Deploying and scaling the GitHub Copilot SDK

In the previous post we went deep on sessions — how to create, persist, resume, and manage them in .NET. All of that assumes you have a running application talking to a Copilot CLI. In development, that's trivial: the SDK starts the CLI for you automatically. In production, the picture is more complex.

This post is about what happens between "it works on my machine" and "it's serving real users." We'll look at how the CLI architecture actually works, when to run the CLI as a separate headless server, the isolation patterns that fit different application types, and how to scale horizontally without losing session state.

How the SDK talks to the CLI

Before making deployment decisions, it helps to understand the communication model. Every SDK in every language works the same way underneath:

Your Application
      ↓
  SDK Client
      ↓
  JSON-RPC
      ↓
Copilot CLI (server mode)

All SDKs communicate with the Copilot CLI server via JSON-RPC. The CLI is doing the heavy lifting — model routing, tool execution, context management, streaming. The SDK client is a typed interface to that process.

By default, the SDK spawns the CLI as a child process. This is the default behavior and is suitable for standalone tools or scenarios where each application instance manages its own CLI lifecycle. This is the mode we've been using throughout this series — you call StartAsync() and the SDK handles everything.

For production backends, you typically want to break those two things apart.

Mode 1: Auto-managed CLI (default)

The bundled, auto-managed CLI is the right choice for:

Desktop applications and distributable utilities
Developer tools that run on a single machine
Scripts and local automation
Early-stage prototypes

The .NET SDK uses MSBuild targets to automatically download the correct platform-specific CLI binary during the build process, mapping .NET Runtime Identifiers to Copilot platform names and fetching the binary for you. The output binary is placed in the runtimes/{rid}/native folder.

This means zero manual installation — dotnet add package GitHub.Copilot.SDK is all you need, and the right CLI binary for your platform is bundled automatically.

// The CLI starts and stops with your app — no separate process to manage
await using var client = new CopilotClient();
await client.StartAsync();

// ... your application logic ...

await client.StopAsync();

The trade-off is that the CLI lifecycle is tied to your application process. Every instance of your application runs its own CLI. For single-user tools, this is fine. For multi-instance web APIs, you're running N CLI processes for N replicas — which is wasteful and creates session continuity problems.

Remark: This also works for the other available languages, with the exception of Go, where you need to download the CLI yourselves.

Mode 2: Headless CLI server

For backend services — web APIs, background workers, internal tools, anything running on a server — you decouple the CLI from your application by running it in headless server mode.

Instead of the SDK spawning a CLI child process, you run the CLI independently in headless server mode. Your backend connects to it over TCP using the cliUrl option. The CLI runs as a persistent server process, not spawned per request. Multiple SDK clients can share one CLI server.

Start the CLI in headless mode:

# Fixed port
copilot --headless --port 4321

# Or let it pick a random port (prints the URL on startup)
copilot --headless
# Output: Listening on http://localhost:52431

Note: By default the headless server only accepts connections from loopback (127.0.0.1). To accept connections from other hosts — for example from another machine on your network — bind to a non-loopback address with --host.

Then connect your application to it:

var client = new CopilotClient(new CopilotClientOptions
{
    CliUrl = "localhost:4321"
});

// No StartAsync needed — CLI is already running
await client.ConnectAsync();

Running as a Docker Container

For production, run the CLI as a system service or in a container. The simplest Docker Compose setup — one CLI container, one API container, shared session storage:

version: "3.8"
services:
  copilot-cli:
    image: ghcr.io/github/copilot-cli:latest
    command: ["--headless", "--port", "4321"]
    environment:
      - COPILOT_GITHUB_TOKEN=${COPILOT_GITHUB_TOKEN}
    ports:
      - "4321:4321"
    restart: always
    volumes:
      - session-data:/root/.copilot/session-state

  api:
    build: .
    environment:
      - CLI_URL=copilot-cli:4321
    depends_on:
      - copilot-cli
    ports:
      - "8080:8080"

volumes:
  session-data:

Your .NET API reads CLI_URL from the environment and connects:

var cliUrl = Environment.GetEnvironmentVariable("CLI_URL") ?? "localhost:4321";

builder.Services.AddSingleton(sp => new CopilotClient(new CopilotClientOptions
{
    CliUrl = cliUrl
}));

Isolation patterns

Once you're running a headless CLI, you have a choice to make: how much isolation do users get from each other? The official docs frame this as three distinct patterns, each with different resource costs and security boundaries.

Pattern 1: Shared CLI, isolated sessions

Multiple users share one CLI server. Each user gets their own session with a unique SessionId. The CLI process is shared; the conversation context is not.

// Each user gets a session scoped to their identity
var session = await _client.CreateSessionAsync(new SessionConfig
{
    SessionId = $"user-{userId}-{conversationId}",
    Model = "gpt-4.1"
});

Pros: Lightest on resources. Simple to operate. One CLI to monitor.

Cons: Weakest isolation boundary. All users share the same process memory and file system access context. If one session triggers a crash, it affects everyone.

Best for: Internal tools, developer portals, trusted environments where all users are within the same organization.

Pattern 2: CLI pool, one CLI per user

Each user gets their own dedicated CLI server instance. Sessions for that user always route to their CLI.

public class CliPool
{
    private readonly Dictionary<string, CopilotClient> _clients = new();
    private int _nextPort = 5000;

    public async Task<CopilotClient> GetClientForUserAsync(string userId)
    {
        if (_clients.TryGetValue(userId, out var existing))
            return existing;

        var port = _nextPort++;
        await SpawnCliAsync(port);

        var client = new CopilotClient(new CopilotClientOptions
        {
            CliUrl = $"localhost:{port}"
        });
        await client.ConnectAsync();

        _clients[userId] = client;
        return client;
    }

    public async Task ReleaseUserAsync(string userId)
    {
        if (_clients.TryGetValue(userId, out var client))
        {
            await client.DisposeAsync();
            _clients.Remove(userId);
        }
    }
}

Pros: Strong isolation — a user's sessions, memory, and processes are completely separated. Users can authenticate with different GitHub tokens.

Cons: Higher resource cost. N active users = N CLI processes. Requires process lifecycle management.

Best for: Multi-tenant SaaS products where data isolation is a hard requirement. Users with different authentication credentials.

Pattern 3: Shared CLI with namespaced Session IDs (Lightweight Middle Ground)

A pragmatic middle option: one shared CLI, but session IDs are structured to encode ownership so auditing and cleanup are straightforward, and sessions never accidentally cross user boundaries.

// Session ID encodes tenant, user, and purpose
string CreateSessionId(string tenantId, string userId, string taskType)
    => $"{tenantId}-{userId}-{taskType}-{DateTimeOffset.UtcNow.ToUnixTimeSeconds()}";

// alice-acme-code-review-1714900000
var sessionId = CreateSessionId("acme", "alice", "code-review");

This doesn't give you process-level isolation, but it does give you clean logical separation and makes audit logs and session cleanup tractable.

Scaling horizontally

A single headless CLI can handle many concurrent sessions. But once your load grows beyond what one instance can handle, you need multiple CLI replicas — and that means thinking carefully about session state.

The core constraint: shared storage enables any CLI to handle any session. Load distribution is more even, but requires networked storage for ~/.copilot/session-state/.

Without shared storage, a session created on replica A can only be resumed by replica A. Add a load balancer and traffic can land anywhere — which breaks resumability unless you use sticky sessions (fragile) or shared state (correct).

Kubernetes deployment

The following deploys three CLI replicas sharing a PersistentVolumeClaim so that any replica can resume any session:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: copilot-cli
spec:
  replicas: 3
  selector:
    matchLabels:
      app: copilot-cli
  template:
    metadata:
      labels:
        app: copilot-cli
    spec:
      containers:
        - name: copilot-cli
          image: ghcr.io/github/copilot-cli:latest
          args: ["--headless", "--port", "4321"]
          env:
            - name: COPILOT_GITHUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: copilot-secrets
                  key: github-token
          ports:
            - containerPort: 4321
          volumeMounts:
            - name: session-state
              mountPath: /root/.copilot/session-state
      volumes:
        - name: session-state
          persistentVolumeClaim:
            claimName: copilot-sessions-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: copilot-cli
spec:
  selector:
    app: copilot-cli
  ports:
    - port: 4321
      targetPort: 4321

Your .NET deployment then points at the copilot-cli service:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: copilot-api
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: api
          image: your-registry/copilot-api:latest
          env:
            - name: CLI_URL
              value: "copilot-cli:4321"

The PVC gives every CLI replica access to the same session state directory. Any replica can resume any session. The load balancer distributes both CLI and API traffic freely.

Important: For the PVC approach to work, the volume must support ReadWriteMany access mode — NFS, Azure Files, or a cloud-native equivalent. Standard block storage (AWS EBS, Azure Disk) is ReadWriteOnce and won't work for shared access across pods.

What's next

I already covered a lot about deployment but there are some related topics I still want to explore. So in the next post, we'll take a look at authentication and observability, key elements for any production-ready setup.

VS Code Planning mode

After the introduction of Plan mode in Visual Studio , it now also found its way into VS Code. Planning mode, or as I like to call it 'Hannibal mode', extends GitHub Copilot's Agent Mode capabilities to handle larger, multi-step coding tasks with a structured approach. Instead of jumping straight into code generation, Planning mode creates a detailed execution plan. If you want more details, have a look at my previous post . Putting plan mode into action VS Code takes a different approach compared to Visual Studio when using plan mode. Instead of a configuration setting that you can activate but have limited control over, planning is available as a separate chat mode/agent: I like this approach better than how Visual Studio does it as you have explicit control when plan mode is activated. Instead of immediately diving into execution, the plan agent creates a plan and asks some follow up questions: You can further edit the plan by clicking on ‘Open in Editor’: ...

The art of simplicity

Search This Blog