Deploying and scaling the GitHub Copilot SDK (continued)

In the previous post I set the stage on deploying to production. We covered managing the CLI process, different isolation patterns and how to scale horizontally.

This post covers 2 other aspects important when putting your GitHub Copilot SDK enabled application into production; how to tackle authentication and how to get insights into what is going inside your agentic system.

Authentication in production

Development uses your personal GitHub credentials via gh auth login. Production backends need a different approach.

Service account token (shared CLI): Set COPILOT_GITHUB_TOKEN as an environment variable on the CLI process. All sessions on that CLI use the same token. Simple, but every user is acting as the service account.

export COPILOT_GITHUB_TOKEN="gho_service_account_token"
copilot --headless --port 4321

Per-user tokens (GitHub OAuth): For multi-tenant applications where users authenticate with their own GitHub identities, pass each user's token when creating their session. This requires implementing the GitHub OAuth flow in your application and is covered in depth in the GitHub OAuth with Copilot SDK docs.

BYOK: If you'd rather not tie deployment to GitHub auth at all, configure the SDK to use your own API keys from OpenAI, Azure AI Foundry, or Anthropic. This sidesteps the Copilot subscription requirement and premium request quota entirely — useful for automated pipelines where per-request billing against a known provider is preferable.

var client = new CopilotClient(new CopilotClientOptions
{
    Byok = new ByokConfig
    {
        Provider = "azure",
        ApiKey = Environment.GetEnvironmentVariable("AZURE_API_KEY")!,
        BaseUrl = Environment.GetEnvironmentVariable("AZURE_ENDPOINT")!
    }
});

Note: BYOK uses key-based authentication only. Microsoft Entra ID (Azure AD), managed identities, and third-party identity providers are not supported (at the moment of writing this post).

Observability

Agents are harder to observe than traditional services because the interesting work happens inside the execution loop. Build instrumentation in from the start.

Session-Level metrics

Use the onSessionStart and onSessionEnd hooks to capture duration and end reason for every session:

var sessionStartTimes = new ConcurrentDictionary<string, long>();

var session = await client.CreateSessionAsync(new SessionConfig
{
    Model = "gpt-4.1",
    Hooks = new SessionHooks
    {
        OnSessionStart = async (input, invocation) =>
        {
            sessionStartTimes[invocation.SessionId] = input.Timestamp;
            _metrics.SessionStarted(invocation.SessionId, input.Source);
            return null;
        },
        OnSessionEnd = async (input, invocation) =>
        {
            if (sessionStartTimes.TryRemove(invocation.SessionId, out var startTime))
            {
                var duration = input.Timestamp - startTime;
                _metrics.SessionEnded(invocation.SessionId, duration, input.Reason);
            }
            return null;
        }
    }
});

Tool call tracing

Instrument tool execution events to understand what actions the agent is taking in production — and catch unexpected tool usage early:

session.On(evt =>
{
     switch(evt)
     {
         case ToolExecutionStartEvent toolStart:
            _logger.LogInformation(
                "Tool {Tool} invoked in session {Session}",
                toolStart.Data.ToolName, sessionId));
          break;

        case ToolExecutionCompletedEvent toolCompleted:           _logger.LogInformation(
        "Tool {Tool} completed in {Duration}ms",
        toolCompleted.Data.ToolName, toolCompleted.Data.DurationMs));
        break;
    }
});

Context compaction

Context compaction is automatic and mostly invisible, but it's worth logging when it happens. Frequent compaction on short conversations can indicate that your system prompt or injected context is consuming more of the context window than you intend./p>

Built-in OpenTelemetry support

The event-based instrumentation above is useful for quick feedback and custom metrics, but for production observability you want proper distributed traces — spans that link the CLI's internal execution to your application code, visible in Jaeger, Grafana, Azure Monitor, Datadog, or any OTLP-compatible backend.

The SDK ships with this built-in. OpenTelemetry support is a first-class feature of the Copilot SDK, not an afterthought: it provides built-in distributed tracing with W3C trace context propagation across all SDKs.

Opting in: one line of config

Enabling OTel tracing requires a single addition to your CopilotClientOptions:

var client = new CopilotClient(new CopilotClientOptions
{
    Telemetry = new TelemetryConfig
    {
        OtlpEndpoint = "http://localhost:4318"
    }
});

That's it. The SDK configures OpenTelemetry on the CLI process and begins exporting spans to your OTLP endpoint. No additional packages required beyond what the SDK already brings in.

Protocol note: The CLI runtime only supports OTLP over HTTP (otlp-http). If your collector is configured for gRPC, the CLI will still use HTTP. Backends that serve both protocols on the same port — like the .NET Aspire Dashboard — work transparently.

What gets traced

Once enabled, every agent interaction produces a hierarchical span tree that captures the full execution flow:

invoke_agent [~15s]
├── chat gpt-4.1 [~3s]          ← LLM requests tool calls
├── execute_tool readFile [~50ms]
├── execute_tool runCommand [~2s]
├── chat gpt-4.1 [~4s]          ← LLM generates final response
└── (span ends)

Each span captures metadata following the OpenTelemetry GenAI Semantic Conventions — model names, token counts, durations — so the data works with any OTel-compatible backend. By default, no prompt content, responses, or tool arguments are captured: only metadata. If you need full content for debugging, set the COPILOT_OTEL_CAPTURE_CONTENT=true environment variable on the CLI process.

Seeing traces locally: The Aspire dashboard

For local development, the .NET Aspire Dashboard gives you a full trace viewer with a built-in OTLP endpoint — no cloud account needed, no Jaeger to configure:

docker run --rm -d \
  -p 18888:18888 \
  -p 4317:18889 \
  --name aspire-dashboard \
  mcr.microsoft.com/dotnet/aspire-dashboard:latest

The dashboard UI is on port 18888. Point your OtlpEndpoint at http://localhost:4318 (the SDK uses HTTP) and open http://localhost:18888 to see traces appear in real time as you send prompts.

Sending to production backends

For production, route traces through an OTel Collector and on to whichever backend you use. A minimal otel-collector-config.yaml that accepts OTLP and exports to multiple destinations:

receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318

exporters:
  azuremonitor:
    connection_string: "${APPLICATIONINSIGHTS_CONNECTION_STRING}"
  otlp/grafana:
    endpoint: "${GRAFANA_OTLP_ENDPOINT}"
    headers:
      authorization: "Basic ${GRAFANA_AUTH}"

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [azuremonitor, otlp/grafana]

Deployment checklist

Let me summarize the current and previous post with a handy pre-deployment checklist:

Architecture

Using headless CLI mode rather than the default auto-managed subprocess?
CLI running as a persistent service (systemd, container, or Kubernetes pod)?
CLI_URL coming from environment configuration, not hardcoded?

Session state

Running multiple CLI replicas? If so, is session state on shared (ReadWriteMany) storage?
Session IDs structured to encode ownership for auditability?
Concurrency limit enforced to prevent memory exhaustion?

Authentication

Service account token stored as an environment secret, not in source?
If using per-user auth, GitHub OAuth flow implemented and tested?
If using BYOK, provider credentials rotated on a schedule?

Observability

Session start/end metrics captured?
Tool execution events logged?
Session errors surfaced to your alerting system?
Context compaction events monitored?

Resilience

CLI process managed by a supervisor (systemd, Kubernetes) that restarts on failure?
Application handling SessionErrorEvent gracefully?
30-minute idle timeout accounted for in long-running workflows?

What's next

Your application is now deployable, scalable, and observable. In the next post in the series, we cover MCP integration — connecting your agent to external context and services via the Model Context Protocol, so your agent can reach databases, APIs, and cloud services without you having to build the glue.

VS Code Planning mode

After the introduction of Plan mode in Visual Studio , it now also found its way into VS Code. Planning mode, or as I like to call it 'Hannibal mode', extends GitHub Copilot's Agent Mode capabilities to handle larger, multi-step coding tasks with a structured approach. Instead of jumping straight into code generation, Planning mode creates a detailed execution plan. If you want more details, have a look at my previous post . Putting plan mode into action VS Code takes a different approach compared to Visual Studio when using plan mode. Instead of a configuration setting that you can activate but have limited control over, planning is available as a separate chat mode/agent: I like this approach better than how Visual Studio does it as you have explicit control when plan mode is activated. Instead of immediately diving into execution, the plan agent creates a plan and asks some follow up questions: You can further edit the plan by clicking on ‘Open in Editor’: ...

The art of simplicity

Search This Blog