Pasi Huuhka - Azure Deep Dive

Building your own PR reviewer with coding agents

Pasi Huuhka — Wed, 11 Mar 2026 20:28:00 GMT

💡

This post is part of a larger Agentic Dev theme:
- A mental model for LLM tooling primitives
- Research - Plan - Implement
- Primary vs Subagents in LLM harnesses
- How I currently develop with LLM models (Early 2026)
- Building your own PR reviewer with coding agents

I've now built a couple versions of automated PR reviewers, and the main thing that keeps standing out is that the AI part is surprisingly small.

That may sound odd for a post about AI review agents, but in practice most of the system is normal software engineering. You receive an event, decide whether to do anything, gather context, start an isolated run, call the model through a coding harness, parse the result, and post findings back to the source system.

The model matters, of course. But the architecture matters more.

This post is about that architecture. I'll use GitHub Copilot SDK in the examples for the LLM side, because it maps cleanly to this kind of coding-agent workflow. The overall pattern works just as well with other harnesses or plain CLIs.

It builds on a few earlier posts, especially A mental model for LLM tooling primitives and Primary vs Subagents in LLM harnesses.

The short version

Mostly control flow, not AI. A PR bot is normal event-driven software with an LLM in one stage.
The agent reviews, it does not run the business process. Everything around the review should be ordinary code.
Event-driven is the natural shape. Receive the PR event, gather context, run a contained review, post structured results back.
Coding harnesses fit well because they already know how to read files, inspect repositories, and run commands.
Structured output matters. The model's findings should be easy for your code to validate before posting.
The same architecture generalizes. Any AI automation with a clear start, process and end can use this shape. You could just as easily build on GitHub Actions, (edit on 23.3.2026) or even the new Agentic Workflows to do the same.

The bot itself is simple

At a high level, a PR reviewer is not a complicated product.

Something happens in your source system. A pull request is created, updated, commented on, or explicitly flagged for review. Your system decides whether that event should trigger a review, gathers the repository context, runs the review logic, and posts the findings back.

That sounds almost boring, and that is exactly the point. This is one of those cases where I strongly think the key to good LLM systems is making them as deterministic as possible. Do not let the agent decide what process it is in. Let code do that.

The agent's job is to review the change. Everything around that should be ordinary software.

Triggering the review

There are a few obvious ways to start:

Always review on PR creation. Simple and broad.
Review on creation and updates. Covers iteration.
Only review on an explicit command. Controls cost and noise.
Trigger from a UI action. Product-specific entry point.

This is mostly a product decision, not an AI decision. You can also mix approaches: a lightweight default review plus deeper specialist reviews on demand.

The triggering event can be almost anything. I use PR events a lot, but this same architecture works for issue triage, ticket classification, bug reproduction, documentation generation, or anything else where a system event kicks off a bounded piece of work.

ADO service hooks as the integration point

In Azure DevOps, service hooks are the natural fit. They are essentially an event subscription mechanism: a publisher emits an event, a subscription filters it, and a webhooks consumer sends a JSON payload to your HTTPS endpoint.

For PR automation, the interesting events are:

pull request created
pull request updated
pull request commented on
pull request merge attempted

Azure DevOps lets you control how much resource detail goes into the webhook payload. I prefer smaller payloads and fetching the full PR details myself afterward. That keeps the webhook receiver simpler and forces context gathering into one consistent path.

Architecture over model

The shape I've been using is fairly simple:

1. A source system event arrives.

2. A manager layer classifies whether work should start.

3. The manager gathers enough context to create a review job.

4. The manager stores run state in a database.

5. The manager starts an isolated worker run.

6. The worker performs the LLM review and returns structured output.

7. The manager parses the result and posts comments or findings back to the source system.

8. The manager updates run state.

Notice the LLM is only a small part of the whole pipeline, and that the logic is no rocket science.

My own implementation has used Azure Functions as the management layer and Azure Container Apps jobs for the isolated execution. The function receives the Azure DevOps webhook and starts a container app job for each review. I like that shape because each review is contained. It has its own execution, logs, and failure boundary.

The other option I find interesting is using a microVM-based sandbox. That starts to matter if you want easier session resumption later, or if you want nested container execution inside the sandbox. For simple review flows, a container job is usually enough. MicroVMs are definitely taking off in the industry, as many services serving quickly spawning isolated sandboxes are popping up everywhere. They're something I've been planning to build on ever since reading this post by Ramp. The main benefit in my mind is the easy ability to also run the app in a container inside the microVM.

For state, almost anything works. I've used table storage and that has been completely fine. If all you really need is run state, correlation IDs, status, and posted-result metadata, you do not need an especially fancy database.

The AI is one stage

The AI stage does not need to own the whole review process. It does not need to decide when jobs start, how retries work, where state lives, how comments are posted, or how webhook idempotency is handled. That is all normal application logic.

The LLM's job is much smaller:

Inspect the repository and PR context
Review the change
Return structured findings

I think the healthiest mental model for these bounded automations is to treat the LLM like another API dependency. The smaller and more bounded the use case, the tighter the guardrails should be. If the task is wider and more exploratory, you can relax them.

Why coding harnesses fit

PR review is one of the places where coding harnesses are a very natural fit. The model needs exactly the kind of capabilities those harnesses already provide: Read files and inspect diffs, Search the codebase and Optionally run commands like linting, type checking or project-specific validations.

That is why this works best with a coding harness instead of a thin text-only wrapper around an LLM API.

I've implemented this with both Copilot SDK and OpenCode SDK, and honestly even using the CLIs directly can work if your process is simple enough. The important bit is not the specific SDK. It is that the runtime already understands code-oriented tools.

Depending on your trust model, you may also want to let the review agent run bash commands. That can improve review quality quite a bit, but obviously also pushes you toward stronger isolation and permission handling. Pure review type of work is probably fine without command execution, but if you want to get into fix suggestions and validation, it becomes more important.

Own the orchestration, not the control flow

This is where the 12-factor agents point becomes especially useful.

Your code should decide:

When a review starts and whether the event is eligible
What repository or commit range to inspect
Which agent setup to use
How retries, deduplication, timeouts and result posting work

The LLM should decide:

Whether a change looks risky or a security issue exists
Whether tests appear missing
How findings should be summarized

That separation makes the whole system easier to reason about and easier to trust.

Inside the LLM run

Inside the review run itself, I don't really like having one giant agent do everything.
What has worked better for me is a primary reviewer that fans out to specialists in parallel, then synthesizes their output:

One primary reviewer agent
Several specialist subagents in parallel (one skill per specialist)
One final synthesis step that produces structured review output

This maps pretty directly to the primitives I wrote about earlier. The command contains instructions on what to do. The agents contain instructions on how to do it. The skills package reusable domain guidance for a specific specialist.

In practice, the specialists might be for example an **Architecture reviewer**, **Testing reviewer**, **Security reviewer**, **Project-specific reviewer**, or whatever makes the most sense for your codebase and team.

Arguably, you could go even further and remove the primary reviewer and push the review request to the specialists directly, and just synthesize their output afterward. That's what I'm doing in the later example, but both approaches work well.

The exact number depends on the system. A tiny one-file change does not need a small army of subagents. If the review scope is narrow, one agent is often enough. If the review scope is broader and the work can run in parallel, multiple specialists make more sense. That parallelism usually helps latency more than it hurts, though of course it increases cost.

Copilot SDK example

The Copilot SDK is a good fit for this because it exposes the same runtime behind Copilot CLI through a programmatic interface. The SDK talks to the CLI over JSON-RPC, and you create sessions that can use built-in coding tools, custom agents, skills, MCP servers, and hooks.

The useful part is not anything magical. You define the session once and then ask it to do exactly one bounded review task.

Here is a very simplified (and somewhat incomplete) TypeScript example:

import { CopilotClient } from "@github/copilot-sdk";

const client = new CopilotClient();
await client.start();

const outputShape = `
{
  "summary": "string",
  "findings": [
    {
      "severity": "critical|high|medium|low",
      "title": "string",
      "path": "string",
      "line": 123,
      "body": "string"
    }
  ]
}`;

const reviewTask = `
Review this pull request in the current repository checkout.

Focus only on concrete issues in the changed code.
Use repository tools as needed.
Return JSON only in this shape:

${outputShape}
`;

const customAgents = [
  {
    name: "architecture-reviewer",
    description: "Reviews architecture and maintainability risks",
    tools: ["grep", "glob", "view", "bash"],
    prompt: `
Review the pull request from an architecture perspective.

Focus on boundaries, coupling, maintainability, layering, and long-term code health.

${reviewTask}
`,
    infer: false,
  },
  {
    name: "security-reviewer",
    description: "Reviews security issues and dangerous patterns",
    tools: ["grep", "glob", "view", "bash"],
    prompt: `
Review the pull request from a security perspective.

Focus on authentication, authorization, secrets handling, injection, trust boundaries, and unsafe execution patterns.

${reviewTask}
`,
    infer: false,
  },
];

async function runReviewer(agent: string) {
  const session = await client.createSession({
    model: "gpt-4.1",
    agent,
    customAgents,
    onPermissionRequest: async () => ({ kind: "approved" }),
  });

  try {
    const response = await session.sendAndWait({ prompt: reviewTask });
    return JSON.parse(response?.data.content ?? '{"summary":"","findings":[]}');
  } finally {
    await session.disconnect();
  }
}

const specialistReviews = await Promise.all([
  runReviewer("architecture-reviewer"),
  runReviewer("security-reviewer"),
]);

const synthesis = await client.createSession({
  model: "gpt-4.1",
  onPermissionRequest: async () => ({ kind: "approved" }),
});

try {
  const response = await synthesis.sendAndWait({
    prompt: `
You are the parent PR reviewer.

Merge overlapping findings from these specialist reviews and return one final review.

${JSON.stringify(specialistReviews, null, 2)}

Return JSON only in this shape:

${outputShape}
`,
  });

  console.log(
    JSON.stringify(
      JSON.parse(response?.data.content ?? '{"summary":"","findings":[]}'),
      null,
      2,
    ),
  );
} finally {
  await synthesis.disconnect();
  await client.stop();
}

Skills, agents and structured output

I would not build this around a single giant system prompt. The split I like is:

Command or job instructions: what the current run should do
Agent prompts: what each specialist is responsible for
Skills: reusable guidance for how that specialist should operate

The Copilot SDK custom agent support and skill loading fit that pattern well. But you can implement this any way that gets the content to the agents in a clear way. I like skills because they can be reused by the developers locally as well, especially the project specific skills are easily shared between the review agent and human developers.

For any automation, the final output should be structured to be easily machine-readable for further processing.

Reviews get wordy

One thing I have definitely noticed is that review agents get verbose very quickly.
Current models produce useful output, but they also happily produce a lot of it. Without constraints on format and severity, you easily end up with comments that are technically fine but may be nitpicks or things not worth blocking a PR over.

This can quickly turn into developer toil especially if you have set branch policies to require all comments be reviewed before completing the PR merge.

That is why I increasingly prefer adding explicit severity levels. That will also allow you to filter out low-severity findings or even automatically post them as a comment on the PR without flagging them as a review issue, or maybe combining them into a single comment. Even better, maybe you can even start jobs to fix smaller issues automatically based on the severity ranking and complexity of the fix.

The interesting question is: which classes of findings are safe enough to autofix?
This then turns from "review bot" into a more general automation system. For that to work well, the agent likely needs to edit code, run tests and validations, and verify that the fix actually worked. Very doable, but it does mean the isolation, permissions and verification side of the architecture matter even more.

The same architecture powers a fix command

I've also implemented a fix command with basically the same structure. The trigger and prompt are different, but the architecture is almost identical. The system receives an event or command (say, a comment with "/fix HOW TO FIX THIS"), gathers repo, PR and comment thread context, starts an isolated run, lets the coding agent perform a bounded task, and returns a structured result with a follow-up action like creating a PR or commit with the fix.

Once you have the manager, state, isolated execution, and source-system callback flow working, you can reuse it for a lot of neighboring automations.

Local and remote on the same harness

One thing I think is genuinely valuable is building these automations on top of the same harness you use locally, like OpenCode for example. If the same agents, skills, instructions and tool configuration also work in local development, you get some nice properties: faster feedback before code ever reaches a PR, easier debugging of the automation itself, reusable review specialists as subagents during development, and less drift between local and server-side AI workflows.
That also creates tradeoffs. The structure ideal for local interactive use is not always identical to what you want for a remote webhook-driven automation. If you want both, you need to think about how to organize prompts, skills, state and configuration so they fit both cases.

This generalizes

This pattern is not specific to PR review. It works well anywhere the automation has a clear triggering event, a bounded piece of reasoning work, and a structured result to return.

That is why I increasingly think of these automations as normal event-driven systems with an LLM inside one stage of the pipeline. The more tightly scoped the task, the more deterministic everything around the model should be.

Wrap up

If I had to compress the whole thing: build it like a normal event-driven system and let code own the workflow. Let the model own the review judgment, not the process. Use a coding harness so the agent can inspect the real repo properly. Return structured findings, not just prose. And keep the task bounded and deterministic where possible.

The most important design choice is often not which model you use. It is how much of the control flow you refuse to hand over.

Testing Cloudflare's Code Mode on Azure DevOps MCP

Pasi Huuhka — Sat, 07 Mar 2026 21:37:00 GMT

I recently spent some time testing Cloudflare's Code Mode in implementing the Azure DevOps MCP server in a more lightweight manner.

Azure DevOps MCP exposes a lot of tools by default. In my environment that surface was around 80 tools. That is already enough that normal tool calling starts to feel awkward. You spend a lot of tokens just describing the tools, the model has to select from a broad surface, and once you need multiple calls in sequence the whole thing can get noisy very quickly.

So the pitch of Code Mode made immediate sense to me: Instead of forcing the model to pick one tool at a time, let it write a small program that does the orchestration itself.

That is also very much in line with where the broader ecosystem seems to be moving. Anthropic now has both tool search and programmatic tool calling. OpenAI has tool search as well. Anthropic's own writeup on this direction is worth reading too: Introducing advanced tool use on the Claude Developer Platform.

So this post is not really about whether the idea is good. I think it is. It is about what happened when I actually tried to make it work on a real tool surface that looked like an obvious candidate.

What Code Mode is, in the simplest form

Cloudflare's own README gives a very small example of the pattern:


import { createCodeTool } from "@cloudflare/codemode/ai";
import { DynamicWorkerExecutor } from "@cloudflare/codemode";
import { streamText, tool } from "ai";
import { z } from "zod";

// Create the tools
const tools = {
  getWeather: tool({
    description: "Get weather for a location",
    inputSchema: z.object({ location: z.string() }),
    execute: async ({ location }) => `Weather in ${location}: 72°F, sunny`
  }),
  sendEmail: tool({
    description: "Send an email",
    inputSchema: z.object({
      to: z.string(),
      subject: z.string(),
      body: z.string()
    }),
    execute: async ({ to, subject, body }) => `Email sent to ${to}`
  })
};

// Create a secure execution sandbox
const executor = new DynamicWorkerExecutor({
  loader: env.LOADER
});

// Wrap them in code mode
const codemode = createCodeTool({ tools, executor });

// Pass them to your agent...

And then the model writes something like:

async () => {
  const weather = await codemode.getWeather({ location: "London" });
  if (weather.includes("sunny")) {
    await codemode.sendEmail({
      to: "team@example.com",
      subject: "Nice day!",
      body: `It's ${weather}`
    });
  }
  return { weather, notified: true };
};

The model is a very appealing. If the tool surface is big enough, or the task requires a few dependent calls, this starts to look much nicer than repeatedly asking the model what the next tool call should be. Cloudflare also has their own presentation on the topic here

Why Azure DevOps looked like a perfect test case

Azure DevOps MCP is exactly the kind of thing that makes you start looking at alternatives to plain tool calling.

the tool surface is big- many tools are adjacent or partially overlapping- descriptions and schemas add a lot of tokens
workflows often require multiple calls in sequence

So on paper the fit looked great. I initially thought I could more or less just:

1. put a search tool in front of the MCP surface
2. put an execute tool behind it
3. let the model search the Azure DevOps tools it needs
4. let it write one small program to do the actual work

That was the first version of this experiment. It is still preserved here for reference

The current implementation is here.

The first lesson: this was not nearly as plug-and-play as I expected

The basic wrapper part is easy. The hard part is getting good model behavior. That turns out to depend much more on the quality of the underlying tool contract than I first expected.

My main takeaway from the whole experiment is this:

Code Mode works best when the model can plan data flow, not just function calls.

That sounds obvious in hindsight, but it really changed how I think about wrapping MCP servers. If the model sees a list of tools and their input schemas, but it does not have a reliable idea of what those tools return, then longer chains get shaky very quickly. And once that happens, the model starts probing.

That probing shows up as extra search and execute calls, retries with slightly different arguments and fallback behavior outside the intended path. In practice it meant that my test agents started using az cli and reading the repo for clues.

At that point, a lot of the benefit of Code Mode starts to disappear, and the target I was chasing was something like 1 search call and 1-2 execute calls to accomplish what's needed.

Wrapping the MCP was the wrong abstraction for this case

This was the main practical problem.

Wrapping an MCP tool surface is not enough on its own if the wrapped tools do not expose what they actually return in a useful way. With Azure DevOps MCP, the model often had enough information to discover what to call next, but not enough information to confidently reason about what each call would return. That created a bad pattern:

the model could find a tool
it could often call the tool correctly, but then it had to guess the output shape, and that made multi-step orchestration inside a single execute call unreliable

So instead of getting the elegant “one search, one execute” flow I was aiming for, I initially got a lot more churn than expected.

That is not really a knock on Code Mode itself. It is more a statement about the dependency chain underneath it.

If the model is supposed to write a small program, it needs the same kind of confidence about function outputs that we as developers would want if we were wiring together a new client.

The major issues I ran into

The wrapped MCP surface did not expose enough output information

This was the biggest one. The model could see inputs much more reliably than outputs. That meant it could discover and call tools, but not confidently build longer chains inside a single execute call.

This was the point where I started feeling that “wrapping MCP servers with Code Mode” may not be the best idea in general unless the wrapped surface also presents good return contracts.

Search and execute shape matter a lot

I went through a few iterations on this. If discovery leaks too much into execute, the model keeps rediscovering tools inside the program. If search returns too much, the model can get stuck doing search loops. If search returns too little, the model misses one supporting operation and has to search again.

This is one of those things that looks simple in a diagram and then turns out to be quite sensitive in practice. Practically just iterated on the search tool prompting and tried to steer the model with better examples on how to get the correct data.

Secure execution is a real engineering problem outside Dynamic Workers

If you use Cloudflare's Dynamic Workers, the sandboxing story is much cleaner.
If you do not, you need to think through how you are executing untrusted model-generated code. That was not impossible, but it definitely was not plug-and-play either.

I ended up building a local sandbox executor around container isolation, runsc , a narrow callback surface and no general network egress from the sandbox. It was not that difficult in the end, but still needs you to run a docker / podman container which is not very suitable for non technical users. I've been thinking of moving this to run on Kata containers on AKS in the future, but that adds another layer of complexity.

In practice, plenty of people are already effectively YOLO-running code through tools like OpenCode or Claude Code anyway, so the ecosystem clearly has a pretty large gap here. That makes this whole area fertile ground for credential leaks and other preventable issues. I don't think this is the main blocker to adoption, but I do think it is something to think about closely before taking code mode in use.

The solution: a direct REST contract

It became pretty clear that I was spending more and more effort compensating for the quality of the wrapped tool surface. That made me step back and ask a simpler question: Could I just work from the Azure DevOps REST contract directly?

It turns out the answer is yes.

Microsoft publishes the Azure DevOps REST specs in MicrosoftDocs/vsts-rest-api-specs. They are split by area and version rather than shipped as one giant contract, but they are good enough to build a searchable operation catalog from. That changed the whole shape of the system.

Instead of wrapping Azure DevOps MCP and trying to enrich its tool metadata- trying to infer output behavior from MCP responses, I switched to search over a static Azure DevOps REST operation catalog and execute over one helper that calls Azure DevOps by operationId

That ended up looking much more like the Cloudflare pattern, just on top of the Azure DevOps REST contract instead of an MCP surface.

The direct REST catalog version worked better for a simple reason: the model could see enough of the contract to actually reason through the chain.

It now had access to:

operation IDs
path/query/body inputs
request body shape
response schema

That was the missing piece. Once the output side of the contract became visible enough, the model behavior improved a lot.

The current implementation ended up with a pattern much closer to what I originally wanted. Far less fallback churn. It's still not perfect, but it is a completely different quality level from the original wrapped MCP version.

Example of a call:

Current implementation flow

This is roughly what the current version does

Whole flow

Search Path

Execute Path

Trust Boundary

Conclusions

I still think Code Mode is a strong idea. But it's clear that Cloudflare's own use case is just much more naturally aligned with it than “wrap a random MCP server and hope the contract is good enough”.

They control the API surface and tool contract
They can surface both inputs and outputs well
The contract is broad enough that code-based orchestration really pays off

That is a very different situation from trying to wrap an existing third-party MCP server whose output side may be inconsistent or under-described.

Code Mode is a very good fit for large, contract-rich tool surfaces. It is a much worse fit for tool surfaces where the model can call things but cannot confidently reason about what comes back.

It is clearly part of a broader direction that Anthropic and OpenAI are moving toward as well.

For Azure DevOps specifically, I got much better results by moving one level lower and working from the REST contract directly instead of treating the existing MCP server as the final abstraction. That does not mean wrapping MCP is always wrong, it just means I would now ask a much stricter question before doing it:

Does this tool surface tell the model enough about both what goes in and what comes out? If the answer is no, I would be cautious.

Repos and references

Enabling Custom (Bicep) Language Server support in OpenCode

Pasi Huuhka — Sun, 01 Mar 2026 13:29:36 GMT

I wanted Bicep diagnostics to show up in OpenCode with a custom LSP setup, as I noticed the models making a bunch of mistakes without it. OpenCode supports multiple LSP servers out of the box, as well as configuring custom ones.

❗

This requires anomalyco/opencode#15570 to be merged before functioning correctly

Why bother with LSPs in OpenCode?

The main benefit is fast feedback in the place you already work:

Catch Bicep errors while editing, not during deployment.
Catch issues when the file is read/analyzed by tooling workflows in opencode.
Get proper diagnostics (for example BCP007) instead of waiting for az deployment to fail later.
OpenCode also supports experimental navigation tools through the LSP.

In practice, this shifts mistakes left: less context switching, less failed pipeline runs, faster fixes.

So how do you actually get this running?

Option 1: Install Bicep Language Server manually

If you want a stable path independent from editor updates, install the language server under your own folder (for example ~/.opencode-lsp/bicep-langserver) and point OpenCode there.

# Install script
function Install-BicepLangServer {
    param([Parameter(Mandatory = $true)][string]$DestinationPath)
    $releaseUrl = 'https://github.com/Azure/bicep/releases/latest/download/bicep-langserver.zip'
    $tempZip = Join-Path ([System.IO.Path]::GetTempPath()) 'bicep-langserver.zip'
    $dllPath = Join-Path $DestinationPath 'Bicep.LangServer.dll'
    # Check if already installed
    if (Test-Path $dllPath) {
        $updateChoice = Read-Host 'Bicep Language Server already installed. Update? [y/N]'
        if ($updateChoice.Trim().ToUpperInvariant() -ne 'Y') {
            Write-Info 'Skipping Bicep Language Server update.'
            return $true
        }
    }
    Write-Info 'Downloading Bicep Language Server...'
    try {
        Invoke-WebRequest -Uri $releaseUrl -OutFile $tempZip -UseBasicParsing
        if (Test-Path $DestinationPath) {
            Remove-Item -Recurse -Force $DestinationPath
        }
        New-Item -ItemType Directory -Path $DestinationPath -Force | Out-Null
        Expand-Archive -Path $tempZip -DestinationPath $DestinationPath -Force
        Remove-Item $tempZip -ErrorAction SilentlyContinue
        Write-Success "Bicep Language Server installed to: $DestinationPath"
        return $true
    }
    catch {
        Write-Warn "Failed to install Bicep Language Server: $($_.Exception.Message)"
        Remove-Item $tempZip -ErrorAction SilentlyContinue
        return $false
    }
}

# Adding LSP config to opencode config JSON
function Add-LspConfigToOpenCodeConfig {
    param(
        [Parameter(Mandatory = $true)][string]$ConfigJson,
        [Parameter(Mandatory = $true)][string]$LspBasePath,
        [Parameter(Mandatory = $true)][bool]$BicepInstalled,
        [Parameter(Mandatory = $true)][bool]$PsesInstalled
    )
    $configObj = $ConfigJson | ConvertFrom-Json
    # Use forward slashes for cross-platform compatibility in JSON
    $lspBasePathNormalized = $LspBasePath -replace '\\', '/'
    $lspConfig = @{}
    if ($BicepInstalled) {
        $lspConfig['bicep'] = @{
            command    = @('dotnet', "$lspBasePathNormalized/bicep-langserver/Bicep.LangServer.dll")
            extensions = @('.bicep', '.bicepparam')
        }
    }
    if ($PsesInstalled) {
        $psesStartScript = "$lspBasePathNormalized/pses/PowerShellEditorServices/Start-EditorServices.ps1"
        $psesModulesPath = "$lspBasePathNormalized/pses"
        $psesLogsPath = "$lspBasePathNormalized/pses/logs"
        $lspConfig['powershell'] = @{
            command    = @(
                'pwsh',
                '-NoLogo',
                '-NoProfile',
                '-Command',
                "& '$psesStartScript' -Stdio -HostName OpenCode -HostVersion 1.0.0 -BundledModulesPath '$psesModulesPath' -LogPath '$psesLogsPath' -LogLevel Normal"
            )
            extensions = @('.ps1')
        }
    }
    if ($lspConfig.Count -gt 0) {
        $configObj | Add-Member -NotePropertyName 'lsp' -NotePropertyValue $lspConfig -Force
    }
    return ($configObj | ConvertTo-Json -Depth 10)
}

The resulting config:

{
  $schema: https://opencode.ai/config.json,
  lsp: {
    bicep: {
      extensions: [
        .bicep,
        .bicepparam
      ],
      command: [
        dotnet,
        /Users/pasi/.opencode-lsp/bicep-langserver/Bicep.LangServer.dll
      ]
    }
  }
}

Option 2: Reuse the VS Code extension's language server

If you already have Bicep extension installed in VS Code/Insiders, you can point OpenCode to that DLL instead of installing another copy.

Example locations on macOS:

/Users/pasi/.vscode-insiders/extensions/ms-azuretools.vscode-bicep-VERSION/bicepLanguageServer/Bicep.LangServer.dll
/Users/pasi/.vscode/extensions/ms-azuretools.vscode-bicep-VERSION/bicepLanguageServer/Bicep.LangServer.dll

Helper script to resolve latest installed extension DLL

function Get-VSCodeBicepLangServerPath {
    [CmdletBinding()]
    param(
        [ValidateSet('insiders', 'stable')]
        [string]$Channel = 'insiders'
    )
    $home = $HOME
    $basePath = if ($Channel -eq 'insiders') {
        Join-Path $home '.vscode-insiders/extensions'
    } else {
        Join-Path $home '.vscode/extensions'
    }
    if (-not (Test-Path $basePath)) {
        return $null
    }
    $candidate = Get-ChildItem -Path $basePath -Directory -Filter 'ms-azuretools.vscode-bicep-*' |
        Sort-Object Name -Descending |
        ForEach-Object {
            Join-Path $_.FullName 'bicepLanguageServer/Bicep.LangServer.dll'
        } |
        Where-Object { Test-Path $_ } |
        Select-Object -First 1
    return $candidate
}

Then use that resolved path in the same lsp.bicep.command array.

The caveat with VS Code paths

The extension folder has a version in its name, so the path changes when the extension updates. That means one of:

update opencode.json after extension updates,
script the path resolution and regenerate config, maybe updating via opencode plugin?
or keep the manual install path for stability.

For quick setup, VS Code reuse is convenient. For long-term predictability, manual install usually wins. I've been sticking with the manual version for now, and updating it now and then myself.

Quick verification

After setting config, run LSP diagnostics against an invalid .bicep file and verify you get Bicep diagnostic codes (for example BCP007).
That confirms both the language-ID mapping and language server wiring are working.

How I currently develop with LLM models (Early 2026)

Pasi Huuhka — Wed, 25 Feb 2026 17:15:00 GMT

💡

I've been experimenting with a lot of different agent setups, harnesses and model combinations over the last months, and I've ended up settling on a workflow that is fairly simple in structure even if the tooling around it is changing quickly.

This post is not really meant as a "this is the correct way" type of thing. It's just the current version of what has worked best for me in actual development work.

It also builds on a few earlier posts of mine, especially Research - Plan - Implement, Primary vs Subagents in LLM harnesses and A mental model for LLM tooling primitives.

The short version

I have centralized most of my LLM-assisted development into OpenCode, though all the other stuff does work on any harness, like Github Copilot.
I still mostly work with the Research - Plan - Implement pattern, but I scale the ceremony up and down depending on the task.
In practice, plan + implement is often enough.- I read the plans carefully and go back and forth with the planning agent until the open questions are actually resolved.
I tested looped implementation with Ralph loops, and it does work, but it hasn't really become my default.
For larger tasks, I now prefer splitting work into parallel streams and handing those to orchestrators.
I'm mostly using GPT-5.4, with some Opus 4.6 mixed in where it helps.

Standardizing on OpenCode as the main harness

The main thing I wanted was standardization.

At this point I have models available from multiple places: customer environments, GitHub Copilot, and my own subscriptions. I don't really want my workflow to completely change based on where the model happens to live. OpenCode has worked well for me here because I can keep the same harness, the same commands, the same agents and roughly the same habits while switching the model underneath.

Sure, you can argue that the provider's own harness is always going to be the ideal place to use their model. In some very specific cases that may well be true. But for my day-to-day work, the convenience of having one interface and one working style has been more valuable than whatever marginal gains I might get by constantly moving between native tools.

And if I ever want to replace my UI in the future, I can still do that without having to change the whole workflow as OpenCode is built on a Client-Server model. I'm still mostly on the normal OpenCode TUI though, and testing the new-ish desktop client here and there.

For me, that standardization matters more in my daily work than chasing theoretical best-case setups.

RPI is still the backbone

I wrote earlier about Research - Plan - Implement, and that is still basically the backbone of how I work.

What has changed a bit is mostly how often I use the full ceremony.

For small tasks, I often just talk directly with the coding agent.
For medium tasks, plan + implement is usually enough.
For larger, messier or riskier work, I still want the full research -> plan -> implement flow.

The important part for me is not the ritual itself. The point is to keep context under control and reduce ambiguity before implementation starts.

If a task does not need three separate phases, I don't force it. That flexibility and simplicity are the reasons why the pattern has kept working for me.

I'm still not in agreement with myself on whether the plan / research artifacts should be committed in the repo or not. Often it feels like the codebase moves so fast that old plans and research files get stale pretty quickly, and that the main value of those files is in the moment when they are created and discussed, not as a long-term reference. But on the other hand, having a record of the thinking process can sometimes be useful for future reference or for other team members.

The plan is where I do most of the thinking

One thing I don't do is generate a plan and then treat it as a decorative artifact.
I read the plans. I discuss them with the planning agent. I answer the open questions. If I notice missing angles, I ask it to expand the plan. If something looks too handwavy, I push it to be more concrete.

This is probably the cheapest point in the whole process to catch misunderstandings, and there have been numerous cases where reading the plan actually makes me change the initial approach I had in mind completely, or I understand that some very important feature or edge case is missing from my mental model.

Once implementation starts, every missing assumption gets more expensive. So I would much rather spend the extra few minutes in the planning stage than later do a half-implementation, realize the feature shape was wrong, and start steering it back.

In that sense, the plan is not just a handoff file. It's also the point where I clarify the task for myself and the team I'm working with.

I tested looping implementations with Ralph loops

I have also spent some time testing a more loop-oriented way of working. By that I mean taking the plan and then repeatedly running implementation steps in a loop, or even having the loop continue until some completion promise is hit. This did actually work reasonably well. (Often there was no promise, just a list of tasks and YOLO)

My script for this was intentionally very simple:

ralph-loop.sh PROMPT.md --agent Implementer --max 30

That simplicity is also part of the appeal. You don't necessarily need a huge framework around the idea to test if it fits your workflow.

That said, I don't have an infinite request budget, so this has not really felt like the best default for me. When I was testing this more actively, Opus 4.5 and Opus 4.6 were also only available to me through GitHub Copilot's smaller context window, which meant compaction happened fairly quickly even with the loop. At that point the whole thing starts to feel a bit less attractive.

So my current view is not that looping is bad. More that if you have the budget and the patience to optimize around it, it is definitely worth exploring. I just haven't felt that simple looping, by itself, is where I get the best tradeoff.

What changed for me here is that GPT-5.4's own compaction has generally felt good enough for the type of work I do. OpenCode's own compaction has also felt solid. Of course, native compaction is a bit of a black box. You don't really know in exact terms what the model decided to compress and how. But in practical terms, the result has been good enough often enough that I no longer feel a big need to build a loop around everything. If the default context management is already holding up, I would rather keep the workflow simpler.

For bigger work, I split plans into parallel streams instead

The bigger change in my own workflow has been here.

Instead of trying to keep one long implementation thread running forever, I now often ask the planning agent to split the implementation into multiple workstreams that could be handled in parallel. This has worked surprisingly well, even in larger codebases.

The reason I like this is fairly close to what I wrote earlier in Primary vs Subagents in LLM harnesses. If a split actually reduces context pressure and gives you cleaner handoffs, it's useful. If it doesn't, then it's mostly just ceremony.

The plan file becomes a real handoff artifact here. It tells each stream what it is responsible for, what is already known, and what done should look like.
A simplified example could look something like this:

## Parallel Streams (very rough example)

### Stream A - API contract changes

- [ ] Add endpoint contract
- [ ] Add validation and tests

### Stream B - UI flow changes

- [ ] Add new settings UI
- [ ] Add loading and error states

### Stream C - Verification

- [ ] Add integration coverage
- [ ] Run browser validation

That kind of structure has been much more useful for me than trying to just keep one giant thread alive for as long as possible.

Once I have the workstreams, I build an orchestrator instructions file for the task.
This file contains the operational rules for the orchestrator: how it should delegate work to subagents, how it should update the plan, how it should verify the implementation, how it should test, and what kind of review it should do before calling the work complete. Then I start one orchestrator per workstream and let them go.

Here's an example of orchestrator instructions

## Orchestrator Instructions

These instructions are for the orchestrator agent coordinating the work.

- Tooling & verification
  - Use `bun` for all installs, builds and tests.
  - A dev server is ALREADY RUNNING. DO NOT START NEW SERVERS unless it has crashed.
  - Use the Chrome DevTools MCP for all browser-based verification.
  - When using Chrome DevTools MCP screenshots:
    - Save each screenshot file into the `screenshots/` folder at repo root.
    - Only after saving, read or reference the screenshot file as needed.
  - The orchestrator must be the only agent using the MCP server (subagents should not talk to it directly).

- Workflow & delegation
  - The orchestrator is responsible for verifying each change end-to-end (tests, manual checks via Chrome MCP, quick code review).
  - The orchestrator should not write application code directly.
    - Delegate implementation work to `subagents/code/coder-agent`.
  - For each implementation task, instruct the coder subagent to:
    - Read the implementation plan to understand the whole workstream (link the file to them)
    - First use `subagents/research/codebase-locator` and `subagents/research/codebase-analyzer` to find entry points and understand patterns.
    - If the work is UI related, use their frontend design skill
    - Only then implement changes.

- Tasklist maintenance
  - Every task in the tasklist is a checklist item (`- [ ]`).
  - For each task the orchestrator completes, they must:
    - Update the tasklist by checking off the corresponding item (`- [x]`).
    - Optionally record a short note or the commit hash next to the checked item.
  - Updating the tasklist is part of the task and should be included in the same commit as the implementation or an immediate follow-up commit.

- Commits & granularity
  - After each task is implemented and verified, the orchestrator must create a separate git commit for that task.
  - Do not batch multiple tasks into a single commit unless a task explicitly says it depends on another and they cannot be separated.
  - Commit messages should mention the task id.
  - DO NOT REVERT any unrelated code changes

Starting the flow is then just:

You are the orchestrator for Stream A.
Read @orchestrator-instructions.md
Read @implementation-plan.md
Start / Continue the work.

Orchestrators at work

In this setup, I don't really think of the orchestrator as "another coding agent". I think of it more as an execution manager and verifier. The actual implementation can be delegated downwards, while the orchestrator keeps track of the plan state, runs checks, and makes sure the result still matches the original intent.

After that, I still like doing an additional review-oriented pass just to verify that the work really is implemented, and not just described confidently. This might also be a good point to do some extra manual checks for the actual implementation, but also validate your own mental model of the functionality in the codebase so you can confidently build on top of it in future work.

Browser verification is becoming part of the normal loop

Another thing that has slowly become more important in my workflow is browser-side verification.

I've written about Chrome DevTools and agent browser style tooling, so I won't go too deep into those here, but the short version is that I increasingly want verification to include more than just tests and static review. Most of the time, the orchestrator is responsible for that too.

So in addition to delegating implementation and running tests, I also want it to verify the result in the browser where relevant. That can mean Chrome DevTools, and increasingly it can also mean agent browser style tooling.

I'm still incorporating agent browser more fully into the toolset, but I think the longer-term benefit there is pretty clear: individual subagents should eventually be able to verify their own work in parallel as part of the implementation flow. Chrome DevTools style MCP setups have felt a bit less happy when multiple agents try to use them at once, so for now that verification often sits more naturally at the orchestrator layer.

Clear validation matters more than the harness

One thing that feels very obvious to me at this point is that regardless of which harness you use, the best results tend to come from giving the model a clear way to validate whether the result is actually correct.That can be tests, visual checks, browser flows, snapshots, linting, typechecking, golden outputs, or any other concrete signal that tells the model when it has matched the target and when it has not.

Without that feedback loop, you are much more dependent on the model confidently approximating what you meant. Sometimes that is enough, but often it isn't.

This is also why it makes sense that people copy test sets from existing applications when trying to clone them outright. The tests are not just verification, they are also an unusually precise description of expected behavior. If you can give the model that kind of target, the odds of getting the exact result you wanted go up quite a bit. So while I do care about harnesses and agent structure a lot, I would still rank clear validation above most harness-level differences.

Model mix: mostly GPT-5.4, some Opus 4.6

At the model level, I'm mostly using GPT-5.4 right now, with some Opus 4.6 added in. The biggest reason is pretty simple: the larger context window of GPT-5.4 is very useful in this kind of work, at least compared to what I currently get from Opus through GitHub Copilot (128k max).

Opus on the other hand is very good especially for UI work, particularly when combined with a stronger design-oriented skill or prompt setup like Anthropic's frontend-design skill. I've also meant to test out Uncodixfy skill to see if it can help either of these models give better UI outputs, but that's still on the to-do list.

For everything else, GPT-5.4 and GPT-5.3-Codex have handled the work very well.
One thing that does feel fairly obvious, though, is that both models have a set number of UI templates they tend to lean on whenever you ask them to build something from scratch. That is not really a surprise anymore, but it is visible. So even when the model is good, the prompting and skill layer still matter a lot if you want the result to feel intentional instead of generic. Ask a model to build 10 different takes on the same UI component, and you'll see the same few templates come up again and again. That is not necessarily a problem, but it is something to be aware of when you're trying to get a specific design or interaction pattern out of the model.

Final thoughts

I don't really expect this to be my workflow a year from now. Right now though, this is the setup that has felt the most useful in actual development work:

standardize the harness where possible
add process only when the task complexity justifies it- use plans as real handoff artifacts instead of generated paperwork
parallelize bigger work instead of stretching one implementation thread forever
verify in the browser too, not just in code and tests

That's basically it.

The models will keep changing, the harnesses will keep changing, and some of these tradeoffs will likely look different again fairly soon. But for now, this has been a good local optimum for me.

Designing a shared OpenTelemetry contract for AI services on Azure

Pasi Huuhka — Sun, 22 Feb 2026 18:13:00 GMT

💡

This post is a part of a larger AI Dev Platform theme:
- Azure AI Dev Platform Fundamentals
- Practical experiences with Azure APIM AI Gateway and imported Foundry endpoints
- Designing a shared OpenTelemetry contract for AI services on Azure
- Connecting OpenCode with Microsoft Foundry Models

Once you have more than one AI-facing service behind the same Azure API Management layer, telemetry starts drifting almost immediately.

One service calls the tool identifier one thing. Another uses a different header name. A third one emits a metric dimension that looked harmless until somebody tried to chart it and discovered the cardinality was terrible. At that point you can still say you have observability, but the useful part of it starts slipping away.

In this post I'll walk through how I went about solving this issue. Not how to turn on OpenTelemetry, but instead how to make multiple services behave like they belong to the same platform.

The problem I cared about

What I wanted was fairly simple. If traffic entered through one platform edge, I wanted the downstream services to agree on what a request was, how it should be attributed, and which parts of that attribution were safe to put on metrics.

That sounds boring, but for AI workloads it matters quite a lot. I usually want to answer questions like which tool was actually used, which client path generated the traffic, whether a specific agent integration is noisy, and where the cost is going. If every backend answers those in a slightly different way, the dashboards stop being trustworthy surprisingly fast.

The other awkward part was that the services weren't all in the same stack. Some were .NET, some were TypeScript, and I really didn't want both ecosystems inventing their own baggage parsing and metric filtering conventions. So I ended up treating telemetry as a platform contract instead of a helper library.

A contract, not just shared code

The main design decision was to move the shared behavior into one contract file and then have both the .NET and TypeScript libraries implement that contract.

This split meant the important decisions lived in one place: which myprefix.* baggage keys exist, what the resolution order is, which metric instruments are expected, and which attributes are explicitly forbidden from metrics.

The common layer itself was just a YAML file. In simplified form, it looked like this:

version: 1

aiplat:
  baggage:
    keys:
      - myprefix.request_id
      - myprefix.tool_id
      - myprefix.user_id_hash
      - myprefix.opencode_agent_name
    headerFallbacks: {}
    resolutionOrder:
      useOtelBaggage: true
      useRawBaggageHeader: true
      useHeaderFallbacks: false

  metrics:
    meterName: AIPlatform.AiPlat
    instruments:
      - name: myprefix_requests_total
        type: counter
        unit: "1"
        attributes:
          - myprefix.tool_id
          - myprefix.opencode_agent_name
          - http.method
          - http.status_code
          - http.route
    forbiddenMetricAttributes:
      - myprefix.user_id_hash

I like this shape because you can understand most of the platform opinion just by looking at the file. The user hash exists as a shared attribute, but it's explicitly forbidden from metrics. Header fallbacks exist as a mechanism, but the normal path keeps them turned off.

That last part was one of the main reasons I wanted a contract at all. Telemetry on AI systems has a bad habit of turning into a junk drawer. Somebody adds a user hash to spans, somebody else thinks it'd be nice on a metric, and three weeks later you're cleaning up a cardinality mess you could've avoided by just being stricter in the first place. I hit this early on, and it was clear I needed to be more intentional about what goes on metrics and what doesn't.

With a contract file, the rules become pretty boring — in a good way. If a field is in the shared config and marked metric-safe, both languages treat it as metric-safe. If header fallbacks are disabled there, they're disabled everywhere. If a key is forbidden from metrics, that's not a code review opinion anymore. It's just the rule.

Let the edge do the normalization

The other thing I felt strongly about was ownership. I didn't want every backend to understand every client-specific header shape forever. That's exactly the kind of decision that feels harmless at the beginning and then quietly turns into coupling. So the rule became that API Management owns the edge normalization. Client-specific headers come in, APIM turns them into the platform-owned shape, appends the values into W3C baggage, and the services only need to understand the normalized platform contract.

It sounds obvious written out like that, but I think it's easy to get wrong. If both APIM and the services can independently decide how a platform attribute is sourced, the whole thing gets muddy very quickly. Some values come from baggage, some from raw headers, some from fallbacks, and eventually nobody's fully sure which layer is authoritative.

That's why I kept header fallbacks disabled by default. The shared libraries support them, but I think the healthier default is to force the edge to do the propagation properly.

In practice, I wanted this and not five different half-overlapping variants of it:

Metric cardinality

The useful part of the design wasn't the config file by itself. It was what the config file made harder to mess up.

On spans I'm fairly relaxed. If a piece of context is useful for debugging and it's handled safely, I don't mind carrying a decent amount of it. On metrics I'm much more conservative.

The shared allowlist and forbidden-attribute model helped a lot here. Tool identifiers, route templates, method, status code, and a bounded operation name are all reasonable candidates. A user hash isn't. Request IDs aren't. Session IDs aren't. Those are good diagnostic attributes and terrible metric dimensions.

This split was especially important because some of the AI-specific context only exists inside the app. If the service parses MCP-style JSON-RPC payloads, for example, it can often derive a stable operation name or tool name that's genuinely useful on request metrics. That enrichment belongs in the app because the app actually understands the payload. Client lineage and normalized request identity, on the other hand, are edge concerns and belong in APIM.

Cross-language

I think this would've been much less useful if it only solved the .NET side nicely.
The .NET version is naturally a little heavier. It plugs into dependency injection, middleware, and the normal OpenTelemetry setup in ASP.NET Core. The TypeScript side is lighter and more wrapper-shaped. That's fine — they don't need to look the same internally.

I wanted them to share the exact same config source, though. This allowed me to change a single location and have it flow to all of the services in the platform, regardless of language. I just had to make sure that no matter which language we were using, the config files were pulled in with the builds accordingly.

The shared library is code, but the contract it implements is still data. I didn't want the values duplicated into two language implementations at build time in some opaque way. I wanted both runtimes to load the same file and validate it normally.

They needed to behave the same way regardless of language. If APIM wrote a platform baggage key, both stacks needed to resolve it in the same order. If a metric attribute was safe in one service, it needed to be safe in the other. If a forbidden attribute was dropped from metrics in TypeScript but leaked through in .NET, the whole shared contract idea would've been kind of pointless.

I think shared contract tests matter more than shared implementation details in setups like this. The point isn't that both libraries use the same code shape. The point is that they produce the same platform behavior.

The Azure perspective

The implementation itself lived mostly in shared code and config, but we of course need some extra Azure parts to make this all run.

There was a central telemetry setup around Log Analytics, Application Insights, and a shared collector story when needed. API Management handled the practical edge work: trace continuation, normalized headers, baggage propagation, and gateway-side dimensions for the traffic that needed to be observable already at that layer.

If you already have a platform edge, that's where this kind of cross-cutting normalization belongs.

It also meant the services stayed smaller. They didn't need to know how a specific client integration decided to represent a parent session or a tooling version header. They just needed to consume the platform contract consistently.

Closing thoughts

Looking back, this was a simple-ish contract design task.

The edge normalizes and propagates. The shared config owns the rules. The language libraries implement those rules. Metrics stay intentionally boring. Richer context lives on spans unless there's a very good reason to promote it.

If I were doing this again, I'd keep the same basics. The declarative contract file was the biggest win. Edge-owned normalization was the right split. And being strict on metric safety saved me from the kind of telemetry mess that's easy to create and annoying to clean up. However, I'd likely implement this right from the start instead of waiting for the mess to happen first. The refactor was pretty easy, but it's still better to avoid the mess in the first place.

Semantic Kernel to Microsoft Agent Framework: Practical reflections

Pasi Huuhka — Fri, 20 Feb 2026 18:08:00 GMT

At AgentCon 2025 Helsinki, I presented my multi-agent demo using Semantic Kernel orchestration.

On the same day, Microsoft announced Microsoft Agent Framework (MAF), so if I wanted to ever have the same presentation again I had to translate it to MAF instead. Now that it's nearing a GA release, I took my existing demo and translated it to see how the new framework feels in real code.

In this post I'll recap my thoughts on the new framework after taking it out for a spin.

The results can be found in these branches of the repo. I used version 1.0.0-rc1.

Let's get going!

Agent creation: less plumbing, more intent

In SK, my agent setup centers around Kernel composition and per-agent kernel wiring. In MAF, it is mostly chat client + instructions + tools.

Before (SK):

// https://github.com/DrBushyTop/MultiAgentSemanticKernel/blob/semanticKernel/Runtime/AgentUtils.cs#L26-L70
var builder = Kernel.CreateBuilder();
builder.Services.AddSingleton(chatService);
builder.Services.AddSingleton();

var agentKernel = builder.Build();

return new ChatCompletionAgent
{
    Name = name,
    Instructions = instructions,
    Kernel = agentKernel,
    Arguments = args,
};

After (MAF):

// https://github.com/DrBushyTop/MultiAgentSemanticKernel/blob/feat/agentFramework/Runtime/AgentFactory.cs#L15-L24
return new ChatClientAgent(chatClient, new ChatClientAgentOptions
{
    Name = name,
    ChatOptions = new ChatOptions
    {
        Instructions = instructions,
        Temperature = temperature,
        Tools = tools ?? []
    }
});

This was the first thing I noticed: less ceremony, easier to read, easier to explain. I did like the previous kernel composition model, even though it was a bit more difficult to understand at first, but I'm sure this new model will grow on me as I get used to it.

Microsoft has these relevant docs in case you want to do the same migration:

Tool registration: plugin model -> plain function tools

In SK, you had to use plugin classes + [KernelFunction] and plugin import. Now in MAF, it's possible to just pass methods directly with AIFunctionFactory.Create(...).

Before (SK):

// https://github.com/DrBushyTop/MultiAgentSemanticKernel/blob/semanticKernel/Plugins/DevWorkflowPlugin.cs#L9-L30
public sealed class DevWorkflowPlugin
{
    [KernelFunction, Description("Generate OpenAPI from story and AC")]
    public string Oas_Generate(string story, string acceptance) => ...;
}

// https://github.com/DrBushyTop/MultiAgentSemanticKernel/blob/semanticKernel/Runners/SequentialRunner.cs#L22-L36
kernel.ImportPluginFromType();

After (MAF):

// https://github.com/DrBushyTop/MultiAgentSemanticKernel/blob/feat/agentFramework/Runners/SequentialRunner.cs#L21-L28
var tools = new List
{
    AIFunctionFactory.Create(DevWorkflowTools.OasGenerate),
    AIFunctionFactory.Create(DevWorkflowTools.RepoCreateBranch),
    AIFunctionFactory.Create(DevWorkflowTools.CreateScaffold),
};

This feels much more natural to me in C#: any function can become a tool without extra plugin lifecycle overhead.

Docs:

Workflow runtime model: orchestration runtime -> event stream

While SK gives the InvokeAsync(...) + GetValueAsync(...) style orchestration, MAF opts for a workflow event stream instead. With SK the result was always a bit different depending on the orchestration pattern, but with MAF I ended up with a more consistent pattern: we always return the list of assistant messages as the final output, and stream intermediate events during execution. I think this makes it easier to understand what is going on during execution, and gives more flexibility on how to handle intermediate events (e.g. tool calls) if needed.

Before (SK):

// https://github.com/DrBushyTop/MultiAgentSemanticKernel/blob/semanticKernel/Runners/ConcurrentRunner.cs#L89-L99
var orchestration = new ConcurrentOrchestration(diffAnalyst, testImpactor, secLint, compliance)
{
    ResponseCallback = AgentResponseCallbacks.Create(cli),
};

var runtime = new InProcessRuntime();
await runtime.StartAsync();

var result = await orchestration.InvokeAsync(prompt, runtime);
var output = await result.GetValueAsync(TimeSpan.FromSeconds(120));
await runtime.RunUntilIdleAsync();

After (MAF):

// https://github.com/DrBushyTop/MultiAgentSemanticKernel/blob/feat/agentFramework/Runtime/WorkflowRunner.cs#L35-L43
StreamingRun run = await InProcessExecution.RunStreamingAsync(workflow, messages, cancellationToken: cancellationToken);
await run.TrySendMessageAsync(new TurnToken(emitEvents: true));

await foreach (WorkflowEvent evt in run.WatchStreamAsync(cancellationToken))
{
    switch (evt)
    {
        case AgentResponseUpdateEvent e:
            // stream token and tool activity
            break;
        case WorkflowOutputEvent output:
            return output.As>() ?? [];
    }
}

One note here was that the AgentResponseUpdateEvent returns on each streamed token individually, so you might have to do some concatenation there.

Docs:

Filters vs middleware

SK uses filter interfaces. In my current MAF demo, I have not yet wired dedicated middleware in the main app - I currently intercept behavior in the workflow event loop. So basically the need for middleware for my use case here (plainly logging when tools are being called) was gone.

SK filter registration + filter implementation:

// Registration:
// https://github.com/DrBushyTop/MultiAgentSemanticKernel/blob/semanticKernel/Program.cs#L59-L61
kernelBuilder.Services.AddSingleton();

// Filter:
// https://github.com/DrBushyTop/MultiAgentSemanticKernel/blob/semanticKernel/Runtime/ConsoleFunctionInvocationFilter.cs#L8-L31
public sealed class ConsoleFunctionInvocationFilter : IFunctionInvocationFilter
{
    public async Task OnFunctionInvocationAsync(FunctionInvocationContext context, Func next)
    {
        var functionName = context.Function.Name;
        var pluginName = context.Function.PluginName;
        var caller = _id?.Name ?? "Agent";

        _cli.ToolStart(caller, pluginName ?? "", functionName);
        _log?.LogInformation("🔧 {Plugin}.{Func} by {Agent}", pluginName, functionName, caller);

        await next(context);
    }
}

MAF equivalent in this app: event-based interception in the workflow runner:

// https://github.com/DrBushyTop/MultiAgentSemanticKernel/blob/feat/agentFramework/Runtime/WorkflowRunner.cs#L38-L63
await foreach (WorkflowEvent evt in run.WatchStreamAsync(cancellationToken))
{
    switch (evt)
    {
        case AgentResponseUpdateEvent e:
            if (e.Update.Contents.OfType().FirstOrDefault() is { } call)
            {
                cli.ToolStart(
                    e.ExecutorId,
                    call.Name,
                    call.Arguments?.ToDictionary(x => x.Key, x => x.Value?.ToString() ?? "")
                    ?? new Dictionary());
            }
            break;
    }
}

SK has explicit invocation filters in use, while MAF currently uses workflow event interception. MAF middleware APIs are available, and I will likely move this interception logic into middleware in a future iteration.

As an extra reference, here is a clean middleware example from Rasmus Wulff Jensen's samples repo:

AIAgent agentWithTools = client
    .GetChatClient("gpt-4.1")
    .AsAIAgent(
        instructions: "You are an expert a set of made up movies given to you (aka don't consider movies from your world-knowledge)",
        tools: [AIFunctionFactory.Create(searchTool.SearchVectorStore)]
    ).AsBuilder()
    .Use(FunctionCallMiddleware)
    .Build();

async ValueTask FunctionCallMiddleware(
    AIAgent callingAgent,
    FunctionInvocationContext context,
    Func next,
    CancellationToken cancellationToken)
{
    StringBuilder functionCallDetails = new();
    functionCallDetails.Append($"- Tool Call: '{context.Function.Name}'");
    if (context.Arguments.Count > 0)
    {
        functionCallDetails.Append($" (Args: {string.Join(",", context.Arguments.Select(x => $"[{x.Key} = {x.Value}]"))}");
    }

    Utils.Gray(functionCallDetails.ToString());

    return await next(context, cancellationToken);

Docs:

Still a few rough edges in the prebuilt workflows

This was my biggest practical friction point in the rewrite. Handoff did not really work as expected with Human In The Loop (HITL) interactions in 1.0.0-rc1, and I had to implement a custom loop around the handoff workflow to get user input and feed it back into the workflow. Also, Magentic support was completely missing from the C# version of the package.

My SK handoff flow uses InteractiveCallback directly on orchestration. In my MAF demo (Microsoft.Agents.AI.Workflows 1.0.0-rc1), built-in handoff did not emit the RequestInfoEvent pattern I expected for a canonical HITL loop. Thus the callback never triggers like it does in the SK version, and I had to implement a custom loop around the workflow execution to feed user input back into the workflow. It's a bit rough and I probably would not use it in a production scenario, but it works for demo purposes.

Before (SK):

// https://github.com/DrBushyTop/MultiAgentSemanticKernel/blob/semanticKernel/Runners/HandoffRunner.cs#L50-L68
var orchestration = new HandoffOrchestration(...)
{
    InteractiveCallback = () =>
    {
        var input = responses.Count > 0 ? responses.Dequeue() : "No, bye";
        return ValueTask.FromResult(new ChatMessageContent(AuthorRole.User, input));
    }
};

After (MAF):

// https://github.com/DrBushyTop/MultiAgentSemanticKernel/blob/feat/agentFramework/Runners/HandoffRunner.cs#L48-L66
while (true)
{
    var workflow = CreateHandoffWorkflow(); // recreated each turn
    var turnResults = await WorkflowRunner.ExecuteAsync(workflow, messages, cli);

    foreach (var msg in turnResults.Where(m => m.Role == ChatRole.Assistant))
    {
        messages.Add(new ChatMessage(ChatRole.Assistant, msg.Text!) { AuthorName = msg.AuthorName });
    }

    if (!_simulatedResponses.TryDequeue(out var userResponse))
        break;

    messages.Add(new ChatMessage(ChatRole.User, userResponse));
}

Why this likely happens (at least in this RC shape):

AgentWorkflowBuilder.CreateHandoffBuilderWith(...) is built around tool-based handoff between agents.
RequestInfoEvent is tied to external request/response ports (RequestPort) and request flow handling.
The built-in handoff builder path does not automatically model that external request boundary in the same way, so no natural user-input request event showed up for my scenario.

In other words, HITL is absolutely possible in MAF workflows, but with handoff I would need a custom workflow that explicitly introduces request/response points where human input is required. That was out of scope for this demo app translation.

Docs:

Quick note on custom graph workflows

These feel powerful, and likely where many production use cases will end up.

I only did a first pass here, but even that looked promising:

// https://github.com/DrBushyTop/MultiAgentSemanticKernel/blob/feat/agentFramework/Runners/GraphRunner.cs#L251-L264
var builder = new WorkflowBuilder(startExecutor);
builder.AddFanOutEdge(startExecutor, [qualityReviewer, securityReviewer]);
builder.AddFanInBarrierEdge([qualityReviewer, securityReviewer], combinerExecutor);
builder.AddEdge(combinerExecutor, reportGenerator);
builder.WithOutputFrom(reportGenerator);

I will do a separate deep-dive post on graph workflows later as I gain more experience with using them in production applications. All in all, I still feel like the current LLM models are so powerful that it might not make much sense to make your code more complex with custom graph workflows unless you have a very specific need for it, but it's good to have the option there when you do.

I tend to think you'll get very far with building your Primary / Subagent flows like many coding agents do: Subagents exist to protect the context window of the Primary agent. More on that in my previous post here.

Final take

All in all, MAF left a positive impression on me. The API is arguably more straightforward and easier to use than SK was (I had no experience with AutoGen), and I think it will be more approachable for new users.

I do still think Microsoft needs to make a somewhat clear value proposition on when to use MAF vs other competitors in the space like LangGraph, but I can see MAF being a strong choice for teams that are already invested in the Microsoft ecosystem and want a first-party solution with good integration with Azure AI services and C#.

Preserving MCP session continuity with Redis

Pasi Huuhka — Tue, 10 Feb 2026 18:44:00 GMT

💡

This post is part of a larger Secure Enterprise AI Tooling On Azure theme:
- Securing remote MCP servers with Entra ID without breaking reconnects
- Preserving MCP session continuity with Redis
- Shipping signed config updates to local AI tooling
- Designing a shared OpenTelemetry contract for AI services on Azure

I ran into a fairly mundane MCP issue recently that only really shows up once the server stops living on one process forever.

The tool looked fine. The server looked fine. But existing client sessions could still lose continuity after a restart or rollout because the session state lived in memory. My LLM sessions in OpenCode and GitHub Copilot would just lose tools mid-conversation with no errors or warnings, which was not a great experience. Even forcing reconnects from plugins could not recover, so the only fix was restarting the process. I could handle that, but my users would not be thrilled about it.

The simple in-memory approach works right up until the process restarts, a new revision gets deployed, or traffic lands on another instance. The client still has the session ID and keeps using it. The new process no longer knows anything about that session. From the client's point of view the tool has just disappeared.

I ended up solving this by just adding a tiny Redis-based session store to preserve the critical continuity state across process boundaries. The shape of the solution ended up being pretty simple, and the infrastructure was refreshingly plain. I thought it might be worth sharing the details since this is a problem that other people are likely to run into as well.

Setup

The setup here is fairly simple:

The client establishes an MCP session and keeps using the returned session ID
The server framework stores session state in memory by default
After a restart, rollout, or replica change, that in-memory state is gone
The client is still behaving correctly, but the next process can no longer resolve the session

That means the actual problem is not reconnecting the HTTP transport. It's preserving just enough session state outside process memory for the next instance to accept the session again.

Before getting into the Redis part, it's worth quickly looking at how MCP sessions work over HTTP, because that is really where the problem starts.

MCP sessions over HTTP

In MCP, the client and server begin with an initialization phase where they negotiate protocol version and capabilities. After that, they move into normal operation. In the Streamable HTTP transport, the server may also assign an Mcp-Session-Id header during initialization, and if it does, the client is expected to send that session ID on subsequent requests.

The official MCP transport spec is quite explicit here. Streamable HTTP sessions are optional, not mandatory. A server may assign a session ID during initialization, and if it does, later requests use that session ID. If the server later returns 404 for that session, the client is expected to reinitialize.

Relevant references:

That means there are really two broad shapes you can end up with on HTTP: a stateful server that keeps per-session state and expects requests to continue that session, or a stateless server that treats each request independently and avoids session tracking altogether.

Stateful vs stateless

I built this server using the Microsoft C# MCP SDK. In that SDK, HTTP transport is stateful by default. The SDK docs are also explicit that Stateless defaults to false, and that enabling stateless mode stops using MCP-Session-Id and creates a fresh server context for each request.

Reference: MCP C# SDK `HttpServerTransportOptions`

That distinction matters quite a lot operationally. If you stay stateful, you get a more session-oriented model, but you also need to think about what happens when the process that originally held the session state disappears. If you go stateless, horizontal scaling and rollouts become much simpler, but you give up features that depend on durable server-side session state.

In my case, I went into the implementation a bit too quickly and accepted the default stateful model. Looking back, part of this specific continuity problem could probably have been avoided if I had first asked whether the server really needed to be stateful at all.

That is not to say the stateful route was wrong. It just means that Redis ended up solving a problem created partly by an earlier transport-mode choice.

Flow

At a high level, the recovery path looks like this:

The client initializes a session and receives a session ID.
The server stores the continuity-critical initialize payload in Redis under that session ID.
A later request arrives at a different process, or after a restart.
If the session is missing locally, the server looks up the stored initialization state and restores enough context to continue.

What needs to survive

I think the most useful thing here is to keep the requirement narrow.
You usually don't need to make the whole server runtime durable. You just need to preserve enough state for the next instance to reconstruct the MCP session in a way the framework accepts.

In the implementation I looked at, the important part was the original initialize payload. That's what got written into distributed cache against the session ID.

That felt like the right level of persistence. Small JSON payloads with a TTL, not some attempt to recreate arbitrary process memory after a crash.

Once you keep the scope that tight, the Redis part becomes very plain. On session initialization, serialize the initialize payload and write it to a namespaced cache key. On a migration attempt, look the session up by ID and hand the stored payload back to the framework.

Why in-memory sessions are not enough

This is obvious in hindsight, but it's easy not to care about until you see it happen.
An MCP client initializes a session and gets back a session ID. It keeps using that session ID for later requests. Then the server restarts. The client is still behaving perfectly reasonably, but the new process has no idea what that session ID means. If all session state is in memory, that behavior is expected.

You notice it more once there are rolling deployments, multiple replicas, or just longer-lived coding sessions that don't fit the "connect, do one tiny thing, disconnect" model. Sticky sessions can make it less frequent, but they don't solve deployments. Once the old revision is gone, the in-memory session state is gone with it.

That's the point where a tiny distributed session store starts making sense.
Of course, the other valid conclusion is that if your server does not need stateful MCP sessions in the first place, stateless mode may be the better answer. Redis is useful here, but it is still compensating for a stateful design choice.

The solution

The shape I like here is very small. Save the initialize payload on session creation. Store it under a service-specific prefix plus the session key. Give it a sliding expiration and a slightly longer absolute ceiling. When a request arrives with a session that's missing locally, ask Redis whether the migration state exists.

That can be expressed in a fairly compact helper:

type SessionInitPayload = {
  protocolVersion: string;
  clientInfo: { name: string; version: string };
  capabilities?: Record;
};

interface CacheClient {
  set(key: string, value: string, ttlSeconds: number): Promise;
  get(key: string): Promise;
}

class SessionMigrationStore {
  constructor(
    private readonly cache: CacheClient,
    private readonly keyPrefix: string,
    private readonly ttlHours: number,
  ) {}

  async save(sessionId: string, payload: SessionInitPayload): Promise {
    const ttlSeconds = Math.max(this.ttlHours, 1) * 60 * 60;
    const key = `${this.keyPrefix}session:${sessionId}`;
    await this.cache.set(key, JSON.stringify(payload), ttlSeconds);
  }

  async restore(sessionId: string): Promise {
    const key = `${this.keyPrefix}session:${sessionId}`;
    const raw = await this.cache.get(key);
    return raw ? (JSON.parse(raw) as SessionInitPayload) : null;
  }
}

There are only a couple of details there that I think really matter. One is namespacing. If several MCP servers share the same Redis, they shouldn't all write to the same naked session: shape. The other is keeping the TTL model explicit. In the implementation here the session state used sliding expiration with a slightly longer absolute bound, which felt about right for development-oriented session continuity without pretending sessions should live forever.

Configuration and failure behavior

I'd definitely keep this behind configuration.

If the Redis connection string is present, enable the distributed session migration path. If it's not, run the normal in-memory mode and accept that sessions die on restart. That's a perfectly fine split between simpler environments and deployed environments that actually need continuity.

The failure behavior should stay straightforward too. If Redis doesn't have the session key anymore, the server shouldn't crash or wedge itself trying to be clever. It should log the miss, return the normal session-not-found behavior, and let the client reinitialize. That keeps session continuity as a best-effort resilience feature instead of making the whole service startup path depend on one cache lookup.

I think that distinction matters quite a lot. There's a difference between "this service can preserve sessions across rollouts when the cache is available" and "this service cannot function unless the cache is healthy". I'd aim for the first one.

Infrastructure

The infrastructure shape was about as small as I'd want it to be. There's one small Redis instance with an LRU-style eviction policy. The connection string is stored in a secret store. The app reads that secret through managed identity. Local development can point at a local Redis if you want to exercise the feature outside Azure.

That felt nicely proportional to the problem. The state is tiny and short-lived, so the cache doesn't need to be fancy. It just needs to exist outside process memory.
The one extra nuance here is that if you also care about stream resumability, the session migration store is only half of the story. You need the relevant stream state outside process memory as well. The pattern is the same though. The point is still to move the continuity-critical state out of the lifetime of one server process.

What this solves

What it solves is the very mundane operational pain of existing MCP sessions dying every time the server is redeployed or restarted. It also helps when traffic lands on a different replica than the one that originally saw the session.

What it doesn't solve is every other durability problem you could imagine. It doesn't make in-flight tool execution survive a hard process death. It doesn't make sessions immortal after TTL expiry or eviction. It doesn't smooth over auth changes that invalidate the restored caller context. And it definitely doesn't fix every reconnect quirk a client might have.

Wrap up

MCP sessions are stateful in a way that starts to matter operationally fairly quickly.

If you want existing sessions to survive rollouts, the server needs somewhere outside process memory to remember them. In my case, the amount of state that actually needed to survive was surprisingly small. That's why Redis ended up being a good fit: the problem was small, stateful, and short-lived in exactly the way Redis tends to handle well.

If I were adding a new internal MCP service today, I'd treat this as a first-class production concern from the start instead of waiting until the first rollout teaches the lesson for me.

Practical experiences with Azure APIM AI Gateway and imported Foundry endpoints

Pasi Huuhka — Sun, 08 Feb 2026 12:20:00 GMT

💡

I've been building out AI platforms on Azure, and as part of that ended up spending a fair bit of time with both the newer AI Gateway story in API Management and the imported Microsoft Foundry endpoint flow. On paper the split is fairly simple. You can either enable AI Gateway directly in Foundry, or you can import a Microsoft Foundry API into APIM and manage it there.

In practice, I found myself needing both. This post will go through the reasons why, where I think the current split makes sense, and which parts of the setup still feel a bit awkward to me.

Platform concerns

Once a few teams start using shared LLM capacity, governance stops being a boring afterthought.

The requirement is usually not "block everything until architecture is perfect". It is more like this:

usage should be fair
usage should be observable
usage should not let one team accidentally starve everyone else
prepaid capacity should be used well
overflow should still have somewhere to go

That last one matters more than it first appears. If you have bought model capacity up front, you obviously want to get value out of it. At the same time, real workloads are rarely flat. If traffic spikes, you may still want to route excess usage somewhere else instead of just failing calls.

As a small sidenote though, at least here in Finland and at the scale most companies around me are operating, buying large chunks of prepaid capacity from Microsoft is still not that common. The cost is usually just too high compared to what they are actually getting out of the deal, so pay as you go is still the more realistic default for many teams. I still think the routing and governance model matters, but the "protect the expensive prepaid capacity" story is often more of a future-looking platform concern than today's norm.

APIM is not the whole answer to that, but to me it is still the most practical place to put the control logic. Microsoft calls out token governance, load balancing, semantic caching, content safety, and observability as part of the AI Gateway story for APIM and that's roughly the same shopping list I tend to have anyway when building shared AI access layers.

Implementing both paths

If you want the quick path, you can enable AI Gateway directly in Foundry and manage model limits there. If you want more control, you can import a Microsoft Foundry API into APIM and use the broader APIM policy surface.

The two routes still don't line up feature for feature. In my setup, the Foundry-managed AI Gateway route didn't give me everything I wanted from the APIM side. The biggest gap was token visibility. I wanted APIM-side token metrics with my own dimensions and dashboarding model, and I also wanted an explicit OpenAI-compatible path for certain clients.

So I ended up exposing two API surfaces on the same APIM hostname:

one more generic AI Gateway style path for Foundry traffic
one imported Foundry models path for the OpenAI-compatible endpoint

It's not very elegant, rather it's more the current shape of the platform when you care about the operational details.

My guess is that this split shrinks over time and the capabilities converge. Right now though, I still had to think about which path gave me which capability, and before I actually dove into the documentations I was thinking just the AI Gateway would bring me everything.

The practical shape was roughly this:

That probably looks slightly silly at first glance, but it let me keep one hostname and one platform entrypoint while still exposing two slightly different integration styles.

I wanted the more direct AI Gateway style shape that maps well to the Foundry story.
I also wanted the imported endpoint shape where APIM policy behavior and metrics felt more explicit for my use case.

If you are only doing one of those, this probably sounds more complex than necessary. That is fair. If your needs are simple, the Foundry-managed path is likely enough. But if you are building a shared platform and care about compatibility, control, and reporting, it gets easier to justify.

## When I would use AI Gateway vs imported Foundry APIs

If I wanted to simplify the decision for myself today, I would phrase it like this.

Use the Foundry-managed AI Gateway when...

project-level token limits are enough- you want the Foundry control-plane experience
you want to get going quickly and can live with preview-era boundaries

That path is attractive precisely because it is less APIM-shaped. You stay in Foundry, wire the gateway up, enable projects, and move on. For some teams that is exactly the correct level of abstraction. The Foundry AI Gateway docs and the model token limit docs describe that path fairly well.

Use imported Foundry APIs in APIM when...

you need policy-level control
you want your own telemetry dimensions- you care about detailed monitoring and reporting
you need the endpoint shape to be predictable for existing clients- you already think of APIM as the platform entrypoint anyway

You can use managed identity auth to the backend, attach token limiting, token metric emission, AI gateway logging, and if you want to, semantic caching too. Again, I don't think this split is permanent. I just think it still matters today.

This is not really a special AI Gateway decision for me. It's just something I do for basically all of my applications. Whenever possible, I prefer to separate identity and permissions from the infrastructure choice itself.

That means if I deploy APIM, a Function App, a Container App, or something else, I would usually rather attach a user-assigned managed identity than let the resource identity be completely implicit. Not because the default identity model is wrong, but because the user-assigned approach gives me free flexibility later. If I redeploy the infra, swap one hosting choice for another, split something up, or move a permission boundary, I can keep the identity and its RBAC assignments more stable. In practice that makes the permission story easier to reason about over time.

So in this case I did the same thing with APIM. Instead of treating the default identity behavior as part of the gateway feature itself, I attached a dedicated user-assigned identity and used that identity when calling the Foundry backend.

That looks roughly like this in a simplified form:

resource gatewayIdentity 'Microsoft.ManagedIdentity/userAssignedIdentities@2025-01-31-preview' = {
  name: 'id-apim-shared-ai'
  location: location
}

resource apiManagement 'Microsoft.ApiManagement/service@2025-03-01-preview' = {
  name: 'apim-shared-ai'
  location: location
  identity: {
    type: 'UserAssigned'
    userAssignedIdentities: {
      '${gatewayIdentity.id}': {}
    }
  }
  properties: {
    publisherName: 'Shared AI Gateway'
    publisherEmail: 'platform@example.net'
  }
}

And the APIM backend auth shape is similarly straightforward:

For the OpenAI-compatible style endpoints, the target resource would typically be https://cognitiveservices.azure.com/ instead. The relevant docs here are auth for AI APIs in APIM and managed identity configuration for APIM backends.

RBAC-wise, the important part was simply granting that identity the backend access it needed. For my case that meant the Cognitive Services User role on the Foundry account and project too(?). Can't actually remember.

⚠️

This can also lead to some issues in some cases. For example it turns out the "New Foundry" standard agent setup did not work at all until I switched back to the system assigned managed identity.

Observability payoff with APIM

I wanted to get some telemetry on usage, which APIM thankfully provided:

one place to enrich requests with stable platform metadata- one place to emit token metrics
one place to send gateway and LLM logs onward
dashboards that are useful to platform owners, not just to whoever happens to be staring at a single model deployment

I had the gateway add a few platform-specific headers and baggage values, and I also hashed the incoming user object id before using it as a metrics dimension. That gave me a reasonably privacy-safe way to answer questions like "who is using the platform", "which tools are hot", and "which traffic is burning the most tokens" without spraying raw identifiers around.

The token metrics part was powered by llm-emit-token-metric. A simplified policy example looks like this:

The flow of telemetry looks pretty much like this:

The logging side is covered in the LLM logging docs, and the dedicated log table reference is ApiManagementGatewayLlmLog.

A KQL example for a dashboard could look like this:

AppMetrics
| where TimeGenerated > ago(24h)
| where AppRoleName has "apim-shared-ai"
| where Name == "Total Tokens"
| extend dims = todynamic(Properties)
| extend user_hash = tostring(dims["user_hash"])
| where isnotempty(user_hash)
| summarize total_tokens = sum(todouble(Sum)) by user_hash
| top 20 by total_tokens desc

Semantic Caching

I have not yet found a compelling use case for APIM semantic caching in my workloads, so I have not enabled it. The main draw of the feature is that it can automatically cache semantically similar requests and responses.

Most of my LLM workload is either coding-oriented or tied to RAG-style scenarios. In both cases, two prompts that look quite similar can still reasonably require very different outputs. Because of that, I have not yet felt confident that semantic caching would lead to more benefits than actual harm in my specific scenario.

There is also the operational side. If I want APIM semantic caching, I need an external Redis-compatible cache with the right capabilities. The semantic caching docs call out Azure Managed Redis and RediSearch requirements. Skipping this let's me avoid an extra moving parts.

So for now I am mostly depending on the provider or model-side caching behavior where it exists, and leaving APIM semantic caching out of the picture. That might change later. Right now it just does not feel like the correct optimization target for my workloads.

The weird 100k free requests bootstrap story

One small thing that felt oddly fuzzy to me was the messaging around the free requests benefit when creating AI Gateway through Foundry. The pricing page just has a single * row you need to search for.

The docs point out that AI Gateway includes a free tier and refer to pricing details from the Foundry AI Gateway setup page. At the time I was doing this, the portal and docs wording around the first 100k requests sounded nice, but I never found a verification path to see that I'm actually getting them (and what determines it).

That led me to a slightly dumb workaround. I ended up doing a three-phase deployment:

deploy the rest of the platform without APIM management
create the AI Gateway / APIM association from the Foundry portal first
then let my Bicep adopt and configure the APIM side afterward

Does that sound a bit silly? Yes. Thankfully we only need to do this once, so I can live with it.

Links / references

Shipping signed config updates to local AI tooling

Pasi Huuhka — Wed, 28 Jan 2026 18:39:00 GMT

💡

I've been building local AI tooling setups where the interesting part is not only the binary. A big part of the product is the surrounding config: agents, plugins, MCP wiring, commands, auth glue, and the small bits of structure that make the local tool actually useful inside one organization.

At some point you usually want a way to update that remotely. Especially when the number of users grows, manual updates become unfeasible. You want to be able to ship new commands, tweak agent behavior, add new plugins, and so on without having to ask everybody to download a new version of the tool. The users might also just be non-technical, and we want to make things as seamless for them as possible.

If a remote service can tell a local coding tool to replace parts of its config, then that update channel is part of the workstation trust boundary. That can be extremely useful. It can also become a very self-inflicted security issue if the model is sloppy.

The risk here

If the update can change plugin code, command files, MCP settings, auth behavior, or the set of managed local files, then the update plane is effectively allowed to change how the tool behaves. That's close enough to code distribution that I don't think it should be treated casually. Add to this the fact that the tool is running locally and has access to the user's files, and you have a recipe for a potential disaster if the update path is compromised.

I also didn't want the local tool to become fragile. If the update API is down, if auth has expired, or if verification fails, developers should still be able to open the tool and work.

That gave me two requirements that sound contradictory until you spell them out. The update path should fail closed. Startup should fail open. You can reject an update aggressively without rejecting the whole application startup.

So why isn't a hash enough?

The first instinct here is usually to hash the ZIP and call it done. That helps, but only partly. If the same actor can tamper with both the bundle and the metadata that tells the client which bundle to download, plain integrity checking doesn't really solve the trust problem. The attacker just changes the ZIP and the hash together.

That's why I ended up with two different checks doing two different jobs. The manifest signature answers whether the update description came from a trusted publisher. The ZIP hash answers whether the downloaded bytes match the signed manifest. That split felt right: Trust the publisher first, then trust the bytes.

I also found it useful to sign a small canonical payload instead of every cosmetic field in the manifest. The important fields were the channel, version, ZIP path, timestamp and SHA-256. That kept the signed surface focused on the security-relevant identity of the release rather than whatever extra metadata I might want to add later.

The Azure implementation

The service side was quite simple. There was one endpoint that returned the latest manifest for a channel and another that returned the ZIP bytes for a specific version. The manifests and bundles lived in blob storage. The signing key stayed server-side. The client carried only a small trusted map of public keys keyed by keyId.

That gave the update cycle a fairly clean shape: The client asks for the latest manifest. It verifies the manifest signature. If the version is newer, it asks for the exact ZIP for that version. Then it hashes the ZIP, compares it to the signed manifest, stages the update locally, and only after that applies it.

The publisher side was similarly straightforward. Build the config bundle, hash it, construct the canonical payload, sign it with Ed25519, write out the version-specific manifest, and then update the channel-level latest manifest.

Key rotation

I didn't want signing key rotation to become a mini-incident. So the client trusts a very small set of public keys and the manifest includes the `keyId` it was signed with.

That gives a practical rotation story. Ship client support for the new public key first. Then switch the publisher to sign with the new private key. Keep the old key around for a transition period. Remove it later. I think this is one of those areas where simpler is better. You don't need a giant PKI story for this kind of internal updater. You need one trustworthy signing path, a clean verification path, and a rotation model that people will actually be willing to execute during normal delivery work.

The validation side should be strict though. Unknown key ID should fail. Invalid signature should fail. ZIP hash mismatch should fail. I wouldn't add any kind of "probably fine" behavior there.

For a tiny self-contained version of the flow, this is the heart of it:

const trustedPublicKeys = {
  k1: publicKey1,
  k2: publicKey2,
};

verifyManifestSignature(manifest, trustedPublicKeys);
verifyDownloadedBundle(manifest, zipBytes);

Again, the important thing is the order.

The local apply flow

It's very easy to overfocus on signatures and underfocus on what happens on disk.

For me, the updater needed to be simple locally too. It should create a backup first, extract into a staging directory, validate the extracted paths, apply the managed updates, write the new version marker, and clean up afterwards. If something goes wrong in the middle, it should restore from the backup instead of leaving the user in some half-updated state.

The other thing that mattered was acknowledging that not every file should be handled the same way. Some paths are fully managed. Some need merge behavior. Some user-defined content should survive an update. If you ignore that, people eventually stop trusting the updater and start working around it. We also clearly mark the managed paths in the file system and in the config structure so that it's obvious what's up for grabs and what's not. Users want to customize their tooling, and we should not get in the way of that.

Update failures vs startup

This was probably the most important product decision in the whole thing. The updater runs during startup, but it shouldn't own startup. If the update API is unavailable, auth has expired, blob storage has a bad day, or verification fails, the safe behavior is to log it, skip the update, and keep starting the tool.

I'd much rather have somebody on yesterday's known-good config than locked out of the tool entirely because the update plane is having a rough morning. That's why the split needs to be very explicit. Update failures are hard failures for the update path. They're not hard failures for local startup. Thankfully a simple design leads to simple error handling.

From the user's side, we get one background check, one token refresh if it's warranted, and then move on. This shouldn't turn into a startup spinner that makes people wonder whether the tool has hung.

What worked well

The main thing I liked was that it stayed small:

There's a metadata endpoint. There's a bundle endpoint. The signed part of the manifest is intentionally minimal. The client verifies signature first and hash second. The updater stages locally and backs up before apply. And none of the failure modes are allowed to take the whole tool down. That's a fairly modest amount of machinery for something that's operating on a sensitive boundary.

If I kept pushing on simplification, it would mostly be in the apply model. It's easy to accidentally build a tiny package manager here, and I don't think that's the right goal. The updater should be strict and trustworthy, not ambitious.

Wrap up

The main lesson for me was that shipping remote config to local AI tooling is close enough to shipping code that I think it deserves the same seriousness. That doesn't mean the implementation needs to be huge. In my case it was actually fairly small: signed manifests, versioned bundles, a couple of endpoints, staged local apply, and clear failure boundaries.

Browser verification for coding agents: Chrome DevTools MCP vs agent-browser

Pasi Huuhka — Wed, 28 Jan 2026 17:13:00 GMT

I've been treating browser-side verification as a standard part of the implementation loop when working with coding agents for a while now.

Frontend work is still one of the places where current models make a steady stream of mistakes. CSS is often wrong in small but obvious ways, interaction states get missed, responsive behavior regresses, and the model will happily tell you everything is fine unless you give it some way to actually look at the result.

This post is about browser feedback for coding agents during normal development work, not about fully autonomous browser agents or end-to-end testing as such.

None of this is really tied to one harness either. The same general ideas work with basically any coding agent setup that can expose MCP servers or CLI tools, whether that is OpenCode, GitHub Copilot, Claude Code, Codex or something else.

The two tools I've been using most are Chrome DevTools MCP and agent-browser.

The short version

I use both, but at the time of writing I still reach for Chrome DevTools MCP more.
Model familiarity: current models seem to understand the Chrome DevTools MCP tool surface better than the agent-browser CLI.
Session isolation: agent-browser feels more naturally aligned with isolated per-agent work, because it is a CLI with explicit session handling. Chrome DevTools MCP has improved here too with named isolated contexts, so the difference is more about workflow shape than absence of support.
Debugging depth: Chrome DevTools MCP gives a richer debugging surface, especially for console, network, performance and general inspection.
Delivery model: in OpenCode, MCP is always present in the model context, while the agent-browser skill can be loaded only when needed.
Parallel work friction: multiple agents trying to use Chrome DevTools MCP in parallel can easily end up fighting for browser control.
These tools are very useful because current models are still pretty bad at frontend correctness if you only let them reason from code.

Why I keep adding browser verification into the loop

I've written earlier about Research - Plan - Implement, Primary vs Subagents in LLM harnesses and A mental model for LLM tooling primitives.

The browser tooling question sits underneath all of those.

If I have an implementation agent that can edit code, run tests and lint, that is already useful. But especially for UI work, there is still a pretty big gap between "the code compiles" and "the feature is actually correct".

That gap is exactly where browser tools help. They let the model inspect what really rendered, what requests fired, what errors showed up in the console, whether the element is actually visible, and whether the flow works beyond static code review. In practice that gives a much better feedback loop than just asking the model to look at JSX or CSS and hope for the best.

Two different design bets

These two tools are aimed at slightly different shapes of work.
agent-browser is a CLI-first browser automation tool. The workflow is command-driven, and in coding harnesses you can expose it through the agent-browser skill when actually needed.

Chrome DevTools MCP is an MCP server. The browser capabilities are available as a tool surface directly through the harness.

That sounds like a small implementation detail, but it really does affect how the tools feel in daily use. With agent-browser, the capability is more opt-in. With Chrome DevTools MCP, the capability is more ambient. That has obvious pros and cons.

The biggest reason I still reach for Chrome DevTools MCP more often is simple: it gives a very strong debugging surface.

You get browser automation, but also:

Console inspection
Network inspection
Screenshots and snapshots
Performance tracing
Lighthouse audits
Memory snapshots

That makes it more than a "click around in the page" tool. It is closer to handing the model a browser plus a chunk of DevTools itself.

Current models also seem to know how to use this MCP surface better than they know how to use agent-browser. That is not a scientific benchmark, just my practical impression after using both. The models seem more ready to do reasonable things with page selection, snapshots, console logs and network requests than they are to drive a CLI workflow correctly from scratch.

The downside has mostly shown up when I try to push more parallel agent work through it: Agents are not particularly good at sharing Chrome DevTools MCP sanely across concurrent work. If I have multiple agents or parallel workstreams trying to use the same browser tooling, it starts feeling like a tug of war over the active page or session. That may partly be a prompting issue on my side, but in practice it has meant that browser verification works better when I centralize it to one orchestrator or one review pass instead of letting every parallel worker poke at the same browser.

So:

Chrome DevTools MCP is very strong for one active agent doing deep verification and debugging.- It is less comfortable as a shared browser layer for multiple concurrent agents.

Update 9.4.2026: I went back and checked this more carefully. Chrome DevTools MCP added storage-isolated browser contexts in v0.18.0 via isolatedContext on new_page, and then added page routing for parallel multi-agent workflows in v0.19.0. So the multi-agent story is better than I originally thought, though in practice it still depends on what your harness actually exposes and how well the model uses it.

So this is not really a case of agent-browser having sessions and Chrome DevTools MCP not having them. Chrome DevTools MCP does now have named isolated contexts. The difference is more that agent-browser has a more explicit workflow around sessions, saved state, auth reuse and diffs, whereas Chrome DevTools MCP is stronger as a live inspection and diagnostics surface.

agent-browser: explicit control, natural session separation

What I like about agent-browser is that the whole thing feels more explicit.
It is a CLI tool with concrete commands, explicit sessions, state save/load, snapshots, screenshots, console inspection, request tracking, diffing and a few other useful pieces. It also has a clear skill package that teaches the model the recommended workflow when needed.

The on-demand skill aspect is worth highlighting.

One of the common problems with MCP servers in general is that they can take a fair amount of context all the time, whether or not the task really needs them. Skills are a more selective mechanism. The model only expands the instructions when it actually decides it needs that capability. The deeper difference is less "skill vs MCP" and more ambient tool surface versus explicit workflow tool.

The other thing I like is that agent-browser feels more naturally suited to isolated sessions. That makes it probably the better shape for future parallel verification flows where individual agents verify their own work without all trying to grab the same browser handle.

It also feels more like a reusable automation utility. Things like saved state, auth reuse, diffing, and provider support make it easier to imagine as a repeatable browser worker in a larger workflow.

Headless support is part of this story too. Both tools can run headless, but the shape is different. With agent-browser, headless or headed operation is a natural part of the CLI workflow. One practical pattern I've been using is to open a headed session first, do whatever auth setup is needed there, and then reuse that state for the agent's headless sessions afterward. Chrome DevTools MCP does support headless mode as well, but that is more of a server launch or browser configuration detail than part of the agent's normal workflow.

The main weakness right now is not the tool itself. Current models do not seem deeply fluent with it yet.

The agent-browser skill is fairly extensive, and that helps, but it also makes very obvious that models are not coming in with much native familiarity. Quite often they fumble around with the CLI a bit before landing on the right command sequence. There does not seem to be much training data here yet. That will likely improve. Right now though, it is still visible.

One thing I also checked more carefully here is whether agent-browser exposes comparable network and performance inspection to Chrome DevTools MCP.The short answer is: partially, but not really at the same depth.

It does expose network requests, trace, and profiler commands, and I verified that the profiling and trace capture do work in practice. But it still feels more like request monitoring plus trace capture than full DevTools-style inspection. Chrome DevTools MCP is stronger here because it exposes explicit tools for things like listing network requests, fetching a specific request, saving request or response bodies, Lighthouse audits, and performance-insight style workflows.

So I would not describe the network or performance inspection story as equivalent today.

Rough feature comparison

These are two different design bets, not a case of one tool simply being better.

Wish I had support for tables...

Screenshot handling in OpenCode

One very practical issue I have hit with Chrome DevTools tooling in OpenCode is screenshot handling.

Sometimes when the model takes a screenshot, the image payload ends up flooding the session context, the whole session can fall over pretty quickly. I've had this happen enough times that it is worth calling out explicitly.

The workaround is simple but useful: tell the model to save the screenshot to a file first, then read the file afterward if needed.

That sounds minor, but it is exactly the kind of operational detail that matters once these tools become part of the normal workflow.

Beyond visual verification

Most of my own use has been around visual verification, but the useful scope is wider than that.

Some examples:

Console errors: checking whether a new feature introduced them
Network requests: catching failed API calls after a UI change
Auth and session state: verifying redirects, cookies, local storage
Bug reproduction: issues that are easier to see in the browser than in code
Responsive / dark mode: quick testing without manual switching
Artifacts: capturing screenshots, traces or request logs for later review
Adversarial review: letting a review agent try to break the implementation
Interaction flow: validating the actual flow works, not just the static layout

Sometimes the page looks fine in a screenshot, but the real problem is a broken loading state, disabled button, wrong request payload, client-side error or auth/session issue. Browser tooling is useful exactly because it can move between visual inspection and behavioral debugging instead of forcing you to pick one.

Browser tooling fits review agents well

One thing I've liked is handing browser tooling to a more adversarial review agent after implementation.
That review pass can be asked to inspect the implemented page visually, check console and network errors, validate a specific flow end to end and try edge cases the implementation agent may have skipped.

That tends to work well because the review agent is not attached to the implementation it just wrote. It is looking for mismatches instead of trying to defend its own earlier reasoning. For frontend work in particular, that extra pass has felt useful, but is a useful tool for any coding tasks. The cost you pay is of course the time it takes for the extra model to think.

Open questions

A few things I have not fully resolved yet.

Who should own browser verification? The implementation agent, a higher-level orchestrator, or a separate reviewer? I currently lean toward orchestrator or reviewer, especially if concurrent agents are involved, but if you set up shared auth, I'm sure both options can work. Similarily Session isolation for parallel agents tackles a bit of the similar problem space. If every agent shares one browser, you get contention. If every agent gets its own clean and controllable browser state, a lot of the workflow becomes simpler and they get faster feedback to fix their own issues quickly.

Security and state handling. The moment these tools start storing auth state, cookies, local storage or remote debugging connections, there is a real security conversation to have.

Performance and debugging, not just CSS. It would be easy to accidentally frame these as only visual verification tools. That would undersell Chrome DevTools MCP especially, since a lot of its value is in debugging and performance analysis.

Playwright MCP

Another tool in this space is Playwright MCP. I have not used it myself yet, so I won't pretend I have a strong opinion.
Interestingly, its own README makes a distinction between MCP-based workflows and CLI + skill based workflows for coding agents, which is very much the same design space as the tradeoff discussed above. Worth evaluating.

Final thought

Giving coding agents some way to verify browser state is increasingly worth it, because current models are still nowhere near reliable enough to just "reason the UI correctly" from code alone.

Securing remote MCP servers with Entra ID without breaking reconnects

Pasi Huuhka — Fri, 16 Jan 2026 18:31:00 GMT

💡

I've been wiring remote MCP servers behind Entra-protected endpoints lately, and the awkward part isn't really validating a JWT, but everything around it.

Most MCP server implementations don't come with Entra ID support out of the box. In the AI platform I've been building, every service sits behind a shared APIM gateway that requires an Entra bearer token. That includes the MCP servers. The way I handle this is by running an Entra-authenticating reverse proxy in front of the upstream MCP server, so the server itself doesn't need to know anything about Entra at all. The proxy validates the caller's token, and then forwards the request upstream.

That moves the authentication story out of the individual MCP server code and into a shared layer that's consistent across the platform. But it also means that every local client needs to be able to acquire the right token, attach it to every outbound request, and handle the usual lifecycle problems: expiry, 401 retries, and reconnects after session loss.

On the client side, I'm using OpenCode as the coding tool. OpenCode has a plugin system that lets you intercept outbound HTTP requests, inject headers, and react to lifecycle events. I ended up building a set of plugins that handle all of this transparently, so the developers using the platform don't have to think about auth at all. It just works in the background.

Setup

The shape of the setup is fairly simple:

Local OpenCode plugins acquire Entra tokens and attach them to the right outbound requests
APIM and a reverse proxy sit in front of the remote MCP servers
The proxy validates the bearer token and forwards traffic to the upstream MCP server
The MCP server itself stays unaware of Entra-specific concerns

That split has worked well for me. The authentication behavior stays consistent across services, and the MCP server implementation can stay focused on MCP instead of identity plumbing.

Request flow

At a high level, the request path is straightforward. The local client requests an Entra token for the configured audience, and an OpenCode plugin attaches that token to the outbound MCP HTTP request. APIM and the reverse proxy validate the token and forward the request upstream. If the token is stale or the session is lost, the client can refresh the token and try the transport recovery path.

It's just HTTP

MCP servers aren't especially exotic from a security point of view. They're HTTP servers with some long-lived connection behavior. You still need to authenticate callers, validate tokens, and make sure the transport can reconnect when things go wrong.

Once the transport is remote, most of the difficulty is just protected HTTP plumbing with some session continuity concerns on top. In practice, the solution space is much more normal than the surrounding discussion sometimes suggests.

What I ended up with was one shared token acquisition path on the client side, narrow bearer injection for the intended remote endpoints, service-side validation that accepts the issuer and audience variants I know I'll see in practice, and a reconnect model that can heal dropped transport sessions.

Shared token acquisition

The first practical lesson was that token acquisition shouldn't be reinvented separately by every plugin.

I have several OpenCode plugins that need to acquire Entra tokens. One handles auth for the MCP servers behind the platform proxy. Another handles auth for the LLM provider calls routed through the same APIM gateway. A third handles auth for an enterprise session sharing service. It just made sense to consolidate the token acquisition logic into one shared module.

So I built one shared auth module with a very plain order of operations:

Return a fresh cached token if one exists.
Try DefaultAzureCredential.
If that fails, fall back to az account get-access-token.
If the CLI explicitly needs login, do a controlled login flow and retry once.

All of this lives in a shared library that every plugin imports, so the behavior is identical everywhere.

Because this runs inside OpenCode's plugin system, I could also integrate the login flow into the UI itself. If a user needs to sign in, the plugin shows a toast notification guiding them through the process instead of just leaving them with a failed auth error and a hint about running `az login` on their own. Toasts only work on the TUI version though, so it's not a perfect solution. However, this was good enough for even the non-technical users to be logged in and productive without needing to understand the Az CLI at all.

This setup also gave me the two things I cared about. Non-interactive environments had a good chance of succeeding through identity-based auth, and local developer machines still had a reliable escape hatch through the CLI.

The other useful detail was separating silent acquisition from interactive login. A plugin that's running during startup should be allowed to try to get a token quietly. It shouldn't immediately decide to throw a browser sign-in flow at the user before the UI is even properly ready.

Audience resolution

Once token acquisition is shared, the next thing that matters is that all paths agree on what audience is being requested.

That sounds trivial until you support both custom API audiences and Azure resource audiences. At that point you need the identity path and the CLI path to resolve the target exactly the same way, or you end up debugging 401s that are really just inconsistencies in your own local tooling.

The shape is simple enough:

export type AuthConfig = {
  tenant?: string;
  clientId?: string;
  resource?: string;
};

export function resolveAudience(config: AuthConfig) {
  if (config.clientId) {
    const scope = `api://${config.clientId}/.default`;
    return {
      kind: "scope" as const,
      scope,
      cliArgs: ["--scope", scope] as const,
    };
  }

  if (config.resource) {
    const resource = config.resource.replace(/\/+$/, "");
    return {
      kind: "resource" as const,
      scope: `${resource}/.default`,
      cliArgs: ["--resource", resource] as const,
    };
  }

  throw new Error("Missing clientId or resource");
}

It's important that every caller in the local tooling ends up behaving consistently. The MCP auth plugin, the provider auth plugin, and the enterprise share plugin all go through the same resolution logic. If one of them gets the scope wrong, the token won't match what the service expects, and the result is a 401 that looks like a service-side problem but is really a local misconfiguration.

Attaching tokens

Once the client can acquire the right token, the next question is where it should use it.

I'd avoid broad global rules here. In the OpenCode plugins, each one intercepts fetch and matches outbound requests against explicitly configured URL prefixes. The MCP auth plugin knows which URLs correspond to remote MCP servers behind the platform proxy. The provider auth plugin knows which providers route through the gateway. Each plugin only injects a bearer token for its own scope. That way you don't accidentally start attaching tokens to random outbound requests that shouldn't have them, and you keep the auth behavior focused on the intended paths.

The other useful behavior was to treat 401 as a signal to evict the cached token and retry once. Without that, you can end up reusing a bad token until it expires. With too much retry logic, you just create more noise. One forced refresh and one retry is a decent middle ground.

Issuer compatibility

It's very tempting to validate only the Entra v2 issuer because that's the one you had in mind when configuring the app registration. In practice, that can be wrong.
What pushed me into this in the first place was seeing legitimate callers arrive with the older v1 issuer format. In my environment that showed up with managed identity and gateway-mediated paths, which was enough to make strict v2-only validation a problem.

Microsoft's own Entra token validation guidance is explicit here: Microsoft Entra-issued access tokens can use either https://sts.windows.net/{tenant-id} for v1.0 tokens or https://login.microsoftonline.com/{tenant-id}/v2.0 for v2.0 tokens. So this isn't some oddity specific to MCP. It's just something your API validation layer needs to account for if valid callers in your environment can receive both token versions.

The same thing happens with audiences. One token may present the bare client ID. Another may use the application ID URI form. So the practical rule became to accept the variants that legitimate callers in my environment can actually produce:

export function buildValidIssuers(tenantId: string): string[] {
  return [
    `https://login.microsoftonline.com/${tenantId}/v2.0`,
    `https://sts.windows.net/${tenantId}/`,
  ];
}

export function buildValidAudiences(clientId: string): string[] {
  return [clientId, `api://${clientId}`];
}

Proxy and reconnects

Once auth is working, the problem is still only half solved.

The reverse proxy that fronts the upstream MCP servers needs to behave like decent transport plumbing. It shouldn't buffer SSE responses. It should use an upstream HTTP version the server actually supports. It should expose a simple health route for probes instead of forcing those probes into your auth story.
After that, the reconnect behavior becomes the interesting part.

What I liked in the implementation here was that it didn't rely on a single recovery mechanism. In the MCP auth plugin, there's a lighter health reconnect tick that periodically checks whether targets are still connected. There's a slower hard reconnect tick that fully resets sessions on a longer cadence. And on top of that, request-time behavior can recover a target when the transport starts showing session loss symptoms, like a 404 with a session-not-found error body.

That felt realistic. Long-lived authenticated transport tends to fail in a few different ways, and it's useful to have more than one way back to a healthy state.
The cooldown logic matters too. If several concurrent requests all decide that they should reconnect the same remote MCP target at once, you've just created a new kind of problem for yourself. The plugin deduplicates recovery attempts per target and applies a cooldown window so that one burst of session-not-found responses doesn't turn into a reconnect storm.

This is also where telemetry earns its keep. If the proxy can log and trace events like accepted sessions, possible session loss, retries, and timeouts after session acceptance, you stop guessing quite so much.

One thing worth calling out separately is that transport reconnects and cross-process session continuity are related, but not identical, problems. Reconnect logic helps the client recover when an existing transport drops. It does not by itself make server-side session state survive a restart or rollout. I'll cover that continuity side separately in a follow-up post using Redis.

Wrap up

The main lesson for me was that remote MCP security isn't mostly about MCP. It's about consistency.

Consistent token acquisition. Consistent audience resolution. Consistent issuer handling. Consistent proxy behavior when the connection gets interrupted. Once those pieces line up, remote MCP feels much less special and much more like what it really is: another authenticated infrastructure path with slightly more session sensitivity than average.

In my case, the fact that the MCP servers themselves don't know anything about Entra is a feature, not a limitation. The proxy handles auth, the OpenCode plugins handle token lifecycle, and the developers using the platform don't have to care about any of it.

The solution isn't to invent some MCP-specific security model. It's to do the identity and transport work properly.

If you're building something similar, the practical checklist I'd keep in mind is:

Centralize token acquisition
Resolve audiences consistently across every local auth path
Accept the issuer and audience variants your real callers can legitimately produce
Keep bearer injection narrow and explicit
Treat reconnect behavior and cross-restart session continuity as separate design problems

Primary vs Subagents in LLM harnesses

Pasi Huuhka — Thu, 15 Jan 2026 14:35:00 GMT

💡

I’ve been refining my own mental model for agent splits in coding workflows, and this post is a snapshot of what has worked best so far.

A lot of this aligns with ideas from Dex Horthy’s talk from the AI Engineer Code Summit (YouTube) and with the way I’ve been structuring Research-Plan-Implement flows (my post here).

I also touched adjacent concepts in A mental model for LLM tooling primitives.

The short version

Primary agents are user-facing orchestrators.
Subagents are context protectors and scoped executors.
If a split doesn’t reduce context pressure, it’s probably not a useful split.

That’s really it.

What primary agents should do

Primary agents should be the only layer directly responsible for user interaction and overall task flow.

They define:

how the session should behave,
what kind of output should be returned,
when to delegate work,
and how to synthesize results into a final answer.

In practice, primary agents should mostly orchestrate and synthesize. They can do work themselves too, but for bigger tasks their main value is coordination.

A useful framing: primary agent = director, not every actor on stage. (Though as always, there are cases where it can do everything. Depends on the size of the task you're currently working on)

What subagents should do

Subagents should exist primarily to protect the context window of the primary agent.

Not “security agent”, “SRE agent”, “backend agent” by default.
Instead: small, narrow units of work that return distilled outcomes.

Good subagent jobs:

locate files relevant to a topic,
analyze a specific code path,
find existing patterns,
implement one atomic change set,
summarize prior research/decisions.

What they return should usually be compact and structured. This could be a short summary, file paths + line pointers, key findings / decisions or explicit gaps. So not full file dumps unless absolutely necessary.

Common failure mode I keep seeing

Sometimes the primary agent asks subagents for too much, like: “return full contents of the files”, “return all changes in detail” or “paste everything you found”

That defeats the whole point of the split and wastes money and time as well, so it's important to catch when it happens.

The solution is to fix the primary prompt so subagents return distillations instead. If the primary needs full content, it can read targeted files itself afterward.

Parallelism is not optional

If subagent tasks are independent, the primary should spawn them in parallel.

Typical good parallel batches:

multiple locators on different search angles,
independent implementation tasks touching disjoint files,
separate analyzers for code, patterns, and prior notes.

This is usually the easiest way to reduce wall-clock time without losing quality. The primary agent is often smart enough to make these decisions, but the risk exists that there is some underlying dependency that can affect the coherency of the final result.

Subagents do not magically inherit full context from the primary.
So when there is a plan/research artifact, instruct the primary agent to pass the link/path directly in the subagent prompt.

This is exactly why I like writing explicit plan files in the first place (again: Research-Plan-Implement): they become stable handoff artifacts between agent hops.

Without that handoff, primaries often assume subagents “know the whole story”, and that assumption breaks quickly. The subagents can then read the plan to get the full context (e.g. what are we doing? What has already been done so far?). If your plans aren't massive, this should be fine from the context perspective.

A practical contract I like for subagent responses

I haven't spent enough time on tuning these yet myself, but something like this could be used as a guideline when designing what the subagents return. Often the primary agent's prompt gives the instructions anyway, so it feels more powerful to tune that instead.

Result: 3–7 bullets max
Evidence: path:line references
State: done / partial / blocked
Next input needed: one short line (if blocked)

Implementation examples

Here are a few of the subagents I'm using. Mostly taken from the humanlayer repository.

codebase-locator
Finds where things live, grouped by purpose. No deep analysis.
codebase-analyzer
Explains how specific flows work with precise file:line references.
pattern-finder
Finds established implementation/test patterns to mirror.

Final thoughts

For me, primary/subagent design is mostly a context engineering problem, not a role taxonomy problem.

If your primary agent is drowning in tokens, make subagents narrower.
If your subagents are returning novels, tighten response contracts.
If execution is slow, parallelize independent work.
If context gets lost, pass explicit plan files.

Everything else is details.

Automating Azure DevOps workload identity service connections end to end

Pasi Huuhka — Tue, 13 Jan 2026 18:28:00 GMT

One of the more annoying setup tasks in Azure DevOps has been service connections. It's not necessarily difficult, but the workload identity federation flow crosses Azure and Azure DevOps in exactly the wrong place.

You can create the managed identity in Azure. You can create the service connection in Azure DevOps. But the awkward values you need in order to finish the federated credential only show up once Azure DevOps has done its part. So the whole thing ends up as a mildly clumsy roundtrip between two control planes. Sounds like a perfect candidate for automation.

I have written before about using user-assigned managed identities behind Azure DevOps service connections in User Assigned Managed Identities with Azure DevOps Service Connections. That post was more about why I like that identity model and what the basic manual setup looks like. This post is really the follow-up to that. The interesting bit here is not the identity choice itself, but how to automate the whole roundtrip once the service connection starts generating federation details on the Azure DevOps side.

If I'm bootstrapping a new workload, I want the identity, the service connection, the federation details and the follow-up permissions to come out of a rerunnable setup process. I don't want somebody clicking through a half-manual draft flow and pasting values around just because the platform boundary is inconvenient.

Manual = Pain

The manual version of this isn't exactly hard. It's just awkward enough to survive for far too long.

If you want the basic setup flow, the earlier post covers it well enough and I won't repeat all of it here. The short version is that you create or pick a user-assigned managed identity, create the Azure Resource Manager service connection in Azure DevOps using workload identity federation, and then finish the trust relationship on the Azure side once you have the federation details. I still like user-assigned identities for this because they let you manage the lifecycle from the Azure side without having to depend on App Registration access in Entra ID, which is often a very real constraint in customer tenants.

After that setup is complete, you still need to do the actual useful work: grant RBAC on the target resource groups, maybe grant some shared platform permissions, and save enough state that reruns don't become guesswork.
The other problem with the manual flow is partial state. It's very easy to end up with a managed identity that exists, a service connection that exists, and a missing federated credential in between. Nothing is fully broken, but the setup isn't actually finished either. Or as often happens, you're creating these for multiple environments at the same time, and you end up mixing up the federation details between them. The risk of human error is pretty high here.

Why the roundtrip exists

The awkward bit is that the federated credential lives on the Azure side, but the values you need for it are effectively materialized by Azure DevOps once the service connection exists. That is the part that changes the character of the problem a bit compared to the earlier manual setup post. Previously it was possible to calculate these values in advance and avoid the roundtrip, but now the `subject` is opaque enough that you really want to read it from Azure DevOps instead of trying to outsmart the platform.

You need one pass to create or resolve the managed identity, then a pass through Azure DevOps to create the service connection, then one more pass back into Azure to attach the federated credential using the issuer and subject that Azure DevOps exposes.

The setup loop was basically this:

Azure creates or resolves the identity. Azure DevOps creates the service connection that points at that identity. Azure DevOps refreshes the endpoint so the federation details are visible. Azure attaches the federated credential. Only after that do the permission-granting steps happen. Arguably you COULD do this in Bicep alone with deployment scripts, but those take forever to provision and tear down, so I wanted to keep the orchestration logic in PowerShell where it's more nimble (not to mention the pain if you're limited to network integrated resources only).

I found it useful to keep those as separate concerns. Creating a usable service connection and granting that principal the permissions it needs are related, but they're not the same step. Splitting them made the rerun behavior easier to reason about too.

The Azure DevOps part

I ended up letting Azure DevOps create the endpoint first and then explicitly query it for the federation values.

In practice that meant creating the service connection with the managed identity client ID, calling the endpoint refresh API, and then reading back the workload identity issuer and subject from the endpoint data. That was the slightly backwards part of the whole flow, but it's also the part that made the automation reliable.

The core PowerShell shape isn't especially complicated. The create call is really just constructing the AzureRM endpoint payload with workload identity federation and the managed identity client ID:

function New-AdoAzureRmFederatedServiceConnection {
    param(
        [string]$Organization,
        [string]$Project,
        [string]$ServiceConnectionName,
        [string]$TenantId,
        [string]$SubscriptionId,
        [string]$SubscriptionName,
        [string]$ManagedIdentityClientId,
        [string]$AccessToken,
        [string]$ProjectId = '00000000-0000-0000-0000-000000000000'
    )

    $body = @{
        authorization = @{
            scheme     = 'WorkloadIdentityFederation'
            parameters = @{
                serviceprincipalid = $ManagedIdentityClientId
                tenantid           = $TenantId
            }
        }
        data = @{
            environment      = 'AzureCloud'
            scopeLevel       = 'Subscription'
            creationMode     = 'Manual'
            subscriptionId   = $SubscriptionId
            subscriptionName = $SubscriptionName
        }
        name        = $ServiceConnectionName
        type        = 'AzureRM'
        url         = 'https://management.azure.com/'
        owner       = 'library'
        isShared    = $false
        isReady     = $false
        serviceEndpointProjectReferences = @(
            @{
                name             = $ServiceConnectionName
                projectReference = @{
                    id   = $ProjectId
                    name = $Project
                }
            }
        )
    }

    Invoke-RestMethod -Method Post -Uri "https://dev.azure.com/$Organization/_apis/serviceendpoint/endpoints?api-version=7.1-preview.4" -Headers @{
        Authorization = "Bearer $AccessToken"
    } -ContentType 'application/json' -Body ($body | ConvertTo-Json -Depth 20)
}

function Get-AdoServiceConnectionFederationDetails {
    param(
        [string]$Organization,
        [string]$Project,
        [string]$ServiceConnectionId,
        [string]$AccessToken
    )

    $refreshBody = @(
        @{
            endpointId             = $ServiceConnectionId
            tokenValidityInMinutes = 5
        }
    )

    $result = Invoke-RestMethod -Method Post -Uri "https://dev.azure.com/$Organization/$Project/_apis/serviceendpoint/endpoints?endpointIds=$ServiceConnectionId&api-version=7.1" -Headers @{
        Authorization = "Bearer $AccessToken"
    } -ContentType 'application/json' -Body ($refreshBody | ConvertTo-Json -Depth 10)

    $endpoint = @($result.value)[0]

    return [ordered]@{
        issuer  = [string]$endpoint.authorization.parameters.workloadIdentityFederationIssuer
        subject = [string]$endpoint.authorization.parameters.workloadIdentityFederationSubject
    }
}

$endpoint = New-AdoAzureRmFederatedServiceConnection `
    -Organization 'example-org' `
    -Project 'example-project' `
    -ServiceConnectionName 'example-release-dev' `
    -TenantId $tenantId `
    -SubscriptionId $subscriptionId `
    -SubscriptionName $subscriptionName `
    -ManagedIdentityClientId $managedIdentityClientId `
    -AccessToken $adoToken

$federation = Get-AdoServiceConnectionFederationDetails `
    -Organization 'example-org' `
    -Project 'example-project' `
    -ServiceConnectionId $endpoint.id `
    -AccessToken $adoToken

The sequencing is the important part. I would've happily avoided the refresh step if the platform had made the values available earlier, but once you accept the roundtrip, the flow is stable enough.

The Azure side

On the Bicep side I wanted one module that could support both passes.
The first run needs to be able to create or resolve the user-assigned managed identity without requiring federation values yet. The second run needs to take the same identity and attach the federated credential once issuer and subject are known.

That meant the module needed to support both creating a new identity and targeting an existing one. The practical shape I liked was to make the federated credential conditional on both issuer and subject being present, and then attach it either to the newly created identity or to an existing one.

param location string
param identityName string
param createIdentity bool = true
param issuer string = ''
param subject string = ''

var shouldAttachFederation = !empty(issuer) && !empty(subject)

resource identity 'Microsoft.ManagedIdentity/userAssignedIdentities@2023-01-31' = if (createIdentity) {
  name: identityName
  location: location
}

resource existingIdentity 'Microsoft.ManagedIdentity/userAssignedIdentities@2023-01-31' existing = if (!createIdentity) {
  name: identityName
}

resource federatedCredentialForNewIdentity 'Microsoft.ManagedIdentity/userAssignedIdentities/federatedIdentityCredentials@2023-01-31' = if (createIdentity && shouldAttachFederation) {
  parent: identity
  name: 'AzureDevOps'
  properties: {
    issuer: issuer
    subject: subject
    audiences: [
      'api://AzureADTokenExchange'
    ]
  }
}

resource federatedCredentialForExistingIdentity 'Microsoft.ManagedIdentity/userAssignedIdentities/federatedIdentityCredentials@2023-01-31' = if (!createIdentity && shouldAttachFederation) {
  parent: existingIdentity
  name: 'AzureDevOps'
  properties: {
    issuer: issuer
    subject: subject
    audiences: [
      'api://AzureADTokenExchange'
    ]
  }
}

That audience value is not especially interesting, but it is one of those details that I prefer to keep explicit in the module rather than relying on people to remember it later.

This kept the orchestration simple. The PowerShell can run the same module twice with different inputs instead of having to reason about two different deployment shapes.

Safety Checks

Once the full loop was automated, the more important question became how strict the automation should be. For me the main danger wasn't reuse by itself. Reuse is often exactly what you want. The dangerous case is when a service connection with the expected name already exists but points to a different managed identity than the one your automation just created or resolved.

So the behavior I wanted was simple. If the service connection doesn't exist, create it. If it exists and points at the expected managed identity, reuse it. If it exists and points somewhere else, stop immediately.

The other practical thing that mattered was checkpointing outputs after each successful environment. If `dev` succeeded and `prod` failed, I wanted to keep the saved client ID, principal ID, service connection ID, issuer and subject from the successful side. That made reruns much less irritating. In fact, I currently do this state management for many of our platform services, as it makes idempotency much easier to achieve.

End result

The obvious improvement is that there are fewer clicks, but that's honestly the least interesting part. What actually got better was that the setup became deterministic. The trust relationship no longer depended on somebody manually copying values between Azure DevOps and Azure. The naming stayed consistent. The outputs were saved. The follow-up permission steps had stable inputs. And when something failed, the failure mode was much easier to understand.

It also forced a cleaner mental model. There's a service connection creation loop, and there's a permission grant loop. They feed into each other, but they're not the same thing.

Closing thoughts

Workload identity federation for Azure DevOps service connections isn't hard in the normal sense. It's just awkward at the point where product boundaries meet.
Once I stopped fighting that and explicitly automated the roundtrip, the whole thing became much more boring — which is exactly what I wanted.

The setup now creates or resolves the identity, creates or reuses the service connection, reads the federation details back from Azure DevOps, finishes the federated credential on the Azure side, and only then continues to the authorization work.

Pretty simple and practical, just the way I like it.

Connecting OpenCode with Microsoft Foundry Models

Pasi Huuhka — Sun, 21 Dec 2025 16:41:00 GMT

💡

I've been using OpenCode as my coding agent of choice for a quite while now. It's great that I can use both GitHub Copilot subscription and my own Foundry models with it, and swap between them with a single keybinding.

In this post I'll show you how I've configured the providers to my own Foundry, which hosts both the recently announced Anthropic models as well as models by OpenAI and others. This does not directly conform to the official way of configuring them described here and here in the docs, but it does work and is arguably simpler.

I expect that you already have deployments of the models running in foundry, but if not, you can cobble them up with something like this (note that Anthropic only works on pay as you go subs):

// params.bicep
param anthropicDeployments = [
  {
    deploymentName: 'claude-sonnet-4-5'
    modelName: 'claude-sonnet-4-5'
    version: '20250929'
    sku: {
      name: 'GlobalStandard'
      capacity: 450
    }
    format: 'Anthropic'
    thinking: true
  }
  {
    deploymentName: 'claude-opus-4-5'
    modelName: 'claude-opus-4-5'
    version: '20251101'
    sku: {
      name: 'GlobalStandard'
      capacity: 450
    }
    format: 'Anthropic'
    thinking: true
  }
  {
    deploymentName: 'claude-haiku-4-5'
    modelName: 'claude-haiku-4-5'
    version: '20251001'
    sku: {
      name: 'GlobalStandard'
      capacity: 450
    }
    format: 'Anthropic'
    thinking: false
  }
]

param openAiDeployments = [
  {
    deploymentName: 'gpt-5.2'
    modelName: 'gpt-5.2'
    version: '2025-12-11'
    sku: {
      name: 'GlobalStandard'
      capacity: 50
    }
    format: 'OpenAI'
    thinking: true
  }
  {
    deploymentName: 'gpt-5.1-codex-max'
    modelName: 'gpt-5.1-codex-max'
    version: '2025-12-04'
    sku: {
      name: 'GlobalStandard'
      capacity: 200
    }
    format: 'OpenAI'
    thinking: true
  }
]

var deployments = concat(anthropicDeployments, openAiDeployments)

@batchSize(1) // Runs into conflict if run in parallel
resource model_deployments 'Microsoft.CognitiveServices/accounts/deployments@2025-10-01-preview' = [
  for deployment in (deployModels ? deployments : []): {
    parent: foundry
    name: deployment.deploymentName
    sku: deployment.sku
    properties: {
      model: {
        name: deployment.modelName
        version: deployment.version
        format: deployment.format
      }
      #disable-next-line BCP037
      modelProviderdata: deployment.format == 'Anthropic'
        ? {
            countryCode: tenant().countryCode
            industry: 'consulting'
            organizationName: tenant().displayName
          }
        : null
      #disable-next-line BCP073 // The api version thinks this is a read only value
      dynamicThrottlingEnabled: deployment.sku.name == 'GlobalStandard' ? false : true
      versionUpgradeOption: 'OnceCurrentVersionExpired'
    }
  }
]

You'll also need the api key to the Foundry, as OpenCode does not yet support oauth to Foundry directly (though you can write a plugin).

The OpenCode config

You could also generate this config directly from the bicep outputs if you'd want. I'll leave that up to you. Here are examples of how I have it set up.

// ~/.local/share/opencode/auth.json
{
  "azure-anthropic": {
    "type": "api",
    "key": "KEYVALUE"
  },
  "azure-openai": {
    "type": "api",
    "key": "KEYVALUE"
  },
  "github-copilot": {
    ....
  }
}

// ~/.config/opencode/opencode.json(c)
{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "azure-anthropic": {
      "name": "Foundry (Anthropic)",
      "npm": "@ai-sdk/anthropic",
      "api": "https://somefoundry.services.ai.azure.com/anthropic/v1",
      "models": {
        "claude-sonnet-4-5": {
          "id": "claude-sonnet-4-5",
          "name": "claude-sonnet-4-5",
          "tool_call": true,
          "attachment": true,
          "reasoning": true,
          "temperature": true,
          "modalities": {
            "input": ["text", "image"],
            "output": ["text"]
          }
        },
        "claude-opus-4-5": {
          "id": "claude-opus-4-5",
          "name": "claude-opus-4-5",
          "tool_call": true,
          "attachment": true,
          "reasoning": true,
          "temperature": true,
          "modalities": {
            "input": ["text", "image"],
            "output": ["text"]
          }
        },
        "claude-haiku-4-5": {
          "id": "claude-haiku-4-5",
          "name": "claude-haiku-4-5",
          "tool_call": true,
          "attachment": true,
          "reasoning": false,
          "temperature": true,
          "modalities": {
            "input": ["text", "image"],
            "output": ["text"]
          }
        }
      }
    },
    "azure-openai": {
      "name": "Foundry (OpenAI)",
      "npm": "@ai-sdk/openai",
      "api": "https://somefoundry.services.ai.azure.com/openai/v1",
      "models": {
        "gpt-5.1-codex-max": {
          "id": "gpt-5.1-codex-max",
          "name": "gpt-5.1-codex-max",
          "tool_call": true,
          "attachment": true,
          "reasoning": true,
          "temperature": true,
          "modalities": {
            "input": ["text", "image"],
            "output": ["text"]
          }
        },
        "gpt-5.2": {
          "id": "gpt-5.2",
          "name": "gpt-5.2",
          "tool_call": true,
          "attachment": true,
          "reasoning": true,
          "temperature": true,
          "modalities": {
            "input": ["text", "image"],
            "output": ["text"]
          }
        }
      }
    }
  }
}

Aaand it should just work. Enjoy!

Research - Plan - Implement

Pasi Huuhka — Wed, 17 Dec 2025 11:48:00 GMT

🆕

Update: Added GH Copilot versions of the agents in the repo, though these haven't seen much real world usage

💡

Many of the LLM harness creators have been experimenting with "spec driven development" flows lately. Some examples for this are GitHub's Spec-kit, Fission-AI's OpenSpec, and Humanlayer's Research Plan Implement flow.

I've been testing all of these and while they all work well, I've found that the RPI flow works best for my workflows. It's simple enough, is easy to implement in any tool and easy to scale depending on the complexity of the task you're working on. Included in this post are examples for implementing this flow in both OpenCode and GitHub Copilot.I

The core idea with all of these is to protect the context of the models, and avoiding going over 40-60% of the total context window to stay in the "smart zone" for all the work you're doing. In practice this is done via saving the state in a markdown file in between the steps.

Core idea

This flow scales both up and down.

For small tasks, I often skip the full ceremony and just talk directly with the coding agent.
For medium tasks, one research file + one plan is usually enough.
For large or messy work, I split the effort into multiple research docs and multiple plans (by domain, service, or milestone) so context stays focused and decisions stay traceable.

The key is not to be dogmatic: do the least structure needed to reliably reach the result you want. Start lightweight, add process only when complexity or risk justifies it, and keep adjusting based on what actually works for your way of building.

My Agent Stack

The main agents I use are just named research, plan and implement. The implementer can be just any normal coding agent. The point of this flow is to get the plan ready so the implementation can start.

This works for every harness that supports custom agents (most of them do). My implementation is mainly for OpenCode, but you can ask your LLM to translate these into any tool of your choice very easily. I basically just took the prompts from the HumanLayer repos, and modified them to meet my needs.

The three agents

I keep this intentionally simple: one agent for research, one for planning, and one for implementation.

Research Agent (Copilot example)
The research agent’s only job is to understand and document the current state of the codebase. Not “fix,” not “improve,” not “rewrite” - just map what exists.

It looks for relevant files, traces how things currently work, and writes the findings into a research markdown file. That becomes a stable handoff artifact for the next stage.

In my mind the main point here is to distill hundreds, thousands, maybe a hundred thousand lines of code into a very compact form describing where the next phase should actually read the most important info from.

Plan Agent (Copilot Example)
The plan agent turns research into a concrete implementation plan with phases and checkpoints.

Its role is to reduce ambiguity before coding starts:

what files are expected to change
what is explicitly out of scope
what “done” means for each phase
what should be verified before continuing

The output is a plan file that the implementation phase can execute directly.
At this point, implementation should feel like execution, not exploration.

The agent is guided to ask any open questions from the user. Sometimes this does need some extra nudging to make it happen, but it's important that you actually read the plan and discuss with the agent to clarify the actual implementation and also understand yourself what the feature actually needs to do. It's much cheaper to get the details right at this point than tuning after the implementation.

Implement Agent (Copilot Example)
The implement agent executes the approved plan phase by phase.
It is optimized for disciplined delivery:

follow the plan
make targeted changes
run checks
surface mismatches between “plan vs reality” quickly

If reality differs from the plan, the goal is to adapt while preserving intent, not freestyle a new design in the middle of coding.

In other words, this agent is for shipping, not for deciding architecture on the fly. However, like I mentioned earlier you could replace this part with whatever you want.

About the Slash Commands
You don’t actually need slash commands for this flow.
I use them because they are convenient routing shortcuts:

/research ... -> sends the prompt to the research agent
/plan ... -> sends the prompt to the plan agent
/implement ... -> sends the prompt to the implement agent

That’s mostly it. They’re ergonomic wrappers around prompt dispatch, not a magical requirement.

If your tool can target agents directly, you can run the same workflow without slash commands at all. The repo has some other examples for handoff, iteration and oneshotting implementations, but I've not really experimented with the usefulness of those, as opencode tends to do the compaction step itself, which quite closely matches the handoff logic.

Quick Note About the Repo

I’m linking the repo mainly as a reference for people who want to peek at how this is wired.

It is not really packaged for public consumption or polished as a “drop-in product.”

Still, if you’re curious, you can browse it, copy ideas, and adapt the structure to your own harness/tooling setup. The concepts are portable even if the exact implementation is opinionated.

Pasi Huuhka - Azure Deep Dive

Building your own PR reviewer with coding agents

The short version

The bot itself is simple

Triggering the review

ADO service hooks as the integration point

Architecture over model

The AI is one stage

Why coding harnesses fit

Own the orchestration, not the control flow

Inside the LLM run

Copilot SDK example

Skills, agents and structured output

Reviews get wordy

The same architecture powers a fix command

Local and remote on the same harness

This generalizes

Wrap up

Testing Cloudflare's Code Mode on Azure DevOps MCP

What Code Mode is, in the simplest form

Why Azure DevOps looked like a perfect test case

The first lesson: this was not nearly as plug-and-play as I expected

Wrapping the MCP was the wrong abstraction for this case

The major issues I ran into

The solution: a direct REST contract

Current implementation flow

Conclusions

Repos and references

Enabling Custom (Bicep) Language Server support in OpenCode

Why bother with LSPs in OpenCode?

Option 1: Install Bicep Language Server manually

Option 2: Reuse the VS Code extension's language server

The caveat with VS Code paths

Quick verification

How I currently develop with LLM models (Early 2026)

Standardizing on OpenCode as the main harness

RPI is still the backbone

The plan is where I do most of the thinking

I tested looping implementations with Ralph loops

For bigger work, I split plans into parallel streams instead

Browser verification is becoming part of the normal loop

Clear validation matters more than the harness

Model mix: mostly GPT-5.4, some Opus 4.6

Final thoughts

Designing a shared OpenTelemetry contract for AI services on Azure

The problem I cared about

A contract, not just shared code

Let the edge do the normalization

Metric cardinality

Cross-language

The Azure perspective

Closing thoughts

Semantic Kernel to Microsoft Agent Framework: Practical reflections

Agent creation: less plumbing, more intent

Tool registration: plugin model -> plain function tools

Workflow runtime model: orchestration runtime -> event stream

Filters vs middleware

Still a few rough edges in the prebuilt workflows

Quick note on custom graph workflows

Final take

Preserving MCP session continuity with Redis

Setup

MCP sessions over HTTP

Stateful vs stateless

Flow

What needs to survive

Why in-memory sessions are not enough

The solution

Configuration and failure behavior

Infrastructure

What this solves

Wrap up

Practical experiences with Azure APIM AI Gateway and imported Foundry endpoints

Platform concerns

Implementing both paths

Sidebar: why I keep using user-assigned managed identities everywhere

Observability payoff with APIM

Semantic Caching

The weird 100k free requests bootstrap story

Links / references