What is harness engineering?

A few of you have read The harness is the product or Meet agentic-pi and very reasonably come back with “I follow the conclusion but I do not have the words for the thing you keep talking about yet.” Fair. This is the post I should probably have written first. What is a harness, why does anyone need one, and how is it different from the AI coding setup most teams already have?

If you have been following along, you can skip it. If you have not, this is the warm-up before the longer reads.

One useful bit of context up front. By the time I wrote The harness is the product at the end of April, both labs had already published their own takes on this idea and the vocabulary had started to settle around it. OpenAI’s Ryan Lopopolo calls it harness engineering and frames it as “humans steer, agents execute” - the discipline showing up “in the scaffolding rather than the code”, written off the back of a five-month internal experiment where Codex agents shipped a million-line product with no hand-written source. Anthropic’s Harness design for long-running apps puts the same point more bluntly: “every component in a harness encodes an assumption about what the model cannot do on its own.” My working definition lines up with both, and the rest of this post is the same idea told in my own words - with the bits that matter most to small teams pulled to the front.

What the tools actually are

The AI coding tools dominating the conversation right now - Claude Code, OpenAI Codex, Gemini CLI, opencode, Pi, Cursor’s agent mode, Cline, OpenClaw, the Kimi-powered terminals doing the rounds on X, the long tail of forks of each of them - all share the same essential shape. They are agents wrapped around a large language model, and each of them runs a version of the same loop:

EACH LLM CALL SEES system prompt · tool definitions · conversation history so far agentic loop · one turn Your prompt Model evaluates tool_use Tool runs tool_result appended to context no more tool calls · end_turn Final answer
Figure 1. The agentic coding loop, common to every popular coding agent. Adapted from Anthropic's Agent SDK documentation.

Read this carefully because it is the whole concept. A prompt arrives. The model decides whether to answer directly or to call a tool - read a file, run a shell command, search the web, invoke an MCP server, execute a skill. If it calls a tool, the runtime executes it, feeds the result back into the conversation, and asks the model what to do next. The loop repeats until the model says “I am done” and returns its final answer.

That is it. Every “agentic” coding tool you have seen this year is some variation of this loop. The differences are about which model is on the other end, what tools are available, how big the context window is, what the box around the agent looks like, and what it can see.

The first-order consequence of this design is that the tool itself is generic. Claude Code does not know whether you are asking it to refactor a controller, summarise a customer ticket, write a migration plan, or compose a haiku. It has whatever tools you have configured globally - every MCP server, every skill, every file in the working directory - and it will happily reach for any of them. Generality is the headline feature.

Why generality is also the problem

Generality is great when you are exploring. It is much less great when you are trying to ship the same kind of change a hundred times with the same level of quality.

A few things tend to bite you in production:

  • Context is broad and undirected. The agent can read any file in the repo, run any tool, browse the docs of any MCP server you have wired up. Most of the time, the relevant context is a thin slice of all of that, and the model has to guess which slice. Sometimes the guess is wrong, and you only find out at the end.
  • Tools are everywhere. The MCP ecosystem is fantastic and growing fast, but a coding agent with thirty MCP servers attached is staring at a tool catalogue rather than at a workflow. Most of those tools are irrelevant to most tasks, and the more of them you load the more the model gets distracted reaching for them.
  • The model is whatever you set globally. You picked Sonnet, or Opus, or Haiku, and that is the model for the entire session - including the bits that would have been faster and cheaper with a smaller model, and the bits that would have been better with a bigger one.
  • The behaviour drifts. Run the same prompt twice and you will get two adjacent but different sequences. The agent decides what to read, what to run, in what order. That is fine for one-off exploration. It is not fine when you need the same result every time.

You can take a surprisingly long way down this road with CLAUDE.md and AGENTS.md files - put project conventions in there, tell the model how the repo is structured, document what good looks like. This works, you should do it, and I first wrote about it back in 2025 when I started seriously using Claude Code. The advice still holds. But every repo I have ever maintained drifts away from these files faster than the team updates them. They put a fence around the agent that the agent can step over the moment something interesting happens inside the fence. Keeping them current and useful is, itself, real ongoing work.

Spec-driven development - spec-kit, BMAD, the wider movement of “write the spec first, then have the agent build to it” - is the next step up. It is genuinely effective. Forcing the agent to produce a spec before it produces code, or having a separate review pass that checks the implementation against the spec, materially reduces the “cut corners and ship something approximately right” failure mode that plagues big open prompts. If you are not doing this already, you should be.

But notice what is still true even with spec-driven workflows. The agent is still running its open-ended loop, with its full tool set, against an open context, under your guidance. The spec is a better fence. It is not a workflow. And as soon as the work being done is something you do over and over - triaging customer issues, reviewing PRs, processing release candidates - the lack of a real workflow becomes the bottleneck.

What a harness actually is

A harness is one step further.

Instead of an agent that runs in a single open loop until it thinks it is done, you have a deterministic workflow that decomposes the work into named phases. Each phase is itself a one-shot agent run. And critically:

  • Each phase has only the context it needs - direct from the input, plus whatever the harness has retrieved from the systems of record on its behalf.
  • Each phase has only the tools it needs - read-only for analysis, write-permitted for implementation, GitHub-only for triage, no internet for the executor.
  • Each phase uses the model that fits the work - cheap and fast for triage, smarter and slower for diagnosis, biggest and most expensive only where it earns its keep.

It is worth saying out loud that a harness is just code. Nothing magical. Code in whatever language your team already uses - TypeScript, Python, Go - that calls a coding agent in a particular sequence, with particular inputs, in particular sandboxes. You build it, test it, version it, deploy it, and monitor it the same way you do any other piece of software your team owns. Some of the projects in the tour - Last Light, Archon, Symphony - let you express each workflow in YAML rather than directly in code, and that is genuinely useful: it makes the workflow editable without a redeploy, easier to inspect, and possible to build dynamic editor surfaces around. But the YAML is not what makes a harness a harness. The deterministic sequencing is. You can do this entirely in plain TypeScript and have exactly the same thing.

The agent still gets to make every reasoning call inside a phase. It does not get to decide which phase runs next, whether to skip one, or whether to combine two. That is the harness’s job, and it is the bit that has to live in deterministic code if you want the cycle to be reproducible.

A diagram is worth more than the paragraph above. Here is a concrete example - the kind of harness I run on real repos, taking a customer issue all the way through to a pull request:

CUSTOMER ISSUE · opened on the issue tracker PHASE 1 Triage model: Haiku context: issue body tools: read-only PHASE 2 Diagnose model: Sonnet + logs, telemetry tools: read repo PHASE 3 Root Cause model: Opus + commit history tools: read repo PHASE 4 Build PR model: Sonnet + plan from p.3 tools: repo-write HARNESS · deterministic glue, retrieves context from systems of record, applies permission profiles, spins up sandboxes runs each phase as a one-shot agent, decides what comes next, persists results between phases PULL REQUEST
Figure 2. A four-phase harness for taking a customer issue through to a pull request.

A customer issue comes in. The harness routes it through a fixed sequence of agent calls, and at the end - depending on what each phase concluded - either a clarifying comment, a recommendation, or a pull request lands on the issue. Each box in the diagram is a separate one-shot run of an agent, in its own sandbox, with its own permissions, and on whichever model fits the work. No phase sees anything it does not need to see.

The other thing the diagram is hinting at - and this is the part teams routinely underestimate - is that the harness can retrieve data from systems of record on the agent’s behalf, rather than handing the agent an MCP server and letting it go fishing. This is an operating mode, not a rule. You can absolutely still wire MCP servers (or any other tools) into a phase if that is the right call - sometimes the agent genuinely needs to explore, and an MCP server is the cleanest way to let it. The point of having a harness is that you now get the choice per phase. For the Diagnose phase you might decide the customer database, the logs cluster, and the incident tracker are queried by the harness with whatever scope is appropriate, formatted into the prompt as a known slice of data, and reasoned over by the agent. For the Executor phase you might decide the agent needs full repo write tools and a real shell. Both are fine. The harness-side retrieval option is the one teams miss, because the open-loop agent does not really give it to you - and where it fits, it is much safer, much cheaper in tokens, and reproducible. The same input produces the same retrieval every time, and the auth boundary lives in your code rather than in a config file the model is allowed to read.

The data passing between phases is similarly low-tech and that is the point. The convention most of the harnesses worth looking at have landed on is to hand off through plain markdown files committed to the working branch, or through GitHub issue comments. The Triage phase writes .harness/issue-42/triage.md, commits it, and the Diagnose phase reads it as part of its prompt. Diagnose writes diagnose.md, Root Cause reads both, and so on. No shared agent memory, no database of run state, no message bus - just files in a folder, or comments on an issue. This is deliberate: every handoff is auditable by a human reading the branch or the issue, the git log doubles as the run history, and if a phase produces something weird you can see it before the next phase ever consumes it.

Why this matters

Three things compound once you start running real work through a harness.

First, you can optimise each step independently. If the triage phase is misclassifying tickets, you tune the triage prompt or change the triage model. Nothing else moves. If the executor is producing flaky code, you sharpen the executor’s tests or guardrails. Nothing else moves. In an open-loop agent, the same prompt change might fix one failure mode and introduce two more elsewhere in the run - because nothing in the run is isolated.

Second, you can run evals. Once a phase has a defined input, a defined output, and a fixed prompt, you can run a hundred historical examples through it and measure whether your change made things better or worse. This is the part of harness engineering that feels closest to traditional engineering: contract, test, regress. Evals are how you stop guessing whether a change to a prompt has actually helped.

Third, you can compose. Spec-driven development slots straight into a harness as one or two phases - a Socratic-questioning phase that helps the user write the spec, or a reviewer phase that checks the implementation against the spec. A guardrails phase that fails before any code is written if the test scaffolding is not there. A security-scan phase that runs in parallel with the reviewer. The harness is the place where all these patterns become parts of a production line rather than instructions buried in a README that the agent might or might not have read.

The combined effect is control. You start with a generic agent that can do almost anything but does none of it reliably, and you end up with a specialised pipeline that does one thing extremely well - and that gets measurably better every time you ship an improvement to one of its phases.

Where to go next

If this has clicked into place, the two posts of mine to read are:

  • The harness is the product - the longer argument for why this is the direction engineering is heading, with a tour of nine or ten projects (Stoneforge, Aiden, Paperclip, Runfusion, Archon, Stripe’s Minions, OpenAI’s Symphony spec, and Last Light itself) all converging on the same shape from very different starting points.
  • Meet agentic-pi - the much more granular story of what goes into one phase of a harness: the runtime, the permission profiles, the JSONL event stream, the per-phase model selection. This is where the abstract picture above turns into running code.

And the two from the labs themselves, both of which I would read before mine:

  • Harness engineering: leveraging Codex in an agent-first world - Ryan Lopopolo’s account of an internal OpenAI experiment that shipped a roughly one-million-line product across 1,500+ PRs with no hand-written source, and the scaffolding (AGENTS.md, a docs/ system of record, custom linters, observability hooks) that made it work. The “humans steer, agents execute” framing comes from here.
  • Harness design for long-running apps - Anthropic’s deeper dive on the design patterns: generator / evaluator separation, sprint-based decomposition, context resets vs compaction, contract negotiation between planner and builder. The line “every component in a harness encodes an assumption about what the model cannot do on its own” is the cleanest framing of the discipline I have seen.

If you are starting from zero, start small. Pick one repeatable activity your team does poorly or expensively - PR review, issue triage, release notes, post-incident write-ups - and build the cheapest possible harness around it. Two phases, deterministic glue between them, the smallest model that gets the job done, and an eval set of ten historical examples. Watch what breaks. Tighten the context. Tighten the tools. Add a third phase only when you genuinely cannot get further without one.

Once you have done that once - and watched the same prompt-change-and-re-eval loop bend the output measurably in the direction you wanted - the rest is repetition.

If you have built one of these, or you are halfway through building one and want to compare notes, I would genuinely love to hear about it. You can find me on LinkedIn, or have a look at Last Light and tell me what you would do differently.