The harness is the product
I’ve been building Last Light for a little under a month - the first commit landed on 4 April 2026. It started as a small experiment - a bot that could triage issues on my open source projects while I was otherwise occupied - and very quickly turned into something else. What I thought I was building was a maintenance bot. What I was actually building was a harness: a YAML-driven workflow engine that runs Claude in Docker sandboxes with downscoped GitHub tokens, hands work between Architect / Executor / Reviewer roles, posts progress to GitHub issues, and pauses at approval gates the maintainer can resume from a comment.
The bot is a tiny part of it. The interesting bit is the engine.
The reason it has gone fast is, in retrospect, the whole point of this post: Last Light is building Last Light. Of the 270 commits in the repo as of today, 195 (just over 72%) are authored by last-light[bot] itself. I have written 75. Almost every feature - the YAML workflow engine, the approval gates, the dashboard tabs, the security scan workflow, most of the bug fixes - has gone through the same Architect → Executor → Reviewer cycle the bot runs on everyone else’s repos. When I find a class of mistake, I add a skill or a check, and from then on the bot catches it on every future PR, including the ones it raises against itself. Four weeks ago this was a README. It is now running a security scan on its own pull requests.
Note: This is very much a draft, I’d appreciate any comments or feedback!
The longer I’ve worked on it, the more obvious it has become that I am not unusual in this. Everywhere I look there are people - solo, in tiny teams, inside large companies - building variations of the same thing. Different runtimes, different stacks, different opinions about who should approve what, but the same underlying pattern. And once you see it, you see it everywhere. This isn’t a fad. This is where engineering is going, and going fast.
I want to use this post to walk through what I’m seeing across the field, summarise a handful of projects worth knowing about, and then talk about what I think this means - particularly for small teams and for what we are paid to do as engineers.
Where this comes from
If you want to understand the shape of what is happening, you should read Sigrid Jin’s post about how a clean-room rewrite of Claude Code’s TypeScript was produced by an agent system called oh-my-codex (OmX) while the developer slept. The output of that work - a project called Claw Code - became the fastest open source repo in history to cross 100K stars on GitHub. The bit that stuck with me, and that I have quoted to anyone who will listen, is this:
The code is a byproduct. The thing worth studying is the system that produced it.
That single sentence is the whole game. The artefact of a build cycle - the diff, the PR, the merged commit - is no longer the interesting thing. The interesting thing is the production line. The sequencing of roles. The decisions about what context goes where. The verification that catches drift. The persistence loop that keeps an agent on task across context windows. The semantics of a “lore commit” that captures intent and risk for a future agent (or human) who has to pick up the work. All of that is the system. And the system is what we are now in the business of building.
OmX itself is worth a look: 33 prompt templates, role-based agents ($architect, $executor, $reviewer, and many more), a $ralph mode that loops persistently until an architect verifies completion, a $team mode for parallel review, and a separate daemon called clawhip that routes events between agents over Discord channels. It runs on top of OpenAI Codex CLI inside tmux panes. It is not pretty in the conventional sense. It is, however, doing real work, in volume, on a real codebase, and the patterns are spreading.
A tour of what people are building
I want to be clear that none of these are theoretical. All of these are running, available, and built (in most cases) by individuals or very small teams. This is a category, not a list of competitors.
Stoneforge
Stoneforge targets developers already running 3-5 agents in parallel and needing coordination. It splits responsibilities between Smithy (the orchestrator that spawns agents and dispatches tasks) and Quarry (an event-sourced data SDK with SQLite for queries and JSONL as the source of truth - so you get fast lookups and a git-mergeable audit trail in one go). Director agents plan work, a dispatch daemon assigns ready tasks to isolated workers (each in its own git worktree to prevent merge conflicts), and a steward agent auto-merges successful branches.
The interesting choice they have made is explicit: no human approval gates before agents take actions. Their reasoning is that approval prompts bottleneck parallel pipelines, so agents read, write, execute, and push autonomously. The team flag this as a tradeoff requiring caution and they are right - it is a different position from where Last Light sits, but it is a coherent answer to the same problem.
h2
h2 is the most explicitly tmux-shaped of the bunch. It calls itself an “agent runner, messaging, and orchestration layer,” and its philosophy is that it does not want to be a custom harness at all. Instead it wraps existing agent CLIs (Claude Code, ChatGPT Pro) by talking to them through their terminal interface. No API keys, no special integration. You attach and detach from agent terminals as you would tmux panes, agents discover and message each other with priority levels (interrupt, normal, idle-first, idle), and a Telegram bridge lets you steer the whole thing from your phone.
Harness-agnostic by design. Worth noting because it shows a different ambition - “I do not want to replace your tools, I want to coordinate them.”
Aiden
Aiden takes the SDLC view. Its pitch is that the bottleneck is no longer code generation, it is coordination - between sprint boards, GitHub, CI, agent logs, customer calls, PRDs. The platform consolidates these into one shared context graph so agents are not constantly re-discovering what the team already knows. Humans and agents appear together on the same sprint board with live status, the platform routes tasks to the best executor based on context, and human checkpoints are inserted at high-impact moments.
This one is less about how the agent thinks and more about how the agent fits into the way teams actually work. I think this layer - the connective tissue between agents and the existing way an organisation runs - is where a lot of the next year’s effort will go.
Mainframe
Mainframe is a desktop application that consolidates multiple agent CLIs into one visual workspace. Where the rest of this list is mostly headless, Mainframe is “visual-first”: in-app file editing, live previews, session history, kanban boards alongside agent sessions, integrated sandbox with browser inspector, and a mobile companion app for remote approvals over Cloudflare Tunnel. It has multi-provider support (Claude, Gemini, others), and an API-first daemon mode that lets you build your own UI on top.
For people who do not love the terminal - and there are a lot of them - Mainframe is the most accessible entry point.
Runfusion
Runfusion is the most distributed of the lot. It coordinates agent work across multiple machines - laptops, servers, cloud VMs, even mobile devices - and runs agents in plan → review → execute → review cycles with quality gates before merging. It is git-native (worktrees per task), supports Anthropic / OpenAI / Ollama / others, and can track hierarchical work as Mission → Milestone → Task. The framing is “agent companies” - teams of specialised agents working together on a larger mission, with the platform able to run for weeks with minimal human intervention.
Multica
Multica commits to the teammate model. Agents receive task assignments through a unified issue system, autonomously execute work, report progress over WebSocket streams, and participate in conversations. The platform maintains agent profiles, workspace-level organisation, and supports multiple runtimes (local daemons, cloud instances). Crucially it has a concept of persistent skills - solutions become reusable for future tasks - and it works across Claude Code, Codex, OpenClaw, Hermes, Gemini. It is positioned for small teams collaborating on real projects rather than solo operators.
The teammate framing is doing a lot of work here. Once you put an agent on a kanban board next to a human, the conceptual gap closes very quickly.
Eva
Eva is a focused take: GitHub repos connect to Eva, Eva provisions sandboxed cloud development environments via Daytona, and Claude works inside them with full development capability - shell, package install, tests, builds, previews, opening PRs. MCP integration lets agents query production-shaped services (Convex, Supabase) directly. The defining choice is “real environments, sandboxed”. Agents get genuine capability, not simulated tooling, but they are isolated. Vite / React frontend, Convex backend, Daytona for sandboxing. Self-hosted, open source, no vendor lock-in.
Archon
Archon calls itself “the first open-source harness builder for AI coding,” and that framing is the most honest description of where this category is going. It encodes development processes as YAML workflows. The execution layer alternates between deterministic nodes (bash, tests, git) and AI nodes (planning, generation, review). 17 pre-built workflows ship with it - issue fixes, feature development, PR reviews, refactors. Platform adapters span Web / CLI / Telegram / Slack / Discord / GitHub, all feeding into the same orchestrator.
If you only look at one project on this list to understand what a harness is, look at Archon. The “mix and match deterministic logic with AI reasoning at specific steps” framing is exactly right, and it is one of the projects that directly shaped how I designed Last Light - alongside Sigrid Jin’s post and Stripe’s Minions write-up. I am not claiming originality here. The shape is being deliberately copied, refined, and re-implemented across the field, because the problem is the same and the constraints push you to roughly the same answer.
Stripe’s Minions
The most striking data point is from inside a very large company. Minions are Stripe’s homegrown coding agents, and they merge more than 1,300 pull requests every week containing zero human-written code. Not assisted code, not co-authored code - one-shot, end-to-end, agent-produced PRs. They are forked from Goose (Block’s open source coding agent), they orchestrate work using “blueprints” (deterministic code combined with flexible agent loops), and they sit on top of an MCP server that exposes 400+ internal tools and SaaS integrations.
Two pieces of the architecture are worth pulling out:
- Deterministic prefetching - the orchestrator scans the prompt for links and keywords, then curates a surgical subset of about 15 relevant tools rather than dumping all 400 into context. This is exactly the kind of thing that you cannot get a model to do well by prompting. You build it.
- Isolation as the permission system - every Minion run spins up a devbox identical to those used by human engineers, pre-warmed in about 10 seconds, with no internet access and no production access. Because the box is sandboxed, Stripe eliminates the need for human permission checks during execution. CI / tests / static analysis are the gate, not a person.
Stripe’s own framing is what I find most useful: the reason this works has almost nothing to do with the model. It works because Stripe spent years before LLMs existed building the infrastructure - typed Ruby with Sorbet, homegrown libraries, devboxes, CI - that humans needed to ship safely. The agents inherited all of that, and the model became a substitutable component. The harness was already there.
OpenAI Symphony
The most interesting move in the field, though, is from OpenAI - and it is interesting precisely because of what OpenAI did not do. They did not ship a hosted product. They did not announce a managed orchestrator. They published a spec. Symphony is a literal SPEC.md file with RFC 2119 normative language (MUST, SHOULD, MAY) that defines what a coding-agent orchestrator is supposed to do, alongside a reference implementation in Elixir. Their explicit position is that they do not plan to maintain Symphony as a standalone product - “think of it as a reference implementation.”
The spec describes a long-running service that polls an issue tracker (Linear in v1), creates an isolated workspace per issue, and runs a coding-agent session inside it. The components are exactly the shape you would expect by now: a workflow loader (reading a WORKFLOW.md versioned in the repo, so the prompt lives with the code), a config layer, a tracker client, an orchestrator, a workspace manager, an agent runner, structured logs, an optional status surface. Internal teams at OpenAI saw landed PRs rise 500% in the first three weeks of using it.
Two things are worth pulling out. First, the Elixir choice is deliberate. The supporting write-up notes that “when code is effectively free, you can finally pick languages for their strengths, like Elixir’s concurrency.” That is a sentence to read twice. When the cost of writing the implementation drops to near zero, the constraints on technology choice change - you pick the language that fits the problem rather than the language that fits your team’s existing skill profile, because the team’s existing skill profile is no longer the bottleneck on output.
Second, and more importantly, OpenAI’s stance is essentially: this is the shape, this is what we are using internally, here is the spec, build your own. The implication, said quietly but unmistakeably, is that the harness layer is something every team needs but nobody can sell - because the harness is where your team’s specific knowledge lives. You cannot outsource that. You can adopt a spec, and you can fork a reference implementation, but the actual harness has to belong to you.
I cannot overstate how important I think this signal is. When the company that builds the model says “the orchestration layer is yours to build, here is a specification to help you,” that is the entire industry pointing at the same conclusion. The model is becoming a commodity. The harness is not.
Last Light
I have already opened with Last Light and I will not labour the architecture here because the how-it-works page does that job better than I will. The short version: every behaviour - triage, review, build, health, chat - is a YAML workflow. Phases run in sequence (or as a DAG when one needs to wait for another), each in a fresh Docker sandbox with a downscoped GitHub App token matched to the workflow’s permission profile (read, issues-write, review-write, repo-write). The harness is workflow-agnostic - it reads YAML, executes phases, writes results to SQLite and the session JSONLs. Adding a new behaviour is a new YAML file, not new TypeScript.
The build cycle uses three roles - Architect (read-only analysis), Executor (TDD implementation), Reviewer (independent verification with no shared context) - with a fix loop of up to two cycles. Phases hand off through a .lastlight/issue-N/ folder on the branch: architect-plan.md, executor-summary.md, reviewer-verdict.md. GitHub is the coordination layer - every phase posts progress to the issue, the issue is the authorisation gate, build requests must come from a maintainer @mention.
I named it Last Light because the original use case was “keep the lights on for repos I have moved on from,” but the shape it took is very obviously a general-purpose harness, and the same instance is now also maintaining drizzle-cube and drizby for me. The lineage is direct: Sigrid Jin’s post is what made me start, OmX gave me the role-based agents and the closed development loop and the lore commit format, Archon is where the YAML-workflow framing came from, the Stripe Minions write-up sharpened my thinking on permission profiles and isolation, and Matt Pocock’s Sandcastle is where I learned that the Claude CLI can be driven as a headless agent inside containers in the first place.
What is the same across all of them
Pull back and look at the list. Stoneforge, h2, Aiden, Mainframe, Runfusion, Multica, Eva, Archon, Minions, Last Light, OmX. Different runtimes, different opinions about parallelism, different stances on human approval, different visual front-ends or no front-end at all. But underneath, the shared shape is:
- A workflow engine that decomposes work into phases - usually some flavour of plan / build / verify, with a fix loop.
- Sandboxed execution of each phase, with the permission of the box matched to the work it is doing. The permission is in the box, not in the prompt.
- Role-based agents with explicit constraints rather than vague instructions. The Architect cannot edit, the Reviewer has no shared context with the Executor, the box without internet cannot leak data. Behaviour is enforced by structure.
- Deterministic glue between AI nodes - prefetching context, parsing verdicts, routing events. No LLM in the routing loop. AI is on tap for reasoning, not in charge of orchestration.
- A coordination layer that humans already use - usually GitHub issues / PRs, sometimes Slack, sometimes a kanban board. The agents do not get a private channel. Their work product is on the same surface where the team already lives.
- Persistent state outside the agent - SQLite execution logs, JSONL session histories, plain markdown handoff files on the branch. Memory is in files and tables, not in the model’s context.
This shape did not come from a paper or a vendor. It is being arrived at, again and again, by people working on the problem from very different starting points - solo builders, indie startups, large companies, and now OpenAI publishing it as a normative spec. When that happens, it is usually because the underlying problem is forcing the design. I started Last Light after reading Sigrid Jin’s post, and I borrowed liberally from Archon and from what Stripe described in the Minions write-up. None of that takes away from the convergence point. If anything, it reinforces it: the pattern is good enough that people are reading each other’s work, copying the shape, and getting something that works in their own setting. That is how a category settles.
What this means for small teams
I have written before about Value Pairs - the idea that AI changes the team topology away from the traditional PM + Designer + 5-6 engineers and towards smaller, paired groups working in concentrated discovery / delivery / validation cycles, with each person empowered by AI rather than carrying a full specialism. I still think that is right, and the harness pattern is what makes it actually work in practice.
The reason is simple. In a Value Pairs model, two engineers can produce the volume of code that previously took five or six. But the quality, security, observability, and consistency that the larger team used to provide through PR reviews, architecture discussions, and ambient peer pressure - that does not happen by itself. Either you wave it off (and accumulate technical debt at terrifying speed), or you build it into the harness.
Concretely, the things a five-person engineering team used to do as ambient overhead - someone notices the test coverage drop, someone questions the decision to introduce a new dependency, someone catches the auth path that does not check the user’s tenant, someone asks why the migration is not reversible - these are all things a harness can do. Not perfectly, but consistently and at every PR, not just the ones that happen to land in front of the right reviewer that week. The Reviewer phase in Last Light, the static analysis gate in Minions, the quality gates in Runfusion, the human checkpoints in Aiden - these are all instances of taking what used to be a cultural norm and codifying it into the production line.
And once it is codified, you can improve it. You can add a new check. You can write a new skill. You can amend a YAML workflow. Each improvement applies to every future PR, not just the ones reviewed by the engineer who happened to learn the lesson. This is the part that I find most exciting: for the first time, the things you learn the hard way can actually compound across the team rather than evaporate into tribal knowledge.
The PR review avalanche
One of the reasons the more sceptical engineers I know are nervous about agentic delivery is the volume problem. If a pair of engineers can ship five times the diffs, who reviews the diffs? The default answer - “humans, but harder” - is obviously not viable. Reviewing well is already hard, and PR fatigue is a well-documented thing even before any of this started. Stripe is merging 1,300 agent-produced PRs a week. No human review process scales to that, no matter how rested the reviewer is.
The harness shape is the answer, and it is the answer in a way that is more interesting than “AI reviews AI.” The harness is the review process.
In Last Light, the Reviewer is a separate Agent SDK session, with no shared context with the Executor. It reads the architect plan, runs the tests, looks at the diff, and either approves or rejects with file:line references. That is one layer of review. But the harness also runs static security scanning, type checks, lints, the test suite (which the guardrails phase verifies exists before any code is even written), and a fix loop that actually addresses the issues rather than papering over them. The human review at the end is much smaller - it is reviewing a PR that has already been through three or four checks designed to catch the specific things humans are bad at catching at scale.
The really useful property of this is that the review process is itself a piece of code in the repo. The skills under skills/, the workflow YAMLs, the prompts. When the team learns something - “we keep missing a class of bug where someone forgets to check the tenant scope on a query” - the learning goes into the harness. A new check. A new prompt. A new test guardrail. And from then on every PR is checked for it. The review process gets more rigorous over time, automatically, and you can read the entire history of what your team has learned by reading the git log of your harness repo.
That is the shift. PR review stops being a thing reviewers do under time pressure on a Friday afternoon and starts being a thing the production line does, every time, with the cumulative knowledge of everyone who has ever caught a class of mistake. Humans still review - the humans get the final call on the highest-impact changes, the contentious calls, the things that need taste. But they do it on a much smaller, much more pre-filtered surface.
Engineers as harness builders
The thing I find most interesting about all of this - and the part I am still working out for myself - is what it means for what we are paid to do.
For the past 25 years or so, an engineer was paid to build the change. The feature. The fix. The refactor. The design system component. Some engineers were paid to build the platforms that other engineers used to build the change - the CI system, the deployment tooling, the observability stack - and those people generally got more leverage, but it was a specialism. Most engineering work was direct.
What is happening now, and what every project on the list above is a symptom of, is that this is inverting. The valuable work is moving up a level. The change is largely produced by the agents in the harness. What the engineer does is build, maintain, and improve the harness that produces the change. The skills you actually need are:
- Designing role boundaries. What can the Architect do? What is the Reviewer’s permission profile? Where does context flow and where is it deliberately withheld? This is systems design with new constraints, and it rewards the same kinds of thinking we have always rewarded - clear interfaces, separation of concerns, explicit failure modes.
- Writing skills and prompts as code. A
SKILL.mdfile is now part of the engineering surface area. It needs to be reviewed, versioned, tested. Prompts are code. Bad ones produce bad outputs at scale. - Building the deterministic glue. The bits between the AI nodes - parsing verdicts, prefetching context, routing events, handling fix loops, persisting state, downscoping tokens. This is the most boring-looking and most important work. Most of the value of Stripe’s Minions is in the boring deterministic infrastructure, not the agent prompt.
- Verification design. The tests, the static analysis, the guardrails phase that runs before anything else. When the agent is the producer, verification is no longer a courtesy at the end of the cycle, it is the gate that decides whether anything ships at all. This is where the most leverage is.
- Observability of agent runs. When something goes wrong, you need to be able to read the full Agent SDK session, see the tool calls, understand what the model thought it was doing, and figure out whether the failure was a prompt issue, a context issue, a tool issue, or a model issue. Last Light has a four-tab admin dashboard for this and I still have not built enough of it.
None of this is glamorous. Most of it is the kind of work that previously belonged to platform engineers and SREs. My honest take is that the engineers who have spent the past few years closer to the platform - paying attention to CI, to observability, to permissions, to failure modes - are about to have a very good few years. Their instincts transfer directly. The engineers who have been most sceptical of “infrastructure work” as somehow lesser than “feature work” are going to find this transition harder.
The point I want to leave you with here is that this is engineering work. It rewards the same skills that good engineering has always rewarded. It is not prompt-fiddling, it is not vibes, it is not magic. The thing you are building is unusual - a system that produces other systems - but the discipline is exactly what you would hope. Specifications, contracts, tests, clean failure modes, observability, version control, the lot.
What to do about it
If you are a small team, or you run one, or you are in the position to influence one, here is what I would actually do.
Start by accepting that the harness is going to exist. Either you build one (or adopt one), or one builds itself out of half-finished scripts and Slack reminders and someone’s personal Claude Code configuration - which is the worst of all worlds, because it is a harness without any of the discipline that makes a harness valuable.
Pick a small, real surface to start with. PR review is a great one because the agent’s output is a comment, not a commit, so the blast radius is bounded. Issue triage is even smaller. Both of these will teach you more about what you actually need from the harness than any amount of theorising. Last Light started here. The build cycle came later, after I had spent enough time watching the smaller workflows fail to know what the bigger one needed.
Treat the harness as a real codebase. Tests, reviews, versioning, the lot. Do not run a YAML file out of someone’s home directory in production. This is not a hot take, it is a thing I have done and immediately regretted.
Keep the model out of the routing. This is something I keep getting wrong in Last Light and having to back out of - it is very tempting to let the agent decide which workflow to run, or let the reviewer decide whether to merge, and it works fine for the first ten cases and breaks in the eleventh in a way you cannot debug. Deterministic code should route events, parse verdicts, and decide what is approved. The model is on tap for reasoning inside a phase, not for running the production line.
And finally, write down what you learn. When the harness catches a class of bug, add the check. When the harness misses a class of bug, write a new skill. When you make a call about who can approve what, put the rule in YAML. The compounding effect of a team that does this consistently for six months is genuinely larger than anything I have seen from any single tool or model upgrade.
Closing thought
I started this post by quoting Sigrid Jin: the code is a byproduct, the thing worth studying is the system that produced it. I think that holds up under scrutiny - and across the whole field, not just in OmX. The artefact of an engineering organisation in 2026 is no longer the codebase. It is the harness that produces and maintains the codebase. The codebase is downstream.
If that sounds dramatic, I would point at the list above. Stripe is merging 1,300 agent-written PRs a week. OpenAI has published a normative spec for the orchestration layer and explicitly told everyone to build their own. Solo developers are shipping production-grade harnesses in their spare time. Workflow engines with the same essential shape are being re-implemented across three or four runtimes. The category did not exist eighteen months ago in any meaningful form. It exists now.
Where it goes from here is, honestly, what most of the small teams I know are figuring out in real time. I am one of them. I do not have all the answers - my own harness is still very much in active development, with a long list of things I want to fix and a roadmap that is mostly things I have learned by running it on real repos and watching it get something wrong. But I am very confident that this is the right direction to be looking, and I would much rather be wrong about the details than miss the shift entirely.
What are you building? And if you are building a harness too, I would genuinely love to compare notes. You can find me on LinkedIn, or take a look at Last Light and tell me what you would do differently.