Harness Engineering — Shift Your Focus to the "Structure"

Last updated at 2026-05-23Posted at 2026-05-23

People want AI to do exactly what they ask — nothing more, nothing less.

A simple statement. Yet making it a reality has meant confronting with the limitations of generative models for a long time. Prompt engineering, context engineering, and now harness engineering — all are successive innovations aimed at the same goal.

Background — Why Harnesses Became Necessary

Model performance alone never quite delivered that feeling of getting the exact result you wanted, reliably. We tried specifying personas and output formats to sharpen results, structuring information, and feeding reference material through RAG. All of it helped — but none of it solved the fundamental problem.

The core constraint of generative models — the inherent nature of being "generative" — remained. Unlike predictable traditional engineering, both prompt engineering and context engineering amount to requests and recommendations to the model. No matter how emphatically you say "always" or "never," the model still behaves like an unrestrained horse — unpredictable in ways that persist. This was especially limiting for tasks requiring consistent output across long-running or multi-session workflows.

Harness engineering emerged to eliminate — or at least control — these constraints. A harness is a piece of horse tack. The idea is that to keep a powerful horse moving as directed, you focus not on the horse itself but on the structure surrounding it. The metaphor is apt.

Definitions — How Organizations and Individuals Define.

Harness engineering is an approach that reinforces not the model, but the structures outside the model. In other words, it is structural augmentation that ensures the model continues to produce accurate results.

Here are widely cited definitions of harnesses and harness engineering:

Mitchell Hashimoto (My AI Adoption Journey, Feb 2026): Engineering the environment so that every time an agent fails, the same failure can never happen again. (The origin point where the term "harness engineering" gained traction.)
Vivek Trivedy / LangChain (The Anatomy of an Agent Harness): Agent = Model + Harness. Everything that isn't the model is the harness. The model provides intelligence; the harness makes that intelligence useful.
Ryan Lopopolo / OpenAI (Harness engineering: leveraging Codex in an agent-first world, Feb 2026): The work of designing environments, intent specifications, and feedback loops. The engineer's primary job has shifted from "writing code" to "building environments."
Anthropic Engineering Blog (Effective Harnesses for Long-Running Agents): Every component of a harness is built on the assumption that the model cannot do something on its own. As the model evolves, those assumptions become outdated.
Martin Fowler (Harness Engineering, by Birgitta Böckeler): A coherent system composed of guides, sensors, and self-correcting loops.

Principles — Five Core Ideas That Determine Performance

Even with the same model, performance varies dramatically depending on how the harness is designed. LangChain improved their Terminal Bench 2.0 score from 52.8 to 66.5 by modifying only the harness while keeping the model (gpt-5.2-codex) fixed.

The following principles have been consistently emphasized across the harness engineering discourse:

#	Principle	One-Line Explanation	Primary Source
1	Table-of-contents-style instruction files	`AGENTS.md` / `CLAUDE.md` should function primarily as a map, not an encyclopedia. Details live in `docs/` and are disclosed progressively.	OpenAI
2	Mechanical enforcement	Don't ask — automate verification and enforce compliance with linters, tests, and hooks.	OpenAI / Fowler
3	External memory	The context window is finite. Write state to disk via `git` / `progress.txt` / `feature_list.json` and carry it across sessions.	Anthropic
4	Separation of evaluators	Self-verification carries an optimism bias. Separate the Generator and the Evaluator into distinct agents. Default to "fail" when there is no evidence of success.	Anthropic
5	Feedback loops	Improve the environment by one line after every failure, so the same failure never happens twice.	Hashimoto / OpenAI

In short: Don't ask the model — build an environment where it cannot go down the wrong path. Then bake in mechanisms for that environment to improve itself.

Practice — Set Up an MVH in 30 Minutes

This is a minimal hands-on exercise that covers the core principles — an MVH (Minimum Viable Harness), the smallest useful harness configuration. It can be built quickly, and building it once is the fastest way to internalize how harnesses work.

The goal of an MVH is to replace "requests" with "enforcement."

Make the environment execute automatically, without being asked. You need three things:

An instruction file (CLAUDE.md) — A quick-reference sheet of rules, prohibitions, and commands that the agent reads automatically at the start of every session.
An objective definition of "done" — A pass/fail criterion free of subjective judgment, such as "all tests pass green."
Automated verification hooks — Mechanisms that physically run linting, type checking, and tests immediately after edits and just before the agent responds.
Once these three pieces are in place, the agent enters a state where it notices its own mistakes and fixes them on its own. Here, we will use Claude Code as an example.

Note: The same configuration is possible with Gemini CLI. It is available for free using only Google account authentication.

Setup (3 minutes)

Node.js 18 or later (verify with node --version)
Install Claude Code (macOS / Linux / WSL):

curl -fsSL https://claude.ai/install.sh | bash

For Windows, refer to the official installation guide. All subsequent work is done in your project's root directory.

Step 1: Create the Instruction File `CLAUDE.md` (5 minutes)

Create an instruction file that consolidates the rules, commands, and prohibitions you have been repeating to the agent every session. Claude Code automatically loads CLAUDE.md at the start of every session, so anything written here is effectively persisted. Create it at the project root and paste the following:

CLAUDE.md

# Project Name
 
One sentence describing the identity (e.g., an internal communication tool)
 
## Build & Test
- Build: `npm run build`
- Test: `npm test`
- Lint: `npm run lint`
 
## Coding Rules
- Use structured logging only (`console.log` is prohibited)
- Keep files under 300 lines
 
## Recurring Failures (Do NOT do these)
- ❌ Do not delete failing tests
- ❌ Do not leave catch blocks empty
 
## Pre-Completion Verification
After any change, always run: `npm test && npm run lint && npm run typecheck`

Here is what each section does:

Section	Purpose
Opening sentence	Declares the project's identity. Lets the AI understand what it is building from the outset.
`## Build & Test`	Lists the correct commands. Prevents the AI from using the wrong ones.
`## Coding Rules`	Rules that must be followed whenever new code is written.
`## Recurring Failures`	A log of past mistakes. Prevents the same errors from being repeated.
`## Pre-Completion Verification`	Steps the agent must complete before declaring "done."

Step 2: Get Tests Running (2 minutes)

What is a test? A small program that automatically verifies whether code works correctly. Here we place a single test that simply checks whether 1 + 1 equals 2. The content itself is trivial, but the hooks in Step 3 use "did the tests pass?" as their pass/fail criterion, so we first need the machinery to be in working order. Real tests can be added later.

Run the following two lines in your terminal, one at a time:

mkdir -p __tests__           # Create the test folder
cat > __tests__/sanity.test.js << 'EOF'
test('1 + 1 = 2', () => {
  expect(1 + 1).toBe(2);
});
EOF
# End Of File

What just happened: mkdir -p creates a folder, and cat > ... << 'EOF' is a standard shell idiom that writes everything up to the line reading EOF into a file. This lets you create files without opening an editor.

To verify, run npm test — if you see a green PASS, you are all set.

Step 3: Wire Up Automated Verification Hooks (15 minutes)

What is a hook? A mechanism that automatically runs a command when a certain event occurs. It is the same idea as a Git pre-commit check, applied here to the Claude Code agent. Instead of relying on the agent's promise to "be careful," the goal is to force checks from the environment side.

There are two trigger points:

Trigger	What Runs	What It Prevents
Immediately after a file is edited	Lint auto-fix + type check	Leaving rule violations or type errors in place
Just before returning a response	Full test suite	Declaring "done" while things are broken

When an error is found, its content is automatically passed back to the agent. The agent reads the error and fixes it on its own.

More precisely, the hook's execution output (stdout and stderr) is automatically injected into the agent's context. Furthermore, if the exit code is non-zero (0 means success), the agent's turn does not end — it continues working on the fix. This is the mechanism behind "reads the error and fixes it on its own."

Run the following in your terminal to create the configuration file:

mkdir -p .claude
touch .claude/settings.json

.claude/settings.json

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Write|Edit",
        "hooks": [
          { "type": "command", "command": "npm run lint:fix && npm run typecheck" }
        ]
      }
    ],
    "Stop": [
      {
        "hooks": [
          { "type": "command", "command": "npm test" }
        ]
      }
    ]
  }
}

A closer look at the configuration:

PostToolUse — Fires immediately after the agent writes or edits a file. Runs lint auto-fix (lint:fix) and type checking (typecheck).
Stop — Fires whenever the agent is about to finish responding. Runs the full test suite; if any test fails, the agent is not allowed to declare "done" and is sent back to fix the issue.
If the lint:fix command is not yet defined, add the following line to the "scripts" section of your package.json:

"lint:fix": "eslint . --fix"

Verification — The Moment "Requests" Become "Enforcement" (5 minutes)

Ask Claude Code for a small implementation (e.g., "Create a function that returns the current date in YYYY-MM-DD format"). You will notice two things that are different from before:

Immediately after editing a file, the agent discovers lint errors on its own and starts fixing them.
Just before saying "Done," the tests run automatically. If they fail, the response is halted and the agent reworks the task.
The verification work you used to request via prompts every time is now enforcement built into the environment. This is what "designing the environment" feels like in practice.

What You Have When You Are Done

File	Purpose
`CLAUDE.md` (instruction file)	A table of contents for rules and prohibitions
Test suite (green)	Defines "done" not by human intuition but by test pass/fail
`.claude/settings.json`	Automated verification after edits + full test run before completion

These three pieces form the MVH — the foundation of your harness. From here, you can grow it one folder at a time, one rule at a time, wherever failures keep recurring. There is no need to aim for perfection from the start.

Structure — What a Full-Scale Harness Looks Like (Reference)

Once the MVH is stable, the harness naturally grows into a larger structure. Below is the actual structure used by the OpenAI Codex team on a one-million-line codebase:

AGENTS.md
ARCHITECTURE.md
docs/
├── design-docs/
│   ├── index.md
│   ├── core-beliefs.md
│   └── ...
├── exec-plans/
│   ├── active/
│   ├── completed/
│   └── tech-debt-tracker.md
├── generated/
│   └── db-schema.md
├── product-specs/
│   ├── index.md
│   ├── new-user-onboarding.md
│   └── ...
├── references/
│   ├── design-system-reference-llms.txt
│   ├── nixpacks-llms.txt
│   ├── uv-llms.txt
│   └── ...
├── DESIGN.md
├── FRONTEND.md
├── PLANS.md
├── PRODUCT_SENSE.md
├── QUALITY_SCORE.md
├── RELIABILITY.md
└── SECURITY.md

The key takeaways:

What the agent reads, what it does not read, and what it must not touch — all of this is determined by the folder structure.

AGENTS.md stays a table of contents — Details live in separate files. The context window is not bloated.
core-beliefs.md centralizes decision criteria — A "yardstick" for when the agent faces ambiguity, embedded in the environment.
exec-plans/ with active → completed — Completed plans are preserved so that past decisions can be reused.
generated/ is a no-touch zone — Auto-generated artifacts are walled off from edits at the folder level.
references/*-llms.txt pre-stages knowledge — External documentation is pulled in, eliminating the need for runtime searches.
tech-debt-tracker.md + a separate agent — Technical debt is automatically detected, and fix PRs are continuously submitted.
Whenever the same failure keeps recurring, you add a single line to the harness at that moment, embedding a structural safeguard. By sealing off failures one at a time, the same mistake becomes mechanically impossible to make.

Outlook — HaaS and the Adoption Trajectory

Improving model performance is the domain of providers, but harness engineering is a domain users can invest in directly. Moreover, it has already been demonstrated that harness improvements can yield output gains exceeding those from model version upgrades on benchmarks. The expected impact on time efficiency is even greater. Harness engineering treats human involvement itself as the bottleneck, concentrating on reducing the steps that require human review and action. The substantial benefits in both performance and time — that is the clear reason this approach will be adopted and spread organically.

HaaS (Harness as a Service) offerings have already appeared on the market, and harness engineering is expected to expand across many industries going forward.

Here is what individuals and organizations should prepare for:

Engineers — As code itself becomes commoditized, you need to shift toward becoming designers of environments, constraints, and structures.
Organizations — This is a methodology with verified productivity gains of 10x or more. Incorporating it as a competitive advantage is the sensible move.
Harness engineering — or this general direction — will be adopted regardless of whether the label eventually changes to "XYZ engineering." People will continue to demand that agents "do exactly what they want," and technology always moves toward greater utility and cost efficiency.

References

Mitchell Hashimoto, My AI Adoption Journey — The article where this term first appeared
Ryan Lopopolo / OpenAI, Harness engineering: leveraging Codex in an agent-first world — Experiment report on a 1M LOC codebase; YouTube — Harness Engineering: How to Build Software When Humans Steer, Agents Execute
Anthropic, Effective Harnesses for Long-Running Agents — The original source for the two-agent pattern
Anthropic, Harness Design for Long-Running Application Development — The follow-up with a three-agent variant
Vivek Trivedy / LangChain, The Anatomy of an Agent Harness — The formulation "Agent = Model + Harness"
Birgitta Böckeler (martinfowler.com), Harness Engineering — The guide / sensor / self-correction framework

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up