Skip to main content
|7 min read

Harness Engineering: The Discipline That Separates AI Demos from AI That Ships

88% of AI agent projects never reach production. The gap isn't the model — it's everything around it. Here's what harness engineering is and why it's the most important AI discipline you've never heard of.

Harness EngineeringAI AgentsEnterprise AI

Here's a stat that should make every business leader pause: 88% of AI agent projects never reach production.

Not because the models aren't capable. Not because the use cases aren't valid. Because the everything else — the guides, the guardrails, the feedback loops, the governance — wasn't designed properly.

That "everything else" now has a name: harness engineering.

What Is Harness Engineering?

OpenAI coined the term in February 2026, and it's since been adopted by Martin Fowler, Red Hat, Anthropic, and a growing community of practitioners. The formula is simple:

Agent = Model + Harness

The model is the AI brain. The harness is everything you build around it to make that brain useful, reliable, and safe in production. It includes:

  • Guides — system prompts, agent definition files, constraint documents that tell the AI how to behave
  • Sensors — evaluations, validation loops, output parsers that verify the AI is performing correctly
  • Data context pipelines — the systems that feed the right information to the right agent at the right time

Think of it this way: if the model is the engine, the harness is the steering wheel, the brakes, the dashboard, and the road rules. You wouldn't put an engine in a car without those things. But that's exactly what most organisations do with AI.

The Evidence Is Stark

Two teams using the same model can see task completion rates of 60% vs. 98% — based entirely on harness quality.

Read that again. Same model. Same capability. The only difference is what was built around it.

Microsoft demonstrated this when they shifted their SRE (Site Reliability Engineering) agent from 100+ bespoke tools to a filesystem-based context engineering system. Performance on novel incidents rose from 45% to 75% — not by upgrading the model, but by redesigning the harness.

Martin Fowler published a full framework article on harness engineering in April 2026, establishing it as a legitimate engineering discipline alongside software engineering and data engineering. This isn't a buzzword. It's an emerging practice with real methodology behind it.

Why Most AI Projects Fail

The typical AI deployment looks like this:

  1. Buy a tool or API
  2. Write some prompts
  3. Demo it to leadership
  4. Hand it to the team
  5. Watch adoption stall

What's missing? Everything in the harness:

  • No clear definition of how the agent should behave in edge cases
  • No validation that the outputs are correct before they reach a user
  • No feedback loops that improve performance over time
  • No governance layer that ensures compliance
  • No coaching or documentation that helps the team actually use it

The demo was impressive because the demo was a controlled environment. Production isn't controlled. Production is messy data, ambiguous requests, edge cases, compliance requirements, and real consequences when something goes wrong.

The harness is what handles that mess.

What Good Harness Engineering Looks Like

Guides: Defining Agent Behaviour

Every AI agent needs a definition of how it should operate. Not just a system prompt — a comprehensive guide that covers:

  • What the agent is responsible for (and what it isn't)
  • How it should handle ambiguity
  • What tone and style to use in outputs
  • What data it can access and what it can't
  • When to escalate to a human

This is where files like AGENTS.md and CLAUDE.md come in — adopted by over 60,000 open-source projects since August 2025. They're version-controlled, reviewable, and auditable. When the agent's behaviour needs to change, you update the guide and the change is tracked.

Sensors: Verifying Quality

A production agent needs monitoring. Not just "is it running?" but "is the output correct and useful?" This includes:

  • Automated evaluations that check output quality against defined criteria
  • Human-in-the-loop checkpoints for high-stakes decisions
  • Output parsers that validate format and content before delivery
  • Performance dashboards that track accuracy, latency, and user satisfaction over time

Data Context Pipelines: Feeding the Right Information

The best harness connects to the right data at the right time. Standards like MCP (Model Context Protocol) from Anthropic, A2A (Agent-to-Agent Protocol) from Google, and the emerging NIST AI Agent Standards are making this interoperable — meaning your harness can work across platforms, not just one vendor's ecosystem.

Skills: The Building Blocks of Enterprise AI

Within the harness, skills are the modular, reusable capabilities that agents draw on. Think of them as the specific things an agent knows how to do: process a claim, draft an email, analyse a spreadsheet, generate a report.

The breakthrough in 2026 is that skills are becoming portable. A skill file designed for Claude Code also works in Codex CLI, Gemini CLI, Cursor, and GitHub Copilot. ServiceNow opened its entire agent skills platform to every developer, from any tool.

This means the investment you make in building enterprise skills isn't locked to one platform. It's an asset that works across your entire AI stack.

As Fortune reported: "The future belongs to the developer who masters the ability to break down human expertise into reusable agent skills."

What This Means for Your Business

If you're investing in AI — whether it's Microsoft Copilot, Claude, ChatGPT, or something else — the model is table stakes. Everyone has access to the same models.

The differentiation is in the harness:

  • How well your agents are defined and governed
  • How reliably they perform under real conditions
  • How portably your skills work across platforms
  • How confidently your team can use and trust the outputs

Models are commoditising. The harness is the competitive advantage.

Where to Start

You don't need to build everything at once. Start with one agent doing one job well:

  1. Pick a real workflow — something your team does repeatedly that has clear inputs and outputs
  2. Write the guide — define exactly how the agent should behave, including edge cases and escalation rules
  3. Build the sensors — how will you know if the output is good? Define the checks before you deploy
  4. Connect the context — what data does the agent need? Map the sources and build the pipeline
  5. Coach the team — a tool nobody trusts is a tool nobody uses. Training and transparency are part of the harness

Then iterate. Improve the guide based on what you learn. Tighten the sensors. Expand the context. Build the next skill.

This is the discipline that separates AI demos from AI that ships. And it's the discipline we help organisations build every day.

We help organisations build the context infrastructure, harness design, and skills architecture that make AI actually work in production. If this resonates, let's talk.

Back to all posts