Here's a stat that should make every business leader pause: 88% of AI agent projects never reach production.
Not because the models aren't capable. Not because the use cases aren't valid. Because the everything else — the guides, the guardrails, the feedback loops, the governance — wasn't designed properly.
That "everything else" now has a name: harness engineering.
What Is Harness Engineering?
OpenAI coined the term in February 2026, and it's since been adopted by Martin Fowler, Red Hat, Anthropic, and a growing community of practitioners. The formula is simple:
Agent = Model + Harness
The model is the AI brain. The harness is everything you build around it to make that brain useful, reliable, and safe in production. It includes:
- Guides — system prompts, agent definition files, constraint documents that tell the AI how to behave
- Sensors — evaluations, validation loops, output parsers that verify the AI is performing correctly
- Data context pipelines — the systems that feed the right information to the right agent at the right time
Think of it this way: if the model is the engine, the harness is the steering wheel, the brakes, the dashboard, and the road rules. You wouldn't put an engine in a car without those things. But that's exactly what most organisations do with AI.
The Evidence Is Stark
Two teams using the same model can see task completion rates of 60% vs. 98% — based entirely on harness quality.
Read that again. Same model. Same capability. The only difference is what was built around it.
Microsoft demonstrated this when they shifted their SRE (Site Reliability Engineering) agent from 100+ bespoke tools to a filesystem-based context engineering system. Performance on novel incidents rose from 45% to 75% — not by upgrading the model, but by redesigning the harness.
Martin Fowler published a full framework article on harness engineering in April 2026, establishing it as a legitimate engineering discipline alongside software engineering and data engineering. This isn't a buzzword. It's an emerging practice with real methodology behind it.
Why Most AI Projects Fail
The typical AI deployment looks like this:
- Buy a tool or API
- Write some prompts
- Demo it to leadership
- Hand it to the team
- Watch adoption stall
What's missing? Everything in the harness:
- No clear definition of how the agent should behave in edge cases
- No validation that the outputs are correct before they reach a user
- No feedback loops that improve performance over time
- No governance layer that ensures compliance
- No coaching or documentation that helps the team actually use it
The demo was impressive because the demo was a controlled environment. Production isn't controlled. Production is messy data, ambiguous requests, edge cases, compliance requirements, and real consequences when something goes wrong.
The harness is what handles that mess.
What Good Harness Engineering Looks Like
Guides: Defining Agent Behaviour
Every AI agent needs a definition of how it should operate. Not just a system prompt — a comprehensive guide that covers:
- What the agent is responsible for (and what it isn't)
- How it should handle ambiguity
- What tone and style to use in outputs
- What data it can access and what it can't
- When to escalate to a human
This is where files like AGENTS.md and CLAUDE.md come in — adopted by over 60,000 open-source projects since August 2025. They're version-controlled, reviewable, and auditable. When the agent's behaviour needs to change, you update the guide and the change is tracked.
Sensors: Verifying Quality
A production agent needs monitoring. Not just "is it running?" but "is the output correct and useful?" This includes:
- Automated evaluations that check output quality against defined criteria
- Human-in-the-loop checkpoints for high-stakes decisions
- Output parsers that validate format and content before delivery
- Performance dashboards that track accuracy, latency, and user satisfaction over time
Data Context Pipelines: Feeding the Right Information
The best harness connects to the right data at the right time. Standards like MCP (Model Context Protocol) from Anthropic, A2A (Agent-to-Agent Protocol) from Google, and the emerging NIST AI Agent Standards are making this interoperable — meaning your harness can work across platforms, not just one vendor's ecosystem.
Skills: The Building Blocks of Enterprise AI
Within the harness, skills are the modular, reusable capabilities that agents draw on. Think of them as the specific things an agent knows how to do: process a claim, draft an email, analyse a spreadsheet, generate a report.
The breakthrough in 2026 is that skills are becoming portable. A skill file designed for Claude Code also works in Codex CLI, Gemini CLI, Cursor, and GitHub Copilot. ServiceNow opened its entire agent skills platform to every developer, from any tool.
This means the investment you make in building enterprise skills isn't locked to one platform. It's an asset that works across your entire AI stack.
As Fortune reported: "The future belongs to the developer who masters the ability to break down human expertise into reusable agent skills."
What This Means for Your Business
If you're investing in AI — whether it's Microsoft Copilot, Claude, ChatGPT, or something else — the model is table stakes. Everyone has access to the same models.
The differentiation is in the harness:
- How well your agents are defined and governed
- How reliably they perform under real conditions
- How portably your skills work across platforms
- How confidently your team can use and trust the outputs
Models are commoditising. The harness is the competitive advantage.
Where to Start
You don't need to build everything at once. Start with one agent doing one job well:
- Pick a real workflow — something your team does repeatedly that has clear inputs and outputs
- Write the guide — define exactly how the agent should behave, including edge cases and escalation rules
- Build the sensors — how will you know if the output is good? Define the checks before you deploy
- Connect the context — what data does the agent need? Map the sources and build the pipeline
- Coach the team — a tool nobody trusts is a tool nobody uses. Training and transparency are part of the harness
Then iterate. Improve the guide based on what you learn. Tighten the sensors. Expand the context. Build the next skill.
This is the discipline that separates AI demos from AI that ships. And it's the discipline we help organisations build every day.