Claude Code 98% Harness: Four Competing Teams Built Same Agent Harness, Pointing to Real AI Moat

As Karpathy Joins Anthropic’s Pre-Training Push, Independent Research Reveals Infrastructure Outsizes Model in Production Coding Agents

Claude Code
Anthropic.com

A research team at Mohamed bin Zayed University of Artificial Intelligence published a finding in April 2026 that has gained traction in engineering circles for reasons that go beyond its headline number. By analyzing Claude Code v2.1.88 — the version that briefly exposed its full TypeScript source to the public on March 31, 2026, after Anthropic accidentally bundled a 59.8-megabyte sourcemap file with an npm release — the four-author team at VILA-Lab dissected 1,884 files and approximately 512,000 lines of code. Their classification: roughly 1.6 percent constitutes AI decision logic; the remaining 98.4 percent is what the field now calls the "harness" — the permission pipeline, context-management system, sandboxing layer, tool router, and recovery infrastructure that surrounds the model on every side.

The 1.6 percent figure carries a methodological asterisk. It reflects the researchers' own line-count classification of a leak-derived bundle that contains generated code and minification artifacts. The MBZUAI paper's authors — Jiacheng Liu, Xiaohan Zhao, Xinyi Shang, and Zhiqiang Shen — do not present it as a universal audit. What they do argue, and what their architecture analysis independently supports, is the qualitative claim stated plainly in the paper's abstract: the core of the system is a simple loop that calls the model, runs tools, and repeats — but most of the code lives in the systems built around that loop.

That qualitative finding has been independently reached by others. Karan Prasad of Obvix Labs spent 72 hours after the source map leak building a five-phase extraction pipeline across the same codebase, producing 82 analysis documents and 16 architectural diagrams. His conclusion: "The model is interchangeable — the harness is not."

Convergence Across Competing Teams

The deeper signal in the MBZUAI paper is architectural convergence. Claude Code is built in TypeScript on Bun. OpenAI's Codex CLI was originally TypeScript but has undergone a near-complete rewrite in Rust, with that language now accounting for roughly 95 percent of the active codebase. The open-source tool Aider, written in Python, and OpenClaw, an independent open-source agent system that the MBZUAI researchers used as a direct architectural comparison case, were built by entirely separate teams with different commercial incentives.

All four converged on the same structural skeleton: an outer loop of model call, tool execution, and result capture; a primitive toolset of roughly a dozen capabilities covering file read, write, edit, shell execution, search, and web fetch; and a multi-stage permission pipeline that gates every tool call through sequential checks before execution. When competing teams independently invent the same architecture, that convergence is one of the strongest engineering signals available that the pattern is not a preference but a constraint imposed by the problem itself.

The MBZUAI paper maps Claude Code's harness across seven components: user interface, agent loop, permission system, tools layer, state and persistence, execution environment, and extensibility. The permission system alone spans seven modes and includes a machine-learning classifier that evaluates tool calls in two stages — a fast 64-token binary decision followed, when needed, by a full reasoning pass of up to 4,096 tokens. Context management runs a five-layer compaction pipeline that attempts to preserve cached prompt prefixes through most compression events, which the Obvix Labs analysis traces to prompt-cache economics: preserving the cached portion of the prompt cuts API costs by roughly 76 percent on repeated turns.

Model-Maker's Counterargument Deserves Airtime

The person who built Claude Code disagrees with the moat framing, and his disagreement is worth taking seriously. Boris Cherny, who created Claude Code and now leads it at Anthropic, told the Latent Space podcast that the harness is "the thinnest possible wrapper over the model" and that the company "literally could not build anything more minimal." His direct claim: "All the secret sauce — it's all in the model."

That position is corroborated in part by third-party benchmarking. METR evaluations found that Claude Code and Codex CLI do not consistently outperform a basic scaffold on certain tasks. Scale AI's SWE-Atlas data showed that for some models, the choice of harness produced performance differences within the margin of error. OpenAI researcher Noam Brown has argued that as reasoning models improve, scaffolding around them will increasingly be replaced by model capability directly — the same dynamic that made elaborate scaffolding redundant when reasoning models first appeared.

The honest read is that "Big Model" and "Big Harness" proponents are both selling something. Cherny is selling the model. The harness researchers are selling harness frameworks. What the independent architectural evidence — from MBZUAI, from Obvix Labs, and from the architectural decisions visible in every competing CLI agent — actually supports is a narrower claim: harness engineering has real and non-trivial value even if it is not the only value, and the pattern of independent convergence suggests the harness decisions compound across model generations in ways that model swaps alone cannot automatically fix.

Security Vulnerabilities Lived in Harness Components

The CVE record for Claude Code adds a practical dimension to the architectural debate. Check Point Research disclosed CVE-2025-59536, a code injection vulnerability carrying a CVSS score of 8.7 that allowed arbitrary shell commands to execute automatically when a developer started Claude Code in a malicious repository. The exploit path ran through the hooks mechanism — a harness component — and required no user interaction beyond cloning a repository. A second vulnerability, CVE-2026-21852, allowed a malicious repository to exfiltrate Anthropic API keys before the trust confirmation prompt appeared. Both were patched, the first in October 2025 and the second in January 2026. Adversa AI subsequently identified additional configuration-abuse vectors, including a TrustFall sandbox bypass disclosed in May 2026 that has also been patched.

The pattern is notable in the context of the harness debate: the vulnerabilities that caused real developer exposure were not in the model's inference logic. They were in the permission pipeline, the hooks system, and the sandbox — exactly the components the MBZUAI researchers classify as the harness. That the same infrastructure is both the structural differentiator and the primary attack surface is not a coincidence. A more expressive harness enlarges the surface for manipulation by malicious inputs.

Karpathy Bet and the Recursive Loop

The harness finding lands against a backdrop that frames it differently than the MBZUAI paper alone might suggest. Andrej Karpathy, who co-founded OpenAI, led Tesla's Autopilot program, and coined the term "vibe coding" in February 2025 before leaving his education startup Eureka Labs, announced on May 19, 2026 that he joined Anthropic's pre-training team. His mandate is specific: work under pre-training team lead Nick Joseph to build a new group that uses Claude to accelerate Claude's own pre-training research. The bet is that AI-assisted science can compress the training cycle that produces future models faster than raw compute expenditure can.

That bet is a "Big Model" wager in the terms of the harness debate. Karpathy is not joining the Claude Code infrastructure team. He is joining the team that produces the model the harness wraps. His presence at Anthropic strengthens the argument that model quality remains a primary lever — but it does not settle the question of whether the harness around that model compounds or erodes the advantage as agents move from prototype to production.

Cherny confirmed in March 2026 that 100 percent of Claude Code's own codebase is now written by Claude Code — the feedback loop between model and harness fully closed. Whether that makes the harness the primary moat or the model the primary moat depends on what you believe is harder to replicate: the infrastructure that constrains and enables the model's behavior in production, or the model that makes the infrastructure worth building.

Builders' Practical Takeaway

For developers building agentic systems in 2026, the MBZUAI paper and the convergence evidence suggest a concrete allocation question: how much engineering time goes into harness components — permission pipelines, context compaction, sandboxing, tool routing — versus model selection and prompt engineering? The independent convergence of four competing teams on nearly identical harness skeletons implies that the harness decisions are not arbitrary and that getting them wrong carries production costs that model upgrades do not automatically fix.

The Model Context Protocol, the open standard Anthropic released in November 2024 and which has since accumulated more than 97 million monthly SDK downloads and over 10,000 registered servers, extends the harness logic over the network. It is, in effect, the same JSON-RPC call-and-response primitive the CLI exposes locally, projected outward to external tools and services. Codex CLI ships a mode that allows the binary to act as a Model Context Protocol server as well as a client, meaning external agents can drive Codex the way Codex drives its own tools — one infrastructure standard connecting everything.

The competitive line in agentic AI for the next 12 to 18 months will likely be drawn by both model quality and harness defensibility. A CLI agent that ships with a working multi-stage permission pipeline, OS-level sandbox, structured session storage, and a small set of primitive tools is no longer a prototype. The question the MBZUAI paper raises — and the Boris Cherny counterargument does not fully resolve — is whether that infrastructure, once built and battle-tested across millions of developer sessions, is harder to replicate than the model it wraps.

ⓒ 2026 TECHTIMES.com All rights reserved. Do not reproduce without permission.

Join the Discussion