AI Agents in Biology Are Too Inaccurate to Use: Anthropic’s Deterministic Tool Is the Fix

Using an AI agent to navigate biological data is like driving a modern car through a medieval city. The streets were laid out for people on foot : narrow, winding, signposted for humans , with no highways and no standardized interfaces.

AI biology
Medical technologist and laboratory manager at the National Health Laboratory EMMANUEL CROSET/Getty Images

Using an AI agent to navigate biological data is like driving a modern car through a medieval city. The streets were laid out for people on foot : narrow, winding, signposted for humans , with no highways and no standardized interfaces. However good the car, it cannot get up to speed. That is the governing image in a new essay from Anthropic, published June 8, 2026, whose argument reduces to one sentence: biological data infrastructure is not ready for AI agents, and it needs to be rebuilt.

The finding cuts against the dominant story about AI and science, which assumes smarter models are all that stand between us and automated discovery. Anthropic researchers, led by Laura Luebbert and drawing on a preprint by Ferdous Nasri and colleagues, measured what happens when today's best research agents tackle a routine biology task — and the binding constraint was not the model's reasoning. It was the plumbing underneath it. For any scientist or builder betting on "AI for science," that relocates the problem from a place they cannot fix to one they can.

Why Do Coding Agents Race Ahead While Biology Agents Stall?

Anyone who writes software has felt how capable AI coding agents have become — resolving issues, passing test suites, moving as if on an open highway. Biology research is nothing like that, and the reason is structural.

Software engineering was, in effect, built for machines: version control, well-documented APIs, package managers, and testable outputs that compile and verify cleanly. Resolving a coding issue means producing a patch that passes the project's tests — a clean, checkable reward. Biological data is the opposite, a messy network of databases each with its own identifiers, file formats, filtering logic, and degree of programmatic access. Much of the essential know-how is implicit, living in researchers' heads and never written down, and biology offers few simple, verifiable signals to tell an agent whether it got the answer right.

The mismatch is not unique to biology. Andrej Karpathy made the same complaint in a recent talk on software in the age of AI: he had vibe-coded a small web app, but making it real : authentication, payments, deployment — cost him a week clicking through browser dashboards. "The code was the easiest part!" he said; the documentation kept telling him to visit a URL and click a dropdown. Nobody should have to do that, he concluded; we must build for agents. Biology researchers, Anthropic notes, have lived Karpathy's frustration for years.

What Is the "Click Tax" in Virology?

There is a category of biology work so basic it is nearly invisible: pulling sequence data out of a database. For virologists that usually means NCBI Virus, a collection of viral sequence records drawn from GenBank, RefSeq, and the international INSDC ecosystem, behind a searchable web interface. Vaccine design, diagnostic development, and the construction of training data for protein models all tend to start there.

The catch is that much of NCBI Virus's filtering logic exists only in that web interface. Suppose a researcher wants every SARS-CoV-2 sequence released in 2025 containing the surface glycoprotein. In the browser, that is a few clicks. Programmatically, it can require a multi-hundred-line script that stitches together three APIs — REST, Datasets, and E-utilities — pages through results one screen at a time, reconciles mismatched identifiers, downloads hundreds of gigabytes, and discards most of it after filtering locally. In virology labs, the recipes for these datasets are often passed around as long lists of filters each researcher reproduces by hand — exactly the browser-clicking workflow Karpathy complained about, except the stakes can be measured in lives.

That is not hyperbole. In May 2026, the Democratic Republic of Congo suffered an Ebola outbreak caused by Bundibugyo virus. On May 14, INRB Kinshasa analyzed 13 blood samples and confirmed eight the next day, and an outbreak was declared; by May 29 the World Health Organization had reported more than 1,000 confirmed and suspected cases and over 200 deaths, and researchers had produced the first near-complete outbreak genomes, establishing a new spillover event. Those genomes raise three urgent questions — how different is this virus from earlier Ebola, can existing diagnostics detect it, and will existing therapeutics still protect patients — and answering any of them means comparing the new genomes against historical ones in NCBI Virus. The first step in that analysis is still a human clicking through a web interface, reproducing filters by hand, and hoping the result is complete.

👉 Read more:

Experts Predict AI Hallucinations Could Soon Lead to Development of Life-Saving Drugs

How Badly Do Agents Do on Their Own?

To measure this, the team built a benchmark called VirBench: 120 realistic viral-sequence queries spanning 40 pathogens, each with a manually verified correct answer, mirroring real work in surveillance, diagnostic design, and protein-model training. A representative task asks an agent to retrieve NCBI sequences for TaxID 3052462 (Zaire ebolavirus) where the host is human, the sample was collected in Africa between January 1 and June 20, 2014, the sequence is at least 15,200 bases, has at most 1,900 ambiguous characters, and excludes lab-passaged samples.

The systems tested were Claude Sonnet 4, Claude Opus 4.7, the open-source Biomni, Edison Analysis, GPT-5.2-pro, and GPT-5.5. Left to work with the infrastructure available today, they posted mean accuracies from 16.9% to 91.3%. That looks like a spread of grades, but for dataset construction the real bar is 100%: a single missing or wrong record can distort whether a diagnostic appears to cover circulating strains, or push the inferred start of an outbreak weeks in either direction. And the agents were not even consistent with themselves. Asked the identical Ebola query three times, Sonnet 4 returned 106 sequences first (the correct count was 266), then 15, then 5 — same prompt each time. For a scientific workflow that non-reproducibility is as damaging as the inaccuracy, because retrieval is the first link in a long downstream chain.

How Does a Bad Dataset Push an Outbreak's Origin to 1922?

The consequences go beyond wrong counts. The team used the retrieved sequences to build phylogenetic trees — the standard way to reconstruct how viral samples in an outbreak relate — and to estimate the time to the most recent common ancestor, or TMRCA, the inferred root date that shapes conclusions about when and where a virus arose and how long it circulated.

A tree from a manually curated set recovered a TMRCA of January 2014, consistent with prior published estimates for the 2014 West African epidemic (a 95% credible interval of January 27 to March 14). Two of Sonnet 4's three retrieved sets produced visibly broken trees, one dating the common ancestor to 1922 — more than ninety years off. The third looked fine but was missing sequences from Guinea, shifting the estimate to April 2014 and moving the inferred timing of the outbreak. The therapeutics analysis fared no better: retrieving Ebola glycoprotein sequences to check for mutations in the regions targeted by maftivimab and MBP134, two WHO priority treatment candidates in the current outbreak, Sonnet 4 produced three different conclusions across three runs.

Both cases share one root cause: the agents usually understood the task but lacked a reliable, deterministic way to access the database, verify the result, and reproduce it. The answers could look plausible while being wrong, and different every time. The failure modes were specific — the biggest errors hit the viruses with the most records, such as Influenza A, HIV-1, and SARS-CoV-2, where an agent stops partway through retrieval and under-counts, or applies a filter wrong and over-counts. Agents stumbled on metadata fields whose meaning depends on convention, and accuracy collapsed once a query stacked more than three or four filters.

How Did a Deterministic Tool Reach 99.7%?

The fix was not a smarter model. Working with researchers at NCBI, the team built a tool called gget virus to translate NCBI Virus's complex, browser-based retrieval behavior into a single accurate, reproducible programmatic interface — and that turned out to be far harder than wiring up an API. NCBI Virus is a portal over multiple underlying resources, internationally synchronized sequence databases maintained across the United States, Europe, and Japan, so even a simple query often means assembling information from several places.

To reproduce the interface's behavior, gget virus coordinates across the REST, Datasets, and E-utilities APIs; it decides which filters can run through those APIs and which must be checked locally, because some filtering is not exposed by any single endpoint. It batches requests so large result sets are retrieved in full rather than cut off, and when a filter depends on data in a separate database — a GenBank record showing whether a sequence contains a given protein — it fetches those records, applies the filter, and preserves the information in the output. It returns standardized results readable by both humans and machines, with logs showing exactly how the answer was produced. With gget virus available, every agent rose above 90% accuracy, peaking at 99.7% for GPT-5.5, and run-to-run variability was largely eliminated.

Why "Infrastructure, Not Intelligence" Is the Real Headline

The quietly radical result is what happened to the gap between models: it shrank dramatically. Once a deterministic retrieval layer is in place, the choice of model stops mattering much. Reliable dataset construction no longer hinges on access to the newest or most expensive frontier model, or on knowing which model is best at a given database; a cheaper model with the right tool both reduces the wobble and widens access. That inverts the usual assumption that progress in AI-for-science is gated by model intelligence — here, the determinism belongs in the tool, and the model is freed to do the creative work of hypotheses and design.

It also carries a sharper warning about trust. The agents' answers were plausible-looking, wrong, and irreproducible — the exact profile of a result that slips through review. As AI is woven deeper into scientific workflows, that pattern, not raw capability, is the thing to guard against, and the antidote is infrastructure that lets a reviewer see not just what was retrieved but how, turning a plausible answer into one that can be checked and repeated.

Anthropic frames gget virus as one instance of a general need to build "context engines" — reliable, agent-accessible infrastructure for biological data — alongside efforts such as ToolUniverse, Edison Scientific's Robin, and Biomni. The team is candid about the tension: extrapolate the capability curve and one can imagine agents soon good enough to navigate messy portals, reconcile identifiers, and recover from their own errors, making such tools less necessary. (In one detail, across 360 runs GPT-5.5 once found and used gget virus on its own — the only run that got that question right.) But even then a model that can brute-force a confusing workflow may be too expensive, too slow, too hard to audit, and too hard to trust for routine science. The lesson for database designers holds regardless: treat agents as scaled-up users and build for that scale now.

👉 Read more:

Claude Code Skills: Inside Anthropic's Playbook for the Nine Types That Actually Work


Frequently Asked Questions

What did Anthropic find about AI agents in biology?

In a June 8, 2026 essay, Anthropic reported that today's best research agents are unreliable at a basic biology task — retrieving viral sequences from the NCBI Virus database — scoring as low as 16.9% accuracy and giving different answers to the same query. Accuracy rose above 90% only after adding a deterministic retrieval tool.

What is gget virus?

gget virus is a tool Anthropic built with NCBI researchers that translates the NCBI Virus website's complex filtering into one accurate, reproducible programmatic interface. It coordinates the REST, Datasets, and E-utilities APIs, handles large-result batching, and returns logged, standardized output, lifting agent accuracy to as high as 99.7%.

Why is 16.9% to 91.3% accuracy not good enough?

For building scientific datasets the effective bar is 100%, because a single missing or wrong sequence can distort downstream analysis — in one example, a flawed dataset pushed an Ebola outbreak's estimated origin from 2014 back to 1922. Inconsistent, irreproducible results are especially dangerous since retrieval is the first step in a long analysis chain.

Does this mean AI can't help with biology research?

No. Anthropic's point is that the main bottleneck is infrastructure, not model intelligence. With deterministic tools connecting agents to biological databases, even cheaper models become reliable — so the path forward is building agent-ready data infrastructure rather than waiting for smarter models.

ⓒ 2026 TECHTIMES.com All rights reserved. Do not reproduce without permission.

Join the Discussion