Fei-Fei Li’s ESI-Bench Catches Frontier AI Failing 3D Space: Seeing and Acting Diverge

Stanford-Led Benchmark Across Approximately 3,000 Tasks Finds Frontier Models Commit Prematurely and Miss Falsifying Viewpoints: A Gap World Labs Raised $1 Billion to Close

Feifei Li, Stanford University
Feifei Li, Stanford University stanford.edu

A Stanford-led research team that includes Fei-Fei Li published a benchmark on May 18, 2026, that documents a specific, measurable failure in every frontier multimodal AI model tested: when forced to actively move through a 3D environment rather than interpret a pre-composed image, state-of-the-art systems consistently choose the wrong actions, skip viewpoints that would correct their mistakes, and commit to wrong answers with high confidence. The paper names this failure "action blindness." For robotics buyers, simulation engineers, and any developer whose product requires an AI agent to reason about physical space, the finding supplies a test that existing leaderboards do not.

ESI-Bench Redefines What Spatial Intelligence Requires

The benchmark, titled "ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop," covers 10 task categories and 29 subcategories built inside the OmniGibson simulator on top of Stanford's BEHAVIOR-1K activity dataset. Its approximately 3,000 tasks are organized around what developmental psychologist Elizabeth Spelke at Harvard has identified as core knowledge systems — the domain-specific cognitive systems human infants use to reason about objects, agents, numbers, space, and form.

The conceptual shift the benchmark makes is precise and consequential. Prior spatial-intelligence benchmarks assumed oracle observations: the model was handed the right view of the scene before being asked a question. ESI-Bench removes that assumption. The agent must first decide which ability to deploy — perception, locomotion, or manipulation — and in what sequence, in order to accumulate the evidence needed to answer. As described on the project page, the agent must actively uncover what is unseen: occluded structure, dynamics, containment, and functionality that no single static image can reveal.

The task set reflects that design. Agents are asked to compare the liquid-holding capacity of two containers by manipulating them, predict whether a deformable object will conform to a surface, judge whether a tower will balance under given mass and geometry distributions, and distinguish a mirror reflection from the real scene by repositioning. Each task has a correct answer that is structurally inaccessible without choosing an action first.

Frontier Models See But Do Not Know When to Move

Experiments on state-of-the-art multimodal large language models produced two main findings. Active exploration — letting the model choose its own viewpoints — substantially outperforms passive observation, and agents spontaneously discover spatial strategies they were not explicitly trained to use. But even with that advantage, models fall well short of oracle action selection: the best performance still leaves meaningful headroom against both human performance and the theoretical ceiling a model with perfect action choices would achieve.

The dominant failure mode is not perceptual. The paper reports that most errors originate from action blindness — the model's inability to identify which observation would be informative and when to seek it. Poor action choices produce poor observations, which cascade into wrong answers. In contrast, humans consistently seek falsifying viewpoints: when a scene is ambiguous, a person will move to the angle most likely to disprove their current hypothesis. Current multimodal large language models do the opposite, committing prematurely with high confidence regardless of the quality of the evidence they have gathered so far.

The paper also reports a secondary finding about 3D grounding. Explicit 3D representation stabilizes reasoning on depth-sensitive tasks, but imperfect 3D reconstruction proves more harmful than a 2D baseline — distorting spatial relations rather than clarifying them. The implication is that adding more sensory modalities without improving action selection produces a net negative.

MMSI-Bench Supplies Independent Corroboration

ESI-Bench does not stand alone. MMSI-Bench, a separate benchmark from Shanghai AI Laboratory and Beijing Normal University published in May 2025, found that even OpenAI's o3 — the strongest reasoning-tuned model tested — reached only roughly 40 percent accuracy on multi-image spatial-reasoning tasks, against a human baseline of 97 percent. The strongest open-source model in that evaluation reached approximately 30 percent.

The two benchmarks use different paradigms — MMSI-Bench tests passive multi-image reasoning; ESI-Bench tests active 3D exploration — but their joint message is consistent: scaling language-trained models has not closed the spatial-reasoning gap in the way it closed several text and image benchmarks. The gap that remains is not at the edge of model capability; it is near the center of what physical-world deployment demands.

World Labs' $1 Billion Bet on the Same Problem

Three months before the paper appeared, World Labs — the company Li co-founded and leads as CEO — closed a $1 billion funding round. Investors included chip companies AMD and NVIDIA, design-software giant Autodesk, Emerson Collective, Fidelity Management and Research Company, and Sea. Autodesk committed $200 million and signed on as a strategic adviser — a notable position for the company whose CAD and 3D-design tools are used across architecture, engineering, and entertainment.

The round followed World Labs' $230 million seed raise in September 2024 and came three months after the company's first commercial product, Marble, launched in November 2025. Marble generates and edits persistent 3D environments from text, image, video, or 3D-layout inputs and exports to formats — Gaussian splats, meshes, video — that game engines including Unreal and Unity can ingest directly. Subscription tiers run from free to $95 per month for the Max plan, which includes full feature access and commercial rights.

TechCrunch reported that the round was preceded by reports placing the target valuation at approximately $5 billion; World Labs did not disclose final terms.

In a manifesto accompanying Marble's launch, Li wrote that the next generation of world models would enable machines to achieve spatial intelligence on an entirely new level. At a Cisco event in February 2026, she argued that the binding constraint for AI systems expected to act in the physical world is no longer language reasoning but spatial understanding. "The ability to understand, to reason, to interact with and to navigate the real 3D, 4D physical world is the foundation," she said.

Diagnostic and Commercial Thesis Reinforce Each Other

The connection between the paper and the company is structural, not merely biographical. ESI-Bench provides the diagnostic — a rigorous, open framework for measuring precisely the capabilities that physical-world AI agents need and that current models consistently lack. Marble, and the broader world-model category World Labs is building toward, provides the substrate: persistent, manipulable 3D environments that an embodied agent must have in order to train on the kind of perception-action loops ESI-Bench is testing.

Li told Fast Company in March 2026 that robotics may be the biggest beneficiary of World Labs' work: "You need a 3D environment that is interactable, that has collisions, physics, and dynamics to train and evaluate robots." The World Labs blog post accompanying the February funding round said the capital would go to advancing models for storytelling, creativity, robotics, and scientific discovery.

The broader world-model space is competitive. Google DeepMind's Genie series, NVIDIA's Cosmos platform, and Yann LeCun's AMI Labs — which raised $1 billion of its own in early 2026 — are all building toward overlapping targets from different architectural starting points.

Benchmark Findings Stop Where the Thesis Continues

Several caveats are material. ESI-Bench's environments are simulated, built inside OmniGibson and BEHAVIOR-1K rather than the physical world, and the findings about the advantage of active exploration have not yet been independently replicated on physical robots in the peer-reviewed literature. The benchmark also tests models in an embodied setting they were not trained for, which means weak performance reflects a distribution mismatch as well as a fundamental capability ceiling — the two effects are difficult to separate at this stage.

The broader spatial-intelligence thesis — that 3D-grounded world models will become the foundation for robotics and physical AI — remains contested by researchers who argue that scaled video-prediction models or end-to-end policy learning are more direct paths to the same destination.

What ESI-Bench establishes, with primary-source documentation, is that the next round of spatial-AI evaluation will measure something categorically different from what existing leaderboards measure, and that on those new measures the current model class still has substantial headroom against oracle and human performance. That gap — specifically named, openly benchmarked, and now reproducible by any team with access to the code and simulator — is what makes both the academic finding and the commercial bet legible at the same time. Developers and buyers evaluating spatial AI systems now have a standardized yardstick: models that cannot pass ESI-Bench-style tests are not ready for physical-world deployment, regardless of how they rank on static-image leaderboards.

ⓒ 2026 TECHTIMES.com All rights reserved. Do not reproduce without permission.

Tags:Stanford
Join the Discussion