Open Source Robotics Model MolmoAct2: Ai2 Beats π0.5, Releases 720-Hour Bimanual Dataset

The Allen Institute for AI released MolmoAct2 on May 5, 2026, as a fully open robotics foundation model that outperforms Physical Intelligence's π0.5 on multiple real-world and simulation benchmarks, runs up to 37 times faster than its predecessor, and ships with what the institute describes as the largest open-source bimanual manipulation dataset ever published. The release arrives as open-source robotics AI crosses what practitioners describe as a production-grade threshold — the point at which researchers and builders can train, fine-tune, and deploy competitive robot policies without access to proprietary datasets, expensive closed systems, or platform-specific hardware contracts.

For robotics engineers and researchers, the practical meaning is direct: MolmoAct2's weights, training code, datasets, and action tokenizer are all publicly available under open-source terms, making it possible to reproduce, adapt, and build on the model in ways that most competing systems — including Physical Intelligence's π0.5 — do not permit. The gap between what a funded proprietary lab can build and what an independent research team or startup can deploy just narrowed.

MolmoAct2 is the second generation of Ai2's Action Reasoning Model family, first introduced in August 2025. Where the original MolmoAct demonstrated that an open, reasoning-based architecture could outperform larger closed models on standard benchmarks, MolmoAct2 is designed for deployment in genuine real-world environments: household surfaces, laboratory benches, and factory-adjacent tabletop tasks, without requiring per-task fine-tuning from scratch.

Spatial Reasoning Before Action: How MolmoAct2 Works

At the center of MolmoAct2 is a new vision-language backbone called Molmo2-ER, a variant of Ai2's earlier Molmo2 multimodal model specialized for embodied and spatial reasoning. The backbone was trained on a corpus of approximately 3.3 million samples spanning image embodied question-answering, spatial pointing, object detection, video embodied reasoning, multi-image tasks, and ego-exo correspondence — the ability to relate a robot's first-person view to third-person spatial understanding. That training matters because robot manipulation depends not only on recognizing objects but on understanding where they are in three-dimensional space, how far away they sit, what surfaces are reachable, and how the robot's camera perspective relates to the geometry of the scene.

The model connects this spatial backbone to a continuous action generator using a flow-matching architecture with a key-value cache bridge. Rather than regenerating attention computations from scratch at each timestep, the action expert reuses previously computed information from the vision-language backbone's internal layers — a design that reduces inference latency significantly while preserving the grounding advantages of a spatially trained model. According to Ai2's technical paper, the result is inference speeds up to 37 times faster than the original MolmoAct, which opens the model to closed-loop control tasks that earlier generation vision-language-action systems were too slow to handle.

What Is a Vision-Language-Action Model, and Why Does Spatial Reasoning Matter?

A vision-language-action model is a neural network that takes image inputs and natural-language instructions and outputs robot control signals — such as joint positions or end-effector trajectories — rather than text or images. Earlier VLA models treated the visual scene as a 2D perception problem; they could recognize objects and follow instructions but struggled with tasks requiring depth estimation, free-space reasoning, or viewpoint-dependent geometry. MolmoAct2's spatial backbone trains the model on problems that directly require 3D understanding, such as predicting where an object is relative to a surface, which direction a robot arm should move to avoid a collision, and how a camera's perspective relates to where a robot can reach. The technical claim is that this spatial grounding produces more reliable action generation in scenes that differ from training conditions.

MolmoAct2-Think: Adaptive Depth Without Latency Penalty

One of MolmoAct2's five architectural advances is a variant called MolmoAct2-Think, which adds adaptive spatial reasoning to each control step without the latency cost that earlier reasoning-augmented policies incurred. Prior systems that generated intermediate spatial representations — such as depth maps — at every timestep were often too slow for real-time manipulation. MolmoAct2-Think addresses this by updating depth tokens only for scene regions that have changed between timesteps, caching the spatial representation for static parts of the scene.

On Ai2's own evaluations, MolmoAct2-Think achieved a 98.1% average success rate on the LIBERO simulation benchmark, compared with 97.2% for the standard MolmoAct2 and 96.9% for Physical Intelligence's π0.5. On real-world DROID-style tasks, MolmoAct2-DROID achieved an 87.1% average success rate, ahead of both MolmoBot and π0.5-DROID. These are author-reported figures from Ai2's own technical paper. As Ai2 researcher Jiafei Duan has noted, robotics benchmarks are significantly harder to reproduce independently than language or multimodal model evaluations, since they require identical hardware, physical setups, and controlled conditions that only the model's creators typically have on hand. The figures should be treated as directionally meaningful until external researchers with equivalent hardware confirm them.

Open Action Tokenizer: Reproducibility Problem Physical Intelligence Left Unsolved

A significant detail in the MolmoAct2 release is the publication of MolmoAct2-FAST, an open action tokenizer that maps one second of continuous 32-dimensional robot actions into compact discrete sequences. Action tokenization is a critical bridge between language-model training and robot control: raw robot movements are continuous, high-frequency, and embodiment-specific, while language models operate on discrete tokens. Physical Intelligence's FAST tokenizer, which has become a common baseline in the field, releases weights but not the training data distribution, making it impossible to verify exactly what the tokenizer was trained on or to reproduce it from scratch. MolmoAct2-FAST publishes both — a distinction that affects whether other researchers can fully audit, replicate, or improve on the system.

Largest Open Bimanual Dataset: 34,500 Demonstrations, 720 Hours

Alongside the model, Ai2 published the MolmoAct2-Bimanual YAM dataset: 34,500 teleoperated demonstrations totaling more than 720 hours, collected over approximately two months. The tasks include folding clothes, untangling cables, scanning groceries, packing medication, bussing tables, and related household, coffee-shop, and light industrial behaviors — all executed on a two-armed tabletop robot platform. Ai2 describes it as the largest open-source bimanual tabletop manipulation dataset ever published.

The release also includes two curated subsets: MolmoAct2-DROID, a quality-filtered version of the widely used 76,000-demonstration DROID Franka dataset; and MolmoAct2-SO100/101, a filtered community dataset from the affordable SO-100 and SO-101 robot arms associated with the Hugging Face LeRobot ecosystem. The SO-100 and SO-101 are sub-$500 robot arms popular among independent researchers and student labs, meaning parts of MolmoAct2's training pipeline are now accessible for platforms that cost less than a new laptop.

How MolmoAct2 Performs on Its Two Admitted Weaknesses

Ai2's own blog post is direct about what MolmoAct2 cannot yet do. First, the model plans batches of 10 to 30 movements and executes them as a sequence without re-inferring mid-batch. If something unexpected happens during execution — a shifted object, an arm bumping an obstacle — the robot cannot adjust until the next batch cycle, and transitions between batches can appear jerky. Second, MolmoAct2 operates reliably out of the box only on the three hardware platforms it was heavily trained on: the bimanual YAM setup, the DROID Franka arm, and the SO-100/SO-101. Deploying on a humanoid or a robot with significantly different kinematics requires additional task-specific training data.

The model is not a universal robot controller. What it is, as Ai2 frames it, is a foundation checkpoint — a reproducible starting point for fine-tuning, not a finished product for deployment without modification. Stanford's Cong Lab provides one of the more meaningful deployment signals in the release documentation. After evaluating several generalist robotics models for their CRISPR wetlab workflow, the team selected MolmoAct2 as showing strong potential to streamline key parts of automated laboratory operations — a deployment context where reliability matters and failure carries real costs, a more demanding signal than a benchmark score alone.

Open Source Robotics AI Ecosystem: Where MolmoAct2 Fits

The competitive landscape for open robotics foundation models has shifted considerably in the past twelve months. Physical Intelligence's π0.5 focuses on open-world generalization for mobile manipulation in homes and novel environments. Hugging Face's SmolVLA, released in June 2025, pushed in the opposite direction — 450 million parameters, capable of running on a MacBook, trained entirely on community datasets. NVIDIA's GR00T N1.7 operates in the humanoid space with a dual-system architecture separating fast low-level control from high-level language planning. MolmoAct2 enters this landscape emphasizing spatial reasoning depth, dataset transparency, and practical deployment on low-to-medium cost hardware — a position that aligns with academic labs, robotics startups, and production environments that need reproducibility and platform flexibility more than raw benchmark margin.

The release arrives three weeks after Hugging Face's LeRobot platform reached 58,000 community-contributed datasets — a 50-fold increase since the end of 2024 — signaling that the open-source robotics data ecosystem is producing training material at a pace that was unavailable for most labs even eighteen months ago. For builders deciding where to invest their engineering time, MolmoAct2 offers something the robotics field has rarely had: a competitive model whose full recipe — data, code, weights, and tokenizer — can be read, checked, and improved by anyone.

Frequently Asked Questions

What is a vision-language-action model in robotics?

A vision-language-action model is a neural network trained to take camera images and natural-language instructions as input and output robot control signals — such as arm positions or movement trajectories — rather than text or images. Unlike earlier robot controllers programmed for specific tasks, VLA models are designed to generalize across diverse instructions and environments using the same type of large-scale training that powers modern language AI.

How does MolmoAct2 compare to Physical Intelligence's π0.5 on real-world tasks?

On LIBERO simulation benchmarks, MolmoAct2 reported a 97.2% average success rate versus 96.9% for π0.5, with MolmoAct2-Think reaching 98.1%. On real-world DROID-style tasks, MolmoAct2-DROID achieved an 87.1% success rate, ahead of π0.5-DROID. These figures come from Ai2's own technical paper and should be treated as directionally meaningful until independently reproduced, since robotics benchmarks require identical hardware and physical setups that most external researchers cannot easily replicate.

Can open-source robot models like MolmoAct2 run on affordable hardware?

MolmoAct2's training data includes demonstrations from the SO-100 and SO-101 robot arms, which cost under $500. The model's foundation checkpoint is available for fine-tuning on those platforms, as well as on the Franka arm and bimanual YAM setup. Full out-of-the-box deployment is limited to those three hardware configurations; other platforms require additional training data.

What makes MolmoAct2 different from other open robotics models in terms of openness?

MolmoAct2 releases not just model weights but training code, training datasets, and an action tokenizer with its full training data distribution — the last element being something Physical Intelligence's competing FAST tokenizer does not provide. That level of openness makes the complete pipeline auditable and reproducible, which is a requirement for academic research, safety validation, and deployment in regulated environments.

Tags:Robotics

Join the Discussion

Open Source Robotics Model MolmoAct2: Ai2 Beats π0.5, Releases 720-Hour Bimanual Dataset

Open Source Robotics Model MolmoAct2: Ai2 Beats π0.5, Releases 720-Hour Bimanual Dataset

Spatial Reasoning Before Action: How MolmoAct2 Works

What Is a Vision-Language-Action Model, and Why Does Spatial Reasoning Matter?

MolmoAct2-Think: Adaptive Depth Without Latency Penalty

Open Action Tokenizer: Reproducibility Problem Physical Intelligence Left Unsolved

Largest Open Bimanual Dataset: 34,500 Demonstrations, 720 Hours

How MolmoAct2 Performs on Its Two Admitted Weaknesses

Open Source Robotics AI Ecosystem: Where MolmoAct2 Fits

Frequently Asked Questions

AI Memory Shortage: AMD's Lisa Su Identifies High-Bandwidth Memory as AI Chip Supply's Next Cap

Apple's Unexpected iPhone 18 Pro Max Camera Upgrade That Could Change Mobile Photography

GTA 6 Release Date Locked: Pre-Orders and Trailer 3 Expected by Late June

Lucasfilm Confirms Cal Kestis' Adventures Continue—Is 'Star Wars Jedi 3' Coming Soon?

Cook Smarter Not Harder With Smart Ovens, Air Fryers, and Smart Fridges