The Data Drought: Why Embodied AI Can’t Just Read the Internet

Robotics
Robotics TechTimes

In March 2026, DoorDash launched a standalone app called Tasks, paying its 8 million U.S. delivery couriers to strap on body cameras and film themselves washing dishes, folding clothes, and making beds — not to improve food delivery, but to generate training data for humanoid robots. The launch was a symptom of the defining constraint now governing the entire robotics industry: unlike every other branch of modern AI, robots cannot learn from the internet, and the scramble to fill that gap is reshaping how data gets collected, who collects it, and what rights workers and residents retain over footage of their own homes.

Humanoids Raised $6 Billion in 2025, and Still Can't Fold a Shirt Reliably

Language models trained on billions of web pages. Image models trained on hundreds of millions of photographs. For both, the data already existed. For robots, it does not. A robot learning to wipe a counter needs multidimensional sensor traces — vision, force, joint position, motor command — captured in tight time synchronization during a real physical interaction. Each useful movement trajectory has to be recorded from scratch, on actual hardware, with humans or other robots in the loop. The industry calls this the data drought.

The scale of the gap is documented. To train the RT-1 model in 2022, Google's robotics team ran 13 robots for 17 months in an office kitchen to collect 130,000 movement trajectories covering more than 700 tasks. The largest cross-institution open dataset assembled to date, Open X-Embodiment, pooled 60 separate datasets from 21 institutions and 34 research labs worldwide to reach approximately 1 million trajectories across 22 robot types. By the standards of language model training corpora, both numbers are vanishingly small. Language datasets routinely contain 1.5 to 4.5 billion examples.

Venture capital has not solved the problem by throwing money at it. Over $6 billion went into humanoid robots in 2025 alone, according to MIT Technology Review — yet the fundamental bottleneck remains unchanged. More capital can buy more hardware and hire more engineers, but it cannot conjure training data that does not exist.

Four Data Sources, None of Them Sufficient on Their Own

The industry has settled on four parallel approaches to building robot training data, each with a different trade-off between quality and scale.

Teleoperation — where a human operator physically drives a robot through a task while every sensor stream records simultaneously — produces the highest-quality data. Real contact, real failure, real recovery. According to the Silicon Valley Robotics Center's State of Robotics 2026 report, the fully loaded cost of that data fell from roughly $340 per hour in early 2024 to $118 per hour by March 2026, a 65 percent reduction. The cost is still prohibitive at scale: a single enterprise pilot now requires 300 to 1,200 demonstrations, putting a $50,000 to $150,000 data budget at the low end of viability.

Simulation — running thousands of virtual robots in parallel inside platforms like Nvidia's Isaac Sim — is cheap and infinitely scalable, but physics engines approximate the world rather than reproduce it. Friction on a damp surface, the deformation of soft fabric, the dynamics of a half-filled cup: all remain difficult to model precisely enough that policies trained in simulation transfer cleanly to a real robot. Researchers call this the sim-to-real gap.

Motion capture tracks human bodies and retargets the movement onto robot frames. The data is rich, and it has driven viral demonstrations of robots performing gymnastic and martial arts routines. The limits appear the moment a task requires physical contact: a human hand adjusts grip force continuously through tactile feedback that current robot hardware cannot match. Visually identical motions fail because the robot lacks the joint torque, balance, or finger compliance the human used without thinking.

Internet and egocentric video is the most abundant source by several orders of magnitude. It carries no force values, no joint angles, no motor commands — only pixels. What it can teach a robot is a world model: the structure of physical scenes, the typical sequence of human actions, the affordances of common objects. First-person footage is the most strategically valuable subset, because it reflects the visual perspective a robot actually operates from.

Tesla Abandoned Motion Capture; Figure AI Recruited a Trillion-Dollar Landlord

The most consequential US data strategy shift of the past year belongs to Tesla. In June 2025, after Optimus program director Milan Kovac departed and AI vice president Ashok Elluswamy took over, Tesla replaced its motion-capture suits and VR rigs with helmet-mounted five-camera arrays and 30-to-40-pound backpacks worn by factory workers during ordinary tasks. Workers were told the change would allow faster scaling of data collection. Jonathan Aitken, a robotics expert at the University of Sheffield, noted that the fixed camera towers Tesla added around the work area supplement the wearable footage by providing broader environmental context.

Figure AI took a different route. In September 2025, the San Jose–based company announced a partnership with Brookfield Asset Management, a global asset manager with over 100,000 residential units, 500 million square feet of commercial office space, and 160 million square feet of logistics space, to record first-person human task video across Brookfield properties at scale for Figure's Helix vision-language-action model. Figure CEO Brett Adcock described the logic directly: "Every machine learning breakthrough has come from massive, diverse datasets. There is nothing like this for robotics so we are building our own."

The initiative, called Project Go-Big, delivered an early result in November 2025: after training exclusively on egocentric human video collected in Brookfield residential properties, Figure's Helix model learned to navigate cluttered home environments from natural language commands like "go to the fridge" without a single robot demonstration. The company said it was the first instance of a humanoid robot learning navigation end-to-end from only human video, with zero robot-specific training data.

DoorDash and Its 8 Million Couriers Are Now a Robot Training Pipeline

The gig economy's role in the data drought became explicit on March 19, 2026, when DoorDash launched Tasks, a standalone app that pays the company's 8 million U.S. couriers to film themselves completing household chores. One assignment asks workers to wear a body camera pointed at their hands and scrub at least five dishes, holding each clean dish steady in frame before moving on. Others include folding clothes, making beds, and pruning plants. The footage is used to evaluate both DoorDash's in-house AI models and those of unnamed partners in retail, insurance, hospitality, and technology.

DoorDash is not alone. Scale AI and Encord are recruiting independent data recorders globally. California-based Sunday Robotics ships a "skill capture glove" to people nationwide who collect motion data by performing household tasks. In January 2026, Rest of World documented workers in Shanghai spending entire weeks wearing VR headsets and exoskeletons to repeat the same microwave-door motion hundreds of times per day to train the robot beside them.

China has turned this into national infrastructure. As of January 2026, the Chinese government had funded 40 dedicated robot training centers, according to Rest of World, where human trainers in facilities like the National and Local Co-Built Humanoid Robotics Innovation Center in Suzhou repeat motions like folding clothes and wiping tables hundreds of times daily with humanoid robots beside them.

What Workers Are Not Being Told About the Footage They're Generating

Critics and researchers have identified serious transparency failures in the way data is being collected from gig workers. As MIT Technology Review reported in February 2026, the human labor behind humanoid robot training is being systematically obscured — workers often do not know which robot companies will ultimately train on their footage, and the public tends to overestimate how autonomous current robots actually are as a result.

The problem is particularly acute for home recording programs. DoorDash's Tasks app asks couriers to bring cameras into their kitchens and capture their own voices, but has not published consent policies, data retention timelines, or the rights workers retain over footage recorded in their own residences. The footage is intimate by nature: it depicts domestic environments, personal routines, and the interiors of private homes. Its stated end use — training humanoid robot models owned by unnamed third-party partners — extends well beyond what a standard courier agreement implies.

DoorDash has excluded its Tasks app from California, New York City, Seattle, and Colorado — jurisdictions with stricter data privacy laws. The geographic exclusion is a signal: the program is structured to operate in markets where workers have fewer formal protections.

The roboticist Aaron Prather, speaking to MIT Technology Review, described working with a delivery company that had workers wear movement-tracking sensors as they moved boxes, with the data earmarked for robot training. "It's going to be weird," Prather said. "No doubts about it." The broader concern, as MIT Technology Review framed it, is that if humanoid robots are not genuinely autonomous — and most current deployments still require significant human guidance — the arrangement risks becoming a form of global labor arbitrage in which physical tasks are performed remotely from wherever human labor is cheapest, a pattern already documented in AI content moderation.

Sharpa's Tactile-Reflex Model Argues Hardware Can Substitute for Data

A parallel approach tries to make lower-quality data more useful by adding hardware in the loop. Singapore-based startup Sharpa demonstrated the clearest version of this argument at CES 2026. On January 6, 2026, the company announced CraftNet, a vision-tactile-language-action model built around a three-tier reflex hierarchy. A high-level reasoning layer decomposes tasks into steps; a mid-frequency motion model at roughly 10 Hz plans the physical approach; and a "System 0" layer at approximately 100 Hz uses tactile feedback from Sharpa's SharpaWave 22-degree-of-freedom hand to continuously readjust grip and finger position during contact.

The argument is that real-time tactile correction at the hardware layer reduces how much high-fidelity teleoperation data the model must memorize. Sharpa's Alicia Veneziani, Global VP of GTM and President of Europe, put the competitive claim directly: "Robots can already dance and backflip, but manipulation remains the real bottleneck for useful, autonomous robots. At Sharpa, we focus on productivity from day one, which is why we started with the hardest part, the hand."

No Company Uses a Single Source, and the Scaling Law Is Unproven

No company training a serious robot foundation model relies on one data tier alone. The pattern that has emerged mixes all four: teleoperation data is a fraction of one percent of total training examples but carries most of the weight for whether a policy works in the real world; synthetic data fills the long tail of rare scenarios; motion capture supplies broad-strokes movement priors; internet and egocentric video provide the underlying world model.

Open benchmarks like AgiBot World are emerging to let labs measure progress on shared tasks. AgiBot's Genie Operator-1 model, trained on over 1 million trajectories across 217 tasks, achieved a 30 percent performance improvement over models trained on Open X-Embodiment. The result is encouraging, but it does not answer the field's central open question: whether scaling robot training data produces the same emergent generalization that language models showed — the ability to transfer reliably to genuinely unseen problems.

According to the Silicon Valley Robotics Center, the global robotics market reached $38 billion in 2026, a 34 percent year-on-year increase and the fastest growth rate the sector has recorded in a decade. Vision-Language-Action model adoption tripled, now present in 40 percent of new deployments. But the longer-term hope most labs share — a "data flywheel" in which deployed robots generate failure data that feeds back into the next training cycle automatically — has not yet been demonstrated to work at scale.

The robotics industry has not yet proven its scaling law exists. Until it does, the data drought sets the speed limit on everything else — and the people being asked to close that gap are delivery couriers, warehouse workers, and residents of Brookfield apartment buildings, many of whom have no clear picture of where their footage ends up.

ⓒ 2026 TECHTIMES.com All rights reserved. Do not reproduce without permission.

Join the Discussion