MIT Unveils F3RM: AI-Powered Robots Can Now Grasp Unfamiliar Objects Using Natural Language

Researchers at the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT have unveiled Feature Fields for Robotic Manipulation (F3RM), a ground-breaking system that enables robots to grasp and manipulate objects with the help of open-ended language prompts.

The flexibility of human beings is the source of inspiration for F3RM. Just as humans can easily navigate among strange things, F3RM enables robots to identify, comprehend, and engage with strange objects. This innovative method creates complex 3D environments that robots can explore by fusing basic model characteristics with 2D photos, according to Robohub.

F3RM has a wide range of possible uses, especially in settings like homes and warehouses where there are a lot of things. Here, robots are capable of properly and efficiently completing jobs and interpreting ambiguous or nonspecific human demands. This advancement improves robots' flexibility and human-like interaction.

MIT Unveils F3RM: AI-Powered Robots Can Now Grasp Unfamiliar Objects Using Natural Language — Bipedal robots in testing phase move containers during a mobile-manipulation demonstration at Amazon's "Delivering the Future" event at the company's BFI1 Fulfillment Center, Robotics Research and Development Hub in Sumner, Washington on October 18, 2023. JASON REDMOND/AFP via Getty Images

AI System Promotes Customer Satisfaction

F3RM may be very helpful to robots in big fulfillment centers where unpredictability and messiness are widespread. Robots in these facilities are often given inventory descriptions to identify.

To guarantee proper order fulfillment, they must then match these text descriptions to items, regardless of differences in packing.

Robots may get increasingly skilled at finding things, putting them in designated bins, and getting them ready for packing with the help of F3RM's sophisticated spatial and semantic perception skills. This improves order fulfillment and boosts efficiency, increasing shipment speed and customer satisfaction.

Pictures are flat, but the world is 3D.

MIT scientists built a robot that can see the world with an open vocabulary by lifting 2D foundation models into 3D: https://t.co/m7z5gNwV0g pic.twitter.com/0wCOgKNu8w
— MIT CSAIL (@MIT_CSAIL) November 6, 2023

The most remarkable feature of F3RM is its scene comprehension, which enables it to be used in a variety of contexts, including homes and warehouses. F3RM, for example, may help customized robots recognize and retrieve certain objects. The technology helps robots interact with their environment both physically and perceptually.

"We wanted to learn how to make robots as flexible as ourselves since we can grasp and place objects even though we've never seen them before," remarked Ge Yang, a postdoc at the National Science Foundation AI Institute for Artificial Intelligence and Fundamental Interactions and MIT CSAIL, as quoted in the institution's media release.

F3RM starts the procedure by taking a bunch of pictures with a camera attached to a selfie stick. These fifty pictures, which constitute the foundation of a neural radiance field (NeRF), were taken at different angles and in different positions.

An advanced deep learning technology called NeRF uses 2D photos to create a 3D scene, comparable to a "digital twin" of the robot's surroundings. Providing a complete 360-degree view of the robot's surroundings allows for strong interaction and manipulation.

Moreover, F3RM is not limited to generating a three-dimensional model of the surroundings. Additionally, it creates a feature field that is enhanced with semantic data. CLIP, a vision foundation model trained on an extensive dataset of hundreds of millions of pictures, is used to do this.

This makes it possible for F3RM to understand a variety of visual ideas. The method successfully elevates the 2D CLIP features that were extracted from the selfie stick photographs into a 3D representation, improving the system's comprehension of the environment's geometry and semantics.

Open-Ended Features

With F3RM's open-ended features, robots may respond to human queries in different degrees of detail. When a user requests a "tall mug," for instance, the robot will choose the item that best fits the description. Despite a user's vague request, F3RM's capabilities allow robots to choose and interact with objects like humans.

According to The Daily Science, the researchers put the AI system to the test by having a robot pick up Baymax, a character from Disney's "Big Hero 6."

To decide which item to grab and how to do so, F3RM used its spatial awareness and vision-language characteristics obtained from the foundation models, even though it had not been specifically taught to pick up this particular cartoon superhero toy. This demonstrates the F3RM system's remarkable flexibility and ability to solve problems.

This innovative approach allows robots to perform more dynamic control tasks and adapt to real-time perception. By combining geometric knowledge with semantics from foundation models trained on internet-scale data, F3RM makes it possible for aggressive generalization from a small set of demos.

Read also: Thi