Stanford’s mobile ALOHA robot learns from humans to cook, clean, do laundry

Join leaders in San Francisco on January 10 for an exclusive night of networking, insights, and conversation. Request an invite here.

A new AI system developed by researchers at Stanford University makes impressive breakthroughs in training mobile robots that can perform complex tasks in different environments.

Called Mobile ALOHA (A Low-cost Open-source Hardware System for Bimanual Teleoperation) the system addresses the high costs and technical challenges of training mobile bimanual robots that require careful guidance from human operators.

[embedded content]

It costs a fraction of off-the-shelf systems and can learn from as few as 50 human demonstrations.

This new system comes against the backdrop of an acceleration in robotics, enabled partly by the success of generative models.

VB Event

The AI Impact Tour

Getting to an AI Governance Blueprint – Request an invite for the Jan 10 event.

Learn More

Limits of current robotics systems

Most robotic manipulation tasks focus on table-top manipulation. This includes a recent wave of models that have been built based on transformers and diffusion models, architectures widely used in generative AI.

However, many of these models lack the mobility and dexterity necessary for generally useful tasks. Many tasks in everyday environments require coordinating mobility and dexterous manipulation capabilities.

“With additional degrees of freedom added, the interaction between the arms and base actions can be complex, and a small deviation in base pose can lead to large drifts in the arm’s end-effector pose,” the Stanford researchers write in their paper, adding that prior works have not delivered “a practical and convincing solution for bimanual mobile manipulation, both from a hardware and a learning standpoint.”

Mobile ALOHA

The new system developed by Stanford researchers builds on top of ALOHA, a low-cost and whole-body teleoperation system for collecting bimanual mobile manipulation data.

A human operator demonstrates tasks by manipulating the robot arms through a teleoperated control. The system captures the demonstration data and uses it to train a control system through end-to-end imitation learning.

Mobile ALOHA extends the system by mounting it on a wheeled base. It is designed to provide a cost-effective solution for training robotic systems. The entire setup, which includes webcams and a laptop with a consumer-grade GPU, costs around $32,000, which is much cheaper than off-the-shelf bimanual robots, which can cost up to $200,000.

Mobile ALOHA configuration (source: arxiv)

Mobile ALOHA is designed to teleoperate all degrees of freedom simultaneously. The human operator is tethered to the system by the waist and drives it around the work environment while operating the arms with controllers. This enables the robot control system to simultaneously learn movement and other control commands. Once it gathers enough information, the model can then repeat the sequence of tasks autonomously.

The teleoperation system is capable of multiple hours of consecutive usage. The results are impressive and show that a simple training recipe enables the system to learn complex mobile manipulation tasks.

The demos show the trained robot cooking a three-course meal with delicate tasks such as breaking eggs, mincing garlic, pouring liquid, unpackaging vegetables, and flipping chicken in a frying pan.

Mobile ALOHA can also do a variety of house-keeping tasks, including watering plants, using a vacuum, loading and unloading a dishwasher, getting drinks from the fridge, opening doors, and operating washing machines

Imitation learning and co-training

Like many recent works in robotics, Mobile ALOHA takes advantage of transformers, the architecture used in large language models. The original ALOHA system used an architecture called Action Chunking with Transformers (ACT), which takes images from multiple viewpoints and joint positions as input and predicts a sequence of actions.

Action Chunking with Transformers (ACT) (source: ALOHA webpage)

Mobile ALOHA extends that system by adding movement signals to the input vector. This formulation allows Mobile ALOHA to reuse previous deep imitation learning algorithms with minimal changes.

“We observe that simply concatenating the base and arm actions then training via direct imitation learning can yield strong performance,” the researchers write. “Specifically, we concatenate the 14-DoF joint positions of ALOHA with the linear and angular velocity of the mobile base, forming a 16-dimensional action vector.”

The work also benefits from the success of recent methods that pre-train models on diverse robot datasets from other projects. Of special note is RT-X, a project by DeepMind and 33 research institutions, which combined several robotics datasets to create control systems that could generalize well beyond their training data and robot morphologies.

“Despite the differences in tasks and morphology, we observe positive transfer in nearly all mobile manipulation tasks, attaining equivalent or better performance and data efficiency than policies trained using only Mobile ALOHA data,” the researchers write.

Using existing data enabled the researchers to train Mobile ALOHA for complex tasks with very few human demonstrations

“With co-training, we are able to achieve over 80% success on these tasks with only 50 human demonstrations per task, with an average of 34% absolute improvement compared to no co-training,” the researchers write.

Not production-ready

Despite its impressive results, Mobile ALOHA has drawbacks. For example, its bulkiness and unwieldy form factor do not make it suitable for tight environments.

In the future, the researchers plan to improve the system by adding more degrees of freedom and reducing the robot’s volume.

It is also worth noting that this is not a fully autonomous system that can learn to explore new environments on its own. It still requires full demonstrations by human operators in its environment, though it learns the tasks with fewer examples than previous methods, thanks to its co-training system.

The researchers will explore changes to the AI model that will allow the robot to self-improve and acquire new knowledge.
Given the recent trend of training control AI systems across different datasets and morphologies, this work can further accelerate the development of versatile mobile robots. And ideally, lead to enterprise-and-consumer grade helpful robots, a field that is rapidly heating up thanks to the work of other researchers and companies such as Tesla with its still-in development Optimus humanoid robot and Hyundai with its Boston Dynamics division, which does offer the robotic dog Spot for sale at around $74,000 USD.

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.