DeepMind and Stanford’s new robot control model follow instructions from sketches

DeepMind and Stanford’s new robot control model follow instructions from sketches

Join leaders in Boston on March 27 for an exclusive night of networking, insights, and conversation. Request an invite here.


Recent advances in language and vision models have helped make great progress in creating robotic systems that can follow instructions from text descriptions or images. However, there are limits to what language- and image-based instructions can accomplish.

A new study by researchers at Stanford University and Google DeepMind suggests using sketches as instructions for robots. Sketches have rich spatial information to help the robot carry out its tasks without getting confused by the clutter of realistic images or the ambiguity of natural language instructions.

The researchers created RT-Sketch, a model that uses sketches to control robots. It performs on par with language- and image-conditioned agents in normal conditions and outperforms them in situations where language and image goals fall short.

Why sketches?

While language is an intuitive way to specify goals, it can become inconvenient when the task requires precise manipulations, such as placing objects in specific arrangements. 

VB Event

The AI Impact Tour – Boston










We’re excited for the next stop on the AI Impact Tour in Boston on March 27th. This exclusive, invite-only event, in partnership with Microsoft, will feature discussions on best practices for data integrity in 2024 and beyond. Space is limited, so request an invite today.










Request an invite


On the other hand, images are efficient at depicting the desired goal of the robot in full detail. However, access to a goal image is often impossible, and a pre-recorded goal image can have too many details. Therefore, a model trained on goal images might overfit to its training data and not be able to generalize its capabilities to other environments.

“The original idea of conditioning on sketches actually stemmed from early-on brainstorming about how we could enable a robot to interpret assembly manuals, such as IKEA furniture schematics, and perform the necessary manipulation,” Priya Sundaresan, Ph.D. student at Stanford University and lead author of the paper, told VentureBeat. “Language is often extremely ambiguous for these kinds of spatially precise tasks, and an image of the desired scene is not available beforehand.” 

The team decided to use sketches as they are minimal, easy to collect, and rich with information. On the one hand, sketches provide spatial information that would be hard to express in natural language instructions. On the other, sketches can provide specific details of desired spatial arrangements without needing to preserve pixel-level details as in an image. At the same time, they can help models learn to tell which objects are relevant to the task, which results in more generalizable capabilities.

“We view sketches as a stepping stone towards more convenient but expressive ways for humans to specify goals to robots,” Sundaresan said.

RT-Sketch

RT-Sketch is one of many new robotics systems that use transformers, the deep learning architecture used in large language models (LLMs). RT-Sketch is based on Robotics Transformer 1 (RT-1), a model developed by DeepMind that takes language instructions as input and generates commands for robots. RT-Sketch has modified the architecture to replace natural language input with visual goals, including sketches and images. 

To train the model, the researchers used the RT-1 dataset, which includes 80,000 recordings of VR-teleoperated demonstrations of tasks such as moving and manipulating objects, opening and closing cabinets, and more. However, first, they had to create sketches from the demonstrations. For this, they selected 500 training examples and created hand-drawn sketches from the final video frame. They then used these sketches and the corresponding video frame along with other image-to-sketch examples to train a generative adversarial network (GAN) that can create sketches from images. 

GAN network generates sketches from images

They used the GAN network to create goal sketches to train the RT-Sketch model. They also augmented these generated sketches with various colorspace and affine transforms, to simulate variations in hand-drawn sketches. The RT-Sketch model was then trained on the original recordings and the sketch of the goal state.

The trained model takes an image of the scene and a rough sketch of the desired arrangement of objects. In response, it generates a sequence of robot commands to reach the desired goal.

“RT-Sketch could be useful in spatial tasks where describing the intended goal would take longer to say in words than a sketch, or in cases where an image may not be available,” Sundaresan said. 

RT-Sketch takes in visual instructions and generates action commands for robots

For example, if you want to set a dinner table, language instructions like “put the utensils next to the plate” could be ambiguous with multiple sets of forks and knives and many possible placements. Using a language-conditioned model would require multiple interactions and corrections to the model. At the same time, having an image of the desired scene would require solving the task in advance. With RT-Sketch, you can instead provide a quickly drawn sketch of how you expect the objects to be arranged.

“RT-Sketch could also be applied to scenarios such as arranging or unpacking objects and furniture in a new space with a mobile robot, or any long-horizon tasks such as multi-step folding of laundry where a sketch can help visually convey step-by-step subgoals,” Sundaresan said. 

RT-Sketch in action

The researchers evaluated RT-Sketch in different scenes across six manipulation skills, including moving objects near to one another, knocking cans sideways or placing them upright, and closing and opening drawers.

RT-Sketch performs on par with image- and language-conditioned models for tabletop and countertop manipulation. Meanwhile, it outperforms language-conditioned models in scenarios where goals can’t be expressed clearly with language instructions. It is also suitable for scenarios where the environment is cluttered with visual distractors and image-based instructions can confuse image-conditioned models.

“This suggests that sketches are a happy medium; they are minimal enough to avoid being affected by visual distractors, but are expressive enough to preserve semantic and spatial awareness,” Sundaresan said.

In the future, the researchers will explore the broader applications of sketches, such as complementing them with other modalities like language, images, and human gestures. DeepMind already has several other robotics models that use multi-modal models. It will be interesting to see how they can be improved with the findings of RT-Sketch. The researchers will also explore the versatility of sketches beyond just capturing visual scenes. 

“Sketches can convey motion via drawn arrows, subgoals via partial sketches, constraints via scribbles, or even semantic labels via scribbled text,” Sundaresan said. “All of these can encode useful information for downstream manipulation that we have yet to explore.”



VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.