On July 29, a reporter from the New York Times was at Google Labs and got a first look at Google's latest RT-2 model-driven robot.
A one-armed robot stood in front of a table. On the table sat three plastic statues: a lion, a whale, and a dinosaur. The engineer gave the robot the command, "Pick up the extinct animals." The robot whirred for a moment, then its arm extended and its claws fell open. It grabs the dinosaur.
It was a flash of intelligence.
Until last week," the New York Times described, "this demonstration was impossible. Robots can't reliably manipulate objects they've never seen before, and they certainly can't make the logical leap from 'extinct animal' to 'plastic dinosaur.'"
While still on display, and Google doesn't intend to immediately launch or commercialize it on a larger scale, the demonstration was enough to show a glimpse of the opportunities that the big model could present for robots.
Before the era of big models, people trained robots, usually optimized for each task, such as grabbing a certain toy, need a sufficient amount of data, the robot can accurately identify the toy from all angles, in all kinds of light, grabbing success. And making the robot aware that it has the task of grasping a toy also requires programming the robot to solve it.
And the intelligence and generalization capabilities of the big models show a ray of light towards solving these problems and moving towards general-purpose robotics.
Applying Transformer to Robotics
Google's new RT-2 model, known as Robotic Transformer 2, utilizes the Transformer architecture as a base for its models.
The Transformer architecture, which was proposed in 2018, is the lowest level of the base for the current firestorm of Large Language Models (LLMs), but in fact, as an architecture, Transformer can be used not only among large language models, but also for training other types of data. Back in March, Google released PaLM-E, the world's largest visual language model (VLM) at that time.
In a VLM, language is encoded as vectors, and people provide the model with a large corpus that allows it to predict what a human would normally say next, and use this to generate linguistic responses.
In the Visual Language Model, on the other hand, the model can encode image information as vectors similar to language, allowing the model to both 'understand' text and 'understand' images in the same way. And the researchers provide the visual language model with a large corpus and images, allowing it to perform tasks such as visual quizzing, captioning images and object recognition.
Both images and language are relatively easy data to acquire in large quantities. As a result, it's easy for the models to achieve stunning results.
Trying to use the Transformer architecture to generate robot behaviors, on the other hand, has a major difficulty. "The data involved in robot actions is very expensive." Prof. Huazhe Xu, an assistant professor at Tsinghua University's Institute for Cross-Information Studies, told Geek Park, "Visual and linguistic data come from humans and are passive data, while robot action data, all of it, comes from the robot's active data.
For example, if I want to study the action of a robot pouring coffee, no matter whether I write code for the robot to perform, or utilize other ways for the robot to perform, it is necessary for the robot to actually perform the action once in order to get this data. Therefore, the scale and magnitude of the robot's data is completely different from that of language and pictures."
In RT-1, the first generation of the Transformer model for robots that Google worked on, Google opened up such a challenge for the first time by trying to build a visual language action model.
To build such a model, Google used 13 robots in a constructed kitchen environment for 17 months to assemble a dataset of active data from the robots on more than 700 tasks.
The dataset recorded three dimensions simultaneously:
Vision - camera data from the robots as they perform task operations; Language - task text described in natural language; and Robot Motion - data on the xyz-axis and deflection while the robot is performing the task, etc. Although better experimental results were obtained at that time, it is conceivable that it would be very difficult to further increase the amount of data within the dataset
The innovation of RT-2 is that it uses the previously described visual language model (VLM) PaLM-E and another VLM PaLI-X as its base - the pure VLM can be trained with network-level data because the amount of data is large enough to get good enough results, while the fine-tuning stage then adds in the robot's movement data for co-finetuning. results, and in the fine-tuning (fine-tuning) phase, the robot's action data is then added in to fine-tune it together (co-finetuning).
In this way, the robot is equivalent to having a common sense system that has been learned on a huge amount of data - although it can't grasp bananas yet, it can recognize bananas, and even knows that bananas are a kind of fruit that monkeys would like to eat more.
In the fine-tuning stage, by adding the knowledge of how the robot grasps bananas when it sees them in the real world, the robot not only has the ability to recognize bananas in various lights and angles, but also has the ability to grasp bananas.
In this way, the data required to train the robot with the Transformer architecture is significantly reduced.
RT-2 directly used the vision/language/robot action dataset used in the RT-1 training phase in the fine-tuning phase. The data given by Google shows that RT-2 performs just as well as RT-1 when it comes to grasping items that originally appeared in the training data. And because of the "brain with common sense", the success rate when grasping items that had not been seen before increased from 32% in RT-1 to 62%.
"That's the beauty of the big model," Hsu said. There's no way to break it down to whether the success rate increased because it recognized that the two objects were made of similar materials, or were similar in size, or something else," Xu said. After it learns enough, it springs to life with some capability."
The future of using natural language to interact with robots
Academically, RT-2's strong generalization has the potential to solve the problem of insufficient training data for robots. Beyond that, RT-2 is intuitively impressive because of its intelligence.
In the experiment, the researcher wanted it to pick up "something that could be used as a hammer" and the robot picked up a rock from a pile of items, while when asked to pick up a drink for a tired person, the robot chose a Red Bull from a pile of items.
Such techniques come from the ability of researchers to introduce 'chains of thought' when training large models. This kind of multi-segment semantic reasoning is very difficult to achieve in traditional robot imitation learning research.
However, the use of natural language to interact with robots is not new to RT-2.
In the past, researchers have always needed to translate task requirements into code for the robot to understand, as well as write code to correct the robot's behavior if something goes wrong, a process that requires multiple interactions and is inefficient. Now that we have very intelligent conversational robots, the next natural step is to let robots interact with humans in natural language.
"We started working on these language models about two years ago, and then we realized that they hold a wealth of knowledge." So we started connecting them to robots," said Karol Hausman, a research scientist at Google.
However, making big models the brains of robots comes with its own set of challenges. One of the most important of these is the problem of grounding, that is, how to translate the usually pie-in-the-sky responses of the macromodels into commands that drive the robots' actions.
In 2022, Google introduced the Say-can model. As the name suggests, the model uses two considerations to help robots act. One consideration is say, where the model, in combination with Google's big language model PaLM model, can break down an acquired task to find the most appropriate action for the current action through natural language and human interaction; the other consideration is can, where the model uses an algorithm to calculate the probability that the robot will be able to successfully perform the task at the current time. The robot performs the action under these two considerations.
For example, if you tell the robot "I spilled my milk, can you help me?" The robot will first go through the language model for task planning, at which point it may make the most sense to find a cleaner, followed by finding a sponge and wiping it itself. The robot will then go through the algorithm and calculate that as a robot, it has a low probability of being able to successfully find a cleaner, and a high probability of finding a sponge to wipe itself. After both considerations, the robot will then choose the action of finding a sponge to wipe the milk.