Google showcased some exciting test results of its latest vision-language-action (VLA) robot model called Robotics Transformer 2 (RT-2).
The bulk of recent AI discussions has centered around large language models like ChatGPT and Llama. The responses these models provide, while useful, remain on the screen of your device. With RT-2, Google is bringing the power of AI to the physical world. A world where self-learning robots could soon be a part of our everyday lives.
There has been a big improvement in the dexterity of robots but they still need very specific programming instructions to accomplish even simple tasks. When the task changes, even slightly, the program needs to change.
With RT-2, Google has created a model that enables a robot to classify and learn from the things it sees in combination with words it hears. It then reasons on the instruction it receives and takes physical action in response.
With LLMs, a sentence is broken up into tokens, essentially bite-size chunks of words that enable the AI to understand the sentence. Google took that principle and tokenized the movements a robot would need to make in response to a command.
The motions of a robotic arm with a gripper, for example, would be broken up into tokens of changes in x and y positions or rotations.
In the past, robots have usually required firsthand experience in order to perform an action. But with our new vision-language-action model, RT-2, they can now learn from both text and images from the web to tackle new and complex tasks. Learn more ↓ https://t.co/4DSRwUHhwg
— Google (@Google) July 28, 2023
What does RT-2 enable a robot to do?
Being able to understand what it sees and hears and having chain-of-thought reasoning means the robot doesn’t need to be programmed for new tasks.
One example that Google gave in its DeepMind blog post on RT-2 was “deciding which object could be used as an improvised hammer (a rock), or which type of drink is best for a tired person (an energy drink).”
In the tests that Google ran, it put a robotic arm and gripper through a series of requests that required language comprehension, vision, and reasoning, to be able to take the appropriate action. For example, presented with 2 bags of crisps on a table, with one slightly over the edge, the robot was told to “pick up the bag about to fall off the table.”
That may sound simple but the contextual awareness required to pick up the correct bag is groundbreaking in the world of robotics.
To explain how much more advanced RT-2 is than regular LLMs another Google blog explained that “A robot needs to be able to recognize an apple in context, distinguish it from a red ball, understand what it looks like, and most importantly, know how to pick it up.”
While it’s early days, the prospect of household or industrial robots helping with a variety of tasks in changing environments is exciting. The defense applications are almost certainly getting attention too.
Google’s robot arm didn’t always get it right and had a big red emergency off button in case it malfunctioned. Let’s hope the future robots come with something similar in case they feel like they’re not happy with the boss one day.