Imagine an AI that doesn’t just understand commands but applies them, like a human would, across an array of simulated 3D environments.
That’s the aim of DeepMind’s (Scalable, Instructable, Multiworld Agent (SIMA).
Unlike traditional AI, which might excel in discrete tasks like strategic games or specific problem-solving, SIMA’s agents are trained to interpret human language instructions and translate them into actions using a keyboard and mouse, mimicking human interaction with a computer.
This means that whether the task is to navigate through a digital landscape, solve puzzles, or interact with objects in a game, SIMA aims to understand and execute these commands with the same intuition and adaptability as a person would.
Introducing SIMA: the first generalist AI agent to follow natural-language instructions in a broad range of 3D virtual environments and video games. 🕹️
It can complete tasks similar to a human, and outperforms an agent trained in just one setting. 🧵 https://t.co/qz3IxzUpto pic.twitter.com/02Q6AkW4uq
— Google DeepMind (@GoogleDeepMind) March 13, 2024
This project’s core is a huge and diverse dataset of human gameplay across research environments and commercial video games.
SIMA was trained and tested on a selection of nine video games through collaborations with eight game studios, including well-known titles like No Man’s Sky and Teardown. Each game challenges SIMA with different skills, from basic navigation and resource gathering to more complex activities like crafting and spaceship piloting.
SIMA’s training included four research environments to assess its physical interaction and object manipulation skills.
In terms of architecture, SIMA uses pre-trained vision and video prediction models, fine-tuned on the specific 3D settings of its game portfolio.
Unlike traditional game-playing AIs, SIMA doesn’t require source code access or custom APIs. It operates on-screen images and user-provided instructions, employing keyboard and mouse actions to execute tasks.
In its evaluation phase, SIMA demonstrated proficiency across 600 basic skills encompassing navigation, object interaction, and menu use.
What sets SIMA apart is its generality. This AI isn’t being trained to master a single game or solve a particular set of problems.
Instead, DeepMind is teaching it to be adaptable, to understand instructions, and to act on them across different virtual worlds.
Tim Harley from DeepMind explained, “It’s still very much a research project,” but in the future, “one could imagine one day having agents like SIMA playing alongside you in games with you and with your friends.”
SIMA needs only the images provided by the 3D environment and natural-language instructions given by the user. 🖱️
With mouse and keyboard outputs, it is evaluated across 600 skills, spanning areas like navigation and object interaction – such as “turn left” or “chop down tree.”… pic.twitter.com/PEPfLZv2o0
— Google DeepMind (@GoogleDeepMind) March 13, 2024
SIMA is mastering the art of understanding and acting upon our instructions by grounding language in perception and action.
DeepMind has plenty of gaming heritage stretching back to AlphaGo in 2014, which went on to beat several high-profile players of the famously complex Asian game Go.
However, SIMA goes deeper than video games, moving closer to the dream of truly intelligent, instructable AI agents that blur the lines between human and machine understanding.