Google DeepMind’s Genie is a generative model that translates simple images or text prompts into dynamic, interactive worlds.
Genie was trained on an extensive dataset of over 200,000 hours of in-game video footage, including gameplay from 2D platformers and real-world robotics interactions.
This vast dataset allowed Genie to understand and generate the physics, dynamics, and aesthetics of numerous environments and objects.
The finalized model, documented in a research paper, contains 11 billion parameters to generate interactive virtual worlds from either images in multiple formats or text prompts.
So, you could feed Genie an image of your living room or garden and turn it into a playable 2D platform level.
Or scribble a 2D environment on a piece of paper and convert it into a playable game environment.
What sets Genie apart from other world models is its ability to enable users to interact with the generated environments on a frame-by-frame basis.
For example, below, you can see how Genie takes photographs of real-world environments and turns them into 2D game levels.
How Genie works
Genie is a “foundation world model” with three key components: a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple, scalable latent action model (LAM).
Here’s how it works:
- Spatiotemporal transformers: Central to Genie are spatiotemporal (ST) transformers, which process sequences of video frames. Unlike traditional transformers that handle text or static images, ST transformers are designed to understand the progression of visual data over time, making them ideal for video and dynamic environment generation.
- Latent Action Model (LAM): Genie understands and predicts actions within its generated worlds through the LAM. This infers the potential actions that could occur between frames in a video, learning a set of “latent actions” directly from the visual data. This enables Genie to control the progression of events in interactive environments, despite the absence of explicit action labels in the training data.
- Video tokenizer and dynamics model: To manage video data, Genie employs a video tokenizer that compresses raw video frames into a more manageable format of discrete tokens. Following tokenization, the dynamics model predicts the next set of frame tokens, generating subsequent frames in the interactive environment.
The DeepMind team explained of Genie, “Genie could enable a large amount of people to generate their own game-like experiences. This could be positive for those who wish to express their creativity in a new way, for example, children who could design and step into their own imagined worlds.”
In a side experiment, when presented with videos of real robot arms engaging with real-world objects, Genie demonstrated an uncanny ability to decipher the actions these arms could perform. This demonstrates potential uses in robotics research.
Tim Rocktäschel from the Genie team described Genie’s open-ended potential: “It is hard to predict what use cases will be enabled. We hope projects like Genie will eventually provide people with new tools to express their creativity.”
DeepMind was conscious of the risks of releasing this foundation model, stating in the paper, “We have chosen not to release the trained model checkpoints, the model’s training dataset, or examples from that data to accompany this paper or the website.”
“We would like to have the opportunity to further engage with the research (and video game) community and to ensure that any future such releases are respectful, safe and responsible.”
Using games to simulate real-world applications
DeepMind has used video games for several machine learning projects.
For example, in 2021, DeepMind built XLand, a virtual playground for testing reinforcement learning (RL) approaches for generalist AI agents. Here, AI models mastered cooperation and problem-solving by performing tasks such as moving obstacles in open-ended game environments.
Then, just last month, SIMA (Scalable, Instructable, Multiworld Agent) was designed to understand and execute human language instructions across different games and scenarios.
SIMA was trained using nine video games requiring different skill sets, from basic navigation to piloting vehicles.
Game environments offer a controllable, scalable sandbox for training and testing AI models.
DeepMind’s gaming expertise extends to 2014-2015, when they developed an algorithm to defeat humans in games like Pong and Space Invaders, not to mention AlphaGo, which defeated pro player Fan Hui on a full-sized 19×19 board.