DeepMind develop a foundation model for building 2D game environments

April 2, 2024

  • DeepMind trained a foundation model for generating 2D game levels from text or image prompts
  • Named Genie, this model streamlines the creation of functional 2D environments
  • This also yields potential for building robots that correctly understand new environments
AI game

Google DeepMind’s Genie is a generative model that translates simple images or text prompts into dynamic, interactive worlds. 

Genie was trained on an extensive dataset of over 200,000 hours of in-game video footage, including gameplay from 2D platformers and real-world robotics interactions. 

This vast dataset allowed Genie to understand and generate the physics, dynamics, and aesthetics of numerous environments and objects.

The finalized model, documented in a research paper, contains 11 billion parameters to generate interactive virtual worlds from either images in multiple formats or text prompts. 

So, you could feed Genie an image of your living room or garden and turn it into a playable 2D platform level.

Or scribble a 2D environment on a piece of paper and convert it into a playable game environment.

DeepMind AI
Genie can function as an interactive environment, accepting various prompts such as generated images or hand-drawn sketches. Users can guide the model’s output by providing latent actions at each time step, which Genie then uses to generate the next frame in the sequence at 1 FPS. Source: DeepMind via ArXiv (open access).

What sets Genie apart from other world models is its ability to enable users to interact with the generated environments on a frame-by-frame basis.

For example, below, you can see how Genie takes photographs of real-world environments and turns them into 2D game levels.

DeepMind AI
Genie can create game levels from a) other game levels, b) hand-drawn sketches, and c) photographs of real-world environments. See the game levels (bottom row) generated from real-world images (top row). Source: DeepMind.

How Genie works

Genie is a “foundation world model” with three key components: a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple, scalable latent action model (LAM).

Here’s how it works:

  1. Spatiotemporal transformers: Central to Genie are spatiotemporal (ST) transformers, which process sequences of video frames. Unlike traditional transformers that handle text or static images, ST transformers are designed to understand the progression of visual data over time, making them ideal for video and dynamic environment generation.
  2. Latent Action Model (LAM): Genie understands and predicts actions within its generated worlds through the LAM. This infers the potential actions that could occur between frames in a video, learning a set of “latent actions” directly from the visual data. This enables Genie to control the progression of events in interactive environments, despite the absence of explicit action labels in the training data.
  3. Video tokenizer and dynamics model: To manage video data, Genie employs a video tokenizer that compresses raw video frames into a more manageable format of discrete tokens. Following tokenization, the dynamics model predicts the next set of frame tokens, generating subsequent frames in the interactive environment.

The DeepMind team explained of Genie, “Genie could enable a large amount of people to generate their own game-like experiences. This could be positive for those who wish to express their creativity in a new way, for example, children who could design and step into their own imagined worlds.”

In a side experiment, when presented with videos of real robot arms engaging with real-world objects, Genie demonstrated an uncanny ability to decipher the actions these arms could perform. This demonstrates potential uses in robotics research. 

Tim Rocktäschel from the Genie team described Genie’s open-ended potential: “It is hard to predict what use cases will be enabled. We hope projects like Genie will eventually provide people with new tools to express their creativity.” 

DeepMind was conscious of the risks of releasing this foundation model, stating in the paper, “We have chosen not to release the trained model checkpoints, the model’s training dataset, or examples from that data to accompany this paper or the website.”

“We would like to have the opportunity to further engage with the research (and video game) community and to ensure that any future such releases are respectful, safe and responsible.”

Using games to simulate real-world applications

DeepMind has used video games for several machine learning projects. 

For example, in 2021, DeepMind built XLand, a virtual playground for testing reinforcement learning (RL) approaches for generalist AI agents. Here, AI models mastered cooperation and problem-solving by performing tasks such as moving obstacles in open-ended game environments. 

Then, just last month, SIMA (Scalable, Instructable, Multiworld Agent) was designed to understand and execute human language instructions across different games and scenarios. 

SIMA was trained using nine video games requiring different skill sets, from basic navigation to piloting vehicles. 

Game environments offer a controllable, scalable sandbox for training and testing AI models.

DeepMind’s gaming expertise extends to 2014-2015, when they developed an algorithm to defeat humans in games like Pong and Space Invaders, not to mention AlphaGo, which defeated pro player Fan Hui on a full-sized 19×19 board.

Join The Future


Clear, concise, comprehensive. Get a grip on AI developments with DailyAI

Sam Jeans

Sam is a science and technology writer who has worked in various AI startups. When he’s not writing, he can be found reading medical journals or digging through boxes of vinyl records.


Stay Ahead with DailyAI

Sign up for our weekly newsletter and receive exclusive access to DailyAI's Latest eBook: 'Mastering AI Tools: Your 2024 Guide to Enhanced Productivity'.

*By subscribing to our newsletter you accept our Privacy Policy and our Terms and Conditions