Google Research unveiled Lumiere, a text-to-video diffusion model that creates remarkably realistic videos from text or image prompts.
The still images generated by tools like Midjourney or DALL-E are incredible, but text-to-video (TTV) has understandably lagged behind and has been a lot less impressive so far.
TTV models like those from Pika Labs or Stable Video Diffusion have come a long way in the last 12 months but the realism and continuity of motion are still a little clunky.
Lumiere represents a big jump in TTV due to a novel approach to generating video that is spatially and temporally coherent. In other words, the goal is that the scenes in each frame stay visually consistent and the movements are smooth.
What can Lumiere do?
Lumiere has a range of video generation functionality including the following:
- Text-to-video – Enter a text prompt and Lumiere generates a 5-second video clip made up of 80 frames at 16 frames per second.
- Image-to-video – Lumiere takes an image as the prompt and turns it into a video.
- Stylized generation – An image can be used as a style reference. Lumiere uses a text prompt to generate a video in the style of the reference image.
- Video stylization – Lumiere can edit a source video to match a stylistic text prompt.
- Cinemagraphs – Select a region in a still image and Lumiere will animate that part of the image.
- Video inpainting – Lumiere can take a masked video scene and inpaint it to complete the video. It can also edit source video by removing or replacing elements in the scene.
The video below shows some of the impressive videos Lumiere can generate.
How does Lumiere do it?
Existing TTV models adopt a cascaded design where a base model generates a subset of keyframes and then they use a temporal super-resolution (TSR) model to generate data to fill the gaps between frames.
This approach is memory efficient but trying to fill gaps between a sub-sampled set of keyframes results in a video with temporal inconsistency, or glitchy motion. The low-resolution frames are then upscaled using a spatial super-resolution (SSR) model on non-overlapping windows.
Lumiere takes a different approach. It uses a Space-Time U-Net (STUNet) architecture that learns to downsample the signal in both space and time and processes all the frames at once.
Because it isn’t just passing a subset of sample keyframes to a TSR, Lumiere achieves globally coherent motion. To obtain the high-resolution video, Lumiere applies an SSR model on overlapping windows and uses MultiDiffusion to combine the predictions into a coherent result.
Google Research did a user study that showed users overwhelmingly preferred Lumiere videos from other TTV models.
The end result may only be a 5-second clip, but the realism and coherent visuals and movement are better than anything else currently available. Most other TTV solutions only generate 3-second clips for now.
Lumiere doesn’t handle scene transitions or multi-shot video scenes, but longer multi-scene functionality is almost certainly in the pipeline.
In the Lumiere research paper, Google noted that “there is a risk of misuse for creating fake or harmful content with our technology.”
Hopefully, they find a way to effectively watermark their videos and avoid copyright issues so they can release Lumiere for us to put it through its paces.