Meta has released V-JEPA, a predictive vision model that is the next step toward Meta Chief AI Scientist Yann LeCun’s vision of advanced machine intelligence (AMI).
For AI-powered machines to interact with objects in the physical world, they need to be trained, but conventional methods are very inefficient. They use thousands of video examples with pre-trained image encoders, text, or human annotations, for a machine to learn a single concept, let alone multiple skills.
V-JEPA, which stands for Joint Embedding Predictive Architectures, is a vision model that is designed to learn these concepts in a more efficient way.
LeCun said that “V-JEPA is a step toward a more grounded understanding of the world so machines can achieve more generalized reasoning and planning.”
V-JEPA learns how objects in the physical world interact in much the same way that toddlers do. A key part of how we learn is by filling in the blanks to predict missing information. When a person walks behind a screen and out the other side, our brain fills in the blank with an understanding of what happened behind the screen.
V-JEPA is a non-generative model that learns by predicting missing or masked parts of a video. Generative models can recreate a masked piece of video pixel by pixel, but V-JEPA doesn’t do that.
It compares abstract representations of unlabeled images rather than the pixels themselves. V-JEPA is presented with a video that has a large portion masked out, with just enough of the video to give some context. The model is then asked to provide an abstract description of what is happening in the masked-out space.
Instead of being trained on one specific skill, Meta says “it used self-supervised training on a range of videos and learned a number of things about how the world works.”
Today we’re releasing V-JEPA, a method for teaching machines to understand and model the physical world by watching videos. This work is another important step towards @ylecun’s outlined vision of AI models that use a learned understanding of the world to plan, reason and… pic.twitter.com/5i6uNeFwJp
— AI at Meta (@AIatMeta) February 15, 2024
Frozen evaluations
Meta’s research paper explains that one of the key things that makes V-JEPA so much more efficient than some other vision learning models is how good it is at “frozen evaluations”.
After undergoing self-supervised learning with extensive unlabeled data, the encoder and predictor don’t require further training when learning a new skill. The pretrained model is frozen.
Previously, if you wanted to fine-tune a model to learn a new skill you’d need to update the parameters or the weights of the entire model. In order for V-JEPA to learn a new task, it requires only a small amount of labeled data with only a small set of task-specific parameters optimized on top of the frozen backbone.
The capability V-JEPA has to efficiently learn new tasks is promising for the development of embodied AI. It could be key to enabling machines to be contextually aware of their physical surroundings and to handle planning and sequential decision-making tasks.