Researchers from New York University took inspiration from children’s learning processes to train an AI system.
The method, detailed in the journal Science, allows AI to learn from its environment without heavily relying on labeled data, which is key to the study’s design.
It mirrors how children learn by absorbing vast amounts of information from their surroundings, gradually making sense of the world around them.
The team created a dataset from 60 hours of first-person video recordings from a head-mounted camera worn by children aged six months to two years to replicate a child’s perspective in their AI model.
1/ Today in Science, we train a neural net from scratch through the eyes and ears of one child. The model learns to map words to visual referents, showing how grounded language learning from just one child’s perspective is possible with today’s AI tools. https://t.co/hPZiiQt6Vv pic.twitter.com/wa8jfn9b5Z
— Wai Keen Vong (@wkvong) February 1, 2024
Researchers then trained a self-supervised learning (SSL) AI model using the video dataset to see if AI could grasp the concept of actions and changes by analyzing temporal or time-related information in the videos as children do.
SSL approaches enable AI models to learn patterns and structures in the data without explicit labels.
Study author Emri Orhan, writing in his research blog, had previously advocated for a greater focus on SSL in AI research, which he believes is pivotal for understanding complex learning processes.
Orhan wrote, “Children are often said to learn the meanings of words very efficiently. For example, in their second year, children are claimed to be learning a few words a day on average. This suggests that they are probably able to learn most of their words from just a handful of exposures (perhaps often from a single exposure only), a phenomenon also known as fast mapping.”
4/ To test this, what better than to train a neural network, not on enormous amounts of data from the web, but only on the input that a single child receives? What would it learn then, if anything? pic.twitter.com/bQ9aVbXUlB
— Wai Keen Vong (@wkvong) February 1, 2024
The study also aimed to address whether AI needs built-in biases or ‘shortcuts’ to learn effectively or if it could develop an understanding of the world through general learning algorithms, much like a child does.
The results were intriguing. Despite the video only covering about 1% of the child’s waking hours, the AI system could learn numerous words and concepts, demonstrating the efficiency of learning from limited but targeted data.
Results include:
- Action recognition performance: The AI models trained on the SAYCam dataset were highly effective at recognizing actions from videos. When tested on fine-grained action recognition tasks like Kinetics-700 and Something-Something-V2 (SSV2), the models displayed impressive performance, even with just a small number of labeled examples for training.
- Comparison with Kinetics-700 dataset: The SAYCam-trained models were compared to models trained on Kinetics-700, a diverse dataset of short YouTube clips. Remarkably, the SAYCam models performed competitively, suggesting that the child-centric, developmentally realistic video data provided a rich learning environment for the AI, similar to or even better than the varied content found on YouTube.
- Video interpolation skill: An interesting outcome was the models’ ability to perform video interpolation – predicting missing segments within a video sequence. This demonstrated an understanding of temporal dynamics and continuity in visual scenes, mirroring the way humans perceive and predict actions.
- Robust object representations: The study also found that video-trained models developed more robust object representations than those trained on static images. This was evident in tasks requiring the recognition of objects under various conditions, highlighting the value of temporal information in learning more resilient and versatile models.
- Data scaling and model performance: The research explored how the models’ performance improved with increased video data from the SAYCam dataset. This suggests that access to more extensive, realistic data will increase model performance.
6/ Results: Even with limited data, we found that the model can acquire word-referent mappings from merely tens to hundreds of examples, generalize zero-shot to new visual datasets, and achieve multi-modal alignment. Again, genuine language learning is possible from a child’s… pic.twitter.com/FCHfZCqftr
— Wai Keen Vong (@wkvong) February 1, 2024
Wai Keen Vong, a research scientist at NYU’s Center for Data Science, discussed the novelty of this approach, stating, “We show, for the first time, that a neural network trained on this developmentally realistic input from a single child can learn to link words to their visual counterparts.”
Touching on the issues confronted by modern generative AI models, Vong said, “Today’s state-of-the-art AI systems are trained using astronomical amounts of data (often billions/trillions of words), and yet humans manage to learn and use language with far less data (hundreds of millions of words), so the connection between these advances in machine learning to human language acquisition is not clear.”
Interest in novel, ‘lightweight’ machine learning methods is increasing. For one, colossal monolithic models like GPT-3 and GPT-4 have immense power demands that aren’t easy to satisfy.
Secondly, creating bio-inspired AI systems is key to designing models or robots that authentically ‘think’ and ‘behave’ as we do.
Vong also acknowledged study limitations, noting, “One caveat is that the language input to the model is text, not the underlying speech signal that children receive.”
This study challenged traditional AI training models and contributed to the ongoing discourse on the most effective ways to mimic biological learning.
Interest in this subject will grow as colossal AI models begin to show limitations for the future.