A new study published in Nature reveals that AI models, including large language models (LLMs), rapidly degrade in quality when trained on data generated by previous AI models.
This phenomenon, termed “model collapse,” could erode the quality of future AI models, particularly as more AI-generated content is released onto the internet and, therefore, recycled and reused in model training data.
Investigating this phenomenon, researchers from the University of Cambridge, University of Oxford, and other institutions conducted experiments showing that when AI models are repeatedly trained on data produced by earlier versions of themselves, they start generating nonsensical outputs.
This was observed across different types of AI models, including language models, variational autoencoders, and Gaussian mixture models.
In one key experiment with language models, the team fine-tuned the OPT-125m model on the WikiText-2 dataset and then used it to generate new text.
This AI-generated text was then used to train the next “generation” of the model, and the process was repeated over and over.
It wasn’t long before models started producing increasingly improbable and nonsensical text.
By the ninth generation, the model was generating complete gibberish, such as listing multiple non-existent types of “jackrabbits” when prompted about English church towers.
The researchers also observed how models lose information about “rare” or infrequent events before complete collapse.
This is alarming, as rare events often relate to marginalized groups or outliers. Without them, models risk concentrating their responses across a narrow spectrum of ideas and beliefs, thus reinforcing biases.
AI companies are aware of this, hence why they’re striking deals with news companies and publishers to secure a steady stream of high-quality, human-written, topically relevant information.
“The message is, we have to be very careful about what ends up in our training data,” study co-author Zakhar Shumaylov from the University of Cambridge told Nature. “Otherwise, things will always, provably, go wrong.”
Compounding this effect, a recent study by Dr. Richard Fletcher, Director of Research at the Reuters Institute for the Study of Journalism, found that nearly half (48%) of the most popular news sites worldwide are now inaccessible to OpenAI’s crawlers, with Google’s AI crawlers being blocked by 24% of sites.
As a result, AI models have access to a smaller pool of high-quality, recent data than they once did, increasing the risk of training on sub-standard or outdated data.
Solutions to model collapse
Regarding solutions, the researchers state that maintaining access to original, human-generated data sources is vital for AI’s future.
Tracking and managing AI-generated content would also be helpful to prevent it from accidentally contaminating training datasets. That would be very tricky, as AI-generated content is becoming impossible to detect.
Researchers posit four main solutions:
- Watermarking AI-generated content to distinguish it from human-created data
- Creating incentives for humans to continue producing high-quality content
- Developing more sophisticated filtering and curation methods for training data
- Exploring ways to preserve and prioritize access to original, non-AI-generated information
Model collapse is a real problem
This study is far from the only one exploring model collapse.
Not long ago, Stanford researchers compared two scenarios in which model collapse might occur: one where each new model iteration’s training data fully replaced the previous data and another where synthetic data is added to the existing dataset.
When data was replaced, model performance deteriorated rapidly across all tested architectures.
However, when data was allowed to “accumulate,” model collapse was largely avoided. The AI systems maintained their performance and, in some cases, showed improvements.
So, despite credible concerns, model collapse isn’t a foregone conclusion – it depends on how much AI-generated data is in the set and the ratio of synthetic to authentic data.
If and when model collapse starts to become evident in frontier models, you can be certain that AI companies will be scrambling for a long-term solution.
We’re not there yet, but it might be a matter of when, not if.