Training an AI image classifier or computer vision system requires access to a large dataset of images. MIT researchers have found a way to reduce our dependency on real data for image classification.
When you upload an image of a cat and ask ChatGPT to explain what the image is, it’s able to do that because it was trained on thousands of images of cats that had text labels associated with them.
Those training images had to come from somewhere. One of the datasets used to train Stable Diffusion, called LAION-5B, is a collection of billions of images scraped from the internet and paired with text descriptions.
But what do we do when we need more data to train a model but have exhausted the set of real images we have?
A team of MIT researchers tackled this data scarcity challenge by creating their own synthetic dataset.
Their approach, called StableRep, uses a strategy called “multi-positive contrastive learning.” It sounds complicated but the concept is actually quite simple.
They prompt the Stable Diffusion image generator with a text description and then have it generate multiple images. Assuming Stable Diffusion did a good job, they now have a collection of images that match the initial prompt.
StableRep treats these images as positives of each other and learns high-level concepts behind why these images are good semantic matches for the initial prompt.
The team trained their image classifier model on the AI-generated pictures and gave it a small performance bump with language supervision to finally come up with StableRep+.
How would StableRep+ fare at image classification seeing as it only ever ‘saw’ AI-generated images?
When trained with 20 million synthetic images, StableRep+ was more accurate and efficient than CLIP models trained with 50 million real images. The researchers are still working on understanding the reason behind the superior performance.
There are still a few issues, but using synthetic training data like this solves a lot of data collection issues like cost, copyright, and privacy concerns related to real images.
The challenges of this approach include the computation cost and time to generate millions of images for the dataset. And StableRep still relies on using an image generator that was originally trained on real images to then produce the synthetic data.
Any bias in the text prompt or data labeling in the real image dataset also transfers to the new model trained on synthetic data.
Despite the challenges, the results StableRep achieves are promising. The increase in diversity that synthetic data can deliver could reduce data scientists’ dependence on expensive and limited real data when training new models.