Meta introduced its AI image generator called CM3leon and claims that it delivers state-of-the-art performance.
CM3leon, pronounced “chameleon”, represents a shift in performance in the text-to-image generation race. While tools like DALL-E and Stable Diffusion generate impressive images, in its announcement Meta claims that CM3leon has the edge on these tools in a number of areas.
Most models like Stable Diffusion and DALL-E are diffusion models. These models deliver great results but their step-by-step approach to image generation is slow and needs a lot of processing power. Meta’s model is a transformer model which it says outperforms models like Google’s Parti.
It’s also a lot more efficient model, requiring 5 times less computer processing for training and far smaller training data sets than other models.
CM3leon is one of the first models that can generate both long and short-form captions for images. You can also ask it questions about the image. Meta gave an example of how CM3leon responds to prompts about a picture of a dog carrying a stick.
Prompt Question: What is the dog carrying?
Model Generation: Stick
Prompt: Describe the given image in very fine detail.
Model Generation: In this image, there is a dog holding a stick in its mouth. There is grass on the surface. In the background of the image, there are trees.
CM3leon is able to respond very well to specific details or nuance in prompts. And the sample images that Meta used in their announcement seem to show that it performs better than other models with tricky things like human hands and adding text to generated images.
The respective prompts for these images were:
(1) A small cactus wearing a straw hat and neon sunglasses in the Sahara desert. (2) A close-up photo of a human hand, hand model. High quality. (3) A raccoon main character in an Anime preparing for an epic battle with a samurai sword. Battle stance. Fantasy, Illustration. (4) A stop sign in a Fantasy style with the text “1991.”
Other interesting features that Meta highlighted are text-based and structure-guided image editing. These allow you to use text to request edits like “change the sky to blue” or to place an item at a specific x-y coordinate in the image.
CM3leon was trained on millions of licensed images from Shutterstock rather than the smash-and-grab broad approach other models have been criticized for. As with other models, Meta says CM3leon will reflect the biases in the training data. So if you ask it to generate an image of a construction worker it’ll probably create an image of a man.
But Meta is at least being upfront about this and commented on the issue of bias by saying “While the industry is still in its early stages of understanding and addressing these challenges, we believe that transparency will be key to accelerating progress.”
From the examples in their release and the performance claims it seems that CM3leon is more efficient and a lot better at spatial and contextual understanding of text prompts than other AI image generators.
Meta hasn’t said when it will be releasing CM3leon so we’ll have to take their word for how well these features work for now.