Google’s Gecko benchmark identifies best AI image generator

April 30, 2024

  • Google’s DeepMind developed Gecko, a comprehensive benchmark to evaluate text-to-image (T2I) models
  • Gecko categorizes prompts into specific skills and sub-skills for fine-grained T2I model assessment
  • Gecko introduces a new QA-based auto-evaluation metric that correlates more effectively with human ratings

Google’s DeepMind released Gecko, a new benchmark for comprehensively evaluating AI text-to-image (T2I) models.

Over the last two years, we’ve seen AI image generators like DALL-E and Midjourney become progressively better with each version release.

However, deciding which of the underlying models these platforms use is best has been largely subjective and difficult to benchmark.

To make a broad claim that one model is “better” than another isn’t so simple. Different models excel in various aspects of image generation. One may be good at text rendering while another may be better at object interaction.

A key challenge that T2I models face is to follow each detail in the prompt and have these accurately reflected in the generated image.

With Gecko, the DeepMind researchers have created a benchmark that evaluates the capabilities of T2I models similarly to how humans do.

Skill set

The researchers first defined a comprehensive dataset of skills relevant to T2I generation. These include spatial understanding, action recognition, text rendering, and others. They further broke these down into more specific sub-skills.

For example, under text rendering, sub-skills might include rendering different fonts, colors, or text sizes.

An LLM was then used to generate prompts to test the T2I model’s capability on a specific skill or sub-skill.

This enables the creators of a T2I model to pinpoint not only which skills are challenging, but at what level of complexity a skill becomes challenging for their model.

The Gecko benchmark framework uses a dataset of skills and subskills (a), human Likert scoring of image accuracy (b), LLM-generated queries for VQA analysis, and results in comprehensive metrics that correlate with human evaluations. Source: arXiv

Human vs Auto eval

Gecko also measures how accurately a T2I model follows all the details in a prompt. Again, an LLM was used to isolate key details in each input prompt and then generate a set of questions related to those details.

These questions could be both simple, direct questions about visible elements in the image (e.g., “Is there a cat in the image?”) and more complex questions that test understanding of the scene or the relationships between objects (e.g., “Is the cat sitting above the book?”).

A Visual Question Answering (VQA) model then analyzes the generated image and answers the questions to see how accurately the T2I model aligns its output image with an input prompt.

The researchers collected over 100,000 human annotations where the participants scored a generated image based on how aligned the image was to specific criteria.

The humans were asked to consider a specific aspect of the input prompt and score the image on a scale of 1 to 5 based on how well it aligned with the prompt.

Using the human-annotated evaluations as the gold standard, the researchers were able to confirm that their auto-eval metric “is better correlated with human ratings than existing metrics for our new dataset.”

The result is a benchmarking system capable of putting numbers to specific factors that make a generated image good or not.

Gecko essentially scores the output image in a way that closely aligns with how we intuitively decide whether or not we’re happy with the generated image.

So what is the best text-to-image model?

In their paper, the researchers concluded that Google’s Muse model beats Stable Diffusion 1.5 and SDXL on the Gecko benchmark. They may be biased but the numbers don’t lie.

Join The Future


SUBSCRIBE TODAY

Clear, concise, comprehensive. Get a grip on AI developments with DailyAI

Eugene van der Watt

Eugene comes from an electronic engineering background and loves all things tech. When he takes a break from consuming AI news you'll find him at the snooker table.

×

FREE PDF EXCLUSIVE
Stay Ahead with DailyAI

Sign up for our weekly newsletter and receive exclusive access to DailyAI's Latest eBook: 'Mastering AI Tools: Your 2024 Guide to Enhanced Productivity'.

*By subscribing to our newsletter you accept our Privacy Policy and our Terms and Conditions