Woodpecker could solve multimodal LLM hallucinations

October 26, 2023

Multimodal Large Language Models (MLLM) like GPT-4V are really good at analyzing and describing images but sometimes they hallucinate and get things wrong. A new approach called Woodpecker could fix that.

If you ask an MLLM to describe a photo it can normally pick out the objects and accurately describe the scene. But as with answers to text prompts, the model sometimes makes assumptions based on items or concepts that often appear together.

As a result, an MLLM could describe a photo of a shopfront scene and say there are people in the scene when there actually aren’t any.

Fixing hallucinations in text-based LLMs is ongoing but gets a lot easier when the model is connected to the internet. The LLM can generate a text response to a prompt, check it for veracity based on relevant internet data, and self-correct where necessary.

Scientists from Tencent’s YouTu Lab and the University of Science and Technology of China took this approach and translated it into a visual solution called Woodpecker.

In simple terms, Woodpecker builds a body of knowledge from the image and then an LLM can use that as a reference to correct the initial description generated by the MLLM.

Here’s a brief description of how it works:

  1. An LLM like GPT-3.5 Turbo analyzes the description generated by the MLLM and extracts key concepts like objects, quantities, and attributes. For example, in the sentence “The man is wearing a black hat.”, the objects “man” and “hat” are extracted.
  2. An LLM is then prompted to generate questions related to these concepts like “Is there a man in the image?” or “What is the man wearing?”.
  3. These questions are fed as prompts to a Visual Question Answering (VQA) model. Grounding DINO performs object detection and counting while the BLIP-2-FlanT5 VQA answers attribute-related questions after analyzing the image.
  4. An LLM combines the answers to the questions into a visual knowledge base for the image.
  5. An LLM uses this reference body of knowledge to correct any hallucinations in the original MLLM’s description and adds details it missed.
Incorrect descriptions from MLLM along with corrections from Woodpecker. Source: arXiv

The researchers named their approach Woodpecker in reference to how the bird picks bugs out of trees.

Test results showed that Woodpecker achieved an accuracy improvement of 30.66% for MiniGPT4 and 24.33% for the mPLUG-Owl models.

The generic nature of the models required in this approach means that the Woodpecker approach could easily be integrated into various MLLMs.

If OpenAI integrates Woodpecker into ChatGPT then we could see a marked improvement in the already impressive visual performance. A reduction in MLLM hallucination could also improve automated decision-making by systems that use visual descriptions as inputs.

Join The Future


Clear, concise, comprehensive. Get a grip on AI developments with DailyAI

Eugene van der Watt

Eugene comes from an electronic engineering background and loves all things tech. When he takes a break from consuming AI news you'll find him at the snooker table.


Stay Ahead with DailyAI


Sign up for our weekly newsletter and receive exclusive access to DailyAI's Latest eBook: 'Mastering AI Tools: Your 2024 Guide to Enhanced Productivity'.


*By subscribing to our newsletter you accept our Privacy Policy and our Terms and Conditions