AI transcription tools generate harmful hallucinations

May 8, 2024

  • A study found that AI transcription tools hallucinate and generate harmful speech-to-text
  • OpenAI’s Whisper API hallucinated 1.4% of transcriptions with 38% of those including harmful content
  • Hallucinations were more prevalent when transcribing speech from people with aphasia

Speech-to-text transcribers have become invaluable but a new study shows that when the AI gets it wrong the hallucinated text is often harmful.

AI transcription tools have become extremely accurate and have transformed the way doctors keep patient records or how we take minutes of meetings. We know they’re not perfect so we’re unsurprised when the transcription isn’t quite right.

A new study found that when more advanced AI transcribers like OpenAI’s Whisper make mistakes they don’t simply produce garbled or random text. They hallucinate entire phrases, and they are often distressing.

We know that all AI models hallucinate. When ChatGPT doesn’t know an answer to a question, it will often make something up instead of saying “I don’t know.”

Researchers from Cornell University, the University of Washington, New York University, and the University of Virginia found that even though the Whisper API was better than other tools, it still hallucinated just over 1% of the time.

The more significant finding is that when they analyzed the hallucinated text, they found that “38% of hallucinations include explicit harms such as perpetuating violence, making up inaccurate associations, or implying false authority.”

It seems that Whisper doesn’t like awkward silences, so when there were longer pauses in the speech it tended to hallucinate more to fill the gaps.

This becomes a serious problem when transcribing speech spoken by people with aphasia, a speech disorder that often causes the person to struggle to find the right words.

Careless Whisper

The paper records the results from experiments with early 2023 versions of Whisper. OpenAI has since improved the tool but Whisper’s tendency to go to the dark side when hallucinating is interesting.

The researchers classified the harmful hallucinations as follows:

  • Perpetuation of Violence: Hallucinations that depicted violence, made sexual innuendos, or involved demographic stereotyping.
  • Inaccurate Associations: hallucinations that introduced false information, such as incorrect names, fictional relationships, or erroneous health statuses.
  • False Authority: These hallucinations included text that impersonated authoritative figures or media, such as YouTubers or newscasters, and often involved directives that could lead to phishing attacks or other forms of deception.

Here are some examples of transcriptions where the words in bold are Whisper’s hallucinated additions.

Whisper’s hallucinated additions to the transcription are shown in bold. Source: arXiv
Whisper’s hallucinated additions to the transcription are shown in bold. Source: arXiv

You can imagine how dangerous these kinds of mistakes could be if the transcriptions are assumed to be accurate when documenting a witness statement, a phone call, or a patient’s medical records.

Why did Whisper take a sentence about a fireman rescuing a cat and add a “blood-soaked stroller” to the scene, or add a “terror knife” to a sentence describing someone opening an umbrella?

OpenAI seems to have fixed the problem but hasn’t given an explanation for why Whisper behaved the way it did. When the researchers tested the newer versions of Whisper they got far fewer problematic hallucinations.

The implications of even slight or very few hallucinations in transcriptions could be serious.

The paper described a real-world scenario where a tool like Whisper is used to transcribe video interviews of job applicants. The transcriptions are fed into a hiring system that uses a language model to analyze the transcription to find the most suitable candidate.

If an interviewee paused a little too long and Whisper added “terror knife”, “blood-soaked stroller”, or “fondled” to a sentence it might affect their odds of getting the job.

The researchers said that OpenAI should make people aware that Whisper hallucinates and that it should find out why it generates problematic transcriptions.

They also suggest that newer versions of Whisper should be designed to better accommodate underserved communities, such as people with aphasia and other speech impediments.

Join The Future


Clear, concise, comprehensive. Get a grip on AI developments with DailyAI

Eugene van der Watt

Eugene comes from an electronic engineering background and loves all things tech. When he takes a break from consuming AI news you'll find him at the snooker table.


Stay Ahead with DailyAI

Sign up for our weekly newsletter and receive exclusive access to DailyAI's Latest eBook: 'Mastering AI Tools: Your 2024 Guide to Enhanced Productivity'.

*By subscribing to our newsletter you accept our Privacy Policy and our Terms and Conditions