A new study found that even when people are aware they could be listening to deep fake speech, they still struggle to reliably identify fake voices.
This applies to both English and Mandarin speakers, underscoring that deep fake voices are likely effective in many languages.
Researchers from University College London asked 500 people to identify deep fake speech within multiple audio clips. Some clips included an authentic female voice reading generic sentences in either English or Mandarin, while others were deep fakes produced by generative AIs trained on female voices.
The study participants were broken into two groups, each subject to a different form of the experiment.
One group was presented with 20 voice samples in their native language and had to discern whether the clips were real or fake. Participants correctly identified the deep fakes and the authentic voices approximately 73% of the time for both the English and Mandarin voice samples.
A separate group was provided with 20 randomly selected pairs of audio clips. Each pair featured the same sentence delivered by a human and the deep fake, and participants were tasked to identify the fake. This increased detection accuracy to 85.59%.
In the first experiment, real-life human detection of deep fakes will likely be poorer than results suggest, as people wouldn’t be warned they might be hearing AI-generated speech.
Conversely, listeners have a binary choice in the second experiment, thus granting them an advantage.
Interestingly, there was relatively little difference in results for English in Mandarin.
Deep fake speech scams on the rise in real life
“This setup is not completely representative of real-life scenarios,” says Mai. “Listeners would not be told beforehand whether what they are listening to is real, and factors like the speaker’s gender and age could affect detection performance.”
However, there are further limitations, as the study didn’t challenge listeners to determine deep fakes designed to sound like someone they know, such as a son, mother, etc. If scammers targeted someone with a deep fake, they’d almost certainly clone someone’s voice. This is relatively easy if someone has uploaded audio or videos of themselves online, for example, for a social media video, podcast, radio, or TV broadcast.
This is already happening, with one McAfee survey showing that some 1 in 4 adults are aware of deep fake voice fraud occurring.
AI-related fraud is also on the rise in China, and one analyst predicted that AI-supported fraud could cost people and economies dearly. There are numerous frightening anecdotes from people who have already been targeted with deep fake calls, often in the form of a panicked family member asking for money to bail them out of a difficult situation.
This study found that fake voices are “moving through the uncanny valley,” imitating the natural sound of human voices but lacking the subtle nuances that provoke suspicion in some listeners. Of course, AI deep fake voices are improving all of the time.
Overall, the study proves that current technology is already highly competent, and attempts to improve individuals’ abilities to detect fake voices in the study were largely unsuccessful.
The authors highlight the need for developing competent AI voice detectors and educating populations about how sophisticated deep fake voices already are.
Future research into deep fakes replicating the voices of people the study participants know would be insightful.
There are practical ways to identify deep fake speech scams, such as establishing code words between family members or asking callers to describe a mutually known piece of personal information.