Meta has released its new multimodal multilingual AI translator model called SeamlessM4T. This first-of-its-kind translator can translate and transcribe speech and text in up to 100 languages.
Meta has been working on a number of language recognition and translation products but with SeamlessM4T it has integrated multiple inputs and outputs into a single model.
According to Meta’s release announcement, SeamlessM4T supports:
- Speech recognition for nearly 100 languages
- Speech-to-text translation for nearly 100 input and output languages
- Speech-to-speech translation, supporting nearly 100 input languages and 36 (including English) output languages
- Text-to-text translation for nearly 100 languages
- Text-to-speech translation, supporting nearly 100 input languages and 35 (including English) output languages
The speech-to-speech translation is probably one of the more exciting capabilities of the model. To be able to record speech in your language and then have it spoken in a different language is amazing. Imagine how useful this would be while traveling in a foreign country.
In 2022 Meta released its No Language Left Behind text-to-text translator which supports 200 languages. That model supported 55 African languages, many of which were very poorly translated by other tools.
Late last year, Meta also released an example of a new approach to speech-to-speech translation of low-resource languages. It used its Universal Speech Translator to translate Hokkien which is a spoken language without a writing system.
Earlier this year it continued its focus on underserved languages with its Massively Multilingual Speech model that provides automatic speech recognition of more than 1,100 languages.
SeamlessM4T is a unified model that builds on these individual model capabilities to combine them into one lightweight model.
Introducing SeamlessM4T, the first all-in-one, multilingual multimodal translation model.
This single model can perform tasks across speech-to-text, speech-to-speech, text-to-text translation & speech recognition for up to 100 languages depending on the task.
Details ⬇️
— Meta AI (@MetaAI) August 22, 2023
Training data presents bias and toxicity challenges
Meta says its model was trained on “data from publicly available repositories of web data (tens of billions of sentences) and speech (4 million hours).”
It didn’t specify where the training data came from but said it came from licensed and open-source data that wasn’t copyrighted.
Meta acknowledged that the model faces the same “inherent risks” of bias and toxicity that other AI models do. Inevitably the bias in different cultures is expressed in the recorded audio and transferred to the model during the training process.
To eliminate bias Meta extended its Multilingual HolisticBias text dataset to accommodate speech. This is part of its effort to correct for when the model may “unfairly favor a gender and sometimes default to gender stereotypes.”
Providing guardrails to curb the toxicity of the output is another challenge Meta has to address. Toxicity refers to how incorrect translations could “incite hate, violence, profanity, or abuse against an individual or a group.”
Meta used its “highly multilingual toxicity classifier” to check for toxicity in inputs and outputs so that SeamlessM4T is less likely to offend anyone.
It’ll probably still come up with some awkward translations as the team that developed the model admits that it “overgeneralizes to masculine forms when translating from neutral terms.” I bet if you try hard enough you could get it to say something naughty.
If you’d like to try it out then check out the demo here. You can record a sentence, select three different languages, and a few seconds later hear the spoken translations. Very impressive.
In describing its ambitions with SeamlessM4T, Meta referenced the Babel Fish from The Hitchhiker’s Guide to the Galaxy. It isn’t capable of real-time translation quite yet, but it’s probably a lot more comfortable to use than sticking a fish in your ear.