Meta has unveiled Voicebox, a state-of-the-art generative AI model for speech. It works similarly to text generators, like ChatGPT, but generates audio instead of text responses.
Voicebox can generate audio from scratch or modify existing audio. It’s a highly flexible tool that can take a 2-second audio clip of someone’s voice and use that to generate speech in a different language while retaining voice intonation.
This combines with text-to-speech generation. So, you can ‘insert’ your voice into the AI and use it for text-to-speech generation with your own voice. For example, if you’re on holiday and need to communicate in English, French, Spanish, German, Polish, or Portuguese, simply type your message into Voicebox, and it’ll speak for you.
The model was trained with over 50,000 hours of recorded speech and transcripts in 6 languages: English, French, Spanish, German, Polish, and Portuguese. It’s considerably faster and more accurate than similar audio-centric AIs, like VALL-E.
Here are Voicebox’s 4 main uses:
- In-context text-to-speech synthesis: Voicebox can generate realistic audio from text. This could be used to create multilingual virtual assistants to enable people with voice and hearing conditions to converse more naturally.
- Cross-lingual style transfer: The AI can translate text into 6 different languages, enabling authentic and natural multilingual communication.
- Speech denoising and editing: Voicebox can generate speech to edit segments within audio recordings. For example, it can resynthesize parts of speech corrupted by noise.
- Diverse speech sampling: Voicebox can generate representative speech across 6 languages which is ideal for generating synthetic data for other speech and audio models with impressive results. Speech recognition models trained on Voicebox-generated synthetic speech perform near-equally with models trained on real speech, with a marginal 1% error rate degradation, a massive leap from the 45 to 70% degradation observed in similar models.
Meta hasn’t released Voicebox yet, citing concerns about misuse. However, they have published an in-depth paper about the model, available here.
While there’s no official estimate on when people will be able to use Voicebox, Meta says the tool will help creators edit audio tracks, improve communication with visually impaired people, and enable people to speak any foreign language in their own voice.