Meta released its text-to-audio generative AI called AudioCraft this week and the samples of its output are impressive.
The generative AI space has seen exponential development in text, image, and voice generation but we haven’t had much news in the AI audio generation space. AudioCraft is one of the first text-to-audio tools of its kind that are available to try out properly.
Earlier this year Google gave us a peek at its MusicLM text-to-music generator but we’re 8 months down the line and you can still only try it out if you get accepted into their AI Test Kitchen.
The AudioCraft pre-trained models are available for download on GitHub and Meta is hoping that their open-source strategy will drive adoption and testing to improve the models.
AudioCraft is made up of Meta’s MusicGen, AudioGen, and Encodec models.
The MusicGen model was trained on music that was specifically licensed and owned by Meta and outputs music from a text prompt. The example on Meta’s blog used the following prompt: “Pop dance track with catchy melodies, tropical percussions, and upbeat rhythms, perfect for the beach”
The music output sounds pretty good and closely matches the prompt. The sample was likely cherry-picked but it’s impressive nonetheless. You can listen to more samples here.
🎵 Today we’re sharing details about AudioCraft, a family of generative AI models that lets you easily generate high-quality audio and music from text.https://t.co/04XAq4rlap pic.twitter.com/JreMIBGbTF
— Meta Newsroom (@MetaNewsroom) August 2, 2023
While there are a few text-to-music tools you can try out online, the AudioGen model is fairly unique. The model was trained on public sound effects and generates complex sound effects based on text prompts. The example prompt on Meta’s blog was: “Sirens and a humming engine approach and pass” and sounded great. Here are some more AudioGen sample effects.
Being able to generate sound effects from text descriptions for free will be huge for content creators. Imagine making a clip for social media or a Youtube video and getting exactly the right sound effect without having to pay to download it from a sound effects website.
The Encodec model is probably the most exciting part of AudioCraft. It’s an AI-powered codec for audio. A codec is a piece of software that takes data and compresses it while losing as little of the data as possible. If you’ve played an MP3 music file then you’ve used a codec.
Encodec strips out as much of the data from the generated audio file as possible and then uses AI to fill in the gaps when the audio needs to be played again. The result is that the compressed audio files can be 10 times smaller than if they were stored as MP3s.
Meta hasn’t got a similar codec for video yet but can you imagine the implications of compressing video and audio by a factor of 10 without losing any fidelity? You could free up 90% of your hard drive space or stream music and video 10 times faster with the same bandwidth.
It’ll be interesting to see how developers use Meta’s text-to-audio tool. It seems that Meta trained their models responsibly, but other users of the models may not share their ethical and legal concerns. Expect some heated debate around whether copyrighted music is fair game for training AI.
And while actors and screenwriters continue to strike, free music tools like AudioCraft may soon have musicians and sound effects artists joining the picket line too.