Kyutai’s AI voice assistant beats OpenAI to public release

July 7, 2024

  • French non-profit AI research lab Kyutai released Moshi, a real-time AI voice assistant
  • Moshi processes emotions and speaks in various styles, and accents while listening simultaneously
  • Moshi delivers a 200 ms end-to-end latency for real-time interactions using consumer-grade hardware

We’re still waiting for OpenAI to release its GPT-4o voice assistant but a French non-profit AI research lab beat it to the punch with its release of Moshi.

Moshi is a real-time voice AI assistant powered by the Helium 7B model that Kyutai developed and trained using a mix of synthetic text and audio data. Moshi was then fine-tuned on synthetic dialogues to teach it how to interact.

Moshi can understand and express 70 different emotions and speak in various styles and accents. The demo of its 200 milli-second end-to-end latency is very impressive. By listening, thinking, and speaking simultaneously the real-time interactions are seamless with no awkward pauses.

It may not sound as sultry as GPT-4o’s Sky, which OpenAI says isn’t imitating Scarlett Johansson, but Moshi responds faster and is publicly available.

Moshi got its voice by being trained on audio samples produced by a voice actor Kyutai referred to as “Alice” without providing further details.

The way Moshi interrupts and responds with imperceptible pauses makes the interactions with the AI model feel very natural.

Here’s an example of Moshi joining in on some sci-fi role-play.

Helium 7B is much smaller than GPT-4o but its small size means you can run it on consumer-grade hardware or in the cloud using low-power GPUs.

During the demo, a Kyutai engineer used a MacBook Pro to show how Moshi could run on-device.

It was a little glitchy but it’s a promising sign that we’ll soon have a low-latency AI voice assistant running on our phones or computers without sending our private data to the cloud.

Audio compression is crucial to making Moshi as small as possible. It uses an audio codec called Mimi which compresses audio 300 times smaller than than the MP3 codec does. Mimi captures both the acoustic information and the semantic data in the audio.

If you’d like to chat with Moshi you can try it out here:

It’s important to remember that Moshi is an experimental prototype and that it was created in just 6 months by a team of 8 engineers.

The web version is really glitchy but that’s probably because their servers are getting slammed with users wanting to try it out.

Kyutai says it will publicly release the model, codec, code, and weights soon. We may have to wait until then to get performance similar to the demo.

Even though it’s a bit buggy, the demo was refreshingly honest compared to Big Tech teasers of features that don’t get released.

Moshi is a great example of what a small team of AI engineers can do and makes you wonder why we’re still waiting for GPT-4o to talk to us.

Join The Future


Clear, concise, comprehensive. Get a grip on AI developments with DailyAI

Eugene van der Watt

Eugene comes from an electronic engineering background and loves all things tech. When he takes a break from consuming AI news you'll find him at the snooker table.


Stay Ahead with DailyAI

Sign up for our weekly newsletter and receive exclusive access to DailyAI's Latest eBook: 'Mastering AI Tools: Your 2024 Guide to Enhanced Productivity'.

*By subscribing to our newsletter you accept our Privacy Policy and our Terms and Conditions