Introducing Voicebox: The first generative AI model for speech to generalize across tasks with state-of-the-art performance

Research

June 16, 2023•

5 minute read

Meta AI researchers have achieved a breakthrough in generative AI for speech. We’ve developed Voicebox, the first model that can generalize to speech-generation tasks it was not specifically trained to accomplish with state-of-the-art performance.

RECOMMENDED READS

Like generative systems for images and text, Voicebox creates outputs in a vast variety of styles, and it can create outputs from scratch as well as modify a sample it’s given. But instead of creating a picture or a passage of text, Voicebox produces high-quality audio clips. The model can synthesize speech across six languages, as well as perform noise removal, content editing, style conversion, and diverse sample generation.

Prior to Voicebox, generative AI for speech required specific training for each task using carefully prepared training data. Voicebox uses a new approach to learn just from raw audio and an accompanying transcription. Unlike autoregressive models for audio generation, Voicebox can modify any part of a given sample, not just the end of an audio clip it is given.

Voicebox is based on a method called Flow Matching, which has been shown to improve upon diffusion models. Voicebox outperforms the current state of the art English model VALL-E on zero-shot text-to-speech in terms of both intelligibility (5.9 percent vs. 1.9 percent word error rates) and audio similarity (0.580 vs. 0.681), while being as much as 20 times faster. For cross-lingual style transfer, Voicebox outperforms YourTTS to reduce average word error rate from 10.9 percent to 5.2 percent, and improves audio similarity from 0.335 to 0.481.

Voicebox achieves new state-of-the-art results, outperforming Vall-E and YourTTS on word error rate.

Voicebox also achieves new state-of-the-art results on audio style similarity metrics on English and multilingual benchmarks, respectively.

There are many exciting use cases for generative speech models, but because of the potential risks of misuse, we are not making the Voicebox model or code publicly available at this time. While we believe it is important to be open with the AI community and to share our research to advance the state of the art in AI, it’s also necessary to strike the right balance between openness with responsibility. With these considerations, today we are sharing audio samples and a research paper detailing the approach and results we have achieved. In the paper, we also detail how we built a highly effective classifier that can distinguish between authentic speech and audio generated with Voicebox.

A new approach to speech generation

One of the main limitations of existing speech synthesizers is that they can only be trained on data that has been prepared expressly for that task. These inputs – known as monotonic, clean data – are difficult to produce, so they exist only in limited quantities, and they result in outputs that sound monotone.

We built Voicebox upon the Flow Matching model, which is Meta’s latest advancement on non-autoregressive generative models that can learn highly non-deterministic mapping between text and speech. Non-deterministic mapping is useful because it enables Voicebox to learn from varied speech data without those variations having to be carefully labeled. This means Voicebox can train on more diverse data and a much larger scale of data.

We trained Voicebox with more than 50,000 hours of recorded speech and transcripts from public domain audiobooks in English, French, Spanish, German, Polish, and Portuguese. Voicebox is trained to predict a speech segment when given the surrounding speech and the transcript of the segment. Having learned to infill speech from context, the model can then apply this across speech generation tasks, including generating portions in the middle of an audio recording without having to re-create the entire input.

This versatility enables Voicebox to perform well across a variety of tasks, including:

In-context text-to-speech synthesis: Using an input audio sample just two seconds in length, Voicebox can match the sample’s audio style and use it for text-to-speech generation. Future projects could build on this capability by bringing speech to people who are unable to speak, or by allowing people to customize the voices used by nonplayer characters and virtual assistants.

Cross-lingual style transfer: Given a sample of speech and a passage of text in English, French, German, Spanish, Polish, or Portuguese, Voicebox can produce a reading of the text in that language. This capability is exciting because in the future it could be used to help people communicate in a natural, authentic way — even if they don’t speak the same languages.

Speech denoising and editing: Voicebox’s in-context learning makes it good at generating speech to seamlessly edit segments within audio recordings. It can resynthesize the portion of speech corrupted by short-duration noise, or replace misspoken words without having to rerecord the entire speech. A person could identify which raw segment of the speech is corrupted by noise (like a dog barking), crop it, and instruct the model to regenerate that segment. This capability could one day be used to make cleaning up and editing audio as easy as popular image-editing tools have made adjusting photos.

Diverse speech sampling: Having learned from diverse in-the-wild data, Voicebox can generate speech that is more representative of how people talk in the real world and across the six languages listed above. In the future, this capability could be used to generate synthetic data to help better train a speech assistant model. Our results show that speech recognition models trained on Voicebox-generated synthetic speech perform almost as well as models trained on real speech, with 1 percent error rate degradation as opposed to 45 to 70 percent degradation with synthetic speech from previous text-to-speech models.

Sharing generative AI research responsibly

As the first versatile, efficient model that successfully performs task generalization, we believe Voicebox could usher in a new era of generative AI for speech. As with other powerful new AI innovations, we recognize that this technology brings the potential for misuse and unintended harm. In our paper, we detail how we built a highly effective classifier that can distinguish between authentic speech and audio generated with Voicebox to mitigate these possible future risks. We believe it is important to be open about our work so the research community can build on it and to continue the important conversations we’re having about how to build AI responsibly, which is why we are sharing our approach and results in a research paper.

Voicebox represents an important step forward in generative AI research. Other scalable generative AI models with task generalization capabilities have sparked excitement about potential applications across tasks when it comes to text, image, and video generation. We hope to see a similar impact for speech in the future. We look forward to continuing our exploration in the audio domain and seeing how other researchers build on our work.

This blog post was made possible by the work of Matt Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, and Wei-Ning Hsu.

Read the paper

Listen to more Voicebox samples

Our latest updates delivered to your inbox

Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.

Join us in the pursuit of what’s possible with AI.

See all open positions