Wav2vec: State-of-the-art speech recognition through self-supervision

9/19/2019

We’re releasing our code for wav2vec, an algorithm that uses raw, unlabeled audio to train automatic speech recognition (ASR) models.
This self-supervised approach beats traditional ASR systems that rely solely on transcribed audio, including a 22 percent accuracy improvement over Deep Speech 2, while using two orders of magnitude less labeled data.
Wav2vec trains models to learn the difference between original speech examples and modified versions, often repeating this task hundreds of times for each second of audio, and predicting the correct audio milliseconds into the future.
Reducing the need for manually annotated data is important for developing systems that understand non-English languages, particularly those with limited existing training sets of transcribed speech. Wav2vec is also part of our ongoing commitment to self-supervised training, which could accelerate the development of AI systems across the field.

With thousands of languages spoken around the world, we need powerful, versatile AI speech recognition systems that work effectively for everyone, in whatever language they happen to speak. We’re developing ways to build such systems more efficiently, and across more languages, including those with fewer existing resources for training AI. Today we are sharing the code and technical details related to wav2vec, a new, self-supervised approach to automatic speech recognition. Our algorithm trains models by making them pick between original speech examples and modified versions, and repeating this task hundreds of times per second of audio.

This approach achieved the best published result to date on the popular WSJ benchmark while using two orders of magnitude less labeled training data than a comparable system. The algorithm works with existing ASR systems and uses raw audio as training data, without the need for written transcriptions, demonstrating that self-supervision can make even high-performing speech recognition models more effective. For example, our wav2vec-based system demonstrated a 22 percent relative error reduction over Deep Speech 2, the previous best character-based system in the literature today.

Wav2vec trains models by making them pick between existing 10-milliseconds-long audio clips and distractor clips swapped in from elsewhere in the same example. Models must also predict the correct audio clips further into the future, increasing the difficulty and utility of the task for training.

Wav2vec represents a step forward for ASR systems, and it’s a promising direction for recognizing speech in languages that do not have extensive datasets for training AI systems. But it’s also part of our long-term vision for self-supervised training, an approach that takes advantage of unlabeled training examples and enables us to move beyond the comparatively limited number of datasets that have been gathered and annotated specifically for training AI systems.

Turning raw audio into self-supervised representations

ASR systems are typically trained on transcribed speech data, with audio sequences that come paired with corresponding text. But these examples require labeling vast amounts of audio data, a time- and resource-intensive process that slows the creation and improvement of AI-based speech systems. And while researchers currently have access to thousands of hours of publicly available speech examples for English, less than half of the people on Facebook speak English, and ASR training sets for other languages are often limited or nonexistent. We want to not only help communities everywhere benefit from the best speech recognition technology but also detect harmful content in any language. Exploring self-supervision for ASR development is an important part of achieving those goals.

Although self-supervision has shown promise in natural language processing (NLP) tasks — including RoBERTa, Facebook AI’s optimized pretraining method that recently topped the leaderboard for a major NLP benchmark — wav2vec applies the approach specifically to speech. Our algorithm does not require transcriptions, and our model learns from unlabeled audio data.

Most current ASR models train on the log-mel filter bank features of speech data, meaning audio that’s been processed to make vocal features stand out. Our approach instead turns raw speech examples into a representation — specifically, a code — that can be fed into an existing ASR system. Using wav2vec’s representations as inputs enables the algorithm to work with a wide variety of existing speech recognition models, making unlabeled audio data more widely useful for speech-related AI research.

One of the primary challenges in building wav2vec was dealing with the continuous nature of speech data, which makes it difficult to directly predict the data. We addressed this issue by using a pretraining regime inspired in part by the popular NLP algorithm word2vec. This algorithm learns representations by training a model to distinguish between the true data and a set of distractor samples.

For wav2vec, we created an architecture consisting of two multilayer convolutional neural networks stacked on top of each other. The encoder network maps raw audio input to a representation, where each vector covers about 30 milliseconds (ms) of speech. The context network uses those vectors to generate its own representations, which cover a larger span of up to a second.

The model then uses these representations to solve a self-supervised prediction task. Within each 10-second audio clip that the model is trained on, wav2vec generates a number of distractor examples, which swap out 10 ms of the original audio with sections from elsewhere in the clip. The model must then determine which version is correct. And this selection process is repeated multiple times for each 10-second training clip, essentially quizzing the model to discern accurate speech sounds from distractor samples hundreds of times per second.

To make our approach even more effective for training, we ask the model to solve increasingly difficult versions of this task by predicting changes in audio that appear immediately after the unaltered portion of each clip, as well as changes that are further in the future, in time steps of 10 ms. For example, instead of picking out only the “c” sound from distractors at the beginning of the word “cat,” the model might also have to predict subsequent audio, such as the “a” and “t” sounds.

The purpose of this task is essentially to train models to have an improved understanding of the waveforms associated with speech. This waveform-level grasp of the flow of spoken language boosts the overall accuracy of the ASR system wav2vec is incorporated into.

Wav2vec’s prediction task is also the basis of the algorithm’s self-supervision. By automatically generating incorrect versions of speech examples to test the system on, and evaluating the model’s performance in identifying the right version, there’s no need to manually annotate training data. This process allows wav2vec to turn unlabeled audio clips into thousands of opportunities to train a model.

Evaluating wav2vec using research benchmarks

We trained wav2vec on a little less than 1,000 hours of unlabeled speech examples from the LibriSpeech dataset, a corpus that draws from public domain audiobooks. Next, we trained a speech recognition model on roughly 81 hours of labeled speech from the WSJ corpus — a collection of Wall Street Journal articles read aloud — with representations that wav2vec generated. These examples from WSJ were the only supervised data used in our work, with all other training data consisting of unlabeled audio.

The result of this training process was a significant improvement over comparable ASR systems that relied entirely on supervised data. Deep Speech 2, for example, is widely considered the best character-based ASR system, using 12,000 hours of transcribed data to achieve a word error rate (WER) of 3.1 percent. Our wav2vec-based system is also a character-based system and demonstrated a 2.43 percent WER — a 22 percent relative decrease in the error rate — while using 150x less transcribed data.

And because the exact kind and amount of training data that we used differed from what Deep Speech 2 and other systems were trained on, we also compared our wav2vec-trained system to a baseline model, which didn’t incorporate pretrained representations. We found that wav2vec provided a 30 percent relative improvement in WER compared with that baseline. And in other experiments, our results show that wav2vec can lead to better performance than pretraining on the labeled version of LibriSpeech.

Word error rate (WER) for our baseline and wav2vec automatic speech recognition models, as well as Deep Speech 2 (trained on 12,000 hours of speech) and a supervised transfer learning-based model that uses the labeled version of LibriSpeech. Wav2vec achieves better performance despite not using LibriSpeech labels.

These results show that wav2vec can improve supervised ASR systems by effectively leveraging unlabeled data. But as we noted when we first discussed wav2vec earlier this year, this work also suggests the potential for self-supervised techniques to expand ASR capabilities to low-resource languages, meaning those with limited datasets of transcribed, annotated speech examples. Improving our ability to automatically recognize speech in these languages is important for a range of features across our platforms, including generating captions for videos, catching policy violations, and ensuring that research in this field becomes less English-centric and more inclusive.

The future of wav2vec, and self-supervised training

We plan to use wav2vec to provide better audio data representations for a range of speech-related applications, such as keyword spotting and acoustic event detection. This could potentially help us improve our use of AI to proactively find and flag harmful content, and keep people safe on our platforms.

But the broader implications for this work are related to the pursuit of self-supervised training techniques by teams at Facebook AI as well as in the wider AI community. To assist with this collective effort, we’ve added wav2vec as a simple, lightweight extension to fairseq, our open source sequence modeling toolkit. Researchers can download fairseq and start training their own wav2vec-based models with a few simple commands. Self-supervision is accelerating development not only in speech but also across virtually every domain in the field. The quickest way to transition toward a future in which unlabeled training data is the rule, rather than the exception, will be through ongoing open, collaborative science.

This research is a collaboration between many people at Facebook AI, including Michael Auli, Alexei Baevski, Ronan Collobert, Christian Fuegen, Jeff Glick, Jacob Kahn, Jay Mahadeokar, Nathan Ng, Steffen Schneider, Mike Seltzer, Siddharth Shah, Yongqiang Wang and Qiantong Xu.