Wav2vec 2.0: Learning the structure of speech from raw audio

September 24, 2020

We are releasing pretrained models and code for wav2vec 2.0, the successor to wav2vec. This new model learns basic speech units used to tackle a self-supervised task. The model is trained to predict the correct speech unit for masked parts of the audio, while at the same time learning what the speech units should be.
With just 10 minutes of transcribed speech and 53K hours of unlabeled speech, wav2vec 2.0 enables speech recognition models at a word error rate (WER) of 8.6 percent on noisy speech and 5.2 percent on clean speech on the standard LibriSpeech benchmark.
This opens the door for speech recognition models in many more languages, dialects, and domains that previously required much more transcribed audio data to provide acceptable accuracy.
We also developed a cross-lingual approach, dubbed XLSR, that can learn speech units common to several languages. This approach helps when we have even small amounts of unlabeled speech, since languages for which we have little data can benefit from languages for which more data is available.

There are thousands of languages spoken around the world, many with several different dialects, which presents a huge challenge for building high-quality speech recognition technology. It’s simply not feasible to obtain resources for each dialect and every language across the many possible domains (read speech, telephone speech, etc.). Our new model, wav2vec 2.0, uses self-supervision to push the boundaries by learning from unlabeled training data to enable speech recognition systems for many more languages, dialects, and domains. With just one hour of labeled training data, wav2vec 2.0 outperforms the previous state of the art on the 100-hour subset of the LibriSpeech benchmark — using 100 times less labeled data.

Similar to the Bidirectional Encoder Representations from Transformers (BERT), our model is trained by predicting speech units for masked parts of the audio. A major difference is that speech audio is a continuous signal that captures many aspects of the recording with no clear segmentation into words or other units. Wav2vec 2.0 tackles this issue by learning basic units that are 25ms long to enable learning of high-level contextualized representations. These units are then used to describe many different speech audio recordings and make wav2vec more robust. This enables us to build speech recognition systems that can outperform the best semisupervised methods, even with 100x less labeled training data.

Wav2vec 2.0 is part of our vision for machine learning models that rely less on labeled data, thanks to self-supervised learning. Self-supervision has helped us advance image classification, video understanding, and our content understanding systems. We hope that the algorithm will enable improved speech technology for many more languages, dialects, and domains, and lead to improvements for existing systems.

Learning discrete latent speech units

Traditional speech recognition models are primarily trained on annotated speech audio with transcriptions. Good systems require large amounts of annotated data, which is only available for a small number of languages. Self-supervision provides a way to leverage unannotated data to build better systems.

Other self-supervised approaches for speech try to reconstruct the audio signal, which requires the model to capture every aspect of the speech, including recording environment, channel noise, and speaker traits. Another common approach is to train the model by asking it to predict what the speaker said next by contrasting several options.

Our approach learns a set of speech units, which are shorter than phonemes, to describe the speech audio sequence. Since this set is finite, the model cannot represent all variations, such as background noise. Instead, the units encourage the model to focus on the most important factors to represent the speech audio. In our experiments, we find that this works better than alternative approaches on the LibriSpeech benchmark.

The model first processes the raw waveform of the speech audio with a multilayer convolutional neural network to get latent audio representations of 25ms each. These representations are then fed into a quantizer as well as a transformer. The quantizer chooses a speech unit for the latent audio representation from an inventory of learned units. About half the audio representations are masked before being fed into the transformer. The transformer adds information from the entire audio sequence. Finally, the output of the transformer is used to solve a contrastive task. This task requires the model to identify the correct quantized speech units for the masked positions.

Something Went Wrong

We're having trouble playing this video.

Learn more

With cross-lingual training, wav2vec 2.0 learns speech units that are used in multiple languages.

Cross-lingual training

For some languages, even unannotated data is limited. To address this issue, we explore the idea of cross-lingual training. The idea is to pretrain a single model on multiple languages at the same time, which results in representations that are better than training on a single language. This has worked particularly well for natural language processing with XLM-R. Performance for low-resource languages can improve significantly with this method, since they benefit from related languages.

With wav2vec 2.0, we can also learn speech units that are used across languages. We find that some units are used for only a particular language, whereas others are used in similar languages and sometimes even in languages that aren’t very similar.

Performance on public speech benchmarks

We trained wav2vec on 960 hours of unannotated speech data from the LibriSpeech benchmark, which contains public audiobooks. After pretraining, we fine-tuned the model on 100 hours, 1 hour, or just 10 minutes of annotated data from Libri-light to perform speech recognition. The result shows a large improvement over the previous state of the art on 100 hours of annotated data (Noisy Student training) when wav2vec 2.0 uses the same amount of annotated data. Moreover, it still shows improvement over the previous best result even when using 100x less annotated data, or just one hour.

What happens if we increase the amount of unannotated data? To answer this question, we trained on 53K hours of unannotated data from the LibriVox dataset (a large collection of public audiobooks) and fine-tuned with only 10 minutes of labeled data. The result was a model that still achieved a WER of 8.6 percent. This demonstrates that wav2vec 2.0 can enable speech recognition models for settings where there is very little labeled training data.

WER for Noisy Student self-training with 100 hours of labeled data. Wav2vec 2.0 with 100 hours, 1 hour, and only 10 minutes of labeled data. All models use the remainder of the LibriSpeech corpus (total 960 hours) as unannotated data, except for the last result, which uses 53K hours from LibriVox.

To evaluate cross-linguality, we trained wav2vec 2.0 on unannotated speech audio of 12 languages from the Common Voice benchmark. The resulting approach, called XLSR, shows that cross-lingual training dramatically improves performance on low-resource languages, compared with training only on a single language. We also measured how often the learned speech units are used in each language and visualized the result in a 2D plot. This illustration shows that related languages tend to use similar units, which confirms that our model learns cross-lingual units.

Results on the Common Voice benchmark in terms of phoneme error rate (PER), comparing training on each language individually (XLSR-Mono) with training on all 10 languages simultaneously (XLSR-10).

Visualization of how the learned units are used across languages. Graph shows a 2D PCA plot of how units are used in each language. Languages closer to each other, like English and German or Basque and Catalan, tend to use similar units.

The future of wav2vec

Wav2vec 2.0 enables us to build better speech recognition systems for many more languages and domains with much less annotated data. We’ve open-sourced the code and pretrained models to enable other researchers to do exactly this. The code is part of fairseq, Facebook AI Research’s sequence modeling toolkit, which provides implementations for many of our research papers. A few commands enable training and fine-tuning of the provided models.

We are excited about the potential of powerful speech representations for other applications, such as speech translation, and models involving other modalities, such as vision. We are also adapting our wav2vec 2.0 implementation to run on Cloud TPUs — stay tuned for more information on that release in the future.