Talking to each other is a natural way for people to interact, and as voice technology has evolved, to interact with our devices — and the metaverse in the future, where virtual experiences blend with our physical worlds.
Yet speech technology is only available for a fraction of the thousands of languages spoken around the world. Few-shot learning, based on limited labeled data, and even unsupervised speech recognition are helpful, but the success of these methods depends on the quality of the self-supervised model.
Today, we are releasing XLS-R, a new self-supervised model for a variety of speech tasks. XLS-R substantively improves upon previous multilingual models by training on nearly 10 times more public data in more than twice as many languages.
To accomplish our goal of a single model that’s capable of understanding speech in many different languages, we fine-tuned XLS-R to perform speech recognition, speech translation, and language identification, setting a new state of the art on a diverse set of benchmarks: BABEL, CommonVoice, and VoxPopuli for speech recognition; CoVoST-2 on foreign-to-English translation; and language identification with VoxLingua107.
To make this advancement as broadly accessible as possible, we’ve released these models with Hugging Face and made them available on our fairseq GitHub repository.
Trained on more than 436,000 hours of publicly available speech recordings, XLS-R is based on wav2vec 2.0, our approach to self-supervised learning of speech representations. That’s nearly 10 times more hours of speech than the best, previous model we released last year, XLSR-53. Utilizing speech data from different sources, ranging from parliamentary proceedings to audio books, we’ve expanded to 128 different languages, covering nearly two and a half times more languages than its predecessor.
We found that our largest model, containing over 2 billion parameters, performs much better than smaller models, since more parameters are critical to adequately represent the many languages in our data set. We also found that larger model size improved performance much more than when pretraining on a single language.
We evaluated XLS-R on four major multilingual speech recognition benchmarks, where it outperformed prior work on most of the 37 languages tested; specifically, we tried it on five languages of BABEL, 10 languages of CommonVoice, eight languages of MLS, and the 14 languages of VoxPopuli.
We also evaluated our model for speech translation, where we directly translated audio recordings into another language. Since we’re interested in models that can perform multiple tasks, we simultaneously fine-tuned XLS-R on several different translation directions of the CoVoST-2 benchmark. The result is a single model that can translate between English and up to 21 other languages.
We saw markedly strong improvements when we used XLS-R to encode languages other than English, which is where multilingual speech representations are especially important. Our model leads to very large improvements on low-resource language directions, such as Indonesian-to-English translation, where the accuracy in terms of BLEU doubles on average — a very large step forward in improving translation of spoken language. An increase in the BLEU metric means automatic translations have more overlap with the translations produced by a human tackling the same task.
XLS-R demonstrates that scaling cross-lingual pretraining can further improve performance for low-resource languages. It improves performance for speech recognition and more than doubles the accuracy on foreign-to-English speech translation. XLS-R is an important step toward a single model that can understand speech in many different languages and it is the largest effort we know of to leverage public data for multilingual pretraining.
We trust this direction will enable machine learning applications that better understand all human speech and catalyze further research to make speech technology more accessible across the globe, especially among underserved populations. We will continue to improve our algorithms by developing new ways to learn from less supervision and scale our approach to the more than 7,000 languages around the world.
If you are interested in using our model, then please take a look at Hugging Face’s excellent tutorial on how to fine-tune our models.
This blog post was made possible by the work of Alexei Baevski, Alexis Conneau, Andros Tjandra, Arun Babu, Changhan Wang, Juan Pino, Kritika Singh, Kushal Lakhotia, Michael Auli, Naman Goyal, and Qiantong Xu, Yatharth Saraf (in alphabetical order).