August 2, 2021
Speech recognition and translation technologies are being widely adopted to enable human-to-computer interaction, real-time human-to-human communication, and access to multimedia content without language barriers. These technologies are currently available in only a handful of widely spoken languages, however, and there are approximately 6,500 languages spoken around the globe. To be more useful, these AI systems need to work for many more.
To accelerate the creation of new natural language processing (NLP) systems for use in more regions of the world, Facebook AI is releasing VoxPopuli, a large-scale multilingual body of audio recordings that provide 400,000 hours of unlabeled speech data in 23 languages. It is the largest open data set released thus far for self-supervised learning and semisupervised learning. VoxPopuli also contains 1,800 hours of transcribed speeches in 15 languages. It also channels their oral interpretations into 15 target languages, for a total of 17,300 hours, with alignments at utterance level (which can be a single word, a sentence, or a distinct sound such as “umm”).
Getting to the realistic goal of advanced NLP for dozens of languages is a time-intensive project. Recent automation advances (wav2vec 2.0 and wav2vec-U) have shown promising results in reducing—or even eliminating—the requirements of labeled data in building these technologies, which are based on learning from large-scale unlabeled data. Previous open-speech data sets (such as Libri-light) are limited in size or language coverage, which restricts the full power of these techniques. The AI research community simply needs more language data—a lot more.
VoxPopuli provides 9,000 to 18,000 hours of unlabeled speech per language; previous data sets have had only about 130 hours. This substantial new body of speech-to-speech translation data will be an important supplement to the existing speech-to-text translation corpora, such as CoVoST V2.
We collected data in 23 languages from publicly available European Parliament event recordings and built processing pipelines to segment speech audios by speaker or silence, properly aligned them with transcripts or translations, and filtered out examples with inaccurate transcripts. We provide speech recognition (ASR) baselines for benchmarking and validating the versatility of VoxPopuli unlabeled data in semisupervised ASR and speech-to-text translation under challenging out-of-domain settings.
We showed that the increased amounts of unlabeled data and language coverage in VoxPopuli are very helpful to improving self-supervised models in terms of both quality and robustness. The team also found the automatic speech-to-speech alignments to be of high quality as it evaluates them on a speech-translation benchmark.
Enabling advanced speech technologies for many more languages will require not just limited labeled data, but also large-scale unlabeled data sets. VoxPopuli helps to push the limits on these promising research directions by providing larger-scale data in languages other than English, such as Romanian (30 million speakers) and Greek (13.5 million speakers), which have large speaker populations but lack open data (even unlabeled data). VoxPopuli also unblocks open research on direct speech-to-speech translation with a large amount of labeled data. We look forward to seeing how others in the AI research community leverage VoxPopuli to create new NLP systems that work for more people around the world.