A new open benchmark for speech recognition with limited or no supervision

December 20, 2019

What it is:

The largest-ever open source dataset for speech technology, Libri-light is built entirely from public domain audio and optimized for developing automatic speech recognition (ASR) systems using limited or no supervision. While previous speech datasets have typically consisted of human-annotated training examples that are fed to ASR systems with supervised learning objectives, we designed Libri-light to support three training settings that are less reliant on labels. Those approaches include pretraining acoustic models on raw unlabeled data, training with a mix of labeled and unlabeled data, and training from unaligned audio and text. In addition to training and test sets, Libri-light includes metrics and baseline models to help researchers compare different methods for developing ASR systems that require less supervision or none at all.

How it works:

We built Libri-light using more than 60,000 hours of unlabeled speech in English from LibriVox, a large repository of public domain audiobooks. In addition to filtering corrupted and duplicated data and adding speech activity, speaker, and genre metadata to make Libri-light useful in the context of ASR training, we built baselines systems and evaluation metrics on top of the popular LibriSpeech ASR benchmark.

Specifically, we built ASR systems for three training settings — self-supervised, semi-supervised, and training through distant supervision — and evaluated them against the standard LibriSpeech dev and test sets. Pretraining our self-supervised model on raw audio resulted in accuracy that surpassed the state-of-the-art system in the most recent Zero Resource Speech Challenge, while the accuracy of our semi-supervised system — which used a small amount of labeled speech during training — improved as we applied more pretraining, resulting in fewer errors when recognizing phonemes, or word-related sounds. And for our distant supervision setting, where acoustic models are created using a combination of unaligned text and speech audio with limited labels, we used a process that automatically generates labels for our unannotated dataset. The resulting system’s accuracy is lower than that of fully supervised systems, but its performance shows that increasing the degree of unsupervised pretraining can improve word error rate, suggesting the value of training on large amounts of unannotated data, even for systems that also use annotations.

Why it matters:

Libri-light sets a new standard for training ASR systems that work with languages that lack large-scale training datasets necessary for traditional fully supervised training methods. These training resources are unavailable for the majority of the world’s 7,000 languages, so Libri-light can potentially help develop or improve ASR for millions of people around the globe. Reliance on full supervision also limits the efficacy of ASR even in high-resource languages, such as English, Mandarin, Spanish, and Arabic, since these tend to include a large number of dialectal variants. Our overall dataset is larger — by three orders of magnitude — than the Zero Resource Speech Challenge datasets generally used for unsupervised speech learning. And because Libri-light uses the standard LibriSpeech as its test set, it’s the first benchmark that enables researchers to make direct comparisons of methods using different degrees of supervision and gauge their performance against the state of the art in supervised ASR. This will help provide the kind of common objectives and milestones that can accelerate the field’s collective progress toward reducing supervision and bringing ASR to more languages around the world.