TOOLS

Libri-light

Libri-light is a benchmark for the training of automatic speech recognition (ASR) systems with limited or no supervision. It contains a large dataset of 60K hours of unlabelled speech from audiobooks in English and a small labelled dataset (10h, 1h, and 10 min) plus metrics, trainable baseline models, and pretrained models that use these datasets.

Get Code

View Paper

Tasks and Metrics

Traditional supervised methods for ASR have become more and more dependent on human annotations in abundant quantities, which is difficult to scale up to the majority of the world’s languages, which have low or no linguistic resources. Even high-resource languages like English, Mandarin, and Spanish have difficulties coping with the long tail of dialectal variants. The aim of Libri-light is to help researchers compare and contrast ideas and results for limited-supervision methods on the same datasets and metrics.

ASR with limited supervision has been explored in several directions. In Libri-light, we support three popular research directions (unsupervised, semi-supervised, and distantly supervised settings) and present metrics adapted to each of them.

For the unsupervised setting, the aim is to extract from raw unlabeled speech, representations (discrete or continuous) that encode the phonetic content while ignoring irrelevant information (channel, speaker, etc). We evaluate these representations with ABX error, a discrimination metric inspired by psycholinguistics and used in the Zero Resource Speech Challenge series.
For the semi-supervised setting the task is to use a limited amount of labeled speech (a few hours) in addition to a large set of unlabeled data. Because the labeled subset is so small, the only units that can be learned are phonemes and characters. We use phoneme error rate (PER) and character error rate (CER) as metrics.
For the distant supervision setting, in addition to the above resources, large amounts of unaligned text can be leveraged, for instance, to train a language model that will help decode speech at the word level. We use word error rate (WER) for the evaluation.

These three settings are nested in terms of the kind of data they use. Similarly, the metrics can also be applied in a nested fashion: ABX-error can be computed in all three settings, and PER/CER in the last two.

Overall structure of Libri-light. We provide data, models, and metrics for the benchmarking ASR with limited supervision in three popular settings.

Datasets

The training set is composed of unlabeled audio, limited supervision training, and unaligned text.

Unlabelled audio: 60K of unlabelled speech extracted and processed from LibriVox audiobooks. It contains speech from over 7,000 unique speakers. We removed duplicates and corrupted data, and added voice activity detection, signal to noise, genre, and unique speaker IDs in order to help study the impact of these side variables on unsupervised methods. The dataset is distributed in three disjoint subsets of different durations: unlab-60kh, unlab-6kh, and unlab-600h, respectively.
Limited supervision training set: We provide the orthographic and phonetic transcription (the latter being force-aligned) for three subsets of different durations: train-10h, train-1h and train-10min.
Unaligned text. We rely on the LibriSpeech LM training set, which is based on the 14K books from the open source Gutenberg repository.

For dev and test sets, we use the same as LibriSpeech which enables comparisons with supervised state of the art on WER. We also provide force-aligned transcriptions for computing ABX error and PER/CER.

Training sets and test sets can be downloaded here.

Repartition of genres in the Libri-light unlabeled training set.

Baselines

We provide code for training baseline systems as well as checkpoints in the three limited-supervision settings.

In the unsupervised setting, we use a Contrastive Predictive Coding model (CPC). It is trained to predict the hidden states of N future speech frames and contains an encoder (mapping waveforms to hidden states), a sequence model (encoding the context), and a predictor (attempting to predict future hidden states). It is a PyTorch reimplementation of Van den Oord, Li & Vinyals (2018), with a few changes to stabilize learning.
In the semi-supervised setting, we fine-tune the pretrained CPC model with a linear classifier on top of the sequence model using a CTC objective on the limited-supervision training set.
in the distant supervision setting, we present two methods. The first one adds a pretrained LM on top of the fine-tuned CPC, and the other first builds a small acoustic model (TDS, Hannun et al. 2019) with the limited supervision set and creates pseudo-labels on the unlabeled speech with the help of the pretrained LM. This is used to retrain a larger ASR system on the 60K hours.

Two different methods for distant supervision: Blue indicates pretraining on unlabeled data plus fine-tuning on limited labels, plus decoding with a language model. Orange indicates a small supervised system trained with the limited labels and used to create pseudo-labels on the unlabeled data by decoding with a language model. Green indicates the current state of the art with supervised training on 1,000 hours.

The results show that the WER result obtained with very limited amounts of labeled speech can improve by 10 percent or more from using unlabeled data in our simple baselines. We hope that Libri-light will encourage researchers to beat these baselines and approach state-of-the-art performance with two or three orders of magnitude less labeled data.

All the models and metrics are provided here: https://github.com/facebookresearch/libri-light