HuBERT: Self-supervised representation learning for speech recognition, generation, and compression

June 15, 2021

What the research is:

The north star for many AI research programs has been continuously learning to better recognize and understand speech simply through listening and interacting with others, similar to how babies learn their first language. This requires not only analyzing the words that someone speaks but also many other cues from how those words are delivered, e.g., speaker identity, emotion, hesitation, and interruptions. Furthermore, to completely understand a situation as a person would, the AI system must distinguish and interpret noises that overlap with the speech signal, e.g., laughter, coughing, lip-smacking, background vehicles, or birds chirping.

To open the door for modeling these types of rich lexical and nonlexical information in audio, we are releasing HuBERT, our new approach for learning self-supervised speech representations. HuBERT matches or surpasses the SOTA approaches for speech representation learning for speech recognition, generation, and compression.

To do this, our model uses an offline k-means clustering step and learns the structure of spoken input by predicting the right cluster for masked audio segments. HuBERT progressively improves its learned discrete representations by alternating between clustering and prediction steps.

HuBERT’s simplicity and stability will help natural language processing and speech researchers to more broadly adopt learned discrete representations in their work. In addition, the quality of HuBERT’s learned presentations facilitates easy deployment to many different downstream speech applications.

How it works:

HuBERT draws inspiration from Facebook AI’s DeepCluster method for self-supervised visual learning. It leverages the masked prediction loss over sequences, e.g., Google’s Bidirectional Encoder Representations from Transformers, or BERT, method, to represent the sequential structure of speech. HuBERT uses an offline clustering step to generate noisy labels for Masked Language Model pretraining. Concretely, HuBERT consumes masked continuous speech features to predict predetermined cluster assignments. The predictive loss is applied over only the masked regions, forcing the model to learn good high-level representations of unmasked inputs in order to infer the targets of masked ones correctly.

HuBERT learns both acoustic and language models from continuous inputs. First, the model needs to encode unmasked audio inputs into meaningful continuous latent representations, which map to the classical acoustic modeling problem. Second, to reduce the prediction error, the model needs to capture the long-range temporal relations between learned representations. One crucial insight motivating this work is the importance of consistency of the k-means mapping from audio inputs into discrete targets, not just their correctness, which enables the model to focus on modeling the sequential structure of input data. For example, if an early clustering iteration cannot distinguish /k/ and /g/ sounds, leading to a single supercluster containing both of them, the prediction loss will learn representations that model how other consonant and vowel sounds work together with this supercluster to form words. As a result, the following clustering iteration creates better clusters using the newly learned representation. Our experiments show the progressive improvement of representations by alternating clustering and prediction steps.

When HuBERT is pretrained on either the standard LibriSpeech 960 hours or the Libri-Light 60,000 hours, it either matches or improves upon the state-of-the-art wav2vec 2.0 performance on all fine-tuning subsets of 10mins, 1h, 10h, 100h, and 960h.

The charts show results of HuBERT with two model sizes pretrained with LARGE (300M), and X-LARGE (1B). The X-LARGE model shows up to 19 percent and 13 percent relative WER improvement on dev-other and test-other evaluation subsets when pretrained on 60,000 hours of Libri-Light data.

The notable success of speech representation learning enables direct language modeling of speech signals without reliance on any lexical resources (no supervised labels, text corpus, or lexicons). This in turn opens the door for modeling nonlexical information, such as a dramatic pause or urgent interruption, as well as background noises.

In our Generative Spoken Language Modeling (GSLM), we’ve taken the first steps toward utilizing learned speech representations from CPC, Wav2Vec2.0, and HuBERT for synthesizing speech. A unit language model trained on discretized latent representations allows conditional and unconditional generation of speech. In both automatic and human evaluations, HuBERT generated samples competing in quality with the top-line supervised character-based LM and generation. You can listen to generated conditional and unconditional samples of all systems here: https://speechbot.github.io/.

The graphs above show HuBERT’s performance for language generation.

For speech compression, our recent “Speech Resynthesis from Discrete Disentangled Self-Supervised Representations” ( [paper](https://arxiv.org/pdf/2104.00355.pdf) ) used HuBERT to achieve a bit rate of 365bps without degrading quality. You can listen to samples of HuBERT compressed audio (https://resynthesis-ssl.github.io/).

HuBERT is second only to the uncompressed audio (256kbps) in the Multi-stimulus test with hidden reference and anchor (MUSHRA) subjective test.

Why it matters:

HuBERT can help the AI research community develop NLP systems that are trained entirely on audio, rather than relying on text samples. This will enable us to enrich existing NLP applications with the full expressivity of spontaneous oral language, so that an AI voice assistant might speak with the nuances and affect of a real person. Learning speech representations without reliance on large volumes of labeled data is also crucial for industrial applications and products with ever-increasing coverage of new languages and domains. It will help the AI community build more inclusive applications covering spoken-only dialects and languages.

Read the full paper

Get the code + pretrained models