January 7, 2022
People use AI for a wide range of speech recognition and understanding tasks, from enabling smart speakers to developing tools for people who are hard of hearing or who have speech impairments. But oftentimes these speech understanding systems don’t work well in the everyday situations when we need them most: Where multiple people are speaking simultaneously or when there’s lots of background noise. Even sophisticated noise-suppression techniques are often no match for, say, the sound of the ocean during a family beach trip or the background chatter of a bustling street market.
One reason why people can understand speech better than AI in these instances is that we use not just our ears but also our eyes. We might see someone’s mouth moving and intuitively know the voice we’re hearing must be coming from her, for example. That’s why Meta AI is working on new conversational AI systems that can recognize the nuanced correlations between what they see and what they hear in conversation, like we do.
To help us build these more versatile and robust speech recognition tools, we are announcing Audio-Visual Hidden Unit BERT (AV-HuBERT), a state-of-the-art self-supervised framework for understanding speech that learns by both seeing and hearing people speak. It is the first system to jointly model speech and lip movements from unlabeled data — raw video that has not already been transcribed. Using the same amount of transcriptions, AV-HuBERT is 75 percent more accurate than the best audio-visual speech recognition systems (which use both sound and images of the speaker to understand what the person is saying). Notably, our system overcomes an important limitation in training AI to perform useful tasks: AV-HuBERT outperforms the previous best audio-visual speech recognition system using one-tenth labeled data. Since large amounts of labeled data are difficult to obtain for most of the world’s languages, AV-HuBERT’s self-supervised approach will help us build noise-robust automatic speech recognition (ASR) systems in more languages and for more applications.
By incorporating data on both visual lip movement and spoken language, AV-HuBERT will bring assistants closer to human-level speech perception. This technology could one day enable assistants on smartphones and augmented reality (AR) glasses to understand what we’re telling them no matter the circumstances — whether on a noisy factory floor, at a concert, or just speaking while a plane flies overhead.
We are making our code and the pretrained AV-HuBERT models available to other researchers working in this domain so the broader research community can build on our work and accelerate progress in ASR.
Today’s speech recognition models use just audio as their input, so they have to guess at whether one person or several people are speaking or whether a sound is just background noise. AV-HuBERT, however, learns similarly to how people master new skills — multimodally — by perceiving and learning language through a combination of audio and lip-movement cues. We trained the model using video recordings — the publicly available LRS3 and VoxCeleb data sets.
By combining visual cues, such as the movement of the lips and teeth during speaking, along with auditory information for representation learning, AV-HuBERT can capture nuanced associations between the two input streams efficiently even with much smaller amounts of untranscribed video data for pretraining. Once the pretrained model learns the structure and correlation well, only a small amount of labeled data is needed to train a model for a particular task or a different language.
The animation below demonstrates the AV-HuBERT approach. It encodes masked audio and image sequences into audio-visual features via a hybrid ResNet-Transformer architecture to predict the predetermined sequence of discrete cluster assignments. Motivated by Meta AI’s audio HuBERT approach for learning self-supervised speech representations, AV-HuBERT’s target cluster assignments are initially generated from signal processing-based acoustic features (e.g., Mel-frequency cepstral coefficients, or MFCCs) and then iteratively refined using the features learned by the audio-visual encoder via k-means clustering.
AV-HuBERT simultaneously captures linguistic and phonetic information for unmasked regions from both the lip-movement and audio streams into its latent representations, then encodes their long-range temporal relationships to solve the masked-prediction task, similar to the BERT model. The contextualized representations learned by AV-HuBERT also show excellent transferability to tasks where the model can see but not hear the speaker. AV-HuBERT is 20 percent more accurate in this setting than the current state-of-the-art approach, outperforming the best visual-only speech recognition models using one-thousandth the amount of labeled data.
When speech and background noise are equally loud, the previous state-of-the-art AV-ASR achieves a 25.5 percent error rate when trained on 433 hours of labeled data. Using the same amount of labeled data, AV-HuBERT achieves a 3.2 percent error rate, meaning it makes just one mistake in every 30 words it hears. When the interfering speech is as loud as the target speech, it is impossible for an audio-only speech recognition model to know which speaker to transcribe. In contrast, our audio-visual model learns to transcribe the speech of only the person who it sees is speaking. AV-HuBERT achieves a 2.9 percent WER, while an audio-only model without pretraining achieves only a 37.3 percent WER in this scenario.
In a low-resource setup with 30 hours of labeled data, on a test set with four types of noise (babble, interfering speech, music, other) and a wide range of signal-to-noise ratios (from -10dB to 10dB), we observe an average of 51.4 percent absolute WER reduction (59.2 percent → 7.8 percent) from AV-HuBERT compared with an audio-only ASR model without pretraining, and 35.1 percent absolute WER reduction compared with an audio-visual ASR model without pretraining.
When the system can see the speaker but not hear him or her, the previous state-of-the-art model can reach a 33.6 percent WER on the standard LRS3 benchmark data set after being trained on 31,000 hours of transcribed video data. Our approach beats the supervised state of the art, reaching a 28.6 percent WER with just 30 hours of labeled data and an order of magnitude less unsupervised video data. What’s more, when using 433 hours of labeled data, we achieve a new state of the art at 26.9 percent WER.
AV-HuBERT will do more than just allow us to develop conversational AI systems that can be used in challenging scenarios. Since it requires far less supervised data for training, it will also open up possibilities for developing conversational AI models for hundreds of millions of people around the globe who don’t speak languages such as English, Mandarin, and Spanish, which have large-scale labeled data sets.
Since AV-HuBERT learns from both voice and mouth movements, it may be useful for researchers working on building more inclusive speech recognition models for people with speech impairments. By capturing the fine correlations between sounds and mouth movements, self-supervised audio-visual representations could also be used to help detect deepfakes and other content that’s been manipulated in order to mislead people. It could also help generate realistic lip movements in virtual reality avatars, in order to deliver a true sense of presence — that feeling of being there with someone even if they’re on the other side of the world.
We will continue to benchmark and develop approaches that improve audio-visual speech recognition models in everyday scenarios where background noise and speaker overlap are commonplace. We would also like to extend our model to multilingual benchmarks beyond English. Ultimately, we hope that AV-HuBERT will help us and others build new speech recognition tools that work well for everyone, regardless of the language they speak or the circumstances in which they are using it.