ML Applications

MuAViC: The first audio-video speech translation benchmark

March 8, 2023

In countless everyday situations, background noise — the sound of traffic, music, other people speaking – makes it more difficult to understand what other people are saying. Humans often use information from our other senses, especially vision, to help us communicate (as pointed out by Harry McGurk and John MacDonald in their 1976 study “Hearing Lips and Seeing Voices”). For example, if you're speaking to a friend at a loud concert, you will likely focus on their face to supplement what you can hear.

AI researchers have recently built systems (such as Meta AI’s publicly available AV-HuBERT and RAVen models) that use visual information to improve performance for English speech recognition tasks. Now, Meta AI is releasing MuAViC (Multilingual Audio-Visual Corpus), the first benchmark that makes it possible to use audio-visual learning for highly accurate speech translation. We’ve used MuAViC to train our AV-HuBERT model to translate speech in noisy, challenging settings, where it outperforms other leading translation models.

Something Went Wrong

We're having trouble playing this video.

Learn more

In this example, the AV-HuBERT model's transcription contains one error ("Either" instead of "Hi there") but still achieves much greater accuracy than the other model.

With No Language Left Behind and Universal Speech Translator, Meta has focused on speech translation research because it has incredible potential to break down communication barriers and bring people together. We’re excited to see how others in the research community use MuAViC to create translation systems that work well in real-world conditions.

Creating MuAViC

Because of the scarcity of suitable training data, extending audio-visual understanding to speech translation was previously unexplored. Collecting and processing audio-video data typically requires more resources than collecting audio data alone.

MuAViC is the first benchmark for audio-video speech translation and the largest multilingual benchmark for audio-video speech recognition. It contains roughly 1,200 hours of transcribed data spanning 9 languages.

For English talks, we reuse audio-visual data from the LRS3 dataset and align it with a machine translation corpus using a text-matching algorithm. Matched examples are then paired with the corresponding target sentences in the machine translation corpus for translation labels. We apply exact text matching for development set and test set examples to ensure the best accuracy. For training set examples without a match, we acquire pseudo-translation labels from a machine translation model.

For non-English talks, we reuse the audio-only data, transcriptions, and text translations collected in the speech translation dataset. To add the visual modality, we acquire video tracks of the original recordings and align processed video data with the audio data to create audio-visual data. Although all the audio data is transcribed, only a subset of it is translated. We acquire pseudo-translation labels using the same machine translation model as earlier.

Training end-to-end models

We used Meta’s AV-HuBERT architecture to create end-to-end audio-video speech recognition and audio-video speech translation models. Given an aligned pair of audio-video data, our model is able to process both modalities and fuse their representations into a unified space that can be used for either speech recognition or translation tasks. And if either modality is missing, AV-HuBERT can still process the available input modality (but with less efficiency).

Something Went Wrong

We're having trouble playing this video.

Learn more

In this video, the model must deal with background music (rather than background noise, as in the first video above).

Our model’s most noteworthy feature is its robustness to noise. If the audio modality is distorted because of noise or any other factor, the model will rely more on the visual modality to perform the task properly. We tested our models against a state-of-the-art model for speech recognition and X-En speech translation tasks in environments both with and without noise.

This chart compares model performance on speech recognition tasks spanning nine different languages. Meta's AV-HuBERT model doesn't degrade significantly in noisy environments, while the current state-of-the-art model does.

Similarly, the performance of Meta’s AV-HuBERT model does not significantly degrade compared with that of the state-of-the-art model, using the X-En speech translation task spanning six different languages.

Toward robust speech translation

MuAVic enables researchers to build robust speech recognition and translation systems for different languages. We’ve released the corpus as well as our audio-visual speech recognition and translation models covering nine different languages. We hope this will help the community build even better, more robust speech models. We are excited about the future of powerful robust models.