March 8, 2023
In countless everyday situations, background noise — the sound of traffic, music, other people speaking – makes it more difficult to understand what other people are saying. Humans often use information from our other senses, especially vision, to help us communicate (as pointed out by Harry McGurk and John MacDonald in their 1976 study “Hearing Lips and Seeing Voices”). For example, if you're speaking to a friend at a loud concert, you will likely focus on their face to supplement what you can hear.
AI researchers have recently built systems (such as Meta AI’s publicly available AV-HuBERT and RAVen models) that use visual information to improve performance for English speech recognition tasks. Now, Meta AI is releasing MuAViC (Multilingual Audio-Visual Corpus), the first benchmark that makes it possible to use audio-visual learning for highly accurate speech translation. We’ve used MuAViC to train our AV-HuBERT model to translate speech in noisy, challenging settings, where it outperforms other leading translation models.
With No Language Left Behind and Universal Speech Translator, Meta has focused on speech translation research because it has incredible potential to break down communication barriers and bring people together. We’re excited to see how others in the research community use MuAViC to create translation systems that work well in real-world conditions.
Because of the scarcity of suitable training data, extending audio-visual understanding to speech translation was previously unexplored. Collecting and processing audio-video data typically requires more resources than collecting audio data alone.
MuAViC is the first benchmark for audio-video speech translation and the largest multilingual benchmark for audio-video speech recognition. It contains roughly 1,200 hours of transcribed data spanning 9 languages.
For English talks, we reuse audio-visual data from the LRS3 dataset and align it with a machine translation corpus using a text-matching algorithm. Matched examples are then paired with the corresponding target sentences in the machine translation corpus for translation labels. We apply exact text matching for development set and test set examples to ensure the best accuracy. For training set examples without a match, we acquire pseudo-translation labels from a machine translation model.
For non-English talks, we reuse the audio-only data, transcriptions, and text translations collected in the speech translation dataset. To add the visual modality, we acquire video tracks of the original recordings and align processed video data with the audio data to create audio-visual data. Although all the audio data is transcribed, only a subset of it is translated. We acquire pseudo-translation labels using the same machine translation model as earlier.
We used Meta’s AV-HuBERT architecture to create end-to-end audio-video speech recognition and audio-video speech translation models. Given an aligned pair of audio-video data, our model is able to process both modalities and fuse their representations into a unified space that can be used for either speech recognition or translation tasks. And if either modality is missing, AV-HuBERT can still process the available input modality (but with less efficiency).
Our model’s most noteworthy feature is its robustness to noise. If the audio modality is distorted because of noise or any other factor, the model will rely more on the visual modality to perform the task properly. We tested our models against a state-of-the-art model for speech recognition and X-En speech translation tasks in environments both with and without noise.
MuAVic enables researchers to build robust speech recognition and translation systems for different languages. We’ve released the corpus as well as our audio-visual speech recognition and translation models covering nine different languages. We hope this will help the community build even better, more robust speech models. We are excited about the future of powerful robust models.
MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation
We’d like to acknowledge the contributions of Vedanju Goswami, Wei-Ning Hsu, Bowen Shi to the work discussed in this blog post.
Meta AI Resident