AR/VR

Powered by AI: Automated captions

Sept. 15, 2020

This blog post originally appeared on Tech.fb.com.

The COVID-19 pandemic is giving the world a fierce hunger for human connection and information. Messaging and calls, both voice and video, have surged in recent months as people around the world check in with family, friends, and work colleagues. Audiences for newscasts and government briefings have also ballooned as the public seeks updates on the outbreak, travel guidance, and personal hygiene advice to protect themselves from getting sick.

While there is no shortage of information, not everyone can access it. It needed to be available to the hundreds of millions of people in the world who are deaf or hard of hearing. According to the World Health Organization, over 5 percent of the world’s population — or 466 million people — have disabling hearing loss, and that is projected to increase to over 900 million by 2050. “Video captioning is critical for people like me in the deaf community during a public health emergency,” explains Brenden Gilbert, a Production Operations Engineer at Facebook. While Facebook provides automatic closed captioning for on-demand videos in 16 languages, and just announced similar capabilities for Instagram IGTV, the hunger for live, real-time news and information still needed to be met.

Facebook AI researchers and engineers have now made live video content more accessible by enabling automatic closed captions for Facebook Live and Workplace Live. Already, six languages are supported: English, Spanish, Portuguese, Italian, German, and French. Facebook Live automatic captions are helping governments disseminate crucial public health information and ensuring that millions of viewers across the world — whether they have hearing loss or are just watching where audio is not available — get the message. And, as workplace policies evolve, automatic captioning has become essential for employers to keep their staff and customers informed through safety updates.

Something Went Wrong

We're having trouble playing this video.

Learn more

Facebook AI researchers and engineers have made live video content more accessible by enabling automatic closed captions for Facebook Live and Workplace Live.

The speed and scale of this AI-powered technology was only possible thanks to advances Facebook AI has made in automated speech recognition (ASR) over the past few years, explains Daniel McKinnon, a Product Manager at Facebook. “Our team made these advancements in AI and was able to rapidly productize them even at a time when Facebook servers were ‘melting,’” he adds, referring to the recent spikes in app traffic, messaging, and voice and video calling. “That took extraordinary engineering.”

Laying out the challenge

Although automated speech recognition (ASR), which predicts a sequence of words from a raw audio signal, has been around since the late 2000s, it is still an exceptionally difficult task. In the type of conversational speech that is present in live streams, people don’t always naturally speak clearly or “wait their turn” to speak. Unpredictable background noise, the large variety of accents and dialects, and the wide range of tones that influence human speech make ASR even harder.

The system also needs to learn to recognize hundreds of millions of different words across many languages, including uncommon names and jargon. An “open domain” task like this is very different from, and much more complex than, more constrained ASR tasks such as automated customer service calls, where the system needs to consider only a relatively small set of possibilities.

Something Went Wrong

We're having trouble playing this video.

Learn more

Facebook provides automatic closed captioning for on-demand videos in 16 languages, and just announced similar capabilities for Instagram IGTV.

Conventional ASR systems are composed of three components: an acoustic model that predicts phonemes from short segments of audio; a pronunciation lexicon, which describes how the phonemes are combined to form the words of a given language; and a language model that captures the relationships among those words, e.g., which words are the most common and which words are likely to appear together.

A pivotal early discovery by the Facebook AI team was that the phonetic pronunciation lexicon could be eliminated, and acoustic models could be trained to directly predict the graphemes (or characters) of a word with better accuracy for end-to-end systems at first and later also for hybrid systems. This greatly simplified training and deployment of these ASR models across different languages.

The COVID crunch

The rapid spread of the COVID-19 pandemic caused a spike in both the supply and demand of public health information. Several local and state governments, which were accustomed to holding live press conferences but didn’t have the resources, staff, or technology to record, stream, and caption their live events, turned to Facebook Live. “It gave them a great solution when everything was shut down,” says McKinnon. Several governments also discovered that video captioning was not just nice to have but imperative, especially in the absence of available sign language interpreters. “Many of them needed captions to comply with their own disability access rules for public broadcasts,” explains McKinnon.

People around the world were also tuning into newscasts and conferences streaming on Facebook Live, and watching for much longer periods of time than usual. In fact, the number of Facebook Live broadcasts from Pages doubled in June 2020 compared with the same time last year. That incredible amount of traffic puts enormous stress on any ASR system.

To handle these elevated spikes in traffic, Facebook's ASR models needed to get a lot faster in production to avoid falling behind. Recent research has shown that convolutional encoders trained with the CTC loss function can be highly efficient during inference for streaming use cases, while RNN Transducer models consistently yield the best accuracies despite being the most compact. For non-streaming use cases, i.e. when the entire video is available to the model for decoding, we have found that Transformer encoders can produce ASR models that are both very fast and the most accurate.

Engineers were able to deploy all of these model variations which, when combined with a number of infrastructure optimizations, contributed to not only being able to serve all the additional video traffic but even resulted in machine savings despite the increased load. Models were trained using PyTorch which enabled quick iterations on ideas and deployments to production.“Improving speed without compromising on accuracy is the cherry on top,” says Yatharth Saraf, an engineering manager at Facebook. “It was a nice response to the COVID-19 capacity crunch we found ourselves in.”

Julian Chan, a Software Engineer at Facebook AI, explains that the system is also capable of adapting to new words such as COVID, which is essential for captioning public health information–based broadcasts during the pandemic. “It can easily learn a new word and predict where it will occur,” he explains. “This was largely made possible using text data from public Facebook posts to train the system.”

Saraf is proud of the rapid response of Facebook’s engineers and researchers. “We’ve made a lot of crucial progress in making vital information accessible to the Deaf and Hard of Hearing community in a very short space of time,” he says. He explains that since deaf individuals rely on captioning services for critical updates, continued improvements in the captioning system are essential.

Facebook AI is already on the case. “The training data our system learned from included many different types of speech, but it’s far from perfect, especially when it comes to different accents,” says McKinnon. However, it can be difficult or even impossible to collect sufficient training data of every type, so researchers are exploring methods to improve and adapt models by having them also learn from vast amounts of unlabeled audio.

In the meantime, broadcasters can count on automatic closed captions to support their efforts to get the message out — whether a state official is sharing authoritative health guidance or someone is simply taking their viewers behind the scenes of a day in their life — during COVID-19 and beyond.

We'd like to extend special thanks to Xiaohui Zhang, Duc Le, Frank Zhang, Jun Liu, Si Chen, Chunxi Liu, Jiedan Zhu, Yongqiang Wang, Vineel Pratap, Jiatong Zhou, Julian Chan, Kritika Singh, Kjell Schubert, Yutong Pang, Shahzad Bhatti, Jeff Glick, Andres Alvarado, Mark Chou, Abdelrahman Mohamed, Awni Hannun, Ronan Collobert, Mike Seltzer, and Geoffrey Zweig.

Share on Facebook

Share on Twitter

Research Areas

NLP

Product experiences