Facebook AI leads in 2019 WMT international machine translation competition

August 01, 2019

With hundreds of languages used by people on our platforms and thousands more spoken around the world, developing powerful and flexible machine translation systems has long been a research focus for Facebook. Today we are proud to announce that Facebook AI models achieved first place in several language tasks included in this year’s annual news translation competition, hosted by the Fourth Conference on Machine Translation (also known as WMT). Our models outperformed all other entrants’ models in the four tasks we participated in, including English to German, the most competitive task in the contest, with entries drawn from a wide range of high-performing research teams. For this language direction, our translations have been declared superhuman by the WMT organizers, meaning that human evaluators preferred them over translations done by human experts.

Our models used large-scale sampled back-translation, noisy channel modeling, and data-cleaning techniques to achieve the highest performance for translating from English to German, German to English, English to Russian, and Russian to English. These models, along with our work on cross-lingual pretraining and self-supervised learning for other modalities, will enable us to break down language barriers and to build better content understanding systems to keep people safe.

Building the leading machine translation systems

This year marks the 13th consecutive iteration of the WMT shared task, which is widely considered the premier platform for researchers demonstrating advancements in machine translation technology. The news translation challenge is an important part of this annual conference and received over 150 submissions this year from institutions around the world, including universities as well as tech firms.

Of the 14 available translation tasks, we competed in four, including the English-to-German task. Though we had also earned first place in this direction in 2018, our new submission beat our previous system by 4.5 BLEU, which is a large improvement.

Facebook AI’s submissions leverage our earlier work on large-scale sampled back-translation, which we used to win the 2018 edition of the English-to-German WMT news translation task. Neural machine translation (NMT) models typically require large amounts of bilingual training data, meaning sentences for which we have reference translations. However, high-quality bilingual data is limited, which encourages participants to use monolingual data, for which no translations are available.

Back-translation presents a workaround to this issue — in order to train an English-to-German translation model, we first train a German-to-English model and then use it to translate the monolingual German data provided by WMT. This is a well-established technique, but we used sampling for back-translation, which leads to much better results: Instead of choosing the best possible translation, we sometimes choose translations that are not optimal. This improved performance compared to using conventional back-translation, because the model can learn more from this noisier data. We also scaled to very large amounts of data, incorporating roughy 10 billion words of additional data for the English-to-German setting.

Last year, we open-sourced our system and documented our techniques in a research paper for everyone to use. This year, several submissions adopted our techniques, and to stay ahead, we improved our system through so-called noisy channel modeling and rigorous data cleaning, and by fine-tuning the model

Translating backward, forward, and more fluently

Machine translation typically works by using a single model to generate a translation for a given sequence of words, such as translating a German sentence into English. Though back-translation adds another layer to this process, at least for training purposes, noisy channel modeling takes this process further, using a total of three models to ultimately arrive at a more accurate translation.

Something Went Wrong

We're having trouble playing this video.

Learn more

First, a forward model translates a sentence, such as from German to English, generating a set of translation candidates, or hypotheses. A backward model then translates those English hypotheses back into German, allowing the system to evaluate how well each English translation appears to line up with the original German sentence. Finally, a language model judges the fluency of the English translations. The language model was trained on billions of words to get a good sense of what an English sentence should look like.

Once the backward and language models have scored all the English translations produced by the forward model, the system then selects the hypothesis with the highest combined score according to all models as the actual translation. We trained all these models on 128 NVIDIA Volta GPUs using fairseq, our open sequence-to-sequence modeling toolkit.

Cleaning crawled datasets and fine-tuning

The other significant change that we made this year wasn’t to our system but to the data used to train all our models, including those used for back-translation and noisy channel modeling. The WMT news translation task provides large datasets that have been crawled from the internet. These are naturally very noisy, and the presence of misaligned sentences, artifacts of web pages, URLs, and other incorrect or irrelevant material can potentially decrease translation performance.

To offset these issues, we employed a range of data cleaning techniques, including removing instances where a translation is significantly longer than its corresponding translation. We also used language identification (or langid) filtering to keep only those sentence pairs with the correct languages on both sides.

Performance of the various techniques we used for our four submissions on the newstest2018 test set.

For Russian, we had less monolingual news data than for English, so we increased the amount of data by adding sentences from the Common Crawl corpus. Common Crawl is very large but also very noisy, so to address this issue we applied domain filtering to add high-quality sentences

A small part of the provided data is of very high quality, and we used it to fine-tune our models once they finished training. The idea is to simply continue training on the high-quality data only. This has the positive effect that our models are primed to do well on this specific type of data, which closely matches the actual data the models will be tested on.

The context-sensitive future of machine translation

The goal of WMT’s news translation competition is to provide a platform for researchers to share their ideas and to assess the state of the art in machine translation. Like last year, we are making the models for our winning systems available for everyone to download as part of fairseq, our open source sequence modeling toolkit. We are also sharing the ideas behind our system in a research paper.

Though current NMT models tend to look at each sentence in isolation, it’s becoming clear that looking at the entire document is important for achieving additional performance improvements. By taking information from previous sentences into account, systems can produce better translations. Of course, this additional context presents new challenges. It requires datasets that retain document boundaries, new models that are able to use that information effectively, and a robust evaluation methodology to measure progress. We believe that this is the next frontier for improving translation even further.