Research

Q&A with machine translation pioneer: The future of MT is multilingual

November 10, 2021

How far are we from universal translation, the “holy grail” of machine translation (MT), where everyone in the world can understand every language in real time?

We sat down with Philipp Koehn, a Meta AI research scientist, author of Statistical Machine Translation andNeural Machine Translation, and one of the inventors of the modern method of phrase-based MT. He talks about the latest advances in MT, the newest open challenges for the field, and promising directions on the path toward universal translation.

Q: Your team has just pioneered the first-ever multilingual model to win the prestigious WMT competition, a competition you helped create in the early days of MT, around 15 years ago. What does this mean for automatic translations?

Philipp Koehn: Today there’s a significant imbalance in the coverage of MT technology: Language pairs with vast volumes of training data, such as French-English, can be automatically translated close to human quality, but there are still hundreds of low-resource languages for which no MT systems exist at all. Translations have the power to provide access to information that would otherwise not be possible. It’s important that translation technology is inclusive to everyone around the world, regardless of data scarcity.

Multilingual systems translate multiple language pairs in a single model and are a key evolution because they generalize knowledge across many language pairs, which is particularly helpful for low-resource languages. This is a stark movement away from the traditional bilingual model approach where each language pair was treated in isolation. Until now, multilingual models haven’t been handling high-resource languages as well as bilingual counterparts.

With new advancements like this WMT 2021 competition model submission, where a multilingual model established new state-of-the-art quality for the first time, we’re now witnessing a bigger shift toward multilingual models. The single multilingual model is not just more efficient to develop via new scaling and data optimization work, but it also brings better-quality translations than bilingual models, across both high- and low-resource languages. This work holds promise in bringing high-quality translations to more languages, which was not possible before.

Q: How quickly do you think we can bring these translation improvements to billions of people using Facebook and Meta’s other platforms, especially for people who speak low-resource languages?

PK: Meta’s latest WMT multilingual model translates many very different language pairs using a single model, and this is a major milestone. Having a single model rather than training specialized models for each language direction makes creating and deploying new models a lot more feasible, particularly when scaling to more and more languages. But productionizing it at the rate of 20 billion translations daily on Facebook, Instagram, and our other platforms is its own research direction in and of itself. The Meta AI team has a separate research arm that’s focused on research to deploy these large multilingual models, including techniques such as knowledge distillation and model compression. For example, we’ve productionized a previous version of a multilingual model that’s currently helping to proactively detect hate speech, even in languages for which there’s little training data, which is important to keep people safe on our platform around the world.

Recent improvements demonstrated in the WMT efficiency task have shown that it is possible to translate thousands of words per second on a single CPU. While the latest WMT multilingual model is still too big to be deployed in real-time settings, the learnings from building these models will improve the production MT system in the near future.

Q: The field has long been working toward building universal language translators. Why couldn’t previous systems get us there, and why do you think the multilingual approach is different?

PK: Traditional supervised models are too narrow and dependent on data sets of millions of examples, which doesn’t exist for many language pairs. Building and maintaining thousands of models also creates excessive complexity, which is not computationally feasible or scalable for practical applications.

The ultimate goal of the field has been working toward building some representation of text that’s common to all languages, so it’s easier to transfer knowledge from one language to another.

There is an interesting throughline that traces through efforts to expand the number of language pairs over the decades, centered on the notion of interlingual representations.

With interlingual representation, may it be symbolic or neural, the quadratic problem of covering many-to-many language pairs is reduced to linear: For each language, only one analyzer (or encoder) and one generator (or decoder) needs to be constructed, since these feed into and out of a language-independent central representation.

An illustration for this is the Vauquois triangle, with interlingual representation at its pinnacle, obviating the need for transfer. What this idea leaves out, however, is the ability to share knowledge across languages — say, the ability of a Catalan component to benefit from data in Spanish. Multilingual models, on the other hand, jointly train encoders and decoders on multiple languages, holding promise for achieving universal translation one day.

Q: What are the challenges still ahead, if multilingual is the path forward toward universal translation? How far away are we?

PK: Multilingual models pose serious computational challenges due to their large scale and the vast amounts of training data needed to train them. Hence research into more efficient training methods has been essential.

But there are a host of additional challenges. Modeling challenges range from the balancing of the different types of data (including synthesized data via back-translation) and the open questions around how the neural architecture should accommodate language-specific parameters.

The architecture of multilingual models is not yet settled. Early efforts introduced language-specific encoders and decoders. At the other end of the spectrum, you could have a traditional model and feed it a concatenation of all parallel corpora, tagged with a language token to specify the output language. Most researchers believe that some form of language-specific parameters need to augment a general model. But it’s not yet clear if these should be hard-coded by language or if the model should be tasked to learn how specialized parameters can be best utilized.

There is always the question of whether bigger is better. A language pair with lots of data will likely benefit from a bigger model, but low-resource language pairs risk overfitting. We were able to overcome this with the WMT model across 10 out of 14 different language pairs. But as we add more languages, these two concerns need to be accommodated at the same time.

There are several other challenges, like figuring out a way to train on different types of data across style, topic, noise — and across the language pair of each corpus, it’s unclear how this data should be best combined, weighted, or staged. How much leverage do related language directions provide to enable zero-shot translation? Is it even desirable or practical to use all available training data to train such models?

Q: What do you think are the most promising directions for solutions to address these challenges?

PK: At Meta, the teams are engaged in a concerted effort to cover a much larger number of languages in a multilingual model and utilize it for many applications. This involves all aspects of the problem: modeling, training, data, and productionizing.

In terms of modeling and architecture challenges, we have seen the most success with models that selectively use subsets of parameters, based on the input. One such model uses latent layer selection where a subset of Transformer layers are used based on the language. Another one is the Mixture of Experts model, which uses an ensemble of multiple alternative feed-forward layers in the Transformer block and allows the model to select a subset of them. Given the large amount of training data, it is not surprising that bigger models yield better results but careful selection of hyperparameters is important to effect that outcome.

For many of the NxN language directions, the only parallel data that is available was originally translated through a pivot language. Think of the many translations of the Bible from which, say, a Estonian-Nepali parallel corpus can be extracted, but each Bible version was translated from a third language (may it be Greek, Latin, or English). Since we don’t want the training to be dominated by such data, we combine the high-quality training data (often paired with English) with parallel data only for some language pairs: translations between representative languages of each language family, grouped by linguistic and data-driven analysis.

It’s also important to consider the varying degrees of quality, relevance, and source of training data. Staging the training data in a curriculum (e.g., reducing the data size toward the best subsets) gives typically better results.

These techniques are promising. But progress toward solving open challenges has always been cumulative. It will happen over time through open science as researchers across the industry build on top of this work, as well as research from other labs and companies. We’ve published the WMT model and released its code, just as we’ve done in the past with research and tools (fairseq), data sets (CCMatrix, CCAligned, FLORES), organizing shared tasks (multilingual, low-resource news, terminology, filtering), and funding academia to collectively push research forward.

Q: The shift toward more generalized models goes beyond just MT. How might these multilingual advancements help push the AI field forward overall?

PK: The move to large multilingual models mirrors a broader trend in AI. Many advanced natural language models are not built as specialized systems anymore, but rather on top of massive language models such as GPT-3 or XLM-R.

One may view this as a push toward general intelligence: AI systems that are capable of addressing many different problems and cross-applying knowledge between them. In the same spirit, multilingual translation models solve the general translation problem, not the specific problem of a particular language pair.

Multilingual is a step in that direction. It leads to more flexible systems that can serve more tasks. It is more efficient because it frees up capacity — which allows us to roll out new features instantly to people around the world. Finally, it’s closer to human thinking. As humans, we don’t have specialized models for each task; we have one brain that does many different things. Multilingual models, just like pretrained models, are bringing us closer to that.

Q: As one of the pioneers of modern MT, what do you think the future of translation will look like in the next 10 years?

PK: That is hard to predict. Ten years ago, I would not have predicted the hard turn from statistical to neural methods. What is safe to say, though, is that we will see continued improvements in translation quality and languages covered by translation technology, leading to broader applications. Many people on the Facebook platform already expect that they can translate posts in languages they do not understand with a single click. Sometimes, they do not even have to click and the translation is automatically displayed. This kind of seamless integration is an example of how translation technology will be employed, invisible to the users who just use their favorite language and everything just works. There is some exciting research of speech translation at Meta, which promises to bring this kind of seamless integration into the spoken realm.

From a research perspective, there is at the moment a real concern that training AI systems such as MT require massive computing resources, which limits experimentation and who can engage in this type of work. Over the next 10 years, maybe Moore’s law will take care of some of that. But I expect that we need more efficient training methods to be able to move forward quickly with new innovations.

Written By

Ritika Trikha

AI Writer/Editor

Product experiences