A new model, called XLM-R, that uses self-supervised training techniques to achieve state-of-the-art performance in cross-lingual understanding, a task in which a model is trained in one language and then used with other languages without additional training data. Our model improves upon previous multilingual approaches by incorporating more training data and languages — including so-called low-resource languages, which lack extensive labeled and unlabeled datasets.
XLM-R has achieved the best results to date on four cross-lingual understanding benchmarks, with increases of 4.7 percent average accuracy on the XNLI cross-lingual natural language inference dataset, 8.4 percent average F1 score on the recently introduced MLQA question answering dataset, and 2.1 percent F1 score on NER. After extensive experiments and ablation studies, we’ve shown that XLM-R is the first multilingual model to outperform traditional monolingual baselines that rely on pretrained models.
In addition to sharing our results, we’re releasing the code and models that we used for this research. Those resources can be found on our fairseq, Pytext and XLM repositories on GitHub.
While earlier work in this area has demonstrated the effectiveness of multilingual masked language models on cross-lingual understanding, models such as XLM and multilingual BERT were limited in their ability to learn useful representations for low-resource languages. XLM-R improves on previous approaches in several ways:
Building on the cross-lingual approach that we used with XLM and RoBERTa, we increased the number of languages and training examples for our new model, training self-supervised cross-lingual representations from more than two terabytes of publicly available CommonCrawl data that had been cleaned and filtered. This included generating new unlabeled corpora for low-resource languages, scaling the amount of training data available for those languages by two orders of magnitude.
During fine-tuning, we leveraged the ability of multilingual models to use labeled data in multiple languages in order to improve downstream task performance. This enabled our model to achieve state-of-the-art results on cross-lingual benchmarks while exceeding the per-language performance of monolingual BERT models.
We tuned our model’s parameters to offset the fact that using cross-lingual transfer to scale models to more languages also limits the model’s capacity to understand each of those languages. Our parameter changes included upsampling low-resource languages during training and vocabulary construction, generating a larger shared vocabulary, and increasing the overall model capacity up to 550 million parameters.
We found that XLM-R performed particularly well for low-resource languages, improving XNLI performance on Swahili and Urdu by 2.3 percent and 5 percent compared with the previous state of the art, which was trained on 15 languages.
With people on Facebook posting content in more than 160 languages, XLM-R represents an important step toward our vision of providing the best possible experience on our platforms for everyone, regardless of what language they speak. Potential applications include serving highly accurate models for identifying hate speech and other policy-violating content across a wide range of languages. As this work helps us transition toward a one-model-for-many-languages approach — as opposed to one model per language — it will also make it easier to continue launching high-performing products in multiple languages at once. And by open-sourcing our models and code, (available through our GitHub repositories for fairseq, Pytext and XLM) we hope to improve the performance of multilingual models created by the research community, particularly systems that use self-supervised training methods to better understand low-resource languages.