We’ve used Facebook AI’s LASER toolkit and Faiss library to create WikiMatrix, the largest and most complete extraction of parallel sentences across multiple languages. Using publicly available Wikipedia articles, we extracted 135 million parallel sentences for 1,620 different language pairs in 85 different languages. The pairings systematically cover all possible language pairs, including uncommon ones such as Hebrew-Italian and Arabic-Vietnamese.
We are now sharing WikiMatrix with the AI research community. Additionally, we are providing an exhaustive quality assessment by training more than 1,800 neural machine translation systems and evaluating them on the TED corpus. WikiMatrix provides parallel data to directly train neural machine translation systems between even distantly related languages, without the need to first translate to English.
Most multilingual models, and neural machine translation systems in particular, require parallel corpora for training. Large quantities of parallel texts are available only for some major languages, and are usually aligned only to English. WikiMatrix is the first dataset to systematically handle all languages on Wikipedia, including low-resource languages and dialects. Also, many publicly available parallel corpora come from one specific source (e.g., legal texts), while the WikiMatrix corpus covers the wide range of topics on Wikipedia. Because WikiMatrix contains a large volume of sentence pairs in different languages, it can be used to train and evaluate translation systems more effectively for low-resource languages.
WikiMatrix also demonstrates how LASER’s massively multilingual sentence embeddings and the Faiss library can be used to efficiently perform large-scale distance-based bitext mining. In contrast, a brute force method of comparing the 134 million English and 51 million German Wikipedia entries would require calculating more than 6 quadrillion distances.
NLP researchers can use WikiMatrix to train, evaluate, and compare new translation models or other multilingual models for 85 different languages and dialects.