CCMatrix: A billion-scale bitext dataset for training translation models

February 06, 2020

What it is:

CCMatrix is the largest dataset of high-quality, web-based bitexts for training translation models. With more than 4.5 billion parallel sentences in 576 language pairs pulled from snapshots of the CommonCrawl public dataset, CCMatrix is more than 50 times larger than the WikiMatrix corpus that we shared last year. Gathering a dataset of this size required modifying our previous bitext mining approach used for WikiMatrix, assuming that the translation of one sentence could be found anywhere on CommonCrawl, which functions as an open archive of the internet. To address the significant computational challenges posed by comparing billions of sentences to determine which ones are mutual translations, we used massively parallel processing, as well as our highly efficient FAISS library for fast similarity searches.

We’re sharing details about how we created CCMatrix, and the tools needed for other researchers to reproduce our results and use this corpus for their work. To demonstrate the value of automatically generating such a large number of parallel texts, we trained neural machine translation (NMT) systems on CCMatrix and compared their performance with established baselines. Our resulting models outperformed the state-of-the-art single-NMT systems evaluated in the Conference on Machine Translation (also known as WMT’19) competition in four language directions, including Russian to English, despite using only mined translations (rather than human-provided ones). And when tested against the TED corpus, CCMatrix also enabled us to significantly improve NMT performance for many language pairs, compared with other approaches.

What it does:

Parallel texts — which include sentences in one language and their corresponding translations in another — are the backbone of most NMT training methods. And while more bitext examples typically lead to better translation performance, gathering large parallel corpora across a wide number of languages is a resource-intensive task. Our method automates and parallelizes this bitext mining process, processing multiple batches of 50 million examples at a time on an 8-GPU server. Using the FAISS library, we’re able to calculate the distance between all the sentence embeddings in each batch, with every calculation performed in parallel. This enables a rapid extraction of sentence pairs, pulled from a greater variety of publicly available texts than similar datasets, including our Wikipedia-based WikiMatrix.

CCMatrix’s parallelized approach to bitext mining maps the similarities between millions of sentences in many different languages at once, searching for pairs that can function as training examples for translation models.

Why it matters:

CCMatrix enables the NMT research community to leverage much larger bitext datasets than was previously possible for scores of language pairs. This can accelerate creation of more effective NMT models that work with more languages, particularly low-resource ones that have relatively limited corpora.

Because of its large scale and its use of a broad array of public texts, we believe that CCMatrix will become one of the most commonly used resources for building and evaluating systems across the field of NMT. We also hope that the technique we used to create CCMatrix will help the research community develop new ways to create large-scale datasets that will improve translation tools used by people around the globe.