MLQA is a multiway aligned extractive question answering (QA) evaluation benchmark. It is designed to help the AI community improve and extend QA in more languages and to foster research in zero-shot multilingual QA approaches. MLQA contains 12,000 QA instances in English and more than 5,000 in each of six other languages: Arabic, German, Hindi, Spanish, Vietnamese, and Simpliﬁed Chinese. Facebook AI’s LASER toolkit was used to identify documents that would be suitable for MLQA.
Because MLQA is highly parallel, with each question in the data set appearing in several languages, researchers can use it to compare transfer performance across languages. It also facilitates evaluation between languages, such as when a question is in Vietnamese and the answer is Hindi, for example.
We have evaluated current state-of-the-art cross-lingual representations on MLQA and found a significant performance gap between the training language and the testing languages. We challenge the QA and cross-lingual research community to come together to make progress on this new cross-lingual understanding task. To help with this effort, we are providing machine translation-based baselines as well as the results of our evaluations.
In recent years, extractive question-answering systems (also known as reading comprehension models) have made significant improvements in accuracy and can now even outperform humans. But these models require tens or hundreds of thousands of clean data points for training, and such robust data sets are available in only a handful of languages.
In order to build QA models that work well in many more languages, the research community needs high-quality evaluation data to benchmark and measure progress. The MLQA data set is specifically designed for this purpose, providing a valuable new tool for cross-lingual research.
Evaluating cross-lingual performance in question answering.