MLQA: Evaluating cross-lingual extractive question answering

11/7/2019

What’s new:

MLQA is a multiway aligned extractive question answering (QA) evaluation benchmark. It is designed to help the AI community improve and extend QA in more languages and to foster research in zero-shot multilingual QA approaches. MLQA contains 12,000 QA instances in English and more than 5,000 in each of six other languages: Arabic, German, Hindi, Spanish, Vietnamese, and Simpliﬁed Chinese. Facebook AI’s LASER toolkit was used to identify documents that would be suitable for MLQA.

Because MLQA is highly parallel, with each question in the dataset appearing in several languages, researchers can use it to compare transfer performance across languages. It also facilitates evaluation between languages, such as when a question is in Vietnamese and the answer is Hindi, for example.

We have evaluated current state-of-the-art cross-lingual representations on MLQA and found a significant performance gap between the training language and the testing languages. We challenge the QA and cross-lingual research community to come together to make progress on this new cross-lingual understanding task. To help with this effort, we are providing machine translation-based baselines as well as the results of our evaluations.

This graphic illustrates the MLQA annotation pipeline (simplified to show just one target language). As shown on the left, we first identify and extract parallel sentences in Wikipedia articles on a particular topic. Human annotators then formulate corresponding questions. The rightmost image shows how English questions are then converted by professional translators into all languages qi and the answer is then annotated in the target language.

Why it matters:

In recent years, extractive question-answering systems (also known as reading comprehension models) have made significant improvements in accuracy and can now even outperform humans. But these models require tens or hundreds of thousands of clean data points for training, and such robust datasets are available in only a handful of languages.

In order to build QA models that work well in many more languages, the research community needs high-quality evaluation data to benchmark and measure progress. The MLQA dataset is specifically designed for this purpose, providing a valuable new tool for cross-lingual research.