Facebook, NYU expand available languages for natural language understanding systems

October 26, 2018

WHAT THE RESEARCH IS:

The XLNI dataset, created for evaluating cross-lingual approaches to natural language understanding (NLU). This collaboration between Facebook and New York University builds on the commonly used Multi-Genre Natural Language Inference (MultiNLI) corpus, adding 14 languages to that English-only dataset, including two low-resource languages: Swahili and Urdu.

HOW IT WORKS:

Most existing NLU models are trained on supervised data — data that’s been manually labeled for training purposes — from a single language. But as researchers look to increase the number of languages their systems can understand, the prospect of gathering and annotating data in every language is not scalable. One potential solution is cross-lingual language understanding, an approach that trains a model on data in one language, and then tests that model in other languages. The Cross-Lingual Natural Language Inference (XNLI) dataset advances this approach by providing that test data, adding a total of 112,500 annotated sentence pairs to the MultiNLI dataset, in 14 additional languages. XNLI also includes several baselines to assist others in creating systems that understand multiple languages. Two of those baselines are based on AI machine translation systems; the other two use parallel data, for researchers with limited compute resources to train their systems.

WHY IT MATTERS:

In addition to expanding on a widely used NLU research dataset, XNLI contributes to the broader research goal of building AI systems that can understand a wider range of languages. And by including sentence pairs and results for Swahili and Urdu, the dataset also supports research related to low-resource languages and helps move the field away from English-centric NLU models. The dataset is available to download.