The XLNI data set, created for evaluating cross-lingual approaches to natural language understanding (NLU). This collaboration between Facebook and New York University builds on the commonly used Multi-Genre Natural Language Inference (MultiNLI) corpus, adding 14 languages to that English-only data set, including two low-resource languages: Swahili and Urdu.
Most existing NLU models are trained on supervised data — data that’s been manually labeled for training purposes — from a single language. But as researchers look to increase the number of languages their systems can understand, the prospect of gathering and annotating data in every language is not scalable. One potential solution is cross-lingual language understanding, an approach that trains a model on data in one language, and then tests that model in other languages. The Cross-Lingual Natural Language Inference (XNLI) data set advances this approach by providing that test data, adding a total of 112,500 annotated sentence pairs to the MultiNLI data set, in 14 additional languages. XNLI also includes several baselines to assist others in creating systems that understand multiple languages. Two of those baselines are based on AI machine translation systems; the other two use parallel data, for researchers with limited compute resources to train their systems.
In addition to expanding on a widely used NLU research data set, XNLI contributes to the broader research goal of building AI systems that can understand a wider range of languages. And by including sentence pairs and results for Swahili and Urdu, the data set also supports research related to low-resource languages and helps move the field away from English-centric NLU models. The data set is available to download.