NLP

Research

Facebook Research at EMNLP 2019

11/2/2019

The 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP) is taking place in Hong Kong this week from November 3 to November 7. Organized by the ACL Special Interest Group on Linguistic Data (SIGDAT), EMNLP is one of the leading research conferences in the field of natural language processing, with three main conference days and two days of workshops and tutorials. Facebook researchers will join 2,000 to 2,500 academics, industry professionals, and researchers from around the world to discuss the latest advances in NLP and computational linguistics.

Facebook Research Scientist Kyunghyun Cho is one of three keynote speakers at EMNLP, and will be giving a presentation entitled Curiosity-driven Journey into Neural Sequence Models. Read Kyunghyun’s talk abstract below.

“In this talk, I take the audience on a tour of my earlier and recent experiences in building neural sequence models. I start from an early experience of using a recurrent net for sequence-to-sequence learning and talk about the attention mechanism. I discuss factors behind the success of these earlier approaches, and how these were embraced by the community even before they became state of the art I then move on to more recent research directions in unconventional neural sequence models that automatically learn to decide on the order of generation.”

Facebook researchers will be presenting more than 25 papers in oral presentations, poster sessions, and workshops. For those attending the conference, be sure to stop by Facebook Research booth C to chat with our program managers, researchers, and recruiters.

A day-by-day schedule of research being presented at EMNLP is available here.

Facebook research being presented at EMNLP

A Discrete Hard EM Approach for Weakly Supervised Question Answering

Sewon Min, Danqi Chen, Hannaneh Hajishirzi, and Luke Zettlemoyer

Many question answering (QA) tasks only provide weak supervision for how the answer should be computed. For example, TRIVIAQA answers are entities that can be mentioned multiple times in supporting documents, while DROP answers can be computed by deriving many different equations from numbers in the reference text. In this paper, we show it is possible to convert such tasks into discrete latent variable learning problems with a precomputed, task-specific set of possible solutions (e.g. different mentions or equations) that contains one correct option. We then develop a hard EM learning scheme that computes gradients relative to the most likely solution at each update. Despite its simplicity, we show that this approach significantly outperforms previous methods on six QA tasks, including absolute gains of 2–10%, and achieves the state-of-the-art on five of them. Using hard updates instead of maximizing marginal likelihood is key to these results as it encourages the model to find the one correct answer, which we show through detailed qualitative analysis.

BERT for Coreference Resolution: Baselines and Analysis

Mandar Joshi, Omer Levy, Luke Zettlemoyer, and Daniel Weld

We apply BERT to coreference resolution, achieving strong improvements on the OntoNotes (+3.9 F1) and GAP (+11.5 F1) benchmarks. A qualitative analysis of model predictions indicates that, compared to ELMo and BERT-base, BERT-large is particularly better at distinguishing between related but distinct entities (e.g., president and CEO). However, there is still room for improvement in modeling document-level context, conversations, and mention paraphrasing. Our code and models are publicly available.

Bridging the Gap Between Relevance Matching and Semantic Matching for Short Text Similarity Modeling

Jinfeng Rao, Linqing Liu, Yi Tay, Wei Yang, Peng Shi, and Jimmy Lin

A core problem of information retrieval (IR) is relevance matching, which is to rank documents by relevance to a user’s query. On the other hand, many NLP problems, such as question answering and paraphrase identification, can be considered variants of semantic matching, which is to measure the semantic distance between two pieces of short texts. While at a high level both relevance and semantic matching require modeling textual similarity, many existing techniques for one cannot be easily adapted to the other. To bridge this gap, we propose a novel model, HCAN (Hybrid Co-Attention Network), which comprises (1) a hybrid encoder module that includes ConvNet-based and LSTM-based encoders, (2) a relevance matching module that measures soft term matches with importance weighting at multiple granularities, and (3) a semantic matching module with co-attention mechanisms that capture context-aware semantic relatedness. Evaluations on multiple IR and NLP benchmarks demonstrate state-of-the-art effectiveness compared to approaches that do not exploit pretraining on external data. Extensive ablation studies suggest that relevance and semantic matching signals are complementary across many problem settings, regardless of the choice of underlying encoders.

Build It Break It Fix It for Dialogue Safety: Robustness from Adversarial Human Attack

Emily Dinan, Samuel Humeau, Bharath Chintagunta, and Jason Weston

The detection of offensive language in the context of a dialogue has become an increasingly important application of natural language processing. The detection of trolls in public forums (Galan-García et al., 2016), and the deployment of chatbots in the public domain (Wolf et al., 2017) are two examples that show the necessity of guarding against adversarially offensive behavior on the part of humans. In this work, we develop a training scheme for a model to become robust to such human attacks by an iterative build it, break it, fix it strategy with humans and models in the loop. In detailed experiments we show this approach is considerably more robust than previous systems. Further, we show that offensive language used within a conversation critically depends on the dialogue context, and cannot be viewed as a single sentence offensive detection task as in most previous work. Our newly collected tasks and methods are all made open source and publicly available.

Cloze-Driven Pretraining of Self-Attention Networks

Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, and Michael Auli

We present a new approach for pretraining a bidirectional transformer model that provides significant performance gains across a variety of language understanding problems. Our model solves a cloze-style word reconstruction task, where each word is ablated and must be predicted given the rest of the text. Experiments demonstrate large performance gains on GLUE and new state-of-the-art results on NER as well as constituency parsing benchmarks, consistent with BERT. We also present a detailed analysis of a number of factors that contribute to effective pretraining, including data domain and size, model capacity, and variations on the cloze objective.

CLUTRR — A Diagnostic Benchmark for Inductive Reasoning from Text

Koustuv Sinha, Shagun Sodhani, Jin Dong, Joelle Pineau, and William L. Hamilton

The recent success of natural language understanding (NLU) systems has been troubled by results highlighting the failure of these models to generalize in a systematic and robust way. In this work, we introduce a diagnostic benchmark suite, named CLUTRR, to clarify some key issues related to the robustness and systematicity of NLU systems. Motivated by classic work on inductive logic programming, CLUTRR requires that an NLU system infer kinship relations between characters in short stories. Successful performance on this task requires both extracting relationships between entities as well as inferring the logical rules governing these relationships. CLUTRR allows us to precisely measure a model’s ability for systematic generalization by evaluating on held-out combinations of logical rules, and it allows us to evaluate a model’s robustness by adding curated noise facts. Our empirical results highlight a substantial performance gap between state-of-the-art NLU models (e.g., BERT and MAC) and a graph neural network model that works directly with symbolic inputs — with the graph-based model exhibiting both stronger generalization and greater robustness.

Countering Language Drift via Grounding

Jason Lee, Kyunghyun Cho, and Douwe Kiela

While reinforcement learning shows a lot of promise for multi-agent communication — e.g., when fine-tuning agents to achieve a certain objective by communicating — there has been little investigation into potential language drift: When an external reward is used to train a system, the agents’ communication protocol may easily and radically diverge from natural language. We investigate what constraints to impose in order to mitigate drift, and show that a combination of syntactic and semantic (via grounding) constraints gives the best communication performance, allowing pretrained agents to retain English syntax while learning to convey the intended meaning.

Don’t Forget the Long Tail! A Comprehensive Analysis of Morphological Generalization in Bilingual Lexicon Induction

Paula Czarnowska, Sebastian Ruder, Edouard Grave, Ryan Cotterell, and Ann Copestake

Human translators routinely have to translate rare inflections of words — due to the Zipfian distribution of words in a language. When translating from Spanish, a good translator would have no problem identifying the proper translation of a statistically rare inflection such as hablarámos. Note the lexeme itself, hablar, is relatively common. In this work, we investigate whether state-of-the-art bilingual lexicon inducers are capable of learning this kind of generalization. We introduce 40 morphologically complete dictionaries in 10 languages and evaluate three of the state-of-the-art models on the task of translation of less frequent morphological forms. We demonstrate that the performance of state-of-the-art models drops considerably when evaluated on infrequent morphological inflections and then show that adding a simple morphological constraint at training time improves the performance, proving that the bilingual lexicon inducers can benefit from better encoding of morphology.

EASSE: Easier Automatic Sentence Simplification Evaluation

Fernando Alva-Manchego, Louis Martin, Carolina Scarton, and Lucia Specia

We introduce EASSE, a Python package aiming to facilitate and standardize automatic evaluation and comparison of Sentence Simplification (SS) systems. EASSE provides a single access point to a broad range of evaluation resources: standard automatic metrics for assessing SS outputs (e.g., SARI), word-level accuracy scores for certain simplification transformations, reference-independent quality estimation features (e.g., compression ratio), and standard test data for SS evaluation (e.g., TurkCorpus). Finally, EASSE generates easy-to-visualize reports on the various metrics and features above, and on how a particular SS output fares against reference simplifications. Through experiments, we show that these functionalities allow for better comparison and understanding of the performance of SS systems.

EGG: A Toolkit for Research on Emergence of lanGuage in Games

Eugene Kharitonov, Rahma Chaabouni, Diane Bouchacourt and Marco Baroni

There is renewed interest in simulating language emergence among deep neural agents that communicate to jointly solve a task, spurred by the practical aim to develop language-enabled interactive AIs, as well as by theoretical questions about the evolution of human language. However, optimizing deep architectures connected by a discrete communication channel (such as that in which language emerges) is technically challenging. We introduce EGG, a toolkit that greatly simplifies the implementation of emergent-language communication games. EGG’s modular design provides a set of building blocks that the user can combine to create new games, easily navigating the optimization and architecture space. We hope that the tool will lower the technical barrier and encourage researchers from various backgrounds to do original work in this exciting area.

Emergent Linguistic Phenomena in Multi-Agent Communication Games

Laura Graesser, Kyunghyun Cho, and Douwe Kiela

We describe a multi-agent communication framework for examining high-level linguistic phenomena at the community level. We demonstrate that complex linguistic behavior observed in natural language can be reproduced in this simple setting: 1) the outcome of contact between communities is a function of inter- and intragroup connectivity; 2) linguistic contact either converges to the majority protocol, or, in balanced cases leads to novel creole languages of lower complexity; and 3) a linguistic continuum emerges where neighboring languages are more mutually intelligible than farther removed languages. We conclude that at least some of the intricate properties of language evolution need not depend on complex evolved linguistic capabilities, but can emerge from simple social exchanges between perceptually enabled agents playing communication games.

Finding Generalizable Evidence by Learning to Convince Q and A Models

Ethan Perez, Siddharth Karamcheti, Rob Fergus, Jason Weston,Douwe Kiela, and Kyunghyun Cho

We propose a system that finds the strongest supporting evidence for a given answer to a question, using passage-based question-answering (QA) as a testbed. We train evidence agents to select the passage sentences that most convince a pretrained QA model of a given answer, if the QA model received those sentences instead of the full passage. Rather than finding evidence that convinces one model alone, we find that agents select evidence that generalizes; agent-chosen evidence increases the plausibility of the supported answer, as judged by other QA models and humans. Given its general nature, this approach improves QA in a robust manner: Using agent-selected evidence, 1) humans can correctly answer questions with only ∼20% of the full passage, and 2) QA models can generalize to longer passages and harder questions.

The FLORES Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English

Francisco Guzmán, Peng-Jen Chen, Myle Ott, Juan Pino, Guillaume Lample, Philipp Koehn, Vishrav Chaudhary, and Marc’Aurelio Ranzato

For machine translation, a vast majority of language pairs in the world are considered low-resource because they have little parallel data available. Besides the technical challenges of learning with limited supervision, it is difficult to evaluate methods trained on low-resource language pairs because of the lack of freely and publicly available benchmarks. In this work, we introduce the FLORES evaluation datasets for Nepali-English and Sinhala-English, based on sentences translated from Wikipedia. Compared to English, these are languages with very different morphology and syntax, for which little out-of-domain parallel data is available and for which relatively large amounts of monolingual data are freely available. We describe our process to collect and cross-check the quality of translations, and we report baseline performance using several learning settings: fully supervised, weakly supervised, semisupervised, and fully unsupervised. Our experiments demonstrate that current state-of-the-art methods perform rather poorly on this benchmark, posing a challenge to the research community working on low-resource MT. Data and code to reproduce our experiments are available here. .

FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow

Xuezhe Ma, Chunting Zhou, Xian Li, Graham Neubig, and Eduard Hovy

Most sequence-to-sequence (seq2seq) models are autoregressive; they generate each token by conditioning on previously generated tokens. In contrast, non-autoregressive seq2seq models generate all tokens in one pass, which leads to increased efficiency through parallel processing on hardware such as GPUs. However, directly modeling the joint distribution of all tokens simultaneously is challenging, and, even with increasingly complex model structures, accuracy lags significantly behind autoregressive models. In this paper, we propose a simple, efficient, and effective model for non-autoregressive sequence generation using latent variable models. Specifically, we turn to generative flow, an elegant technique to model complex distributions using neural networks, and design several layers of flow tailored for modeling the conditional density of sequential latent variables. We evaluate this model on three neural machine translation (NMT) benchmark datasets, achieving comparable performance with state-of-the-art non-autoregressive NMT models and almost constant decoding time w.r.t the sequence length.

Improving Generative Visual Dialog by Answering Diverse Questions

Vishvak S. Murahari, Prithvijit Chattopadhyay, Dhruv Batra, Devi Parikh, and Abhishek Das

Prior work on training generative Visual Dialog models with reinforcement learning (Das et al., 2017b) has explored a Q-BOT-A-BOT image-guessing game and shown that this “self-talk” approach can lead to improved performance at the downstream dialog-conditioned image-guessing task. However, this improvement saturates and starts degrading after a few rounds of interaction and does not lead to a better Visual Dialog model. We find that this is due in part to repeated interactions between Q-BOT and A-BOT during self-talk, which are not informative with respect to the image. To improve this, we devise a simple auxiliary objective that incentivizes Q-BOT to ask diverse questions, thus reducing repetitions and in turn enabling A-BOT to explore a larger state space during RL, i.e., be exposed to more visual concepts to talk about, and varied questions to answer. We evaluate our approach via a host of automatic metrics and human studies, and demonstrate that it leads to better dialog, i.e., dialog that is more diverse (less repetitive), consistent (has fewer conflicting exchanges), fluent (more humanlike), and detailed, while still being comparably image-relevant as prior work and ablations.

Language Models as Knowledge Bases?

Fabio Petroni Tim Rocktaschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander H. Miller

Recent progress in pretraining language models on large textual corpora led to a surge of improvements for downstream NLP tasks. Whilst learning linguistic knowledge, these models may also be storing relational knowledge present in the training data, and may be able to answer queries structured as “fill-in-the-blank” cloze statements. Language models have many advantages over structured knowledge bases: They require no schema engineering, allow practitioners to query about an open class of relations, are easy to extend to more data, and require no human supervision to train. We present an in-depth analysis of the relational knowledge already present (without fine-tuning) in a wide range of state-of-the-art pretrained language models. We find that 1) without fine-tuning, BERT contains relational knowledge competitive with traditional NLP methods that have some access to oracle knowledge, 2) BERT also does remarkably well on open-domain question answering against a supervised baseline, and 3) certain types of factual knowledge are learned much more readily than others by standard language model pretraining approaches. The surprisingly strong ability of these models to recall factual knowledge without any fine-tuning demonstrates their potential as unsupervised open-domain QA systems. The code to reproduce our analysis is available here.

Learning Programmatic Idioms for Scalable Semantic Parsing

Srinivasan Iyer, Alvin Cheung, and Luke Zettlemoyer

Programmers typically organize executable source code using high-level coding patterns or idiomatic structures such as nested loops, exception handlers and recursive blocks, rather than as individual code tokens. In contrast, state-of-the-art (SOTA) semantic parsers still map natural language instructions to source code by building the code syntax tree one node at a time. In this paper, we introduce an iterative method to extract code idioms from large source code corpora by repeatedly collapsing most-frequent depth-2 subtrees of their syntax trees, and train semantic parsers to apply these idioms during decoding. Applying idiom-based decoding on a recent context-dependent semantic parsing task improves the SOTA by 2.2% BLEU score while reducing training time by more than 50%. This improved speed enables us to scale up the model by training on an extended training set that is 5× larger, to further move up the SOTA by an additional 2.3% BLEU and 0.9% exact match. Finally, idioms also significantly improve accuracy of semantic parsing to SQL on the ATIS-SQL dataset when training data is limited.

Learning to Speak and Act in a Fantasy Text Adventure Game

Jack Urbanek, Angela Fan, Siddharth Karamcheti, Saachi Jain, Samuel Humeau, Emily Dinan, Tim Rocktaschel, Douwe Kiela, Arthur Szlam, and Jason Weston

We introduce a large-scale crowdsourced text adventure game as a research platform for studying grounded dialogue. In it, agents can perceive, emote, and act whilst conducting dialogue with other agents. Models and humans can both act as characters within the game. We describe the results of training state-of-the-art generative and retrieval models in this setting. We show that in addition to using past dialogue, these models are able to effectively use the state of the underlying world to condition their predictions. In particular, we show that grounding on the details of the local environment, including location descriptions, and the objects (and their affordances) and characters (and their previous actions) present within it allows better predictions of agent behavior and dialogue. We analyze the ingredients necessary for successful grounding in this setting, and how each of these factors relate to agents that can talk and act successfully.

Mask-Predict: Parallel Decoding of Conditional Masked Language Models

Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer

Most machine translation systems generate text autoregressively from left to right. We, instead, use a masked language modeling objective to train a model to predict any subset of the target words, conditioned on both the input text and a partially masked target translation. This approach allows for efficient iterative decoding, where we first predict all of the target words non-autoregressively, and then repeatedly mask out and regenerate the subset of words that the model is least confident about. By applying this strategy for a constant number of iterations, our model improves state-of-the-art performance levels for non-autoregressive and parallel decoding translation models by over 4 BLEU on average. It is also able to reach within about 1 BLEU point of a typical left-to-right transformer model, while decoding significantly faster.

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond*

Mikel Artexe and Holger Schwenk

We introduce an architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts. Our system uses a single BiLSTM encoder with a shared BPE vocabulary for all languages, which is coupled with an auxiliary decoder and trained on publicly available parallel corpora. This enables us to learn a classifier on top of the resulting embeddings using English annotated data only, and transfer it to any of the 93 languages without any modification. Our experiments in cross-lingual natural language inference (XNLI dataset), cross-lingual document classification (MLDoc dataset) and parallel corpus mining (BUCC dataset) show the effectiveness of our approach. We also introduce a new test set of aligned sentences in 112 languages, and show that our sentence embeddings obtain strong results in multilingual similarity search even for low-resource languages. Our implementation, the pretrained encoder, and the multilingual test set are available here.

Memory-Grounded Conversational Reasoning

Seungwhan Moon, Pararth Shah, Anuj Kumar, and Rajen Subba

We demonstrate a conversational system which engages the user through a multimodal, multi-turn dialog over the user’s memories. The system can perform QA over memories by responding to user queries to recall specific attributes and associated media (e.g., photos) of past episodic memories. The system can also make proactive suggestions to surface related events or facts from past memories to make conversations more engaging and natural. To implement such a system, we collect a new corpus of memory-grounded conversations, which comprises human-to-human role-playing dialogs given synthetic memory graphs with simulated attributes. Our proof-of-concept system operates on these synthetic memory graphs, however it can be trained and applied to real-world user memory data (e.g., photo albums). We present the architecture of the proposed conversational system and example queries that the system supports.

Quantifying the Semantic Core of Gender Systems

Adina Williams, Ryan Cotterell, Lawrence Wolf-Sonkin, Damián Blasi, and Hanna Wallach

Many of the world’s languages employ grammatical gender on the lexeme. For example, in Spanish, the word for house (casa) is feminine, whereas the word for paper (papel) is masculine. To a speaker of a genderless language, this assignment seems to exist with neither rhyme nor reason. But is the assignment of inanimate nouns to grammatical genders truly arbitrary? We present the first large-scale investigation of the arbitrariness of noun gender assignments. To that end, we use canonical correlation analysis to correlate the grammatical gender of inanimate nouns with an externally grounded definition of their lexical semantics. We find that 18 languages exhibit a significant correlation between grammatical gender and lexical semantics.

Recommendation as a Communication Game: Self-Role-Playing for Goal-Oriented Dialogue

Dongyeop Kang, Anusha Balakrishnan, Pararth Shah, Paul Crook, Y-Lan Boureau, and Jason Weston

Traditional recommendation systems produce static rather than interactive recommendations invariant to a user’s specific requests, clarifications, or current mood, and can suffer from the cold-start problem if their tastes are unknown. These issues can be alleviated by treating recommendation as an interactive dialogue task instead, where an expert recommender can sequentially ask about someone’s preferences, react to their requests, and recommend more appropriate items. In this work, we collect a goal-driven recommendation dialogue dataset (GoRecDial), which consists of 9,125 dialogue games and 81,260 conversation turns between pairs of human workers recommending movies to each other. The task is specifically designed as a cooperative game between two players working toward a quantifiable common goal. We leverage the dataset to develop an end-to-end dialogue system that can simultaneously converse and recommend. Models are first trained to imitate the behavior of human players without considering the task goal itself (supervised training). We then fine-tune our models on simulated bot-bot conversations between two paired pretrained models (bot-play), in order to achieve the dialogue goal. Our experiments show that models fine-tuned with bot-play learn improved dialogue strategies, reach the dialogue goal more often when paired with a human, and are rated as more consistent by humans compared to models trained without bot-play. The dataset and code are publicly available through the ParlAI framework.

Revisiting the Evaluation of Theory of Mind Through Question Answering

Matthew Le, Y-Lan Boureau, and Maximilian Nickel

Theory of mind, i.e., the ability to reason about intents and beliefs of agents, is an important task in artificial intelligence and central to resolving ambiguous references in natural language dialogue. In this work, we revisit the evaluation of theory of mind through question answering. We show that current evaluation methods are flawed and that existing benchmark tasks can be solved without theory of mind due to dataset biases. Based on prior work, we propose an improved evaluation protocol and dataset in which we explicitly control for data regularities via a careful examination of the answer space. We show that state-of-the-art methods that are successful on existing benchmarks fail to solve theory-of-mind tasks in our proposed approach.

Simple and Effective Noisy Channel Modeling for Neural Machine Translation

Kyra Yee, Yann N. Dauphin, and Michael Auli

Previous work on neural noisy channel modeling relied on latent variable models that incrementally process the source and target sentence. This makes decoding decisions based on partial source prefixes even though the full source is available. We pursue an alternative approach based on standard sequence-to-sequence models that utilize the entire source. These models perform remarkably well as channel models, even though they have neither been trained on nor were designed to factor over incomplete target sentences. Experiments with neural language models trained on billions of words show that noisy channel models can outperform a direct model by up to 3.2 BLEU on WMT’17 German-English translation. We evaluate on four language pairs, and our channel models consistently outperform strong alternatives, such as right-to-left reranking models and ensembles of direct models.

Using Local Knowledge Graph Construction to Scale Seq2Seq Models to Multi-Document Inputs

Angela Fan, Claire Gardent, Chloe Braud, and Antoine Bordes

Query-based open-domain NLP tasks require information synthesis from long and diverse web results. Current approaches extractively select portions of web text as input to sequence-to-sequence models using methods such as TF-IDF ranking. We propose constructing a local graph structured knowledge base for each query, which compresses the web search information and reduces redundancy. We show that by linearizing the graph into a structured input sequence, models can encode the graph representations within a standard sequence-to-sequence setting. For two generative tasks with very long text input, long-form question answering, and multidocument summarization, feeding graph representations as input can achieve better performance than using retrieved text portions.

VizSeq: A Visual Analysis Toolkit for Text Generation Tasks

Changhan Wang, Anirudh Jain, Danlu Chen, and Jiatao Gu

Automatic evaluation of text generation tasks (e.g., machine translation, text summarization, image captioning, and video description) usually relies heavily on task-specific metrics, such as BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004). They, however, are abstract numbers and are not perfectly aligned with human assessment. This suggests inspecting detailed examples as a complement to identify system error patterns. In this paper, we present VizSeq, a visual analysis toolkit for instance-level and corpus-level system evaluation on a wide variety of text generation tasks. It supports multimodal sources and multiple text references, providing visualization in Jupyter notebook or a web app interface. It can be used locally or deployed onto public servers for centralized data hosting and benchmarking. It covers most common n-gram based metrics accelerated with multiprocessing, and also provides latest embedding-based metrics such as BERTScore (Zhang et al., 2019).

Other activities at EMNLP

Workshop on Asian Translation

Paper: Facebook AI's WAT19 Myanmar-English Translation Task Submission
Peng-Jen Chen, Jiajun Shen, Matthew Le, Vishrav Chaudhary, Ahmed El-Kishky, Guillaume Wenzek, Myle Ott, and Marc’Aurelio Ranzato

Conference on Computational Natural Language Learning (CoNLL) — two-day workshop

Paper: Walk the Memory: Memory Graph Networks for Explainable Memory-Grounded Question Answering
Seungwhan Moon, Pararth Shah, Anuj Kumar, and Rajen Subba