Experts in the fields of natural language understanding, computational linguistics, and conversational AI are gathering this week in Minneapolis for the 2019 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Research from Facebook will be presented in oral spotlight presentations and group poster sessions.
Our researchers and engineers will also be participating in other activities throughout the week, including a demo on FAIRSEQ, an open source sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks.
For those attending NAACL-HLT, be sure to visit the Facebook Research booth.
Pushkar Mishra, Marco Del Tredici, Helen Yannakoudakis, and Ekaterina Shutova
Abuse on the internet represents a significant societal problem of our time. Previous research on automated abusive language detection on Twitter has shown that community-based profiling of users is a promising technique for this task. However, existing approaches only capture shallow properties of online communities by modeling follower-following relationships. In contrast, we present the first approach that captures both the structure of online communities as well as the linguistic behavior of the users within them, based on graph convolutional networks (GCNs). We show that such heterogeneous graph-structured modeling of communities significantly advances the current state of the art in abusive language detection.
Visual Dialog is a multimodal task of answering a sequence of questions grounded in an image, using the conversation history as context. It entails challenges in vision, language, reasoning, and grounding. However, studying these subtasks in isolation on large, real data sets is infeasible as it requires prohibitively expensive complete annotation of the “state” of all images and dialogs.
We develop CLEVR-Dialog, a large diagnostic data set for studying multi-round reasoning in visual dialog. Specifically, we construct a dialog grammar that is grounded in the scene graphs of the images from the CLEVR data set. This combination results in a data set where all aspects of the visual dialog are fully annotated. In total, CLEVR-Dialog contains five instances of 10-round dialogs for about 85K CLEVR images, totaling to 4.25M question-answer pairs.
We use CLEVR-Dialog to benchmark performance of standard visual dialog models; in particular, on visual coreference resolution (as a function of the coreference distance). This is the first analysis of its kind for visual dialog models that was not possible without this data set. We hope the findings from CLEVR-Dialog will help inform the development of future models for visual dialog. Our data set and code will be made public.
Serhii Havrylov, Germán Kruszewski, and Armand Joulin
There has been considerable attention devoted to models that learn to jointly infer an expression’s syntactic structure and its semantics. Yet, Nangia and Bowman (2018) has recently shown that the current best systems fail to learn the correct parsing strategy on mathematical expressions generated from a simple context-free grammar. In this work, we present a recursive model inspired by Choi et al. (2018) that reaches near perfect accuracy on this task. Our model is composed of two separated modules for syntax and semantics. They are cooperatively trained with standard continuous and discrete optimization schemes. Our model does not require any linguistic structure for supervision, and its recursive nature allows for out-of-domain generalization with little loss in performance. Additionally, our approach performs competitively on several natural language tasks, such as Natural Language Inference and Sentiment Analysis.
One of the first steps in the utterance interpretation pipeline of many task-oriented conversational AI systems is to identify user intents and the corresponding slots. Since data collection for machine learning models for this task is time-consuming, it is desirable to make use of existing data in a high-resource language to train models in low-resource languages. However, development of such models has largely been hindered by the lack of multilingual training data. In this paper, we present a new data set of 57K annotated utterances in English (43K), Spanish (8.6K), and Thai (5K) across the domains weather, alarm, and reminder. We use this data set to evaluate three different cross-lingual transfer methods: (1) translating the training data; (2) using cross-lingual pretrained embeddings; and (3) a novel method of using a multilingual machine translation encoder as contextual word representations. We find that given several hundred training examples in the target language, the latter two methods outperform translating the training data. Further, in very low-resource settings, multilingual contextual word representations give better results than using cross-lingual static embeddings. We also compare the cross-lingual methods to using monolingual resources in the form of contextual ELMo representations and find that given just small amounts of target language data, this method outperforms all cross-lingual methods, which highlights the need for more sophisticated cross-lingual methods.
Yair Lakretz, Germán Kruszewski, Théo Desbordes, Dieuwke Hupkes, Stanislas Dehaene, and Marco Baroni
Recent work has shown that LSTMs trained on a generic language modeling objective capture syntax-sensitive generalizations such as long-distance number agreement. We have, however, no mechanistic understanding of how they accomplish this remarkable feat. Some have conjectured it depends on heuristics that do not truly take hierarchical structure into account. We present here a detailed study of the inner mechanics of number tracking in LSTMs at the single neuron level. We discover that long-distance number information is largely managed by two “number units.” Importantly, the behavior of these units is partially controlled by other units independently shown to track syntactic structure. We conclude that LSTMs are, to some extent, implementing genuinely syntactic processing mechanisms, paving the way to a more general understanding of grammatical encoding in LSTMs.
Neural approaches to Natural Language Generation (NLG) have been promising for goal-oriented dialogue. One of the challenges of productionizing these approaches, however, is the ability to control response quality and ensure that generated responses are acceptable. We propose the use of a generate, filter, and rank framework, in which candidate responses are first filtered to eliminate unacceptable responses and then ranked to select the best response. While acceptability includes grammatical correctness and semantic correctness, we focus only on grammaticality classification in this paper, and show that existing data sets for grammatical error correction don’t correctly capture the distribution of errors that data-driven generators are likely to make. We release a grammatical classification and semantic correctness classification data set for the weather domain that consists of responses generated by three data-driven NLG systems. We then explore two supervised learning approaches (CNNs and GBDTs) for classifying grammaticality. Our experiments show that grammaticality classification is very sensitive to the distribution of errors in the data, and that these distributions vary significantly with both the source of the response as well as the domain. We show that it’s possible to achieve high precision with reasonable recall on our data set.
Angli Liu, Jingfei Du, and Veselin Stoyanov
Traditional language models are unable to efficiently model entity names observed in text. All but the most popular named entities appear infrequently in text, providing insufficient context. Recent efforts have recognized that context can be generalized between entity names that share the same type (e.g., person or location) and have equipped language models with access to an external knowledge base (KB). Our Knowledge-Augmented Language Model (KALM) continues this line of work by augmenting a traditional model with a KB. Unlike previous methods, however, we train with an end-to-end predictive objective, optimizing the perplexity of text. We do not require any additional information, such as named entity tags. In addition to improving language modeling performance, KALM learns to recognize named entities in an entirely unsupervised way by using entity type information latent in the model. On a Named Entity Recognition (NER) task, KALM achieves performance comparable with state-of-the-art supervised models. Our work demonstrates that named entities (and possibly other types of world knowledge) can be modeled successfully using predictive learning and training on large corpora of text without any additional information.
Adversarial examples, perturbations to the input of a model that elicit large changes in the output, have been shown to be an effective way of assessing the robustness of sequence-to-sequence (seq2seq) models. However, these perturbations only indicate weaknesses in the model if they do not change the input so significantly that it legitimately results in changes in the expected output. This fact that has largely been ignored in the evaluations of the growing body of related literature. Using the example of untargeted attacks on machine translation (MT), we propose a new evaluation framework for adversarial attacks on seq2seq models that takes the semantic equivalence of the pre- and post-perturbation input into account. Using this framework, we demonstrate that existing methods may not preserve meaning in general, breaking the aforementioned assumption that source side perturbations should not result in changes in the expected output. We further use this framework to demonstrate that adding additional constraints on attacks allows for adversarial perturbations that are more meaning-preserving but nonetheless largely change the output sequence. Finally, we show that performing untargeted adversarial training with meaning-preserving attacks is beneficial to the model in terms of adversarial robustness, without hurting test performance.
Shijia Liu, Hongyuan Mei, Adina Williams, and Ryan Cotterell
While idiosyncrasies of the Chinese classifier system have been a richly studied topic among linguists (Adams and Conklin, 1973; Erbaugh, 1986; Lakoff, 1986), not much work has been done to quantify them with statistical methods. In this paper, we introduce an information-theoretic approach to measuring idiosyncrasy; we examine how much the uncertainty in Mandarin Chinese classifiers can be reduced by knowing semantic information about the nouns that the classifiers modify. Using the empirical distribution of classifiers from the parsed Chinese Gigaword corpus (Graff et al., 2005), we compute the mutual information (in bits) between the distribution over classifiers and distributions over other linguistic quantities. We investigate whether semantic classes of nouns and adjectives differ in how much they reduce uncertainty in classifier choice, and find that it is not fully idiosyncratic; while there are no obvious trends for the majority of semantic classes, shape nouns reduce uncertainty in classifier choice the most.
Mandar Joshi, Eunsol Choi, Omer Levy, Daniel Weld, and Luke Zettlemoyer
Reasoning about implied relationships (e.g., paraphrastic, common sense, encyclopedic) between pairs of words is crucial for many cross-sentence inference problems. This paper proposes new methods for learning and using embeddings of word pairs that implicitly represent background knowledge about such relationships. Our pairwise embeddings are computed as a compositional function on word representations, which is learned by maximizing the pointwise mutual information (PMI) with the contexts in which the two words co-occur. We add these representations to the cross-sentence attention layer of existing inference models (e.g., BiDAF for QA, ESIM for NLI), instead of extending or replacing existing word embeddings. Experiments show a gain of 2.7% on the recently released SQuAD 2.0 and 1.3% on MultiNLI. Our representations also aid in better generalization, with gains of around 6%-7% on adversarial SQuAD data sets, and 8.8% on the adversarial entailment test set by Glockner et al. (2018).
Pretrained language model representations have been successful in a wide range of language understanding tasks. In this paper, we examine different strategies to integrate pretrained representations into sequence to sequence models and apply it to neural machine translation and abstractive summarization. We find that pretrained representations are most effective when added to the encoder network, which slows inference by only 14%. Our experiments in machine translation show gains of up to 5.3 BLEU in a simulated resource-poor setup. While returns diminish with more labeled data, we still observe improvements when millions of sentence pairs are available. Finally, on abstractive summarization we achieve a new state of the art on the full text version of CNN-DailyMail.
Peng Shi, Jinfeng Rao, and Jimmy Lin
This paper explores the problem of ranking short social media posts with respect to user queries using neural networks. Instead of starting with a complex architecture, we proceed from the bottom up and examine the effectiveness of a simple, word-level Siamese architecture augmented with attention-based mechanisms for capturing semantic “soft” matches between query and post tokens. Extensive experiments on data sets from the TREC Microblog Tracks show that our simple models not only achieve better effectiveness than existing approaches that are far more complex or exploit a more diverse set of relevance signals, but are also much faster. Implementations of our samCNN (Simple Attention-based Matching CNN) models are shared with the community to support future work.
A good conversation requires balance – between simplicity and detail; staying on topic and changing it; asking questions and answering them. Although dialogue agents are commonly evaluated via human judgments of overall quality, the relationship between quality and these individual factors is less well-studied. In this work, we examine two controllable neural text generation methods, conditional training and weighted decoding, in order to control four important attributes for chitchat dialogue: repetition, specificity, response-relatedness, and question-asking. We conduct a large-scale human evaluation to measure the effect of these control parameters on multi-turn interactive conversations on the PersonaChat task. We provide a detailed analysis of their relationship to high-level aspects of conversation, and show that by controlling combinations of these variables, our models obtain clear improvements in human quality judgments.