Dialogue research is a crucial component of building the next generation of intelligent agents. While there’s been progress with chatbots in single-domain dialogue, agents today are far from capable of carrying an open-domain conversation across a multitude of topics. Agents that can chat with humans in the way that people talk to each other will be easier and more enjoyable to use in our day-to-day lives — going beyond simple tasks like playing a song or booking an appointment.
Generating coherent and engaging responses in conversations requires a range of nuanced conversational skills, including language understanding and reasoning. Facebook AI has made scientific progress in dialogue research that is, in the long run, fundamental to building more engaging, personable AI systems. In this blog post, we describe new open source data sets, algorithms, and models that improve five common weaknesses of open-domain chatbots today: consistency, specificity, empathy, knowledgeability, and multimodal understanding.
The first step in developing personable chatbots is to ensure they’re generating appropriate responses without missteps, such as contradictions. Inconsistencies are a common issue for chatbots partly because most models lack both explicit long-term memory and semantic understanding. In the example below, for instance, the model says both “I have 2 cats” and “I do not have any pets.”
In collaboration with our colleagues at NYU, we recently developed a new way of framing consistency of dialogue agents as natural language inference (NLI) and created a new NLI data set called Dialogue NLI, which is used to both improve and evaluate the consistency of dialogue models. In Dialogue NLI, we consider two utterances in a dialogue as the premise and hypothesis, respectively. Each pair is labeled to indicate whether the premise entails, contradicts, or is neutral with respect to the hypothesis.
Training an NLI model on this data set and using it to rerank the model’s responses to entail previous dialogues — or maintain consistency with them — improves the overall consistency of the dialogue agent. Across our three test sets, we saw an average of 3x fewer contradictions. Human annotators also rated these models as more consistent and less contradictory.
Generative dialogue models frequently default to generic, safe responses, such as “I don’t know.” In collaboration with Stanford’s AI researcher Abigail See, we study how to fix this by controlling several conversational attributes, like the level of specificity. When people are engaging in a conversation, we found that they significantly preferred more specific responses to generic ones: Our specificity-controlled models were rated by human annotators as 28 percent more interesting and 32 percent more engaging than our baseline model. Variety is the spice of conversation.
However, overly specific models risk being too narrowly focused and unrelatable conversational partners. In one experiment, we conditioned a bot on character information and asked “What do you do for a living?” A typical chatbot responds with the generic statement “I’m a construction worker.” With control methods, our chatbots proposed more specific and engaging responses, like “I build antique homes and refurbish houses."
While most current research is focused on the next utterance prediction problem, our work demonstrates that studying multiturn aspects is necessary to improve conversation quality. In addition to specificity, we also show that balancing question-asking and answering and controlling how repetitive our models are make significant differences. The better the overall conversation flow, the more engaging and personable the chatbots and dialogue agents of the future will be.
Currently, it’s challenging for dialogue agents to recognize feelings and reply appropriately. This can be attributed in part to the scarcity of suitable benchmarks and publicly available training datasets. In recent work with researchers from the University of Washington, we introduce the first benchmark task of human-written empathetic dialogues centered on specific emotional labels to measure a chatbot’s ability to display empathy. In addition to improving on automatic metrics, we show that using this data for both fine-tuning and as retrieval candidates leads to responses that are evaluated by humans as more empathetic, with an average improvement of 0.95 points (on a 1-to-5 scale) across three different retrieval and generative models.
This work provides a basis for new research directions for developing empathy in chatbots. For instance, the next challenge is for empathy-focused models to perform well in complex dialogue situations, where agents may require balancing empathy with staying on topic or providing information.
Humans naturally incorporate knowledge into conversations with their speaking partner, but open-domain dialogue agents often struggle to leverage available knowledge. Current state-of-the-art approaches to dialogue modeling involve sequence-to-sequence models, which lack access to information outside of the conversation history. To address this issue, more direct knowledge memory mechanisms need to be employed in these models.
Recently, we’ve improved dialogue models’ capability of demonstrating knowledge by collecting a data set with conversations directly grounded in knowledge from Wikipedia, and creating new model architectures that retrieve knowledge, read it, and condition their responses on it.
The new architectures, called transformer memory networks, yield more knowledgeable agents, outperforming systems that do not employ a memory structure for storing knowledge in both automatic metrics and human evaluations. Our generative model variants yield the most pronounced improvement and are rated by humans as 26 percent more engaging on average than their knowledgeless counterparts.
To engage with humans, agents should not only comprehend dialogue but also understand images. When people engage with one another and talk about what they see around them, they don’t make neutral observations — they express their points of view. Machine learning approaches that comment on images have typically focused on image captioning, which is factual and neutral in tone — like “fireworks in the sky.” In our research, we focus on image captioning that is engaging for humans by incorporating personality. We collect a large data set of human comments grounded in images, and train state-of-the-art models capable of discussing images with given personalities, which makes the system much more interesting for humans to talk to. Humans prefer our personality-based captions over traditional captions 64.5 percent of the time.
To build strong models, we consider both retrieval and generative variants, and leverage state-of-the art modules from both the vision and language domains. We define a simple yet powerful retrieval architecture, named TransResNet. It works by projecting the image, personality, and caption in the same space using image, personality, and text encoders. We show that our best system is able to produce captions that are close to matching human performance in terms of engagement and relevance. In fact, annotators preferred our retrieval model’s captions over captions written by people 49.5 percent of the time.
Dialogue research today functions almost entirely based on extensive supervised learning from humans talking to one another — usually crowdsourced or publicly available on the internet. This data can differ significantly in distribution from the environment in which a chatbot might be deployed. To help researchers further explore and push dialogue research forward, it’s important to have agents out in the real world actually conversing with humans.
To that end, we’ve released a new data collection and model evaluation tool, a Messenger-based Chatbot game called Beat the Bot, which allows people to interact directly with bots and other humans in real time, creating rich examples to help train models. Our goal with sharing this new tool is to provide researchers with high-signal data from live interactions instead of fixed language data. We plan to continuously enhance this tool’s capabilities (for instance, adding image understanding) to help both improve our latest dialogue models and further explore dialogue research.
Beat the Bot is currently live: If you send a message to this page, you will be matched with a bot and another person. Both you and the other person will see two responses for every message you send — one from your human partner and one from a bot. You’ll choose which response is better and continue the conversation from there. The goal is to get your human speaking partner to choose your message more often than the bot’s. This allows for supervision in two senses: It provides both the human-human dialogue turns, and a human’s assessment on when the bot fails to match human performance. We ask users to play a character in a game that is completely disconnected from their personal information. With user permission at the beginning of the game, the data collected will be open-sourced to facilitate new research directions for the entire community.
There’s also untapped opportunity in exploring how chatbots can learn from conversations they have once they are deployed. We’ve made some headway in extracting training signal from these conversations. In collaboration with Stanford, we have shown that it’s possible to improve a deployed dialogue agent by extracting training data from conversations it has with humans.
The self-feeding chatbot estimates its conversation partner’s satisfaction with its responses during interaction. When the dialogue agent believes it has made a mistake, it can ask for feedback. Learning to predict such feedback helps the model improve over time. When it believes it has not made a mistake, it can use standard supervised learning techniques instead. Ultimately, we find that learning from dialogue with a self-feeding chatbot significantly improves performance, regardless of the amount of traditional supervision. The improvement is most pronounced when the initial training set is small: In this case, we see a 9.4 point increase in accuracy on the dialogue task, which amounts to a 31 percent improvement over the baseline.
Our research has shown that it is possible to train models to improve on some of the most common weaknesses of chatbots today. Over time, we’ll work toward bringing these subtasks together into one unified intelligent agent by narrowing and eventually closing the gap with human performance. In the future, intelligent chatbots will be capable of open-domain dialogue in a way that’s personable, consistent, empathetic, and engaging.
As part of Facebook AI’s contribution to the broader research community, we’re sharing our new models, training code, and data sets within ParlAI, our open source dialogue research platform. We hope that this platform will continue to foster research advances across the research community and contribute to pushing dialogue research forward.You can follow the latest ParlAI updates here.
Research Engineer, Facebook AI
Research Scientist, Facebook AI