August 24, 2020
Situated and Interactive Multimodal Conversations (SIMMC) is a first-of-its-kind dataset that has been open-sourced to help researchers and engineers develop virtual assistants capable of handling complex, task-oriented conversations in a co-observed multimodal context.
Imagine you’re buying a product from an online store using AR/VR. One of the easiest ways to make this interaction happen smoothly would be to use a digital assistant that can act on your voice commands and cues (e.g., “I’d like to buy a brown leather couch”) — similar to how a salesperson might help you in a brick-and-mortar store.
However, the complexity of an AR/VR environment means the assistant must be able to function beyond simple call-and-response or question-and-answer actions. Successfully operating in a virtual environment means being able to handle and memorize the various multimodal inputs inherent to it, such as the visual aspects of objects, including color, size, shape, and orientation.
If the wider research and engineering community is going to develop digital assistants that can closely mimic their human counterparts, there are a range of novel and nontrivial research challenges to address.
SIMMC is a dataset specifically aimed at training agents that take multimodal actions grounded in a coevolving multimodal input context in addition to the dialog history. SIMMC tasks address task-oriented dialogs that encompass a rich, situated multimodal user context in the form of a co-observed image or a VR environment, which gets updated dynamically based on the dialog flow and the assistant actions. It enables AI assistants to understand the evolving context of the interaction in much the same way the user does.
SIMMC contains about 13,000 human-to-human dialogs (totaling about 169,000 utterances). We chose shopping experiences — specifically furniture and fashion — as the domain for the SIMMC datasets because of the dynamic environment created by the shopping experience, where rich multimodal interactions happen around visually grounded items.
SIMMC offers four key advantages over previous multimodal dialog datasets:
SIMMC assumes a co-observed multimodal context between a user and an assistant and records the ground-truth item appearance logs of each item that appears. SIMMC tasks emphasize semantic processing of the input modalities, while work in this area has traditionally focused heavily on raw image processing.
Compared with the conventional task-oriented conversational datasets, the agent actions in the SIMMC datasets span across a diverse multimodal action space (e.g., “rotate,” “search,” and “add to cart”).
Agent actions can be enacted on both the object level (e.g., changing the view of a specific object within a scene) and the scene level (e.g., introducing a new scene or an image).
SIMMC emphasizes semantic processing. The proposed SIMMC annotation schema allows for a more systematic and structural approach for visual grounding of conversations, which is essential for solving challenging problems in real-world scenarios.
Machine learning models trained on the furniture and fashion datasets were evaluated for their performance in API call prediction, response generation, and dialog state tracking.
We’ve organized a challenge track at the Ninth Dialog System Technology Challenge (DSTC9) around SIMMC. This track invites the dialog research community to tackle efforts toward developing real-world assistant agents that can handle multimodal inputs and perform multimodal actions.
The SIMMC framework is a step toward building next-generation virtual assistants that can perform the sorts of multimodal reasoning needed to create dynamic experiences, such as AR/VR-based shopping. A dataset of this type also opens the door to further research into conversational AI, including multimodal entity disambiguation.
We’ve provided two datasets (furniture and fashion), as well as the contextual natural language understanding and coreference annotations on these datasets for further study. There are several strong baselines for some of the tasks enabled by the datasets that showcase their various uses in real-world applications. SIMMC is in the research phase right now, and we believe the collected annotations should facilitate further study into the tasks highlighted in this work and several other tasks.
We would like to thank our collaborators:
Seungwhan Moon, Paul A. Crook, Ankita De, Shivani Poddar, Theodore Levin, David Whitney, Daniel Difranco, Eunjoon Cho, Rajen Subba, and Alborz Geramifard