May 29, 2020
We show how several simple, infrequently explored design choices in pretraining can help achieve high performance on tasks that combine language and visual understanding.
These improvements are possible without any architectural changes to the underlying models. They also complement more common approaches to improving performance in this domain, such as focusing on optimizing pretraining objective functions and model architecture choices.
For these sorts of visiolinguistic tasks — such as visual question answering, in which the AI must analyze an image and correctly answer a question about what it shows — models are first pretrained to solve proxy tasks. Our approach focuses on improving performance by varying the similarity between the pretraining dataset domain (both textual and visual) and the downstream domain. We show that this can produce close to state-of-the-art results on downstream tasks without any architectural changes.
We’re sharing code for this work as part of our open source multimodal framework.
In research on vision and language, the finer nuances and details of the pretrain-then-fine-tune regime have not yet been investigated carefully. For instance, a great deal of recent research work uses Conceptual Captions as the pretraining dataset because of its large size. But perhaps COCO captions, which are less noisy, would be a better fit? Should the domain of the downstream task be considered when deciding which pretraining dataset would be the most effective? Is synthetically generated data in a domain closer to the downstream task a better choice for pretraining than “natural” data from a less closely related domain?
As a step toward answering these questions, we carefully chose a set of pretraining datasets and downstream tasks. We selected pretraining datasets with varying degrees of similarities in textual and visual domains to the downstream tasks.
We then attempted to improve accuracy of downstream tasks by pushing the domains of the pretraining dataset and downstream tasks closer. We achieved this by generating a synthetic dataset that is closer in domain to the downstream task. Interestingly, our synthetic dataset achieved better performance on the downstream task than a more “natural,” commonly used dataset that is a worse match in domain to the downstream task.
Multimodal understanding problems at the intersection of vision and language are important for applications ranging from aiding visually impaired people to building virtual assistants to detecting hateful content. Facebook AI is working on a wide range of connected research on multimodal understanding, vision and language, and self-supervised learning. For example, we’ve recently released the Hateful Memes dataset and challenge to spur progress on detecting multimodal hate speech. In addition to sharing the benchmark data for Hateful Memes, we’ve released the model code through MMF, a multimodal framework. The visiolinguistic pretraining methods discussed here could help researchers develop more effective models for these sorts of tasks.
Our success using a synthetically generated dataset is also important because it has the potential to help researchers overcome the scarcity of large-scale paired-labeled datasets, which are used for pretraining visiolinguistic representations. This will be particularly helpful for low-resource applications and tasks with limited available training data. It can also help with augmenting existing datasets.