May 29, 2020
We show how several simple, infrequently explored design choices in pretraining can help achieve high performance on tasks that combine language and visual understanding.
These improvements are possible without any architectural changes to the underlying models. They also complement more common approaches to improving performance in this domain, such as focusing on optimizing pretraining objective functions and model architecture choices.
For these sorts of visiolinguistic tasks — such as visual question answering, in which the AI must analyze an image and correctly answer a question about what it shows — models are first pretrained to solve proxy tasks. Our approach focuses on improving performance by varying the similarity between the pretraining data set domain (both textual and visual) and the downstream domain. We show that this can produce close to state-of-the-art results on downstream tasks without any architectural changes.
We’re sharing code for this work as part of our open source multimodal framework.
In research on vision and language, the finer nuances and details of the pretrain-then-fine-tune regime have not yet been investigated carefully. For instance, a great deal of recent research work uses Conceptual Captions as the pretraining data set because of its large size. But perhaps COCO captions, which are less noisy, would be a better fit? Should the domain of the downstream task be considered when deciding which pretraining data set would be the most effective? Is synthetically generated data in a domain closer to the downstream task a better choice for pretraining than “natural” data from a less closely related domain?
As a step toward answering these questions, we carefully chose a set of pretraining data sets and downstream tasks. We selected pretraining data sets with varying degrees of similarities in textual and visual domains to the downstream tasks.
We then attempted to improve accuracy of downstream tasks by pushing the domains of the pretraining data set and downstream tasks closer. We achieved this by generating a synthetic data set that is closer in domain to the downstream task. Interestingly, our synthetic data set achieved better performance on the downstream task than a more “natural,” commonly used data set that is a worse match in domain to the downstream task.
Multimodal understanding problems at the intersection of vision and language are important for applications ranging from aiding visually impaired people to building virtual assistants to detecting hateful content. Facebook AI is working on a wide range of connected research on multimodal understanding, vision and language, and self-supervised learning. For example, we’ve recently released the Hateful Memes data set and challenge to spur progress on detecting multimodal hate speech. In addition to sharing the benchmark data for Hateful Memes, we’ve released the model code through MMF, a multimodal framework. The visiolinguistic pretraining methods discussed here could help researchers develop more effective models for these sorts of tasks.
Our success using a synthetically generated data set is also important because it has the potential to help researchers overcome the scarcity of large-scale paired-labeled data sets, which are used for pretraining visiolinguistic representations. This will be particularly helpful for low-resource applications and tasks with limited available training data. It can also help with augmenting existing data sets.