June 11, 2020
Pythia, our open source, modular deep learning framework for vision and language multimodal research, is now called a multimodal framework (MMF). As part of this change, we are rewriting major portions of the library to improve usability for the open source community and adding new state-of-the-art models and data sets in vision and language. MMF has starter code for several multimodal challenges, including the Hateful Memes, VQA, TextVQA, and TextCaps challenges. Learn more on the MMF website and on GitHub.
New features include performance and UX improvements, new state-of-the-art BERT-based multimodal models, new vision and language multimodal models, pretrained model zoo, automatic downloads, and a revamped configuration system based on OmegaConf. Rewriting the library has allowed us to make it highly modular, which enables researchers to easily include different individual MMF components. MMF is intended to help researchers develop adaptive AI that synthesizes multiple kinds of understanding into a more context-based, multimodal understanding. This work is extremely challenging for machines because they can’t analyze the text and the image separately. They must combine these different modalities and understand how the meaning changes when they are presented together.
Earlier this month, we provided starter code and baselines for the recent Hateful Memes Challenge, a first-of-its-kind online competition hosted by DrivenData through MMF. As part of that challenge, we also shared a new data set designed specifically to help AI researchers develop new systems to identify multimodal hate speech. In addition to this open source release, we plan to continue adding tools, tasks, data sets, and reference models. We look forward to seeing how the open source community uses and contributes to MMF.