June 11, 2020
Pythia, our open source, modular deep learning framework for vision and language multimodal research, is now called a multimodal framework (MMF). As part of this change, we are rewriting major portions of the library to improve usability for the open source community and adding new state-of-the-art models and datasets in vision and language. MMF has starter code for several multimodal challenges, including the Hateful Memes, VQA, TextVQA, and TextCaps challenges. Learn more on the MMF website and on GitHub.
New features include performance and UX improvements, new state-of-the-art BERT-based multimodal models, new vision and language multimodal models, pretrained model zoo, automatic downloads, and a revamped configuration system based on OmegaConf. Rewriting the library has allowed us to make it highly modular, which enables researchers to easily include different individual MMF components. MMF is intended to help researchers develop adaptive AI that synthesizes multiple kinds of understanding into a more context-based, multimodal understanding. This work is extremely challenging for machines because they can’t analyze the text and the image separately. They must combine these different modalities and understand how the meaning changes when they are presented together.
Earlier this month, we provided starter code and baselines for the recent Hateful Memes Challenge, a first-of-its-kind online competition hosted by DrivenData through MMF. As part of that challenge, we also shared a new dataset designed specifically to help AI researchers develop new systems to identify multimodal hate speech. In addition to this open source release, we plan to continue adding tools, tasks, datasets, and reference models. We look forward to seeing how the open source community uses and contributes to MMF.
We’re announcing updates to Facebook’s population density maps, which can be used to coordinate and improve the delivery of humanitarian aid around the world, including global COVID-19 vaccinations.
April 15, 2021
Working with Inria researchers, we’ve developed a self-supervised image representation method, DINO, which produces remarkable results when trained with Vision Transformers. We are also detailing PAWS, a new method for 10x more efficient training.
April 30, 2021