June 12, 2020
We’re sharing results of the Deepfake Detection Challenge, an open, collaborative initiative to spur creation of innovative new technologies to detect deepfakes and manipulated media.
The competition drew more than 2,000 participants, who trained and tested their models using a unique new data set created for the challenge.
Tested against a black box data set with challenging real world examples that were not shared with entrants, the top model achieved an average precision of 65.18 percent. This establishes a new shared baseline as the AI community continues to work on this difficult and important task.
We partnered with other industry leaders and academic experts last year to create the Deepfake Detection Challenge (DFDC) in order to accelerate development of new ways to detect deepfake videos. By creating and sharing a unique new data set of more than 100,000 videos, the DFDC has enabled experts from around the world to come together, benchmark their deepfake detection models, try new approaches, and learn from each others’ work. This open, collaborative effort will help the industry and society at large meet the challenge presented by deepfake technology and help everyone better assess the legitimacy of content they see online. As with our recently launched Hateful Memes Challenge, we believe challenges and shared data sets are key to faster progress in AI.
The DFDC launched last December, and 2,114 participants submitted more than 35,000 models to the competition. Now that the challenge has concluded, we are sharing details on the results and working with the winners to help them release code for the top-performing detection models. Next week, at the conference on Computer Vision and Pattern Recognition (CVPR), we will also share details on our plans to open-source the raw data set used to construct DFDC, featuring more than 3,500 actors and 38.5 days’ worth of data. This will help AI researchers develop new generation and detection methods to advance the state of the art in this field. Moreover, this data set will be opened for use for other research work in AI domains as well as work on deepfakes.
To create this challenge, we built a new data set with a wide variety of high-quality videos created expressly for research on deepfakes. DFDC participants used this data set to train and test their models and were able to assess their performance on a public leaderboard on the challenge website. The videos feature more than 3,500 different paid actors, each of whom agreed to participate in the project. We focused in particular on ensuring diversity in gender, skin tone, ethnicity, age, and other characteristics.
September 5, 2019
Facebook, the Partnership on AI, Microsoft, and academics from Technical University of Munich, University of Naples Federico II, Cornell Tech, MIT, University of Oxford, UC Berkeley, University of Maryland, College Park, and University at Albany–SUNY launch the Deepfake Detection Challenge (DFDC).
October 21, 2019
Preview deepfake data set with 4,000 videos is released.
December 11, 2019
Challenge launches with a new training corpus of 115,000 videos created for this challenge. A public leaderboard hosted by Kaggle enables participants to assess their performance. Prizes totaling $1 million will be distributed to the winners.
March 31, 2020
Challenge ends, with 2,114 participants having submitted 35,109 models.
April – May 2020
Participant models are evaluated against a separate black box data set to determine the winners. By using a distinct data set, we were able to replicate real-world challenges, where models must be accurate even when tasked with new or unfamiliar techniques for creating deepfakes.
June 12, 2020
Winners announced along with details on the results.
The original, unaltered videos used to construct the DFDC data set will be open-sourced so that other AI researchers can use them in their research. ~3,500 actors, 38.5 days’ worth of data.
We altered the videos using a variety of different deepfake generation models, refinement techniques such as image enhancement, and additional augmentations and distractors, such as blur, frame-rate modification, and overlays. Our goal was to make the data set representative of the variety of qualities and adversarial methods that could occur in real-world videos shared online. To ensure that the challenge would address researchers’ needs, we also worked with experts from Cornell Tech, MIT, Technical University of Munich, UC Berkeley, University at Albany–SUNY, University of Maryland, University of Naples Federico II, and University of Oxford to gather feedback and recommendations.
One of the central unsolved challenges of detecting deepfakes is that it is hard to generalize from known examples to unfamiliar instances. We designed the DFDC with this in mind. To determine the winners, participants in the challenge submitted their code to a black box environment. This separate data set was not available to entrants, so they had to design models that could be effective even under unforeseen circumstances. The black box data set consisted of 10,000 videos that were not available to participants in the competition. It contains both organic content (both deepfakes and benign clips) found on the internet and new videos created for this project. We verified that the distribution of fake and real videos was identical to that of the public test set.
We added videos of makeup tutorials, paintings, and other examples that might be difficult for detector models to classify correctly. We also randomly applied a number of augmentations to emulate how potential bad actors could modify videos to try to fool detectors. Examples include applying AR face filters, adding random images to each frame, and changing the frame rate and resolution. Augmentations were also applied to the public test set, but the black box test set used additional techniques to increase the difficulty level.
The top-performing model on the public data set achieved 82.56 percent average precision, a common accuracy measure for computer vision tasks. But when evaluating the entrants against the black box data set, the ranking of top-performing models changed significantly. The highest-performing entrant was a model entered by Selim Seferbekov. It achieved an average precision of 65.18 percent against the black box data set. Using the public data set, this model had been ranked fourth. Similarly, the other winning models, which were second through fifth when tested against the black box environment, also ranked lower on the public leaderboard. (They were 37th, 6th, 10th and 17th, respectively.) This outcome reinforces the importance of learning to generalize to unforeseen examples when addressing the challenges of deepfake detection. The competition was hosted by Kaggle and winners were selected using the log-loss score against the private test set. Details on the competition and the final leaderboard are available here, and a deep dive on the data and winning models can be found in this new paper about the challenge.
Eighteen years old
All Faces Are Real
The similarities and differences between the top-performing models also provide insights for future research in this area. The most successful models, including the five winning submissions, all found ways to innovate in the task of deepfake classification. Some of the common themes among the winners include:
Clever augmentations. Many methods used a form of data augmentation that dropped portions of faces — either randomly, using landmarks, or using attention-based networks. Others used more complex “mixup” augmentations, such as blending a real face and an AI-generated one, and then using the blending coefficient as a target label. Another competitor used the WS-DAN model, which uses weakly supervised learning for augmentation. This model takes an image and then either emphasizes or drops discriminative face parts (eyes, mouth, forehead, etc.) to help with the problem of overfitting. Deepfake detection is a more constrained problem than general object detection, but these types of fine-grained visual classification seem to provide an edge when figuring out exactly which parts of a face to drop.
Architectures. All winners used pretrained EfficientNet networks, which were fine-tuned only on the DFDC training data. Most chose the B7 variant of EfficientNet. What differed among competitors was how they used these models: how many they used, and how they combined predictions from an ensemble. The challenge shows that an ensemble approach, which has demonstrated success for many other AI applications, is useful for dealing with deepfakes as well.
Absence of forensics methods. It was interesting to see that none of the top-performing solutions used digital forensics techniques, like sensor noise fingerprints or other characteristics derived from the image creation process. This suggests that either non-learned techniques that operate at a pixel level (or on compressed images) aren’t useful for this task or they aren’t currently in widespread use among those who entered the DFDC.
Identifying these common characteristics will help researchers improve their models, but the DFDC results also show that this is still very much an unsolved problem. None of the 2,114 participants, which included leading experts from around the globe, achieved 70 percent average precision on unseen deepfakes in the black box data set. Facebook researchers participated in the challenge (though they were not eligible for prizes because of our role in organizing the competition). The team’s final submission did not appear on the final leaderboard due to run time issues when it was evaluated on that private test set.
As the research community looks to build upon the results of the challenge, we should all think more broadly and consider solutions that go beyond analyzing images and video. Considering context, provenance, and other signals may be the way to improve deepfake detection models.
Once the challenge was completed, we worked with several of our academic partners to stress-test the winning models. We wanted to better understand any specific vulnerabilities in the models before they were open-sourced. The University of Oxford, the National University of Singapore, and the Technical University of Munich all participated and utilized different techniques to try to trick the models. The participants presented methods to trigger false positives and false negatives, and relevant insights on the generalization capabilities of the winning models. These results will be presented at the Media Forensics Workshop @ CVPR.
The issue of deepfakes is an important and difficult one. Like many other types of harmful content, it is adversarial in nature and will continue to evolve and no single organization can solve these challenges on its own. But the DFDC has enabled us to work together, accelerate progress, and ultimately help prevent people from being deceived by the images and videos they see online.
We’re grateful for the contributions from all the participants in the challenge, the community, and our partners. As we continue this work, we’re committed to continuing this open, collaborative approach to improving deepfake detection tools. Building technology to detect deepfake videos effectively is important for all of us, and we will continue to work openly with other experts to address this challenge together.In the first video above, clips 1, 4, and 6 are original, unmodified videos. Clips 2, 3, and 5 are deepfakes created for the Deepfake Detection Challenge.