Deepfake Detection Challenge launches with new dataset and Kaggle site

December 11, 2019

Preventing the spread of deepfake videos is a serious concern for the entire tech industry and for society. We’re taking an important step forward today with the launch of the Deepfake Detection Challenge (DFDC), an open, collaborative initiative to accelerate development of new technologies for detecting deepfakes and manipulated media.

In partnership with leaders in the industry and in academia, we are launching the challenge at the Conference on Neural Information Processing Systems (NeurIPS) and providing entrants with the full release of a new, unique dataset of 100,000-plus videos specially created to aid research on deepfakes. Participants will use the dataset to create new and better models to detect manipulated media, and results will be scored for effectiveness.

I’m particularly excited about this project because it combines two elements that have been so effective in catalyzing AI research in other areas: an open challenge so researchers everywhere can compete and compare their results, and a large-scale, high-quality dataset built expressly for this use case. Deepfakes are a rapidly evolving challenge, similar to spam, phishing, and other adversarial threats, and rapid progress will require contributions from experts across the AI community.

Cristian Canton Ferrer, the Facebook AI Research Manager leading this project here, has explained why the new DFDC dataset is so important: “Ensuring that cutting-edge research can be used to detect deepfakes depends on large-scale, close-to-reality, useful, and freely available datasets. Since that resource didn’t exist, we’ve had to create it from scratch. The resulting dataset consists of more than 100,000 videos featuring paid actors in realistic scenarios, with accompanying labels describing whether they were manipulated with AI.”

Additionally, we are pleased to announce that Kaggle, the data science and machine learning community site, will host the DFDC challenge and leaderboard.

Kaggle’s CEO, Anthony Goldbloom, shared his perspective on the DFDC: “Kaggle is thrilled to be collaborating with Facebook on this challenge. AI has made dramatic leaps forward over the last decade thanks to open datasets and open challenges. This challenge is a powerful step in tackling one of the most difficult open issues in AI today.”

The challenge will be hosted here and will run through March 2020.

A dataset designed expressly for research on deepfakes

In order to capture the full range of deepfake-generation techniques, we created the dataset videos using a highly diverse group of paid actors in a wide range of settings, poses, and backgrounds. The demographic makeup was approximately 54 percent female and 46 percent male.

These videos from the Deepfake Detection Challenge dataset show an unaltered video (left) and a deepfake (right).

Facebook AI researchers used a number of different techniques to generate face swaps and voice alterations from the original videos. With a subset of the videos, we also applied augmentations that approximate actual degradations seen in real-life videos shared online.

Participants in the challenge must submit their code into a black box environment for testing. Entrants do not need to share their models in order to participate. But to be eligible for the challenge prizes, they must agree to open-source their work so others in the research community can benefit. Entrants will retain rights to their models trained on the training dataset. Submission will be scored and ranked by the evaluation metric detailed on the challenge website, and the leaderboard will be regularly updated on the site so participants can compare their progress with others’.

We’ve also taken extensive measures to make sure the data was gathered responsibly and will not be misused. The videos feature only paid actors who have entered into agreements to help with the creation of the dataset, so as to avoid restrictions that could hamper researchers’ work. (No Facebook user data is used in this dataset.)

Access to the dataset will be gated, so that only researchers who have agreed to the dataset license and been accepted into the challenge can access it.

More information is available on the Deepfake Detection Challenge site, including details on how Facebook, the Partnership on AI, Microsoft, Amazon Web Services (AWS), and experts from leading academic institutions and media organizations came together to create this initiative. Facebook has dedicated more than $10 million in awards and grants for the challenge to help encourage more participation. AWS is also contributing up to $1 million in AWS credits and offering to host entrants’ models if they choose. The DFDC leaders are committed to sharing their expertise as well as technical and other resources.

Irina Kofman, the Facebook AI Director and Business Lead managing the DFDC program, has highlighted the importance of gathering input from a broad range of partners:

“It is inspiring to see the commitment from partners across multiple areas, including industry, academia, civil society, and media, and how they came together over many months to create the challenge. Each brought insights from their respective area and allowed us to consider a broad range of viewpoints. After releasing the initial dataset in October, we were able to gather feedback from academic advisers and partners and continue to improve the dataset being released today. By bringing the community together and fostering collaboration we hope that we can enable accelerated progress in this space.”

We know there will not be a simple and conclusive technical solution to these issues. I’m confident, however, that this open approach to research will help us build new tools to prevent people from using AI to manipulate videos in order to deceive others.