Shedding light on fairness in AI with a new data set

April 8, 2021

4.1.22 Update: As of April 1, 2022, the corresondences between Deepfake Detection Challenge and Casual Conversations. Now, researchers can measure biases of their deepfake detection models.

12.14.21 Update: We’re continuously augmenting our open-source data set to help surface fairness issues in AI speech models. As of Dec 2021, we've added human speech transcriptions so that researchers can test how automatic speech recognition models work for people in different demographic groups.

Facebook AI has built and open-sourced a new, unique data set called Casual Conversations, consisting of 45,186 videos of participants having nonscripted conversations. It serves as a tool for AI researchers to surface useful signals that may help them evaluate the fairness of their computer vision and audio models across subgroups of age, gender, apparent skin tone, and ambient lighting.
To our knowledge, it’s the first publicly available data set featuring paid individuals who explicitly provided their age and gender themselves — as opposed to information labeled by third parties or estimated using ML models. We prefer this human-centered approach and believe that it allows our data to have a relatively unbiased view of age and gender.
A group of trained annotators also labeled ambient lighting conditions and used the Fitzpatrick Scale to label participants’ apparent skin tones, in order to help AI researchers analyze how AI systems perform across different skin tones and low-light ambient conditions.
As an industry and research community, we are at the beginning of understanding the multifaceted, ongoing challenges of fairness and bias. By making fairness research more transparent and normalizing subgroup measurement, we hope this data set brings the field one step closer to building fairer, more inclusive technology.

Something Went Wrong

We're having trouble playing this video.

Learn more

Whether we realize it or not, we all have implicit biases that may influence daily judgments and decisions, and there are explicit inequalities in the world in which we live. Studies show that résumés with subtle cues of older age, like old-fashioned names, get fewer callbacks, and that educated Black men are sometimes mistakenly remembered as having lighter skin. These biases can make their way into data used to train AI systems, which could amplify unfair stereotypes and lead to potentially harmful consequences for individuals and groups — an urgent, ongoing challenge across industries. Smart cameras, for instance, can be less accurate in recognizing certain subgroups of people when the model learns from data sets that don’t reflect all skin tones. And some decision-making algorithms in healthcare have even proven to unfairly exclude people from getting the treatment they need based on flawed benchmarks.

As a field, industry and academic experts alike are still in the early days of understanding fairness and bias when it comes to AI. We’ve recently talked about our approach to AI fairness as a process, designed to account for the holistic, potential impact of our products and services on society. To make meaningful progress, it’s important that we consider fairness across multiple dimensions looking at not just the performance of AI systems but also the structures in which they are situated. Fairness can vary based not just on the application, but also on the environment, culture, and community in which the product is used. However, the technical implementation is a critical piece of the broader fairness puzzle. And an open challenge in improving fairness in AI systems is the lack of high-quality data sets designed to help evaluate potential algorithmic biases in complex, real-world AI systems.

As part of our ongoing commitment to improve fairness and responsibility of AI systems, we’re releasing a new, unique data set called “Casual Conversations” that’s — to our knowledge — the first publicly available data set using videos of paid actors, each of whom agreed to participate in the project and explicitly provided age and gender labels themselves. We prefer this human-centered approach and believe it allows our data to have a relatively unbiased view of age and gender. In addition to age and gender, Casual Conversations includes labels of apparent skin tone, which were provided by trained annotators using the Fitzpatrick scale, to help researchers evaluate their computer vision and audio models for accuracy along these groups. These trained annotators also labeled videos with ambient lighting conditions to help measure how models treat people with various skin tones under low-light ambient conditions.

With 45,186 videos of 3,011 participants, Casual Conversations is intended to be used for assessing the performance of already trained models in computer vision and audio applications for the purposes permitted data user agreement. The agreement doesn’t allow certain tasks, such as training models that identify gender, age, and skin tone, as this data set is purely for model measurement. Because knowing the limitations of data sets is so critical, it is important to note that we collected the data set in the United States and did not solicit information on where the participants are originally from. Also, in gathering the gender of the participants in the study, we provided the choices of male, female, and other. We recognize that this characterization is insufficient and doesn’t represent all genders, such as people who identify as nonbinary. Casual Conversations is a good, bold first step forward, and we’ll keep pushing progress toward developing data analysis that captures additional diversity of genders while continuing to respect people’s privacy. Over the next year or so, we’ll explore pathways to expand this data set to be even more inclusive, with representations that include a wider range of gender identities, ages, geographical locations, activities, and other characteristics.

Progress has always been cumulative, and we can’t improve the fairness of our AI systems without collaborative research and open science. Although there is no one-size-fits-all approach to surfacing fairness issues, where possible, the field would benefit from a standardized way to identify subgroups for which AI systems can perform better and share successes and pitfalls in order to make collective progress. The AI research community can use Casual Conversations as one important stepping stone toward normalizing subgroup measurement and fairness research. With Casual Conversations, we hope to spur further research in this important emerging field.

Unpacking the Casual Conversations data set

The Casual Conversations data set features videos of 3,011 people with consistent and evenly distributed age and gender annotations, as well as apparent skin-tone groups, in a total of 45,186 videos (or roughly 15 videos per individual).

The standard way of evaluating the performance of AI models today is to measure against a test set after the model has been trained and validated. While these test sets can identify the accuracy of the model prediction, they may contain the same shortcomings as the training sets because they’re from the same distribution and domain. Our new Casual Conversations data set should be used as a supplementary tool for measuring the fairness of computer vision and audio models, in addition to accuracy tests, for communities represented in the data set. It’s designed to surface instances in which performance may be unequal across different subgroups on four dimensions: age, gender, apparent skin tone, and ambient lighting conditions. Since skin tone and ambient lighting are not relevant dimensions for audio models alone, we encourage the research community to explore the responsible development of data sets to test inclusivity of audio models along relevant dimensions.

To build this data set, we consulted with Facebook’s Responsible AI team to leverage the original video recordings created by Facebook for the Deepfake Detection Challenge (DFDC). We used the standardized Fitzpatrick scale to label the apparent skin tone of each participant. Uniform distributions of the labels help identify unbalanced distributed errors in our measurements, and allow researchers to surface potential algorithmic biases. The standard Fitzpatrick scale grouping system has limitations in capturing diversity because it’s biased toward lighter skin tones. A common procedure to alleviate this bias is to group the Fitzpatrick skin types into three buckets of light [types I,II], medium [types III, IV], and dark skins [type V, VI]. Our annotations provide the full array of Fitzpatrick skin types so that AI researchers can ensure their data set is representative of all skin tones. It’s also important to make sure that distributions are balanced not just within subgroups but also across different intersections of those groups. For instance, even if an AI system performs equally well across all age groups, it should not underperform for older women with darker skin, for instance. Our data set and paper provide these intersectional breakdowns as well.

AI fairness in the state of the art

One important application in computer vision is deepfake detectors — a burgeoning field in media forensics that aims to distinguish AI-generated videos from real videos. Media manipulation is an ever-evolving problem, and researchers at Facebook AI and across the industry are focused on developing state-of-the-art detectors that can flag misleading videos. But there are still open questions, including how well these detectors perform across different subgroups, such as age, gender, and apparent skin tone. It’s important that deepfake detectors aim to spot doctored images, regardless of skin tone and other attributes, as well as accurately assess whether images and videos are benign. As part of our Deepfake Detection Challenge collaboration with other industry leaders and academic experts, Facebook AI was the first to build and release a large-scale video data set designed to catalyze progress in this area.

We applied the Casual Conversations data set to measure performance by subgroup for the top five winners of the Deepfake Detection Challenge on roughly 5,000 videos that overlap with the Casual Conversations data set in our paper. Ultimately, we discovered that all winning approaches struggle to identify fake videos of people specifically with darker skin tones (types V and VI). Of all the submissions, the model with the most balanced predictions was actually the third-place winner.Read page 4 of the paper for more details.

Toward fairness in computer vision and audio systems

This release is part of Facebook’s long-term initiative of building AI-powered technologies in a responsible way — from proactively improving the Facebook experience for communities that are often underserved to inventing new ways to prevent hate speech from spreading across the internet. Fairness is a multifaceted, iterative process, and we’re continuously researching ways to understand not just whether algorithms are fair, but also whether the products, policies, and implementation are fair, just, and reasonable. Of course, we cannot put a “fairness checkmark” on AI systems, and we have more work to do to address potential fairness concerns. By open-sourcing our data set and enabling the wider AI industry to better understand how AI performs on critical subgroups, we hope to spur research, dialogue, and make progress toward making AI more inclusive. Facebook is making Casual Conversations available for all internal teams today. We’re encouraging teams to use it as one of several data sets that they use for evaluation. And we’re actively working on expanding the data set to represent more diverse groups of people.

We hope the research community extends annotations of our data set for their own computer vision and audio applications, in line with our data use agreement. Collaborative research — through open source tools, papers, and discussions — will be critical in developing legitimate processes of responsible AI development, to ensure that the technologies are inclusive of a diverse range of perspectives from the research community.

READ THE PAPER

GET THE DATASET