ML Applications

How we’re using Fairness Flow to help build AI that works better for everyone

March 31, 2021

Something Went Wrong
We're having trouble playing this video.

AI plays an important role across Facebook’s apps — from enabling stunning AR effects, to helping keep bad content off our platforms, to directly improving the lives of people in our communities through our COVID-19 Community Help hub. As AI-powered services become ubiquitous in everyday life, it’s becoming even more important to understand how systems might affect people around the world and how to help ensure the best possible outcomes for them.

Facebook and the AI systems we use have a broad set of potential impacts on and responsibilities related to important social issues from data privacy and ethics, to security, the spread of misinformation, polarization, financial scams, and beyond. As an industry, we are working to understand those impacts, and as a research community, we have only just begun the journey of developing the qualitative and quantitative engineering, research, and ethical toolkits for grappling with and addressing them.

Fairness in AI — a broad concern across the industry — is one such area of social responsibility that can have an enormous impact on the people who use our products and services. One aspect of fairness in AI relates to how AI-driven systems may affect people in diverse groups, or groups that have been historically marginalized in society. This has been a long-standing focus for Facebook’s engineers and researchers.

When building the Portal Smart Camera, for example, we worked to make sure it performed well for diverse populations. Likewise, we’ve used AI to build improved photo descriptions for people who are blind or visually impaired. Despite these important accomplishments, we know that as an industry and research community we are still in the early days of understanding the right, holistic processes and playbooks that may be used to achieve fairness at scale. To help advance this emerging field and spread the impact of such work throughout Facebook, we created an interdisciplinary Responsible AI (RAI) team several years ago. Within RAI, the Fairness team works with product teams across the company to foster informed, context-specific decisions about how to measure and define fairness in AI-powered products.

In a prior blog post, we described RAI’s overall process-based approach to defining fairness for each product and surfacing ethical concerns. We believe AI can work well for everyone, and we’re tackling some of the hard problems to help get there. But because an AI system can perform poorly for some groups even when it appears to perform well for everyone on average, questions about fairness and system measurement by subgroup are a crucial part of our responsibility effort.

If people who build AI don’t consider how AI might work in different situations and for different groups of people, and what outcomes it might produce before their systems are put into production, and if they don’t take steps to address any potential issues that may arise, the products and services they develop can risk having unintended consequences.

Designing an AI system to be fair and inclusive is not a one-size-fits-all task — it’s an iterative process that involves working hard to try to understand what it means for a product or system to perform well for all users while carefully balancing any tensions that may exist between stakeholders’ interests. In some instances, such as with facial recognition technologies, one way to help address fairness is to train the AI system on more representative data sets. In others, it involves grappling with a wide range of challenging issues, including the difficulty (or impossibility) of arriving at a universal definition of fairness as it relates to the AI system and managing trade-offs so that AI systems can work for multiple communities and stakeholders with sometimes divergent interests. Creating fairer systems often requires thinking critically about when and how potential statistical bias can creep into a machine learning process both during and after production, and taking steps to mitigate against it.

Something Went Wrong
We're having trouble playing this video.

One important step in the process of addressing fairness concerns in products and services is surfacing measurements of potential statistical bias early and systematically. To help do that, Facebook AI developed a tool called the Fairness Flow, and we’re sharing more details here.

Initially launched in 2018 after consulting with experts at Stanford University, the Center for Social Media Responsibility, the Brookings Institute, and Better Business Bureau Institute for Marketplace Trust, Fairness Flow is a technical toolkit that enables our teams to analyze how some types of AI models and labels perform across different groups. Fairness Flow is a diagnostic tool, so it can’t resolve fairness concerns on its own — that would require input from ethicists and other stakeholders, as well as context-specific research. But Fairness Flow can provide necessary insight to help us understand how some systems in our products perform across user groups.

Since its launch, we have continued to improve Fairness Flow, scaling our tooling and building infrastructure to offer long-term, recurring automated measurement that can support ongoing, holistic work. Fairness Flow offers a high-level statistical understanding of the performance of labeling and models per group, enabling and supporting deep-dive investigations into the systems, processes, and policies surrounding the types of models it can analyze. We are working to understand and potentially expand the ways Fairness Flow can be used for more AI models.

Who uses it: Fairness Flow across Facebook

RAI doesn’t work on responsibility practices in a vacuum. We partner closely with product teams to help define and understand fairness concerns in their products, and in some cases to support their use of Fairness Flow to help understand their systems. Across the company, from the engineers working to keep harmful content off our platform, to teams like Equity on the Instagram side, we are constantly introducing and improving on responsibility best practices that make our platform and the technologies we produce better for all.

Instagram Equity is one example of a team that was formed to address potential inequities that may exist in product development or impact people's experiences on Instagram. One of Instagram Equity's key responsibilities is to work with all teams across Instagram to help develop fair and equitable products.

Early work in this space includes the introduction of model cards, which have been designed to accompany trained machine learning models and provide the information necessary to help ensure that the models are used appropriately and to minimize the potential for unintended consequences and biases. These model cards include a model bias assessment that utilizes Fairness Flow and are already being used across Instagram’s integrity systems, along with other tools. The Equity team aims to have model cards applied to all Instagram models before the end of next year. Meanwhile, across Facebook, we’re also starting to use Fairness Flow to better understand potential errors that may affect our ads system, as part of our ongoing and broader work to study algorithmic fairness in ads.

Because definitions of fairness can change by context, Facebook’s RAI researchers and engineers are developing best practices for common use cases across the company (identifying content that violates a policy using a binary classifier and enqueueing it for human review etc.). We can then advise product teams on these uses so they can choose an appropriate framework and strategy for measuring and improving fairness in their particular system. We are building tooling that will allow teams to interactively explore the implications of different measurement strategies and automatically schedule analyses once they’ve settled on the appropriate one.

Use of Fairness Flow is currently optional, though it is encouraged in cases that the tool supports. Facebook’s goal is to require comprehensive fairness assessments over time, but we have more work to do to develop tools, guidelines, and processes to support our full variety of use cases while allowing for sufficient flexibility and nuance to correctly handle a product’s specific ethical implications.

Since fairness is so contextual, there will never be a single metric that applies in the same way to all products or AI models. Ethical assessments will often need to combine purposeful quantitative measurements from tools like Fairness Flow with qualitative assessments, which can’t be automated. We are currently working to develop and test more rigorous governance approaches that provide the correct contextual flexibility, with the goal of designing a process that we can require of teams.

How it works: Fairness Flow in practice

Fairness Flow works by helping machine learning engineers detect certain forms of potential statistical bias in certain types of AI models and labels commonly used at Facebook. Model bias occurs when a model systematically over- or under-estimates the outcome at issue for different groups, or when it applies different standards to different groups. For example, if a spam detection model were to disproportionately flag content from a certain group as spam while similar content from another group did not get the same treatment. Label bias occurs if or when human labelers apply inconsistent standards to content produced by different groups. While it may not be possible to eliminate model and label bias completely, it is possible to strive for systems that are more fair.

By measuring whether models or human-labelled training data perform better or worse for different groups of people, machine learning engineers can see if they need to take steps to improve the comparative performance of their models. Some changes they can consider include broadening or improving representation within their training or test data set, examining whether certain features are important, or exploring more complex or less complex models.

The core of Fairness Flow is a Python library. It provides a simple API that requires a data set of predictions, labels, group membership (e.g., gender or age), and sampling weights (for models);, or labels, ground truth, group membership, and sampling weights (for labels);, and some metadata about the type of model (e.g., classifier or regression) and desired metrics. As an output, the API has the option to both (a) provide a full report with informative metrics, statistical confidence and power, and a narration of how to interpret the results, and (b) record metrics to databases for ongoing monitoring.

Assessing model fairness using Fairness Flow

Some AI models are designed to predict whether certain outcomes are true or false, likely or unlikely, or positive or negative. And because systems rely on statistical models, even the best performing system will inevitably produce some errors. It’s important to understand whether these system errors might be affecting groups differently, and part of doing this requires that we measure the performance of the system per group.

Something Went Wrong
We're having trouble playing this video.

To measure the performance of an algorithm’s predictions for certain groups, Fairness Flow works by dividing the data a model uses into relevant groups and calculating the model’s performance group by group. For example, one of the fairness metrics that the toolkit examines is the number of examples from each group. The goal is not for each group to be represented in exactly the same numbers but to determine whether the model has a sufficient representation within the data set from each group. Other areas that Fairness Flow examines include whether a model can accurately classify or rank content for people from different groups, and whether a model systematically over- or underpredicts for one or more groups relative to others.

For each of the fairness metrics the toolkit analyzes, differences in performance across groups aren’t always an indication of a fairness concern, but they can be. When performance differences arise, the product team and model engineers should dig deeper to understand and evaluate the source of differences and the context in which the model is used, so they can determine whether the differences observed indicate a fairness concern. Notable differences in group performance can suggest that there may not be enough training data for a group, that the data captured does not cover a rich enough set of features, or that a group has systematically different behavior that a model does not capture.

Using Fairness Flow to assess potential bias in labels

Fairness Flow measures models based on whether their predictions deviate from the labels they were trained on, which implicitly assumes that the labels themselves are correct. But those labels are often the product of many individual decisions made by different people, and there is the risk that human-assigned labels might have embedded human biases.

Fairness Flow can also be used to evaluate binary labels, and compares the labels that annotators have provided with a set of high-quality labels produced by experts, which are assumed to be ground truth for measurement purposes. These ground truth labels may come from subject matter experts at Facebook or from experienced labelers (of course, even these high-quality labels can embed their own biases, but this methodology focuses on potential biases introduced in the general labeling process).

Similar to the predictive model methodology, Fairness Flow’s label measurements divide content from each group, and calculate accuracy metrics for each. These accuracy metrics include the false positive rate, false negative rate, and prevalence of labels produced by the scaled labelers compared with the ground truth labels. It’s not the case that simply having a different prevalence or a different false positive rate across groups implies bias; groups may simply have different positive rates in the ground truth data, or be harder to adjudicate than others.

Instead of simply measuring these differences, Fairness Flow uses a methodology based on Signal Detection Theory that decomposes these metrics into new metrics that can be interpreted as (a) the difficulty to label content from a group, and (b) differing thresholds used by the labelers. Typically, the second is interpreted as labelers using different standards, though there are also situations where the first might be evidence of labelers being systematically unable to reliably label content from a certain group.

The gap shows evidence of statistical bias in how labelers are processing the data.

Fairness Flow is context-agnostic for both models and labels, but we have adopted best practice metrics and methodologies for certain types of supervised models (like binary classifiers) and for labels where ground truth data is available in sufficient volume. We are performing ongoing research with ethicists, data scientists, social scientists, UX researchers, and ML engineers — all on the RAI team — to develop and identify best practices for more cases. Our goal is to develop processes that enable teams using AI to systematically surface potential issues as they build products.

What’s next: Improving and scaling

Fairness Flow is available to product teams across Facebook and can be applied to models even after they are deployed to production. However, Fairness Flow can’t analyze all types of models, and since each AI system has a different goal, its approach to fairness will be different. Choosing the right metric for a given use case requires a deep understanding of the product, the community (and larger world) in which it operates, the way predictions are used, the way mispredictions may affect users, and the groups that may be at risk. This ultimately requires product expertise and user experience research, which are difficult to scale, and we know we can always do better to understand our users and user communities.

An appropriate fairness metric will be directly related to the way users might experience the product and the specific potential impact we are trying to prevent. For example, the right fairness metric for measuring whether a piece of content is an instance of bullying or harassment may not be the right one for measuring how posts are ranked in News Feed. Fairness Flow provides metrics that speak to multiple dimensions of fairness, but product teams ultimately determine how to perform the measurements that fit their context.

We will keep working to help build technology responsibly. Fairness Flow is just one tool among many that we are deploying to help ensure that the AI that powers our products and services is inclusive, works well for everyone, and treats individuals and communities fairly.

Written By

Isabel Kloumann

Research Science Manager

Jonathan Tannen

Research Engineering Manager