How Facebook annotates multimodal training data for ML

May 31, 2019

Written byWahyudinata Setiawan, Roshan Sumbaly

Written by

Wahyudinata Setiawan, Roshan Sumbaly

Share

  • We have built an annotation platform called Halo, which allows Facebook researchers to more easily and efficiently create high-quality training data. Halo creates annotation tasks and then visualizes corresponding data across diverse media types, including images, audio, video, and 3D.

  • The platform’s novel, pluggable architecture expresses the end-to-end annotation process through a configuration, allowing for swappable components that can be used across annotation tasks.

  • The goal of Halo is to let Facebook researchers focus on training ML models — not on building annotation tools.

The performance of many machine learning (ML) applications depends on the quality and quantity of training data available for the underlying models. In ML, an integral part of researchers’ workflow is creating and setting up pipelines to annotate training data. To scale this process, we used React to build a new configuration-based annotation UI framework. It leverages SVG layering to stack various annotation components into a single platform that allows researchers to quickly plug and play different pieces.

We refer to this framework as Human-AI loop (Halo). Researchers can streamline annotation tasks, visualize the results and accuracy metrics of annotations, and export the annotations to start training their models.

In the early years of Facebook’s ML development, researchers created custom annotation tools from scratch to accomplish ad hoc tasks. As ML proliferated across teams, we needed a scalable way to create and annotate training data for a wide range of cases. We also wanted a way to share best practices with researchers across the organization around task creation and labeler management.

All of these considerations prompted us to build a single platform to support our researchers while improving every step of the annotation process. Models trained with Halo-labeled data sets are powering AI experiences across our family of products.

Behind the scenes: A flexible, configuration-based system

Our biggest challenge was to create a platform that not only supports a breadth of tasks but also is maintainable and extensible. The tasks can span any combination of the following:

  • Types of media, such as audio, video, image, and 3D, including combinations of multiple media types per job.

  • Types of annotation, like bounding boxes, pixel-level segmentation, transcriptions, and combinations.

  • Queue management, which is performed by providing different allocation algorithms from jobs to labelers. This includes the ability to send the same job to multiple labelers and measure inter-labeler agreement.

  • Custom metrics frameworks that can take a gold standard annotation and compute task-specific quality metrics.

This need for configurability prompted us to create an architecture that expresses the whole annotation process through a single JSON-based annotation configuration that describes the entire process, from creating a new task to generating a final analytics report. We also integrate best practices into the platform by setting up well-tested and studied defaults. For instance, we automatically inject gold standard data at different rates and implement a standardized set of task-specific accuracy metrics in our analytics reports.

To use Halo, researchers start by submitting a proposal to clarify the project details, data usage, and broader context within Facebook. Projects that require a human labeling component start with a risk assessment conducted by a cross-functional privacy team. They present human labelers with a standard set of guidelines or instructions that are crafted to alleviate bias, protect privacy, and ensure objectivity. These guidelines are used when creating the data necessary for each step of the annotation process.

Researchers proceed through the following steps of the Halo process.

Researchers upload their data into our system through either an internal database query or our APIs. Given a query, Halo can process the data in chunks in parallel and enqueue each row as a separate annotation job. Once the data has been loaded in our intermediate storage, the researcher uses the UI builder to define how the annotation tool will behave. This React-based UI platform consumes the configuration and then layers the different components using SVG’s element rendering order. This concept is similar to the approach taken by widely used photo editing or presentation software, where each layer represents a part of the image. Within Halo’s UI platform, however, each layer has its own functionality. For example, in the sample gif below, the image layer is responsible for rendering the raw image, while a separate layer is responsible for rendering the bounding box.

The UI platform separates additional core tool functionalities into different layers. For example, an individual layer can render a mouse’s crosshairs or display a tooltip on hover. Each of those layers can work with multiple media types. With this design, adding new features is as easy as adding a new layer. In fact, research teams have contributed their own layers to our library, expanding the range of use cases that Halo can support and compounding the value of the platform.

As with any platform supporting an ecosystem of contributed components, we needed a communication protocol to allow each component to live independently while being composed together in one tool. The UI platform uses Undux for state management, wrapped in React Hooks and statically typed by Flow. Since rendering and data management are handled by the UI platform, each component automatically benefits from core features. This includes the ability to undo and redo, local storage caching, and various rendering performance optimizations.

The next step is to set up the metrics needed to understand the quality of the output data after it is annotated. Given the wide range of annotation types and methods needed at Facebook, we created a metric framework to allow researchers to write their own custom function in Hack. It consumes the same JSON configuration to parameterize certain threshold values of the metric calculation. Halo supports various standard metrics on provided gold standard data sets — from IoU for image segmentation to WER for audio transcription to BLEU scores for text translations.

Since Halo is entirely configuration driven, tasks can be changed and tested in real-time directly from the platform’s UI, with no code changes. All changes are versioned, and parameters are easy to modify throughout the annotation process. For example, researchers can update quality metrics midway through the annotation, which, once reconfigured, are then automatically backfilled into a new version.

At this point, the task is ready, and the underlying jobs are set up to be distributed to labelers. At any time during the annotation process, researchers can visit a separate administrator UI, where they can visualize the results in read-only mode. For tasks where we have a gold standard data set, we overlay those tasks for easy auditing and provide corresponding visualizations for metrics. Researchers can further iterate on the output by chaining it to another task, allowing them to create sophisticated DAGs of annotation tasks. For instance, the result of the bounding box annotation can be cropped and then sent for keypoint annotation programmatically.

Halo in action

Transcription of audio from video

In one instance, researchers at Facebook used a standard video player and a single text box to transcribe all intelligible audio from a set of publicly available videos. The goal was to train and evaluate automatic speech recognition (ASR) models for live captioning and other downstream applications.

The labeler watched the entire video to transcribe the parts where a person was speaking. With the help of Halo, the team created labeler jobs programmatically via Halo API, where each job contains video chunks generated by employing a voice activity detection (VAD) model. This production-ready tool was created in three days and decreased average labeling per video by 32 percent. As a result, the transcription process is a lot easier for labelers, as they can skip to snippets of the video where someone is speaking.

Detecting harmful text in images

One of the core teams using Halo focuses on Rosetta, a Facebook tool that detects and recognizes text in visual media, including images and video. The team annotated hundreds of thousands of images to generate millions of bounding boxes across text from a dozen languages. The output of the corresponding trained model then served as a signal in a meta-classifier that proactively protects people from harmful content, such as posts that contain hate speech.

Auto-generating descriptions of images

Scalable annotations have also helped us build automatic alt text (AAT) within the core Facebook app. The technology generates text descriptions of photos that help the visually impaired. With AAT, people can use screen readers to hear descriptions of content images. We used Halo to post-process categories generated from our classification models. Creating this labeled data for AAT models helps us make Facebook more accessible.

What’s next

Researchers spend a substantial portion of their time processing data before they can start training their models. Annotation remains an integral piece of powering ML applications, and we will continue to add foundational annotation components to Halo. One area of development is in leveraging our expertise on training deep learning models to build interactive annotation tools. For example, new advancements in GrabCut-style guided segmentation can help increase end-to-end annotation efficiency.

Building this platform has also presented us with opportunities to learn and automate other common steps that researchers follow before they can deploy their models. For instance, we noticed researchers were building active learning loops on top of Halo to help select the right annotation data points for model improvement, so we built standardized workflows that package such routine processes into our platform. All of these improvements are ways to streamline data preparation and, ultimately, to bring our innovation from research to production quickly.

We'd like to acknowledge the contributions of Isha Pathak, Bill Chen, Labeeb Panampullan, Alex Dajani, Ozioma Obiaka, and all past colleagues.

Written by

Wahyudinata Setiawan

Front-end engineer

Roshan Sumbaly

Engineering manager