Creating better virtual backdrops for video calling, remote presence, and AR

February 8, 2022

Share on Facebook

Share on Twitter

Something Went Wrong

We're having trouble playing this video.

Learn more

This video shows how Meta’s person-segmentation model works with both close-up views of people and full-body views.

Since the start of the COVID-19 pandemic, many of us have become accustomed to using or viewing virtual backgrounds and background filters when video chatting with friends, coworkers, or family. Altering our backgrounds during video calls gives us greater control over our environments, helping us eliminate distractions, protect the privacy of the people and spaces around us, and even liven up our virtual presentations and get-togethers. But background filters don’t always work as intended, and they don’t always perform optimally for everyone. Most of us are familiar with background filters mistakenly covering someone’s face as they move, for example, or a filter failing to distinguish between a hand and a table.

To improve background blurring, virtual backgrounds, and many other augmented reality (AR) effects in Meta’s products and services, we’ve recently deployed enhanced AI models for image segmentation — the computer vision task of separating out the different parts of a photo or video. A cross-functional team of researchers and engineers from Meta AI, Reality Labs, and other parts of the company has built new segmentation models that are now in production for real-time video calling in Spark AR on multiple surfaces across Portal, Messenger, and Instagram. We’ve also improved the two-person segmentation models that are running today on Instagram and Messenger.

These systems are now more efficient, more stable, and more versatile, which will help enhance the quality and consistency of background filter effects in our products. For example, our improved segmentation models can now be used for multiple people and for people’s full bodies, as well as for people occluded by an object, such as a sofa, desk, or table. And beyond video calling, improved segmentation can also bring new dimensions to augmented and virtual reality (AR/VR) by merging virtual environments with people and objects in the real world. This will be especially important as we build new immersive experiences for the metaverse.

As we worked to advance image segmentation, we focused broadly on three important challenges:

Teaching our AI models to work well in a wide variety of circumstances, such as with dark lighting conditions, variations in skin tones and situations where skin tones are similar to background colors, less common body poses (e.g., someone bending forward to tie a shoe or to stretch), occlusions, and movements.
Improving boundary smoothness, stability, and general consistency. These qualities are less discussed in existing research literature, but user studies have shown they greatly affect people’s experience when using background effects.
Ensuring that our models are efficient and flexible enough to work well on billions of smartphones currently in use around the world, not just the small fraction that are current-gen devices with cutting-edge processors. The models also had to support very different aspect ratios, in order to work well on laptop computers, Meta’s Portal video calling device, and portrait and landscape modes on people’s phones.

The challenges of person segmentation in the real world

While the concept of image segmentation is easy to grasp, achieving highly accurate person segmentation presents significant challenges. To create a good experience, the model must be extremely consistent and lag-free. Artifacts caused by incorrect segmentation output can easily distract people using virtual background applications during a video call. Even more important, segmentation errors may lead to unwanted exposure of people’s physical environments when they are using background effects.

For these reasons, it’s important to achieve high accuracy — greater than 90 percent intersection over union (IoU), a commonly used metric for measuring the overlap between the image segmentation prediction and the ground truth — in order to deploy person segmentation models into production. Because of the huge variety of possible use cases, the last 10 percent gap in IoU is exponentially more difficult to overcome than the first 90 percent. We found when IoU reaches 90 percent, the metric becomes saturated and cannot capture further improvements in temporal consistency and boundary stability. We thus developed a video based measurement system together with several metrics to capture these additional dimensions.

Developing training and measurement strategies for the real world

AI models learn from the data they’re given, so it’s not enough to simply use examples of video callers sitting still in well-lit rooms. In order to create highly accurate segmentation models for a very wide variety of circumstances, we needed many other kinds of examples as well.

We used Meta AI’s ClusterFit model to retrieve from our data set a broad range of examples across gender, skin tone, age, body pose, motion, background complexity, number of people, and so forth.

Metrics on static images don’t accurately reflect a model’s quality in real time, because real-time models usually have a tracking mode that relies on temporal information. To measure our models’ qualities in real time, we designed a quantitative video evaluation framework that computes metrics at each frame when the model inference reaches.

Unlike standard academic segmentation problems, the quality of our person segmentation model is best judged by its performance in everyday situations. If the effect is jarring, distracting, or otherwise lacking, its performance against any specific benchmark is inconsequential. So we surveyed people who use our products, asking about the quality of our segmentation applications. We found that non-smooth and ambiguous boundaries affect user experiences the most. To capture this signal, we augmented our framework with an additional metric, Boundary IoU, a new segmentation evaluation created by Meta AI researchers to measure boundary quality. Boundary IoU is of higher interest when general IoU is close to saturated, i.e., above 90 percent. Additionally, the jitters (temporal inconsistency) at the boundary also detract from the experience. We measure temporal consistency using two methods. First, we assume adjacent video frames are identical to each other, and any prediction discrepancy indicates the model temporal inconsistency. Second, we consider the foreground movement between adjacent video frames. Optical flow can help us transform the prediction of frame N to N+1. We then can compare this transformed prediction with the raw prediction at frame N+1 and use IoU to represent the discrepancy. We use IoU to measure the discrepancies in both cases.

We also analyzed our models for differences in performance across specific groups of people. We labeled evaluation videos with metadata from more than 100 classes (more than 30 categories) including three skin tones (informed by clustering Fitzpatrick scale skin types) and two apparent gender categories. The model showed similar accuracy across both apparent skin tone and apparent binary gender categories. Despite some very minor differences between categories, which we will prioritize addressing in our ongoing work, the model demonstrates good performance across all subcategories.

Per class IoU from fairness analysis. We collected 1,100 videos dedicated to this project with diverse attributes, e.g., skin tone, apparent gender, pose, lighting condition, etc. (greater than 30 categories with greater than 100 classes). Points in the plots represent the per class IoU. Error bars represent 95 percent confidence intervals. We see some differences between classes, although the difference is very small.

Optimizing the model

Architecture

To optimize our models, we use FBNet V3 as their backbone. The architecture is in encoder-decoder structure with a fusion of layers with the same spatial resolution. We design a heavy weight encoder with a light decoder that achieves better quality than the symmetric design of architecture. The resulting architecture is supported by Neural Architecture Search and highly optimized for speed on-device.

Architecture of the semantic segmentation model. The green rectangles represent convolution layers, and the black circles represent concatenation.

Data-efficient learning

We used an offline high-capacity PointRend model to generate pseudo ground truth labels for the unannotated data to increase the volume of training data. Similarly, we used the student-teacher semi-supervised model to remove biases in pseudo labeling.

Aspect ratio dependent resampling

A traditional deep learning model resamples an image to a small squared one as its input to the network. Because of this resampling, distortions occur. And because images have different aspect ratios, the distortions are also different. The presence of, and variations in, distortions causes the network to learn low-level features not robust to different aspect ratios. Limitations caused by such distortions are amplified in segmentation applications. When the majority of the training images have portrait ratios, for example, the model performs much worse on landscape images and videos. To address this challenge, we adopted Detectron 2’s aspect ratio dependent resampling method, which groups images with similar aspect ratios and resamples them all to the same size.

Illustration of the importance of aspect-ratio-dependent resampling. The left image is the model resampling an image to a squared size (the output mask is very unstable). The right image is the model trained with aspect-ratio-dependent resampling.

Customized padding

Aspect ratio dependent resampling requires padding group images with similar aspect ratios, but the commonly used zero-padding method produces artifacts. Even worse, the artifacts propagate to other areas when the network gets deeper. We use replicate padding to remove these artifacts. In a recent study, we found reflection padding in convolution layers can further improve model quality by minimizing the artifacts’ propagation, but the latency cost increases accordingly. An example of the artifact and the result of removing it is shown below.

Left: segmentation output using zero padding, right: segmentation output using customized padding.

Tracking

Temporal inconsistency represents frame-to-frame prediction discrepancies, known as flickers, and hurts the user experience. To improve the temporal consistency, we designed a detect-with-mask process. It takes three channels from the current frame (YUV), and there is a fourth channel. For the first frame, the fourth channel is an empty matrix, while in following frames, the fourth channel is from the prediction of the last frame. We found this tracking strategy improves temporal consistency significantly. We also adopted some ideas from state-of-the-art tracking models, such as CRVOS and transform invariant CNN modeling strategies, to obtain a temporally stable segmentation model.

Illustration of the detect-with-mask model.

Boundary cross entropy

Creating smooth and clear boundaries is critical for AR applications of segmentation. Besides the standard cross entropy loss for segmentation, we need to consider boundary weighted loss as well. The authors of U-Net and most later variations recommend trimap weighted loss to improve a model’s quality given the observation that interiors of objects are easier to be segmented. However, one limitation of the trimap loss is that it calculates the boundary area only based on the ground truth, thus it’s an asymmetric loss insensitive to false positives. Inspired by Boundary IoU, we adopt their method of retrieving boundary areas for both ground truth and prediction, and build a cross entropy loss in these areas. The model trained on boundary cross entropy outperforms the baseline significantly. Besides making the boundary area clearer in the final mask output, false positives from the new models occur less frequently, which can be expected according to its theory.

Something Went Wrong

We're having trouble playing this video.

Learn more

This video illustrates the improvement based on boundary cross entropy loss. The right side uses the traditional trimap weighted loss. The left side shows the output the model trained with this new loss. The left side exhibits much heavier boundary flickers as well as false positives.

Performance

All our models are trained offline using PyTorch and then deployed into production using the Spark AR platform. We use PyTorch Lite to optimize on-device deep learning model inference. Because use cases and hardware are different for apps and devices, we design different models to satisfy these requirements. After 1.5 years of development, our team has successfully improved the person segmentation models on multiple Facebook apps and devices.

Building better segmentation models for video chat and much more

We’ve made substantial improvements in our segmentation models, but there is more work to do. We hope to enable segmentation-powered effects that seamlessly adjust to even the most complex, challenging, and uncommon use cases. In particular, we continue to work on new ways to improve boundary stability and temporal consistency, which are vital for AR/VR human-centric segmentation applications. We are working to create advanced tracking methods that will provide more consistent prediction, especially at edges of objects.

We will also continue to improve our tools for assessment of model performance across the various dimensions of diversity of people around the world.

We hope that by sharing details here on our work, we will help other researchers and engineers create better segmentation-powered applications that work well for everyone.

This project is a collaboration of around 30 engineers and scientists. We’d like to thank everyone who contributed to making Meta’s on-device person segmentation solution better.