Under the hood: Portal's Smart Camera

February 15, 2019

A powerful feature of Portal, Facebook's new video-calling device, is an AI-powered system called Smart Camera. Smart Camera frames shots much as an experienced camera operator would, so that people using Portal feel like they are right beside each other. Instead of relying on dedicated servers typically used for advanced machine learning tasks, Smart Camera does this by performing complex computer vision (CV) modeling entirely on-device, using processing power similar to that of a high-end mobile chipset.

Building an entirely new intelligent camera system for Portal hardware from scratch in under two years posed significant CV challenges. For instance, even in a crowded room with people moving and interacting, Portal must decide when to zoom out to accommodate new subjects and when to follow someone walking out of frame. These decisions are a core part of Portal's fundamental use case, not merely a peripheral feature.

To enable Smart Camera to do this effectively, the Portal AI and Mobile Vision teams leveraged Facebook's long history of investments in AI. In particular, we innovated on our pioneering Mask R-CNN model in order to create a new 2D pose detection model that was two orders of magnitude more efficient than existing systems. With Mask R-CNN2Go, 2D human pose estimation runs in real time on Portal at 30 frames per second. We used this new model to inform Portal's decisions about when to digitally pan and zoom, and then paired it with a separate system to ensure that camera movements would be natural-looking and immersive.

The end result is a video-calling experience that works for everyone: Parents don't need to chase their toddler around with their phone, younger kids can participate in the call without having to stand still, and groups of people can see and be seen with ease.

In this post, we'll highlight the work that went into creating Smart Camera, starting with answering a simple question: How do we build a camera that showcases people in a wide variety of environments if all we have to work with is a small mobile processor?

Mechanical vs. intelligent camera

Portal is a product imbued with CV technology, but Smart Camera began with a hardware challenge. Early prototypes physically swiveled to face different subjects, but the drawbacks of a motorized camera were significant: reduced reliability, along with the inability to see and react to what's happening away from where the camera is pointed. Instead, we decided to develop a stationary wide-angle camera whose movements would be entirely digital. But this approach required a truly nimble, efficient AI control system that could make rapid and sophisticated CV decisions. Smart Camera, which was always a key component of our product, became increasingly central to our planned reinvention of the video-calling experience.

The first Portal prototype relied on a motor to physically move the camera.

Established CV methods of locating and tracking subjects were insufficient. Neither the existing software tools nor camera hardware could support the product requirements we had set up. For example, standard CV systems are designed to operate within a couple of feet, far short of Portal's intended 10- to 20-foot range. And while the depth cameras used by Facebook Reality Labs (FRL) to capture detailed VR environments addressed similar challenges (such as accounting for changes in sunlight), FRL's work wasn't designed to operate in real time. No single approach or platform could handle all of Portal's needs. As the team developed more advanced prototypes, they incorporated research and code from teams across Facebook to create more versatile iterations of Smart Camera.

The need for sophisticated, robust computer vision

Video calling is a particularly unforgiving CV application, with no opportunity for do-overs. When a smartphone's panoramic photo comes out distorted, it's easy to delete the photo and try again. But a crucial moment during a conversation is impossible to re-create. And because Portal makes all its CV decisions in the moment, as video calls are in progress, it can't look at an entire video from start to finish in order to determine how to handle unexpected events. If the system could peer into the future, it could anticipate when someone will enter or leave a room, or it could preemptively reframe the scene before someone starts moving around excitedly.

The variety of real-world conditions poses significant challenges too. Recognizing that a grandmother is holding her granddaughter on her lap in a well-lit space might not be particularly processor-intensive for a CV system. But what happens when the toddler crawls toward a shaded corner of the room and is then picked up and carried out of view by her mom? Smart Camera has to dynamically respond to multiple variables and determine when to zoom and pan, as well as what to ignore.

And because disruptive errors and lag during video calls are unacceptable, the CV system must run entirely on the device. Processing video locally also provides enhanced privacy because none of the pose detection or other AI modeling leaves the device. Furthermore, nothing that is said during a Portal video call is accessed by Facebook or used for advertising.

The heart of Smart Camera: Mask R-CNN2Go

To create a video-calling experience to meet these needs, Smart Camera relies on 2D pose detection, supplemented with additional CV technologies. Smart Camera actively frames a given scene by continually searching for relevant subjects to include. Since it analyzes each frame in the video, Portal is able to ignore potential subjects that haven't moved over an extended period of time, such as a portrait hanging on a wall. It can also prioritize what should be within its field of view, choosing, for example, to feature a subject who is talking to the listeners versus someone who's passing through in the background.

A simple CV system, such as one that uses only head detection or bounding boxes around people, might have been easy to implement. But Smart Camera needed sufficient accuracy to account for different postures, as there are very different framing choices based on, for example, whether someone is lying down or standing up.

When the Facebook AI Research (FAIR) group released its Mask R-CNN model in 2017, it was a breakthrough for the industry in streamlining instance segmentation, garnering the Best Paper award at the International Conference on Computer Vision (ICCV). But Mask R-CNN's GPU-based approach made it incompatible with Portal's mobile chipset. Last year, teams across Facebook collaborated to create Mask R-CNN2Go, a full-body pose detection system that was just a few megabytes in size. That made it small enough to run on mobile processors and perfect for use within Portal.

Mask R-CNN2Go is an efficient and light-weighted framework. This graphic outlines the five major components of the model.

Powered by Mask R-CNN2Go, Smart Camera maintains Mask R-CNN's high pose-detection accuracy while also running 400x faster than that model. Compressing our pose-detection model — from running on desktop GPUs to mobile chips — forced trade-offs in model quality. The lower-quality key points (which can introduce jitter or other visual errors during framing changes) weren't acceptable for the stable, natural video-calling experience that we were targeting. To compensate, we developed several strategies, including improving low-light performance by applying data augmentation on low-light examples in the training dataset and balancing multiple pose-detection approaches (such as detecting a subject's head, trunk, and entire body). And we used additional preprocessing to differentiate between multiple people in proximity to one another.

We also pushed the limits of Portal's mobile chipset in other ways, including developing hand-tuned optimizations of Qualcomm's Snapdragon Neural Processing Engine (SNPE). These increased that software's already accelerated execution of deep neural networks to account for operational conditions and for Portal's specific hardware. The end result of this process was a new consumer product available in stores less than two years after FAIR's initial Mask R-CNN research was published.

In addition to leveraging our work on Mask R-CNN, Portal takes advantage of our long-term investments in AR. With interactive AR already deployed on Facebook, Messenger, and Instagram, we were able to incorporate out-of-the-box AR — and particularly the new Story Time mode — with minimal heavy lifting, thanks to our Spark AR platform and the work we've done over the years in body tracking and segmentation. This streamlined integration of AR is part of our overall strategy to provide a common AR platform across all Facebook products — including Messenger, Instagram, and now Portal — that lets creators author an effect once and then deploy it widely, whether on today's screens or tomorrow's head-mounted displays.

Enhancing AI capabilities with human expertise

Even after we had solved for all the baseline challenges associated with camera movements that respond to real-time pose detection, the results in early prototypes still felt stiff and mechanical. Smart Camera would accurately zoom, pan, and track relevant subjects, but its actual movements weren't as smooth, fluid, and intuitive as the human-controlled camerawork that we’re accustomed to from film and TV. Recognizing that mathematicians and scientists can improve CV models but don't necessarily understand how people interact with camera experiences on an emotional level, we got creative. Or, more accurately, we asked creative professionals for help.

The filmmakers that we worked with shared a range of insights, some of which were well-established techniques — such as how experts tend to compose shots and how those decisions influence audience expectations — while others were more instinctual and harder to replicate with AI. For one experiment, we asked a group of professional camera operators to film a series of scenes where it was difficult to capture the action from a single angle. Analyzing the results revealed that while there's no consistent ground truth for how a seasoned pro films a given situation (camera operators often make different decisions despite sharing the same angle and subjects), there are subtle movements that filmmakers instinctively use to produce a more natural, intuitive camera experience. We carefully analyzed these movements and distilled them into software models that aim to mimic this experience in Smart Camera. These proved more effective than movements guided by simple mathematical strategies.

For example, camera operators typically film only in landscape mode, but we asked our experts to shoot each sample scene in both landscape and portrait modes. Based on how they composed shots for the two modes, we soon realized we would have to provide two very different experiences. In portrait mode, their composition and camera movements prioritized the key people in a given scene. They used tighter framing to showcase the person and their expressions rather than their environment, in order to create a more intimate experience. In landscape mode, the camera operators were more comfortable capturing more of the action in the scene. They tended to use a wider range of compositions, especially medium and “cowboy shots,” where subjects are framed from the midthigh up. (The shot's name comes from classic Westerns, since it can include an actor's face and holster.)

We incorporated these choices into Smart Camera, taking advantage of landscape orientation to showcase more of the activity in the scene, and then shifting to a closer, more one-on-one composition when someone rotates their Portal+ into portrait mode.

What's next for Smart Camera

Smart Camera is a culmination of our broad and varied CV expertise, including the pose-detection techniques that we pioneered and the hardware implementation that we learned during Portal's development. But it’s also an example of an AI-based system that’s informed by a human skill set, incorporating the nuances of cinematography and photography into a feature that otherwise might have been jarring and disruptive to the natural, effortless face-to-face communication for which Portal was designed. And its streamlined, reduced-compute approach to CV (using Mask R-CNN2Go) shows the progress we continue to make in this space.

Since camera control is very tightly coupled to the contextual understanding of people and their environments, we will continue to advance our CV and other AI technologies to help us better understand the world around the device. For example, we may want to frame a shot differently and use different types of camera movements when a person is cooking in the kitchen, compared with when he or she is watching TV on the couch. And, ultimately, Portal is not just about making video calls — it's about spending quality time with people you care about. To better facilitate that, we are also looking for new ways to build shared experiences that people can enjoy with Portal.