May 19, 2020
Today we’re announcing:
We’ve built and deployed GrokNet, a universal computer vision system designed for shopping. It can identify fine-grained product attributes across billions of photos — in different categories, such as fashion, auto, and home decor.
GrokNet is powering new Marketplace features for buyers and sellers today and we’re testing automatic product tagging on Facebook Pages to help make photos shoppable.
We’re also introducing Rotating View, a state-of-the-art 3D-like photo capability that allows anyone with a camera on their phone to capture multi-dimensional panoramic views of their listings on Marketplace.
And we’ve advanced research by creating a state of the art technique to predict occluded or layered objects in photos (like a shirt beneath a jacket).
These advancements are part of the foundation we’re building to develop an entirely new way to shop on our platforms — making it easier for individuals and small businesses to showcase their products to billions of people, and for buyers to find exactly what they’re looking for.
Think about the last time you shopped for clothes online. Maybe you found what you wanted, but most likely it required browsing dozens of possible options and guessing what would actually fit you and pair well with items you already own.
Now imagine having an AI lifestyle assistant to recommend just the right blue raincoat — one that matches your personal taste, complements your favorite plaid scarf, and will come in handy on your upcoming trip to Washington because there's a chance of drizzle. What if the same AI lifestyle assistant could help you shop for a new rug just by analyzing a photo of your home’s understated Scandinavian decor? Before buying, you could automatically generate a virtual replica in 3D to see how it might fit in your room and even share it in real time with your friends to see what they think.
Shopping is extremely challenging for AI systems to tackle because personal taste is so subjective. To build a truly intelligent assistant, we need to teach systems to understand each individual’s taste and style — and the context that matters when searching for a product to fit a specific need or situation. In addition, for a system to work for our platforms, it needs to work for everyone’s subjective preferences, globally. That means working for diverse body types and understanding different global trends and socioeconomic factors. Today, we are one step closer to making this vision a reality as we unveil the details of our AI-powered shopping systems, which leverage state-of the-art image recognition models to improve the way people buy, sell, and discover items.
This all starts by improving segmentation, detection, and classification. We need systems to know where items appear in a photo, and specifically what the items are. This helps us deliver fine-grained visual intelligence for dozens of shopping categories. Today, we can understand that a person is wearing a suede-collared polka-dot dress, even if half of her is hidden behind her office desk. We can also understand whether that desk looks like it’s made of wood or metal.
Our long-term vision is to build an all-in-one AI lifestyle assistant that can accurately search and rank billions of products, while personalizing to individual tastes. That same system would make online shopping just as social as shopping with friends in real life. Going one step further, it would advance visual search to make your real-world environment shoppable. If you see something you like (clothing, furniture, electronics, etc.), you could snap a photo of it and the system would find that exact item, as well as several similar ones to purchase right then and there.
Most product recognition models work well for a single product vertical, such as fashion or home decor. That type of system wasn’t going to work across Facebook’s services, where millions of people post, buy, and sell products across dozens of categories, ranging from SUVs to stilettos to side tables. We needed a single model to correctly identify products across all fine-grained product categories.
That meant aggregating a massive number of data sets, types of supervision, and loss functions into a single model, while making sure it works well on every task simultaneously. This is a huge AI challenge because optimizing and fine-tuning hyperparameters for one task can sometimes reduce the effectiveness of another. For example, optimizing a model to recognize cars well might mean it’s not as good at recognizing patterns on clothing.
We built, trained, and deployed a model with 83 loss functions across seven data sets to combine multiple verticals into a single embedding space. This universal model allows us to leverage many more sources of information, which increases our accuracy and outperforms our single vertical-focused models.
For the input, we used a combination of both real-world seller photos with challenging angles and catalog-style photos. We needed a way to structure all that information, make it inclusive for all countries, and make sure the relevant languages were represented. We also needed to ensure that different ages, sizes, and cultures were reflected. This meant training for inclusivity from the beginning by incorporating data across different body types, skin tones, locations, socioeconomic classes, ages, and poses. For example, clothing might look very different on someone doing yoga than on someone playing football. And a body might look different if the image was taken from the side rather than the front. Considering these kinds of issues from the start ensures that our attribute models work well for everyone.
We discovered that each data set has an inherent level of difficulty. Easier tasks don’t need much supervision and can be given a smaller weight — this insight helped us improve accuracy on multiple tasks simultaneously. For difficult tasks, we need to apply a very large weight in the loss function and apply many images for each batch when training. For simpler data sets, we can use a small weight and a small number of images per batch. For this unified model, we allocated most of the batch size to the more challenging data sets and only one or two images per batch to the simpler ones. This allowed us to scale up and ensure that all 83 loss functions work well at the same time.
Manually annotating each image with its exact product identity is incredibly challenging since there are millions of possible product IDs. We developed a technique to automatically generate additional product ID labels using our model as a feedback loop — weakly supervised learning. Our method uses an object detector to identify boxes in images surrounding likely products, match each box against our list of known products, and keep all matches that are within a similarity threshold. The resulting matches are added to our training set, increasing our training data without requiring human annotation.
Our model predicts a wide variety of properties for an image, such as its category, attributes, and likely search queries. It also predicts an embedding (like a “fingerprint”) that can be used to perform tasks like product recognition, visual search, visually similar product recommendations, ranking, personalization, price suggestions, and canonicalization.
We did all this in a compressed embedding space, using just 256 bits to represent each product. When we deployed GrokNet to Marketplace, we compressed our embeddings by a factor of 50 using our Neural Catalyzer, to significantly speed up retrieval and cut storage requirements. We accomplished this with substantial gains over previous embeddings running in production (+50 percent to +300 percent relative top-1 accuracy). As recently as last year, our text-based attribute systems could identify only 33 percent of the colors and attributes in home and garden listings on Marketplace. We’re now able to recognize 90 percent of them.
Created end-to-end with Facebook-developed tools, including PyTorch, this system is 2x more accurate than previous product recognition systems we’ve used. This has allowed us to greatly improve search and filtering on Marketplace so people can find products with very specific materials, styles, and colors (like a yellow mid century loveseat). With this new unified model, the system is able to detect exact, similar (via related attributes), and co-occurring products across billions of photos.
We expect GrokNet to play an important role in making virtually any photo shoppable across our apps. We’ve already used this system to launch automatically populated listing details in Marketplace seller listings. When sellers upload a photo, we auto-suggest attributes like colors and materials, which makes creating a listing much easier. Beyond Marketplace, we are also testing automatic product tagging on Facebook Pages to help people discover products from businesses they like. When Page admins upload a photo, GrokNet can suggest potential products to tag by visually matching between items in the photo and the Page's product catalog. With AI-powered product tagging, businesses will be able to more easily showcase entire catalogues of products to billions of people worldwide within seconds.
In the future, GrokNet could be used to help customers easily find exactly what they're looking for, receive personalized suggestions from storefronts on what products are most relevant to them, which products are compatible, how they’re being worn, and then click through to purchase when they find things they like in their feeds.
To build a unified model, we had to push the boundaries of segmentation. Products in real-world images are generally not one-dimensional on a stark white background. When people snap a quick photo to sell something on Marketplace, for example, items are often photographed in uneven lighting, partially obscured, shown in different poses, or layered under other items (like a shirt under a jacket). Existing segmentation tools have come a long way, thanks to recent advancements and new large-scale data sets, but they sometimes fail to accurately segment highly occluded items.
We took a new approach to high-resolution segmentation and developed a method that achieves state-of-the-art performance, even in these challenging situations. First, we detect a clothing item as a whole and roughly predict its shape. Then we can use this prediction as a guide to refine the estimate for each pixel. This allows us to incorporate global information from the detection to make better local decisions for each pixel in the semantic segmentation.
We have conducted research using an operator called Instance Mask Projection, which projects the predicted instance masks (with uncertainty) for each detection into a feature map to use as an auxiliary input for semantic segmentation. It also supports back-propagation, making it end-to-end trainable.
Our experiments have shown the effectiveness of Instance Mask Projection on both clothing parsing (complex layering, large deformations, non-convex objects) and street-scene segmentation (overlapping instances and small objects). Our approach is most helpful for improving semantic segmentation of objects for which detection works well, such as movable foreground objects, as opposed to regions, such as grass. It is especially helpful for small items, such as a scarf or a tie, and locally ambiguous items, such as a dress that’s distinguished from a skirt. We have demonstrated the effectiveness of adding our new operator on internally developed architectures, but it is quite general and could make use of future instance and semantic segmentation methods as baseline models.
Many of these computer vision advancements are already helping people use our shopping surfaces to more effectively find the right products to buy. Looking ahead, we are training our existing product recognition system to improve across dozens of additional product categories, including lighting and tableware, and on more patterns, textures, styles, and occasions. We’d also like to create a richer experience by enabling our system to detect objects in 3D photos.
These advancements are making it easier to locate products you might want. But any shopper knows the real test is seeing what something would look like in real life and trying it out.
That’s why we’ve created a new feature in Marketplace that can take a standard 2D video shot in Marketplace on mobile and post-process it to create an extremely accurate, interactive, 3D-like representation. If, for example, sellers want to list an 18x60 chestnut wood rustic coffee table on Marketplace, they can capture a short video clip. Potential buyers can then spin and move the table up to 360 degrees to see whether the color, size, style, and condition would work well for their space. This combination of video stabilization, editing, and user interaction, helps transform 2D videos into a 3D-like view.
Major retailers commonly provide 3D models to let buyers see their products. While these are useful in giving buyers more information about the product’s shape, material, and condition, they typically require professional equipment and expertise to produce. The Rotating View feature provides an accessible alternative way to showcase an item’s condition, allowing Marketplace sellers to create an interactive 3D view with just their smartphone camera.
There are three main steps to this algorithm: identifying the original camera positions, defining a smooth camera path, and generating novel views that form the output video. Each frame of a captured video is an image from some viewpoint (or camera pose) in 3D space. To figure out where they are, we used the classic visual-inertial simultaneous localization and mapping (SLAM) algorithm, which gives us the position of the camera relative to a set of 3D feature points in the image. We chose SLAM over other slower offline 3D reconstruction algorithms to perform the reconstruction in real time during video capture, reducing the amount of computation done post-capture. This let us stabilize the video in a fraction of a second.
Due to natural hand shaking, these camera positions do not form a smooth path. This is what makes the video appear shaky. To stabilize the video, we constructed a smooth virtual camera path by optimizing a splined curve through the original camera poses.
With the smooth virtual camera path, we picked optimally spaced points on this path that correspond to the ideal viewpoints. Since these are novel viewpoints, we used the sparse 3D point clouds as guidance to find the nearest captured viewpoint and warp the images in such a way that they look like they were taken from the ideal viewpoints. We did this by dividing the image in a grid pattern, and applied a transform to the grid such that the feature points from the capture viewpoint line up with where they should be in the ideal viewpoint.
When these novel viewpoints are put together into a video, we’re able to present an interactive view where the buyer is able to smoothly rotate the product to view at any captured angle. The feature works in real-world conditions and takes minimal effort to use. We are testing this feature on Marketplace iOS to start.
To allow people to see how they might look in these products — or how these products might look in the real world — we will draw from our open Spark AR platform, which already enables augmented reality try on for Facebook Ads and Instagram Checkout. Brands such as NYX, NARs, and Ray Ban have used these features to show people how they might look in a new shade of lipstick or a different pair of glasses. People can also share these images to their Story or Feed to see what friends think, or purchase the item they like best directly from the experience.
These advancements leverage the latest in computer vision, face detection in real time, and facial landmark recognition (identifying coordinates for noses, cheekbones, etc.). Try on requires applying AR onto the face using computational photography and a mesh 3D model, and using plane tracking technology so people can place 3D objects into their surroundings and interact with them in real time for a more personalized, contextualized experience. While today the majority of experiences focus on beauty and optical, in the near future we will support AR try on for a wider variety of products, such as home decor and furniture, across more Facebook surfaces.
None of this would have been possible if it weren’t for technology pioneered by Facebook AI. Instance segmentation models, such as Mask R-CNN, efficiently detect objects in an image while simultaneously generating a high-quality segmentation mask for each instance. Feature Pyramid Network makes it possible for systems to process objects at scales both large and small. More recently, we’ve developed several new systems that represent big leaps forward. Detectron 2 is a modular PyTorch-based object detection library that helps accelerate implementation of state-of-the-art CV systems. Panoptic FPN uses a single neural network to simultaneously recognize distinct foreground objects, such as animals or people (instance segmentation), while also labeling pixels in the image background with classes, such as road, sky, or grass (semantic segmentation).
We’ve also pushed research forward by building a deep image-generation neural network specifically designed for fashion. Fashion++ uses AI to suggest personalized style advice to take an outfit to the next level with simple changes, like adding a belt or half-tucking a shirt. We’ve also done research to help us take into consideration factors that current models don’t, such as diverse body shapes. Existing state-of-the-art clothing recommendation methods neglect the significance of an individual’s body shape when estimating the relevance of a given garment or outfit. We are developing a body-aware embedding to train systems to better detect and suggest clothing that would be flattering for a person’s specific body type. And because location is an important component of style, we’ve also done research on detecting which cities influence other cities in terms of propagating their styles — to inform a forecasting model that predicts the popularity of a look in any given region of the world.
As we work toward our long-term goal of teaching these systems to understand a person’s taste and style — and the context that matters when that person searches for a product — we need to push additional breakthroughs. We need to continue improving content understanding and build systems that can reason, make connections between items, and learn personalized shopping preferences.
Personal style and preferences are subjective, and they change frequently based on factors such as season, weather, occasion, cost, and geographical location. To adapt flexibly to someone’s needs and preferences, we need models that can keep learning over time. To re-rank and optimize, building that kind of system will require direct input via signals like uploading photos of products a person owns, likes, or wants to buy, or providing positive or negative feedback on images of potential items.
Say you're in São Paulo and spot a jacket you love. You could snap a photo to find a similar one in seconds because the system can analyze the brand, fabric, price point, and style information. We envision a future in which the same system could even incorporate your friends’ recommendations on museums, restaurants, or the best ceramics class in the city — enabling you to more easily shop for those types of experiences.
While these systems are fragmented right now, incorporating everything into one system is the ambitious challenge we’ve set out to achieve. Building these systems across all Facebook platforms would enable shoppers to connect with their friends and family to get an opinion on an automatically generated 360-degree 3D view of an item. These friends can weigh in on which sneakers they like most or which size painting looks best in the shopper’s kitchen. By combining state-of-the-art computer vision with advancements in other AI domains, such as language understanding, personalization, and social-first experiences, we’re well positioned to transform online shopping for everyone.
Research Scientist Manager
Head of Applied Computer Vision