Advancing AI to make shopping easier for everyone

June 22, 2021

Facebook AI is on a quest to build the world’s largest shoppable social media platform, where billions of items can be bought and sold in one place.
As a key milestone toward this goal, we’re sharing details on how we’ve improved and expanded GrokNet, our breakthrough product recognition system. Now, it’s powering new applications on Facebook, like product tagging and showing visually similar products. Soon, we’ll bring visual search to Instagram so that people find similar products just by tapping on an image.
We’re also diving into our latest advancements, which provide a deeper, more nuanced understanding of product attributes and multimodal signals. These advancements collectively represent fundamental building blocks that could power entirely new shopping innovations of the future.

At any given moment, the web is a treasure trove of shopping inspiration — brimming with fashionable trends, seasonal tablescapes, and artful shelfies. But how many times have you seen something you want but can’t figure out how to buy it or even check it out?

Product recognition is among the most important ways to make it easier for people to shop online today. If AI can predict and understand exactly what’s in any given virtual frame, then people could — one day — choose to make any image or video shoppable. People would more easily find exactly what they’re looking for, and sellers could make their products more discoverable.

Facebook AI is building the world’s largest shoppable social media platform, where billions of items can be bought and sold in one place. As a key milestone toward this goal, we’re sharing details on how we’re expanding GrokNet, our breakthrough product recognition system, to new applications on Facebook and Instagram. GrokNet identifies what products are in an image and predicts their categories, like “sofa,” and attributes, like color and style. Unlike previous systems, which required separate models for each vertical, GrokNet is a first-of-its-kind, all-in-one model that scales across billions of photos across vastly different verticals, including fashion, auto, and home decor. GrokNet started as a fundamental AI research project with its first few applications on Marketplace, where AI analyzes search queries like “midcentury modern sofa” and predicts matches to search indexes so that the more than a billion people who visit Marketplace each month get the most relevant, state-of-the-art results when searching for products.

Something Went Wrong

We're having trouble playing this video.

Learn more

When a seller posts an image on their Facebook page, AI is helping identify untagged items and suggests tags based on their product catalog. When a shopper is viewing an untagged post from a seller, the system suggests similar products below the post from the seller’s product catalog. These are visual demonstrations only — exact model experiences may vary.

Since 2020, we’ve expanded this technology to new applications to make posts more shoppable across new Facebook applications. Right now, when a seller posts an image on their Facebook page, our AI-powered shopping system helps identify untagged items and suggests tags based on their product catalog — so that instead of taking several minutes to manually tag their items, a seller could create and post their photo in just seconds. And when a shopper is viewing an untagged post from a seller, the system instantly suggests similar products below the post from that seller’s product catalog.

Something Went Wrong

We're having trouble playing this video.

Learn more

With billions of images uploaded to Shops on Facebook and Instagram by sellers, predicting just the right product at any given moment is an extremely hard, open AI challenge. Today we’re sharing details on our newest state-of-the-art advancements, which are making our AI systems remarkably smarter at recognizing products — from multimodal understanding to learning deeper, more nuanced attributes. These advancements not only strengthen current applications, but they also are the building blocks of future shopping experiences.

For example on Instagram, shopping begins with visual discovery. Every day, people scroll through the app and see thumb-stopping inspiration — whether that’s a floral dress for summer or the perfect wedding dress. With AI-powered visual search, people can find similar dresses just by tapping on an image they see within Instagram. While it’s still early, we think visual search will enhance mobile shopping by making even more images on Instagram shoppable.

With each new advancement, we’ll cumulatively push AI research to go beyond finding similar products to entirely new, more flexible tasks like: “Find a handbag with the similar pattern or embellishment as this dress.” And when you find the right product, one day, we could build on this foundational technology to create new immersive innovations like AI-powered AR glasses that let you shop a window display on your commute or personalized AI assistants that can complete your look for you.

Identifying thousands of new, unseen objects and attributes

To help shoppers find exactly what they’re looking for, it’s important that product recognition systems excel at recognizing specific product characteristics — also known as attributes. But there are thousands of possibilities, and each one can apply to a range of categories. For example, you can have blue skirts, blue pants, blue cars, or even a blue sky. Most state-of-the-art classification models take a supervised approach, but with near-infinite possibilities, this is not scalable. Even just 1,000 objects and 1,000 attributes would mean manually labeling more than a million pairwise combinations. Plus, some combinations might occur more frequently in data. For example, there might be many blue cars, but what about rarer occurrences, like blue cheetah-print clothing items?

How can we make our systems work even on rare occurrences?

An overview of the compositional framework architecture.

We built a new model that learns from some attribute-object pairs and generalizes to new, unseen combinations. So, if you train on blue skirts, blue cars, and blue skies, you’d still be able to recognize blue pants even if your model never saw them during training. We did this in a new compositional framework trained on 78M public Instagram images — built on top of our previous foundational research that uses hashtags as weak supervision to achieve state-of-the-art image recognition.

One exciting advancement of this work is that we incorporated a new compositional module that takes attribute and object classifier weights and learns how to compose them into attribute-object classifiers. This makes it possible for us to predict combinations of attributes and objects not seen during training, and it outperforms the standard approach of individual attribute and object predictions. Each object can be modified with many attributes, increasing the fine-grained space of classes with few orders of magnitude. Meaning, we can scale to millions of images and hundreds of thousands of fine-grained class labels in ways that were not possible before. And we can quickly spin up predictions for new verticals to cover the range of products in our Facebook catalog, or even recognize those blue cheetah-print clothing items should we ever come across them.

While collecting the training data to train these models, we sampled objects and attributes from all geographies around the world. This helps us reduce the potential for bias in recognizing concepts like “wedding dress,” which is often white in Western cultures but is likely to be red in South Asian cultures, for instance. As part of our ongoing efforts to improve the algorithmic fairness of models we build, we trained and evaluated our AI models across subgroups, including 15 countries and four age buckets. By continuously collecting annotations for these subgroups, we can better evaluate and flag when models might work better at recognizing some attributes, like the neckline (V-neck, square, crew, etc.) on shirts for women compared with those for men if, for instance, we didn’t have enough training data of men wearing a V-neck shirt. Although the AI field is just beginning to understand the challenges of fairness in AI, we’re continuously working to understand and improve the way our products work for everyone across the world.

This model is now live on Marketplace, and as a next step, we’re exploring and deploying these models to strengthen AI-assisted tagging and product matches across our apps. We’re also exploring entirely new tasks beyond object similarity. Just as we can combine different objects and attributes, we can also disentangle attributes from object-related ones for a more diverse image search ranking. This opens up the possibility of searching not just based on object similarity, but also different tasks like: “Find a scarf with the same pattern and material as this skirt.”

Strengthening predictions with multimodal signals

In the Facebook family of apps, images almost always come with associated text, such as metadata or product descriptions. So, building vision-only models potentially leaves critical pieces of the puzzle on the table. We’re already pushing state-of-the-art multimodal advancements to improve content understanding across our platform. And now we’ve seen that signals from associated text significantly improved the accuracy of product categorization in a few different ways.

Boosting attributes for fashion

Transformer structures have revolutionized natural language processing tasks and, recently, researchers have extended its power to multimodality. We first tested a multimodal understanding framework using a clothing attributes data set, including catalog data that includes text input. A key challenge with multimodal understanding, however, is that the text data itself can sometimes be misleading. For example, a product description might read, “Here is the perfect sequined top to wear with your favorite pair of black skinny jeans.” An AI model might incorrectly predict that the top is black when in fact it is silver. We also needed to prepare for occasions when fashions in images are completely missing descriptions or related text.

To address this challenge, we combined visual signals from the image and related text description to guide the final model prediction. We found a great recipe for a multimodal model, which includes a slew of AI frameworks and tools, like the early-fusion architecture: Facebook AI’s Multimodal Bitransformer, generalized as the MMF Transformer in Facebook AI’s Multimodal Framework, as well as the Transformer encoder that’s pretrained on public Facebook posts. It turns out that early-fusion multimodal transformers outperform late-fusion architectures.

And to solve for instances when there are no text details, we added a modality dropout trick during training, in which we randomly remove either text or image when both modalities are present to ensure that it’s robust against these missing details. Overall, this advancement provides significant improvements in accuracy compared with vision-only models, and we’ll keep expanding these multimodal attributes to other verticals.

Improving product matches

When we initially launched the product match application, we noticed that the GrokNet embedding distance could only capture overall attributes like color, shape, structure; it couldn’t differentiate some key text-based details from one another. You can surface the same type of cosmetic product, for instance, but end up with different brands.

We needed to use additional signals to improve the accuracy of our product matches. For example, local features can capture similarities like specific logo and pattern, and optical character recognition (OCR) can capture exact variants of a product if the query image is text-rich. But it’s very challenging to combine these signals in a single product ranker, since each works well on different cases. Local features won’t work well for smooth objects like plain T-shirts or generic furniture, while OCR is commonly useful for beauty products but not for home and garden products.

To solve this problem, we had to build a flexible framework with careful feature engineering that let us add new features. We added two-stage ranking components into product recognition. The idea is to assemble features from GrokNet as well as other modalities with appropriate ranking models, and boost the best result into the top position of each query. This added flexibility to incorporate additional features without changing the current framework. The two-stage ranker includes:

A multilayer perceptron model which takes GrokNet embeddings and outputs rematch scores.
A gradient boosting decision tree model which takes multifeatures from different modalities and outputs rematch scores.

Because beauty products are the most likely to have readable text, we saw the best improvements in the beauty category and good results from nonbeauty categories, like fashion and home. In the future, we’ll explore other types of signals to boost our product matches, such as engagement signals, which could be complementary to the current image-text models.

Stepping stones to the future of shopping

Today, AI-powered shopping is in its infancy -- to machines, photos of products are still just collections of pixels. While some attributes can be straightforward, like “short sleeves,” others are more objective, like “formal wear” or “warm weather.” Training AI models that can flexibly make use of the right information in each situation requires solving scientific and engineering challenges. With each year, we’re building smarter AI systems that are fine-tuned to understand shopping-related images and text with state-of-the-art accuracy. All of these advancements are collectively pushing us toward smarter product understanding systems that connect consumers with exactly what they want as soon as it catches their eye.

In the future, this technology could fuel more immersive experiences. With millions of pieces of multimedia content posted on public Facebook pages every day, we hope to eventually build embeddings and multimodal models to learn varieties and styles to match people with their taste in music, travel, and other interests. Imagine watching a livestream video of your favorite artist performing at a concert. You could instantly browse outfits and accessories inspired by the artist, shop hashtags associated with the song, and even automatically surface product reviews from your friends and family who are watching the livestream with you.

And further in the future, we hope to combine all of this work with our ongoing advancements in AR, conversational AI, and other domains of machine learning to build the most personalized shopping experience in the world. You could snap photos of a sweater you like and a future AI-powered stylist could match you with just the right complimentary accessories and allow you to customize them for the perfect look and fit, all in augmented reality.

Of course, there’s a long way to go to such future innovations. Personalized shopping is particularly difficult since everyone’s preferences are unique and constantly evolving. We need to keep building on our fundamental AI advancements for even deeper AI understanding of not just products and style, but also ever-changing context and relationship between items. Today’s latest AI advancements bring us one step closer to the future of shopping.

We’d like to acknowledge the contributions of Animesh Sinha, Dhruv Mahajan, Dillon Stuart, Faizan Bhat, Filip Radenovic, Grigorios Antonellis, Jun Chen, Licheng Yu, Naveen Adibhatla, Omkar Parkhi, Pratik Dubal, Sami Alsheikh, Shawn Tzeng, Sridhar Rao, Tao Xiang, Wenwen Jiang, Yanping Xie, and Yina Tang as well as researchers, engineers, and other teammates who worked on Connected Commerce, Product Clustering Platform, IG Shopping, Catalog Quality Inference, AIX and AI Commerce team.