Delivering the benefits of artificial intelligence to everyone requires creating systems that work well for everyone. At F8 2019, we highlighted Facebook AI's range of systems and processes currently in production for developing inclusive AI and addressing labeling bias, algorithmic bias, and intervention bias. These efforts help ensure our computer vision (CV) systems work well for all skin tones, for example, and allow our augmented reality effects to serve everyone regardless of facial features, hairstyle, or other factors. But creating fair and unbiased AI systems will require mitigating other forms of potential bias as well, so today, Facebook AI researchers have published the first systematic study that measures the accuracy of object-recognition systems for different communities across the world. Events such as weddings or commonly used household items (dish soap, for instance) can look very different in different places, so CV systems trained with data predominantly from one region may not perform as well when classifying images from somewhere else.
To assess this systematically, Facebook AI researchers tested the performance of our object-recognition system along with systems from a range of other technology companies. Using a publicly available third-party data set of photos of household items in 50 countries, we found accuracy for all these systems was indeed significantly lower for images from certain regions and from households with lower income levels.
By publishing these results and describing our methodology, AI researchers and engineers across the community can use this work to test and compare the performance of their own object-recognition systems, and then make them more capable of serving everyone effectively. This can help everyone in the AI community create tools that deliver the same high level of accuracy, irrespective of a person's country of residence, culture, and socio-economic circumstances.
Object-recognition technology has improved drastically in the past few years across the industry, and it is now part of a huge variety of products and services that millions of people worldwide use. Here at Facebook, for example, object recognition is an important part of the AI systems that help keep people safe on our platforms. It also enables our Automatic Alt Text feature to help the the visually impaired better understand what their friends and family are publishing to Facebook products. Facebook AI has pushed the state of the art in CV systems with Mask R-CNN in 2017 and now Panoptic Feature Pyramid Networks. We have also developed new techniques, such as using public hashtags for weakly supervised learning systems, which set a new state of the art in image recognition, and we've now expanded this work to video as well.
As with other AI tools, however, the performance of image-recognition systems is greatly affected by the data sets on which they are trained. In this case, performance varies in large part because the objects themselves vary.
To measure discrepancies in accuracy, we analyzed Dollar Street, a collection of publicly available photos of household items that were photographed in 264 different homes across 50 countries. The images were gathered by the independent Gapminder foundation, which highlights differences in people's living conditions around the world. The Dollar Street photo collection can serve as a public benchmark of object-recognition systems that allows us to measure system performance in different parts of the world. (More details are available in the paper, which is being presented at the 2019 Computer Vision for Global Challenges Workshop.)
We analyzed how well object-recognition systems provided by major commercial cloud services work on the Dollar Street photo collection, as well as the performance of Facebook's own internal systems. The results of our analysis show that, at the time of benchmarking, all these services perform substantially better in some countries than in others. In particular, photos from several countries in Africa and Asia were less likely to be identified accurately than photos from Europe and North America. Our analysis showed that this issue is not specific to one particular object-recognition system, but rather broadly affects tools from a wide range of companies, including ours. Using the Dollar Street data set and comparing performance for different income groups, we found that the accuracy of Facebook's object-recognition system varies by roughly 20 percent.
These results clearly show that we must do better both across the industry and here at Facebook. As detailed in the concluding section below, we are actively pursuing ways to address the issues identified by our paper.
Geographic discrepancies are not the only issue this study identifies. Dollar Street also recorded the monthly consumption income (determined by purchasing power parity) for each of the households in which photos were taken. This allowed us to analyze object-recognition systems' performance by household income as well as location. The results of this analysis show that object-recognition systems performed 10 percent to 20 percent better in classifying Dollar Street images for the wealthiest households than for the least wealthy households.
To assess these results, it is helpful to review how object-recognition systems work. These systems are based on machine learning models, typically convolutional networks, that are trained on a large collection of images. For each image in the collection, humans often must manually annotate which objects are present in the photo.
The question is: Where do the images used for training come from? Proprietary services do not typically share this data, but we can analyze public photo collections that are often used to train the object-recognition systems that academic institutions develop, such as ImageNet, COCO, and OpenImages. We find that these collections can have a very skewed geographic distribution: Almost all the photos come from Europe and North America, whereas relatively few photos come from populous regions in Africa, South America, and Central and Southeast Asia. This uneven distribution may lead to biases in object-recognition systems trained on these data sets. Such systems may be much better at recognizing a traditionally Western wedding than a traditional wedding in India, for example, because they were not trained on data that included extensive examples of Indian weddings.
Diversifying the photos in a given data set is not the only step toward addressing this bias in object-recognition systems. Some systems are trained with photos obtained by searching public photo websites using English-language queries. This can make it even more challenging to create a data set that is very broadly representative. A query for the word “wedding” will typically return very different photos than a query for, say, the Hindi word for wedding, “शादी.” To address this issue, researchers and engineers must make sure their systems are not overly dependent on English queries and labels.
Being able to diagnose these issues is an important first step to solving them, but it is only the beginning. The entire AI community must continue to improve AI systems so they work well for everyone who wants to use them, irrespective of gender, race, cultural background, country of origin, and socio-economic circumstances. At Facebook, we know we have more to do to make our AI systems more inclusive and we're working hard on improving. We believe one important component of making image recognition models more inclusive lies in using hashtags from many different languages for training, rather than only English hashtags. Specifically, we are developing image-recognition systems that are trained on multi-lingual hashtag annotations. The first step in our approach is to use Facebook's unsupervised word embedding technology to learn multi-lingual hashtag embeddings. Subsequently, we train our convolutional networks to predict the hashtag embedding that corresponds to the training image. The use of unsupervised word embedding allows us to train on images that are annotated in hundreds of different languages, including languages that have relatively few speakers. In addition to the training of multilingual vision models, we are exploring techniques that use location information to ensure we select a data set that is geographically representative of the world population. This method works by resampling training images to match a geographic target distribution. As we work to implement these measures, we'll also explore other ways to improve our object recognition systems. We hope this new study we've released today will inspire others to work to improve their systems as well.
Laurens van der Maaten