March 18, 2021

Humans innately understand the countless variations in the world around us. We know a dog when we see one, even if it's a breed, color, or pose we've never encountered before. But even the most state-of-the-art AI models, models that outperform humans in myriad ways, can struggle with the simple — for humans, at least — task of identifying a golden retriever whether it's viewed head-on, from the side, upside down, leaping through the air, or even covered in mud.

Deep learning models are great at interpreting statistical patterns among pixels and labels, but they can struggle to correctly identify objects across their many potential natural variations. Is that a snowplow coming down the road? Or a school bus tipped over on its side? A human would instantaneously know, but factors such as color, size, and perspective complicate whether an AI model can make a successful prediction.

Figure based on “Strike (with) a pose: Neural networks are easily fooled by strange poses of familiar objects” by M.A. Alcorn et al. (used with permission of the author) shows a deep neural network misclassifying a bus as a snowplow.

At Facebook AI, we’ve been exploring this challenge of capturing natural variation and identified the limitations of the traditional solution, known as disentanglement. We’ve also recently developed the idea of an equivariant shift operator — a proof of concept for an alternative solution that could help models understand how an object might vary by mimicking the most common transformations.

Our work is largely theoretical at the moment but has broad potential for deep learning models, particularly in computer vision: increased interpretability and accuracy, better performance even when trained on small data sets, and improved capability to generalize. We hope these contributions bring the computer vision community one step closer to developing AI systems that can better understand the visual world in all its complexity.

Disentanglement is an existing solution for identifying natural variations that aims to identify and distinguish among the factors of variation in data. Current approaches to disentanglement attempt to learn the underlying transformation of objects in a model by encoding each of its factors into a separate subspace of the model’s internal representation. For example, disentanglement might encode a data set of dog images into pose, color, and breed subspaces.

This approach is good at identifying the factors of variation in rigid data sets, like a single MNIST digit or a single object class like a chair, but we’ve found that disentanglement performs poorly across multiple object classes. Think about multiple rotating shapes, such as triangles and squares. A disentangled model would attempt to separate the two factors of variation, the shape and the orientation of the object, into two representational spaces. The image below illustrates how traditional disentanglement fails to isolate rotation in a data set of multiple shapes. We would expect the highlighted shape to rotate, but because disentanglement failed, the shape instead remains fixed.

Disentanglement also introduces topological defects, another problem for a broad family of transformations. Topological defects violate continuity — an essential property of deep learning models. Without continuity, deep learning models may struggle to effectively learn the patterns present in data.

Consider rotations of an equilateral triangle. An equilateral triangle rotated by 120 degrees is indistinguishable from the original triangle, leading to identical representations in orientation space. However, by adding an infinitesimally tiny dot to one corner of the triangle, the representations become distinguishable, violating continuity. Nearby images are mapped to representations that are quite far apart. Our research also shows that topological defects arise for nonsymmetrical shapes and many other common transformations.

Rather than restrict each transformation to one component of a representation, what if transformations could instead modify the entire representation? The goal of this approach is to discover operators capable of manipulating the image and its representation — a single operator for each factor of variation. These are known as equivariant.

There’s a rich branch of mathematics known as group theory that can teach us a great deal about applying equivariant operators. It shows that an intuitive way to understand variation factors is to model them as a group of transformations. Rotations of a triangle, for example, have a group structure: a 90-degree rotation and a 30-degree rotation combine to yield a 120-degree rotation.

We've used these ideas to identify the shortcomings of traditional disentanglement and determine how to train equivariant operators to disentangle. We propose an equivariant operator called the shift operator. This is a matrix with blocks that mimic the group structure of common transformations — rotations, translations, and rescaling. We then train an AI model on both the original images and their transformations.

In doing so, we’ve found that the shift operator successfully learns transformations even among data sets containing multiple classes — the very condition in which traditional disentanglement typically fails.

These are exciting developments because equivariant models based on group theory greatly expand the scope of disentanglement research. Existing models rely on strong supervision — such as understanding a priori the transformations of interest and enforcing them in the model. But how can we discover a data set’s symmetries using a minimal amount of supervision? Previous research in this area has been applied mostly to synthetic data, so knowledge of the underlying symmetries could make models more robust when they’re faced with unusual observations, such as a bus on its side or a dog with an oversized toy in its mouth.

Humans recognize unknown objects by intuitively comparing them with things we've seen before. Models could be trained to be equivariant to transformations of subparts of an image, and crucially, the models could recombine subparts when confronted by unknown objects.

Finally, tackling real data sets with group theory–based models is challenging because the group structure is not perfectly respected. For example, when rotating an object in a nonuniform background, there are many ways of inferring the values of the pixels that appear after the rotation. Extending this idea to more realistic settings and data sets such as images with no artificial augmentation might prove to be a valuable approach going forward.

In any case, our work opens numerous promising lines of research to pursue. We hope that it brings us one step closer to powerful, flexible, and reliable AI models that can fully understand the world in all its natural complexity.

Software Engineer

Research Scientist