April 12, 2022

Self-supervised learning has revolutionized AI training in several domains, including vision, language, and speech. By teaching models to construct useful representations of text and images without labeled data, self-supervised methods free AI systems to learn from orders of magnitude more examples, so they can recognize subtle patterns that humans might miss. Self-supervised networks now often perform as well or better than more traditional models.

But self-supervised methods have a significant drawback: They sometimes trigger dimensional collapse, in which a model fails to take advantage of its full capacity to encode information. We’re sharing new research that pinpoints two mechanisms underlying dimensional collapse. We have also developed DirectCLR, a training method that overcomes this problem by optimizing a model’s ability to create rich representations of knowledge.

An AI model translates data into complex mathematical representations, assigning each input numerical positions within hundreds of dimensions. For an image, the dimensions might quantify patterns in edges, the sharpness of angles, or the contours implied by shading. These representations — called embeddings — consolidate a wealth of meaning, distilling the unique qualities that characterize each object in a picture.

For an image model, the goal of training is to learn which traits distinguish one group of objects from another. Is color always important, or does this gray cat-shaped figure belong in the same bunch as this orange tabby? How about orientation? Should this thing that looks like an upside-down cat be grouped with the other cats?

Training forces the model to jettison pixel-level details specific to only one example and to find the common ground between similar things. When the model discards extraneous details — noise — it reduces the dimensionality of the embedding. In the process, the representations of, say, different cat pictures grow more alike.

Self-supervised approaches teach a model to cluster the representations of nearly identical inputs close together in a multidimensional graph, called the embedding space. In place of human-created labels, the system duplicates and distorts the input image — applying a series of random augmentations, such as cropping, desaturating, rotating, and resizing — to create a matching pair of pictures. The model learns to create similar embeddings for these images, and, in the process, to build representations that recognize relevant attributes and ignore noise.

The underlying dynamics of these methods remain somewhat mysterious, however. Self-supervised methods can bring about complete collapse, in which all representation vectors cluster at a single point; the model creates the exact same embedding for each input. This can happen when the system learns only from pairs of related images. In attempting to maximize the likeness between similar features, the model ends up treating all images as if they were the same.

To prevent complete collapse, researchers often turn to contrastive learning, one of the most promising methods for self-supervised image-model training. Like other self-supervised approaches, contrastive techniques teach a model to cluster the representations of nearly identical inputs, or positive pairs, close together in the embedding space — but the model also learns to push embeddings away from those of dissimilar examples, or negative pairs. The positive and negative pairs are treated differently in the loss function, which measures the error in the model’s predictions. The model will learn to bunch all the representations of cat photos together, and to separate them from the representations of otter pictures. (It doesn’t know that one group depicts animals called cats, and the other shows otters; it simply learns to discriminate between different sets of features.)

Ideally, a model’s embedding vectors would span the entire embedding space, maximizing the knowledge it can encode. However, we observed that while contrastive learning prevents total collapse, it can induce a related problem called dimensional collapse. When that happens, the embeddings all vacate certain dimensions and shrink into a lower-dimensional subspace — like a 3D sculpture compressed into a 2D drawing.

To quantify the problem, we trained a SimCLR model with a two-layer multilayer perceptron projector and evaluated its dimensionality by collecting the embedding vectors on the validation set. About 30 singular values dropped to zero, indicating that those dimensions had collapsed. This suggests that contrastive learning may not make full use of model capacity, and that as a result the system will represent a limited volume of information. These artifacts may prevent contrastive methods from being truly scalable.

In contrastive methods that explicitly use positive and negative pairs in the loss function, the repulsive effect of the dissimilar examples should propel the embeddings across all the available dimensions. However, contrary to intuition, contrastive learning methods can nonetheless trigger dimensional collapse. We found that two mechanisms cause this phenomenon.

The first is strong augmentation. When the distortion applied to the duplicate is overly severe, the image is no longer similar enough to the original for the network to recognize them as a positive pair. If strong augmentation produces more variance within a particular feature than is found in the data distribution, the weight collapses in that dimension. We found that this happens when the contrastive covariance matrix (the weighted data distribution covariance matrix minus the weighted augmentation covariance matrix) is not positive semidefinite.

But contrastive learning can bring about dimensional collapse even when the positive pairs are similar — if the linear network has more parameters than necessary. Overparameterized networks tend to find low-rank — that is, lower-dimensional — solutions. A loss function can have more than one local minimum, and some minima confer lower loss than others. By design, different optimization algorithms will tend to converge to particular local minima and eschew others. These dynamics, along with the interplay of weight matrices at different layers, can cause overparameterized neural networks to find flatter minima solutions. This phenomenon, called implicit regularization, is thought to be what drives neural networks to generalize so well in supervised training.

However, in contrastive learning settings, implicit regularization can prevent neural networks from encoding more than minimal information even when positive pairs are very similar. We discovered that in these cases, gradient descent spurs adjacent layers to align and small initialized singular values to evolve exponentially more slowly than others, resulting in collapsed dimensions. For this to happen, the contrastive covariance matrix must be positive semidefinite — the exact opposite of the condition that causes dimensional collapse when augmentation is too strong.

Given the fundamental limitations of contrastive learning, AI researchers may need better approaches to develop truly scalable self-supervised methods. We have developed a novel contrastive learning method, DirectCLR, which uses a low-rank diagonal projector. In contrast to all recent state-of-the-art self-supervised learning approaches, DirectCLR optimizes the representation space, and it outperforms SimCLR with a linear trainable projector on ImageNet. DirectCLR sends a subvector of the representation directly to the loss function. Even though the gradient from the loss function is low-rank, DirectCLR takes advantage of residual connection in the ResNet backbone to build full-rank representation vectors.

In the near future, self-supervised pretraining will become standard procedure for machine learning pipelines. We must understand the fundamental limitations of these methods before deploying them in huge models and data sets. Although dimensional collapse is an obstacle to scalability for contrastive learning methods, our study shows at least one way to circumvent this problem. The more we learn about these approaches, the better our models will learn on their own.

Postdoctoral Researcher

Research Scientist and Manager