Research

Easy-to-interpret neurons may hinder learning in deep neural networks

October 28, 2020

What it is:

What does an AI model “understand,” and why? Answering this question is crucial for reproducing and improving AI systems. Unfortunately, computer scientists’ ability to interpret deep neural networks (DNNs) greatly lags behind our ability to achieve useful outcomes with them. One common set of methods for understanding DNNs focuses on the properties of individual neurons — for example, finding an individual neuron that activates for images of cats but not for other types of images.

We call this preference for a specific image type “class selectivity.”

Selectivity is widely used in part because it’s intuitive and easy-to-understand in human terms (i.e., these neurons are the “cat” part of the network!), and because these kinds of interpretable neurons do, in fact, naturally emerge in networks trained on a variety of different tasks. For example, DNNs trained to classify many different kinds of images contain individual neurons that activate most strongly — i.e., are selective — for Labrador retrievers. And DNNs trained to predict individual letters in product reviewscontain neurons selective for positive or negative sentiment.

But are such easy-to-interpret neurons actually necessary for DNNs to function? It might be like studying automobile exhaust to understand automobile propulsion; the exhaust is related to the speed of the car, but it’s not what’s propelling it. Is class selectivity part of the engine or part of the exhaust?

Surprisingly, we found strong evidence showing that DNNs can function well even if their neurons largely aren’t class selective. In fact, easily interpretable neurons can impair DNN function and even make networks more susceptible to randomly-distorted inputs. We discovered this by developing a new technique to directly control the class selectivity of a DNN’s neurons. Our findings help demonstrate that overreliance on intuition-based methods for understanding DNNs can be misleading if these methods aren’t rigorously tested and verified. To fully understand AI systems, we must strive for methods that are not just intuitive but also empirically grounded.

What we found:

Although class selectivity has been widely examined as a tool for DNN interpretability, there’s been surprisingly little research into whether easy-to-interpret neurons are actually necessary for DNNs to function optimally. Researchers have recently begun to examine whether easily interpretable neurons are actually important for DNN function, but different studies have reported conflicting results.

We tackled this question through a new approach to manipulate class selectivity: When training a network to classify images, we not only instructed the network to improve its ability to classify images, we also added an incentive to decrease (or increase) the amount of class selectivity in its neurons.

Here we show how manipulating class selectivity across neurons in a DNN affects the DNN’s ability to correctly classify images (specifically, for ResNet18 trained on Tiny ImageNet). Each point represents a single DNN. The color of the dot represents how intensely class selectivity was encouraged or discouraged in the DNN’s neurons. The x-axis shows the mean class selectivity across neurons in a DNN, and the y-axis shows how accurately the DNN classifies images. The grey points are neutral — class selectivity is neither encouraged nor discouraged — and represent the naturally occurring level of class selectivity in this type of DNN, which we use as a baseline for comparing classification accuracy. By discouraging class selectivity (blue points), we can improve test accuracy by over 2 percent. In contrast, encouraging class selectivity (red points) causes rapid negative effects on the DNN’s ability to classify images. We zoom in on a subset of the data to better illustrate the effects of decreasing vs. increasing class selectivity.

We did this by adding a term for class selectivity to the loss function used to train the networks. We controlled the importance of class selectivity to the network using a single parameter. Changing this parameter changes whether we are encouraging or discouraging easily interpretable neurons, and to what degree. This gives us a single knob with which we can manipulate class selectivity across all the neurons in the network. We experimented with this knob and here’s what we found:

When we reduced DNNs’ class selectivity, we found it had little effect on performance and in some cases even improved performance. These results demonstrate that class selectivity is not integral to, and can sometimes even negatively affect, DNN function, despite its ubiquity across tasks and models.
When we increased class selectivity in DNNs, we found a significant negative effect on network performance. This second result shows that the presence of class selectivity is no guarantee that a DNN will function properly.
DNNs that are deployed in the real world often deal with noisier and more distorted data compared with research settings. A research DNN, for instance, would see very clear images of cats from Wikipedia, whereas in the real world, the DNN would need to process a dark, blurry image of a cat running away. We found that decreased class selectivity makes DNNs more robust against naturalistic distortions such as blur and noise. Interestingly, decreasing class selectivity also makes DNNs more vulnerable to targeted attacks in which images are intentionally manipulated in order to fool the DNN.

These results are surprising for two reasons: First, because class selectivity has been widely used for understanding DNN function, and second, because class selectivity is naturally present in most DNNs. Our findings also suggest that in the absence of class selectivity manipulation, DNNs naturally learn as much class selectivity as is possible without it having a negative impact on performance. This leads to a deeper question that we hope to answer in future work: Why do networks learn class selectivity if it’s not necessary for good performance?

Why it matters:

We hope the simplicity and utility of our class selectivity knob encourages other researchers to adopt the technique to further our collective understanding of class selectivity and its role in DNNs. It’s critical that the approaches we develop for understanding complex neural network systems are based on characteristics that are actually meaningful. If we can train DNNs that don’t have cat neurons but are unimpaired in their ability to recognize cats, then we shouldn’t try to understand DNNs by focusing on cat neurons.

As an alternative approach, AI researchers should focus more on analyzing how large groups of neurons function together. We’re also optimistic that the potential performance benefits of regularizing against class selectivity can lead to practical applications.

More broadly, our results generally caution against focusing on the properties of single neurons as the key to understanding how DNNs function. In fact, as a follow-on to these results, we examined how some widely used interpretability methods can generate misleading results. To help address such impediments, we have just released a position paper reviewing two case studies where an over-reliance on intuition led researchers astray and discussing a framework for interpretability research centered on building concrete and critically, falsifiable, hypotheses that can be directly tested and either proven or disproven.

Robust, quantifiable interpretability research that tests our intuition will generate meaningful advances in our understanding of DNNs. All this work is part of Facebook’s broader efforts to further explainability in AI, including open sourced interpretability tools for machine learning developers and partnerships with key platforms. Ultimately, this work will help researchers better understand how complex AI systems work and lead to more robust, reliable, and useful models.