Improved supernet training for efficient neural architecture search

July 14, 2021

Share on Facebook

Share on Twitter

Designing accurate, computationally efficient neural network architectures is an important but challenging part of building any high-performance machine learning system. Neural architecture search (NAS) can automate neural network design by exploring an enormous architecture space, but conventional NAS approaches are themselves typically very computationally expensive. Hundreds of candidate architectures need to be trained from scratch and then evaluated, which would take years on a single GPU and still requires nearly a month even on thousands of GPUs.

We’ve developed two new methods, AttentiveNAS and AlphaNet, that significantly improve the accuracy of so-called supernets, which have emerged as a powerful way to make NAS more efficient. These approaches outperform existing supernet training methods, delivering state-of-the-art results on the ImageNet data set and achieving greater than 80 percent accuracy with only 444 MFLOPS.

Supernets assemble the various candidate architectures into a single, overparameterized weight-sharing network, which it then tries to optimize simultaneously. Each candidate architecture corresponds to one subnetwork, and by training the subnetworks simultaneously with the supernet, different architectures can directly inherit the weights from the supernet for evaluation and deployment. This eliminates the huge computational cost of training or fine-tuning each architecture individually.

Though promising, simultaneously optimizing all the supernet’s subnetworks with weight sharing is very difficult. To stabilize the training, researchers often use a sandwich sampling strategy, which samples the multiple subnetworks (largest, smallest, and two random choices) and aggregates the gradients together for each mini-batch. To improve the supernet training, two questions naturally arise: How to sample the subnetworks during training? And how to supervise the subnetworks, given that they are usually much harder to train?

Our AttentiveNAS method proposes to focus attentive sampling strategy in order to steer training to networks to achieve Pareto efficiency. With AlphaNet, we supervise the subnetworks with alpha-divergence in order to simultaneously prevent the overestimation or underestimation of the uncertainty of the teacher model. The two methods solve orthogonal problems for supernet training and can be combined together to deliver the most stable and performant supernet training.

AttentiveNAS and AlphaNet help achieve a significant boost of the network accuracy given a wide range of compute constraints as shown in the figure below, outperforming all the prior art NAS methods, including BigNAS, OFA, and EfficientNet.

Comparing AttentiveNAS and AlphaNet with prior art methods, including BigNAS, Once-for-all Network (OFA), and EfficientNet.

How it works:

AttentiveNAS and AlphaNet take two different approaches to improving the performance of supernets. Conventional supernet training approaches, such as BigNAS, sample the search space uniformly, so they are agnostic of the model performance Pareto front. A natural idea to improve supernet training is to pay more attention to the Pareto-optimal subnetworks that form the best trade-offs between accuracy and computational requirements. At the same time, it may also be valuable to improve the worst-performing models, since pushing the performance limits of the worst Pareto set may lead to a better-optimized weight-sharing graph. With a tighter range between the best and worst Pareto architectures, all the trainable components (e.g., channels) will make their maximum contribution to the final performance of the network architecture.

As shown in the figure below, with AttentiveNAS we study different Pareto-aware sampling strategies to focus the subnetwork sampling on the best and worst Pareto architecture set.

Best and worst Pareto architecture set.

To focus on Pareto architectures during supernet training, AttentiveNAS decomposes the subnetwork sampling into two steps. First, we sample a compute target following the prior distribution. Second, we sample a subnetwork with the best and worst accuracy satisfying the particular compute target. While the first sampling step is easier, selecting the best or worst subnetwork is nontrivial, as exact accuracy evaluation on a validation set can be computationally expensive. To efficiently estimate the performance of the subnetworks, we propose two algorithms based on the mini-batch loss or an accuracy predictor.

We also compared AttentiveNAS with prior art NAS methods, such as BigNAS, OFA, and EfficientNet. AttentiveNAS achieves SOTA trade-offs in accuracy vs. FLOPS when evaluated on the ImageNet data set.

We also compared AttentiveNAS with prior art NAS methods and found that AttentiveNAS achieves SOTA trade-offs in accuracy vs. FLOPS when evaluated on the ImageNet data set.

AlphaNet: Improving supernet training with alpha-divergence

To supervise the subnetworks, in-place knowledge distillation (KD) has been widely used to leverage the soft labels predicted by the supernet’s largest subnetwork. Standard KD uses KL divergence (a statistical measure of relative entropy) to assess the discrepancy between the teacher and student networks. However, with KL divergence, the student model tends to overestimate the uncertainty of the teacher model and suffers from inaccurate approximation of the most important mode, i.e., the correct prediction of the teacher model. In the example below, the student network on the left makes the correct prediction but underestimates the uncertainty of the teacher model. In contrast, the student network in the center overestimates the uncertainty of the teacher model and misclassifies the input. While the second case is less preferred, as shown in the graph on the right, when using KL divergence (alpha = 1) the first student is penalized much more.

The graph on the left shows uncertainty underestimation, while the graph in the center shows uncertainty overestimation. The graph on the right plots the corresponding α-divergences between the student model and the teacher model for those two examples. Note that KL divergence is a special case of α-divergences, with α = 1. We refer to the uncertainty as the entropy of predictions after the Softmax layer of the network.

To mitigate the problem of KL divergence, our AlphaNet method uses a more generalized alpha-divergence. As shown in the graph on the right above, by controlling the alpha, we can explicitly control how to penalize the overestimation and underestimation of the teacher model uncertainty. More specifically, during training, we evaluate the alpha-divergence loss for different alphas and explicitly choose the largest loss for gradient computation.

To show the benefits of the alpha-divergence, as shown in the graphs below, we compare the performance Pareto front and the accuracy of the smallest network with baseline KL divergence. Alpha-divergence leads to superior Pareto front.

A comparison between alpha-divergence-based adaptive KD and KL-divergence-based KD.

As can be seen in the first graph in the section above, the resulting network family, AlphaNet, significantly outperforms all the prior art methods. Moreover, when we compare the transfer learning accuracy with prior art EfficientNet, AlphaNet outperforms other methods for all the downstream tasks while also being more computationally efficient.

This chart compares transfer learning accuracy with EfficientNet. Note that EfficientNet-B0 and B1 have 390 MFLOPS and 700 MFLOPS, respectively, while AlphaNet-A0 and A6 have 203 MFLOPS and 709 MFLOPS, respectively.

Why it matters:

Edge devices such as smartphones, tablets, and VR headsets play an ever-growing role in the daily lives of people around the world. To bring advanced computer vision and other AI systems to these devices, the research community needs to design neural networks that are not just accurate but also highly efficient. This need is likely to grow as new devices like AR glasses become more available and IoT chipsets are used in new products.

NAS provides a powerful tool for automating efficient network design, but it hasn’t been available to many researchers. Conventional NAS algorithms are computationally expensive, requiring hundreds of thousands GPU hours, which in turn depends on access to large-scale computing resources. Supernet-based NAS decouples the network training and searching process, and can be orders of magnitude more efficient than traditional NAS techniques while achieving superior and state-of-the-art network performance by weight sharing.

The proposed AttentiveNAS and AlphaNet methods improve the supernet-based NAS from two fundamental perspectives and can work together to further democratize the NAS technique to all the researchers in the AI community and deliver more performant networks to the edge.