Introducing few-shot neural architecture search

July 19, 2021

What the research is:

In recent years, neural architecture search (NAS) has become an exciting area of deep learning research, offering promising results in computer vision, particularly when specialized models need to be found under different resources and platform constraints (for example, on-device models in VR headsets).

One such approach, commonly known as vanilla NAS, utilizes search techniques to accurately explore the search space and evaluates new architectures by training them from scratch. But this can require thousands of GPU hours, which can be too high a computing cost for many research applications. Researchers often turn to another approach, one-shot NAS, because it significantly lowers the computing cost by utilizing a supernet, which is a large network whose edge contains every type of edge connection (i.e., a compound edge). Once pretrained, a supernet can approximate the accuracy of neural architectures in the search space without having to be trained from scratch.

While one-shot NAS reduces GPU requirements, its search can be hampered by inaccurate predictions from the supernet, making it hard to identify suitable architectures.

In this work, we introduce few-shot NAS, a new approach that combines the accurate network ranking of vanilla NAS with the speed and minimal computing cost of one-shot NAS. Few-shot NAS enables any user to quickly design a powerful customized model for their tasks using just a few GPUs. We also show that it effectively designs numerous state-of-the-art models, from convolutional neural networks for image recognition to generative adversarial networks for image generation.

Compared with one-shot NAS, few-shot NAS improves performance estimation by first partitioning the search space into different independent regions and then employing multiple sub-supernets to cover these regions. It works similarly to the way companies are run. Rather than have a CEO run everything, dedicated experts can improve a company's performance by each taking charge of different areas depending on their specialties. To partition the search space in a meaningful way, we choose to leverage the structure of the original supernet. By picking each type of edge connection individually, we choose a way to split the search space that is consistent with how the supernet is constructed (Figure 1).

Figure 1: Few-shot NAS is a trade-off between the accuracy of vanilla NAS and the low computing cost of one-shot NAS.

How it works:

The innovation offered by few-shot NAS arises from the observation that a supernet can be regarded as a representation of search space, and that we can enumerate every neural architecture by recursively splitting each supernet’s compound edges, as demonstrated below (Figure 2).

Figure 2: In this hierarchy of networks, the root is the supernet with compound edges, while the leaves are individual architectures in the search space, created by recursively splitting every compound edge. One-shot NAS evaluates the architecture the fastest but has the most inaccurate evaluations, while vanilla NAS gives perfect evaluations but at a high computational cost

Given the benefits of using a single supernet, a natural question is whether using multiple supernets could offer the best aspects of both one-shot NAS and vanilla NAS.

To investigate this idea, we designed a search space containing 1,296 networks. First, we trained the networks in order to rank them according to their true accuracies on the CIFAR10 data set. We then predicted the 1,296 networks using 6, 36, and 216 sub-supernets and compared the predicted rankings with the true ranking. We found that the ranking improved substantially even when adding just a few sub-supernets, as demonstrated below (Figure 3).

Figure 3: With multiple sub-supernets, the predicted accuracy matches well with the ground truth accuracy (a), and the ranking prediction is improved (c), as is the final search performance (b).

Then, we tested our idea on real-world tasks and found that, compared with one-shot NAS, few-shot NAS improved the accuracy of architecture evaluations with a small increase in evaluation costs. With only up to seven sub-supernets, few-shot NAS establishes new SoTAs: On ImageNet, it finds models that reach 80.5 percent top-1 accuracy at 600 MFLOPS and 77.5 percent top-1 accuracy at 238 MFLOPS; on CIFAR10, it reaches 98.72 percent top-1 accuracy without using extra data or transfer learning. In AutoGAN, few-shot NAS outperforms the previously published results by up to 20 percent. Extensive experiments show that few-shot NAS significantly improves various one-shot methods, including four gradient-based and six search-based methods on three different tasks in NasBench-201 and NasBench1-shot-1.

Overall, our work demonstrates that few-shot NAS is a simple yet highly effective advance over the ability of one-shot NAS to improve ranking prediction. It’s also widely applicable to all existing NAS methods. While we demonstrate these scenarios as concrete examples, the technique we’ve developed can have broad applications, such as when a candidate architecture needs to be evaluated quickly in search of better architectures.

Why it matters:

Adapting state-of-the-art models for concrete real-world applications can be extremely challenging because of various real-world limitations like CPU, memory, and power consumption. Human experts can design dedicated models for specific scenarios, but doing so is both time-consuming and cost-ineffective, hence the need for automatic search. Few-shot NAS contributes to the design of accurate and fast value models. It is verified extensively in our experiments on classification tasks of CIFAR10, ImageNet, and generation tasks like AutoGAN.

Applying our few-shot approach can improve the search efficiency of various neural architecture search algorithms that employ a supernet (such as AttentiveNAS and AlphaNet) and more generally, weight-sharing techniques. Looking forward, we hope this approach can be used in even broader scenarios.

We’d like to acknowledge the contributions of Yiyang Zhao and Tian Guo from Worcester Polytechnic Institute, and Linnan Wang and Rodrigo Fonseca from Brown University.

Get the code

Read the full paper