July 25, 2019
A new approach that aims to reduce the memory footprint of neural network architectures by quantizing (or discretizing) their weights, while maintaining a short inference time thanks to its byte-aligned scheme. This is intended to help researchers in computer vision, who are continuously advancing the state of the art with models performing tasks ranging from image classification to instance detection. With traditional methods, the memory required to store these high-performing neural networks and use them to perform inference is generally more than 100 MB, which prevents them from being used on embedded devices.
We’re open-sourcing the compressed models as well as the code for reproducing our results.
We rely on a popular structured quantization method called product quantization, and adapt it to focus on the reconstruction of activations, not on the weights themselves. In other terms, where previous approaches aimed to approximate the network for arbitrary inputs, we only focus on the quality of the reconstruction for in-domain inputs. We guide the compression of the student network by the non-compressed network, which serves as a teacher, leveraging the distillation technique. Our approach is unsupervised in the sense that it does not require any labeled data.
We applied our method to a high-performing ResNet-50 trained by Facebook AI using semi supervised learning. The compressed model weighs only 5 MB (20x compression factor) and preserves the top-1 accuracy of a vanilla ResNet-50 on ImageNet (76.1 percent). We also compressed the widely used Mask R-CNN (now available in torchvision) for instance detection and reached a Box AP/Mask AP of 33.9/30.8 for a model size of around 6 MB (26x compression factor).
There is a growing need to embed the best neural networks, and each application requires a particular trade-off between size and accuracy. For instance, robotics and autonomous cars require a reliable technology that precisely identifies all the instances present in a video frame in real time, which means quite large models in need of compression. Virtual reality and augmented reality devices, such as Oculus Quest, would similarly benefit from advances in compressing neural networks.
And the bit goes down: Revisiting the quantization of neural networks
Research Assistant, Facebook AI