Training with quantization noise for extreme model compression

April 23, 2020

What it is:

Quant-Noise is a new technique to enable extreme compression of models that still deliver high performance when deployed in practical applications. We apply it to computer vision (CV) and natural language processing (NLP) models, and it works with a variety of quantization methods, such as int4, int8, and product quantizers (iPQ). When employed with iPQ, Quant-Noise sets a new state of the art for combining high accuracy and small model size on NLP and CV benchmark tasks.

Our method delivers performance that nearly matches that of the original uncompressed models while reducing the memory footprint by 10x to 20x. This significantly exceeds the 4x compression with int8 currently available in both PyTorch and Tensorflow. Quant-Noise can be used to shrink models even further — by more than 50x — in use cases where greater performance trade-offs are acceptable. Quant-Noise changes model training only by adding a regularization noise similar to dropout, with no impact on either the convergence rate or training speed.

We have open-sourced our code so other researchers can reproduce our results and use Quant-Noise in their work.

How it works:

Quantization shrinks a neural network’s memory footprint and can speed up inference. However, directly applying quantization to a trained model can significantly harm performance, because the model was not trained in this setting. To overcome this problem, Quant-Noise mimics the effect of quantization during training time.

At training time during the forward pass, it takes a subset of the weights and then randomly applies simulated quantization noise. This makes the model resilient to quantization and enables large compression ratios without much loss in accuracy.

Something Went Wrong

We're having trouble playing this video.

Learn more

This graphic shows how we apply quantization noise to a subset of weights during training in order to improve performance of the quantized model.

Unlike previous approaches to quantization-aware training, Quant-Noise is applied to only a subset of the weights. This method has the advantage that the unbiased gradients still flow from the weights that are unaffected by the noise.

We’ve demonstrated that this framework compresses the state-of-the-art EfficientNet-B3 model from about 50 MB to 3.3 MB while achieving 80 percent top-1 accuracy on ImageNet, compared with 81.7 percent for the uncompressed model.

Likewise, we used Quant-Noise to compress Facebook AI’s state-of-the-art RoBERTa Base model from 480 MB to 14 MB while achieving 82.5 percent on MNLI, compared with 84.8 percent for the original model.

Why it matters:

State-of-the-art models are getting bigger and bigger. For example, there are millions of parameters in each layer of Transformers now widely used in NLP tasks. By shrinking these models without significantly degrading performance, Quant-Noise can help bring cutting-edge AI to smartphones, tablets, and even IoT chipsets, with everything running entirely on-device to avoid disruptive errors and lag. This will enable devices used by many millions of people around the world to run new virtual and augmented reality experiences, more intelligent assistants, and other new products and experiences.

Furthermore, because it works with many quantization methods and model types, Quant-Noise can be applied in a wide variety of practical use cases. It also can be used to further shrink compact models, such as EfficientNet. Facebook AI is currently exploring ways to use Quant-Noise in a variety of on-device AI applications, and we look forward to seeing how others leverage the framework in their own work.