Model Quantization for Production-Level Neural Network Inference

Source: Deep Learning on Medium

Go to the profile of Patric Zhao

Author: Patric Zhao, Xinyu Chen, Zhennan Qin, Jason Ye


In deep learning, inference is used to deploy a pretrained neural network model to perform image classification, object detection, and other prediction tasks. In the real-world, and especially enterprises, inference is quite important because it is the stage of the analytics pipeline where valuable results are delivered to end users based on their production-level data. The huge number of inference requests from end users are constantly being routed to cloud servers all over the world.

A major measurement of inference performance is latency, or how long it takes to complete a prediction — shorter latency ensures good user experience. And single batch inference is very common in production-level inference, so it is CPU friendly.

When deploying deep learning infrastructure in real production environments, high performance and cost-efficient services are key. Therefore, many cloud service providers (CSPs) and hardware vendors have optimized their services and architectures for inference, such as Amazon SageMaker, Deep Learning AMIs from Amazon Web Services (AWS) and Intel® Deep Learning Boost (Intel® DL Boost), including Vector Neural Network Instructions (VNNI) found in 2nd Generation Intel® Xeon® Scalable processors.

The Apache MXNet* community delivered quantization approaches to improve performance and reduce the deployment costs for inference. There are two main benefits of lower precision (INT8). First, the computation can be accelerated by lower precision instruction, like VNNI. Second, lower precision data types save memory bandwidth and allow for better cache locality and power savings.

The new quantization approach, along with operator fusion, can realize up to a 3.7x performance speedup in current AWS* EC2 CPU instances (Fig.6, mobilenet v1 BS=64) and will reach up to higher throughput under Intel DL Boost-enabled hardware with less than 0.5% accuracy drop.

Model Quantization

Apache MXNet supports model quantization from float32 to signed INT8 (s8) or unsigned INT8 (u8). S8 is designed for general inference and u8 is specific for CNNs. For most CNNs, Relu is used as the activation function so output activations are non-negative. Thus, the benefit of u8 is obvious — we can use one more bit for the data to achieve better accuracy.

The INT8 inference pipeline includes two stages based on the trained FP32 models including saved models (json file) and parameters.

Fig 1. MXNet int8 inference pipeline
  • Quantization with calibration (offline stage). During this stage, a small fraction of images from the validation dataset (1–5%) will be used for collecting statistical information including naive min/max or optimal thresholds based on entropy theory and defining scaling factors using symmetric quantization and execution profiles of each layer. The output of this stage is a calibrated model including quantized operators saved as a JSON file and a parameter file.
  • INT8 Inference (run-time stage). The quantized and calibrated model should be a pair of a JSON file and a param file which can be loaded and used for inference just like the original model, except with higher speed and less accuracy difference.


Many advanced features are provided by Apache MXNet to accelerate the inference quantization, including the quantized data loader, offline calibration, graph optimization, etc. Apache MXNet is one of the first deep learning frameworks to deliver the fully quantized INT8 network from data loading to compute-intensive operation with production-level quality. In the quantized network, the common computation patterns, like convolution + relu, are fused by a graph optimizer so the whole quantized network is more compact and efficient than the original one. As an example, the ResNet 50 v1 figure below shows the network changes before and after the optimization and quantization.

Fig 2. ResNet50 V1 Architecture (Left: FP32 Right: INT8)

All of these features are transparent to the user when they deploy models on different hardware. In other words, end users don’t need to alter their production code and can get a performance improvement when they switch to a new AWS EC2 instance, such as Intel® DL Boost-enabled instances.

Fig 3. Intel® Deep Learning Boost

Deploy Your Models

Calibration tools and APIs are available for customers to easily quantize their float32 models to INT8 models. Also, Apache MXNet officially provides two kinds of quantization examples: quantization for image classification and object detection (SSD-VGG16). Users can also reference quantization APIs to integrate them in their real-world workloads.

The latest MXNet release 1.4.0 supports unsigned quantization. The new quantization features like signed quantization and quantized fully connected are available from the nightly build or the master branch of the MXNet github repo.

Below, SSD-VGG16 is used as an example to show the implementation and results of MXNet model quantization.


Use the following command to install the latest release version of MXNet with Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) support.

pip install --pre mxnet-mkl

Follow the Training instructions to train an FP32 SSD-VGG16_reduced_300x300 model based on the Pascal VOC dataset. You can also download our SSD-VGG16 pre-trained model and packed binary data. Create model and data directories if they do not exist, extract the zip files, then rename the uncompressed files as follows.


Then, you can utilize the below command to verify the float32 pretrained model:

python --cpu --num-batch 10 --batch-size 224 --deploy --prefix=./model/ssd_


MXNet provides a calibration script for SSD-VGG16. Users can set different configurations to quantize float32 SSD-VGG16 models to INT8 models, including batch size, number of batches for calibration, calibration mode, quantization destination data type for input data, exclude layers and other configurations for data loaders. We can use the following command for quantization. By default, this script uses five batches (32 samples per batch) for naive calibration.


After quantization, INT8 models will be saved in the model dictionary as follows.


Deploy INT8 Inference

Use the following command to launch the inference model.

python --cpu --num-batch 10 --batch-size 224 --deploy --prefix=./model/cqssd_

Detect Visualization

Pick one image from the Pascal VOC2007 validation dataset and the detection results should show as follows. The first image shows the detection result from float32 inference and the second shows the detection result from INT8 inference.

Use the following command to visualize detection.

# Download demo image
python data/demo/
# visualize float32 detection
python --cpu --network vgg16_reduced --data-shape 300 --deploy --prefix=./model/ssd_
# visualize int8 detection
python --cpu --network vgg16_reduced --data-shape 300 --deploy --prefix=./model/cqssd_
Fig 4.1. SSD-VGG Detection, FP32
Fig 4.2. SSD-VGG Detection, INT8


In this section, we show more networks and their INT8 performance. The below CPU performance is from an AWS EC2 C5.18x large instance with 36 Intel® Xeon® Platinum 8124M CPU cores. See complete configuration details in notices and disclaimers.

For latency results, in single batch size a lower runtime is better. ResNet-50 and MobileNet V1 networks were completed in less than 7 ms. Especially for the edge-level model MobileNet1, the latency is much better at 2.03 ms.

The quantization approach improved throughput performance from 1.96X to 3.72X for selected models. The quantization flow from MXNet ensured only a small reduction in accuracy (less than 0.5%, as shown in figure 7).

Fig 5. MXNet* Fusion and Quantization Latency
Fig 6. MXNet* Fusion and Quantization Speedup
Fig 7. MXNet* Fusion and Quantization Accuracy


  • Apache MXNet accelerates inference performance using model quantization powered by the Intel MKL-DNN library and Intel Xeon Scalable CPUs.
  • INT8 inference shows great performance improvements for CNN networks, from image classification to object detection.
  • The accuracy of quantized INT8 models is very close to that of FP32 models, normally around 0.5%.
  • Advanced optimizations, such as offline calibration and graph optimization, provide extra performance speedup.
  • 2nd Gen Intel Xeon Scalable processors further boost model performance with Intel DL Boost with new VNNI instruction set, with no impact to the user.


Thanks for the great support from the Apache community and the Amazon MXNet team.

Lots of help from Mu Li, Jun Wu, Da Zheng, Ziheng Jiang, Sheng Zha, Anirudh Subramanian, Kim Sukwon, Haibin Lin, Emily Hutson and Emily Backus.

Also thanks to the customers of Apache MXNet for providing great feedback.


Notices and Disclaimers

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.

Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit

​Performance results are based on testing as of 31st March 2019 by AWS and may not reflect all publicly available security updates. No product or component can be absolutely secure.

Test Configuration:

Reproduce Script:

Software: Apache MXNet

Hardware: AWS EC2 c5.18xlarge instance on Intel® Xeon® Platinum 8124M CPU @ 3.00GHz with dual sockets, 36 physical cores

Intel, the Intel logo, and Intel Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © Intel Corporation