Deep Compression

Source: Deep Learning on Medium

In their current form, Deep Neural Networks require enormous memory to fund their massive over-parameterization. Classic Neural Networks such as AlexNet and VGG-16 require around 240 and 552 MB, respectively. Many efforts have been made to reduce the file size of Neural Networks, generally relying on techniques such as Weight Pruning or Quantization, or SVD decompositions of Weight Matrices. This paper, Deep Compression, combines Pruning, Quantization, and Huffman encoding into a three stage pipeline that reduces the size of AlexNet by a factor of 35x and VGG-16 by 49x. This results in AlexNet being reduced from 240 to 6.9 MB and VGG-16 from 552 to 11.3 MB.

The pipeline consists of three stages:

Pruning, Quantization, and Huffman Encoding


Pruning describes the process of masking out certain weights in a Deep Neural Network. This requires implementing a mask over the Neural Network layers such that they iterate differently through the y = Wx + b operation. Weight pruning is not the same as simply setting certain weights to 0. The Pruning operation implemented in the paper masks out weights below a certain threshold. For example, if a weight is lies between an interval of [-0.5, 0.5], it will be masked out.

After Pruning, the weights are represented in a Compressed Sparse Row format. This is done so as not to waste space with a Sparse Weight Matrix. This CSR format is depicted in the image below:


Quantization is a technique to reduce the number of bits needed to store each weight in the Neural Network through weight sharing. Weights in a Deep Neural Network are typically represented by 32-bit floats, taking the form of say, ‘2.70381’. In Quantization, a k-Means algorithm is deployed to search for clusters that describe the weights in the network. If the weights are to be represented with 3 bits, this would result in 2³ = 8 centroids used to cluster the weights. Each weight is then mapped to it’s respective centroid. For example, ‘2.70381’ → ‘2’. The 8 centroids thus form the ‘Codebook’ used to map the original weights to the corresponding 3-bit weights. This Codebook is then fine-tuned during training.

The Codebook is fine-tuned via a similar mechanism as classic backprop / SGD. The partial derivative of each weight is computed and they are aggregated for each discrete 3-bit weight. For example, a series of the ‘2’ 3-bit weight may have the corresponding partial derivative updates of ‘0.2’, ‘0.1’, ‘0.2’, and ‘0.3’. These derivatives are aggregated and ‘2’ is optimized to become ‘2.2’.

This process is depicted below:

Huffman Encoding

Huffman Encoding is a popular technique in compression to take advantage of skewed / biased distributions of values. For example, if 20 weights map to ‘2’, 10 weights map to ‘3’, and 3 weights map to ‘8’, it would make sense to encode 2 as ‘00’, 3 as ’10, and 8 as something like ‘1110’. Huffman encoding is used in this case to reduce the amount of bits needed to represent the weights in the Quantized Codebook.

Putting it Altogether

I find it very interesting how Pruning and Weight Quantization work in tandem without destroying the accuracy of the network. I hope to use this technique to improve the turn-around time for Meta-Learning algorithms through the use of mechanisms such as Distributed Synchronous SGD. I hope that others find this useful and can find Deep Learning applications for Mobile and Embedded Systems. Please check out the video below to learn more about Deep Compression, Thanks for Reading!