Source: Deep Learning on Medium

We can classify objects almost at real time using deep learning models using a fairly good computer. Although Deep Learning neural networks are very powerful, the large number of weights consumes considerable storage requiring more 300MB of space. This makes it difficult to deploy models on mobiles systems. Mobile devices are battery constrained, making power hungry applications such as deep neural networks hard to deploy. Running a 1 billion connection neural network, for example, at 20fps would require(20Hz)(1G)(640pJ) = 12.8W just for DRAM access — well beyond the power envelope of a typical mobile device.

In the paper DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING Song Han,Huizi Mao,William J. Dally presents ,deep compression, a three stage pipeline to reduce storage while preserving accuracy. First the network is pruned removing redundant connections and then weights are quantized so that multiple connections share the same weight and finally Huffman coding is applied.

#### Network pruning

The first step is to train the models using normal connectivity. Next, prune the small-weight connections by removing all weights below a threshold from the network. Finally, retrain the network to learn the final weights for the remaining sparse connections. Pruning reduced the number of parameters by 9× and 13× for AlexNet and VGG-16model.

#### Trained Quantization and weight sharing

Network quantization limit the number of effective weights needed to store, by having multiple connections share the same weight, and then fine-tune those shared weights.

Consider a layer that has 4 input neurons and 4 output neurons, the weight is a 4×4matrix. On the top left is the4×4weight matrix, and on the bottom left is the 4×4 gradient matrix. The weights are quantized to 4 bins (denoted with 4 colors),all the weights in the same bin share the same value, thus for each weight, we then need to store only a small index into a table of shared weights. During update, all the gradients are grouped by the color and summed together, multiplied by the learning rate and subtracted from the shared centroids from last iteration. For pruned AlexNet, each CONV layers can be quantized to 8-bits (256 shared weights) , and each FC layer to 5-bits (32 shared weights) without any loss of accuracy.

Given k clusters, log_2(k) bits is required to encode the index. In general, for a network with n connections and each connection is represented with b bits, constraining the connections to have only k shared weights will result in a compression rate of:

r = nb/(nlog_2(k) + kb)

In the eg above 4×4 = 16 weights originally but there are only 4 shared weights: similar weights are grouped together to share the same value. Originally we need to store 16 weights each has 32 bits, now we need to store only 4 effective weights (blue, green, red and orange), each has 32 bits, together with 16 2-bit indices giving a compression rate of16∗32/(4∗32 + 2∗16) = 3.2

Weights are normally not shared across layers.

Let’s assume we have already trained a 32-bit network and want to quantize its weight into four bits as a post-processing step to reduce its size. During forward pass, all the kernel weights will become quantized. But once that happens, they will return flat or zero gradients, which means the network isn’t learning. We can sidestep this issue during back propagation by using Straight-Through Estimator (STE), which returns the original gradients and updates the underlying float values without modifying the quantized value on top. At that point, the forward pass and backward pass are repeated.

#### Huffman Coding

For sending a model over a network, an additional stage of compression can be used to optimize for size. The paper proposes using Huffman coding. It uses variable-length code words to encode source symbols. The table is derived from the occurrence probability for each symbol. More common symbols are represented with fewer bits.