Original article was published on Deep Learning on Medium
ML Paper Challenge Day 33 — Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding
Day 33: 2020.05.14
Paper: Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding
- Having DNN locally: better privacy, less network bandwidth & real time processing -> But model size too large
- Large model -> High energy consumption
- reduce the storage and energy required to run inference on such large networks so they can be deployed on mobile devices
Step I — Pruning: Prune the networking by removing the redundant connections, keeping only the most informative connections
Also reduce the network complexity and prevent over-fitting
- Learn the connectivity via normal network training
- Prune the small-weight connections: all connections with weights below a threshold are removed from the network
- Retrain the network to learn the final weights for the remaining sparse connections
How it is stored after pruning
- Use compressed sparse row (CSR) or compressed sparse column (CSC) format, which requires 2a + n + 1 numbers
(a: number of non-zero elements, n: number of rows or columns)
- To compress further, store the index difference instead of the absolute position, and encode this difference in 8 bits for conv layer and 5 bits for fc layer. When we need an index difference larger than the bound, use the zero padding solution: in case when the difference exceeds 8, the largest 3-bit (as an example) unsigned number, we add a filler zero.
Step II — Quantisation: weights are quantized so that multiple connections share the same weight, thus only the codebook (effective weights) and the indices need to be stored
- Quantize weights to bins.
- All the weights in the same bin share the same value, thus for each weight, we then need to store only a small index into a table of shared weights.
- Use k-means clustering to identify the shared weights for each layer of a trained network, so that all the weights that fall into the same cluster will share the same weight. Weights are not shared across layers.
- Initialise the values by Linear initialization: linearly spaces the centroids between the [min, max] of the original weights.
- During update, all the gradients are grouped by the bins and summed together, multiplied by the learning rate and subtracted from the shared centroids from last iteration.
Step III — Huffman Encoding: take advantage of the biased distribution of effective weights
- A Huffman code is an optimal prefix code commonly used for lossless data compression.
- uses variable-length codewords to encode source symbols.
- More common symbols are represented with fewer bits.