ML Paper Challenge Day 33 — Deep Compression: Compressing Deep Neural Network with Pruning…

Original article was published on Deep Learning on Medium

ML Paper Challenge Day 33 — Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding

Day 33: 2020.05.14
Paper: Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding
Category: Model/Optimization

Deep Compression

Background:

  • Having DNN locally: better privacy, less network bandwidth & real time processing -> But model size too large
  • Large model -> High energy consumption

Goal:

  • reduce the storage and energy required to run inference on such large networks so they can be deployed on mobile devices

How:

Step I — Pruning: Prune the networking by removing the redundant connections, keeping only the most informative connections

Also reduce the network complexity and prevent over-fitting

  1. Learn the connectivity via normal network training
  2. Prune the small-weight connections: all connections with weights below a threshold are removed from the network
  3. Retrain the network to learn the final weights for the remaining sparse connections

How it is stored after pruning

  1. Use compressed sparse row (CSR) or compressed sparse column (CSC) format, which requires 2a + n + 1 numbers
    (a: number of non-zero elements, n: number of rows or columns)
  2. To compress further, store the index difference instead of the absolute position, and encode this difference in 8 bits for conv layer and 5 bits for fc layer. When we need an index difference larger than the bound, use the zero padding solution: in case when the difference exceeds 8, the largest 3-bit (as an example) unsigned number, we add a filler zero.

Step II — Quantisation: weights are quantized so that multiple connections share the same weight, thus only the codebook (effective weights) and the indices need to be stored

  1. Quantize weights to bins.
  2. All the weights in the same bin share the same value, thus for each weight, we then need to store only a small index into a table of shared weights.
  3. Use k-means clustering to identify the shared weights for each layer of a trained network, so that all the weights that fall into the same cluster will share the same weight. Weights are not shared across layers.
  4. Initialise the values by Linear initialization: linearly spaces the centroids between the [min, max] of the original weights.
  5. During update, all the gradients are grouped by the bins and summed together, multiplied by the learning rate and subtracted from the shared centroids from last iteration.

Step III — Huffman Encoding: take advantage of the biased distribution of effective weights

  • A Huffman code is an optimal prefix code commonly used for lossless data compression.
  • uses variable-length codewords to encode source symbols.
  • More common symbols are represented with fewer bits.