Understand the future of model compression in 6 minutes: ‘DeepThin: A Self-Compressing Library for…


Photo by Vance Osterhout on Unsplash

Note from the author: PLEASE read this alongside the actual paper itself, which can be found here. I’ve struggled with notating the math (and boy, is there a lot of it) here on Medium — I think you’ll follow along much more easily with the paper by your side. If you have any suggestions on how I could present this better, please let me know!

Deep learning is expensive stuff. Forget the hours of training time, the $8000 dowry for an NVIDIA GPU, the hours of time writing algorithms; even once you have a trained model, deploying it in production could (and probably will) cost an arm and a leg. Thus, the industry is moving towards edge computing; predicting with models using a device’s local compute resources. Naturally, one wonders, “how will Apple put a PCIe slot in their iPhones?” but alas, I really mean local compute resources.

To make this possible, model compression methods were devised — and today we’re going to look at the current SOA (state of the art) compression library: DeepThin.

I was quite shocked to find that the buzz around DeepThin has been limited, even though it’s a pretty impressive system. The intimidating looking math probably has something to do with it. Nonetheless, hopefully after a bit of a deep dive you’ll want to use it with your models. Let’s jump in!

The most popular methods for model compression are quantization, parameter pruning and low-rank factorization. Quick breakdown:

  1. Quantization compresses models by lowering the bit depth of each of the values in the weight matrices.
  2. Pruning is the process of removing “weak” neurons in the network. Each neuron is evaluated on their effect on the output. The algorithm removes neurons that have the least effect.
  3. Low-rank factorisation involves the decomposition of matrices — breaking one large matrix into smaller parts that are then multiplied together, thus reducing the number of complex computations required.

DeepThin is a compression system that uses low-rank factorisation. We’ll identify why this is “low-rank” later. Let’s take a look at the results of the system first:

Check out those relative improvement numbers!

CPU prediction on low-cache systems (realistic for mobile devices) leads to inference speed increases of between 2X — 14X, with accuracy improvements of up to 60% over previous systems. Additionally, due to matrix decomposition, DeepThin-compressed networks do not need to compute expensive hashing functions, unlike HashedNet.

Here’s the meat of it. DeepThin can be understood as an improvement over standard low-rank approximation methods. Check out the diagram below:

The rank of a matrix is defined as the maximum number of columns / rows that are linearly independent:

1 2 6

2 4 17

3 6 9

2, 4, 6 is 2X 1, 2, 3, therefore it’s dependent on it / generated from it. 6, 17, 9 is not. Therefore, this is a rank 2 matrix. Pretty simple on the surface, but there’s more to it. You can read all the linear algebra complexities here.

All layers in a deep neural network look something like this Y= (X.W+B), plus a nonlinearity applied to a linear weighting and biasing operation. The paper focuses on explaining compression in the context of the weight matrix W due to its significantly larger size (though in practise, both W and B are compressed). Rank approximation replaces W with a function of two smaller matrices, such that W_QXRX_f . W_f, where X_f and W_f are the factors of the original matrix. This is why it’s called factorisation — it’s exactly like what you learnt in middle school.

As the rank of the factors approaches 1 (essentially a row / column vector), storing and computing with the factors becomes more efficient. This is presumably because the single column / row requires fewer bits, and thus also requires fewer computations.

Just to reiterate since it’s so important: lower rank factors → more efficient computation. It’s vastly preferable to store a huge matrix as a collection of its factors, which will be combined at runtime.

The factors are used to reconstruct the weight matrix so it can be used for forward propagation / at inference time. Take note of the bottom of the above figure:

Both rows simply fit this format!

This means that the rows and columns are becoming copies of each other with scaling factors, which inevitably leads to large redundancy in the network, because all output nodes in a layer are forced to learn similar inputs (duh, the matrices are almost rank 1 or 2 when reconstructed).

This is where DeepThin comes in (I know, I know, we’ve taken a while to get here. I’ll be quick). DeepThin uses an auxiliary matrix (creatively referred to in the paper as W_aux) which is of size m x n, where W_aux=X_f . W_f. X_f is an m x r matrix and W_f is an r x n matrix. The key change is the use of a non-linear reshaping function in the reshaping of W_auxW_QXR.The nonlinearity is exactly like activation function in Y=(X.W+B). The paper touches on the idea of using random scattering to distribute the elements of W_aux, but this isn’t intuitive or practical as the scattering negates the performance and space saving gains of compression. Let’s stop, it’s h̶a̶m̶m̶e̶r̶time to look at a diagram to make all this concrete.

The ReLayout operation that you can see in the diagram is pretty simple and is described below: “Values are read from W_aux in a row-major (rows first) order and written to W_QXR in a column-major order.” This means that a single column in W_QXR may be composed of multiple rows of W_aux.

But isn’t the shape of W^f key to the result of this operation? Good question! The number of columns in W^f (n) and the number of rows in W_QXR (Q) need to be coprimes (also known as relative primes, defined as LCM(x,y) = x*y, or GCF(x,y) = 1. Say hello to the middle school math again!). The LCM of the two coprimes is actually the repeat frequency of blocks from W_aux in W_QXR. Of course, another condition is that W_aux needs have at least as many elements as W_QXR. These details are crucial for implementation — speaking of which, keep an eye out for an implementation here soon!

Here’s the last bit of intuition before we get into the bulk of the results. How do we calculate and (more importantly) configure the compression rate?

Here’s how. For a matrix of shape QXR the compression ratio is the ratio of the size of the low-rank factors against the size of the matrix itself, as can be seen below:

However, since we still have two non-fixed configurable parameters m and n, we can factorise out r from the low-rank factor sizes (hehe) and replace m with its minimum value rearranged equivalent from the inequality we arrived at earlier:

m X n >= Q X R

Cool. Now, in order to change the compression ratio, all we need to do is tune n to optimise compression.

This paper did have a pretty disappointing element though. There’s a relatively detailed explanation of how DeepThin was implemented in Tensorflow, but no link to a Git repo. They’re being massive teases! They repeatedly refer to their library, but never show us the code itself. Although, there’s more to a network than its code, and the coverage of implementation details and intuition is really helpful for people who want to understand DeepThin. If you do want to see an actual implementation though, do keep an eye out — a little birdie told me that an implementation of DeepThin will be available on Github in the coming months. Regardless, this is a great step forward for the model shrinking space, and I think it’ll take off as soon as people can begin to use it in their products.

P.S: I’ve been resisting using this line throughout the article, but here it comes: it would appear that size does matter for DeepThin. ( ͡° ͜ʖ ͡°).

Source: Deep Learning on Medium