The Engineering Behind Convolutions

Original article can be found here (source): Deep Learning on Medium

Convolution implementation

Now we have to optimize our convolution code in such a way that either the data recently used will be used again, or the data sequentially near that data will be used. Let’s first look at a traditional convolution operation:

Let’s try to understand this by taking an example with an input image of shape (4,4, 2) and filter of shape (3,3, 2):

Data access pattern when (3,3,2) filter is placed on (4,4,2) image.

The above diagram shows how memory is accessed for the first window of the convolution operation. When we’re at (0,0), the CPU loads data not only for (0,0) but also for the next elements of the row in the cache. As such, it won’t need to load data from memory at (0,1) and (0,2) because these elements are already in cache. But when we’re at (0,2), the next element [i.e. (1,0)] is not in cache — we get a cache miss and the CPU stalls while the data is fetched.

Similarly, it keeps stalling at (1,2), (2,2), etc. So what im2col does is simply arrange elements of one convolution window in one row. Then, they can be accessed sequentially from the cache without misses. As a result, all 18 elements required for convolution are rearranged, as shown in the image below:

In our example, there will be four convolution windows. After converting all to im2col and stacking them, we get the following 2-D array:

input activations after im2col

As you can see, now we have 18×4=72 elements instead of 32 elements (as in the original image). This redundancy is the downside of im2col, but with the resulting performance difference, it’s all worth it.

filter (18,5)

After rearranging the image, we also need to rearrange filters (in practice, they’re already stored like this). Suppose we had 5 filters of (3,3,2)—in this case, each filter will be arranged into a single column of 3x3x2=18 elements. Then 5 of them are stacked to get the (18, 5) matrix.

At this point, we’ve converted this into a simple 2-D matrix multiplication problem. And this is a very common problem in scientific computing. There exist BLAS libraries like OpenBLAS, Eigen, and Intel MKL for computing GEMM (General Element Matrix Multiplications). These libraries have been optimized for many decades for matrix multiplication, and are now used in all frameworks like TensorFlow, PyTorch, etc.

Matrix Multiplication of image and filters

The matrix multiplication of the input and filter give the resultant matrix of shape (4,5). This can be rearranged into a (2,2,5) matrix, which is our expected size.