Original article can be found here (source): Deep Learning on Medium
Now we have to optimize our convolution code in such a way that either the data recently used will be used again, or the data sequentially near that data will be used. Let’s first look at a traditional convolution operation:
Let’s try to understand this by taking an example with an input image of shape (4,4, 2) and filter of shape (3,3, 2):
The above diagram shows how memory is accessed for the first window of the convolution operation. When we’re at (0,0), the CPU loads data not only for (0,0) but also for the next elements of the row in the cache. As such, it won’t need to load data from memory at (0,1) and (0,2) because these elements are already in cache. But when we’re at (0,2), the next element [i.e. (1,0)] is not in cache — we get a cache miss and the CPU stalls while the data is fetched.
Similarly, it keeps stalling at (1,2), (2,2), etc. So what im2col does is simply arrange elements of one convolution window in one row. Then, they can be accessed sequentially from the cache without misses. As a result, all 18 elements required for convolution are rearranged, as shown in the image below:
In our example, there will be four convolution windows. After converting all to im2col and stacking them, we get the following 2-D array:
As you can see, now we have 18×4=72 elements instead of 32 elements (as in the original image). This redundancy is the downside of im2col, but with the resulting performance difference, it’s all worth it.
After rearranging the image, we also need to rearrange filters (in practice, they’re already stored like this). Suppose we had 5 filters of (3,3,2)—in this case, each filter will be arranged into a single column of 3x3x2=18 elements. Then 5 of them are stacked to get the (18, 5) matrix.
At this point, we’ve converted this into a simple 2-D matrix multiplication problem. And this is a very common problem in scientific computing. There exist BLAS libraries like OpenBLAS, Eigen, and Intel MKL for computing GEMM (General Element Matrix Multiplications). These libraries have been optimized for many decades for matrix multiplication, and are now used in all frameworks like TensorFlow, PyTorch, etc.
The matrix multiplication of the input and filter give the resultant matrix of shape (4,5). This can be rearranged into a (2,2,5) matrix, which is our expected size.