CNN & ResNets — a more liberal understanding

Source: Deep Learning on Medium

Thus the input image is reduced to a smaller matrix known as a channel representing a certain feature.

Now, we may understand it in the traditional neural network way like below:

We may perform the operations in the conventional neural network way, but it takes a lot of memory and time. Instead, we perform it the other way as stated above, and it takes a whole of less time and memory.

Now, let us consider one more case of a convolutional neural network. What if our input matrix and kernel have the same size?
There are two options to deal with such situations:

  • We may convolute the complete input matrix and obtain rank one tensor.
  • Otherwise, we add zero paddings or reflection paddings around the input matrix and then convolute the input matrix-like stated below. Fastai frequently uses reflection padding wherever possible.

In other words, a convolution is just a matrix multiplication where two things happen:

  • some of the entries are set to zero all the time
  • same kernel weights are multiplied to compute different channels

So when you’ve got multiple things with the same weight, that’s called weight tying.

Now, that’s much of the theoretical understanding of the convolutional neural networks. Now, let’s understand the convolutional neural networks from a practical point of view.

  • In reality, we have 3D input images rather than 2D images. Each image has different red, green and blue pixels. So, instead of having a rank two tensor kernel, we have rank three tensor kernel representing different values for red, green and blue. So, instead of doing an element-wise multiplication of 9 things(if we have 2D kernel having nine values), we’re going to do an element-wise multiplication of 27 things (3 by three by 3), and we’re still going to then add them up into a single number.
  • Now, when we are convoluting the image, we don’t want only to find top edges but other things also like detecting repetitions, gradients of colours in the image, etc. TO cover all the different features, we need more and more kernels, and this is actually what happens. In each layer, we process the image using a lot of kernels. Thus, each layer consists of a large number of channels.
  • To avoid our memory going out of control due to a lot of channels, from time to time we create a convolution where we don’t step over every single set of 3×3(considering the size of the kernel), but instead, we skip over two at a time. We would start with a 3×3 centered at (2, 2) and then we’d jump over to (2, 4), (2, 6), (2, 8), and so forth. That’s called a stride two convolution. What that does is, it looks the same, it’s still just a bunch of kernels, but we’re just jumping over two at a time. We’re skipping every other input pixel. So the output from that will be H/2 by W/2. (We may define stride-n convolution)

Let’s understand the stride-2 convolution.

stride-2 convolution

Now, let’s evaluate the MNIST dataset and use our convolutional neural network. I used google colab for a practical purpose.

from import *

Fastai provides academic datasets, and we can untar and use that.

path = untar_data(URLs.MNIST)
It consists of training and validation data

After extracting the data, we have to create the data bunch. So, let us establish that.

The first thing you say is what kind of item list do you have. So, in this case, it’s the list of images. Then where are you getting the list of file names from? In this case, we have folders.

imagelist = ImageList.from_folder(path); imagelist
It has a total of 7000 images. Each image has three channels and is 28*28.

So inside an item list is an items attribute and the items attribute is the kind of thing that you gave it. It’s the thing that it’s going to use to create your items. So in this case, the thing you gave it is a list of file names. That’s what it got from the folder.

It consists of a list of images from training and testing folder

When you show images, it usually shows them in RGB. In this case, we want to use a binary color map.


Once you’ve got an image item list, you then split it into training versus validation. You nearly always want validation. If you don’t, you can use the .no_split method to create an empty validation set. You can’t skip it entirely. All this is defined in fastai data block API.

splitData = imagelist.split_by_folder(train='training', valid='testing'); splitData
60000 items in training dataset and 10000 items in the validation dataset

So that is always the order. First, create your item list, then decide how to split. In this case, we’re going to do it based on folders. The validation folder for MNIST is called testing , and therefore we mention that in the method also.

Now, we want to label our data, and we want to label the data using the folder in which our data is present.

Number 4 images are present in 4 numbered folder and same for others also.
labelist = splitData.label_from_folder()
The category list is defined for the sample images in the image.

So first you create the item list, then you split it, then you label it.

x,y = labelist.train[0] or labelist.valid[0]
print(x.shape, y)

Now, comes adding transforms. Transformation is the part of data augmentation. There is a very considerable difference between processes that we add for tabular data and transformations that we add for images.

  • Processes are added once on the training data, and the same validations are carried to the validation and testing data.s.
  • Transformations are applied every time we ask for the bunch of images.

Since we are doing digit recognition, therefore we do not want to apply a default to the data because it consists of some transformations that we really do not want like flipping the number vertically/horizontally will change the number, zooming the text will change the pixels of image and image will be blurred. Therefore, we will add our transformations, and they are effortless, add random padding and a little amount of cropping.

tfms = ([*rand_pad(padding=3, size=28, mode='zeros')], [])
(empty array refers to the validaion set transforms)
transformedlist = labelist.transform(tfms)

Now is the time for the last step, and it is to create the data bunch. Here I am not using image stats for normalization as I am not using a pre-trained model like ResNet34, ResNet56, etc. Also, I will use the batch size of 128.

bs = 128
data = transformedlist.databunch(bs=bs).normalize()
x,y = data.train_ds[0]

What is most interesting is that the training data set now has data augmentation because we have added transforms. plot_multi is a function that will plot the result of calling some function on each of the items.

def _plot(i,j,ax): data.train_ds[0][0].show(ax, cmap='gray')
plot_multi(_plot, 3, 3, figsize=(7, 7))
Check different padding and cropping in the images
xb,yb = data.one_batch()
Since we selected the batch size of 128. Thus there are 128 images.
data.show_batch(rows=3, figsize=(5,5))
Images with the labels

Now, we are done with the data bunch. Now, we will create the learner and will train it through our own CNN.

Basic CNN with the batch normalization

def conv(ni,nf): return nn.Conv2d(ni, nf, kernel_size=3, stride=2, padding=1)model = nn.Sequential(
conv(3, 8), # 14
conv(8, 16), # 7
conv(16, 32), # 4
conv(32, 16), # 2
conv(16, 10), # 1
Flatten() # remove (1,1) grid

Let us understand the above function.

  • We are declaring the kernel size to be 3 * 3.
  • We want to perform the stride-2 convolution.
  • Now, We want to perform the sequential operation, that’s why we have written nn.Sequential.
  • The first layer of the model is conv(3, 8). 3 implies to the number of input channels. Since our image has three input channels, therefore we have declared that number. 8 is the total number of channels in the output. This number implies the total number of filters, as discussed in the above section.
  • The number of channels in the output of one layer is input to the next layer. Now, we have mentioned using stride-2 convolution. Therefore, we started with an image size of 28 * 28. In the second layer, it will turn down to 14 * 14, in the next layer to 7 * 7 and then to 4 * 4, then to 2 * 2 and lastly to 1 * 1.
  • The output will be in the form of [128, 10, 1, 1] — each image in the batch of 128 has ten channels of 1 * 1in the output as rank three tensors. We flatten it to rank one tensor.
  • In between the convolution layers, we have added batch normalization and ReLu as a non-linear layer.

That’s all ( ͡ᵔ ͜ʖ ͡ᵔ), we have created our convolutional neural network.

Now is the time to create the learner as defined in the fastai.

learn = Learner(data, model, loss_func = nn.CrossEntropyLoss(), metrics=accuracy)learn.summary()
[8, 14, 14] — [channels, dimension, dimention]
Learning rate plot
learn.fit_one_cycle(10, max_lr=0.1)
We have reached 99% accuracy

Now, let us understand the ResNet and then I will include that in our model and will see how much the accuracy improves.

❓ What is ResNet

Let X be out input. As per the ResNet, instead of doing like

Y = conv2(conv1(X)),

It does like so,

Y = X + conv2(conv1(X)) — This thing is called an identity connection or skip connection.

Basics of ResNet — ResBlock

ResNet drastically improves the loff function surface. Without ResNets, the loss function has lots of bumps, and with ResNet, it turned down to smooth.

We can create ResBock like below:

class ResBlock(nn.Module):
def __init__(self, nf):
self.conv1 = conv_layer(nf,nf)
self.conv2 = conv_layer(nf,nf)

def forward(self, x): return x + self.conv2(self.conv1(x))

Let us change our model to include the ResNet blocks. Let’s refactor that a little. Rather than saying conv, batch norm, ReLU all the time, already has something called conv_layer which lets you create conv, batch norm, ReLU combinations.

def conv2(ni,nf): return conv_layer(ni,nf,stride=2)def conv_and_res(ni,nf): return nn.Sequential(conv2(ni, nf), res_block(nf))model = nn.Sequential(
conv2(1, 8), # 14
conv2(8, 16), # 7
conv2(16, 32), # 4
conv2(32, 16), # 2
conv2(16, 10), # 1
Flatten() # remove (1,1) grid
learn = Learner(data, model, loss_func = nn.CrossEntropyLoss(), metrics=accuracy)
learn.fit_one_cycle(12, max_lr=0.05)
Accuracy is improved a bit to 99.23%.

That’s all. I hope you may have understood the logic behind CNN and ResNets.