Beginner’s Crash Course to Deep Learning and CNNs

Original article was published on Deep Learning on Medium

Beginner’s Crash Course to Deep Learning and CNNs

An animated explanation with no complicated maths.

Note: This is a summary of my video on deep learning and convolutional neural networks. If you are interested, feel free to watch it to learn more!

All the assets and images in this article are created by me

Deep learning is extremely fascinating. From recognizing and classifying images, to even generating dream-like hallucinations, deep learning does it all.

When I learned about this topic, I was constantly bombarded by extremely complicated mathematical equations and numerous terminologies filled with acronyms that sounded like the Pokemon equivalent of deep learning. However, strip that all away and deep learning actually becomes intuitive. Welcome to my “animated” guide to deep learning and convolutional neural networks!

Image Classification

We will simplify things by using a black and white image of a tick, with one layer that represents a dedicated black and white channel. Black will be denoted by -1 while white will be denoted by positive 1.

Numerical representation of a tick

Now, if we want to classify images of ticks and crosses per se, we need to first do some processing with the input.

Convolution Step

The main purpose here is to extract the key features from the input image.

To detect features, we need something called a filter. A filter is just a numerical representation of a pattern. From this, we can see that this defines the pattern of the stem of the tick.

Numerical representation of a filter

Starting from the top left-hand corner, this filter tries to find a match with the feature it has. Notice how this filter only cares about a small region at any point in time. This region is known as the receptive field.

In order to find a match, the filter performs a series of mathematical operations on the section of the image it is looking at.

The mathematical operation performed by the filter on a section of the image

Firstly, for each corresponding image and filter pixel, the values are multiplied together. Afterward, they are summed to give a result. This value is then divided by the total number of pixels in the filter, giving us the average. The calculated result is then stored on one pixel in a feature map.

Convolution process for the whole image

We repeat this process for the whole image as the filter shifts step-by-step and executes the operation at each point. This sliding motion of the filter is what encapsulates the whole idea of convolution. The distance the filter moves at each step is known as the stride length, which in this case is 1.

At this point, you might be wondering: what is the significance of the feature map? A feature map is actually a spatial representation of how well the feature of our filter matches the image. An intuitive way of understanding the values on it is that 1 represents a complete match while -1 signifies a complete mismatch. Whatever remaining values in between just signify a partial match.

We can do the same using a different filter that helps detect other features. The size of each filter and the number of filters we want to use in this convolution step can be customized accordingly.

With that, this concludes our first convolutional layer.

ReLU (Rectified Linear Unit)

Rectified Linear Unit — Normalization process

Now, I’ll add a layer to turn all the negative values in our feature maps into zero, which is a form of normalization that rectifies the feature map. This is known as a ReLU layer, which stands for Rectified Linear Unit.

Max Pooling

Pooling process of 2px x 2px area with stride length of 2

Additionally, we might also add a layer to do some downsizing to help with computational speed. One method is to just take the most significant value from an area (which is just the maximum value) and record it. This is known as a max-pooling layer. In this case, my max-pooling layer pools a 2px by 2px area and is set to have a stride length of 2, moving two pixels at a time and logging the values.

We repeat and stack the aforementioned layers on top of each other like a burger to form the meat of the convolutional neural network, increasing the complexity of the features which help with the image classification.

Fully Connected Layer

Ending it off, we add a fully connected layer to interpret the results. This is essentially the generic neuron and synapses model that you commonly see to represent neural networks.

To understand it intuitively, this is simply where the network decides on the importance of a certain characteristic. The more important it is, the heavier the “weight” — represented by the thicker lines — which means that the connection between the neurons is stronger.

Introduction to the fully connected layer and demonstration of the flattening of the image

The first layer of neurons represents each individual pixel of the feature maps created after a series of convolution, ReLU, and pooling cycles, flattened into one dimension like such. The value of that pixel can be seen as the neuron’s signal strength, with higher values representing a stronger signal. Hence, with stronger connections, a greater proportion of the signal emitted by this set of neurons can pass through.

Activation of the last layer of neurons

This last layer of neurons will tell us how confident it is in predicting a certain object. The activation of these neurons is based on the strength of the signals it receives. Higher activations mean that the cumulative signals from previous neurons are stronger.

Since we are deciding between two classes of objects, we would have two output neurons — one for the tick and one for the cross. An activation of 1 represents a hundred percent confidence in classifying the image as whatever class it corresponds to, while 0 means a complete rejection. If you want, you can stack more layers of neurons in between to make your model more dynamic.

The Learning Process

An Intuitive, Non-Mathematical Explanation

The idea of training the model is just like how you train a pet to do tricks. If the pet does the trick well, you reward them with a treat. Well, for machines, their “treat” is defined in the form of a cost function. If they perform better, the cost function will be lower in value, which means that they are on the right track and vice versa. The neural network wants to minimize this cost as much as possible.

To train the neural network, we need data which is called the training set. A training set contains a bunch of images, in our case ticks and crosses, which are labeled with the correct answer. Every time we feed the network an image, it will generate a response based on what it thinks the image is by passing through all the layers. From the output, the network evaluates itself and determines how far off it is from the correct answer.

The network then adjusts the weights in order to attempt to steer itself in the correct direction by finding the right combination to improve its accuracy. This is done through backpropagation and gradient descent which involves a bit of calculus and will not be covered in-depth here.

Graphical representation of minimizing the cost function

Graphically speaking, however, it is trying to make edits to take a slide down the cost function in order to eventually reach the local minimum. The rate in which it learns can be modified by changing its learning rate.

We then use a validation set, which is a set of images not seen by the neural network, to check its performance after the training stage. This is just like letting the neural network take an exam and seeing how it does.

We then train the model further if needed to get the best results possible. And there you have it, you now have an intuitive understanding of convolutional neural networks and deep learning!