One of the most exciting areas of deep learning is computer vision. Through recent advances in convolutional neural nets we have been able to create self driving cars, facial detection systems and automated medical imagery analysis that out performs specialists just to name a few. In this article I will show you the fundamentals of convolutional neural nets and how you can create one yourself to classify hand written digits.

Unlike many fields of deep learning which are hyped to the public to seem like they are replications of biological functions in the human brain, convolutional neural nets come very close. Back in 1959, David Hubel and Torsten Wiesel conducted expirements on cats and monkeys which gave important revelations of how the visual cortex functions. What they found was that many neurons have a small local receptive which only react to small finite areas of the total visual field. They showed that certain neurons react to low level patterns such as horizontal lines, vertical lines and others rounded. They also recognized that other neurons have larger receptive fields and are stimulated by more complex patterns which are combinations of information gathered by the lower level neurons. These findings laid the foundation for what we now call convolutional neural nets. Let’s break down the building blocks one by one.

- Convolutional Layer.

Each convolutional layer is built of several feature maps which are fed information by filters which detect features such as horizontal lines or vertical lines. You can picture each filter like a window that slides over the dimensions of an image and detecting properties as it goes. The amount the filter moves across the image is called the stride. A stride of 1 means the filter moves over one pixel at a time where as a a stride of 2 would skip jump across 2 pixels..

In the example above, we have a vertical line detector. The original image is a 6×6, being scanned by with a 3×3 filter with a stride of 1 which results in a 4×4 dimension output. The filter is only interested in the sections in the left and right columns of its field of view. By multiplying the inputs of the image by the configuration of the 3×3 filter we have 3+1+2 -1- 7-5= -7. The filter then moves to the right one stride which would then calculate 1+0+3 -2–3–1 = -2. -2 Would then go in the spot to the right of the -7. This process would continue until the 4×4 grid is complete. Afterwards the next feature map will calculate its own values using a unique filter/kernel matrix of it’s own.

2. Pooling Layers

The goal of the pooling layer is to further reduce dimensionality by aggregating the values collected by the convolutional layer or what’s called sub-sampling. This will reduce computational load as well provide some regularization to your model to avoid over fitting. They follow the same sliding window idea as the conv layer but rather than calculate all values they pick the max or average of it’s inputs. This is called max pooling and average pooling respectively.

These 2 components are the key building blocks of a convolution layer. You then would typically repeat this recipe further reducing the dimensions of your feature maps, though increasing their depth. Each feature map will specialize in recognizing it’s own unique shapes. At the end of the convolutions you will place a fully connected layer/layers with an activation function such as Relu or Selu which is used to reshape the dimensions into a vector suitable to feed into your classifier. For example if your final conv layer outputs a 3x3x128 matrix but you are only predicting 10 different classes, you will want to reshape that into a 1×1152 vector and gradually reduce its size before feeding to your classifier. The fully connected layers will also learn their own functions as in a typical deep neural network.

Now let’s see a implementation in Tensorflow on the MNIST handwritten digit dataset. First we will load our libraries. Using fetch_mldata from sklearn we load the mnist dataset and assign the images and labels to the X and y variables. Then we will create our train/test sets. Lastly, we will plot a few examples to get an idea of the task ahead.

Next we will do some data augmentation which is a sure way to improve your models performance. By creating slight variations of the training images you in effect create regularization for your model. We will use the scipy’s ndimage module to shift our images by 1 pixel right, left, up and down. Not only does this provide a wider variety of examples it will increase the size of our training set considerably which is usually always a good thing.

The last form of data augmentation I’ll show you will be to create horizontal flips of the images using the cv2 library. We will also need to create new labels for these flipped images which is as easy as duplicating the original labels.

Next we will create a helper function for feeding random mini batches to our neural net input. Due to the nature of convolutional layers, they require massive amounts of memory during the forward and backward propogation steps. Consider a layer with 4×4 filters, outputting 128 feature maps with stride of 1 and SAME padding with a RGB image input of dimension 299×299. The number of parameters would equal (4x4x3+1) x 128 = 6272. Now consider each of those 128 feature maps is computing 299×299 neurons and each of these neurons is computing a weighted sum of 4x4x3 inputs. That’s now 4x4x3x299x299x150 = 643,687,200 calculations. That’s just for one training example. As you can imagine this quickly get’s out of hand. The way to get around this is to feed small batches at a time to the network using a python generator which by nature keeps items out of memory until they are required.

We are ready to start creating our network architecture. First, we create our placeholders for our training data/labels. We will need to reshape them into a (-1, 28, 28, 1) matrix as the tensorflow conv2d layer expects a 4 dimensional input. We will set the first dimension to None to allow arbitrary batch sizes to be fed to the placeholder.

Now we will design our convolutional layer. I will be taking inspiration from the Le-NET5 (pioneered by Yann LeCun) network architecture which is known for it’s success in classifying hand written digits. However, there are many thing’s I did differently. I recommend you study the Le-NET5 as well as other proven models to get some intuition of what kind of convolutional networks work for different tasks. Here is a link to his white paper. http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf.

The first layer consists of 12 feature maps, using a 3×3 filter with stride of 1. We chose SAME padding which will maintain the dimensions of the image by adding a pad of zeros around the input. We then apply a max pooling layer with another 3×3 filter and strides of 2 which will output a 13x13x12 matrix. So we started with a 28x28x1 image, and have produced filter maps of which are less than half its size but much deeper in depth. We then pass this matrix along to the 2nd conv layer which has depth of 16, 3×3 filters, stride = 1 and padding SAME followed by the same max pooling layer as before. This outputs a 6*6*16 dimension matrix. You can see we are reducing the dimension space of our feature maps, but going deeper and and deeper. This is where we are learning to put together the lower level shapes learning in the first layer and form more complex patterns in the 2nd layer. Next we prepare the outputs for the fully connected layer by reshaping it into a 1 dimensional row vector consisting of 6x6x16 = 576 values. We use two dense layers with Selu activation to reduce the number of inputs by around half at each layer til finally feeding them to our logits which will output 10 predictions.

We create our loss function which in this case will be softmax cross entropy which will output multi class probabilities. You can think of cross entropy as a measure of distance between various data points. We choose the AdamOptimizer (adaptive moment estimation) which automatically adjusts it’s learning rate as it moves down the gradient. Finally we create a means of evaluating our results. in_top_k will compute our logits and pick the top score. We then use our accuracy variable to output a percentage between 0–1%.

Now we are all ready for the training phase. First I will train the model without any augmented features or fancy bells and whistles. Let’s see how well our model performs.

At epoch 19 we reach our highest percentage correct at 0.9907. This is already better than the results of any machine learning algorithm so convolutional has taken the lead. Let’s now try and use our shifted features/ flipped features as well as add two new elements to our network. Dropout and Batch Normalization.

We modify our existing placeholders with placeholder_with_default nodes which will hold the values produced by the batch normalization and drop out layers. During training we set these values to True and during testing we will turn them off by setting to False.

Batch normalization simply centers and normalizes the data of each batch. We assign a momentum of 0.9. Drop out regularization assigns a probability (in our case 1 -0.5) to randomly turn nodes off completely during training. This results in the rest of the nodes having to pick up the slack thus improving their effectiveness. Imagine a company that decided to randomly choose 50 employees each week to stay home. The rest of the staff would have to handle the extra work effectively improving their skills in other areas. Not sure that would work in real life but in deep learning it’s been proven effective.

We create our loss, train and eval steps as before then apply a few modifications to our execution phase. The computations performed by batch normalization are saved as update operations during each iteration. In order to access these we assign a variable extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS). During our training operation we feed this into sess.run as a item of a list along with training_op. Finally when performing our validation/test predictions we assign our placeholders False values via the feed_dict. We do not when their to be any randomization during our prediction phase. In order to get outputs we run the logits operation using our test set. Let’s see how well this model performs now that we’ve added regularization/normalization and are using augmented features. When using dropout the model will take longer to train so we’ll increase our number of iterations to 30.

On epoch 29 we achieved 99.5% accuracy on our test set of 10,000 digits. As you can see the model acheieved > 99% accuracy on only the 2nd epoch compared to the 16th with our model before. Though 0.05% may not sound like much, this is a substantial improvement when dealing with huge amounts of data. Finally i’ll show you how to extract predictions using np.argmax on our logits output.

If there is anything you feel I missed or need to explain further please comment below. I’ll try my best to explain. Follow the link below if you’d like to see some of the most famous convolutional networks. It’s a great idea to study these and see how these models were crafted to get some intuition on what may work for your application.

Source: Deep Learning on Medium