Using Convolutional Neural Networks with TensorFlow Part-3

Source: Deep Learning on Medium

Now, that’s a very basic introduction to what convolutions do, and when combined with something called pooling, they can become really powerful.

Now what’s pooling then? pooling is a way of compressing an image. A quick and easy way to do this, is to go over the image of 4 pixels at a time, that is, the current pixel and its neighbors underneath and to the right of it.

a 4 x 4 pooling

Of these 4, pick the biggest value and keep just that. So, for example, you can see it here. My 16 pixels on the left are turned into the four pixels on the right, by looking at them in 2 by 2 grids and picking the biggest value. This will preserve the features that were highlighted by the convolution, while simultaneously quartering the size of the image. We have the horizontal and vertical axes too.

Coding for convolutions and max pooling

These layers are available as

in TensorFlow.

We don’t have to do all the math for filtering and compressing, we simply define convolutional and pooling layers to do the job for us.

So here’s our code from the earlier example, where we defined out a neural network to have an input layer in the shape of our data, and output layer in the shape of the number of categories we’re trying to define, and a hidden layer in the middle. The Flatten takes our square 28 by 28 images and turns them into a one dimensional array.

model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation = tf.nn.relu,
tf.keras.layers.Dense(10, activation = tf.nn.softmax
])

To add convolutions to this, you use code like this.

model = tf.keras.models.Sequential([ tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(28, 28, 1)), tf.keras.layers.MaxPooling2D(2, 2), tf.keras.layers.Conv2D(64, (3,3), activation='relu'), tf.keras.layers.MaxPooling2D(2,2), tf.keras.layers.Flatten(), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dense(10, activation='softmax')])

You’ll see that the last three lines are the same, the Flatten, the Dense hidden layer with 128 neurons, and the Dense output layer with 10 neurons. What’s different is what has been added on top of this. Let’s take a look at this, line by line.

tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(28, 28, 1))

Here we’re specifying the first convolution. We’re asking keras to generate 64 filters for us. These filters are 3 by 3, their activation is relu, which means the negative values will be thrown way, and finally the input shape is as before, the 28 by 28. That extra 1 just means that we are tallying using a single byte for color depth. As we saw before our image is our gray scale, so we just use one byte. Now, of course, you might wonder what the 64 filters are. It’s a little beyond the scope of this blog to define them, but they for now you can understand that they are not random. They start with a set of known good filters in a similar way to the pattern fitting that you saw earlier, and the ones that work from that set are learned over time.

tf.keras.layers.MaxPooling2D(2, 2)

This next line of code will then create a pooling layer. It’s max-pooling because we’re going to take the maximum value. We’re saying it’s a two-by-two pool, so for every four pixels, the biggest one will survive as shown earlier. We then add another convolutional layer, and another max-pooling layer so that the network can learn another set of convolutions on top of the existing one, and then again, pool to reduce the size. So, by the time the image gets to the flatten to go into the dense layers, it’s already much smaller. It’s being quartered, and then quartered again. So, its content has been greatly simplified, the goal being that the convolutions will filter it to the features that determine the output.

A really useful method on the model is the model.summary method. This allows you to inspect the layers of the model, and see the journey of the image through the convolutions, and here is the output.

Output of model.summary

It’s a nice table showing us the layers, and some details about them including the output shape. It’s important to keep an eye on the output shape column. When you first look at this, it can be a little bit confusing and feel like a bug. After all, isn’t the data 28 by 28, so why is the output, 26 by 26? The key to this is remembering that the filter is a 3 by 3 filter. Consider what happens when you start scanning through an image starting on the top left. So, you can’t calculate the filter for the pixel in the top left, because it doesn’t have any neighbors above it or to its left. In a similar fashion, the next pixel to the right won’t work either because it doesn’t have any neighbors above it. So, logically, the first pixel that you can do calculations on is this one, because this one of course has all 8 neighbors that a three by 3 filter needs. This when you think about it, means that you can’t use a 1 pixel margin all around the image, so the output of the convolution will be 2 pixels smaller on x, and 2 pixels smaller on y. If your filter is five-by-five for similar reasons, your output will be four smaller on x, and four smaller on y. So, that’s y with a three by three filter, our output from the 28 by 28 image, is now 26 by 26, we’ve removed that one pixel on x and y, and each of the borders.

So, now our output gets reduced from 26 by 26, to 13 by 13. The convolutions will then operate on that, and of course, we lose the 1 pixel margin as before, so we’re down to 11 by 11, add another 2 by 2 max-pooling to have this rounding down, and went down, down to 5 by 5 images. So, now our dense neural network is the same as before, but it’s being fed with five-by-five images instead of 28 by 28 ones.

But remember, it’s not just one compressed 5 by 5 image instead of the original 28 by 28, there are a number of convolutions per image that we specified, in this case 64. So, there are 64 new images of 5 by 5 that had been fed in. Flatten that out and you have 25 pixels times 64, which is 1600. So, you can see that the new flattened layer has 1,600 elements in it, as opposed to the 784 that you had previously. This number is impacted by the parameters that you set when defining the convolutional 2D layers. Later when you experiment, you’ll see what the impact of setting what other values for the number of convolutions will be, and in particular, you can see what happens when you’re feeding less than 784 over all pixels in. Training should be faster, but is there a sweet spot where it’s more accurate?

Hands on with CNN

You can find the notebook used by me here. Again, you can download the notebook if you are using a local environment and if you are using Colab, you can cllick on opeen in colab button.

This is a really nice way to improve our image recognition performance. Let’s now look at it in action using a notebook. Here’s the same neural network that you used before for loading the set of images of clothing and then classifying them. By the end of epoch five, you can see the loss is around 0.34, meaning, your accuracy is pretty good on the training data.

Output with DNN

It took just a few seconds to train, so that’s not bad. With the test data as before and as expected, the losses a little higher and thus, the accuracy is a little lower.

So now, you can see the code that adds convolutions and pooling. We’re going to do 2 convolutional layers each with 64 convolutions, and each followed by a max pooling layer.

You can see that we defined our convolutions to be three-by-three and our pools to be two-by-two. Let’s train. The first thing you’ll notice is that the training is much slower. For every image, 64 convolutions are being tried, and then the image is compressed and then another 64 convolutions, and then it’s compressed again, and then it’s passed through the DNN, and that’s for 60,000 images that this is happening on each epoch. So it might take a few minutes instead of a few seconds. To remedy this what you can do is use a GPU. How to do that in Colab?

All you need to do is Runtime > Change Runtime Type > GPU. A single layer would now take approximately 5–6 seconds.

Output with the Convolutions and max poolings

Now that it’s done, you can see that the loss has improved a little it’s 0.25 now. In this case, it’s brought our accuracy up a bit for both our test data and with our training data. That’s pretty cool, right?

Now, this is a really fun visualization of the journey of an image through the convolutions. First, I’ll print out the first 100 test labels. The number 9 as we saw earlier is a shoe or boots. I picked out a few instances of this whether the zero, the 23rd and the 28th labels are all nine. So let’s take a look at their journey.

The visualization

The Keras API gives us each convolution and each pooling and each dense, etc. as a layer. So with the layers API, I can take a look at each layer’s outputs, so I’ll create a list of each layer’s output. I can then treat each item in the layer as an individual activation model if I want to see the results of just that layer. Now, by looping through the layers, I can display the journey of the image through the first convolution and then the first pooling and then the second convolution and then the second pooling. Note how the size of the image is changing by looking at the axes. If I set the convolution number to one, we can see that it almost immediately detects the laces area as a common feature between the shoes.

So, for example, if I change the third image to be one, which looks like a handbag, you’ll see that it also has a bright line near the bottom that could look like the soul of the shoes, but by the time it gets through the convolutions, that’s lost, and that area for the laces doesn’t even show up at all. So this convolution definitely helps me separate issue from a handbag. Again, if I said it’s a two, it appears to be trousers, but the feature that detected something that the shoes had in common fails again. Also, if I changed my third image back to that for shoe, but I tried a different convolution number, you’ll see that for convolution two, it didn’t really find any common features. To see commonality in a different image, try images two, three, and five. These all appear to be trousers. Convolutions two and four seem to detect this vertical feature as something they all have in common. If I again go to the list and find three labels that are the same, in this case six, I can see what they signify. When I run it, I can see that they appear to be shirts. Convolution four doesn’t do a whole lot, so let’s try five. We can kind of see that the color appears to light up in this case.

There are some exercises at the bottom of the notebook check them out.

How convolutions work ?(OPTIONAL)

We willcreate a little pooling algorithm, so you can visualize its impact. There’s a notebook that you can play with too, and I’ll step through that here. Here’s the notebook for playing with convolutions here. It does use a few Python libraries that you may not be familiar with such as cv2. It also has Matplotlib that we used before. If you haven’t used them, they’re really quite intuitive for this task and they’re very very easy to learn. So first, we’ll set up our inputs and in particular, import the misc library from SciPy. Now, this is a nice shortcut for us because misc.ascent returns a nice image that we can play with, and we don’t have to worry about managing our own.

Matplotlib contains the code for drawing an image and it will render it right in the browser with Colab. Here, we can see the ascent image from SciPy. Next up, we’ll take a copy of the image, and we’ll add it with our homemade convolutions, and we’ll create variables to keep track of the x and y dimensions of the image. So we can see here that it’s a 512 by 512 image. So now, let’s create a convolution as a three by three array. We’ll load it with values that are pretty good for detecting sharp edges first. Here’s where we’ll create the convolution.

We then iterate over the image, leaving a one pixel margin. You’ll see that the loop starts at one and not zero, and it ends at size x minus one and size y minus one. In the loop, it will then calculate the convolution value by looking at the pixel and its neighbors, and then by multiplying them out by the values determined by the filter, before finally summing it all up.

Vertical line filter

Let’s run it. It takes just a few seconds, so when it’s done, let’s draw the results. We can see that only certain features made it through the filter. I’ve provided a couple more filters, so let’s try them. This first one is really great at spotting vertical lines. So when I run it, and plot the results, we can see that the vertical lines in the image made it through. It’s really cool because they’re not just straight up and down, they are vertical in perspective within the perspective of the image itself. Similarly, this filter works well for horizontal lines. So when I run it, and then plot the results, we can see that a lot of the horizontal lines made it through. Now, let’s take a look at pooling, and in this case, Max pooling, which takes pixels in chunks of four and only passes through the biggest value. I run the code and then render the output. We can see that the features of the image are maintained, but look closely at the axes, and we can see that the size has been halved from the 500’s to the 250’s.

With pooling

Excercise 3

Now you need to apply this to MNIST Handwrting recognition we will revisit that from last blog post. You need to improve MNIST to 99.8% accuracy or more using only a single convolutional layer and a single MaxPooling 2D. You should stop training once the accuracy goes above this amount. It should happen in less than 20 epochs, so it’s ok to hard code the number of epochs for training, but your training must end once it hits the above metric. If it doesn’t, then you’ll need to redesign your layers.

When 99.8% accuracy has been hit, you should print out the string “Reached 99.8% accuracy so cancelling training!”. Yes this is just optional (You can also print out something like “I’m getting bored and won’t train any more” 🤣)

The question notebook is available — here

My Solution

Wonderful! 😃 , you just coded for a handwriting recognizer with a 99.8% accuracy (that’s good) in less than 20 epochs. Let explore my solution for this.

My solution

The callback class (This is the simplest)

class myCallback(tf.keras.callbacks.Callback): def on_epoch_end(self, epoch, logs={}): if(logs.get('acc')>0.998): print("/n Reached 99.8% accuracy so cancelling training!") self.model.stop_training = True

The main CNN code

training_images=training_images.reshape(60000, 28, 28, 1)
test_images=test_images.reshape(10000, 28, 28, 1)
training_images = training_images / 255.0
test_images = test_images / 255.0
# YOUR CODE ENDS HERE
model = tf.keras.models.Sequential([
# YOUR CODE STARTS HERE
tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(2, 2),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
# YOUR CODE ENDS HERE
])

So, all you had to do was play around the code and get this done in just 7 epochs.