Convolutional Neural Networks in Practice

Original article can be found here (source): Deep Learning on Medium

Convolutional Neural Networks in Practice

Develop and implement your first CNN with Keras!

https://commons.wikimedia.org/wiki/File:Typical_cnn.png

Introduction

The goal of this article is to be a tutorial on how to develop a Convolutional Neural Network model. If you want to explore their theoretical fundamentals, I encourage you to check this article out.

CIFAR-10 Dataset

In this first example, we will implement a net that can differentiate between 10 types of objects. To do so, we will use the CIFAR-10 dataset. This dataset consists of 60.000 colored pictures, with a resolution of 32×32 pixels, split across 10 different classes that you can check in the following image. The dataset is divided into 50.000 training pictures and 10.000 for testing.

To develop this implementation, we will not use TensorFlow, but Keras instead. Keras is a framework that works over TF and brings flexibility, quickness and easiness of use. Those are the main reasons for its latest increase in popularity across Deep Learning developers.

# Original Dataset: https://www.cs.toronto.edu/~kriz/cifar.html for more information# Load of necessary libraries
import numpy as np
from keras.datasets import cifar10
from keras.models import Sequential
from keras.layers.core import Dense, Flatten
from keras.layers.convolutional import Conv2D
from keras.optimizers import Adam
from keras.layers.pooling import MaxPooling2D
from keras.utils import to_categorical
# to make the example replicable
np.random.seed(42)
# Load of the dataset
(X_train, Y_train), (X_test, Y_test) = cifar10.load_data()
i
import matplotlib.pyplot as plt
class_names = ['airplane','automobile','bird','cat','deer',
'dog','frog','horse','ship','truck']
fig = plt.figure(figsize=(8,3))
for i in range(len(class_names)):
ax = fig.add_subplot(2, 5, 1 + i, xticks=[], yticks=[])
idx = np.where(Y_train[:]==i)[0]
features_idx = X_train[idx,::]
img_num = np.random.randint(features_idx.shape[0])
im = features_idx[img_num,::]
ax.set_title(class_names[i])
#im = np.transpose(features_idx[img_num,::], (1, 2, 0))
plt.imshow(im)
plt.show()
# Initializing the model
model = Sequential()
# Defining a convolutional layer
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu', input_shape=(32, 32, 3)))
# Defining a second convolutional layer
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
# Defining a third convolutional layer
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
# We add our classificator
model.add(Flatten())
model.add(Dense(1024, activation='relu'))
model.add(Dense(10, activation='softmax'))
# Compiling the model
model.compile(loss='categorical_crossentropy',
optimizer=Adam(lr=0.0001, decay=1e-6),
metrics=['accuracy'])
# Training of the model
model.fit(X_train, to_categorical(Y_train),
batch_size=128,
shuffle=True,
epochs=10,
validation_data=(X_test, to_categorical(Y_test)))
# Evaluation of the model
scores = model.evaluate(X_test, to_categorical(Y_test))
print('Loss: %.3f' % scores[0])
print('Accuracy: %.3f' % scores[1])

What is going on? Only an accuracy of 10%? It seems like our network is predicting the same class for all the samples. Meaning that something is failing. This surely has to do with our data. We have not pre-processed it before feeding it to the model.

Data Pre-Processing

The first thing is to pre-process the data to make the task as easy as possible for our network. If we fail to do it, as we have data that goes from 0 to 255, the network will never learn anything.

To carry out this pre-processing, two things are usually done:

  • Center the data: calculate the average of the dataset and subtract it. When working with images, you can either calculate the entire average of the dataset and subtract it directly, or you can calculate the average of each channel of the image and subtract it from each channel.
  • Normalize the data: this is done to get all the data to have approximately the same scale. The two most common ways to do this are:

1) Divide each dimension by its standard deviation, after the data has been centered (subtracted the mean)

2) Normalize so that the minimum and maximum of each dimension are -1 and 1. This only makes sense if we start from data with different scales but that we know should be similar, that is, that they have a similar importance for the algorithm. In the case of images, we know that the values that can be taken are from 0 to 255, so it is not strictly necessary to normalize since the values are already on a similar scale.

Important Note!

Normalization must be done only with the training set. In other words, we should calculate the average and standard deviation of the training set and use those values with the validation and test sets.

# Cenetering the data
X_train_mean = np.mean(X_train, axis = 0)
X_train_cent = X_train - X_train_mean
# Normalization
X_train_std = np.std(X_train, axis = 0)
X_train_norm = X_train_cent / X_train_std

Now, we prepare the validation and test data using the mean and standard deviation of the training set.

Wait, but we don’t have validation data! Well, we will implement this example in this way, but it’s very important that when we do real developments we have all 3 sets:

  1. Training Set: to update the weights for each batch
  2. Validation Set: it checks the generalization capacity of the network in each period. It tests the models with samples that have not been seen during the training, it serves to monitor the training of the network for information purposes, but it does not intervene in any calculation! It is usually used when you want to adjust the parameters, this set being the one that indicates which parameters are best to use. The more accurate the validation, the better the set of parameters we have. For this reason, we cannot rely on this result to give us an idea of the network’s generalization capacity, because we have chosen the network configuration to give us a higher accuracy. Therefore, we must have an extra set that allows us, now, to say if our network is good with samples that I have never seen or not: the test one.
  3. Test Set: it gives us an intuition of how good our network is by generalizing with a set (larger than the validation set) never seen.

All right, so let’s get our test set ready:

X_test_norm = (X_test - X_train_mean) / X_train_std

Now we are ready to test again our net with normalized data:

# Initializing the model
model = Sequential()
# Defining a convolutional layer
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu', input_shape=(32, 32, 3)))
# Defining a second convolutional layer
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
# Defining a third convolutional layer
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
# We add our classificator
model.add(Flatten())
model.add(Dense(1024, activation='relu'))
model.add(Dense(10, activation='softmax'))
# Compiling the model
model.compile(loss='categorical_crossentropy',
optimizer=Adam(lr=0.0001, decay=1e-6),
metrics=['accuracy'])
# Training of the model
model.fit(X_train, to_categorical(Y_train),
batch_size=128,
shuffle=True,
epochs=10,
validation_data=(X_test, to_categorical(Y_test)))
# Evaluation of the model
scores = model.evaluate(X_test, to_categorical(Y_test))
print('Loss: %.3f' % scores[0])
print('Accuracy: %.3f' % scores[1])

This a notoriously better result than the previous one. So we can proudly say that we have our first CNN trained with an accuracy of ~0.99 in training and ~0.7 in testing.

The next logical question should be:

How can there be such a difference between training and testing?

As you all may be thinking already, in deep learning there is also over-fitting, in fact, even more, pronounced than in other techniques.

For those of you who don’t remember what overfitting is, think about this:

You have a network that can detect which character is appearing at any given time in FRIENDS chapter 4×08. It works perfectly, and it can tell which characters are on stage with 99.3% accuracy. It works so well that you try it out with the 5×01. And the result is that it’s only 71.2% accurate.

Well, this phenomenon is known as overfitting, and consists of creating an algorithm that works very well in our data set, but is extremely bad at generalizing.

You cand find a deeper exploration of overfitting and the techniques to minimize it in this article.

Look at the graph that represents the precision based on the times:

And take a look at this example:

Which one would you take?

The layer with 20 layers works better than the one with 3, right? However, what we usually look for is that it has a good generalization capability (it works well when new data is found). Which one do you think will work better if you see new data?

Surprisingly enough, the one on the left.

Let’s go back to our example. In our case, I’m sure we’d all like it much better if instead of ~99 vs. ~70, we got ~90 vs. ~85, right?

How can we achieve this? With standardization and regularization techniques.

Important Note: in practice, the only pre-processing that is usually done with images is to divide all their values by 255. This is usually enough for the network to work properly, and so we do not depend on any parameters related to our training set.

Dealing with Overfitting

There are several ways to reduce the over-fitting as much as possible and thus have an algorithm capable of generalizing more.

BatchNormalization

The technique known as Batch Normalization is a technique developed by Ioffe and Szegedy that aims to reduce the change of internal covariates, or Internal Covariate Shift, which makes the network more robust to bad initializations.

Internal Covariate Shift is defined as the change in the distribution of network activations due to the different distribution of input data between mini-batches. The lesser this difference between mini-batches, the more similar the data that reaches the network filters, the more similar the activation maps will be, and the better the network training will work.

This is achieved by forcing the network activations to have a chosen value of a unitary Gaussian distribution at the beginning of the training. This process is possible because normalization is a distinguishable operation.

It is normally inserted just before the activation function is executed:

model.add(Conv2D(128, kernel_size=(3, 3), input_shape=(32, 32, 3))model.add(BatchNormalization())model.add(Activation('relu'))

In mathematical terms, what we do is center and normalize each mini-batch that comes into our network with a mean and standard deviation calculated with the mini-batch, and then rescale and offset the data again with parameters learned by the network through training.

Furthermore, as we are calculating the average and standard deviation for each mini-batch, instead of for the whole dataset, the batch norm also introduces some noise that acts as a regulator and helps to reduce overfitting.

This technique has proven to be very efficient for training networks faster.

# We Import Batch Normalizarion layer
from keras.layers import BatchNormalization, Activation
# Inizializting the model
model = Sequential()
# Defining a convolutional layer
model.add(Conv2D(128, kernel_size=(3, 3), input_shape=(32, 32, 3)))
model.add(BatchNormalization())
model.add(Activation('relu'))
# Defining a second convolutional layer
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(Activation('relu'))
# Defining a thirdd convolutional layer
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(Activation('relu'))
# We include our classifier
model.add(Flatten())
model.add(Dense(1024, activation='relu'))
model.add(Dense(10, activation='softmax'))
# Compiling the model
model.compile(loss='categorical_crossentropy',
optimizer=Adam(lr=0.0001, decay=1e-6),
metrics=['accuracy'])
# Training the model
model.fit(X_train_norm, to_categorical(Y_train),
batch_size=128,
shuffle=True,
epochs=10,
validation_data=(X_test_norm, to_categorical(Y_test))) # aquí deberíamos usar un conjunto distinto al de test!!!
# Evaluating the model
scores = model.evaluate(X_test_norm, to_categorical(Y_test))
print('Loss: %.3f' % scores[0])
print('Accuracy: %.3f' % scores[1])

We can see that the accuracy has improved by 2%, which once we are getting to high numbers is a huge step. But there is still room for improvement.

Let’s explore regularization.

Regularization

Regularization consists of penalizing in some way the predictions made by our network during training so that it does not think that the training set is the absolute truth and thus knows how to better generalize when it sees other datasets.

Take a look at this graph:

https://commons.wikimedia.org/wiki/File:75hwQ.jpg

In this graph, we can see an example of overfitting, another of underfitting and another that can generalize correctly.

Which is which?

  • Blue: over-fitting
  • Green: the good model with the ability to generalize
  • Orange: under-fitting

Now, look at this example following the one before the 3 networks with a different number of neurons. What we see now is the network of 20 neurons with different levels of regularization.

You can play with these parameters here:

https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

and here’s a much more complete one:

https://playground.tensorflow.org/

In the end, it is much better to have a net with many layers and apply regularization, than to have a small one to avoid overfitting. This is because small networks are simpler functions that have less local minimums, so the gradient descent reaches one or another depending a lot on the initialization, so the losses achieved usually have a great variance depending on the initialization.

However, networks with many layers are much more complicated functions with many more local minima that, although they are more difficult to reach, usually have all similar and better losses.

If you are interested in this topic: http://cs231n.github.io/neural-networks-1/#arch.

There are many methods of regularization. Here are the most common ones:

L2 regularization (Lasso regularization)

The L2 regularization is possibly the most common.

It consists of penalizing the loss function by adding the term 1/2 * λ* W**2 for each weight, which results in:

The 1/2 is simply for convenience when calculating the derivatives, as this leaves λ* W instead of 2*λ* W.

What this means is that we penalize very high or disparate weights, and prefer them to be all of a similar magnitude. If you remember, what the weights imply is the importance of each neuron in the final calculation of the prediction. Therefore, by doing this, we get all the neurons to matter more or less equally, that is, the network will use all its neurons to make the prediction.

On the contrary, if there were very high weights for certain neurons, the calculation of the prediction would take them much more into account, so we would end up with a network with dead neurons that are useless.

Moreover, introducing the term 1/2 * λ* W**2 in our loss function make our weights to approximate to zero during the gradient descent. With a linear decay of W+=-λ⋅W.

Let’s see if we can improve our network by applying the L2 regularization:

# L2 Regularization# Regularizer layer import
from keras.regularizers import l2
# Inizializing the model
model = Sequential()
# Defining a convolutional layer
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu', input_shape=(32, 32, 3)))
# Defining a second convolutional layer
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
# Defining a third convolutional layer
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
# Classifier inclusion
model.add(Flatten())
model.add(Dense(1024, activation='relu', kernel_regularizer=l2(0.01)))
model.add(Dense(10, activation='softmax'))
# Compiling the model
model.compile(loss='categorical_crossentropy',
optimizer=Adam(lr=0.0001, decay=1e-6),
metrics=['accuracy'])
# Traning the model
model.fit(X_train_norm, to_categorical(Y_train),
batch_size=128,
shuffle=True,
epochs=10,
validation_data=(X_test_norm, to_categorical(Y_test)))
# Evaluating the model
scores = model.evaluate(X_test_norm, to_categorical(Y_test))
print('Loss: %.3f' % scores[0])
print('Accuracy: %.3f' % scores[1])

L1 regularization (Ridge regularization)

L1 is also quite common. This time, we added the term λ|w| to our loss function.

We can also combine the L1 regularization with the L2 in what is known as Elastic net regularization:

The L1 regularization manages to convert the W weight matrix into a sparse weight matrix (very close to zero, except for a few elements).

This means that, unlike L2, it gives much more importance to some neurons than others, making the network more robust to possible noise.

Generally, L2 usually gives better results. You can use L1 if you have images in which you know that there are a certain number of characteristics that will give you a good classification and you do not want the network to be distorted by noise.

Let’s try L1, then L1+L2:

# L1 Regularization# Regularizer layer import
from keras.regularizers import l1
# Inizializing the model
model = Sequential()
# Defining a convolutional layer
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu', input_shape=(32, 32, 3)))
# Defining a second convolutional layer
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
# Defining a third convolutional layer
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
# Classifier inclusion
model.add(Flatten())
model.add(Dense(1024, activation='relu', kernel_regularizer=l1(0.01)))
model.add(Dense(10, activation='softmax'))
# Compiling the model
model.compile(loss='categorical_crossentropy',
optimizer=Adam(lr=0.0001, decay=1e-6),
metrics=['accuracy'])
# Traning the model
model.fit(X_train_norm, to_categorical(Y_train),
batch_size=128,
shuffle=True,
epochs=10,
validation_data=(X_test_norm, to_categorical(Y_test)))
# Evaluating the model
scores = model.evaluate(X_test_norm, to_categorical(Y_test))
print('Loss: %.3f' % scores[0])
print('Accuracy: %.3f' % scores[1])
# Elastic Net Regularization (L1 + L2)# Regularizer layer import
from keras.regularizers import l1_l2
# Inizializing the model
model = Sequential()
# Defining a convolutional layer
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu', input_shape=(32, 32, 3)))
# Defining a second convolutional layer
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
# Defining a third convolutional layer
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
# Classifier inclusion
model.add(Flatten())
model.add(Dense(1024, activation='relu', kernel_regularizer=l1_l2(0.01, 0.01)))
model.add(Dense(10, activation='softmax'))
# Compiling the model
model.compile(loss='categorical_crossentropy',
optimizer=Adam(lr=0.0001, decay=1e-6),
metrics=['accuracy'])
# Traning the model
model.fit(X_train_norm, to_categorical(Y_train),
batch_size=128,
shuffle=True,
epochs=10,
validation_data=(X_test_norm, to_categorical(Y_test)))
# Evaluating the model
scores = model.evaluate(X_test_norm, to_categorical(Y_test))
print('Loss: %.3f' % scores[0])
print('Accuracy: %.3f' % scores[1])

Max norm constraints

Another type of regularization is the one based on restrictions. For example, we could set a maximum threshold that the weights cannot exceed.

In practice, this is implemented by using the descent gradient to calculate the new value of the weights as we would normally do, but then the norm 2 of each weight vector is calculated for each neuron and put as a condition that it cannot exceed C, that is:

Normally, C is equal to 3 or 4.

What we achieve with this normalization is that the network does not “explode”, that is, that the weights do not grow excessively.

Let’s see how this regularization goes:

# Elastic Net Regularization (L1 + L2)# Regularizer layer import
from keras.constraints import max_norm
# Inizializing the model
model = Sequential()
# Defining a convolutional layer
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu', input_shape=(32, 32, 3)))
# Defining a second convolutional layer
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
# Defining a third convolutional layer
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
# Classifier inclusion
model.add(Flatten())
model.add(Dense(1024, activation='relu', kernel_costraint=max_norm(3.)))
model.add(Dense(10, activation='softmax'))
# Compiling the model
model.compile(loss='categorical_crossentropy',
optimizer=Adam(lr=0.0001, decay=1e-6),
metrics=['accuracy'])
# Traning the model
model.fit(X_train_norm, to_categorical(Y_train),
batch_size=128,
shuffle=True,
epochs=10,
validation_data=(X_test_norm, to_categorical(Y_test)))
# Evaluating the model
scores = model.evaluate(X_test_norm, to_categorical(Y_test))
print('Loss: %.3f' % scores[0])
print('Accuracy: %.3f' % scores[1])

Dropout regularization

Dropout regularization is a technique developed by Srivastava et al. in their article “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” that complements the other types of standardization (L1, L2, maxnorm).

It is an extremely effective and simple technique, which consists of keeping a neuron active or setting it to 0 during training with a probability p.

What we achieve with this is to change the architecture of the network at training time, which means that there will not be a single neuron responsible for being activated to a certain pattern, but we will have multiple redundant neurons capable of reacting to that pattern.

Let’s see how applying dropout affects our results:

# Dropout# Dropout layer import
from keras.layers import Dropout
# Inizializing the model
model = Sequential()
# Defining a convolutional layer
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(Dropout(0.25))
# Defining a second convolutional layer
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
model.add(Dropout(0.25))
# Defining a third convolutional layer
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
model.add(Dropout(0.25))
# Classifier inclusion
model.add(Flatten())
model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))
# Compiling the model
model.compile(loss='categorical_crossentropy',
optimizer=Adam(lr=0.0001, decay=1e-6),
metrics=['accuracy'])
# Traning the model
model.fit(X_train_norm, to_categorical(Y_train),
batch_size=128,
shuffle=True,
epochs=10,
validation_data=(X_test_norm, to_categorical(Y_test)))
# Evaluating the model
scores = model.evaluate(X_test_norm, to_categorical(Y_test))
print('Loss: %.3f' % scores[0])
print('Accuracy: %.3f' % scores[1])

And now, let’s see the effects of Max norm + Dropout:

# Dropout & Max Norm# Dropout & Max Norm layers import
from keras.layers import Dropout
from keras.constraints import max_norm
# Inizializing the model
model = Sequential()
# Defining a convolutional layer
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(Dropout(0.25))
# Defining a second convolutional layer
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
model.add(Dropout(0.25))
# Defining a third convolutional layer
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
model.add(Dropout(0.25))
# Classifier inclusion
model.add(Flatten())
model.add(Dense(1024, activation='relu', kernel_constraint=max_norm(3.)))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))
# Compiling the model
model.compile(loss='categorical_crossentropy',
optimizer=Adam(lr=0.0001, decay=1e-6),
metrics=['accuracy'])
# Traning the model
model.fit(X_train_norm, to_categorical(Y_train),
batch_size=128,
shuffle=True,
epochs=10,
validation_data=(X_test_norm, to_categorical(Y_test)))
# Evaluating the model
scores = model.evaluate(X_test_norm, to_categorical(Y_test))
print('Loss: %.3f' % scores[0])
print('Accuracy: %.3f' % scores[1])

There are more techniques to deal with overfitting such as Max pooling, changing the strides…etc. In practice, the best is to apply several of them and test which combination provides the best result according to the problem faced.

Final Words

As always, I hope you enjoyed the post, and that you gained an intuition about how to implement and develop a convolutional neural network!

If you liked this post then you can take a look at my other posts on Data Science and Machine Learning here.

If you want to learn more about Machine Learning, Data Science and Artificial Intelligence follow me on Medium, and stay tuned for my next posts!