Furthermore, as we are calculating the average and standard deviation for each mini-batch, instead of for the whole dataset, the batch norm also introduces some noise that acts as a regulator and helps to reduce overfitting.

This technique has proven to be very efficient for training networks faster.

We can see that the accuracy has improved by 2%, which once we are getting to high numbers is a huge step. But there is still room for improvement.

Let’s explore regularization.

Regularization
Regularization consists of penalizing in some way the predictions made by our network during training so that it does not think that the training set is the absolute truth and thus knows how to better generalize when it sees other datasets.

Take a look at this graph:

https://commons.wikimedia.org/wiki/File:75hwQ.jpg
In this graph, we can see an example of overfitting, another of underfitting and another that can generalize correctly.

Which is which?

Blue: over-fitting
Green: the good model with the ability to generalize
Orange: under-fitting
Now, look at this example following the one before the 3 networks with a different number of neurons. What we see now is the network of 20 neurons with different levels of regularization.

You can play with these parameters here:

https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

and here’s a much more complete one:

https://playground.tensorflow.org/

In the end, it is much better to have a net with many layers and apply regularization, than to have a small one to avoid overfitting. This is because small networks are simpler functions that have less local minimums, so the gradient descent reaches one or another depending a lot on the initialization, so the losses achieved usually have a great variance depending on the initialization.

However, networks with many layers are much more complicated functions with many more local minima that, although they are more difficult to reach, usually have all similar and better losses.

If you are interested in this topic: http://cs231n.github.io/neural-networks-1/#arch .

There are many methods of regularization. Here are the most common ones:

L2 regularization (Lasso regularization)
The L2 regularization is possibly the most common.

It consists of penalizing the loss function by adding the term 1/2 * λ* W**2 for each weight, which results in:

The 1/2 is simply for convenience when calculating the derivatives, as this leaves λ* W instead of 2*λ* W .

What this means is that we penalize very high or disparate weights, and prefer them to be all of a similar magnitude. If you remember, what the weights imply is the importance of each neuron in the final calculation of the prediction. Therefore, by doing this, we get all the neurons to matter more or less equally, that is, the network will use all its neurons to make the prediction.

On the contrary, if there were very high weights for certain neurons, the calculation of the prediction would take them much more into account, so we would end up with a network with dead neurons that are useless.

Moreover, introducing the term 1/2 * λ* W**2 in our loss function make our weights to approximate to zero during the gradient descent. With a linear decay of W+=-λ⋅W.

Let’s see if we can improve our network by applying the L2 regularization:

# L2 Regularization # Regularizer layer import from keras.regularizers import l2# Inizializing the model model = Sequential()# Defining a convolutional layer model.add(Conv2D(128, kernel_size=(3, 3), activation='relu', input_shape=(32, 32, 3)))# Defining a second convolutional layer model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))# Defining a third convolutional layer model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))# Classifier inclusion model.add(Flatten()) model.add(Dense(1024, activation='relu', kernel_regularizer=l2(0.01))) model.add(Dense(10, activation='softmax'))# Compiling the model model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=0.0001, decay=1e-6), metrics=['accuracy'])# Traning the model model.fit(X_train_norm, to_categorical(Y_train), batch_size=128, shuffle=True, epochs=10, validation_data=(X_test_norm, to_categorical(Y_test)))# Evaluating the model scores = model.evaluate(X_test_norm, to_categorical(Y_test))print('Loss: %.3f' % scores[0]) print('Accuracy: %.3f' % scores[1])
L1 regularization (Ridge regularization)
L1 is also quite common. This time, we added the term λ|w| to our loss function.

We can also combine the L1 regularization with the L2 in what is known as Elastic net regularization :

The L1 regularization manages to convert the W weight matrix into a sparse weight matrix (very close to zero, except for a few elements).

This means that, unlike L2, it gives much more importance to some neurons than others, making the network more robust to possible noise.

Generally, L2 usually gives better results. You can use L1 if you have images in which you know that there are a certain number of characteristics that will give you a good classification and you do not want the network to be distorted by noise.

Let’s try L1, then L1+L2:

# L1 Regularization # Regularizer layer import from keras.regularizers import l1# Inizializing the model model = Sequential()# Defining a convolutional layer model.add(Conv2D(128, kernel_size=(3, 3), activation='relu', input_shape=(32, 32, 3)))# Defining a second convolutional layer model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))# Defining a third convolutional layer model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))# Classifier inclusion model.add(Flatten()) model.add(Dense(1024, activation='relu', kernel_regularizer=l1(0.01))) model.add(Dense(10, activation='softmax'))# Compiling the model model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=0.0001, decay=1e-6), metrics=['accuracy'])# Traning the model model.fit(X_train_norm, to_categorical(Y_train), batch_size=128, shuffle=True, epochs=10, validation_data=(X_test_norm, to_categorical(Y_test)))# Evaluating the model scores = model.evaluate(X_test_norm, to_categorical(Y_test))print('Loss: %.3f' % scores[0]) print('Accuracy: %.3f' % scores[1])
# Elastic Net Regularization (L1 + L2) # Regularizer layer import from keras.regularizers import l1_l2# Inizializing the model model = Sequential()# Defining a convolutional layer model.add(Conv2D(128, kernel_size=(3, 3), activation='relu', input_shape=(32, 32, 3)))# Defining a second convolutional layer model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))# Defining a third convolutional layer model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))# Classifier inclusion model.add(Flatten()) model.add(Dense(1024, activation='relu', kernel_regularizer=l1_l2(0.01, 0.01))) model.add(Dense(10, activation='softmax'))# Compiling the model model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=0.0001, decay=1e-6), metrics=['accuracy'])# Traning the model model.fit(X_train_norm, to_categorical(Y_train), batch_size=128, shuffle=True, epochs=10, validation_data=(X_test_norm, to_categorical(Y_test)))# Evaluating the model scores = model.evaluate(X_test_norm, to_categorical(Y_test))print('Loss: %.3f' % scores[0]) print('Accuracy: %.3f' % scores[1])
Max norm constraints
Another type of regularization is the one based on restrictions. For example, we could set a maximum threshold that the weights cannot exceed.

In practice, this is implemented by using the descent gradient to calculate the new value of the weights as we would normally do, but then the norm 2 of each weight vector is calculated for each neuron and put as a condition that it cannot exceed C , that is:

Normally, C is equal to 3 or 4.

What we achieve with this normalization is that the network does not “explode”, that is, that the weights do not grow excessively.

Let’s see how this regularization goes:

# Elastic Net Regularization (L1 + L2) # Regularizer layer import from keras.constraints import max_norm# Inizializing the model model = Sequential()# Defining a convolutional layer model.add(Conv2D(128, kernel_size=(3, 3), activation='relu', input_shape=(32, 32, 3)))# Defining a second convolutional layer model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))# Defining a third convolutional layer model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))# Classifier inclusion model.add(Flatten()) model.add(Dense(1024, activation='relu', kernel_costraint=max_norm(3.))) model.add(Dense(10, activation='softmax'))# Compiling the model model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=0.0001, decay=1e-6), metrics=['accuracy'])# Traning the model model.fit(X_train_norm, to_categorical(Y_train), batch_size=128, shuffle=True, epochs=10, validation_data=(X_test_norm, to_categorical(Y_test)))# Evaluating the model scores = model.evaluate(X_test_norm, to_categorical(Y_test))print('Loss: %.3f' % scores[0]) print('Accuracy: %.3f' % scores[1])
Dropout regularization
Dropout regularization is a technique developed by Srivastava et al. in their article “Dropout: A Simple Way to Prevent Neural Networks from Overfitting ” that complements the other types of standardization (L1, L2, maxnorm).

It is an extremely effective and simple technique, which consists of keeping a neuron active or setting it to 0 during training with a probability p .

What we achieve with this is to change the architecture of the network at training time, which means that there will not be a single neuron responsible for being activated to a certain pattern, but we will have multiple redundant neurons capable of reacting to that pattern.

Let’s see how applying dropout affects our results:

# Dropout # Dropout layer import from keras.layers import Dropout# Inizializing the model model = Sequential()# Defining a convolutional layer model.add(Conv2D(128, kernel_size=(3, 3), activation='relu', input_shape=(32, 32, 3))) model.add(Dropout(0.25))# Defining a second convolutional layer model.add(Conv2D(128, kernel_size=(3, 3), activation='relu')) model.add(Dropout(0.25))# Defining a third convolutional layer model.add(Conv2D(128, kernel_size=(3, 3), activation='relu')) model.add(Dropout(0.25))# Classifier inclusion model.add(Flatten()) model.add(Dense(1024, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(10, activation='softmax'))# Compiling the model model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=0.0001, decay=1e-6), metrics=['accuracy'])# Traning the model model.fit(X_train_norm, to_categorical(Y_train), batch_size=128, shuffle=True, epochs=10, validation_data=(X_test_norm, to_categorical(Y_test)))# Evaluating the model scores = model.evaluate(X_test_norm, to_categorical(Y_test))print('Loss: %.3f' % scores[0]) print('Accuracy: %.3f' % scores[1])
And now, let’s see the effects of Max norm + Dropout:

# Dropout & Max Norm # Dropout & Max Norm layers import from keras.layers import Dropout from keras.constraints import max_norm# Inizializing the model model = Sequential()# Defining a convolutional layer model.add(Conv2D(128, kernel_size=(3, 3), activation='relu', input_shape=(32, 32, 3))) model.add(Dropout(0.25))# Defining a second convolutional layer model.add(Conv2D(128, kernel_size=(3, 3), activation='relu')) model.add(Dropout(0.25))# Defining a third convolutional layer model.add(Conv2D(128, kernel_size=(3, 3), activation='relu')) model.add(Dropout(0.25))# Classifier inclusion model.add(Flatten()) model.add(Dense(1024, activation='relu', kernel_constraint=max_norm(3.))) model.add(Dropout(0.5)) model.add(Dense(10, activation='softmax'))# Compiling the model model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=0.0001, decay=1e-6), metrics=['accuracy'])# Traning the model model.fit(X_train_norm, to_categorical(Y_train), batch_size=128, shuffle=True, epochs=10, validation_data=(X_test_norm, to_categorical(Y_test)))# Evaluating the model scores = model.evaluate(X_test_norm, to_categorical(Y_test))print('Loss: %.3f' % scores[0]) print('Accuracy: %.3f' % scores[1])
There are more techniques to deal with overfitting such as Max pooling, changing the strides…etc. In practice, the best is to apply several of them and test which combination provides the best result according to the problem faced.

Final Words
As always, I hope you enjoyed the post, and that you gained an intuition about how to implement and develop a convolutional neural network!

If you liked this post then you can take a look at my other posts on Data Science and Machine Learning here .

If you want to learn more about Machine Learning, Data Science and Artificial Intelligence follow me on Medium , and stay tuned for my next posts!