UNDERSTANDING RESIDUAL NETWORKS

Image Recognition has advanced in recent years due to availability of large datasets and powerful GPUs that has enabled training of very deep architectures. Simonyan et al. authors of VGG demonstrated that by simply stacking more layers, we can improve accuracy, prior to this, in 2009, Yoshua Bengio in his monograph, “Learning Deep Architectures for AI” gave convincing theoretical analysis of the effectiveness of deep architectures.
In the previous posts, I demonstrated how to apply various techniques including batch normalization, dropout and data augmentation to convolutional neutral networks. Can we build more accurate systems by simply stacking more and more layers of convolution -batch normalization-relu layers? To some point, accuracy would improve, but beyond about 25+ layers, accuracy would rather drop.
Kaiming et al. 2015 first demonstrated the depth problem and proposed a remarkable solution which has since allowed the training of over 2000 layers! with increasing accuracy.
In this post, I would explain their technique and how to apply it.
First, accuracy diminished over many layers due to vanishing gradients, as layers go deep, gradients got small leading to worse performance. This has nothing to do with overfitting, hence, dropouts cannot salvage it.
The ultimate solution deviced by Kaiming He and his colleagues at Microsoft Research Asia was to introduce residual connections. Which is just a simple term to describe connecting the output of previous layers to the output of new layers.
Assuming you have a seven layer network. In a residual setup, you would not only pass the output of layer 1 to layer 2 and on, but you would also add up the outputs of layer 1 to the outputs of layer 2.
Denoting each layer by f(x)
In a standard network y = f(x)
However, in a residual network,
y = f(x) + x

Typical Structure of A Resnet Module

Applying this principle, the authors won Imagenet 2015 and reached new state of the art results on all standard computer vision benchmarks. The idea has since been expanded into all other domains of deep learning including speech and natural language processing.

Enough with the basic maths, let’s get our hands dirty with codes.

A standard two layer module is as below

def Unit(x,filters):

out = BatchNormalization()(x)
out = Activation("relu")(out)
out = Conv2D(filters=filters, kernel_size=[3, 3], strides=[1, 1], padding="same")(out)

out = BatchNormalization()(out)
out = Activation("relu")(out)
out = Conv2D(filters=filters, kernel_size=[3, 3], strides=[1, 1], padding="same")(out)

return out

To recap, in this module, we pass in an input x, take it through batch normalization — relu- conv2d and the output is taken through the same stack.

Below is a resnet module

def Unit(x,filters):
res = x
out = BatchNormalization()(x)
out = Activation("relu")(out)
out = Conv2D(filters=filters, kernel_size=[3, 3], strides=[1, 1], padding="same")(out)

out = BatchNormalization()(out)
out = Activation("relu")(out)
out = Conv2D(filters=filters, kernel_size=[3, 3], strides=[1, 1], padding="same")(out)

out = keras.layers.add([res,out])

return out

this looks very similar but with one major difference, first, we store a reference “res” to the original input, and after passing through the batchnorm-relu-conv layers, we add the output to the residual, for clarity this was done in the line

out = keras.layers.add([res,out])

This part corresponds to the equation y = f(x) + x

So we can build a resnet by stacking many of this module together.

Before that, we need to slightly modify the code to account for pooling.

def Unit(x,filters,pool=False):
res = x
if pool:
x = MaxPooling2D(pool_size=(2, 2))(x)
res = Conv2D(filters=filters,kernel_size=[1,1],strides=(2,2),padding="same")(res)
out = BatchNormalization()(x)
out = Activation("relu")(out)
out = Conv2D(filters=filters, kernel_size=[3, 3], strides=[1, 1], padding="same")(out)

out = BatchNormalization()(out)
out = Activation("relu")(out)
out = Conv2D(filters=filters, kernel_size=[3, 3], strides=[1, 1], padding="same")(out)

out = keras.layers.add([res,out])

return out

Note something above, when we pool, then the dimensions of our output would no longer match the dimension of our residual, hence, we shall not only apply pooling on the input, but the residual would also be transformed through a strided 1 x 1 conv that would project the filters to be the same as the output and the stride of 2 would half the dimensions just as the max pooling does.

Having thus explained, i shall now present the full resnet.

def Unit(x,filters,pool=False):
res = x
if pool:
x = MaxPooling2D(pool_size=(2, 2))(x)
res = Conv2D(filters=filters,kernel_size=[1,1],strides=(2,2),padding="same")(res)
out = BatchNormalization()(x)
out = Activation("relu")(out)
out = Conv2D(filters=filters, kernel_size=[3, 3], strides=[1, 1], padding="same")(out)

out = BatchNormalization()(out)
out = Activation("relu")(out)
out = Conv2D(filters=filters, kernel_size=[3, 3], strides=[1, 1], padding="same")(out)

out = keras.layers.add([res,out])

return out
def MiniModel(input_shape):
images = Input(input_shape)
net = Conv2D(filters=32, kernel_size=[3, 3], strides=[1, 1], padding="same")(images)
net = Unit(net,32)
net = Unit(net,32)
net = Unit(net,32)

net = Unit(net,64,pool=True)
net = Unit(net,64)
net = Unit(net,64)

net = Unit(net,128,pool=True)
net = Unit(net,128)
net = Unit(net,128)

net = Unit(net, 256,pool=True)
net = Unit(net, 256)
net = Unit(net, 256)

net = BatchNormalization()(net)
net = Activation("relu")(net)
net = Dropout(0.25)(net)

net = AveragePooling2D(pool_size=(4,4))(net)
net = Flatten()(net)
net = Dense(units=10,activation="softmax")(net)

model = Model(inputs=images,outputs=net)

return model

This excludes the training code as you can see, below is the training code, with epochs set to 50 epochs.

You can run this for free on GPU with Google Colab

#import needed classes
import keras
from keras.datasets import cifar10
from keras.layers import Dense,Conv2D,MaxPooling2D,Flatten,AveragePooling2D,Dropout,BatchNormalization,Activation
from keras.models import Model,Input
from keras.optimizers import Adam
from keras.callbacks import LearningRateScheduler
from keras.callbacks import ModelCheckpoint
from math import ceil
import os
from keras.preprocessing.image import ImageDataGenerator


def Unit(x,filters,pool=False):
res = x
if pool:
x = MaxPooling2D(pool_size=(2, 2))(x)
res = Conv2D(filters=filters,kernel_size=[1,1],strides=(2,2),padding="same")(res)
out = BatchNormalization()(x)
out = Activation("relu")(out)
out = Conv2D(filters=filters, kernel_size=[3, 3], strides=[1, 1], padding="same")(out)

out = BatchNormalization()(out)
out = Activation("relu")(out)
out = Conv2D(filters=filters, kernel_size=[3, 3], strides=[1, 1], padding="same")(out)

out = keras.layers.add([res,out])

return out

#Define the model


def MiniModel(input_shape):
images = Input(input_shape)
net = Conv2D(filters=32, kernel_size=[3, 3], strides=[1, 1], padding="same")(images)
net = Unit(net,32)
net = Unit(net,32)
net = Unit(net,32)

net = Unit(net,64,pool=True)
net = Unit(net,64)
net = Unit(net,64)

net = Unit(net,128,pool=True)
net = Unit(net,128)
net = Unit(net,128)

net = Unit(net, 256,pool=True)
net = Unit(net, 256)
net = Unit(net, 256)

net = BatchNormalization()(net)
net = Activation("relu")(net)
net = Dropout(0.25)(net)

net = AveragePooling2D(pool_size=(4,4))(net)
net = Flatten()(net)
net = Dense(units=10,activation="softmax")(net)

model = Model(inputs=images,outputs=net)

return model

#load the cifar10 dataset
(train_x, train_y) , (test_x, test_y) = cifar10.load_data()

#normalize the data
train_x = train_x.astype('float32') / 255
test_x = test_x.astype('float32') / 255

#Subtract the mean image from both train and test set
train_x = train_x - train_x.mean()
test_x = test_x - test_x.mean()

#Divide by the standard deviation
train_x = train_x / train_x.std(axis=0)
test_x = test_x / test_x.std(axis=0)


datagen = ImageDataGenerator(rotation_range=10,
width_shift_range=5. / 32,
height_shift_range=5. / 32,
horizontal_flip=True)

# Compute quantities required for featurewise normalization
# (std, mean, and principal components if ZCA whitening is applied).
datagen.fit(train_x)



#Encode the labels to vectors
train_y = keras.utils.to_categorical(train_y,10)
test_y = keras.utils.to_categorical(test_y,10)

#define a common unit


input_shape = (32,32,3)
model = MiniModel(input_shape)

#Print a Summary of the model

model.summary()
#Specify the training components
model.compile(optimizer=Adam(0.001),loss="categorical_crossentropy",metrics=["accuracy"])



epochs = 50
steps_per_epoch = ceil(50000/128)

# Fit the model on the batches generated by datagen.flow().
model.fit_generator(datagen.flow(train_x, train_y, batch_size=128),
validation_data=[test_x,test_y],
epochs=epochs,steps_per_epoch=steps_per_epoch, verbose=1, workers=4)


#Evaluate the accuracy of the test dataset
accuracy = model.evaluate(x=test_x,y=test_y,batch_size=128)
model.save("cifar10model.h5")

Hope you enjoyed this, future posts would be dedicated to ways of easily scaling your residual networks by width and depth as well as ways of adjusting learning rate.

Have a great time and don’t forget leave some claps!

You can always reach to me on twitter via @johnolafenwa

Source: Deep Learning on Medium