Image Recognition has advanced in recent years due to availability of large datasets and powerful GPUs that has enabled training of very deep architectures. Simonyan et al. authors of VGG demonstrated that by simply stacking more layers, we can improve accuracy, prior to this, in 2009, Yoshua Bengio in his monograph, “Learning Deep Architectures for AI” gave convincing theoretical analysis of the effectiveness of deep architectures.

In the previous posts, I demonstrated how to apply various techniques including batch normalization, dropout and data augmentation to convolutional neutral networks. Can we build more accurate systems by simply stacking more and more layers of convolution -batch normalization-relu layers? To some point, accuracy would improve, but beyond about 25+ layers, accuracy would rather drop.

Kaiming et al. 2015 first demonstrated the depth problem and proposed a remarkable solution which has since allowed the training of over 2000 layers! with increasing accuracy.

In this post, I would explain their technique and how to apply it.

First, accuracy diminished over many layers due to vanishing gradients, as layers go deep, gradients got small leading to worse performance. This has nothing to do with overfitting, hence, dropouts cannot salvage it.

The ultimate solution deviced by Kaiming He and his colleagues at Microsoft Research Asia was to introduce residual connections. Which is just a simple term to describe connecting the output of previous layers to the output of new layers.

Assuming you have a seven layer network. In a residual setup, you would not only pass the output of layer 1 to layer 2 and on, but you would also add up the outputs of layer 1 to the outputs of layer 2.

Denoting each layer by *f(x)** *

In a standard network *y = f(x)*

However, in a residual network,

*y = f(x) + x*

Applying this principle, the authors won Imagenet 2015 and reached new state of the art results on all standard computer vision benchmarks. The idea has since been expanded into all other domains of deep learning including speech and natural language processing.

Enough with the basic maths, let’s get our hands dirty with codes.

A standard two layer module is as below

defUnit(x,filters):

out = BatchNormalization()(x)

out = Activation("relu")(out)

out = Conv2D(filters=filters, kernel_size=[3, 3], strides=[1, 1], padding="same")(out)

out = BatchNormalization()(out)

out = Activation("relu")(out)

out = Conv2D(filters=filters, kernel_size=[3, 3], strides=[1, 1], padding="same")(out)

returnout

To recap, in this module, we pass in an input x, take it through batch normalization — relu- conv2d and the output is taken through the same stack.

Below is a resnet module

defUnit(x,filters):

res = x

out = BatchNormalization()(x)

out = Activation("relu")(out)

out = Conv2D(filters=filters, kernel_size=[3, 3], strides=[1, 1], padding="same")(out)

out = BatchNormalization()(out)

out = Activation("relu")(out)

out = Conv2D(filters=filters, kernel_size=[3, 3], strides=[1, 1], padding="same")(out)

out = keras.layers.add([res,out])

returnout

this looks very similar but with one major difference, first, we store a reference “res” to the original input, and after passing through the batchnorm-relu-conv layers, we add the output to the residual, for clarity this was done in the line

out = keras.layers.add([res,out])

This part corresponds to the equation *y = f(x) + x*

So we can build a resnet by stacking many of this module together.

Before that, we need to slightly modify the code to account for pooling.

defUnit(x,filters,pool=False):

res = x

ifpool:

x = MaxPooling2D(pool_size=(2, 2))(x)

res = Conv2D(filters=filters,kernel_size=[1,1],strides=(2,2),padding="same")(res)

out = BatchNormalization()(x)

out = Activation("relu")(out)

out = Conv2D(filters=filters, kernel_size=[3, 3], strides=[1, 1], padding="same")(out)

out = BatchNormalization()(out)

out = Activation("relu")(out)

out = Conv2D(filters=filters, kernel_size=[3, 3], strides=[1, 1], padding="same")(out)

out = keras.layers.add([res,out])

returnout

Note something above, when we pool, then the dimensions of our output would no longer match the dimension of our residual, hence, we shall not only apply pooling on the input, but the residual would also be transformed through a strided 1 x 1 conv that would project the filters to be the same as the output and the stride of 2 would half the dimensions just as the max pooling does.

Having thus explained, i shall now present the full resnet.

defUnit(x,filters,pool=False):

res = x

ifpool:

x = MaxPooling2D(pool_size=(2, 2))(x)

res = Conv2D(filters=filters,kernel_size=[1,1],strides=(2,2),padding="same")(res)

out = BatchNormalization()(x)

out = Activation("relu")(out)

out = Conv2D(filters=filters, kernel_size=[3, 3], strides=[1, 1], padding="same")(out)

out = BatchNormalization()(out)

out = Activation("relu")(out)

out = Conv2D(filters=filters, kernel_size=[3, 3], strides=[1, 1], padding="same")(out)

out = keras.layers.add([res,out])

returnout

defMiniModel(input_shape):

images = Input(input_shape)

net = Conv2D(filters=32, kernel_size=[3, 3], strides=[1, 1], padding="same")(images)

net = Unit(net,32)

net = Unit(net,32)

net = Unit(net,32)

net = Unit(net,64,pool=True)

net = Unit(net,64)

net = Unit(net,64)

net = Unit(net,128,pool=True)

net = Unit(net,128)

net = Unit(net,128)

net = Unit(net, 256,pool=True)

net = Unit(net, 256)

net = Unit(net, 256)

net = BatchNormalization()(net)

net = Activation("relu")(net)

net = Dropout(0.25)(net)

net = AveragePooling2D(pool_size=(4,4))(net)

net = Flatten()(net)

net = Dense(units=10,activation="softmax")(net)

model = Model(inputs=images,outputs=net)

returnmodel

This excludes the training code as you can see, below is the training code, with epochs set to 50 epochs.

You can run this for free on GPU with Google Colab

#import needed classesimportkerasfromkeras.datasetsimportcifar10fromkeras.layersimportDense,Conv2D,MaxPooling2D,Flatten,AveragePooling2D,Dropout,BatchNormalization,Activationfromkeras.modelsimportModel,Inputfromkeras.optimizersimportAdamfromkeras.callbacksimportLearningRateSchedulerfromkeras.callbacksimportModelCheckpointfrommathimportceilimportosfromkeras.preprocessing.imageimportImageDataGeneratordefUnit(x,filters,pool=False):

res = x

ifpool:

x = MaxPooling2D(pool_size=(2, 2))(x)

res = Conv2D(filters=filters,kernel_size=[1,1],strides=(2,2),padding="same")(res)

out = BatchNormalization()(x)

out = Activation("relu")(out)

out = Conv2D(filters=filters, kernel_size=[3, 3], strides=[1, 1], padding="same")(out)

out = BatchNormalization()(out)

out = Activation("relu")(out)

out = Conv2D(filters=filters, kernel_size=[3, 3], strides=[1, 1], padding="same")(out)

out = keras.layers.add([res,out])

returnout#Define the modeldefMiniModel(input_shape):

images = Input(input_shape)

net = Conv2D(filters=32, kernel_size=[3, 3], strides=[1, 1], padding="same")(images)

net = Unit(net,32)

net = Unit(net,32)

net = Unit(net,32)

net = Unit(net,64,pool=True)

net = Unit(net,64)

net = Unit(net,64)

net = Unit(net,128,pool=True)

net = Unit(net,128)

net = Unit(net,128)

net = Unit(net, 256,pool=True)

net = Unit(net, 256)

net = Unit(net, 256)

net = BatchNormalization()(net)

net = Activation("relu")(net)

net = Dropout(0.25)(net)

net = AveragePooling2D(pool_size=(4,4))(net)

net = Flatten()(net)

net = Dense(units=10,activation="softmax")(net)

model = Model(inputs=images,outputs=net)

returnmodel#load the cifar10 dataset(train_x, train_y) , (test_x, test_y) = cifar10.load_data()#normalize the datatrain_x = train_x.astype('float32') / 255

test_x = test_x.astype('float32') / 255#Subtract the mean image from both train and test settrain_x = train_x - train_x.mean()

test_x = test_x - test_x.mean()#Divide by the standard deviationtrain_x = train_x / train_x.std(axis=0)

test_x = test_x / test_x.std(axis=0)

datagen = ImageDataGenerator(rotation_range=10,

width_shift_range=5. / 32,

height_shift_range=5. / 32,

horizontal_flip=True)# Compute quantities required for featurewise normalizationdatagen.fit(train_x)

# (std, mean, and principal components if ZCA whitening is applied).#Encode the labels to vectorstrain_y = keras.utils.to_categorical(train_y,10)

test_y = keras.utils.to_categorical(test_y,10)#define a common unitinput_shape = (32,32,3)

model = MiniModel(input_shape)#Print a Summary of the modelmodel.summary()#Specify the training componentsmodel.compile(optimizer=Adam(0.001),loss="categorical_crossentropy",metrics=["accuracy"])

epochs = 50

steps_per_epoch = ceil(50000/128)# Fit the model on the batches generated by datagen.flow().model.fit_generator(datagen.flow(train_x, train_y, batch_size=128),

validation_data=[test_x,test_y],

epochs=epochs,steps_per_epoch=steps_per_epoch, verbose=1, workers=4)#Evaluate the accuracy of the test datasetaccuracy = model.evaluate(x=test_x,y=test_y,batch_size=128)

model.save("cifar10model.h5")

Hope you enjoyed this, future posts would be dedicated to ways of easily scaling your residual networks by width and depth as well as ways of adjusting learning rate.

Have a great time and don’t forget leave some claps!

You can always reach to me on twitter via @johnolafenwa

Source: Deep Learning on Medium