Five Powerful CNN Architectures

Source: Deep Learning on Medium

Let’s go over some of the powerful Convolutional Neural Networks which laid the foundation of today’s Computer Vision achievements, achieved using Deep Learning.

LeNet-5 — LeCun et al

LeNet-5, a 7 layer Convolutional Neural Network, was deployed in many banking systems to recognize hand-written numbers on cheques.

LeNet-5 — Architecture

The hand-written numbers were digitized into grayscale images of pixel size — 32×32. At that time, the computational capacity was limited and hence the technique wasn’t scalable to large scale images.

Let’s understand the architecture of the model. The model contained 7 layers excluding the input layer. Since it is a relatively small architecture, let’s go layer by layer:

  1. Layer 1: A convolutional layer with kernel size of 5×5, stride of 1×1 and 6 kernels in total. So the input image of size 32x32x1 gives an output of 28x28x6. Total params in layer = 5 * 5 * 6 + 6 (bias terms)
  2. Layer 2: A pooling layer with 2×2 kernel size, stride of 2×2 and 6 kernels in total. This pooling layer acted a little differently than what we discussed in previous post. The input values in the receptive were summed up and then were multiplied to a trainable parameter (1 per filter), the result was finally added to a trainable bias (1 per filter). Finally, sigmoid activation was applied to the output. So, the input from previous layer of size 28x28x6 gets sub-sampled to 14x14x6. Total params in layer = [1 (trainable parameter) + 1 (trainable bias)] * 6 = 12
  3. Layer 3: Similar to Layer 1, this layer is a convolutional layer with same configuration except it has 16 filters instead of 6. So the input from previous layer of size 14x14x6 gives an output of 10x10x16. Total params in layer = 5 * 5 * 16 + 16 = 416.
  4. Layer 4: Again, similar to Layer 2, this layer is a pooling layer with 16 filters this time around. Remember, the outputs are passed through sigmoid activation function. The input of size 10x10x16 from previous layer gets sub-sampled to 5x5x16. Total params in layer = (1 + 1) * 16 = 32
  5. Layer 5: This time around we have a convolutional layer with 5×5 kernel size and 120 filters. There is no need to even consider strides as the input size is 5x5x16 so we will get an output of 1x1x120. Total params in layer = 5 * 5 * 120 = 3000
  6. Layer 6: This is a dense layer with 84 parameters. So, the input of 120 units is converted to 84 units. Total params = 84 * 120 + 84 = 10164. The activation function used here was rather a unique one. I’ll say you can just try out any of your choice here as the task is pretty simple one by today’s standards.
  7. Output Layer: Finally, a dense layer with 10 units is used. Total params = 84 * 10 + 10 = 924.

Skipping over the details of loss function used and why it was used, I would suggest using cross-entropy loss with softmax activation in the last layer. Try out different training schedules and learning rates.

LeNet-5 — CODE

from keras import layers
from keras.models import Model

def lenet_5(in_shape=(32,32,1), n_classes=10, opt='sgd'):
in_layer = layers.Input(in_shape)
conv1 = layers.Conv2D(filters=20, kernel_size=5,
padding='same', activation='relu')(in_layer)
pool1 = layers.MaxPool2D()(conv1)
conv2 = layers.Conv2D(filters=50, kernel_size=5,
padding='same', activation='relu')(pool1)
pool2 = layers.MaxPool2D()(conv2)
flatten = layers.Flatten()(pool2)
dense1 = layers.Dense(500, activation='relu')(flatten)
preds = layers.Dense(n_classes, activation='softmax')(dense1)

model = Model(in_layer, preds)
model.compile(loss="categorical_crossentropy", optimizer=opt,
return model

if __name__ == '__main__':
model = lenet_5()

AlexNet — Krizhevsky et al

In 2012, a jaw dropping moment occurred when Hinton’s Deep Neural Network reduced the top-5 loss from 26% to 15.3% in the world’s most significant computer vision challenge imagenet.

The network was very similar to LeNet but was much more deeper and had around 60 million parameters.

AlexNet — Architecture

Well that figure certainly looks scary. This is because the network was split into two halves, each trained simultaneously on two different GPUs. Let’s make this a little bit easy for us and bring a simpler version into the picture:

The architecture consists of 5 Convolutional Layers and 3 Fully Connected Layers. These 8 layers combined with two new concepts at that time — MaxPooling and ReLU activation gave their model an edge.

You can see the various layers and their configuration in the figure above. The layers are described in the table below:

Note: ReLU activation is applied to the output of every Convolution and Fully Connected layer except the last softmax layer.

Various other techniques were used by the authors (few of them will be discussed in upcoming posts) — dropout, augmentation and Stochastic Gradient Descent with momentum.

AlexNet — CODE

from keras import layers
from keras.models import Model

def alexnet(in_shape=(227,227,3), n_classes=1000, opt='sgd'):
in_layer = layers.Input(in_shape)
conv1 = layers.Conv2D(96, 11, strides=4, activation='relu')(in_layer)
pool1 = layers.MaxPool2D(3, 2)(conv1)
conv2 = layers.Conv2D(256, 5, strides=1, padding='same', activation='relu')(pool1)
pool2 = layers.MaxPool2D(3, 2)(conv2)
conv3 = layers.Conv2D(384, 3, strides=1, padding='same', activation='relu')(pool2)
conv4 = layers.Conv2D(256, 3, strides=1, padding='same', activation='relu')(conv3)
pool3 = layers.MaxPool2D(3, 2)(conv4)
flattened = layers.Flatten()(pool3)
dense1 = layers.Dense(4096, activation='relu')(flattened)
drop1 = layers.Dropout(0.5)(dense1)
dense2 = layers.Dense(4096, activation='relu')(drop1)
drop2 = layers.Dropout(0.5)(dense2)
preds = layers.Dense(n_classes, activation='softmax')(drop2)

model = Model(in_layer, preds)
model.compile(loss="categorical_crossentropy", optimizer=opt,
return model

if __name__ == '__main__':
model = alexnet()

VGGNet — Simonyan et al

The runner up of 2014 imagenet challenge is named VGGNet. Because of the simplicity of it’s uniform architecture, it appeals to a new-comer as a simpler form of a deep convolutional neural network.

In future posts, we will see how this network is one of the most used choices for feature extraction from images (taking images and converting them to a smaller dimensional array that contains important information regarding the image).

VGGNet — Architecture

VGGNet has 2 simple rules of thumb to be followed:

  1. Each Convolutional layer has configuration — kernel size = 3×3, stride = 1×1, padding = same. The only thing that differs is number of filters.
  2. Each Max Pooling layer has configuration — windows size = 2×2 and stride = 2×2. Thus, we half the size of the image at every Pooling layer.

The input image was an RGB image of 224×224 pixels. So input size = 224x224x3

Total Params = 138 million. Most of these parameters are contributed by fully connected layers.

  • The first FC layer contributes = 4096 * (7 * 7 * 512) + 4096 = 102,764,544
  • The second FC layer contributes = 4096 * 4096 + 4096 = 16,781,312
  • The third FC layer contributes = 4096 * 1000 + 4096 = 4,100,096

Total params contributed by FC layers = 123,645,952.


from keras import layers
from keras.models import Model, Sequential

from functools import partial

conv3 = partial(layers.Conv2D,

def block(in_tensor, filters, n_convs):
conv_block = in_tensor
for _ in range(n_convs):
conv_block = conv3(filters=filters)(conv_block)
return conv_block

def _vgg(in_shape=(227,227,3),
n_stages_per_blocks=[2, 2, 3, 3, 3]):
in_layer = layers.Input(in_shape)

block1 = block(in_layer, 64, n_stages_per_blocks[0])
pool1 = layers.MaxPool2D()(block1)
block2 = block(pool1, 128, n_stages_per_blocks[1])
pool2 = layers.MaxPool2D()(block2)
block3 = block(pool2, 256, n_stages_per_blocks[2])
pool3 = layers.MaxPool2D()(block3)
block4 = block(pool3, 512, n_stages_per_blocks[3])
pool4 = layers.MaxPool2D()(block4)
block5 = block(pool4, 512, n_stages_per_blocks[4])
pool5 = layers.MaxPool2D()(block5)
flattened = layers.GlobalAvgPool2D()(pool5)

dense1 = layers.Dense(4096, activation='relu')(flattened)
dense2 = layers.Dense(4096, activation='relu')(dense1)
preds = layers.Dense(1000, activation='softmax')(dense2)

model = Model(in_layer, preds)
model.compile(loss="categorical_crossentropy", optimizer=opt,
return model

def vgg16(in_shape=(227,227,3), n_classes=1000, opt='sgd'):
return _vgg(in_shape, n_classes, opt)

def vgg19(in_shape=(227,227,3), n_classes=1000, opt='sgd'):
return _vgg(in_shape, n_classes, opt, [2, 2, 4, 4, 4])

if __name__ == '__main__':
model = vgg19()

GoogLeNet/Inception — Szegedy et al

The winner of the 2014 imagenet competition — GoogLeNet (a.k.a Inception v1), achieved a top-5 error rate of 6.67%. It used an inception module, a novel concept, with smaller convolutions that allowed the reduction of number of parameters to a mere 4 million.

Inception module

Reasons for using these inception modules:

  1. Each layer type extracts different information from input. Information gathered from a 3×3 layer will differ from information gathered from a 5×5 layer. How do we know which transformation will be the best at a given layer? So we use them all!
  2. Dimensionality reduction using 1×1 convolutions! Consider a 128x128x256 input. If we pass it through 20 filters of size 1×1, we will get an output of 128x128x20. So we apply them before the 3×3 or 5×5 convolutions to decrease the number of input filters to these layers in the inception block used for dimensionality reduction.

GoogLeNet/Inception — Architecture

The complete inception architecture:

You might see some “auxiliary classifiers” with softmax in this structure. Quoting paper here on this one — “By adding auxiliary classifiers connected to these intermediate layers, we would expect to encourage discrimination in the lower stages in the classifier, increase the gradient signal that gets propagated back, and provide additional regularization.”

But what does it mean? Basically what they meant by:

  1. discrimination in the lower stages: We will train lower layers in network with gradients coming in from an earlier staged layer for output probabilities. This makes sure that the network has some discrimination earlier on about different objects.
  2. increase the gradient signal that gets propagated back: In deep neural networks, often, the gradients flowing back (using backpropagation), become so small that the earlier layers of network hardly learn. The earlier classification layers thus make it helpful by propagating a strong gradient signal to train the network.
  3. provide additional regularization: Deep Neural Networks tend to overfit (or cause high variance) the data while small Neural Networks tend to underfit (or cause high bias). The earlier classifiers regularize overfitting effect of the deeper layers!

Structure of Auxiliary classifiers:

Note: Here,

#1×1 represents the filters in 1×1 convolution in inception module.

#3×3 reduce represents the filters in 1×1 convolution before 3×3 convolution in inception module.

#5×5 reduce represents the filters in 1×1 convolution before 5×5 convolution in inception module.

#3×3 represents the filters in 3×3 convolution in inception module.

#5×5 represents the filters in 5×5 convolution in inception module.

Pool Proj represents the filters in 1×1 convolution before Max Pool in inception module.

GoogLeNet incarnation of the Inception architecture

It used batch normalization, image distortions and RMSprop, things we will discuss in future posts.

GoogLeNet/Inception — CODE

from keras import layers
from keras.models import Model

from functools import partial

conv1x1 = partial(layers.Conv2D, kernel_size=1, activation='relu')
conv3x3 = partial(layers.Conv2D, kernel_size=3, padding='same', activation='relu')
conv5x5 = partial(layers.Conv2D, kernel_size=5, padding='same', activation='relu')

def inception_module(in_tensor, c1, c3_1, c3, c5_1, c5, pp):
conv1 = conv1x1(c1)(in_tensor)

conv3_1 = conv1x1(c3_1)(in_tensor)
conv3 = conv3x3(c3)(conv3_1)

conv5_1 = conv1x1(c5_1)(in_tensor)
conv5 = conv5x5(c5)(conv5_1)

pool_conv = conv1x1(pp)(in_tensor)
pool = layers.MaxPool2D(3, strides=1, padding='same')(pool_conv)

merged = layers.Concatenate(axis=-1)([conv1, conv3, conv5, pool])
return merged

def aux_clf(in_tensor):
avg_pool = layers.AvgPool2D(5, 3)(in_tensor)
conv = conv1x1(128)(avg_pool)
flattened = layers.Flatten()(conv)
dense = layers.Dense(1024, activation='relu')(flattened)
dropout = layers.Dropout(0.7)(dense)
out = layers.Dense(1000, activation='softmax')(dropout)
return out

def inception_net(in_shape=(224,224,3), n_classes=1000, opt='sgd'):
in_layer = layers.Input(in_shape)

conv1 = layers.Conv2D(64, 7, strides=2, activation='relu', padding='same')(in_layer)
pad1 = layers.ZeroPadding2D()(conv1)
pool1 = layers.MaxPool2D(3, 2)(pad1)
conv2_1 = conv1x1(64)(pool1)
conv2_2 = conv3x3(192)(conv2_1)
pad2 = layers.ZeroPadding2D()(conv2_2)
pool2 = layers.MaxPool2D(3, 2)(pad2)

inception3a = inception_module(pool2, 64, 96, 128, 16, 32, 32)
inception3b = inception_module(inception3a, 128, 128, 192, 32, 96, 64)
pad3 = layers.ZeroPadding2D()(inception3b)
pool3 = layers.MaxPool2D(3, 2)(pad3)

inception4a = inception_module(pool3, 192, 96, 208, 16, 48, 64)
inception4b = inception_module(inception4a, 160, 112, 224, 24, 64, 64)
inception4c = inception_module(inception4b, 128, 128, 256, 24, 64, 64)
inception4d = inception_module(inception4c, 112, 144, 288, 32, 48, 64)
inception4e = inception_module(inception4d, 256, 160, 320, 32, 128, 128)
pad4 = layers.ZeroPadding2D()(inception4e)
pool4 = layers.MaxPool2D(3, 2)(pad4)

aux_clf1 = aux_clf(inception4a)
aux_clf2 = aux_clf(inception4d)

inception5a = inception_module(pool4, 256, 160, 320, 32, 128, 128)
inception5b = inception_module(inception5a, 384, 192, 384, 48, 128, 128)
pad5 = layers.ZeroPadding2D()(inception5b)
pool5 = layers.MaxPool2D(3, 2)(pad5)

avg_pool = layers.GlobalAvgPool2D()(pool5)
dropout = layers.Dropout(0.4)(avg_pool)
preds = layers.Dense(1000, activation='softmax')(dropout)

model = Model(in_layer, [preds, aux_clf1, aux_clf2])
model.compile(loss="categorical_crossentropy", optimizer=opt,
return model

if __name__ == '__main__':
model = inception_net()

ResNet — Kaiming He et al

The 2015 imagenet competition brought about a top-5 error rate of 3.57%, which is lower than the human error on top-5. This was due to ResNet (Residual Network) model used by microsoft at the competition. The network introduced a novel approach called — “skip connections”.

Residual learning: a building block.

The idea came out as a solution to an observation — Deep neural networks perform worse as we keep on adding layer. But intuitively speaking, this should not be the case. If a network with k layers performs as y, then a network with k+1 layers should at least perform y.

The observation brought about a hypothesis: direct mappings are hard to learn. So instead of learning mapping between output of layer and its input, learn the difference between them — learn the residual.

Say, x was the input and H(x) was the learnt output. So, we need to learn F(x) = H(x) — x. We can do this by first making a layer to learn F(x) and then adding x to F(x) hence achieving H(x). As a result, we are sending the same H(x) in next layer as we were supposed to before! This gave rise to the residual block we saw above.

The results were amazing as the vanishing gradients problem which usually make deep neural networks numb to learning were removed. How? The skip connections or the shortcuts, as we might say them, gave a shortcut to the gradients to the previous layers, skipping bunch of layers in between.

ResNet — Architecture

Let’s use it here:

The paper mentions the usage of bottleneck for deeper ResNets — 50/101/152. Instead of using the residual block mentioned above, the network uses 1×1 convolutions to increase and decrease dimensionality of the number of channels.

ResNet — CODE

from keras import layers
from keras.models import Model

def _after_conv(in_tensor):
norm = layers.BatchNormalization()(in_tensor)
return layers.Activation('relu')(norm)

def conv1(in_tensor, filters):
conv = layers.Conv2D(filters, kernel_size=1, strides=1)(in_tensor)
return _after_conv(conv)

def conv1_downsample(in_tensor, filters):
conv = layers.Conv2D(filters, kernel_size=1, strides=2)(in_tensor)
return _after_conv(conv)

def conv3(in_tensor, filters):
conv = layers.Conv2D(filters, kernel_size=3, strides=1, padding='same')(in_tensor)
return _after_conv(conv)

def conv3_downsample(in_tensor, filters):
conv = layers.Conv2D(filters, kernel_size=3, strides=2, padding='same')(in_tensor)
return _after_conv(conv)

def resnet_block_wo_bottlneck(in_tensor, filters, downsample=False):
if downsample:
conv1_rb = conv3_downsample(in_tensor, filters)
conv1_rb = conv3(in_tensor, filters)
conv2_rb = conv3(conv1_rb, filters)

if downsample:
in_tensor = conv1_downsample(in_tensor, filters)
result = layers.Add()([conv2_rb, in_tensor])

return layers.Activation('relu')(result)

def resnet_block_w_bottlneck(in_tensor,
if downsample:
conv1_rb = conv1_downsample(in_tensor, int(filters/4))
conv1_rb = conv1(in_tensor, int(filters/4))
conv2_rb = conv3(conv1_rb, int(filters/4))
conv3_rb = conv1(conv2_rb, filters)

if downsample:
in_tensor = conv1_downsample(in_tensor, filters)
elif change_channels:
in_tensor = conv1(in_tensor, filters)
result = layers.Add()([conv3_rb, in_tensor])

return result

def _pre_res_blocks(in_tensor):
conv = layers.Conv2D(64, 7, strides=2, padding='same')(in_tensor)
conv = _after_conv(conv)
pool = layers.MaxPool2D(3, 2, padding='same')(conv)
return pool

def _post_res_blocks(in_tensor, n_classes):
pool = layers.GlobalAvgPool2D()(in_tensor)
preds = layers.Dense(n_classes, activation='softmax')(pool)
return preds

def convx_wo_bottleneck(in_tensor, filters, n_times, downsample_1=False):
res = in_tensor
for i in range(n_times):
if i == 0:
res = resnet_block_wo_bottlneck(res, filters, downsample_1)
res = resnet_block_wo_bottlneck(res, filters)
return res

def convx_w_bottleneck(in_tensor, filters, n_times, downsample_1=False):
res = in_tensor
for i in range(n_times):
if i == 0:
res = resnet_block_w_bottlneck(res, filters, downsample_1, not downsample_1)
res = resnet_block_w_bottlneck(res, filters)
return res

def _resnet(in_shape=(224,224,3),
convx=[64, 128, 256, 512],
n_convx=[2, 2, 2, 2],
in_layer = layers.Input(in_shape)

downsampled = _pre_res_blocks(in_layer)

conv2x = convx_fn(downsampled, convx[0], n_convx[0])
conv3x = convx_fn(conv2x, convx[1], n_convx[1], True)
conv4x = convx_fn(conv3x, convx[2], n_convx[2], True)
conv5x = convx_fn(conv4x, convx[3], n_convx[3], True)

preds = _post_res_blocks(conv5x, n_classes)

model = Model(in_layer, preds)
model.compile(loss="categorical_crossentropy", optimizer=opt,
return model

def resnet18(in_shape=(224,224,3), n_classes=1000, opt='sgd'):
return _resnet(in_shape, n_classes, opt)

def resnet34(in_shape=(224,224,3), n_classes=1000, opt='sgd'):
return _resnet(in_shape,
n_convx=[3, 4, 6, 3])

def resnet50(in_shape=(224,224,3), n_classes=1000, opt='sgd'):
return _resnet(in_shape,
[256, 512, 1024, 2048],
[3, 4, 6, 3],

def resnet101(in_shape=(224,224,3), n_classes=1000, opt='sgd'):
return _resnet(in_shape,
[256, 512, 1024, 2048],
[3, 4, 23, 3],

def resnet152(in_shape=(224,224,3), n_classes=1000, opt='sgd'):
return _resnet(in_shape,
[256, 512, 1024, 2048],
[3, 8, 36, 3],

if __name__ == '__main__':
model = resnet50()


  1. Gradient-Based Learning Applied to Document Recognition
  2. Object Recognition with Gradient-Based Learning
  3. ImageNet Classification with Deep Convolutional Neural Networks
  4. Very Deep Convolutional Networks for Large-Scale Image Recognition
  5. Going deeper with convolutions
  6. Deep Residual Learning for Image Recognition