Recent Advancements in Computer Vision

Source: Deep Learning on Medium

I have already explored a Deep Learning application, namely sentiment analysis, related to natural language and comprehension of it in my previous and first blog post, so I turned my focus to the other major application of Deep Learning, which is Computer Vision.

What is Computer Vision?

Computer vision is the subset of Deep Learning that deals with giving computers the ability to cognitively perceive the world around them. Computer vision is primarily used in cars that have any kind of auto-drive features, and most recently, in the Google Lens, which can identify objects and items and allow you to copy and paste text from the real world. Amazon is currently in the process of releasing the Echo Look, a device that scans you and your clothes and makes recommendations about your outfit. Typical computer vision problems include object detection and image classification, and all of these applications are implementations of these 2 fundamental features.

The Evolution of Computer Vision

The field of computer vision started out by applying hand crafted rules to pick out relevant information from images. These techniques, such as hand crafting stipulations, hand picked feature extraction and data augmentation are now referred to as traditional methods of solving computer vision problems. Traditional methods, however successful, were not very salable and had to be re-engineered to introduce a new scope of items. However, competitions such as the ILSVRC pushed programmers to come up with new innovations to solve the problem of image classification. Below is the timeline of the evolution of Computer Vision and Conv-Nets

2012: AlexNet, The Birth of Modern Computer Vision:

AlexNet won the ILSVRC 2012 by a large margin compared to more traditional methods.

Key Features:

  • Used Data Augmentation, the ReLU activation function, dropout, and GPU implementation
  • Used Overlap Pooling to reduce the model complexity
  • Proved the value of Convolutional Neural Networks


  • 5 Convolution Layers
  • Overlap Pooling Layers
  • 3 Fully Connected (Dense) Layers
AlexNet (Source: Coinmonks)

Accuracy: 82%

2013: ZFNet

ZFNet won the ILSVRC 2013 using a modification upon the AlexNet

Key Features:

  • Changed the first convolution layer in the AlexNet from a 11 x 11 convolution to a 7 x 7 convolution to retain more information
ZFNet (Source: Coinmonks)

Accuracy: 86%

2014: VGGNet, One of the Most Popular Conv Nets:

VGGNet replaced large level convolutions by stacking several small kernel convolutions.

Key Features:

  • 5 blocks, each block containing 2–4 3 x 3 convolutions with a stride and padding of 1, and a 2 x 2 max pooling with a stride of 2.
  • A little over 0.5 GB
Different VGG Layer Structures (Source: Coinmonks)

Accuracy: 90%

2014: Inception, or GoogLeNet, Winner of ILSVRC 2014

The inception model addressed the problem of overfitting that came along with deeper convolution models. It also removes the challenge of choosing the correct kernel size by including multiple different kernels.

(Source: Inception v1)

Key Features:

  • Instead of stacking multiple convolution layers with different kernel sizes, the inception net concatenates multiple different kernel convolutions
  • To deal with the inconsistencies in the tensor shapes, uses 1×1 convolutions to combine the different convolutions
  • To prevent the middle of the model from “dying out”, added auxiliary losses in 2 different parts of the model and added their weighted sum to the final loss
  • Inception v2 and v3 replaced the 5 x 5 convolutions with a factorized convolution by 1 x 5 then 5 x 1 kernels.
GoogLeNet. The orange box is the stem, which has some preliminary convolutions. The purple boxes are auxiliary classifiers. The wide parts are the inception modules. (Source: Inception v1)

2015: ResNet:

A model that consisted of stacked residual blocks. Below is the basic representation of the residual block:

Credit: Vincent Fung

Key Features:

  • Replaces model size and complexity substantially
  • Residual block contains 2 conv layers followed by an addition of the original input to the output
  • This addition allowed the model to learn an identity function easily.
  • Much deeper than preceding model architectures
Credit: Vincent Fung

Accuracy: 96.3%

2015: Inception — ResNet: The Hybrid Network

Based on the success of the ResNet and the Inception Net, both models were combined to make a hybrid network with the advantages of both models

The top image is the stem of Inception-ResNet v1. The bottom image is the stem of Inception v4 and Inception-ResNet v2. (Source: Inception v4)

Key Features:

  • 3 different blocks of convolutions (A, B, C)
  • Added the residual block from ResNet
  • Scaled activations on the convolutions to prevent the network from dying
The top image is the layout of Inception v4. The bottom image is the layout of Inception-ResNet. (Source: Inception v4)

Thanks for reading my second blog post! I hope it provided you with some important information about the growth of the Computer Vision field. My next blog post will be about how I created an anomaly detection system to predict fraudulent transactions in parking garages around the US. Signing off for now! -Tanish