Convolution Neural Networks: Computers can see and understand now? What!? I, Robot could be real?

Original article was published by Pranav Mendiratta on Artificial Intelligence on Medium

Convolution Neural Networks: Computers can see and understand now? What!? I, Robot could be real?

Since 1950s, scientist have been trying to make computers that can make sense from visual data inputs. In 2012 a major breakthrough was achieved when an AI system, AlexNet won the ImagNet computer vision contest with 85% accuracy. At the core of AlexNet was an already existing technology- Convolution Neural Network, that had not been used to its full potential in this field of computer vision.

Convolution Neural Network (CNN) also known as ConvNets were developed in 1980s and its early versions were called LeNet (after Le Cun). These networks were only used in niches of the banking and postal industry. Their main problem was scaling, they required large amounts of training data that wasn’t available at that time. AlexNet used large data sets of labelled images called ImageNets to train its CNNs.

Working: Neurons in our nervous system enable us to see things. Convolutional Neural Networks are just multiple layers of these artificial neurons that act as a simulation of the neurons in the human body. These are nothing but mathematical functions that calculate weighted sum of multiple inputs and can give a activated value as an output.

Figure: Structure of artificial neural network

Each neuron behaves differently because they have different weight values. When they receive pixel values, each neuron picks out various visual features/ patterns. They use the pixel values, multiply the colour with its weight, calculate a sum and then run them through activation functions. Each layer of a CNN generate “activation maps” that highlight the apposite features of the image.

First layer of the ConvNet is used for detecting basic attributes such as vertical, horizontal and diagonal edges. Output from this layer gets fed to the next layer as an input which then further extracts out complex attributes such as corners or combinations of edges. As we move further into the Neural Network, the final layers help in detecting faces and objects.

Figure: Each layer of CNN extracts certain info

This operation of multiplying the weight of the neurons to pixel values and adding them up is called “convolution”. There are several such convolution layers present in a ConvNet along with some other components.

The last layer of the CNN is a classification layer which receives the input from its previous layer that is the final “convolution” layer. After that, on the basis of an activation map from the convolution layers, the classification layer gives a confidence score (0 or 1) as output to indicate how likely the particular image belongs to a “class”. For example if you’re trying to identify a horse, then the final layer will give a confidence score on whether the image is of a horse or not.

Figure18: Layers of CNN

Training the Convolution layer: Training the CNN during its development is a great challenge. Adjusting the weight of each neuron appropriately in a CNN is called “training” a CNN.

We can train a CNN by inputting large data sets of labelled images, the CNN runs these data sets and gives output that is then matched with the correct label of the particular image. Accordingly the neuron weights are also tuned. At the beginning, the neurons start with some random weight values thus the failure is high but gradually success rate increases.

The tuning/ correction are made through a technique called “backpropagation”. What it does is, it optimizes the tuning process thus the weights are adjusted according to some precise calculations instead of random tuning.

Every time a training data set runs through a ConvNet, it is called an “epoch”. A CNN can go through several epochs during its training. These help to tweak the weights and slowly a slowly reduces their values. Eventually we reach a point when the CNN “converges” which means that it becomes as good as it can in identifying a class of images.

After training, test data sets are passed through the ConvNet which help in determining its accuracy since these are separate from images that were a part of the training data set. If the ConvNet performs good on the training data but fails to identify images of the test data, this suggests that the network has been “overfitted”. This essentially means that the network has been fed with too many images of the same type/ there is no variety or it has been through too many epochs during its training.

Mainly these neural networks depend on large image data sets for them to function well. ImageNet that was mentioned above, essentially gets its name from a dataset that had 14 million labelled images. There are also some pretrained models available nowadays that do not require such large datasets, some of them are AlexNet and Microsoft’s ResNet. This process in which a trained network is only fed with a few set of new examples is called transfer learning.


· CNNs can describe well what they are able to identify in the image but they cannot describe the emotion/ feelings behind the image.

· ConvNets struggle a lot when they are shown same images from different angles or in different lighting conditions. A CNN was trained using a famous data set ImageNet but it failed to identify images under different lighting conditions.

· One more shortcoming of the convolution neural networks is that unlike humans, they have no sense of understanding relations between objects in an image. That is why they have not been able to solve the “Bongard problem”.

In the Bongard problem you are essentially provided with a few images in the form of two sets and you have to identify the relation between those images and then based on that relation, classify and place an incoming image among those two sets.

Figure: Bongard problem

It is easy for the humans to identify that the set on the left have only one shape and the set on the right has two shapes, but computer vision systems have been struggling on the Bongard problem. They are not able to find the relation and then classify and the incoming image based on that relation.

· Another problem is that the CNNs can be manipulated using Adversarial Attacks. Adversarial attacks are used by hackers, they usually put a data set in the network that causes it to make mistakes, that is it acts like an optical illusion for the neural network. This can be dangerous in some cases such as self-driving cars.