Super-human level accuracy in computer-vision: Learning by revising

Source: Deep Learning on Medium


Go to the profile of Prince Canuma

In this article, we are going to discuss the importance of revising the recent past for computer-vision algorithms.

In the quest of humankind to create artificial intelligence, researchers often tend to battle on whether to copy the human brain or simply create something entirely new. The laws of nature have proven time and time again to be absolute and sound, so the best way to make computers execute intellectual tasks is to base ourselves in a set of universal and battle-tested laws and principles.

“We don’t always have to reinvent the wheel, I believe that often times we just have to adjust it to our needs.” — Prince Canuma

We haven’t uncovered all the secrets of the brain and we aren’t even close, but we are working towards that goal day-by-day. The little knowledge about the brain that we acquired over the years has inspired many scientific fields such as computer engineering & science, Artificial Intelligence(AI) and etc.

Of all tasks that the human brain does, one task in specific puzzles Neuroscientists and AI researchers — learning.

How does the brain learn?

According to TrainingIndustry, neuroscientists have long believed that learning and memory formation is made only by the strengthening and weakening of connections among brain cells, while in fact, researchers found that:

— When two neurons frequently interact, they form a bond that allows them to transmit information more easily and accurately.

Super-human accuracy in computer-vision: Learning by revising

A fairly young and famous subfield of Machine-learning(ML) called Deep Learning(DL) has managed to achieve amazing results in image-classification task, classifying objects in an image with up to 99.8% accuracy.

DL brought a new take on learning representations from data that puts emphasis on learning successive layers of increasingly meaningful representations.

fig. 1 Neural Network with 4 layers

How many layers contribute to a model of data is called the depth of the model. Modern DL algorithms such as Neural Network(NN) often involves tens or even hundreds of successive layers of representations and they all are learned automatically from exposure to training data.

Recent evidence was presented in research papers such as going deeper with convolutions reveals that network depth is of crucial importance and have lead to amazing accomplishments the challenging ImageNet dataset, all of which exploited and benefited from very deep models.

But this discovery raises a new question.

Is the increased depth(stacked layers) proportional to the accuracy and performance gains?

You guessed it right if you said no. Let us understand why.

Going back to the concept of how the best and most complex system in the world(the brain) works, we can all agree that it’s easier to remember and solidify a known concept than it is to acquire a new one. The more we revise a topic, the more you listen to the same song, the easier it becomes to recall it right?

“It’s the repetition of affirmations that leads to belief. And then that belief becomes a deep conviction”

Why is revising much easier you might ask. In the previous section, as you might recall we saw that when two neurons frequently interact, they form a bond that allows them to transmit more information easily and accurately. Thus revising is a fundamental key to learning of any living organism.

In Deep Learning terms we call learning by revising: Residual Learning

Now, let us dive a bit deeper to the technical side of things and understand how residual learning works.

Image-classification is a computer-vision task that tries to find the function that maps image x to the label y.

Now, consider H(x) as an underlying mapping(the mapping of x and y learned ) to be trained by a few stacked layers with x. If one hypothesizes that multiple nonlinear layers can asymptotically approximate complicated functions, it is equivalent to hypothesize that they can asymptotically approximate the residual functions.

So rather than expect stacked layers to approximate H(x), we explicitly let these layers approximate a residual function F(x):= H(x) — x, where the original function thus becomes F(x) + x (assuming that both are the same dimension).

A residual function is basically a mapping of x to y where the concept of revising is used.

Instead of passing x (training data) through a series of layers that successively learn something new every layer we create residual blocks in the network where for every N number of layers we add our previous x which in the paper the authors call it identity mapping.

With this discovery we can easily train bigger networks and avoid the degradation of the knowledge(weights) acquired by deeper layers of the network, all because we create blocks in our network that revise the past information.

We can safely say that this residual blocks form strong bonds that allow the network to learn information easily and accurately.

A major benefit is that the residual connection F(x) +x introduce neither an extra parameter nor computation complexity.

Network Architectures

This is where the fun and code examples begin.

The authors of Deep Residual Learning for Image Recognition paper, test and describe two models used on ImageNet as follows.

  • Plain Network.

This network serves as a baseline and is mainly inspired by the philosophy of VGG nets.

The convolutional layers mostly have 3×3 filters and follow two simple design rules: (i) for the same output feature map size, the layers have the same number of filters; and (ii) if the feature map size if halved, the number of filters is doubled so as to preserve the time complexity per layer.

Downsampling (reducing the height and width of the feature map) is done directly by convolutional layers with stride 2.

The total number of weighted layers is 34 in fig. 3 (left)

  • Residual Network

Based on the above plain network, the authors insert the residual blocks on which they call shortcut connections.

This shortcut connections (F(x) + x) can be directly used when the input and output are of the same dimension(solid line shortcuts in fig. 3). When the dimensions increase(dotted line shortcuts in fig. 3), the authors consider two options:

(A) The shortcut still performs identity mapping with extra zero paddings for increasing dimensions.

(B) The projection shortcut uses a 1×1 convolutional operation to increase dimensions while maintaining feature size. This is done because the 1×1 convolution performs a linear projection Ws(weight matrix) by the shortcut connection to match dimensions (y = F(x, {Wi}) + Ws x).

fig.3 34-layers PlainNet(left) , 34-layers ResNet(right)

Let’s Build it using TF 2.0

Recently TensorFlow has released their beta version of upcoming version 2.0 which brings a suite of innovations that make our lives easier.

For this tutorial, we are going to use 3 amazing new features:

  • Custom Layers
  • Custom model
  • Eager execution

Layers encapsulate a state (weights) and some computation

The main data structure we work with is the Layer. A layer encapsulates both a state (the layer’s “weights” or knowledge in other words) and a transformation from inputs to outputs (a “call” function, the layer’s forward pass).

In order to create a layer, we need to subclass the layers.Layer module.

With that, let’s define a custom convolution layer block with more than one convolution operation:

Custom Model

A custom model is built the same way the only difference is that we subclass the Keras.Model module. The subclassed model class has all attributes of a model such as .compile().fit() and etc, but most importantly our custom model has a flexible input layer, meaning you can train with images of any size without manually setting up input size on the model.

TensorFlow 2.0 has opened doors to better ways to create custom Deep Learning solutions.

Here is an example of how to create a custom model:

The Data set used for this tutorial is the hymenoptera_data, a dataset of images of bees and ants.

The full code is in the colab notebook below:

Conclusion

Universal laws and its inner workings are objective and by drawing inspiration from perfect systems we might as well build a future just as perfect.

The key for learning is in how the brain does it, the brain only strengthens those connections which are constantly interacting. Based on this we can also apply this law by building algorithms that mimic this type of behaviour and this is just the tip of the iceberg when it comes to true intelligence.

Residual learning is an important steppingstone for developing truly intelligent algorithms.


Thank you for reading. If you have any thoughts, comments or critics please comment down below.

Follow me on twitter at Prince Canuma, so you can always be up to date with the AI field.

If you like it and relate to it, please give me a round of applause 👏👏 👏(+50) and share it with your friends.