Computer Vision : Human Intelligence to Artificial Intelligence

Intelligence, one of the most important thing humans have acquired during the evolution. It is our ability to think abstractly, reason, establish correlation between experiences and convert them into knowledge, be creative, predict and most importantly to generate emotions. Human brain is always considered as one of the most complex and fascinating structures across all over the known universe because it is empowered by intelligence. If we try to understand more about the relation between brain and intelligence, we can realize that the brain is more like a tool which gathers informations from all of our senses and help our intelligence to enrich its capabilities. Human brain is simple just like any other organ but the intelligence is complex and we always try to understand the intelligence aspect of our brain. For example when we take decision to move our hand from hot plate, actually it is our intelligence which instruct our brain to move our hand and that is why it is often found that people with paralysis or in coma may not be able to control their body but shows all the signs of intelligence. In fact our intelligence is us, different from our body and unique compared to other human individuals.

Artificial Intelligence is a science and engineering to make intelligent machines as suggested by John McCarthy who has first coined this term in 1956. Modern era has evolved the definition of AI in context of computers by considering it as a domain to make intelligent programs for computers or computer operated systems. Just like human brain, computer also develops its intelligence from collected informations, we call it Data. As human brain collects data through basic senses i.e. sight, hearing, taste, touch, smell and proprioception, our intelligence is also different for each type of senses and our decisions or actions are outcomes of our collective intelligence. It is also important to observe that in general sight or vision is usually the most dominant sense we have and our collective intelligence is highly dependent on the visual information. We can understand the context of scene just by looking at it. We can perceive three dimensional world around us and differentiate shapes with ease. Researchers has found the concept of developing intelligence of computers using visual informations very interesting and this has lead us a domain of AI called Computer Vision. Formally computer vision is a sub-domain of AI where computers are being made intelligent enough to collect visual information from real world in form of images or videos and develop their high-level understanding about the world.

Learning is a task of converting collected information or data into intelligence, knowledge or expertise. Researchers have found that it is easy to develop artificial intelligence where the relation between collected information and decision or action taken can be represented by some set of mathematical rules. The hard part is the decisions or actions we take intuitively. The bottleneck of the AI is to find mathematical representations of our intuitive decisions or actions withe respect to collected information like understanding speech, understanding objects with deformations. The goal is to develop and understand Learning algorithms for such intuitive decisions taken based on visual informations.

Let’s understand different keywords like Deep Learning, Machine Learning, Knowledge base Learning, and Representation Learning which we frequently use in this particular domain and how they are related to AI. To understand this let’s consider an example where we want to develop a program which can identify difference between Cats and Dogs. The very first step is the collection of information or data. To collect data, let’s assume that we physically get thousand cats and dogs each, randomly from the market. Next We can start collecting information either in form of Textual Description or by taking images of each. As we want to get into computer vision we take images as data and which makes our problem an Image Classification problem. So at this point I have total two thousand images of cats and dogs. We need few samples to test our performance so we can divide our two thousand samples into 80% (1600 images) for training and call them Training Set and 20% (400 images)for testing and call them Testing Set.

The first approach we can come up with is we take each image from our test set, and start matching it with every single image in training set. If we are able to find a matching image in training set, we assign a class label i.e. cat or dog, same as the matched image in the training set. This approach to AI is called Knowledge Base Learning where the training set act as knowledge or intelligence of our program. The problem with this approach is we need a huge training set to cover all the images in test set. Also there is a possibility of getting a species of cat or dog which is not part of our training set or we come across an image taken from different angle.

To solve the problems associated with knowledge base learning, we can find significant pieces of information known as features e.g. ears, nose, face, paw etc which help us as humans to understand difference between cats and dogs and code functions to find these features from image. This set of features will be our intermediate training data. We create a program which takes this set of features as input and has ability to extract the hidden patterns from it to associate them with label. This approach to AI is called Machine Learning where the AI has capability to find hidden patterns from a given set of features representing the data. Despite of its success into certain domains like numerical data analysis, predicting type of cancer, predicting stock prices etc, the machine learning was not enough to develop an AI which can understand objects in image. The reason behind this limitation is our limitation to understand and find exact set of features which can collectively represent the given object. The manual process of finding the set of features is called Feature Engineering. In other words the machine learning systems depends heavily on the effectiveness of methods chosen during feature engineering.

Illustration of VGG16 network for an image of cat and dog. It shows how Deep Learning divide the task into different representations. The representations in the top shows the lower level features like edges and bottom images shows higher level features like contours.

The difficulties faced by systems based on machine leaning suggest that AI system needs the ability to not only acquire knowledge by extracting hidden patterns from given data but also acquire set of features to represent the given data points. This ability to acquire set of features to represent the given data is called Feature Learning which replaces the manual feature engineering part. This approach to AI is called Representation Learning where we use machine learning to extract features as well as to establish mathematical relation between extracted features and output labels. With context to our experiment, when we decide to feed the training images as it is and let our program to figure out which is the best set of features to classify the images. For example coming up with some sort of generalized, high-level, abstract features like contours which represents shape of cats and dogs with every minute details. This complexity of representation learning has given rise to an approach to AI called Deep Learning where we decide to represent this high level, abstract representation into a collection of internally correlated simpler representation. It is similar to form a deep graph which represents the contextual correlation between simpler representations of the given image. For example a given image can be represented as a set of small parts, each part can be represented as a set of contours, each contour can be represented as a set of edges and so on. Refer Figure for the illustration of Deep Image Classification Network VGG16’s layer visualization.

When and Why Do we Use Deep Learning

Deep Learning is like a big hammer, which every nail doesn’t require. Every deep learning based approach requires a decent amount of resources to implement in research and quite more resources to use it in production. That is why it is very important to understand what type of problems requires a deep learning based approach and when we can solve the given problem just by using simple machine learning tools. In general we can divide the problems based on two criteria : Their complexity to select features and their requirements for generalization or adaptivity.

Let’s consider a problem where a Birdwatcher wants to understand food behavior of five different bird species based on following set of information : (i) Length of beak (ii) Shape of beak (iii) Image of each birds (iv) Sound of each bird (v) Labels for food behavior category. The first step for us is to understand the given problem. We want to classify birds according to their food behavior. Through basic review about the domain we can easily find that there is a strong relation between type of beak (Length and Shape) and food behavior of a bird. There can be different approaches to solve this problem.

  1. Using Machine Learning : We take Length of beak and Shape of beak as features and design a classifier which can classify the birds into different classes according to their food behavior.
  2. Using Deep Learning : We take the images of birds and classify them into different classes according to their food behavior.

We can easily see that in this particular problem Machine Learning based approach is much more easy and should be more accurate because the features we are providing are the exact set of features to classify birds according to their food behavior.

Next our Birdwatcher ask us to group birds based on their similarity in voice and appearance. While approaching this problem we realize that sound of birds changes according to time, weather and their mood. Same way it is very hard to characterize unique features to describe and differentiate appearance. For this type of problem Deep Learning is a more suitable tool.

To summarize, Deep Learning is great for

  1. Problems which require large number of features to solve, and it is hard to describe each feature. e.g. recognizing voice, recognizing faces.
  2. Problems which requires inter-correlated features to build higher level understanding e.g. semantic analysis of sentences, object tracking and segmentation, document summarization.
  3. Problems which requires high adaptiveness.


LeCun, Y., Bengio, Y. and Hinton, G., 2015. Deep learning. Nature, 521(7553), pp.436–444. [PDF]

Source: Deep Learning on Medium