Original article was published on Deep Learning on Medium
Evolution of Object detection:
The current decade we can’t imagine ourselves without the social media apps, e-commerce sites which made our day to day life much easier, comfortable, more engaging we ever had. Here comes the challenge, for all these things the most common thing involved in it is dealing with a huge variety of Images, be it e-commerce, social media profile tagging & so on. Yes, what you are thinking is right, we are talking about the computer doing this for us instead of humans. We the humans are able to differentiate things with the help of our eye & brain, then what about Computer. It also needs a functionality of eye & brain so that the computer can look at things & differentiate, this we call it as Computer Vision.
The field of Computer Vision has witnessed Significant advancements in the recent 5 years. One of the most common advancement is Convolution Neural Networks(CNNs). With the help of CNNs the most common business problem we are able to solve is the Image Classification.
Deep learning/CNNs/Neural Network:
Thanks to Convolution neural networks (CNNs) that made the deep learning meaningful & interesting. This is implemented using 3–4 or more layers of neural networks where each layer is responsible for extracting one or more features of the image. Making these layers connected forms a Neural network. This neural network works in the same way the neurons in our brain works, each neuron taking an input, perform operations then passing the output to the next layer.
This is about a neural network, how about a Convolutional Neural Network, lets have a look into it. A ConvNet arranges its neurons in three dimensions (width, height, depth). Every layer of a ConvNet transforms the 3D input volume to a 3D output volume of neuron activations.
The layers in Convolutional Neural Network are Convolutional layer, Pooling layer, Fully Connected layer A Convolutional layer is a linear operation that involves the multiplication of a set of weights with our input, same like normal neural network. But this is designed for two- dimensional network & the multiplication is performed between an array of input data & a two- dimensional array of weights, called a filter or kernel. The filter will be smaller than the input data & the type of multiplication applied between a filter-sized patch of input & the filter is a dot product. The dot product is an element-wise multiplication between the filter-sized patch of the input and filter, which is then summed, resulting in single value always. That is why it is called as scaler product. This is just a brief about the CNN.
This looks amazing right. Now we got an overview of how an image can be classified. Let’s look at another problem of finding an object in an image pointing exactly where the object is by a bounding box around it. This is called as Object Localization.
This is if we have only one object in an image. What if we have multiple objects in the image. Here comes the multiple object detection. Have a look at this picture below.
In the above image all the three objects of two categories got detected. To implement this we have a set of pre-trained models available with the Tensorflow Github. The most famous one is which is built on common objects we see daily in our day to day life that is “ssd_mobilenet_v2_coco”.