Original article can be found here (source): Deep Learning on Medium
- The Trend of Deep Learning
The discipline of deep learning has skyrocketed in the last couple of years and has been one of the hottest trends since the beginning of the previous decade. There is no denying the fact that 2010s have seen deep learning as a key driver behind several technologies like autonomous cars, visual art processing, medical image diagnosis, etc. The exponential growth of digital data and the phenomenal advances in GPUs powered the revolution. Several of the computer vision tasks have shifted from traditional statistical mechanisms to Deep learning techniques. One such task is Object Detection.
2. Object Detection
Object detection is the process of finding instances of objects in images or videos. There are mainly two stages involved in the process. 1. Object localization 2. Classification. The deep learning mechanisms look for the objects and locate them within a bounding box. Each object within a bounding box is classified based on the features they possess.
Real-time approach to object detection is one of the most sought out interests in the history of computer vision and its significance can be seen in recent times. Several technologies depend on real-time object detection methods to carry out their functions. For example, Smart surveillance systems use modified object detection techniques to record criminal activities.
3. Speed and Accuracy trade-off
Speed and accuracy both play an important role in judging a real-time object detection system. In autonomous cars, the real-time data from cameras and sensors given to the system and the control of the car depends on how fast, accurate things are recognized. One stage methods like YOLO, SSD, etc., are pretty fast but they don’t offer good accuracy. While older techniques like RCNN, SPP-Net are not fast enough to be equipped in a real-time system. The recent advancements in RCNN family gave rise to a new method which met the needs of both speed, accuracy and that is Faster RCNN.
4. Faster RCNN
Although the core of Object detection is somewhat close to image classification, the real challenge lies in the localization of objects within the images. R-CNN is the first model to come up with a region-based approach to localization. It uses an algorithm called Selective search. This algorithm selects various patches from the image that are likely to contain objects. Despite R-CNN’s highly effective functioning, it failed due to its expensive space and time factors.
The limitations found in RCNN are somewhat solved by its successor Fast RCNN. In Fast R-CNN, the input image is forwarded through a pretrained CNN which in turn generates a feature map at the end. The selective search algorithm then runs on this feature map to generate patches of interest. Fast RCNN proved that it is almost 20 times faster than its parent model R-CNN.
Faster R-CNN moves one more step forward by replacing Fast R-CNN’s selective search method with a faster Region Proposal Network. As a single unified model, Faster R-CNN consists of some interesting methods in its mechanism.
Anchors: Just like R-CNN and Fast R-CNN, Faster RCNN uses a pretrained ConvNet to convert the input image into a feature map. A sliding window slide over this feature map and at each sliding window location, the center is called an anchor. At every possible anchor in the feature map, several anchor boxes (windows) of different aspect ratios (width/height of the box) and scale (size of the box) were created. (In the actual paper of Faster RCNN, it is mentioned that there are 3 aspect ratios (1:1, 1:2, 2:1 ) and 3 scales (128 X 128, 256 X 256, 512 X 512) are used, yielding 9 anchor boxes at every possible anchor.) All these anchor boxes are cropped out from the feature map and are used in further steps.
Region Proposal Network (RPN): An RPN takes an image (of any size) as an input and outputs a set of rectangular object proposals (coordinates of bounding boxes) with an objectness score (probability of being an object). RPN consists of a ConvNet that maps each input image to a lower-dimensional vector (256 dimensions in the paper). This vector is fed into two sibling fully connected layers- a box regression layer and a box classification layer to regress bounding box coordinates and objectness scores respectively.
Region of Interest (ROI) Pooling: ROI Pooling is used to convert size variant inputs (feature maps) to a fixed size outputs (vectors or feature maps). ROI Pooling takes a section of input and maps it to a tensor of fixed dimensions. It solves the problem of varying sizes of bounding boxes of ROIs from RPN.
Step by Step procedure of object detection through Faster RCNN:
- Generation of feature map by forwarding input image through a pre trained ConvNet.
- Creation of anchor boxes of different scales and aspect ratios by sliding a window over the feature map.
- Creation of a set of bounding box coordinates with objectness scores using RPN.
- Filtering of bounding boxes that are redundant and cross-boundary. (Using Non Maximum suppression and IOU values).
- Conversion of varied size patches from feature map with bounding box coordinates from RPN to a tensor of fixed dimensions.
- Using fully connected layers on the tensor to determine the class of object present in the boundary boxes.
Faster RCNN provides better accuracy when compared to other methodologies. Still, it lags behind others in terms of speed. Faster R-CNN proved that it is 250 times faster than R-CNN. Further improvements can make it on par with YOLO, RetinaNet in terms of speed