Source: Deep Learning on Medium
Computer vision is a rapidly growing field in the technology and computer science world. It is a high-level, multifaceted field that allows machines to iteratively learn and understand complex representations from images and videos to automate human visual tasks. More recently, companies like Amazon opened a cashier-less grocery store that uses computer vision and deep learning to automatically detect which items are taken off the shelf. Pinterest developed a visual search engine which uses an object detection pipeline for content recommendation. Computer vision has also been used in solving some of humanity’s grand challenges, like Stanford University’s Sustainability and Artificial Intelligence Lab project that combines satellite imagery and machine learning to predict poverty.
In this post, I describe how I built an object detection algorithm (FurniSure), during my time as an Insight Data Science Fellow, using a convolutional neural network-based algorithm called “You Only Look Once” to identify, classify, and localize different types of furniture in images and videos. The company I consulted for has a platform for users to upload personal photos and videos that can be shared across various social media platforms. For this project, the focus was on automating furniture detection so that these items can be identified for advertising purposes.
Developing The Dataset
Gathering a sufficient amount of data was the first challenge. According to Andrew Ng and other machine learning experts, the amount of training data used has the biggest impact on the performance of a deep learning algorithm; more than hyperparameter tuning or complexity of network architecture. To ensure I gathered enough, I started with data provided by the company I was working with, and then scraped additional public furniture images from Google and Pinterest’s API using the Beautiful Soup library in Python.
To create even more training data, I used an image augmentation library, called Augmentor, that uses a stochastic method with building blocks leading to a reproducible and less error-prone pipeline. The tool passes an image through the pipeline multiple times by applying a pre-defined probability parameter of each operation. These operations are the main features of Augmentor and they consist of standard image manipulation functions like rotating, perspective skewing, elastic distortion, rotating, shearing and mirroring. The stochastic approach allows me to generate a large amount of data (10x) from the initial dataset.
Labeling images for object detection is a very important and daunting task. I used Labellmg, which is a graphical image annotation tool that can be used to create labeled datasets. I manually annotated the images for object detection by drawing bounding boxes around the objects of interest in the images. The images and referenced object’s metadata, such as height and width, coordinates of the bounding boxes, and individual classes, are saved in the PASCAL VOC data format as XML files.
Beyond Image Classification
Image classification is a very popular problem in computer vision. This involves classifying images into different categories by applying a distinctive type of multi-layer neural network called Convolutional Neural Networks (CNN). CNNs are designed to identify visual patterns from images using phenomenal geometric transformations and exceptional variability. The first CNN known as LeNet-5 was developed in 1998 by LeCun, et al. In recent years, we have seen a surge in CNN-inspired classification models trained on ImageNet datasets that can surpass human challenges by at least an order of magnitude. Examples of such algorithms are the AlexNet and Inception module (or Google LeNet). Going one step further, localization and object/instance segmentation are introduced as an advancement to image classification. They involve integrating the determination of object classes with localization by drawing a bounding box around the object or masking the object in the image, pixel by pixel.
Object Detection — You Only Look Once “YOLO”
An object detection system consists of recognizing, classifying and localizing, not only one object in an image, but every referenced object. This is a much more difficult task than traditional image classification. For this project, I used a type of single shot detection (SSD) algorithm called You Only Look Once “YOLO”. It is a cutting-edge detection algorithm that can identify distinct objects within the space of an image. It looks at the image once, divides it into grid cells, which are responsible for predicting bounding boxes, and output a score known as the Intersection Over Union (IOU). For each bounding box, the grid cells also predict a class alongside the probability distribution over all possible classes. The class-specific confidence score is a multiplication of the individual box confidence predictions and the conditional class probabilities.
Pr(Class | Object) * Pr(Object) * IOU = Pr(Class) * IOU
I decided to use transfer learning instead of training the full YOLOv2 on my dataset. This allowed me to use the available data effectively, reduce computational cost, improve how the model generalizes and makes it more robust. I downloaded and then froze the first 24 convolutional layers of YOLO, used the pre-trained weights of these layers and trained only on the last fully connected layers. The image processing was done with OpenCV and the processed images were used to train the last fully connected layers of the network using Python’s Keras package with Darknet and Tensorflow as backends.
My new YOLO algorithm was trained using transfer learning on some of the original images, plus all augmented images. The testing was done on a very small sub-sample of the entire dataset which mainly consist of the original, non-augmented images. For the object detection problem, the most common way to see if one object proposal is correct is to check the Intersection Over Union (IOU). An overlap criterion is defined for an IOU threshold. For my case, I set this threshold to IOU > 0.5, which implies that my prediction is a hit if the predicted bounding box satisfies this criterion with respect to the ground truth bounding box. Otherwise, it is a miss. For each class, the precision and the recall were estimated using the true positives (TP), false positives (FP) and false negatives (FN). The average precision (AP) in this case is the precision averaged across all values of the recall, which falls between 0 and 1. The mean average precision (mAP) gives the mean of the individual class precision. For this project, the mAP on my test set was estimated as 0.714 from the precision-recall curve for each furniture class.
I built an object detection model to identify, classify and segment multiple items of furniture given an image set by using a state-of-the-art deep learning algorithm. I also applied this model to videos and real-time detection with webcam. The videos are split into 20 frames per second using OpenCV, and predictions were performed on each frame. Using stochastic augmentation has also shown that the specificity for a given recall can be greatly improved.
The best way to improve the model going forward is to get more data and find a better balanced class of objects. Facebook AI research recently published an article that combats the class imbalance problem in object detection by adjusting the weights assigned to the standard cross entropy loss and trained on a simple Feature Pyramid Network (FPN) for object detection called RetinaNet. This method is a possible candidate for improving the extreme imbalance problem in YOLO and other SSD detection methods.
Labeling large amounts of images for object detection is a very boring and time-sensitive task. This can be done by crowd-sourcing on platforms such as Amazon’s Mechanical Turk or creating a gamification method for users to interact with the images and annotate them automatically. Improving the complexity of the model can lead to a better mAP, but not necessarily a better speed. Using algorithms like Mask R-CNN, which rely on a form of instance segmentation that juxtaposes object detection and semantic segmentation by training targets for masks on a pixel by pixel level, can greatly improve the overall accuracy.
Wale Akinfaderin was a Data Science Fellow at Insight Data Science (Seattle, Spring 2018). In his first three weeks at Insight, he consulted for a startup and built FurniSure: an algorithm to detect multiple items of furniture from images and videos by employing state-of-the-art deep learning techniques. His PhD research focuses on increasing the sensitivity of magnetic resonance spectroscopy. He previously worked for IBM Research where he developed a veracity model for improving response to safety incidents in East Africa using machine learning and natural language processing. He currently works as a Data Scientist at Lowe’s Companies, Inc.