Finding other vehicles with a high degree of confidence and accuracy is crucial for presence detection, control and maneuvering in a self-driving car. A variety of solutions exist ranging from video image detection systems (VIDS), infrared sensors to ILDs. Research in traffic engineering applications continues to evolve. For the most part, a combination of techniques (hardware+software) is expected to result in higher accuracy. In the meantime, applications in object detection and recognition have become prevalent using the same underlying technology. Few examples — Google Face Recognition, Medical Image processing, UAVs.
The objective of this article is to identify vehicles that are in motion, tag and track them. It should detect a new car entering its area of of vision within a few seconds while keeping false positives to a bare minimum. Performance is not expected to be real-time. Assuming a real-time speed of 25–30 fps, a reasonable target is 2–5fps skipping intermediate frames assuming a slow-changing environment between frames. The options available for classification should be evaluated on parameters such as accuracy, time to predict, complexity (number of hyper-parameters) and training time .
Below are some images used for the initial calibration of the system.
There are many (discrete) parts to this problem — identifying a vehicle from non-vehicles, determining the region occupied by the vehicle and tracking its movement. For any classification problem, a robust feature strategy is necessary. This directly impacts all aspects of the final system including speed and accuracy. The features are extracted and fed into either a CV-application or a machine learning model. For ML, a deep learning model, specifically convolution neural network, is an obvious choice given the number of features to train and and the availability of decent-sized pre-labelled dataset. Finally, vehicle movement tracking to triangulate position and movement is done using CV. Recent research in this area such as segmentation and Kalman filtering is not considered.
The solution is broken down into the following key areas
- Data cleansing and preparation (pre-processing)
- Feature extraction
- Classifier training
- Vehicle ROI
- Visual aids for the driver
Accuracy is measured by the number of false positives and the stability of the bounding region as the vehicle moves on the road. Time taken for prediction (and training) is considered for a final recommendation.
The GTI and the Kitti database contains ~8K cars and ~10K non-car images (trees, roads, bridges, etc) with dimensions of 64x64x3. Car images include multiple car models in different colors and view perspectives. The more variations available, the better the detection rate and accuracy. The only pre-processing needed is color space conversion and data augmentation (similar to what was done in Part 2 and, in fact, the entire module is reused as-is).
This is a sampling of the cars and non-cars in the database
Color space conversion is done to leverage the saturation and hue in a car color and use it as a potential feature. The following set of images show the car image after conversion and the transformed color space.
To reiterate, a feature strategy is an important and necessary step in the design phase. The number of features, its significance and information availability are all aspects that need to be considered. For vehicles, features to consider are its shape, size, color and gradient pattern. Algorithms such as HOG and HAAR/LBP are available to extract these features.
HOG: Histogram of Oriented Gradients (HOG) is based on the concept that an object’s shape, its edges and corners contain information unique to that object and can be extracted. It calculates and bins the magnitude of the shape gradients into a histogram which becomes the signature of the object. The gradient primarily contains two details — the magnitude and the direction for each block. Although the descriptor is stable to variations in the shape, color and size, it does have a tendency of false positives. This is something that had to be accounted for in the final result. The orientation and the intensity of gradient is shown here for different parameters.
The HOG algorithm generates a vector with size equal to O x P x C x number of channels (orientation, pixel size, number of cells per block). Each of these parameters have to be calibrated to get the best results. This is a non-trivial matter given the number of combinations possible. A starting point are the numbers mentioned by the authors in the HOG research paper (O=11, P=8, C=4). A deciding factor was the accuracy during the training phase. However, it turns out it was fairly easy to obtain an accuracy of 98% for a wide range of parameters and the final decision was ultimately based on the feature size and magnitude of false positives.
HAAR / LBP: Other alternatives to HOG are Haar / LBP (Linear Binary Pattern). Haar-like features take advantage of the distribution of the gray levels in two adjacent regions to extract important attributes of the image. This was primarily used for face recognition but can be trained for any object. LBP looks at pixel intensity, primarily focusing on corners and edges, and hence is great at identifying spatial variations.
Finally, neural networks use features as well in the traditional sense except that the network determines the most important features through the learning process. In part 2 of the series, outputs from the convolution layers showed different aspects of the image highlighted by different filters. Although this makes neural networks a very powerful tool, the down-side is the difficulty to predict with any certainty how the model will behave which in turn is heavily influenced by parameters, the underlying data distribution, the network organization and, to some extent, random chance.
This article will use HOG, LBP, color/spatial information as feature descriptors.
As a car moves through the camera’s area of vision, its size changes from the perspective of the camera. Vehicles closer to the camera have large dimensions while the ones further away small. In order to account for the difference in size at different regions, a cascading search is applied across the entire frame with varying scales. As an example, a first pass will look for larger objects in the bottom half of the frame. A second pass looks at the middle part and so on. For the HOG solution, overlapping is needed as well per the recommendation in the research paper. Overlapping is not needed for the CNN solution.
…fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good performance….
An example of the sliding window approach is shown below.
This processing, however, is costly and is typically done every few frames. To keep track of the vehicle, and to reduce false positives, a history of ROI is maintained. Depending upon whether the SVM or the CNN model is used, the thresholds and history are adjusted to account for the accuracy of each classifier. Finally a heat map is generated based on the threshold. The threshold discards false positives as these don’t persist across frames. These steps are shown below
The second part of the system involves building a classifier to predict a vehicle based on the features extracted earlier. Computer vision techniques such as template matching can be used. However, these techniques fail as soon as variations are introduced. A more scalable option is machine learning. The following classifier options are considered here.
- Transfer Learning
HOG+SVM: Since SVM is a one-batch classifier, memory requirements for training is high. As noted earlier, the feature set includes HOG descriptors, color distribution and spatial information. Tuning the parameters to achieve the needed accuracy quickly increased in complexity and the time needed for fine-tuning unjustifiable. Also, the high accuracy obtained during training turned out to be misleading since this classifier produces quite a few false positives. Training accuracy is shown below.
The tie-breaker employed was the feature size. The final model has parameters O=11,P=2,C=8 and color space of LUV. Threshold and frame history are kept at 8 and 15 respectively. The output shows a few false positives and little jitteriness.
LBP+AdaBoost: The LDP classifier was trained as a Cascade Classier using the openCV library. The time needed for training turned out to be extremely high and a complete training run was not completed. This article will be updated with the results at a later point in time. Based on a single stage training cycle, the prediction time was nearly 2–3 minutes. This is a show-stopper. It is possible that once all the stages are complete during training, the performance will improve. For now, LBP is dropped from consideration.
CNN: It turns out that CNN was the quickest to train. The network implemented is the same network from Part 2 of the series. The training and validation accuracy are shown below. The model achieved 98.36% accuracy in the test data set.
Transfer Learning: This list will not be complete without a mention of pre-trained networks. These come in very handy since most of the heavy-lifting is already done. Some notable examples include VGGNet, ResNet, Inception and Exception. Applications can be up and running within a day or two without any training and debugging. Although transfer learning was not done in its proper sense (a future article perhaps), a trial was done using YOLO.
Comparison: The YOLO model clearly outperformed the home-grown HOG+SVM and the CNN models. No training was required and the processing almost real-time. HOG+SVM required parameter tuning of nearly 22, LBP 18 and CNN 4. The complexity of training was high for HOG+SVM although, so far, LBP has disappointed. HOG+SVM prediction time was a measly 1.6fps and the CNN slightly better at 2.6fps. Both had few false positives. The CNN network with a test accuracy of 98% did comparatively well. Additional training is expected to bump up the accuracy further. The SVC model with an accuracy of 99% also performed well. However, the faster processing time of the CNN makes it preferable with potential to improve further. It should be noted that the CNN model was trained and tested on a GPU while the SVM on a CPU.
- All the models built here are single object detection models. To be of any practical use, it needs to support multiple objects. It also needs to be faster for real-time applications in a vehicle.
- The sliding window approach is too costly for prime-time. The CNN model (as does YOLO) does a single pass. The timings on the CNN model can be further improved through code refactoring and increase in accuracy.
- It is possible to extract additional information of the detected vehicle. This will be implemented as as future enhancement
- Performance can be further improved through the application of Kalman filters. It is possible to determine the speed / velocity of the car and not attempt to track its movement in real-time. This will reduce the amount of processing needed to track vehicle movement.
- Segmentation is another technique that was not used here but will be implemented in the future as enhancement.
- No pre-processing was done on the images before feeding to the network or the SVM classifier. Potential options include a Sobel or a Laplacian conversion (or even histogram equalization). Preliminary analysis into Laplacian looks very promising.