On-Device AI Optimization — Leveraging Driving Data to Gain an Edge

Original article was published by Alex Wu on Artificial Intelligence on Medium


Leveraging the Domain Characteristics of Driving Data

When I began my journey at Nauto a year ago, I was commissioned to replace our existing object detector with a more efficient model. After some research and experimentation, I arrived at a new architecture that was able to achieve an accuracy improvement of over 40% mAP* relative to our current detector, while running almost twice as fast. The massive improvement comes largely thanks to the mobile-targeted NAS design framework pioneered by works such as MnasNet and MobileNetV3.

*mAP (mean average precision) is a common metric for evaluating the predictive performance of object detectors.

Relative to our current model, the new detector reduces device inference latency by 43.4% and improves mAP by 42.7%.

Informed Channel Reduction

However, the most interesting improvements surfaced as I looked for ways to further push the boundary of the latency/accuracy curve. During my research I came across an intriguing finding by the authors of Searching for MobileNetV3, a new state-of-the-art classification backbone for mobile devices. They discovered that when adapting the model for the task of object detection, they were able to reduce the channel counts of the final layers by a factor of 2 with no negative impact to accuracy.

The underlying idea was simple: MobileNetV3 was originally optimized to classify the 1000 classes of the ImageNet dataset, while the object detection benchmark, COCO, only contains 90 output classes. Identifying a potential redundancy in layer size, the authors were able to achieve a 15% speedup without sacrificing a single percentage of mAP.

Compared to popular benchmark datasets like ImageNet (1000) and COCO (90), the driving data we work with at Nauto consists of a minuscule number of distinct object classes.

Intrigued, I wondered if I could take this optimization further. In our perception framework we are only interested in detecting a handful of classes such as vehicles, pedestrians, and traffic lights — in total amounting to a fraction of the 90–1000 class datasets used to optimize state-of-the-art architectures. So I began to experiment with reducing the late stage layers of my detector by factors of 4, 8, and all the way up to 32 and beyond. To my surprise, I found that after applying aggressive channel reduction I was able to reduce latency by 22%, while also improving accuracy by 11% mAP relative to the published model.

My original hope was to achieve a modest inference speedup with limited negative side-effects — I never expected to actually see an improvement in accuracy. One possible explanation is that while the original architecture was optimal for the diverse 90 class COCO dataset, it is overparameterized for the relatively uniform road scenes experienced by our devices. In other words, removing redundant channels may have improved overall accuracy in a similar way to how dropout and weight decay help prevent overfit.

At any rate, this optimization illustrates how improving along one axis of the latency/accuracy curve can impact performance in the other. In this case, however, the unintentional side-effect was positive. In fact, we broke the general rule of the trade-off by making a simultaneous improvement in both dimensions.

Applying aggressive channel reduction to the late-stage layers of the detector resulted in a 22% speedup and an 11% improvement in mAP relative to the baseline model.

Task-specific Data Augmentation

The success I had with channel reduction motivated me to look for other ways to leverage the uniqueness of driving data. Something that immediately came to mind was a study done by an old colleague of mine while I worked at my previous company, DeepScale. Essentially, he found that conventional data augmentation strategies like random flip and random crop**, while generally effective at reducing overfit, can actually hurt performance on driving data. For his application, simply removing the default augmentors resulted in a 13% improvement in accuracy.

**Random flip selects images at random to be flipped (typically across the vertical axis). Random crop selects images to be cropped and resized back to original resolution (effectively zooming in).

Again, the underlying idea is simple: while benchmark datasets like COCO and ImageNet contain a diverse collection of objects captured by various cameras from many different angles, driving data is comparatively uniform. In most applications the camera positions are fixed, the intrinsics are known, and the image composition will generally consist of the sky, the road, and a few objects. By introducing randomly flipped and zoomed-in images, you may be teaching your model to generalize to perspectives it will never actually experience in real life. This type of overgeneralization can be detrimental to overall accuracy, particularly for mobile models where predictive capacity is already limited.

Initially, I had adopted the augmentation scheme used by the original authors of my model. This included the standard horizontal flipping and cropping. I began my study by simply removing the random flip augmentor and retraining my model. As I had hoped, this single change led to a noticeable improvement in accuracy: about 4.5% relative mAP. (It must be noted that while we do operate in fleets around the world including left-hand-drive countries like Japan, my model was targeted for US deployment.)

In the default scheme, random crop (top) will often generate distorted, zoomed-in images that compromise object proportions and exclude important landmarks. Random horizontal flip (bottom), while not as obviously harmful, dilutes the training data with orientations the model will never see in production (US). The constrained-crop augmentor takes a more conservative approach; its outputs more closely resemble the viewing angles of real world Nauto devices.

I then shifted my focus to random crop. By default, the selected crop was required to have an area between 10% to 100% of the image, and an aspect ratio of 0.5 to 2.0. After examining some of the augmented data, I quickly discovered two things: first, many of the images were so zoomed-in that they excluded important context clues like lane-markers; and second, many of the objects were noticeably distorted in instances where a low aspect ratio crop was resized back to model resolution.

I was tempted at first to remove random crop entirely as my colleague had, but I realized there is one important difference between Nauto and full stack self-driving companies. Because we’re deployed as an aftermarket platform in vehicles ranging from sedans to 18-wheelers, our camera position varies significantly across fleets and individual installations. My hypothesis was that a constrained, less-aggressive crop augmentor would still be beneficial as a tool to reflect such a distribution.

I began experimenting by fixing the aspect ratio to match the input resolution and raising the minimum crop size. After a few iterations, I found that a constrained augmentor using a fixed ratio and minimum crop area of 50% improved accuracy by 4.4% mAP relative to the default cropping scheme. To test my hypothesis, I also repeated the trial with random crop completely removed. Unlike it had for my colleague, the no-augmentation scheme actually reduced mAP by 5.3% (1% worse than baseline), confirming that conservative cropping can still be beneficial in applications where camera position varies across vehicles.

The final scheme (no-flip, constrained-crop) in total yields a 9.1% relative improvement over the original baseline (flip, crop) and a 10.2% improvement over augmentation at all.

The baseline augmentation scheme (grey) consists of random horizontal flip and random crop (aspect ratio ∈ [0.5, 2.0] and area ∈ [0.1, 1.0]). Removing random flip improved mAP by 4.5%. From there, removing random crop reduced mAP by 5.3% (-1% relative to baseline). Using a constrained crop (fixed ratio, area ∈ [0.25, 1.0]) improved mAP by 7.9% relative to baseline. And finally, the most constrained crop (fixed ratio, area ∈ [0.5, 1.0]) resulted in the largest improvement: 9.1% relative to baseline.

Data-Driven Anchor Box Tuning

I’ll wrap it up with one more interesting finding. The majority of today’s object detection architectures form predictions based on a set of default anchor boxes. These boxes (also sometimes called priors) typically span a range of scales and aspect ratios in order to better detect objects of various shapes and sizes.

SSD default anchor boxes. Liu, Wei et al. “SSD: Single Shot MultiBox Detector.” Lecture Notes in Computer Science (2016): 21–37. Crossref. Web.

At this point, I was focusing my efforts on improving the core vehicle detector that drives our forward collision warning system (FCW). While sifting through our data, I couldn’t help but once again notice its uniformity compared to competition benchmarks; overall image composition aside, the objects themselves seemed to fall into a very tight distribution of shapes and sizes. So I decided to take a deeper look at the vehicles in our dataset.

Object distribution of FCW dataset. Scale is calculated for each object as bounding box height relative to image height (adjusted by object and image aspect ratios). The average object is relatively small, with a median scale of 0.057 and a 99th percentile of 0.31. Objects are also generally square, with a median aspect ratio of 1.02 and 99th percentile of 1.36.

As it turns out, the majority of objects are relatively square, with more than 96% falling between aspect ratios of 0.5 to 1.5. This actually makes a lot of sense in the context of FCW, as the most relevant objects will generally be the rear profiles of vehicles further ahead on the road. The size distribution follows more of a long tail distribution, but even so, the largest objects occupy less than three fourths of the image in either dimension, while 99% occupy less than a third.

Once again, I went back to reevaluate my initial assumptions. Up until now I had adopted the default set of anchor boxes used by the original authors, which ranged in scale between 0.2 and 0.9, using aspect ratios of ⅓, ½, 1, 2, and 3. While this comprehensive range makes sense for general-purpose object detection tasks like COCO, I wondered if I would again be able to find redundancy in the context of autonomous driving.

I began by experimenting with a tighter range of aspect ratios, including {½, 1, 1½} and {¾, 1, 1¼}. Surprisingly, the largest gain in both speed and accuracy came simply from using square anchors only, which effectively cut the total anchor count by a factor of 5. I then turned my attention to box sizes, realizing that the default range of [0.2, 0.9] overlapped with less than 5% of the objects in my dataset. Shrinking the anchor sizes to better match the object distribution yielded another modest improvement.

In total, the new anchor boxes yielded an almost 20% inference speedup and a 2% relative mAP improvement across all object classes, sizes, and shapes.

The baseline model uses anchor boxes with scales ∈ [0.2, 0.9] and aspect ratios ∈ {⅓, ½, 1, 2, 3}. Simply removing all but the square boxes resulted in a speedup of 18.5% with no negative impact to accuracy. Further tuning the boxes to match the scale range of the object distribution resulted in a modest 2.1% relative gain in mAP.

Note: while the benchmarks within each optimization study are conducted in controlled experiments, a number of factors changed between individual studies. I chose not to present a cumulative improvement from start to finish in the interest of keeping this post short and focused on the 3 major optimizations.