Machine Vision — AlphaPilot AI Challenge

Source: Deep Learning on Medium

Go to the profile of Ashok Yannam

Ashok Yannam, Zachary Mueller, Carson Wilber, Chris Mayer, Bhavyansh Mishra, Murali Marimekala


The Drone Racing League (DRL) has developed unique racing gates for use in AIRR, their new autonomous racing league attached to AlphaPilot. These gates are equipped with visual markings (e.g. colors, patterns, logos) that will provide fiducials to aid in guidance through the course. Teams are tasked with:

(1) Developing a gate detection algorithm

(2) Describing their algorithm in a 2-page Technical Report

This gate detection algorithm needs to be capable of detecting the flyable region of AIRR Race Gates and producing a quadrilateral around its edge (see Figure 1) with a high degree of accuracy and speed.

The important metrics to note from this objective are speed, mean Average Product (mAP) of the flyable region predictions with the ground truth, and dynamic application of the solution model, i.e. the ability to generate accurate predictions in a wide variety of lighting and pose cases.

Figure 1: Shows sample AIRR gate images from the training dataset both without (left) and with (right) the flyable region outlined in red.

Techniques and Considerations

Each of these methods can utilize the computational efficiency of GPUs to accelerate both training and runtime classification of gates. During training, Team Titans utilized multiple Amazon Web Services Elastic Compute Cloud p3.2xlarge instance utilizing the Deep Learning AMI preconfigured for such training. Team members also alternatively used onboard GPUs (ex. GeForce NVIDIA GTX 960M) as well as use case specific platforms (ex. NVIDIA Jetson) for local development and testing.

Bayesian and SIFT models were tried and had accurate predictions for specific subsets of the data. Time constraints and the computational complexity of using these models drove us to focus on a single model, YOLO. Future versions would use a multi-model approach with a post-Neural Network (NN) integrator of Bayesian and SIFT models.

On the final model, we also attempted Spatial Pyramid Pooling, a technique for max-pooling across multiple filters in a convolutional layer, as well as a new YOLO based model published during the competition, Stronger YOLO, which allows for multiple resolution inputs. Both models did not achieve adequate mAP improvements and were abandoned.

You Only Look Once

The YOLO algorithm, originally published by Joseph Redmon, et. al. in 2015, provides for highly accurate real-time object detection at 45 to 155 frames per second. An improved and optimized version of the algorithm, YOLOv³¹, developed through Darknet, was selected for the final implementation.

Model: YOLOv³¹


For the model, Team Titans chose rather than classifying whole gates to classify individual gate corners. This is due to the nature of YOLO based algorithms: they do not account for skew. If the gate in an image appears at an angle, it becomes difficult to classify it properly, as the YOLO output will be a square bounding box rather than a quadrilateral one.

To resolve this inconsistency, the team identified labels: UL, UR, LR, and LL, gate corners in clockwise order, and generated labels on the ground truth consisting of (tlx, tly, w, h) for each corner identifying its individual bounding box. The labels were generated, through both an internal and external collaborative effort, using Microsoft Visual Object Tagging Tool (VoTT) to produce a modified ground truth. These labels provide high accuracy and well-featured labels for each corner in a dynamic array of environments, as the bounding boxes themselves are rarely square, and the checkered pattern on each corner alone is unreliable in many cases.

We then used the Darknet implementation of YOLOv³¹ to train on the modified ground truth for the training images to label the corners and flyable region of each gate with high accuracy and conservative bounds for objects obstructing the sides of the gate, where a collision would occur upon flight. The exact corner point from each label is the midpoint of the bounding box (tlx + w/2, tly + y/2).


The model takes the current image and shrinks it to 608×608 resolution image; training images use OpenCV to scale down from the original size using linear interpolation and convert to grayscale to remove unnecessary color information encoded in the original image. Other standard methods were also used (emboss, gradient vectors, …) but grayscale gave the best results.


The original YOLOv³¹ model has modified in testing possible optimizations for this use case. In particular, k-means clustering was applied on ground truth object sizes to generate anchor points that related closer to the typical objects in training frames. We also increased the number of filters at early layers to catch more simple features to better classify the exact positioning of individual gate corners.

Method and Approach


Our first approach to training a Darknet YOLOv³¹ model was to use the standard 9 anchor points that YOLO now had compared to the 5 in the previous version. This provided to be insufficient as the best we could get inaccuracy was around 80%. We let the model run overnight to get above 8,000 iterations with the dataset each time, incrementally adding more and more data as good labels were made available.

We also had to decide on the image size above due to processing speed. A larger image can garner more accuracy, but this challenge is also about real time, and 224 is small and is also used as a standard on most image-classification problems. After determining that the time sacrificed to move from 224 to 608 resulted in a worthwhile gain in accuracy, we re-trained a model to use 608×608 resolution images instead.


When there were less than four corners found by the model, we computed missing corners with simple geometric transforms. We also tried a kNN algorithm for clustering low-confidence predictions and indirectly identifying true corner points, but the additional CPU time wasn’t worth the increased accuracy.

Results and Conclusion

The results found were that after 14 hours of training on a p3.2x large instance, our algorithm has a 90.2% mAP on the ground truth data, which is the best we can do with YOLOv³¹. The methodology of going for corners instead of the entire gate provided great accuracy for this challenge. This may have some discrepancies if we follow this method for test 3, such as if there are multiple gates what do we do with multiple left-corners and not enough of the rest of the square. Otherwise, the results were as expected.


[1] Joseph Redmon, Ali Farhadi — Univ. of Washington, YOLOv3: An Incremental Improvement