The XView Dataset and Baseline Results

Hey there reader! Today we’re going to be talking about an exciting new object detection competition released by DIUx at the Pentagon (yup, that Pentagon) called the XView Challenge. The aim of the competition is to produce better detection models on satellite imagery for the purposes of disaster relief. The competition dataset is to date the largest annotated object detection dataset on aerial imagery with 1 million object instances across 60 classes. Participants are given access to 847 images from the WorldView-3 satellite at 0.3 meter resolution with the option of using either 3-band or 8-band imagery. The paper can be found on arxiv here.

As Picterra’s focus is on Earth Observation Imagery, this dataset is highly relevant to us. But before we really get into our experiments it is important to get a good understanding of what the difficulties of the challenge are and run a few baseline tests. Let’s start with the pros and cons of this new dataset.

The dataset example taken from the XView home page.


  • Lot’s of annotations, lot’s of classes, lot’s to learn from. Given that the class IDs are not sequential there’s likely a wealth of even more annotations that we don’t have access to. How mysterious! Maybe they’ll release those eventually and we’ll have even more data to play with.
  • One major problem with satellite imagery is that it can be visually inconsistent. For example, the exposure and color range can vary drastically. This makes the object detection problem more difficult as either we have to adjust the images ourselves to make them consistent or we hope that the network learns said adjustments for us. The XView data however looks surprisingly good. Perhaps the folks over at DIUx or DigitalGlobe (who owns the WorldView satellites) did some preprocessing for us. The only augmentation that we need to really consider are likely rotations. We don’t need to think too much about scaling though since satellite resolution is constant and if we are training using randomly centered crops from the images then we don’t need to worry about translation either. This gives us more time to focus on the model itself instead of futzing with data augmentation parameters.
  • Speaking of satellite resolution, the XView data has a resolution of 0.3 meters per pixel which is the highest we can get with satellites currently.


  • While there are a lot of classes, the choice of these classes can seem a bit arbitrary and some of them seem to overlap. There is some subjectivity within the labeling which is never a good thing when it comes to annotations. For example, there are “Parking Lots” and various vehicle labels, but sometimes the vehicles are not labeled with “Parking Lot” regions. Along with the “Maritime Vessels” class there are a plethora of other boat objects that could all be considered maritime vessels. Some of the classes perhaps shouldn’t be included in the object detection task, namely Vehicle Lot, Shipping Container Lot and Construction Site. These are classes are better suited to a segmentation task.
Another problem example: 89-Shipping container Lot, 91-Shipping Container
  • There is a severe class imbalance. Most of the annotations come from “Small Car” and “Building”. To assess the size distribution I started by removing invalid bounding boxes (outside of the frame). There are also some boxes that are either way too small (some even have 0 area) or are unreasonably large. To account for these outliers for each class I removed bounding boxes that fell outside of a mean centered 95% interval using the diagonal length of the box as my measure. Here is a table with the number of instance of each class exactly in the dataset. Then highest frequency class is “Building” with a whopping 316138 instance count. The lowest is “Railway Vehicle” with a lonely 17.
Shows histogram of diagonals over all objects in the training set. As you can see there is one major bump for small objects (smaller vehicle) a smaller but wider bump after (buildings) and then a large tail for various other irregular classes (vehicle/shipping container lots, construction sites)
  • Object sizes within the same class can vary drastically. In the most extreme case “Construction Sites” have diagonal lengths ranging between 20 and 1000 (image dimenions are around 3000 x 3000). A chart with mean bounding box diagonals and standard deviations is given below.
  • Many images are sparsely annotated, which is not a problem specific to the XView dataset but to aerial imagery in general. Each image covers about 1 square km in which there are sometimes fewer than 10 annotations. This can make training hard since we typically train on patches within an image and may end up sampling too many empty patches if we’re not careful. The result may be that the network would just learn to predict no bounding boxes as the optimal solution.
  • Another general satellite imagery challenge is that even at a resolution of 0.3 meters a lot of smaller classes look very similar, especially vehicles. Classification will no doubt be a tough challenge.

Here is a chart with statistics on the XView dataset, namely the per class mean diagonal, standard deviation of diagonal and count/frequency information.

Buildings and Small Cars dominate the dataset, corresponding to our earlier histogram. Many classes have high box diagonal standard deviations.

Baseline approaches

As data scientists, whenever a new competition like this pops up our first thought is to go crazy with new ideas and architectures. How can we best take advantage of the unique characteristics of our data? What new cutting edge literature should we dive into? Can we take the current state of the art and beat it? But before we get ahead ourselves it’s important to have a solid baseline, to take a known model and generate our first set of scores (we’ll be using the mAP metric, an explanation can be found in this post).

The Competition Baseline

We’ll start by mentioning the baseline used by our XView hosts at the Pentagon. Their base model of choice is the Single Shot MultiBox Detector (or SSD for short). SSD is an end to end object detection similar to YOLO which we talked about briefly in a previous post. They have baseline result which trains SSD using 300×300 pixel crops of the XView data. In this “vanilla” model they reach an mAP of 0.1456.

They then also produce crops of 400×400 and 500×500 and train SSD on all 3 chip sizes which results in a much increased mAP sore of 0.2590. They call this their “multires” model. Their reasoning behind this improvement is that with smaller chip sizes, larger objects often times get clipped and thus we do not see the whole object when training. Correspondingly their per class mAPs for larger classes such as “Vehicle Lot”, “Aircraft Hangar’ or “Barge” go up drastically while smaller classes exhibit less change. In addition, training with different size images may provide different levels of contextual information which could be helpful. The mAPs still have quite a range between different classes. Some scores approach 0.6 while others are at 0. They also have a training scheme built on top of “multires” in which they jitter and add noise to training tiles. They call this their “aug” model and it performs almost as poorly as the “vanilla” model producing an mAP of 0.1549. The added noise is likely too hard to learn around given that especially for small objects, 1 pixel of noise can cover a significant area of the object itself. The full article can be found here and results are given below.

mAP scores for different baseline training schemes on SSD from the original blog post. It’s odd to see that while ‘multires’ leads to an improvement for most classes, for a few it leads to a significant drop in score.

Our Baseline Approach

The model

For our baseline approach we are going to go with YOLO. YOLO is known to not do so well with clusters of smaller objects which will be an issue for aerial imagery (consider cars in a parking lot). We could mitigate this in a few simple ways that would require very little extra work. The default YOLO model takes in inputs of 416×416 images and divides the image up into a 13×13 proposal grid, where in each grid cell there are some number of predefined “anchor” boxes used to generate proposals, thus the number of proposals for an image is equal to 13 x 13 x #anchors. We could increase the number of anchors and thus increase the number of proposals if we are worried about there not being enough per grid cell to assign a bounding box to every car in a parking lot. By having a range of anchor sizes it may also help to deal with different size objects. The second way is to increase the grid to 26×26 which can easily be done by removing a pooling layer in the original YOLO implementation. However this results in a major slow down of the network of about 4x. For our baseline model we will simply use a 13 x 13 grid with 9 anchors (the original paper uses 5) since as we saw earlier the range of object sizes is quite large and a 4x slow down is a pretty hefty price to pay. We also make another improvement by converting the convolutional network from Darket19 to Darknet53, which is the ResNet variant used in the latest edition of YOLO (v3). There are other major structural improvements in YOLOv3 that are used to handle outputting objects of drastically different sizes though we haven’t yet implemented this. There are other options that we want to explore later but before we go too crazy we should get some results on our current baseline model (YOLO2 with Darknet53 and 9 anchors). We’ll call this model YOLO2D53.


To train our model, we separate the 847 images into a training/validation split and use only the training set. Each epoch consists of going through each image in the set and randomly sampling a 416×416 cropping from each. Thus in our case it’s important to note that each epoch does not have exactly the same information which may make analyzing our loss charts more difficult but also means we see more information. After training for 1000 epochs which took about 16 hours on a GTX1080 Ti, we had our results. The only data augmentation done is random 90 degree rotations and image flips.


Our final mAP using the XView inference code was 0.085. Since our model does not take multiscale crops into account it should be compared to the vanilla SSD model which reportedly scored 0.1456. Our results are quite a bit worse but the vanilla/multires models were trained on the whole XView dataset, not just the training set that we were given. In addition they create their train/test split after generating a pool of crops and not on the source images themselves. This makes their test problem much easier since their training has already seen crops from the all of the same scenes, unlike our case where we will be evaluated on new unseen scenes. On top of all this they trained on 4 GPUS for 7 days. Especially since many classes are infrequent and are seen rarely we expect that training for longer will allow the network to see more instances of said rare classes and improve our classification loss. In short, it’s not really fair to compare our scores to theirs, but it’s still undeniable that .085 is a very low score. So let’s dig into our per class AP score a bit more to see where we’re falling short.

Per class AP scores and total mAP Score for YOLO2D53 model alongside vanilla and multires. Classes in red are classes for which there are no correct detections in our validation set at all. Class frequencies are also included.

One interesting set of class results are the planes (first 3 rows). Our performance varies wildly. “Cargo plane” does quite well across all models while small and fixed wing aircraft do a lot worse in the multires and YOLO2D53 models. This could just be due to the amount of training data that we happen to have in our training split. It’s also strange to note that the multires model performance is much worse than the vanilla model of fixed wing aircraft though presumably the same training scheme was used on both. Perhaps one of the distinguishing differences between these planes is their size and when they train on multiple resolutions (scaling some down) they are obfuscating that size distinction. We also have a lot of classes which we have no detections of at all (0 score). It doesn’t help that that there are very few instances of these in our validation set compared to some other classes, but it’s surprising to see no detections at all.

We should probably do some form of multiresolution training on YOLO2D53 since it seemed to work well for the XView baseline model. However, it seems a bit odd that the model is trying to generalize to different resolutions in the same network. It may be better to split those up into separate networks and merge the results, but more on this in a later blog post.

The object detection problem can be split up into two parts: finding the bounding box around an object of interest and then classifying it correctly. We surmise that the classification is more of a problem than finding the bounding box. To confirm that this class confusion is one of the driving problems behind our low score we can treat all objects as the same class during evaluation. The resulting mAP is 0.471. This gives us a measure of how good we are at just finding the locations of the objects and from our result it looks we’re doing okay. So while we are retrieving a respectable number of our objects, we are not classifying them correctly. This is likely due to the class imbalance. It’s likely that rarer classes are being misclassified as whichever similar looking class has the most annotations. To visualize our class confusion we can look at the confusion matrix below.

As a side note, class confusion matrices for object detection has to be defined a bit differently than in the pure classification task. A ground truth instance may correspond to multiple overlapping predictions and each could potentially predict different classes at various probabilities as opposed to the classification problem where truths and predictions are one to one. Because we allow ground truths to be accounted for multiple times some of the counts may seem a bit high, but it’s more important to look at the rates of confusion representing by the shades of blue.

That’s a lot of stuff to take in, how do we read this matrix? Each row represents detections of a class, each column represents detections of a class. The class IDs run from top to bottom and left to right. If we trace up from column label X, stop at value Y, then left to row label Z we can read this as Y instances of ground truths of class X are detected as class Z. Conversely, Y instances of detections of class Z are actually ground truth instances of class X. The actual class names are left out because they’re too hard to read but we’ll show a zoomed in figure later.

For accurate classification performance the goal is to get high (darker) values along the diagonal and white everywhere else. If there is class confusion between a set of classes and if those classes are side by side in the confusion matrix then we should see a darker square-ish structure (the off diagonal elements are non-white). There is no guarantee that similar classes are adjacent in our matrix which would make our confusion blocks difficult to see but the XView annotations conveniently have classes that are conceptually similar grouped together by ID so not too surprisingly we see that the confusion groupings roughly correspond to these. Let’s draw them out next.

Class groups for confusion matrix. Building mis-detections span across all classes.

The groupings correspond roughly to planes, common vehicles, utility vehicles, train compartments, ships and building/structural “stuff”. There does seem to be some confusion between common and utility vehicles though it’s not as “convenient” to see since those class groups are separated. In fact some of the “common vehicles” like “crane truck” conceptually should belong to the utility vehicles category but again the line is pretty fuzzy as there is nothing smart about this grouping. It was decided purely by “guess-timation”. For some classes we don’t even have enough detections of to make any sort of educated guess about groupings at all but it’s a good enough starting point to build off of. Another point of interest is that Buildings get confused with pretty much everything else. This is justifiable since they come in all shapes and sizes and they are the most abundantly annotated item in the dataset by far so the network is at risk of being biased towards predicting more of them. The confusion matrix is pretty huge though and there’s a lot of information to handle at once so let’s zoom in on a problem block.

Confusion matrix zoomed in on the “Common Vehicles” grouping.

We have a lot of imbalanced classes and some clear major groupings of visually similar classes (planes, vehicles, boats, buildings). The network focuses on the separation between these groups and performs more poorly on the separations within each group. In addition within each group the class that most classes are misclassified as is often the one with the most number of annotations. The confusion block above shows the confusion between a set of vehicles. Here the most frequent classes are “small car”, “truck” and “bus”. We see that a lot of the truck classes with fewer instances get classified as the generic “truck” class. Many of them also get classified as the “small car” class, especially pickup trucks which is perhaps due to their relatively similar size. While we didn’t show it here we also had confusion between a Reacher Stacker and Shipping Containers which seemed rather odd at first since we didn’t know what a Reacher Stacker was. As it turns out though, it is a vehicle that carries shipping containers, go figure. So while our score isn’t great at least the mistakes our network is making are understandable.

Example Detections

Here are some images of detections from our YOLO2D53 model. For a 0.085 mAP, it doesn’t look so bad.

Vehicle Detections, and by vehicle I pretty much just mean “small car”
Building Detections, not too shabby
Plane Detections. That one missed plane has some serious camouflage going on.
Boat Detections…okay…Detection. That one’s kind of embarrassing…

Final Words and Future Plans

With all the class imbalance issues and lack of multiresolution training it’s clear that our baseline needs quite a bit of improvement. We have quite a few ideas on how to approach this. For the resolution issue we could use an ensemble of networks at different resolutions to handle different size objects. To solve the class imbalance issue we could sample tiles that contain infrequent classes more frequently though this is not as straightforward in the object detection problem as in a pure classification task since there multiple objects per tile. We could also use YOLO to predict those object groupings we found earlier and then use separated fine grained classifiers for each group of predictions to focus on separating them. This means that we’ll also need a better strategy to group our confused classes. Each of these ideas however deserves its own blog post so stay tuned for more and if you have any questions feel free to contact us at

Source: Deep Learning on Medium