Deep learning for visual searches and mapping

Source: Deep Learning on Medium

Deep learning for visual searches and mapping

Reducing the need for training data annotation in remote sensing applications by using pre-trained neural networks

Deep learning is a brilliant approach for object detection and segmentation/mapping applications in remote sensing datasets from e.g. satellite or aerial photos. However, like in many other uses of deep learning, getting enough annotated training data can be prohibitively time consuming. In this article I will present some of our work on using pre-trained networks to remove much of the cumbersome work in annotating large training datasets for object detection in remote sensing data.

In the middle of September 2019 I attended the Nordic Remote Sensing Conference. From many of the talks it was clear that deep learning has found its way into the tool boxes of many remote sensing specialists. The interest in the topic among the audience seemed big, and there was some discussion about the impact and suitability of using deep learning techniques for various applications.

One of the things being discussed was the practice of using neural networks developed for and trained on one kind of data (typically natural images), and applying it for other kinds of (remote sensing) data sources. For instance, Øivind Due Trier from the Norwegian Computing Center presented work, where a standard object detection network developed for computer vision applications was applied on a filtered elevation map in order to locate archaeological sites in Norway. Here an objection from the audience was that using this model did not make sense. I strongly disagree with this; even though the neural network was developed for natural images, it makes sense to test it on other data sources. In this case, the presenter could demonstrate that it works! In my opinion, it even makes sense to try transfer learning between data sources — why should initializing the network with filters trained on another kind of dataset be worse than a random initialization? It may be that the developed model might be too large and prone to overfitting, but the benefits of doing a quick experiment with an existing code base and pre-trained model is often so large that it makes a lot of sense to try it out.

In the rest of this post, I would like to present some work we have done in our lab on applying a network trained in one domain (ImageNet natural images), for performing image based searches in another domain (aerial orthophotos). Hopefully, I will convince you that this approach can make sense. I am not claiming that the ImageNet network leads to the best obtainable results — but rather that using a cross-domain network does make sense, when considering the amount of annotation work that would otherwise be required.

Visual searches and the need for training data

Deep learning or other machine learning techniques can be used to develop robust methods for recognizing objects in images. With orthophotos from aircraft or high resolution satellite photos, this will enable mapping, counting or segmentation of different object types. However, using deep learning requires large amounts of training data, and unless you already have usable registry data for the desired object types (polygon data that can be used to cut training images out of the dataset), it is a very time-consuming and thus costly process to create such a training dataset.

In cooperation with the city of Copenhagen, we have therefore taken a step towards a tool, that can be used to map desired object types without the need to create training data in advance. The tool is based on the technology behind a previous project at the Alexandra Institute on geovisual search. This online demo, which was very much inspired by similar technology developed at Descartes Labs, lets you click on a place on an orthophoto dataset of Denmark, and view the 100 places in Denmark that look most similar. The similarity metric is computed based on a neural network trained to distinguish between different object types. For example, clicking marinas or wind turbines will result in the following results:

Basically, the technology works by dividing the dataset into a whole bunch of small cut-outs (in this case 48 million of them), and for each of these running a Resnet-34 network trained to distinguish between the 1000 different objects in the ImageNet dataset. Instead of using the final classification (one of the 1000 classes), we extract for each cut-out a so-called descriptor from the network, consisting of 2048 numbers. In order to save memory and reduce the computational burden, we trained an autoencoder neural network to compress the 2048 numbers to 512 bits. After that, the 48 million image cut-outs from the orthophoto dataset can be compared to a new cut-out in less than 80 milliseconds! That the autoencoder is trained for this specific dataset means that it adapts to relevant features in a self-supervised manner.

At the outset, the solution had some weaknesses that we addressed in order to make the technology more robust:

  • We improved the rotational invariance by basing the extracted descriptor on output from the networking with the cut-out rotated 0, 90, 180 and 270 degrees.
  • We calculated descriptors based on cut-outs on different scales. This allows you to look for objects of different sizes.
  • We developed an interactive method for “refining” the searches, so that the mapping is based not just on a single reference cut-out, but on several cut-outs.

From the publicly available Danish spring orthophoto dataset from 2016 in 12.5 cm resolution, we calculated descriptors for 8,838,984 cut-outs in 3 different scales in the following area around Copenhagen:

Interactive mapping

The interactive mapping is currently at the prototype stage, and is best explained by an example: Suppose that we want to map all boats sailing around in an area. We start by selecting a cut-out that contains a boat:

Based on the stored descriptors, the system calculates a “distance” (similarity) between the selected cut-out and all the others. Then, a sorting is done and the 100 most similar cut-outs are displayed to the user:

It is seen that some of these cut-outs contain boats, but the result is far from good enough. The user can now select a number of cut-outs that he or she is satisfied with:

After that, a comparison is made between descriptors for all selected cut-outs and all cut-outs in the database, and are again sorted based on their average similarity distance. This results in the following top 100:

Significant improvement is seen. We can choose to run another iteration of the search by selecting more cut-outs that we are happy with, and running the sort again:

Boats are still present in the top-100 which is a good sign. Note that the cut-outs that we previously marked as satisfactory do not appear again in the interactive refinement.

From sorting to mapping

The iterative approach results in a ranking of all the 8.8 million cut-outs, based on the average similarity distance to the cut-outs selected during the interactive refinement. Ideally, a boundary should exist between the first N cut-outs that contain boats, and the remaining cut-outs that do not. In practice, however, it is rather the case that the first M cut-outs contain boats, after which there is an interval between cut-out M and cut-out N, where some cut-outs, but not all, contain boats. Cut-outs after M in the sort are assumed not to contain boats to avoid false positives. We have created a quick-and-dirty user interface where the user can inspect the sorted cut-outs, and establish some useful values ​​for M and N.

If the sorting is good, and if M and N are set sensibly, you now have clean training data consisting of cut-outs containing boats (sorting rank<M) and cut-outs that do not contain boats (sorting rank>N). This can be used to train a classification network (or possibly object detection network) to recognize boats. In our case, however, we have chosen to test a simpler heuristic for mapping the boats: We have selected 100 random cut-outs from before M in the sort (positive examples) and 100 random cut-outs after N (negative examples). These cut-outs form a comparison set of 200 examples. For each cut-out between M and N, we have then found the 2 cut-outs among comparison set, whose descriptor is most similar. If both of these cut-outs are positive examples, the cut-out is accepted as a boat and the outline of the cut-out is saved as polygon in a shape file. For all positive examples (sorting rank<M), a polygon is also created. An overview of the results can be seen here:

Zoomed in, you can see something like this (the different boxes are missing a side in the visualization for some reason):

Mapping is not perfect, but in less than a quarter of an hour the technique can give an overview of the situation. And at the same time you have also created a really good starting point for a training dataset that can be used to train a neural network or another machine learning methodology.

Mapping trees

Repeating the same process with trees instead of boats gives a mapping that looks like this:

Zoomed in, it looks like this:

Again, the mapping is not perfect, but it provides a good starting point for further work.

I hope this post has sparked some inspiration on how objects can be localized using a pre-trained neural network, e.g. for extracting training data from maps. I am very interested in hearing about more potential use cases, so if you have ever had a need for finding certain objects in large images such as maps, please leave a comment!

Also, I am very eager to hear your ideas on how to use self-supervised methods to create an even better embedding/representation of the image patches.

I work in the Visual Computing Lab at the Alexandra Institute; a Danish non-profit company that helps companies apply the newest IT-research. My Lab focuses on advanced computer graphics and computer vision research. We are always open to collaboration!