Original article was published on Deep Learning on Medium
Restoring old aerial images with Deep Learning
Super Resolution with Perceptual Loss function and real images as input
With the help of Super Resolution, I brought life into old aerial photos of the city where I live. Using a neural network trained with high resolution images I was able to create details to make the old aerial images look like it was of higher quality.
Super resolution works by upscaling image data and providing it more details than it would have when upsized with, for instance, bilinear or nearest neighbor interpolation. I am sure many of you have tried enlarging a photo in Photoshop and seen the blurry result. Super resolution is something completely different than upsizing. It fabricates details that never where there, in a very convincing way.
To train a Super Resolution model you need pairs of images depicting the same thing, one low-resolution and one high-resolution. A common method for creating training data is to generate low resolution input images out of the high-resolution images used in the model as targets. This gives you pairs of images with corresponding pixels — good training data.
I decided to try it in a slightly different way. Maybe there was an easier way of doing this? Since geographical data is related to positions on the ground and therefore also related to all other geographical data, I figured I didn’t have to create my own training data, I already had what I needed.
Since my data was images taken in 1998, and I had images over the same area with much higher resolution from 2017, I already had my “degraded” images, and my targets.
That way I get all the flaws I want to fix, including the ones I did not think of or the ones I could not replicate manually. By using data from the same source that I later want to enhance, I get all the natural occurring flaws it has — the blurriness, the color cast etc. This would give me ready-to-go training data with little or no effort.
As said, a big advantage with geographical data is that everything is related to a spot on the ground with coordinates to describe it. This makes it easy to cut out parts from different datasets with the exact same extents. In this case, that means creating input and target images that cover the same area in the city. There are potential problems when working with aerial imagery, which I will discuss further on.
The process I have used is based on working examples of fast.ai’s MOOC v3. The biggest difference is the data augmentation part, where I have used a set of transformations that previously have given me good results for aerial imagery.
An orthophoto is an image made out of several overlapping photos taken from a plane or a drone. It is corrected — rectified — to create a map-like image that can be used to measure accurate distances and angles, just like in a map. Everything in the image appears to be taken from a vertical top-down perspective.
The problem with orthophotos is that they are not always orthographic. Aerial images, especially older ones, were not taken with enough overlap. It is the overlap that makes only the center part of each image, the part taken vetical to the ground, being used when creating the mosaic of images that makes the final ortophoto. Lacking overlap, objects get depicted somewhat from an angle. This is an issue especially for images with tall buildings in them, causing the roof to be offset from the footprint, and the facade of the building to become visible. Sometimes this is a desired effect, but most often not. In my case, it causes the content of the input images to differ from the target images, especially in areas with tall buildings.
When training a model with real low-resolution images instead of generated ones, differences in content is still an issue. Houses get built, torn down, roads redrawn, trees grow, etc. Lots of things happen in a city in nineteen years that make the targets differ from the inputs. This did not cause as big of a problem as I thought, though. My overall impression is that the model did well despite the differences.
Choosing areas for my training set, I was careful to pick areas that; a) existed in 1998 and b) had not changed much since.
Input aerial image (orthogonal mosaic) from 1998 (25cm/pixel)
Target aerial image (orthogonal mosaic) from 2017 (10cm/pixel)
10,000 pairs of image tiles
Each tile is 500 x 500 pixels, covering 50 x 50 m on the ground = 10cm/pixel
Total of 25 km²
~ 5 GB of data
Model and training
I will not go into detail about the concept of perceptual loss, as many others have done this much better:
Original paper on Perceptual Loss by Justin Johnson et al.
In-depth article by Christopher Thomas
In short, Perceptual loss function is an alternative to using a GAN for tasks like super resolution. It is similar in that it uses a second model to decide how the first one is doing — “is the thing it created like the thing that we want?”. The difference is that the second model, the discriminator, is not being trained in the process. In my case, the model is a neural net for image classification pretrained on ImageNet, VGG16.
The main model in this case is a U-Net where the encoder part is a pretrained resnet34 architecture. Fitting the model is done by taking both the prediction and the target through the VGG. Then, inside the VGG, we compare their activations and calculate a loss. That way we get numbers for how good the model is at e.g. creating a grassy spot where there is supposed to be one, or if the model is doing a bad job at generating a house with sharp corners.
The loss depends on how the model manages to predict features and style, not just comparing pixel values. This is what helps our model know what is important. For instance, a building corner may not take up many of the pixels in the image but is very crucial to how the image is perceived. The perceptual loss function recognizes the importance of these pixels.
The dataset of 10,000 image-pairs took about two hours to train for a total of 20 epochs, on one 16GB GPU.
Result and postprocessing
The result showed more defined roof-lines and less noise. Vegetation often showed more detail but sometimes got a smudged look. In image tiles with large grey areas like roads the colors often look a bit diluted. When creating an automated workflow for inference I would include some basic image processing to give them a bit more contrast.
I am sure that there are many ways to make the model perform even better. Feel free to notify me of things that can help!
Unfortunately I can not show many examples due to copyright and privacy concerns, but if you are the lucky owner of high-resolution aerial imagery I recommend you to try it out. Link to the code used in this project is found below.
Inference was done with tiles of the same size and resolution as the original input. When I created them, I also made a spatial reference file (.wld) for each image tile file.
Saving the predicted image under the same name as the .wld correlate the predicted image to the ground when using it in GIS software. The result is a spatial related aerial image, in Super Resolution, ready to use as a background in web maps or any other application.
This was a lot of fun!
The imagery from 1998 came to life in a way I have never seen before. It is not reality, of course, as you cannot extract details that wasn’t there from the beginning, but as reproduction of what it might have looked like. Keep in mind that when the model is predicting it has no idea what the target is, just the knowledge or understanding of how a detailed aerial image should look like. The rest is the model’s qualified guessing or rather hallucinating/substituting the missing details. It is not real, but certainly looks like it.
I am surprised how well the model performed, despite how the the input and target images differed way more than it would have if created with the degrading method. Hardly any pre-processing was made, just cutting a subset of both image sets in tiles. The use of Perceptual Loss function makes for short training times and thereby hopefully making projects like this accessible for more municipalities or other organizations that owns a lot of geographical data.
I do not know how much the skewness and difference in content affected the model’s efficiency. Maybe less training data would have been needed if both image sets where truly orthographic, or if I had had high resolution imagery from an earlier year and got a shorter timespan between the sets. It is inevitable that things differ between the images, even in areas where the buildings stay the same. I discovered that few, if any, cars were parked in the same spot in 1998 and 2017.