Using Deep Learning to Segment Roads in Aerial Images.

Source: Deep Learning on Medium

Segmentation of Roads in Aerial Images.

This comprehensive article will help you to create a road segmentation model, which can detect and segment roads in aerial images.

The onset of Convolutional Neural Networks (C.N.N.s) was a breakthrough in the field of computer vision as they radically changed the way computers “looked at” images. Machine vision has come a long way from where it began, but it is still at the bleeding edge of research today. Semantic Segmentation is the process of attributing every pixel in an image to a certain class. This class can be a dog, a car, or in our case roads.


The combined length of all the roads on our planet is about 33.5 million kilometres. Let me rephrase that — If we could arrange all the roads into a straight road, then we would have covered a quarter of the distance between Earth and Sun. Manually annotating each strand of the road is a Herculean task, if not an impossible one. This is where Deep Learning comes into the picture, and this is what we will accomplish through this project. To put it simply, we will train a Deep Learning model to identify roads in aerial images.

You can find a link to the source code at the end of this article. Please refer to the table of contents if you want to discern the scope of this article.
All the resources used in the project are publically available, therefore it is recommended that you follow along. This article covers both the practical and theoretical aspects of this project, and I hope that this will be an enjoyable learning experience for you.

Table of Content

  1. Data
    i. The type of data we need.
    ii. The dataset
    iii. Downloading the dataset.
  2. Preprocessing
  3. Neural Modelling
    i. About F.C.N
    ii. Network Architecture
  4. Training the Model
    i. Loss Function and Optimiser
    ii. Callbacks
    iii. Training the Model
  5. Testing the model
  6. Scopes of improvement
  7. Conclusion
  8. Links and References

Let’s get started.

1. Data

Different type of machine learning models requires a different kind of data and more the data, the merrier it is. More data to train on means that our model will be able to learn more underlying patterns, and it will be able to distinguish outliers better as well.

i. The type of data we need.

Usually, for segmentation challenges, we need images along with their respective (preferably hand-drawn) maps. For this project, we require aerial images, along with their segmentation maps, where only the roads are indicated. The notion is that our model will focus on the white pixels which represent roads, and it will learn a correlation between the input image, and the output maps.

ii. The dataset

For this project, we will be using the Massachusetts Roads Dataset. This dataset contains 1171 aerial images, along with their respective maps. They are 1500 x 1500 in dimension and are in .tiff format. Please have look at the following sample.

Just look at how elaborately the image was annotated.

iii. Downloading the dataset.

You can Start by cloning my GitHub repo and then use the script in the Src folder to download the dataset. In case you have an unreliable internet connection that keeps on fluctuating, then please use academic torrents to acquire the dataset. You can find the dataset here.

2. Preprocessing

The quality of data hugely impacts our model’s performance, and therefore pre-processing is an important step towards making sure that our model receives the data in the right form. I tried multiple pre-processing techniques, and the following methods yielded the best results:

i. hand-picking: There are few images (~50) in the dataset where a big chunk of the aerial images are missing. Majority of the image contains white pixels, but they have complete segmentation maps. Since this can throw the model off, I manually removed them.

ii. Cropping instead of resizing: Training our model on large images is not only resource-intensive but is bound to take a lot of time as well. Resizing images to lower dimensions can be an answer, but resizing comes at a cost. Regardless of the interpolation method we choose while resizing, we end up losing information.

Therefore, we will crop out smaller, 256 x 256 images from the large images. Doing so leaves us with about 22,000 useful images and maps.

iii. Thresholding and binarizing the maps: Grayscale images are single-channel images that contain varying shades of grey. There are 256 possible grey intensity values that each pixel can take up with 0 representing a black pixel, and 255 representing a white one. In Semantic segmentation, we essentially predict this value for each pixel. Rather than giving 256 discrete options for the model to choose from, we will provide only two. As you would have noticed, our maps have just two colours: black and white. While the white pixels represent roads, the black pixels represent everything that isn’t a road.

A closer look at our dichromatic segmentation maps reveals that there are a lot of grey pixels when all we want are black and white. We will start by thresholding the pixel values at 100. Such that all the pixels which have a value above a certain threshold, are assigned the maximum value of 255, and all the other pixels are assigned zero. Doing so ensures that there are only two unique pixel values in the segmentation masks. Now, 0 and 255 is a wide range. By dividing all the maps by 255, we normalize the maps, and now we end up with only two values — 0 and 1.

iv. Packaging (Optional): I trained my model on Google Colab. A big, hearty thanks to Google for providing resources to thousands of Data Scientists and Machine Learning Engineers.

I have noticed that supplying images to the model from Gdrive during training (using ImageDataGenerator) ends up consuming extra time. However, this is not true if you are training the model on your system, as loading files is much faster in that case. I packaged the entire image and map array into two separate .h5py files and loaded them onto the RAM. Doing so sped up the training process.

3. Neural Modelling

Now that we have dealt with the data, its time we start modelling our Neural network. To accomplish our segmentation task, we will be using a Fully Convolutional Network. These kinds of networks are mostly composed of convolutional layers, and unlike the more traditional neural networks, fully connected layers are absent.

i. About F.C.N

Fully Convolutional Network was developed for biomedical image segmentation at the Computer Science Department of the University of Freiburg, Germany [1]. It was later realised that the scope of these networks is well beyond the medical realm. These networks can perform multiclass segmentation of any kind of object — be it segmenting people, cars or even buildings.

ii. Network Architecture

This project uses U-net, a fully convolutional neural network which is quite intuitively named. This network takes a 256×256 multichannel image and outputs a single-channel map of the same dimension.

A U-net has two parts — The encoder or the downsampling section, and the decoder or the up-sampling section. Just have a look at the following image.

Dissecting a U-net

Encoder: It is a.k.a. the downsampling section. This segment uses convolutional layers to learn the temporal features in an image and uses the pooling layers to downsample it. This part is accountable for learning about the objects in an image. In this case, this segment learns how a road looks like and can detect it. I added dropout layers which will randomly ignore neurons to prevent overfitting, and I added BatchNormalization to ensure that each layer can learn independently of the previous one.

Decoder: It is a.k.a. the Upsampling segment. Continuous pooling operations result in the loss of spatial information of the image. The model does know about the contents of the image, but it doesn’t know where it is. The whole idea behind the decoder network is to reconstruct the spatial data using the feature maps which we extracted in the previous step. We use Transposed convolutions to upsample the image. Unlike plain interpolation, Conv2DTranspose has learnable parameters.

Skip Connections: Direct connections between the layers in the encoder segment to the layers in the decoder section are called Skip connections. They are called skip connections because they bridge two layers while ignoring all the intermediate layers. Skip connections provide the spatial information to the upsampling layers and help them reconstruct the image and “put things into place” (quite literally).

Please use the following code to replicate the U-net.

Our U-net in all its shining glory.

4. Training the Model

i. Loss Function and Hyper-parameters

At a pixel level, this segmentation challenge can be considered as a binary classification problem where the model classifies whether each pixel is white(road) or black(not road). But we need a balanced dataset to facilitate proper segmentation, and since the number of black pixels in these images greatly outnumbers the white one, we have an imbalanced dataset.

There are a few different approaches to deal with the imbalanced data issue. In this challenge, we will use the Soft Dice Loss as it is based on the Dice Coefficient. Dice Coefficient is the measure of overlap between the predicted sample and the ground truth sample, and this value ranges between 0 and 1. Where 0 represents no overlap and 1 represents complete overlap.

The formula for the Dice Coefficient. Deja Vu?

Smooth Dice Loss is simply 1 — Dice Coefficient, this is done to create a minimizable Loss Function[2]. Please have a look at the following code for Dice Loss.

def dice_coef(y_true, y_pred, smooth = 1):
y_true_f = K.flatten(y_true)
y_pred_f = K.flatten(y_pred)

intersection = K.sum(y_true_f * y_pred_f)
dice = (2. * intersection + smooth) / (K.sum(y_true_f) K.sum(y_pred_f) + smooth)

return dice
def soft_dice_loss(y_true, y_pred):
return 1-dice_coef(y_true, y_pred)

You can see that, we are using a parameter called smooth, which has a default value of 1. By adding 1 to both the numerator and denominator, we ensure that a division by zero never occurs.

Accuracy Metric: Accuracy metrics tell us about the correctness of the generated segmentation maps. We will be using the Jaccard Index, aka Intersection over Union, to tell us how accurate the generated maps are. As the name suggests, Intersection over Union is the measure of the correctness of the segmentation maps. The numerator is the intersection between the predicted map and the ground truth label, while the denominator is the total area of both the ground truth label and segmentation map (calculated using Union operation). The following code snippet is used to calculate the Jaccard Index.

def IoU(y_pred, y_true):
I = tf.reduce_sum(y_pred * y_true, axis=(1, 2))
U = tf.reduce_sum(y_pred + y_true, axis=(1, 2)) - I
return tf.reduce_mean(I / U)

We compile the model using Adam as the optimizer. We will start with a learning rate of 0.00001 and we will set it to run for 100 epochs. We use Soft Dice Loss as the loss function and the Jaccard Index as the accuracy metric.

ii. Callbacks

A set of functions which can be invoked during the training process are called call back functions. In this project, we will be using four callbacks:

  1. ModelCheckpoint: Monitors validation loss and saves the weights of the model with the lowest validation loss.
  2. EarlyStopping: Monitors the validation loss and kills the training process if the validation loss does not increase after a certain number of epochs.
  3. ReduceLROnPlateau: Monitors Validation loss and reduces the learning rate if the validation loss doesn’t go lower after a certain number of epochs.
  4. TensorBoardColab: is a special version of Tensorboard tailored to work on Google Colab. We can monitor the accuracy and other metrics during the training process.

iii. Training the Model

We have done all the homework, and now it is time to fit the model. But before that, we will use traintestsplit() to split the data into train and test set which contains 17,780 and 4446 images respectively. Once the model starts training on the train data, you can maybe go for a run, because this is going to take some time. The good thing is that we won’t have to babysit the model, and you can come back to a trained model and exported weights.

The model runs for 57 epochs before Earlystopping kicked in and halted the training process. The minimum validation loss was 0.2352. You can observe the trend in the validation and training metrics in the following graph.


5. Testing the model

Our test set contains 4446 images and our model, can predict their segmentation maps in almost no time. Our model’s performance on the test set can be gauged using the Dice coefficient, which comes up to 0.59 (This value is between 0 and 1). There certainly is room for improvement. You can observe a few of the predicted outputs in the following image.

Few Samples

On a second look, you will notice that our model can segment parts of roads which the annotators missed out. In the following image, the square at the bottom right was skipped by the annotators, while our model was able to capture it. Our model successfully segments driveways, parking lots, and cul-de-sacs.

Our model was able to pick up the square region

6. Scopes of improvement

There were certain maps in which the roads weren’t completely visible, look at the following example. Our model was not able to detect the road on the left. Even though no model can churn out 100% accurate results, and there is a room for improvement always.

Missing predictions

We can improve the performance of our model by taking certain measures, and they are as follows:

  1. Image Data Augmentation: It is the method off slightly distorting the images by applying various operations like colour shift, rotation etc. to generate more data.
  2. Use loss multipliers to deal with class imbalance: As mentioned earlier, we had a class imbalance problem, and to deal with it, we used Soft Dice Loss. We want to maximise our dice coefficient, but when compared to Binary Cross-entropy, the later has better gradients and therefore will be a good proxy for our custom loss function and can be easily maximised. The only problem is that Binary Cross-entropy, unlike Soft Dice loss, is not built to deal with the class imbalance and this results in jet black segmentation maps. However, if we apply class multipliers, so that, the model is incentivized to ignore frequently occurring classes, then we can use Binary Cross-entropy instead of Dice Loss. This will result in a smooth training experience.
  3. Using Pretrained models: pre-trained models can be fine-tuned for this problem, and they will act as the best feature extractors. Using transfer learning results in faster training times, and often yields superior segmentation maps.

7. Conclusion

In this project, we created a deep learning model that can segment roads from aerial images. We acquired the images and processed them to suit our needs. We created a U-net and learnt about its working. We used soft dice loss as out cost function and trained the model for 57 epochs. We then tested our model on the test set and observed a few samples.

Few Takeaways from this project:

  1. Cropping images instead of resizing them preserve the spacial information.
  2. Binarizing the segmentation maps reduces the number of distinct values in the map to two.
  3. Using ModelCheckpoint callback to save the model weights is a good idea. Just in case, if the program crashes during the training process, you can always reload the weights and resume training.
  4. Finally, if you ever hit a dead-end, then Slav Ivanov has written a comprehensive article which will help you overcome any deep learning related roadblocks.

8. Links and References

This challenge was surely fun to work on and thank you for reading through this article. If you have any feedback or questions, please feel free to type it out in the comments section below.


  1. Source Code.
  2. CS231n: Convolutional Neural Networks for Visual Recognition


[1]U-Net — Wikipedia

[2]Evaluating image segmentation models — Jeremy Jordan

Want to learn more? Check out a few of my other articles:

  1. Create a custom face recognition model and run it on your system.
  2. Build a live emotion recognition model.