EagleView Super-High-Resolution Image Segmentation with Deeplabv3+ /Mask-RCNN using Keras/ArcGIS

Source: Deep Learning on Medium

Super-High-Resolution EagleView Image Segmentation with Mask-RCNN/DeepLabV3+ using Keras/ArcGIS Pro

Computer vision of Machine Learning provides enormous opportunities for the GIS. Its tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the forms of decisions.[1][2][3][4]In the last several years, computer vision is increasingly shifting from traditional statistical method to the state-of-art deep learning neural network techniques.

In this blog, I will share several empirical practices that you can use Keras and ArcGIS Pro tools with deep learning and transfer learning techniques to build building footprint image segmentation network model with super-high-resolution 3-inch of EagleView (Pictometry) imagery.

We have seen ESRI and Microsoft in collaboration with the Chesapeake Conservancy to train a deep neural network model to predict land cover from 1-meter NAIP resolution aerial imagery data source. The neural network similar in architecture to Ronnenberger et al.’s U-net (2015), a commonly-used semantic segmentation model was used in that case. Each year, GIS Core group in Cobb County Georgia receives 3-inch super-high-resolution ortho imagery from EagleView(Pictometry). Could the similar techniques be applied to this super-high-resolution ortho imagery to classify land cover or building footprints? There are several challenges — Super-high-resolution imagery usually presents varieties of vegetation types and overlaps; buildings and trees creating heavy shadows in the images, could potentially misclassify the true ground objects.

In the beginning, I was very conservative as I decide to use CPU only laptop to train roughly 3800 images. Considering the complexity of land cover and building footprints, this is quite a bit small dataset for deep learning because if you read text books, often says deep learning requires huge amount of training data for better performance. But it is also a realistic classification problem: in a real world-cases, even small-scale image data can be extremely hard to collect and expensive or sometimes almost impossible. Being able to use small dataset and train a powerful classifier is a key skill for a competent data scientist. After many tries and runs, the results turn out very promising especially with state-of-the-art of Deeplabv3+ and Mask-RCNN models.

Study Area and training image dataset preparation

fig.1 — Cobb County 2018 3in EagleView imagery covers with 433 1×1 mile tiles.

The geographical area of Cobb County covers with 433 of 1 x 1 mile Pictometry image tiles at resolution of 3-inch. Cobb County GIS group has a building footprint polygon layer. For training purpose, one image tile close to the center of the County was chosen for ground truth image training processing dataset(fig. 1). The building footprint polygon feature was used as ground truth polygon feature label. The “Export Training Data for Deep Learning” in ArcGIS Pro 2.4 ver. of Geoprocessing tool was used to export images and masks for instance segmentation datasets(fig.2). The dimension of the output images are 512x512x3 and rotation is set to 90 degree to produce more images to prevent overfitting and help the model generalize better.

Fig. 2 — ArcGIS “Export Training Data for Deep Learning”

1. Training with Mask-RCNN model

The resulting training datasets contain over 18000 images and labels. With further data processing to remove no labeling images, final datasets had over 15000 training images and labels. However, with CPU only 32-GB memory laptop, it is impossible to feed such large dataset into the Mask-RCNN model which requires huge memory for the training.

The training strategy is to see how the proof of concept will work, so I gradually increased the datasets to feed into CNN with trial of 3800 datasets.

I used the impressive open source implementation Mask-RCNN library that MatterPort built on github here to train the model.

Mask-RCNN efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method extends faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition[5]. You can read the research paper to better understand the model. (fig. 3).

fig. 3. — Mask R-CNN framework for instance segmentation. Source: https://arxiv.org/abs/1703.06870

There are three main functions need to be modified in the class (Utils.dataset) to load you own datasets into the framework. See below for the data loading implementation. The anchors ratio is setup (16,32,64,128,256) to predict smaller size of residential buildings. IMAGES_PER_GPU set to =1 so CPU can be used to train the model (fig. 4). An example of image and mask (fig. 5).

fig. 4 — Load Cobb Pictometry datasets to Mask-RCNN framework.
fig. 5 — An example of a random image and mask from datasets.

Here, the transfer learning technique was applied with model backbone ResnNet-101. I trained the last fully connected layers first with epoch =5 to adapt residential building class, then trained the full network for 35 epochs.

At a 32-GB CPU, it took almost nearly 48 hours to finish the training process (fig. 6 and fig.7).

fig. 6 — Model training result. The loss is reasonable good.
fig. 7 — Loss charts.

Here are two inferences that original images were not used in training (fig.8 and fig.9). Interestingly to see the inference mask is more accurately delineating the building than original mask.

fig. 8 — Original training image that were not used in training
fig. 9 — Inference mask more accurately delineating the building than original mask.

Another interesting example of original image and mask and inference result (fig.10 and fig.11).

fig. 10 — Original training image that were not used in training.
fig. 11 — Inference instance masks.
fig. 12 — This is a cropped image and inference mask not used in the training. The orange line indicates the image cropped position. With 3000 training datasets, the result is very promising.

2. Training with Deeplabv3+ model

Deeplabv3+ is the latest state-of-art semantic image segmentation model developed by google research team. The distinctive of this model is to employ atrous convolution in cascade or in parallel to capture multi-scale context by adopting multiple atrous Rates (fig.13).

fig.13 — https://ai.Googleblog.com/201803/semantic-image-segmentation-with.html

With same training datasets from ArcGIS export training data for deep learning tool, the images and masks were processed with augmentations and saved to a HDF5 compressed file for conveniently loading to training model (fig.14).

fig. 14 — A random example of image and mask.

I used Keras implementation of Deeplabv3+ on github here. Below is the Keras training model with backbone Mobilenetv2 which has less parameters than Xception model (fig.15).

fig.15 — Define DeepLabv3+model.
fig. 16 — Five epochs training result.

With only 5 epoch training runs, the result turns out very promising(fig.16).

fig. 17 — Training loss converged plot.
fig. 18 — Image and inference from trained year 2018 imagery of Deeplabv3+ model

I used python script for inference of an arbitrary cropped 2064×1463 dimension image which equals 16 512×512 inference rasters. (fig.19). With further inspection of the images and inference, we can see the effect of the building shadow can lower the accuracy of the edge of the buildings.

fig. 19 — A cropped 2018 EagleView image with inference raster (25 rasters of 512×512 dim)

With same trained model to predict 2019 same area cropped image, result is very similar with minor localized differences(fig.20). The model can really help in future year image inference.

fig. 20 — A cropped 2019 EagleView image with inference raster (25 rasters of 512×512 dim)
fig. 21— Add 2018 inference raster to the ArcGIS Pro with original background image.

After adding the above image and inference to the ArcGIS Pro. (fig.21)

The above image raster was convert to the polygon feature and then use Regularize Building Footprint in ArcGIS Pro 3D analysis with appropriate parameters to regularize raw detection. (fig.22)

fig. 22 — Use ArcGIS Pro Regularize Building Footprint tool to clean up building polygons.

Then I tried inference of one complete 2018 tile image of 20,000 x 20,000 dimension, it took approximately one hour to finish with 32GB ram of CPU only laptop. see (fig. 23). There are missed classified buildings mostly because of using very small training dataset and trees covering on top of the buildings. Choosing several representative tiles as training dataset from different locations of the County could improve the accuracy of the result.

fig. 23 — One complete tile of Eagleview image 2018 with inference raster (1600 rasters of 512×512 dim)
fig.24 inference overlay with the image tile in ArcGIS Pro.


With such a small dataset, Mask-RCNN and Deeplabv3+ deep learning models both present promising results for super-high-resolution image segmentation using transfer learning technique. Due to the less accuracy of original building footprints ground truth feature polygons and the laptop CPU and memory limitation, the result of the performance may not surpass human digitizer in some image classifications and instance segmentations. However, the accuracy of this Deep learning training process can be further enhanced by increasing high-quality training datasets from different locations of the county and applying data variation augmentation methods. The model can be used in multi year imagery to infer feature detection for comparison or even used for low cost feature delineation with ArcGIS tools ModelBuilder to automate the business tasks. More importantly, the above deep learning training process can be applied to other types of image instance or segmentation cases.

1.Reinhard Klette (2014). Concise Computer Vision. Springer. ISBN 978–1–4471–6320–6.

2.Linda G. Shapiro; George C. Stockman (2001). Computer Vision. Prentice Hall. ISBN 978–0–13–030796–5.

3.Tim Morris (2004). Computer Vision and Image Processing. Palgrave Macmillan. ISBN 978–0–333–99451–1.

4.Bernd Jähne; Horst Haußecker (2000). Computer Vision and Applications, A Guide for Students and Practitioners. Academic Press. ISBN 978–013085198–7.

5. Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick.(2018). Mask-RCNN, https://arxiv.org/abs/1703.06870v3

6. Deeplabv3+ model, https://github.com/tensorflow/models/tree/master/research/deeplab6.

7. https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html

8. http://pro.arcgis.com/en/pro-app/tool-reference/image-analyst/export-training-data-for-deep-learning.htm

9. https://pro.arcgis.com/en/pro-app/tool-reference/3d-analyst/regularize-building-footprint.htm


11. U-Net: Convolutional Networks for Biomedical Image segmentation:https//lmb.informatik.uni-freiburg.de/people/ronneber/u-net/