Source: Deep Learning on Medium
Super-High-Resolution EagleView Image Segmentation with Mask-RCNN/DeepLabV3+ using Keras/ArcGIS Pro
Computer vision of Machine Learning provides enormous opportunities for the GIS. Its tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the forms of decisions.In the last several years, computer vision is increasingly shifting from traditional statistical method to the state-of-art deep learning neural network techniques.
In this blog, I will share several empirical practices that you can use Keras and ArcGIS Pro tools with deep learning and transfer learning techniques to build building footprint image segmentation network model with super-high-resolution 3-inch of EagleView (Pictometry) imagery.
We have seen ESRI and Microsoft in collaboration with the Chesapeake Conservancy to train a deep neural network model to predict land cover from 1-meter NAIP resolution aerial imagery data source. The neural network similar in architecture to Ronnenberger et al.’s U-net (2015), a commonly-used semantic segmentation model was used in that case. Each year, GIS Core group in Cobb County Georgia receives 3-inch super-high-resolution ortho imagery from EagleView(Pictometry). Could the similar techniques be applied to this super-high-resolution ortho imagery to classify land cover or building footprints? There are several challenges — Super-high-resolution imagery usually presents varieties of vegetation types and overlaps; buildings and trees creating heavy shadows in the images, could potentially misclassify the true ground objects.
In the beginning, I was very conservative as I decide to use CPU only laptop to train roughly 3800 images. Considering the complexity of land cover and building footprints, this is quite a bit small dataset for deep learning because if you read text books, often says deep learning requires huge amount of training data for better performance. But it is also a realistic classification problem: in a real world-cases, even small-scale image data can be extremely hard to collect and expensive or sometimes almost impossible. Being able to use small dataset and train a powerful classifier is a key skill for a competent data scientist. After many tries and runs, the results turn out very promising especially with state-of-the-art of Deeplabv3+ and Mask-RCNN models.
Study Area and training image dataset preparation
The geographical area of Cobb County covers with 433 of 1 x 1 mile Pictometry image tiles at resolution of 3-inch. Cobb County GIS group has a building footprint polygon layer. For training purpose, one image tile close to the center of the County was chosen for ground truth image training processing dataset(fig. 1). The building footprint polygon feature was used as ground truth polygon feature label. The “Export Training Data for Deep Learning” in ArcGIS Pro 2.4 ver. of Geoprocessing tool was used to export images and masks for instance segmentation datasets(fig.2). The dimension of the output images are 512x512x3 and rotation is set to 90 degree to produce more images to prevent overfitting and help the model generalize better.
1. Training with Mask-RCNN model
The resulting training datasets contain over 18000 images and labels. With further data processing to remove no labeling images, final datasets had over 15000 training images and labels. However, with CPU only 32-GB memory laptop, it is impossible to feed such large dataset into the Mask-RCNN model which requires huge memory for the training.
The training strategy is to see how the proof of concept will work, so I gradually increased the datasets to feed into CNN with trial of 3800 datasets.
Mask-RCNN efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method extends faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. You can read the research paper to better understand the model. (fig. 3).
There are three main functions need to be modified in the class (Utils.dataset) to load you own datasets into the framework. See below for the data loading implementation. The anchors ratio is setup (16,32,64,128,256) to predict smaller size of residential buildings. IMAGES_PER_GPU set to =1 so CPU can be used to train the model (fig. 4). An example of image and mask (fig. 5).
Here, the transfer learning technique was applied with model backbone ResnNet-101. I trained the last fully connected layers first with epoch =5 to adapt residential building class, then trained the full network for 35 epochs.
At a 32-GB CPU, it took almost nearly 48 hours to finish the training process (fig. 6 and fig.7).
Here are two inferences that original images were not used in training (fig.8 and fig.9). Interestingly to see the inference mask is more accurately delineating the building than original mask.
Another interesting example of original image and mask and inference result (fig.10 and fig.11).
2. Training with Deeplabv3+ model
Deeplabv3+ is the latest state-of-art semantic image segmentation model developed by google research team. The distinctive of this model is to employ atrous convolution in cascade or in parallel to capture multi-scale context by adopting multiple atrous Rates (fig.13).
With same training datasets from ArcGIS export training data for deep learning tool, the images and masks were processed with augmentations and saved to a HDF5 compressed file for conveniently loading to training model (fig.14).
I used Keras implementation of Deeplabv3+ on github here. Below is the Keras training model with backbone Mobilenetv2 which has less parameters than Xception model (fig.15).
With only 5 epoch training runs, the result turns out very promising(fig.16).
I used python script for inference of an arbitrary cropped 2064×1463 dimension image which equals 16 512×512 inference rasters. (fig.19). With further inspection of the images and inference, we can see the effect of the building shadow can lower the accuracy of the edge of the buildings.
With same trained model to predict 2019 same area cropped image, result is very similar with minor localized differences(fig.20). The model can really help in future year image inference.
After adding the above image and inference to the ArcGIS Pro. (fig.21)
The above image raster was convert to the polygon feature and then use Regularize Building Footprint in ArcGIS Pro 3D analysis with appropriate parameters to regularize raw detection. (fig.22)
Then I tried inference of one complete 2018 tile image of 20,000 x 20,000 dimension, it took approximately one hour to finish with 32GB ram of CPU only laptop. see (fig. 23). There are missed classified buildings mostly because of using very small training dataset and trees covering on top of the buildings. Choosing several representative tiles as training dataset from different locations of the County could improve the accuracy of the result.
With such a small dataset, Mask-RCNN and Deeplabv3+ deep learning models both present promising results for super-high-resolution image segmentation using transfer learning technique. Due to the less accuracy of original building footprints ground truth feature polygons and the laptop CPU and memory limitation, the result of the performance may not surpass human digitizer in some image classifications and instance segmentations. However, the accuracy of this Deep learning training process can be further enhanced by increasing high-quality training datasets from different locations of the county and applying data variation augmentation methods. The model can be used in multi year imagery to infer feature detection for comparison or even used for low cost feature delineation with ArcGIS tools ModelBuilder to automate the business tasks. More importantly, the above deep learning training process can be applied to other types of image instance or segmentation cases.
6. Deeplabv3+ model, https://github.com/tensorflow/models/tree/master/research/deeplab6.
11. U-Net: Convolutional Networks for Biomedical Image segmentation:https//lmb.informatik.uni-freiburg.de/people/ronneber/u-net/