Urban water inlet features detection from EagleView aerial imagery using Mask R-CNN/Keras/ArcGIS

Source: Deep Learning on Medium

In my last blog, I talked about image segmentation of building footprint with EagleView super-high-resolution imagery using Mask R-CNN and deepLabV3+ models. Although the output from both deep learning models is generally promising, they are still missing a certain threshold of accuracy. Why is that? Because of the GIS data for ground truth labels are not validated by anyone, many ground truth labels are either not accurate or missed the actual true object delineation, which could potentially confuse the model classifier. My curiosity: what if I verify each ground truth label and delineate each feature as accurate as possible, would be the output better and more promising? What would be the statistics of output like Precision and Recall of the trained model prediction? Would the prediction of the model be applied in practical work to save potential time-consuming job by human digitizer?

This time I chose the water inlet as a target of the training model using Mask R-CNN. Training images to find the water inlet features present several challenges. Water inlet follows a general oval and triangular pattern but with a highly varied color scheme due to the years of environmental weathering, sometimes it is even hard to detect with human eyes on aerial images. The model has to learn not to be fooled by road and house features, which can trace superficially similar patterns and shapes.

To test the experiment and better understanding the model output, I personally identified 226 water inlet features on top of one tile of 1 x 1 mile EagleView (Pictometry ) 3-in resolution aerial image (fig.1 and fig.2) and digitize each as a polygon feature using ArcGIS software from ESRI for ground truth labels. After training a deep neural network to identify water inlet features from the tile image with Mask R-CNN, another different location of 1 x 1 mile tile image was fed to the trained model to see if the model could spot water inlet features in the image that digitizer had missed. The process turns out a much better result as the model discovered more water inlet features that the human digitizer had been either unnoticed or missed.

fig. 1. Red outline image as training data and blue outline tile image as inference.
fig.2 Zoom-in tile image 187450 of dimension 20,000 x 20,000 x3 RGB bands used for training.

Training with Mask-RCNN model:

Mask R-CNN is a state of art deep neural network to solve instance segmentation problems in machine learning. Mask-RCNN efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. You can read the research paper to better understand the model. (fig. 3).

fig. 3. — Mask R-CNN framework for instance segmentation. Source: https://arxiv.org/abs/1703.06870

The “Export Training Data for Deep Learning” in ArcGIS Pro was used to export tile images and water inlet polygon features for instance segmentation datasets (fig. 4). The dimension of the output images are 512x512x3 and rotation is set to 90 degrees to generate more images to prevent overfitting and help the model generalize better. There are traditional image augmentation methods that you can apply to generate more simulated images that you can refer to here the imgaug Python library. For this experiment, I skip the augmentation techniques as the rotation of the cropped images did generate some augmentation training dataset.

fig. 4 Water inlet polygon features (red) overlay the image used for generating training datasets

The resulting training datasets contain over 3400 images and labels. With further data processing to remove no labeling images, final datasets had over 3200 training images and labels. I used 32 GB CPU Dell notebook to train the model. This is quite a bit small dataset to train a deep neural network to extract an urban feature from aerial images. With applied transfer learning, I start with a weights file that’s been trained on the ImageNet dataset. Although the ImageNet dataset does not include a water inlet class, it includes many other kinds of images, so the trained weights have already learned a lot of the features common in natural images, which helps training the model.

I used the impressive open-source implementation Mask-RCNN library that MatterPort built on Github here to train the model.

There are three main functions need to be modified in the class (Utils.dataset) to load your own datasets into the framework. See below for the data loading implementation. The anchor’s ratio is setup (16,32,64,128,256) to predict a smaller size of residential buildings. IMAGES_PER_GPU set to =1 so CPU can be used to train the model. An example of an image and mask (fig. 5).

fig 5 load training images and labels to the model
fig. 5 a random image and label mask for training

The transfer learning technique was applied with model backbone ResNet-101. I trained the last fully connected layers first with epoch =5 to adapt the water inlet feature class, then trained the full network for 35 epochs. fig.6

fig.6 training head layers to adapt the water inlet features

At a 32-GB CPU, it took over 5 days to finish the training process (fig.7). The charts indicate the loss of validation dataset plateaued after certain epochs. This normally happens as overfitting when you have a small training dataset.

fig.7 training full network layers over 5 days
fig. 8 training loss charts

Here is a test image that was not in the training set and there is one water inlet feature was not identified in the digitized polygons(fig. 9)

fig. 9 a test image and label with missing detection by the human digitizer

The water inlet feature inference from the above image for comparison. Interesting to see a missing water inlet feature was detected by inference.(fig.10)

fig.10 the missing water inlet feature was identified with inference from the above image.

Another example from the test image with inference.(fig.11 and fig.12)

fig. 11 a test image
fig.12 the inference from the above image.

Then I run python inference scripts with one complete 2019 tile image of 20,000 x 20,000 dimension which is 3 miles away from the tile image that was used in the training (fig. 13). The scripts crop and process 1600 (512x512x3 dim) images for the inference. It took approximately one hour to finish the process using 32GB RAM of CPU laptop. see. There are missed water inlet delineations mostly because of using the very small training dataset and trees’ covering on top of the water inlet features. Choosing several representative tiles as training datasets from different locations of the County could improve the accuracy of the result.

fig.13 a complete tile image inference which equals 1600 (512x512x1) cropped rasters

I would like to use statistic Precision and Recall to measure output performance. For those of you not familiar with the terms, here are the basic:

Precision: the fraction of the true positive instances among all positive instances classified by the model

Recall: the fraction of the total amount of positive instances that are actually retrieved by the classified model.

With the threshold of binary cross-entropy confidence set to ≥0.9, the model selected 241 candidates within a 1 -square-mile inference tile range. Then I reviewed the candidates manually and choose the most promising one and confirmed 172 in the aerial imagery. The Precision is 71.37%. The recall is 70.78%. When the threshold of confidence is set to ≥0.75, the precision is dropped to 53.98%. but Recall is up to 78.11% which means more water inlets are correctly identified and also many incorrect detected water inlet features. Depending on the project, we would more likely to maximize recall to retrieve more water inlet features and then process quickly to eliminate the false positives manually. Below is the zoom-in of the inference water inlet features that overlay with aerial imagery (fig.13).

fig.13 a false positive example and positive water inlet features


Although it is a relatively very small dataset, Mask-RCNN presents promising results for super-high-resolution image segmentation using the transfer learning techniques. The result of the performance could be further enhanced by increasing high-quality training datasets from different locations of the county and applying data variation augmentation methods. The model can also be used in multi-year imagery to infer feature detection for comparison or even used for low-cost feature delineation with ArcGIS tools ModelBuilder to automate the business tasks. More importantly, the above deep learning training process can be applied to other types of image instances or segmentation cases. To sum up, I should mention an interesting fact: when I compare the results with the polygon features that vendor digitized on the inference tile. The model delineates 67 more water inlet features than the human digitizer accomplished. This convinces me that if we apply the right deep learning model and do it right, ML or deep learning can be used in practical real-world projects and save time-consuming digitizing jobs.

1. Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick (2018). Mask-RCNN, https://arxiv.org/abs/1703.06870v3

2. Mask R-CNN https://github.com/matterport/Mask_RCNN

3. https://pro.arcgis.com/en/pro-app/tool-reference/3d-analyst/regularize-building-footprint.htm

4. U-Net: Convolutional Networks for Biomedical Image segmentation:https//lmb.informatik.uni-freiburg.de/people/ronneber/u-net/