U-net engineering.

Source: Deep Learning on Medium

U-net engineering. TPE Hyper-Parameter Optimization. Semantic Segmentation. Building Footprint extraction.

TLDR

CNN U-net engineering using Tensorflow Tensorboard and TPE Hyper-Parameter Optimization. Adding a custom layer to increase CNN predictive capacity. All code is available in this tiny project.

Outputs of baseline and improved models on a test image

Motivation

Developing of a neural net architecture is an art. Sometimes it is like a Blind-Man’s Buff game. In spite that there is already a number of state of art neural net designs for a decent number of problems, each problem is unique and there is no theory that provides guidance in advance of what exactly should be used to achieve an optimal solution. In this paper I want to share my thoughts and experience of techniques available to tackle this problem in a systematic way. I’m trying to utilize two approaches: analyzing weights distributions and hyper-parameters optimization by TPE algorithm. Based on a weights analysis I add a custom layer in order to increase generalizing ability where it lacks as I assume. I apply these methods to semantic segmentation of satellite imagery problem(building footprint extraction).

Area of Interest

As a showcase I use SpaceNet Challenge Las Vegas data. As a baseline model I choose U-net like CNN proposed in this Microsoft Azure blog post. I try to make changes of its design and I compare validation set loss and visual mask after the same number of learning iteration of baseline and improved architecture. Improved architecture has a number of convolutional filters as a hyper-parameter of added custom layer. This hyper-parameter is optimized by Tree-structured Parzen Estimator(hyperopt library).

Improved U-net design

Weights distribution analysis

Neural Nets are commonly thought as a black box. However there are researches those try to shed some light on NNs interiors. Some of them try to discover relations of parameters distribution and extrapolation ability(for instance https://arxiv.org/pdf/1504.08291.pdf). There is an idea that NN with weights distributed as Gaussian like distribution might have better performance. Though this statement is not strictly proven, my experience and intuition lead me to speculate that models, which weights distribution shape is smooth, looks like a Gaussian(more or less), changes gently from epoch to epoch, tend to perform better. Let’s try and see how this applies to the showcase problem.

U-net CNN architecture

Original U-net has final conv layer with three filters. Each filter has to be able to distinguish target class on the final feature map. Each filter has capacity of only 65 parameters, which might be insufficient. Let’s see how filter weights are distributed for different layers.

Luckily there is Tensorflow tensorboard tool which allows us to see what is happening internally while neural network black box is learning. It is like a flashlight in a dark room. Tensorboard has a number of nice features. Here we are inspecting layers weights histograms. Below are distributions of the penultimate(conv2d_9) and the last(conv2d_10) convolution layers.

The last one and the penultimate one have 195 and 2359808 parameters respectively. According to the weights distribution shape assumption made above, one can presume that the last one has insufficient capacity to generalize well on the current data set. Let’s try to improve this.
Segmentation model tries to classify each pixel of a given image as one of three classes(background/building/building border). Thus there are only 65 parameters correspond to each class in the final convolution operation. Improved U-net has increased number of filters per class in the final conv layer.
There are three groups of filters as per each class. Filters corresponding to
one classes are summed in the final layer before activation. Let’s compare the layers weights distribution of baseline and improved model.

Baseline:

Improved model:

As It can be presumed, improved weights look a little less torn, and changes are smother from epoch to epoch. We’ll see how this can influence segmentation quality a little later.

Hyper parameter optimization

Custom layer in the improved model can have various number of filters. I try to find optimal number of filters with Tree-structured Parzen Estimator(hyperopt library). This is a Bayesian optimization method which is a reasonable choice for non differentiable stochastic problem. It calculates objective function with randomly sampled parameter variables and tries to estimate posterior objective function distribution in a very witty way. More info the algorithm background can be found in this paper: https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf

In my case I estimate loss function distribution versus number of filters in the custom layer. I model number of filter as quniform(3,120) distribution, which is basically uniform integer. Three filters is equivalent to the baseline model(one filter per class). One hundred twenty filters(40 filters per class) is the upper limit bounded by 16G of GPU memory. If increasing of filters number doesn’t improve loss function it is likely to end up with estimated value close to three. I use Google Colab for this project as a free GPU runtime environment. I run 100 loss function evaluations, which is limited by Google GPU time restriction. Google restricts GPU time to 4 hours to avoid unscrupulous runtime usage like crypto-currency mining. Though I have not found explicit time limit in docs, I encountered four hours limit empirically. Worth mentioning that hyperopt has option to run evaluations in parallel storing evaluation results in MongoDB. Thus if one need more GPU time, then another free GPU runtime can be paired simultaneously like Kaggle or so. If there is interest about how to achieve this I can publish a comprehensive guidance of how to get more GPU runtime using only free tools. Please let me know in comments if you want to learn more about it.

Below there is a plot with hyperopt runs and respective loss and number of channels relation.

As that can be observed the best result is achieved with 57 channels which is 19 channels per class. Let’s see how it would look for an example image from the test set.

Outputs of baseline and improved models on a test image

Conclusion

In this short post I demonstrated a conscious way of Deep Learning model engineering and evolving. A feedback and thoughts about anything missing or next step improvement are highly appreciated.