Semantic Segmentation of Aerial images Using Deep Learning

Source: Deep Learning on Medium

Pixel-wise image segmentation is a challenging and demanding task in computer vision and image processing. This blog is about segmentation of Buildings from Aerial (satellite/drone) images. Availability of high-resolution remote sensing data has opened up the possibility for interesting applications, such as per-pixel classification of individual objects in greater detail. By the use of Convolution Neural Network (CNN), segmentation and classification of images has becomes very efficient and smart. Here in this paper we have used images of very high resolution and size therefore images were cropped and their size was reduced, so that processing could be easily done in a moderate level computer with normal specifications. This model is inspired from UNET Model which is widely used for the network whose output is of same format as the input. This model is generally used in medical science fields to detect anomalies in medical images like for findings fractures or tumors. Fine-tuning or Transfer learning is also used to improve the accuracy of the model. Fine tuning the hyperparameters can help to improve learning of the model as the encoder of previously trained model is used. To train our model for different challenging situations, so that our model can produce more accurate result, we used Data Augmentation i.e. to take one image and make some changes in it like changing of hue, saturation, brightness values, zooming and changing angles, rotation etc to provide wide situations for training, this also helps in the situations where there is less or inadequate number of training data. Overall this model is able to get high accuracy result after full training of data.

What is Semantic Segmentation?? What are its Practical Applications??

Semantic segmentation of drone images to classify different attributes is quite a challenging job as the variations are very large, you can’t expect the places to be same. And doing manual segmentation of this images to use it in different application is a challenge and a never ending process. So, there is a requirement for automation and a smart system which automatically do our job. So, here comes the Neural Network in the play. These days the application of Neural Network and deep learning in different fields is reaching heights, the interesting and smart approach to solve the challenging problem are attracting the attention of people of different fields. So, to solve this problem through applying the concepts of neural networks is a bit challenging task because of irregularity of data.

Convolutional Neural Network (CNN), is the deep learning concept which deals with images. In normal neural network operations, every node is connected with each other, creating a complex network, but if we apply the same thing in the images (which is a large matrix in itself), things will not work so well, because each pixel is acts as a node there, and when these millions of pixels gets connected with another layers, there will be millions or billions of connections, so computation is too complex in that case, so a new way is created for images, which we call as Convolutional Neural Network. Here the concept is basically we convolve the images to another layers.

This program can be implemented in various commercial purposes, for example: drones are emerging as a new widely used tech that have wide range of implementation like to inspect any area or to deliver product, so this program helps to locate the exact location and increases accuracy. By detecting the buildings automatically we can use the data to predict population density or to calculate area of residential, commercial or uninhabited lands. Online Maps are widely used technology these days, by detecting attributes like buildings automatically it becomes easy to visualize a particular area in the map, this can have multiple applications like labeling the areas or to create a 3-D model of buildings using the map. Another major application of this project is that it is very useful in disaster management, like if a place got hit by disaster like earthquake or flood then aerial vehicles are used for rescuing or delivering foods, that can be done if buildings are located with high accuracy in less time. Furthermore there are many other applications of this in practical life which can also benefit the society and also for commercial purposes.

Using neural network to create something automates the work which changes the conventional and slow way of doing things thereby producing high accuracy result which is both time and cost saving without any hectic work; as aerial image segmentation is an important challenge and it’s practical applications are huge, automating it can be a good solution.The UNET model used here for semantic segmentation is inspired from Medical Science Field. The real application of UNET is for medical images i.e. to detect the anomalies like tumor or fracture etc when compared to large dataset of normal images. As very high precision is required for these kind of sensitive jobs, so the layers of CNN are made deep to learn more important features. This point inspired us to use this model for aerial images as the differences between buildings are totally unique and there are many parameters and features to learn, high accuracy is also necessary if it made for practical application.

BASIC CONCEPTS/ TECHNOLOGY USED

Neural Networks- An Artificial Neural Network (ANN) is an information processing paradigm that is inspired by the way biological nervous systems, such as the brain, process information. The key element of this paradigm is the novel structure of the information processing system. It is composed of a large number of highly interconnected processing elements (neurons) working in unison to solve specific problems. ANNs, like people, learn by example. An ANN is configured for a specific application, such as pattern recognition or data classification, through a learning process. Learning in biological systems involves adjustments to the synaptic connections that exist between the neurons.

Fig. A Simple Neural Network Architecture.

Convolutional Neural Network- U-Net architecture consists of a contracting path to capture context and of a symmetrically expanding path that enables precise localization . The contracting path follows the typical architecture of a convolutional network with alternating convolution and pooling operations and progressively down-samples feature maps, increasing the number of feature maps per layer at the same time. Every step in the expansive path consists of an up-sampling of the feature map followed by a convolution.

U-Net is capable of learning from a relatively small training set. In most cases, data sets for image segmentation consist of at most thousands of images, since manual preparation of the masks is a very costly procedure. Typically U-Net is trained from scratch starting with randomly initialized weights. It is well known that training network without over-fitting the dataset should be relatively large, millions of images. Networks that are trained on the Image-net data set are widely used as a source of the initialization for network weights in other tasks. In this way, the learning procedure can be done for non-pre-trained several layers of the network (sometimes only for the last layer) to take into account features of the data set.

Fig. A Simple Convolutional Network Structure.

UNET Model– U-Net architecture consists of a contracting path to capture context and of a symmetrically expanding path that enables precise localization . The contracting path follows the typical architecture of a convolutional network with alternating convolution and pooling operations and progressively down-samples feature maps, increasing the number of feature maps per layer at the same time. Every step in the expansive path consists of an up-sampling of the feature map followed by a convolution.

UNET is capable of learning from a relatively small training set. In most cases, data sets for image segmentation consist of at most thousands of images, since manual preparation of the masks is a very costly procedure. Typically U-Net is trained from scratch starting with randomly initialized weights. It is well known that training network without over-fitting the dataset should be relatively large, millions of images. Networks that are trained on the Image-net data set are widely used as a source of the initialization for network weights in other tasks. In this way, the learning procedure can be done for non-pre-trained several layers of the network (sometimes only for the last layer) to take into account features of the data set.

Fig. A Simple U-NET Structure.

Fine-Tuning with VGG-16 encoders (pre-trained on image-net)– As an encoder in our U-Net network, we used relatively simple CNN of the VGG family that consists of 11sequential layers and known as VGG16. VGG16 contains seven convolutional layers, each followed by a ReLU activation function, and five max polling operations, each reducing feature map by2. All convolutional layers have3×3kernels. The first convolutional layer produces 64 channels and then, as the network deepens, the number of channels doubles after each max pooling operation until it reaches 512. On the following layers, the number of channels does not change. To construct an encoder, we remove the fully connected layers and replace them with a single convolutional layer of 512 channels that serves as a bottleneck central part of the network, separating encoder from the decoder.

To construct the decoder we use transposed convolutions layers that doubles the size of a feature map while reducing the number of channels by half. And the output of a transposed convolution is then concatenated with an output of the corresponding part of the decoder. The resultant feature map is treated by convolution operation to keep the number of channels the same as in a symmetric encoder term. This up-sampling procedure is repeated 5 times to pair up with 5 max-poolings. Technically fully connected layers can take an input of any size, but because we have 5 max-pooling layers, each down-sampling an image two times, only images with a side divisible by 32 can be used as an input to the current network implementation.

PROPOSED MODEL / TOOL

Dataset– I applied my model to Iniria Aerial Image Labeling Dataset . This dataset consists of 180 aerial images of urban settlements in Europe and United States, and is labeled as a building and not building classes. Every image in the data set is RGB and has 5000×5000 pixels resolution where each pixel corresponds to a 30cm×30cm of Earth surface. But we can’t put this much size image directly into our code, so we need to take crops of those images. The resolution of each cropped images must be divisible by 32. So, I took 512×512 resolution crops of those images and then use it in my algorithm. First I tried it by using Bilinear Interpolation on that images to resize it but the technique failed to keep the important details of the images, so I just cropped it normally. From the total dataset 80% of images are taken as test data and rest 20% are taken as validation set.

Platform and Software Requirement-

– Python 3.5 in Jupyter Notebook is used for coding purposes(Anaconda Environment).

  • Keras with Tensorflow (backend) is used to develop the code.

Image Augmentation and Image Data Generator– Image augmentation artificially creates training images through different ways of processing or combination of multiple processing, such as random rotation, shifts, shear and flips, etc. In Keras there is a predefined function called image data generator which is specifically used for that purpose only. I rotated it, taken the zoom of random images of my data set. This helps in creating the model that is more smart to recognize the buildings in different situations. And it also increases the number of images in data which directly proportional to the good prediction of result.

UNET Model with pre-trained vgg16 encoder– I used the pre-trained weights for fine-tuning my model. A pre-trained model is trained on a different task than the task at hand but provides a very useful starting point because the features learned while training on the old task are useful for the new task. I have taken the encoder part of vgg16 and put the weights of it that are “pre-trained on image-net” by using the same names. Then I created the decoding layers according to the structure of encoded layers by up-sampling it correspondingly and “concatenate” the layers to their respective encoding layers. Then I have taken the activation function to be “Adam” and loss as “Binary Cross Entropy”.

Iterations– Number of Iterations/Epochs is taken to be 100 by me. And step_per_epoch which is equal to test images divided by batch size taken, is used correspondingly. Same is applicable for validation set also.

IMPLEMENTATION AND RESULTS

raw image(left)………………………………………………………………………………….. processed image(right)

Advantages and Disadvantages

Advantages-

• It is a very efficient and easy technique to get the result with high accuracy.

• It is a latest and improving technique which is regularly updated so, that we can have a better model in future.

• You can get high accuracy in very less data.

  • It is easy to learn and implement

Disadvantages-

• Very high computational power is required for it, which makes it a bit costly procedure.

• High end GPU (Graphics Processing Unit) is a prerequisite otherwise it will take too much time too train.

• Boundary part sometimes of some labelled images are not too sharp, but this can solved by using large dataset or by applying CRF (Conditional Random Fields).

• Hyper-parameters tuning is a time taking process.

Conclusion

So, I conclude that by using UNET, pre-trained VGG-16 as a encoder we get high accuracy with very limited data-set. I tried just by taking 200 images and its accuracy reached till 98.5% after training. I also tried by taking different number of data to check the result like 2000 images, 6000 images and the results are satisfactory. As its learning is directly proportional to number of data, so providing with large data helps to get a smart learning and better results. I also applied image augmentation, which also helps to get a better result by increasing quality of training dataset. So, over all it is a very efficient technique for image segmentation to extract features like buildings in Aerial Images.