Deep Learning: Image segmentation and localization — U-Net Architecture

Source: Deep Learning on Medium

Deep Learning: Image segmentation and localization — U-Net Architecture

1. Introduction

Artificial intelligence (AI) has been a subject of intense media hype. Machine learning, deep learning, and AI come up in countless articles, often outside of technology-minded publications. In the last two years, deep convolution networks have outperformed the state of the art in many visual recognition tasks.

But have you noticed the complexity of these visual recognition tasks? It’s not very easy.

The emerge of Convolution Networks (CNN) the research has made progress like never before. Using CNN we can classify an image belonging to its own class.

The typical use of convolution networks is on classification tasks, where the output to an image is a single class label. However, in many visual tasks, especially in image processing, the desired output should include localization, i.e., a class label is supposed to be assigned to each pixel.

So we need to have a CNN network which can predict the class label of each pixel by providing a local region (patch) around that pixel in the input. This is called detecting an object in the image and predicting its class labels around the pixels.

Convolution Neural Networks gave decent results in easier image segmentation problems but it hasn’t made any good progress on complex ones. That’s where U-Net comes in the picture. U-Net was first designed especially for medical image segmentation. It showed such good results that it used in many other fields after.

2. U-net Architecture (looks like U)

To predict a class on same image we will use a concept of semantic segmentation where we do pixel-wise classification of an image. We can think of semantic segmentation as image classification at a pixel level.

The architecture consists of two parts:

· Contracting/Down-sampling layer (left-side)

· Expansive/Up-sampling layer (right-side)

· Output Layer

· Contracting/Down-Sampling layer

The contracting path follows the typical architecture of a convolutional network. It consists of the repeated application of two 3×3 convolutions (unpadded convolutions), each followed by a rectified linear unit (ReLU) and a 2×2 max pooling operation with stride 2 for down sampling.

At each down sampling step we double the number of feature channels.

The purpose of this contracting path is to capture the context of the input image in order to be able to do segmentation.

Here is the keras code for contracting layer:

#Contracting Layer:

Taking initial kernel size as 8 and we double in next contracting layer. We can add multiple contracting layers to get better features.

Layer 1: Contracting Layer

c1 = Conv2D(8, (3, 3), padding=’same’ (inputs)c1 = Activation(‘relu’) (c1)c1 = Conv2D(8, (3, 3), padding=’same’ (c1)c1 = Activation(‘relu’) (c1)p1 = MaxPooling2D((2, 2)) (c1)

Layer 2: Contracting Layer

c2 = Conv2D(16, (3, 3), padding=’same’ (p1)c2 = Activation(‘relu’) (c2)c2 = Conv2D(16, (3, 3), padding=’same’ (c2)c2 = Activation(‘relu’) (c2)p2 = MaxPooling2D((2, 2)) (c2)

Layer 3: Contracting Layer

c3 = Conv2D(32, (3, 3), padding=’same’ (p2)c3 = Activation(‘relu’) (c3)c3 = Conv2D(32, (3, 3), padding=’same’ (c3)c3 = Activation(‘relu’) (c3)p3 = MaxPooling2D((2, 2)) (c3)

What is happening in Down-sampling?

This path is used to get more features of the image by doubling the number of filters in each stage.

While moving into next stage we will do 2×2 max-pooling to get the maximum pixel value, thus loosing some features, but retaining the maximum pixel value.

So at the last layer of Down-sampling we are getting the lower level features of an image.

Suppose, an image has 2 classes, cat and Dog and those classes are reflected in the pixels, by assigning a pixel value.

For eg. we are doing segmentation of classes pixel-wise:

Class CAT- pixel value feature at a layer — 135

Class DOG- pixel value feature at a layer — 150

For No defect pixel value is 0.

In this way we can classify pixel-wise and can also segment the classes.

· Expansive/Up-Sampling layer

Similar to contraction layer, it also consists of several expansion blocks. Every step in the expansive path consists of an up-sampling of the feature map followed by a 2×2 convolution (up-convolution”) that halves the number of feature channels, a concatenation with the correspondingly cropped feature map from the contracting path, and two 3×3 convolutions, each followed by a ReLU.

The purpose of this expanding path is to enable precise localization combined with contextual information from the contracting path.

Here is the keras code for Expansion layer:

#Expansion Layer:

Taking initial kernel size as 8.

u4 = Conv2DTranspose(32, (2, 2),strides=(2, 2), padding=’same’) (c3)u4 = concatenate([u4, c3])c4 = Conv2D(32, (3, 3), padding=’same’) (u4)c4 = Activation(‘relu’) (c4)c4 = Conv2D(32, (3, 3), padding=’same’) (c4)c4 = Activation(‘relu’) (c4)
u5 = Conv2DTranspose(16, (2, 2),strides=(2, 2), padding=’same’) (c4)u5 = concatenate([u5, c2])c5 = Conv2D(16, (3, 3), padding=’same’) (u5)c5 = Activation(‘relu’) (c5)c5 = Conv2D(16, (3, 3), padding=’same’) (c5)c5 = Activation(‘relu’) (c5)
u6 = Conv2DTranspose(8, (2, 2),strides=(2, 2), padding=’same’) (c5)u6 = concatenate([u6, c1])c6 = Conv2D(8, (3, 3), padding=’same’) (u6)c6 = Activation(‘relu’) (c6)c6 = Conv2D(8, (3, 3), padding=’same’) (c6)c6 = Activation(‘relu’) (c6)

What is happening in Up-sampling?

We will be doing up-sampling in the same stage so as to retain the same size of the image.

Now in down-sampling we have got the pixel feature values for all the classes. Since we have lost some of the features in down-sampling by using max-pooling no need to worry.

In up-sampling we are getting back the full image by copying the feature map of a level having same filters of down-sampling to the level same filters of Up-sampling thus retaining the features. Thus we get back the full image and can localize where is the defect present in the image for each class. This is known as Transpose convolution.

Then again we are learning the full size image by applying convolution.

So in up-sampling the basically every feature layer of down-sampling side is added to the corresponding feature layer in up-sampling side so as to get the full resolution image, thus locating the defect.

Integrating local information with global information of the image.

· Output layer

outputs = Conv2D(2, (1, 1), activation=’sigmoid’) (c9)

In the last layer we have 2 outputs associated with each class, CAT and DOG


Dog- C2

Each classes have its own filters of 1 X 1.

So, c1 filter will go through every pixel map and predict the probabilities of class CAT being present.

In ideal case probability of c1=1 and c2 will be zero.

The output layer will give the probability for each pixel belonging to a particular class for all the filters. More the probability for a pixel more chance of belonging to that class.

3. Loss Function

We are using Binary Cross Entropy for loss function as it will give pixel wise loss or probability of each pixel belonging to the region.