Computer Vision Part 6: Semantic Segmentation, classification on the pixel level.

Source: Deep Learning on Medium

As can be seen on the figure above, the first part exists of the usual set of convolutional, ReLU and max pooling operations. The 2×2 max pooling operation with stride 2 will result in a downsampling step where the number of feature channels is doubled and the size is halved. The second part consist in a sequence of upsampling the feature map, concatenating corresponding cropped feature map, which is necessary due to the loss of border pixels after each convolution, convolving and applying ReLU.

Similarly to FCN, high resolution features from the compression path are combined with the upsampled output which then is fed to a series of convolutional layers. The main difference with FCN is that in the upsampling part, a large number of feature channels is present, which allows the network to propagate contextual information to higher resolution layers. Furthermore, the network does not contain fully connected layers but only uses convolutional outputs.

2.3. FC-DenseNet

The FC DenseNet or 100 Layers Tiramisu is a segmentation technique built upon the DenseNet architecture for image classification. DenseNet is based on the paradigm where shortcut connections from early layers are made to later layers. What makes DenseNet so special is that all layers are connected with each other.

A 5-layer dense block with a growth rate of k = 4, where k refers to the number of channels for subsequent layers. Each layer takes all preceding feature-maps as input.

Each layers passes on its own feature-maps to all subsequent layers. Where ResNet uses element-wise addition to combine features, concatenation is used in DenseNets. As such, each layer is receiving a collective set of knowledge from all preceding layers. Perhaps counterintuitively, this requires fewer parameters than traditional methods, as there is no need to relearn redundant feature-maps.

Below, we find an overview of the architecture. Each layer produces a k output feature-maps which coincides with the aforementioned growth rate. This is then fed into the next layer’s Dense Block’s BottleNeck to reduce the number of input feature-maps, and thus to improve computational efficiency of its respective Composite blocks. Before being passed to the next layer, the feature maps are compressed by going through a Transition Layer.