ML In Detail 1: PSPNet

Source: Deep Learning on Medium

There are various ways to perform such tasks such as using the superpixel information, sliding window classification, etc.

Most of the recent approaches solve the problem of semantic segmentation using FCN, a fully convolutional network.

Fully convolutional network utilized to perform semantic segmentation

A fully convolutional network is a deep learning architecture only comprised of convolutional layers, which has the advantage of receiving arbitrary input sizes, less parameter, faster inference, and simply better performance overall.

There are various papers that suggested its own method of using FCNs to perform semantic segmentation, such as SegNet, UNet, DANet, Deeplab. In this article, we would look into one of those papers called PSPNet in detail.

PSPNet (Pyramid Scene Parsing Network)

Arxiv Link:

PSPNet architecture design

Problems of previous Semantic Segmentation Methodologies: The paper has made certain observations regarding previous semantic segmentation methodologies, and tries to solve the problem.

  • Mismatched Relationship: Context relationship is important in recognizing a scene. For example, it is likely that a “boat” is over “river”, while it is unlikely that a “car” is over “river”. Having correct knowledge to this relationship would increase the ability to correctly classify the segment’s class.
Example of a scene where context relationships could help.
  • Confusing Categories: There often are confusing categories in a dataset, such as the third column below. The network is predicting that some parts of the object are a “skyscraper” while the other is “building”. However, the result should be either one of them, not both. The relationship between categories could solve this.
Image / Ground Truth / Prediction (The network mixed both predictions)
  • Inconspicuous Classes: Objects can exist in various scales in a scene. Traditional FCNs do not care about the various scales in a scene, resulting in discontinuous prediction between various scales, as shown in the image below. One potential reason for this problem is the small receptive field of the network and the inability to pay attention to certain sub-regions, overlooking the global scene category.
Image / Ground Truth / Prediction (Discontinuous prediction in the pillow)

Overall, the problems are due to the lack of contextual relationship, small receptive field, and the lack of global information.

What is a Receptive Field?

In CNN, the receptive field is the region of input that affects the region of a certain output pixel.

3×3 convolution Operation on a 5×5 input image

For example in the image above, a 3×3 kernel is multiplied to a 3×3 region of the input image to produce 1×1 of the resulting image. Therefore, the receptive field is 3×3.

Having a big receptive field is important, as it allows the network to see the image in a global context.

How do we increase the receptive field?

Receptive Field is increased as more convolution / downsample blocks are added to the network, making a larger region of the input affect the final output.

However, simply increasing the depth of the network does not work, as such layers tend to decrease the input resolution.

VGG16 architecture

For example, in the case of VGG16, the receptive field of the network increased due to the deeper layer, but the resolution shrunk from 224×224 to 7×7 at the final convolution operation.

This would be a problem, as the details of the prediction will be gone.

In PSPNet, the authors used a Pyramid Pooling Module to increase the receptive field without a drastic decrease in the output resolution, or increase in parameter / layer count.

PSPNet Architecture

Feature Extraction

The author uses an ImageNet pretrained ResNet50 for the feature extraction.

Convolution part of a ResNet (Original)

C(ic, oc, ks, s, p) each stands for a convolution operation’s input channel, output channel, kernel size, stride, padding.

Layer_n(s, d) each standards for the nth Residual Layer of the ResNet with stride of s and dilation of d. There are 4 layers in all default ResNets (e.g. Resnet18/34/50/101…)

Each Layer is consisted of multiple ResBlks, where the stride is 1 except the first block that follows the s, and all the blocks follow the dilation d.

Factors below together decreased the output resolution of the ResNet to the 1/32 of the original image.

  • C (ic, 64, 7, 2, 3) -> 1/2 of the original resolution
  • MaxPool (3, 2, 1) -> 1/4 of the original resolution
  • Layer2 (2, 1) -> 1/8, Layer3 (2, 1) -> 1/16, Layer4 (2, 1) -> 1/32

Dilated Convolution

In order to increase the receptive field while retaining the resolution, PSPNet (and many other such as DeepLab) uses a dilated convolution (atrous convolution).

Dilated convolution

Dilated Convolution puts a spacing between values in a kernel with the parameter “Rate”, which allows better control on increasing the receptive field while the resolution doesn’t decrease.

Therefore, instead of using stride = 2 for more receptive field, the network changes like below, where the stride = 1 except layer 2, and the last two layers use dilated convolution of rate = 2 / 4.

Convolution part of a ResNet (Astrous)

Factors below together decreased the output resolution of the ResNet to the 1/8 of the original image, a 4x improvement!

  • C (ic, 64, 7, 2, 3) -> 1/2 of the original resolution
  • MaxPool (3, 2, 1) -> 1/4 of the original resolution
  • Layer2 (2, 1) -> 1/8, Layer3 (1, 2) -> 1/8, Layer4 (1, 4) -> 1/8

The output channel number also depends on the ResNet type that is used. ResNet18, 34 uses a type of ResBlk called BasicBlock, having a final output channel of 512. ResNet50, 101, 152 uses a BottleneckBlock, having a final output channel of 2048. For now, it is nf.

Pyramid Pooling Module

Pyramid Pooling Module

After the feature Map, the output feature is nf x h’ x w’.

Inside the Pyramid Pooling Module, there are total 4 branches. In each branch, the input goes through an average pooling layer, each resulting in 1×1, 2×2, 3×3, 6×6.

The choice for this design is to let the network pay attention in various object size / scales.

  • In the 1×1 case, the entire h’ x w’ is resized into a single pixel, allowing the network the look at the global information with a bigger receptive field.
  • In the 6×6 case, output has a larger spatial resolution, making the network look at the local information with a relatively smaller receptive field.

Using the pyramid pooling module, the network has information both for the global scene and local objects, allowing it to pay attention to various scales of objects in a scene.

Then, a 1×1 Convolution, BatchNorm, ReLU is followed after each of the branch, reducing the channel from nf to nf / 4. Each 1×1, 2×2, 3×3, 6×6 output is also bilinear upsampled to h’ x w’. They are also channel-wise concatenated, resulting in nf x h’ x w’.

The result is further concatenated with the input feature, resulting in (nf x 2) x h’ x w’

Pyramid Pooling Module in PyTorch

Finally, the (nf x 2) x h’ x w’ input goes through a series of convolutional operations and a final bilinear upsample (to recover from being 1/8 of the original resolution from the ResNet).

Final Output layers

Additional Details

From the output, the loss can be calculated through Cross Entropy Loss between the output and the ground truth.

Besides this, the paper also utilizes an auxiliary loss.

  • The Pyramid Pooling Module uses the output of the Astrous ResNet, or technically, the output of Layer 4.
  • As an auxiliary segmentation, the output of Layer 3 (which has the same spatial resolution with Layer 4) goes through a series of convolutional operations and a final bilinear upsample to predict the output mask.
  • The prediction is also compared with the ground truth through Cross Entropy Loss.
  • Both losses (main loss, auxiliary loss) are added as a final loss, but auxiliary weighted by alpha = 0.4, as the main loss is the most important branch, and the auxiliary loss is abandoned after training.
  • The authors claim that this helps the learning process, which is shown from the chart below.
Testing the appropriate weight for alpha

The network is also trained using the poly learning rate policy, with the base learning rate = 0.01 and power = 0.9.

Multiplied to the base learning rate every iteration.

Final Flowchart of PSPNet

Red: Input data / Orange: Feature Extraction / Green: Pyramid Pooling Module / Purple: Final Convolutional Blocks / White: Cross Entropy Loss / Blue: Weighted Final Loss

PyTorch implementation

This is my attempt on implementing various semantic segmentation algorithms on PyTorch, including UNet, PSPNet, DANet and more.