GCNet: Non-local Network Meets Squeeze-Excitation Network and Beyond Review

Source: Deep Learning on Medium

GCNet: Non-local Network Meets Squeeze-Excitation Network and Beyond Review

Before getting into the GCNet, we first need to understand two networks: Non-local Networks and Squeeze-Excitation Networks. We will not get too much in detail about these networks. Once we get the basic idea, we could move on to the main topic, GCNet.

Non-local Neural Network

The network is inspired by the non-local means filter. This filter is used for the random noise reduction tasks. These noises can be reduced by taking other multiple images with random noise and taking their means of each pixel. However, acquiring multiple images with random noise is difficult in the real world scenario. What we usually get is a single image with random noise.

To solve the issue, non-local means filter is used. Instead of having multiple images with random noise, similar patches located in different regions can be used to calculate the means. As shown in the figure above, image patch p is similar to other image patches (q1 and q2). These patches can be used to remove the noise in the patch p.

Inspired by the idea, NLNet authors proposed a non-local block. This block computes all the interaction between the pixels and to do this, the network performs multiple weight reshapes and multiplications to produce embedded feature representation.

As it’s shown in the figure, if the block gets T x H x W x 1024 as an input (T represents a timestamp since the original paper targets video classification), it produces the feature map of the same size. Matrix reshaping, multiplication, and element-wise additions are performed to capture the global dependencies between the pixels.

To learn more about the paper, read the original paper here.

Squeeze-Excitation Network

Squeeze-Excitation Network

Compared to the NLNet, the Squeeze-Excitation network solves a different problem. While NLNet tries to capture global dependencies between the pixels, SENet tries to capture inter-channel dependencies. The main idea behind SENet is to add parameters in each feature map channel. By doing this the network adaptively learns the weighting of the channels in the feature map.

The SE block takes the feature map as an input. The spatial dimension of the feature map (W x H x C) is first reduced to apply two FC layers. The final vector having the same size as the input feature channel is cast with a sigmoid layer to scale each channel based on its importance.

To learn more about the paper, read the original paper here.

GCNet

The Global Context Network has three main ideas.

The simplified version of the NL block

Simplified Non-local Block: The authors propose a simplified version of the non-local block. The simplified version computes a global (query-independent) attention map and shares the attention map for all query positions. This change is made after observing similar attention maps generated in different query positions.

Furthermore, the W_v in the block is moved outside of the attention pooling to reduce the computational cost. The weight W_z is also removed after following the results in the Relation Networks paper.

The architecture of the main blocks

Global Context Modeling Framework: The main block (a in the above figure) used in the Global Context Network can be divided into three procedures: First, a global attention pooling, which adopts 1×1 convolution and a softmax function, is used to obtain the attention weights. Then attention pooling is applied to get the global context features. Once the attention pooling is applied, the output features are transformed via 1×1 convolution, and they are aggregated to add global context features to the features of each position.

Global Context Block: As illustrated in the figure above, the global context block uses a simplified NL block together with the SE block. Some tweaks in the model are made to reduce the number of parameters. For example, in the SE block shown in figure (c), 1×1 convolution is replaced by a bottleneck transformation module. Instead of having CxC parameters, the number is reduced to 2xCxC/r, where r is the bottleneck ratio and C/r is the hidden representation dimension of the bottleneck.

Results

The experiments are performed in object detection and segmentation tasks. Different architecture designs are compared in the table above. The authors observe that adding GC blocks clearly improves the performance of the model.

Also, one thing to note is that the number of parameters/FLOPS does not increase dramatically. This is probably due to the lightweight architectures designed in both the NL block and the SE block.

Conclusion

The paper successfully combines two ideas together to come up with a model that could outperform existing baseline networks. The authors solve the long-range dependency problem by proposing a simplified version of the NL/SE block. By adding only a small number of parameters, the model outperforms existing models on object detection and recognition tasks.