Source: Deep Learning on Medium
Self-attention mechanism in CNN
In order to implement global reference for each pixel-level prediction, Wang et al. proposed self-attention mechanism in CNN (Fig. 3). Their approach is based on covariance between the predicted pixel and every other pixel, in which each pixel is considered as a random variable.
If we reduce the original Fig. 3 to the simplest form as Fig. 4, we can easily understand the role covariance plays in the mechanism. Firstly, we have input feature map X with height H and width W. Then we reshape X into three 1-dimensional vectors A, B and C, multiplying A and B to get the covariance matrix with size HWxHW. Finally, we multiply the covariance matrix with C, getting D and reshape it to the output feature map Y with a Resnet connection from input X. Here, every item in D is a weighted sum of input X, with the weight being the covariance between the item and each other item.
By leveraging self-attention mechanism, we can realize global reference during model training and prediction.