After reading the SAGAN (Self Attention GAN) paper (link here), I wanted to try it, and experiment with it more. Since, the authors’ code isn’t available yet, I decided to write a package for it similar to my previous “pro-gan-pth” package. I first trained a model as described in the SAGAN paper, but then realised, that I can play around with the image based attention mechanism more. This blog is a quick report of that experiment.
Full Attention Layer
The SAGAN architecture just adds one self-attention layer to the generator and one to the discriminator of the DCGAN architecture. Besides, for creating the Q, K and V feature banks for self attention, the layer uses (1 x 1) convolution. I immediately raised two questions: can attention generalize to (k x k) convolutions? And, can we create a unified layer which does features extraction (similar to traditional convolution layer) and perform attention simultaneously?
I figured that the we can address both the questions using a unified attention-cum feature extraction layer. I like to call it the full attention layer and a GAN architecture made up of just these layers would be a Full Attention GAN.
figure 2. describes the architecture of the proposed full attention layer. As you can see, on the upper path we compute traditional convolution output and the lower path, we have an attention layer which generalises to (k x k) convolution filters instead of just (1 x 1) filters. The alpha shown in the residual calculation is a trainable parameter.
Now why is the lower path not self attention? The reason for it is that while computing the attention maps, the input is first locally aggregated by the (k x k) convolutions, and therefore is no longer just self attention since it uses a small spatially neighbouring area into computations. Given enough depth and filter size, we could cover the entire input image as a receptive field for a subsequent attention calculation, hence the name: Full Attention.
I must say, that the current trend of “Attention is all you need” was indeed a major driving force behind this experiment of mine. The experimentation is still going on. I really wanted to get the idea out, and obtain suggestions for further experimentation.
I realise that the trained model’s alpha residual parameters can in fact reveal some important traits of attention mechanism; which I will be working on next.
The attn_gan_pytorch package contains an example of SAGAN trained on celeba for reference. The package contains generic implementations of the self attention, spectral normalization and the proposed full attention layer for all to cook up your own architecture.
Any feedback / suggestions / contributions are highly welcome.
Thank you for reading!
Source: Deep Learning on Medium