Source: Deep Learning on Medium

Go to the profile of Ninad Shukla

This work is a part of the AI Without Borders Initiative.

Co-contributors: Chinmay Pathak, Kevin Garda, Gagana B, Tony Holdroyd, Daniel J Broz.

Read about CAM here.

Gradients in neural networks refer to vectors whose magnitude is the partial derivative of the function f(x) and is directed towards the greatest rate of increase of that function.Based on this information flowing through a generic convolutional network, Grad-CAM uses class specifics to produce localisation maps of the significant regions of the image, making black box models more transparent, by displaying visualizations that support output predictions. In other words, Grad-CAM fuses pixel space gradient visualisation with class discriminative property.

The assumption of the Grad-CAM model is that the final score as described below can always be expressed as a generalised linear combination of pooled average feature maps which depends on the following parameters: weights for a particular feature map, number of pixels in the activation map etc.

Final feature convolutional map of the input image is activated for different channels with respect to the class. That is to say, weighing every channel in the feature with the class gradient with respect to that channel. The global average pooling over two dimensions(i,j) for the gradient of respective class output with respect to feature map is the spatial score of a specific class. The resulting value is multiplied with the feature map along channel axis k and the resultant is pooled along its channel dimension. The spatial score map is hence of size i*j which is normalised to positive region predictions using the non linear ReLU transformation. The score for a class k correlates directly with the importance of the class specific saliency map which hence impacts the final prediction output.

Grad-CAM combined with existing pixel space visualizations to create high resolution class discriminative visualisation is called Guided Grad-CAM. These are together employed to solve various image classification and visual question answering problems. Guided Grad-CAM has the innate ability to localise even the smaller objects. The backpropogation algorithm in the Guided Grad-CAM variant where the backward RELU pass is modified to pass only positive gradients to positively activated regions not only improves localisation ability but also introduces a drop in Grad-CAM’s class discriminative ability. Specifically in the image captioning space, the guided backpropagation algorithm helps obtain coarse localisation along with high resolution visualization highlighting regions that support generated caption.

The architecture is as shown below.


In Grad-CAM, we want to preserve the spatial location information of the object which is lost in a fully connected layer. So the last convolution layer is used as it’s neurons identify parts specific to that class.

To obtain a GradCam of width u and height v for any class c, we first compute the gradient of the score for class c, yc (before the softmax), with respect to feature maps Ak of a convolutional layer, i.e. ∂yc / ∂Ak .

After obtaining these gradients, the following equation highlights the importance of each feature map k for specific classes using the global average pooling technique:

We then perform a weighted combination of forward activation maps followed by ReLU.

ReLU is the preferred choice in this case as it highlights features having positive influence on the class of interest. Regions of interest implicitly refer to those pixels whose intensity varies directly with the gradient yc. Without the use of ReLU, it is observed that localisation maps sometimes might include more information than the desired class like negative pixels that probably belong to other categories in the image hence affecting localization performance.

The class score for a particular class c is computed as:

Which is based on interchanging the order of summations from the class scores obtained for CAM to obtain Lcam.

Grad-CAM can be considered as one of the initial steps in the larger picture of interpretable or explainable AI as visualizations lend insights into failure, and help identify bias while outperforming previous benchmarks. This generalisation of the CAM algorithm is also an effective circumnavigation of the problems of backpropagation algorithm where downsampled relevance maps is upsampled to obtain coarse relevance heatmap. Unlike the CAM, Grad-CAM requires no retraining and is broadly applicable to varieties of CNN architectures including fully connected layers like the VGGNet, structured output CNN’s, CNN’s with multi model outputs or reinforcement learning.


Drawbacks of Grad-CAM include inability to localize multiple occurrences of an object in an image and inaccurate localisation of heatmap with reference to coverage of class region due to the partial derivatives premise. The continual upsampling and downsampling processes may also result in loss of signal.