Class Activation Mapping (CAM) Methods

Sources:

  1. Grad CAM (and Guided Grad CAM) 2015 paper
  2. CAM 2015 paper (Bolei Zhou)
  3. Guided Backpropagation

Code: Jax Feature Attribution Methods

Notation

Suppose we have a convolutional neural network (CNN) that takes an image as input and outputs a scalar target.

Symbol Type Explanation
K N Number of convolutional layers, or number of feature maps in a CNN
k N Index of the convolutional layer, or feature map in a CNN
Hk,Wk,Ck N Height, width, and number of channels of the k-th feature map
fk RHk×Wk×Ck The k-th feature map, k{1,,K}
[fK] :={f1,,fK} Set of convolutional layers or feature maps in a CNN
i,j,c N Integer indices for height, width, and channel
fi,j,ck R The activation value of the k-th feature map at index (i,j,c)
Fk RCk The spatial average of the k-th feature map fk, k{1,,K}
Fck R The activation value of Fk at channel index c
Class N The number of classes in the CNN prediction
class N The integer index for a class
wclass RCK The CAM weights corresponding to class for the spatial average of the last feature map FK
Sclass R The class score for class
Pclass RClass The output of the CNN, i.e., the output of the softmax
aclass RClass The Grad-CAM weights corresponding to class for FK

CAM

Source: Grad CAM explanation by CampusAI

Suppose we want to perform a classification task with an input image and an output, such as the probability of each class (output), using a Convolutional Neural Network (CNN). Class Activation Mapping (CAM) requires that the CNN includes a Global Average Pooling (GAP) layer followed by a Fully Connected (FC) layer, which serves as the classifier before the softmax layer. CAM then produces a heatmap that highlights regions in the image that are relevant to the model's prediction for the target class.

Forward pass

The forward pass of the CNN follows these steps:

  1. Forward Pass: The input image is passed through the CNN model to obtain the feature map fK from the last convolutional layer.

  2. Global Average Pooling: For each channel of fK, compute the spatial average FK=1HKWKi,jfi,jK where FK is a vector with the shape (CK,).

  3. Score Computation: For a given class class, the input to the softmax, Sclass, is computed as: Sclass =cwcclass FcK=1HKWKci,jwcclass fi,j,cK, where wcclass is the scalar weight corresponding to class for FcK. Essentially, wcclass indicates the importance of FcK for class class.

  4. Softmax Output: Finally, the output of the softmax is given by: Pclass =exp(Sclass )class exp(Sclass )

Generating CAM

We define MCAMclass as the Class Activation Map (CAM) for a specific class class, where:

(1)MCAMclass=cwcclassfcK After computing this, the resulting heatmap is upsampled to match the original image size using bilinear interpolation. The final upsampled heatmap has the shape (H,W,1).

Grad-CAM

Source: Grad CAM explanation by CampusAI

CAM relies on a specific CNN architecture that includes a Global Average Pooling (GAP) layer and one Fully Connected layer before the softmax layer.

Grad-CAM extends the original CAM method, making it applicable to a broader range of CNN architectures. The Grad-CAM is defined as: MGrad-CAMclass=ReLU(cacclassfcK). where the weights acclass are computed as follows: (2)acclass=1HKWKi,jSclassfi,jK.

For CNN architectures like those required for CAM, i.e., CNNs with a GAP layer and a Fully Connected layer before softmax, the weights used in Grad-CAM are equivalent to those in CAM. Here is the proof:

The score Sclass for these CNNs is computed by: Sclass=1HKWKci,jwcclassfi,j,cK.

Computing the partial derivative: Sclassfi,j,cK=1HKWKwcclass. Thus, we have: (3)wcclass=HKWKSclassfi,j,cK=i,jSclassfi,jK.

Up to a proportionality constant 1HKWK, which gets normalized out during visualization, the expression for wcclass is identical to acclass used by Grad-CAM.

Meanwhile, the Grad-CAM method, by its principle, can be applied to any convolutional layer, as long as an average is taken across multiple Grad-CAM maps.

However, in the original Grad-CAM (and Guided Grad-CAM) paper from 2015, this method was only applied to the last convolutional layer. The authors note:

"Although our technique is fairly general in that it can be used to explain activations in any layer of a deep network, in this work, we focus on explaining output layer decisions only."