Class Activation Mapping (CAM) Methods

Sources:

  1. Grad CAM (and Guided Grad CAM) 2015 paper
  2. CAM 2015 paper (Bolei Zhou)
  3. Guided Backpropagation

Code: Jax Feature Attribution Methods

Notation

Suppose we have a convolutional neural network (CNN) that takes an image as input and outputs a scalar target.

Symbol Type Explanation
\(K\) \(\in \mathbb{N}\) Number of convolutional layers, or number of feature maps in a CNN
\(k\) \(\in \mathbb{N}\) Index of the convolutional layer, or feature map in a CNN
\(H^k, W^k, C^k\) \(\in \mathbb{N}\) Height, width, and number of channels of the \(k\)-th feature map
\(f^k\) \(\in \mathbb{R}^{H^k \times W^k \times C^k}\) The \(k\)-th feature map
\([f^K]\) \(:=\{f^1, \ldots, f^K\}\) Set of convolutional layers or feature maps in a CNN
\(i, j, c\) \(\in \mathbb{N}\) Integer indices for height, width, and channel
\(f^k_{i,j,c}\) \(\in \mathbb{R}\) The activation value of the \(k\)-th feature map at index \((i,j,c)\)
\(F^k\) \(\in \mathbb{R}^{C^k}\) The spatial average of the \(k\)-th feature map \(f^k\)
\(F^k_{c}\) \(\in \mathbb{R}\) The activation value of \(F^k\) at channel index \(c\)
\(\text{Class}\) \(\in \mathbb{N}\) The number of classes in the CNN prediction
\(\text{class}\) \(\in \mathbb{N}\) The integer index for a class
\(w^\text{class}\) \(\in \mathbb{R}^{C^K}\) The CAM weights corresponding to \(\text{class}\) for the spatial average of the last feature map \(F^K\)
\(S^\text{class}\) \(\in \mathbb{R}\) The class score for \(\text{class}\)
\(P^{\text{class}}\) \(\in \mathbb{R}^{\text{Class}}\) The output of the CNN, i.e., the output of the softmax
\(a^\text{class}\) \(\in \mathbb{R}^{\text{Class}}\) The Grad-CAM weights corresponding to \(\text{class}\) for \(F^K\)

CAM

Source: Grad CAM explanation by CampusAI

Suppose we want to perform a classification task with an input image and an output, such as the probability of each class (\(\text{output}\)), using a Convolutional Neural Network (CNN). Class Activation Mapping (CAM) requires that the CNN includes a Global Average Pooling (GAP) layer followed by a Fully Connected (FC) layer, which serves as the classifier before the softmax layer. CAM then produces a heatmap that highlights regions in the image that are relevant to the model's prediction for the target class.

Forward pass

The forward pass of the CNN follows these steps:

  1. Forward Pass: The input image is passed through the CNN model to obtain the feature map \(f^K\) from the last convolutional layer.

  2. Global Average Pooling: For each channel of \(f^K\), compute the spatial average \[ F^K=\frac{1}{H^K W^K} \sum_{i, j} f_{i, j}^K \] where \(F^K\) is a vector with the shape \((C^K,)\).

  3. Score Computation: For a given class \(\text{class}\), the input to the softmax, \(S^\text{class}\), is computed as: \[ \begin{aligned} S^{\text {class }} & =\sum_c w_c^{\text {class }} F_c^K \\ & =\frac{1}{H^K W^K} \sum_c \sum_{i, j} w_c^{\text {class }} f_{i, j, c}^K, \end{aligned} \] where \(w_c^\text{class}\) is the scalar weight corresponding to \(\text{class}\) for \(F_c^K\). Essentially, \(w_c^\text{class}\) indicates the importance of \(F_c^K\) for class \(\text{class}\).

  4. Softmax Output: Finally, the output of the softmax is given by: \[ P^{\text {class }}=\frac{\exp \left(S^{\text {class }}\right)}{\sum_{\text {class }} \exp \left(S^{\text {class }}\right)} \]

Generating CAM

We define \(M^\text{class}_{\text{CAM}}\) as the Class Activation Map (CAM) for a specific class \(\text{class}\), where:

\[ \begin{equation} \label{eq1} M^\text{class}_{\text{CAM}}=\sum_c w_c^\text{class} f^K_{c} \end{equation} \] After computing this, the resulting heatmap is upsampled to match the original image size using bilinear interpolation. The final upsampled heatmap has the shape \((H, W, 1)\).

Grad-CAM

Source: Grad CAM explanation by CampusAI

CAM relies on a specific CNN architecture that includes a Global Average Pooling (GAP) layer and one Fully Connected layer before the softmax layer.

Grad-CAM extends the original CAM method, making it applicable to a broader range of CNN architectures. The Grad-CAM is defined as: \[ M^\text{class}_{\text{Grad-CAM}}=\operatorname{ReLU}(\sum_c a_c^\text{class} f^K_{c}) . \] where the weights \(a_c^\text{class}\) are computed as follows: \[ \begin{equation} \label{eq2} a_c^\text{class}=\frac{1}{H^K W^K} \sum_{i, j} \frac{\partial S^\text{class}}{\partial f_{i, j}^K} . \end{equation} \]

For CNN architectures like those required for CAM, i.e., CNNs with a GAP layer and a Fully Connected layer before softmax, the weights used in Grad-CAM are equivalent to those in CAM. Here is the proof:

The score \(S^\text{class}\) for these CNNs is computed by: \[ S^\text{class} = \frac{1}{H^K W^K} \sum_{c } \sum_{i, j} w_c^{\text{class}} f^K_{i,j, c} . \]

Computing the partial derivative: \[ \begin{aligned} \frac{\partial S^\text{class}}{\partial f_{i, j, c}^K} = \frac{1}{H^K W^K} w_c^{\text{class}} . \end{aligned} \] Thus, we have: \[ \begin{equation} \label{eq3} w_c^{\text{class}} = H^K W^K \frac{\partial S^\text{class}}{\partial f_{i, j, c}^K} = \sum_{i, j} \frac{\partial S^\text{class}}{\partial f_{i, j}^K} . \end{equation} \]

Up to a proportionality constant \(\frac{1}{H^K W^K}\), which gets normalized out during visualization, the expression for \(w_c^\text{class}\) is identical to \(a_c^\text{class}\) used by Grad-CAM.

Meanwhile, the Grad-CAM method, by its principle, can be applied to any convolutional layer, as long as an average is taken across multiple Grad-CAM maps.

However, in the original Grad-CAM (and Guided Grad-CAM) paper from 2015, this method was only applied to the last convolutional layer. The authors note:

"Although our technique is fairly general in that it can be used to explain activations in any layer of a deep network, in this work, we focus on explaining output layer decisions only."