Class Activation Mapping (CAM) Methods

Posted on 2024-08-07 Edited on 2025-06-17 In Research Views:

Sources:

Grad CAM (and Guided Grad CAM) 2015 paper
CAM 2015 paper (Bolei Zhou)
Guided Backpropagation

Notation

Suppose we have a convolutional neural network (CNN) that takes an image as input and outputs a scalar target.

Symbol	Type	Explanation
$K$	$\in N$	Number of convolutional layers, or number of feature maps in a CNN
$k$	$\in N$	Index of the convolutional layer, or feature map in a CNN
$H^{k}, W^{k}, C^{k}$	$\in N$	Height, width, and number of channels of the $k$ -th feature map
$f^{k}$	$\in R^{H^{k} \times W^{k} \times C^{k}}$	The $k$ -th feature map, $k \in {1, \dots, K}$
$[f^{K}]$	$:= {f^{1}, \dots, f^{K}}$	Set of convolutional layers or feature maps in a CNN
$i, j, c$	$\in N$	Integer indices for height, width, and channel
$f_{i, j, c}^{k}$	$\in R$	The activation value of the $k$ -th feature map at index $(i, j, c)$
$F^{k}$	$\in R^{C^{k}}$	The spatial average of the $k$ -th feature map $f^{k}$ , $k \in {1, \dots, K}$
$F_{c}^{k}$	$\in R$	The activation value of $F^{k}$ at channel index $c$
$Class$	$\in N$	The number of classes in the CNN prediction
$class$	$\in N$	The integer index for a class
$w^{class}$	$\in R^{C^{K}}$	The CAM weights corresponding to $class$ for the spatial average of the last feature map $F^{K}$
$S^{class}$	$\in R$	The class score for $class$
$P^{class}$	$\in R^{Class}$	The output of the CNN, i.e., the output of the softmax
$a^{class}$	$\in R^{Class}$	The Grad-CAM weights corresponding to $class$ for $F^{K}$

CAM

Source: Grad CAM explanation by CampusAI

Suppose we want to perform a classification task with an input image and an output, such as the probability of each class ( $output$ ), using a Convolutional Neural Network (CNN). Class Activation Mapping (CAM) requires that the CNN includes a Global Average Pooling (GAP) layer followed by a Fully Connected (FC) layer, which serves as the classifier before the softmax layer. CAM then produces a heatmap that highlights regions in the image that are relevant to the model's prediction for the target class.

Forward pass

The forward pass of the CNN follows these steps:

Forward Pass: The input image is passed through the CNN model to obtain the feature map $f^{K}$ from the last convolutional layer.
Global Average Pooling: For each channel of $f^{K}$ , compute the spatial average $F^{K} = \frac{1}{H^{K} W^{K}} \sum_{i, j} f_{i, j}^{K}$ where $F^{K}$ is a vector with the shape $(C^{K},)$ .
Score Computation: For a given class $class$ , the input to the softmax, $S^{class}$ , is computed as: $\begin{aligned} S^{class} & = \sum_{c} w_{c}^{class} F_{c}^{K} \\ = \frac{1}{H^{K} W^{K}} \sum_{c} \sum_{i, j} w_{c}^{class} f_{i, j, c}^{K}, \end{aligned}$ where $w_{c}^{class}$ is the scalar weight corresponding to $class$ for $F_{c}^{K}$ . Essentially, $w_{c}^{class}$ indicates the importance of $F_{c}^{K}$ for class $class$ .
Softmax Output: Finally, the output of the softmax is given by: $P^{class} = \frac{\exp (S^{class})}{\sum_{class} \exp (S^{class})}$

Generating CAM

We define $M_{CAM}^{class}$ as the Class Activation Map (CAM) for a specific class $class$ , where:

$\begin{matrix} (1) & M_{CAM}^{class} = \sum_{c} w_{c}^{class} f_{c}^{K} \end{matrix}$ After computing this, the resulting heatmap is upsampled to match the original image size using bilinear interpolation. The final upsampled heatmap has the shape $(H, W, 1)$ .

Grad-CAM

Source: Grad CAM explanation by CampusAI

CAM relies on a specific CNN architecture that includes a Global Average Pooling (GAP) layer and one Fully Connected layer before the softmax layer.

Grad-CAM extends the original CAM method, making it applicable to a broader range of CNN architectures. The Grad-CAM is defined as: $M_{Grad-CAM}^{class} = ReLU (\sum_{c} a_{c}^{class} f_{c}^{K}) .$ where the weights $a_{c}^{class}$ are computed as follows: $\begin{matrix} (2) & a_{c}^{class} = \frac{1}{H^{K} W^{K}} \sum_{i, j} \frac{\partial S^{class}}{\partial f_{i, j}^{K}} . \end{matrix}$

For CNN architectures like those required for CAM, i.e., CNNs with a GAP layer and a Fully Connected layer before softmax, the weights used in Grad-CAM are equivalent to those in CAM. Here is the proof:

The score $S^{class}$ for these CNNs is computed by: $S^{class} = \frac{1}{H^{K} W^{K}} \sum_{c} \sum_{i, j} w_{c}^{class} f_{i, j, c}^{K} .$

Computing the partial derivative: $\begin{array}{r} \frac{\partial S^{class}}{\partial f_{i, j, c}^{K}} = \frac{1}{H^{K} W^{K}} w_{c}^{class} . \end{array}$ Thus, we have: $\begin{matrix} (3) & w_{c}^{class} = H^{K} W^{K} \frac{\partial S^{class}}{\partial f_{i, j, c}^{K}} = \sum_{i, j} \frac{\partial S^{class}}{\partial f_{i, j}^{K}} . \end{matrix}$

Up to a proportionality constant $\frac{1}{H^{K} W^{K}}$ , which gets normalized out during visualization, the expression for $w_{c}^{class}$ is identical to $a_{c}^{class}$ used by Grad-CAM.

Meanwhile, the Grad-CAM method, by its principle, can be applied to any convolutional layer, as long as an average is taken across multiple Grad-CAM maps.

However, in the original Grad-CAM (and Guided Grad-CAM) paper from 2015, this method was only applied to the last convolutional layer. The authors note:

"Although our technique is fairly general in that it can be used to explain activations in any layer of a deep network, in this work, we focus on explaining output layer decisions only."