ResNet

Source:

  1. ResNet tutorial from UvA

Architecture

The popular ResNet architecture consists of three main components:

  1. An initial convolutional layer that downsamples the input by a factor of 2.
  2. A max pooling layer that further downsamples the input by a factor of 2.
  3. Groups (or stages) of ResNet blocks, where all blocks within each group have the same output shape. The first block of the 2nd stage onwards applies downsampling by a factor of 2.

Note: Notations such as [3,3,3] are used to represent the ResNet block structure. [3,3,3] indicates that there are 3 stages, with downsampling occurring in the first block of the 2nd and 3rd stages, specifically at the fourth and seventh blocks. The visualization below shows the ResNet with [3,3,3] blocks on CIFAR-10.

The initial convolutional layer

The initial convolutional layer of Resnet has kernel size 7, stride 2, and padding 3. Therefore, \[ \text { Output size }=\left\lfloor\frac{I+2 \times 3-7}{2}\right\rfloor+1 =\left\lfloor\frac{I-1}{2}\right\rfloor+1 . \] which means that it reduces the spatial dimensions by a factor of 2.

The maxpooling layer

Meanwhile, there is a maxpooling layer after it, with kernel size (or so-called window size) 3, stride 2, and padding 1. Therefore, \[ \text { Output size }=\left\lfloor\frac{I+2 \times 1-3}{2}\right\rfloor+1 =\left\lfloor\frac{I-1}{2}\right\rfloor+1 . \] which means that it reduces the spatial dimensions by a factor of 2 as well.

Stacked ResNet blocks

The ResNet with [3,3,3] blocks on CIFAR10 is visualized below.

ResNet

In the implementations, all the variant of ResNet, including ResNet18, ResNet50 and ResNet101, have 4 stages.

1
2
3
4
5
6
7
8
# Source: https://github.com/matthias-wright/flaxmodels/blob/600ce8a6b6bf2926ccfc948e7c1ff35edc330d5b/flaxmodels/resnet/resnet.py#L17-L21
LAYERS = {
"resnet18": [2, 2, 2, 2],
"resnet34": [3, 4, 6, 3],
"resnet50": [3, 4, 6, 3],
"resnet101": [3, 4, 23, 3],
"resnet152": [3, 8, 36, 3],
}

Example: ResNet 50

  1. Initial Convolution and Max Pooling:

    • The input image first goes through a \(7 \times 7\) convolution with a stride of 2, which reduces the spatial dimensions by a factor of 2.
    • This is followed by a \(3 \times 3\) max pooling layer with a stride of 2, which further reduces the spatial dimensions by a factor of 2.
  2. After these initial layers, the spatial dimensions are reduced by a factor of 4.

    • ResNet50 consists of 4 stages of convolutional blocks, with each stage containing multiple residual blocks.
    • Downsampling occurs at the beginning of each stage (except the first stage) using a convolution with a stride of 2.

    Let's break down each stage:

    1. Stage 1: No downsampling (input dimensions remain the same).
    2. Stage 2: Downsampling by a factor of 2 .
    3. Stage 3: Downsampling by a factor of 2 .
    4. Stage 4: Downsampling by a factor of 2 .

To calculate the overall downsampling ratio of ResNet50:

  1. Initial downsampling: \(\frac{1}{4}\) (due to initial convolution and max pooling)
  2. Downsampling at each stage: \(\frac{1}{2} \times \frac{1}{2} \times \frac{1}{2}\) (due to downsampling at the beginning of stages 2 , 3, and 4)

Overall downsampling ratio: \[ \frac{1}{4} \times \frac{1}{2} \times \frac{1}{2} \times \frac{1}{2}=\frac{1}{32} \]