# ResNet

Source:

- ResNet tutorial from UvA

# Architecture

The popular ResNet architecture consists of three main components:

- An initial convolutional layer that downsamples the input by a factor of 2.
- A max pooling layer that further downsamples the input by a factor of 2.
- Groups (or
*stages*) of ResNet blocks, where all blocks within each group have the same output shape. The first block of the 2nd stage onwards applies downsampling by a factor of 2.

Note: Notations such as `[3,3,3]`

are used to represent the ResNet block structure. `[3,3,3]`

indicates that there are 3 stages, with downsampling occurring in the first block of the 2nd and 3rd stages, specifically at the fourth and seventh blocks. The visualization below shows the ResNet with `[3,3,3]`

blocks on CIFAR-10.

The initial convolutional layer

The initial convolutional layer of Resnet has kernel size 7, stride 2, and padding 3. Therefore, \[ \text { Output size }=\left\lfloor\frac{I+2 \times 3-7}{2}\right\rfloor+1 =\left\lfloor\frac{I-1}{2}\right\rfloor+1 . \] which means that it reduces the spatial dimensions by a factor of 2.

# The maxpooling layer

Meanwhile, there is a maxpooling layer after it, with kernel size (or so-called window size) 3, stride 2, and padding 1. Therefore, \[ \text { Output size }=\left\lfloor\frac{I+2 \times 1-3}{2}\right\rfloor+1 =\left\lfloor\frac{I-1}{2}\right\rfloor+1 . \] which means that it reduces the spatial dimensions by a factor of 2 as well.

# Stacked ResNet blocks

The ResNet with `[3,3,3]`

blocks on CIFAR10 is visualized below.

In the implementations, all the variant of ResNet, including ResNet18, ResNet50 and ResNet101, have 4 stages.

1 | # Source: https://github.com/matthias-wright/flaxmodels/blob/600ce8a6b6bf2926ccfc948e7c1ff35edc330d5b/flaxmodels/resnet/resnet.py#L17-L21 |

# Example: ResNet 50

Initial Convolution and Max Pooling:

- The input image first goes through a \(7 \times 7\) convolution with a stride of 2, which reduces the spatial dimensions by a factor of 2.
- This is followed by a \(3 \times 3\) max pooling layer with a stride of 2, which further reduces the spatial dimensions by a factor of 2.

After these initial layers, the spatial dimensions are reduced by a factor of 4.

- ResNet50 consists of 4 stages of convolutional blocks, with each stage containing multiple residual blocks.
- Downsampling occurs at the beginning of each stage (except the first stage) using a convolution with a stride of 2.

Let's break down each stage:

- Stage 1: No downsampling (input dimensions remain the same).
- Stage 2: Downsampling by a factor of 2 .
- Stage 3: Downsampling by a factor of 2 .
- Stage 4: Downsampling by a factor of 2 .

To calculate the overall downsampling ratio of ResNet50:

- Initial downsampling: \(\frac{1}{4}\) (due to initial convolution and max pooling)
- Downsampling at each stage: \(\frac{1}{2} \times \frac{1}{2} \times \frac{1}{2}\) (due to downsampling at the beginning of stages 2 , 3, and 4)

Overall downsampling ratio: \[ \frac{1}{4} \times \frac{1}{2} \times \frac{1}{2} \times \frac{1}{2}=\frac{1}{32} \]