ResNet
Source:
- ResNet tutorial from UvA
Architecture
The popular ResNet architecture consists of three main components:
- An initial convolutional layer that downsamples the input by a factor of 2.
- A max pooling layer that further downsamples the input by a factor of 2.
- Groups (or stages) of ResNet blocks, where all blocks within each group have the same output shape. The first block of the 2nd stage onwards applies downsampling by a factor of 2.
Note: Notations such as [3,3,3]
are used to represent the ResNet block structure. [3,3,3]
indicates that there are 3 stages, with downsampling occurring in the first block of the 2nd and 3rd stages, specifically at the fourth and seventh blocks. The visualization below shows the ResNet with [3,3,3]
blocks on CIFAR-10.
The initial convolutional layer
The initial convolutional layer of Resnet has kernel size 7, stride 2, and padding 3. Therefore, \[ \text { Output size }=\left\lfloor\frac{I+2 \times 3-7}{2}\right\rfloor+1 =\left\lfloor\frac{I-1}{2}\right\rfloor+1 . \] which means that it reduces the spatial dimensions by a factor of 2.
The maxpooling layer
Meanwhile, there is a maxpooling layer after it, with kernel size (or so-called window size) 3, stride 2, and padding 1. Therefore, \[ \text { Output size }=\left\lfloor\frac{I+2 \times 1-3}{2}\right\rfloor+1 =\left\lfloor\frac{I-1}{2}\right\rfloor+1 . \] which means that it reduces the spatial dimensions by a factor of 2 as well.
Stacked ResNet blocks
The ResNet with [3,3,3]
blocks on CIFAR10 is visualized below.
In the implementations, all the variant of ResNet, including ResNet18, ResNet50 and ResNet101, have 4 stages.
1 | # Source: https://github.com/matthias-wright/flaxmodels/blob/600ce8a6b6bf2926ccfc948e7c1ff35edc330d5b/flaxmodels/resnet/resnet.py#L17-L21 |
Example: ResNet 50
Initial Convolution and Max Pooling:
- The input image first goes through a \(7 \times 7\) convolution with a stride of 2, which reduces the spatial dimensions by a factor of 2.
- This is followed by a \(3 \times 3\) max pooling layer with a stride of 2, which further reduces the spatial dimensions by a factor of 2.
After these initial layers, the spatial dimensions are reduced by a factor of 4.
- ResNet50 consists of 4 stages of convolutional blocks, with each stage containing multiple residual blocks.
- Downsampling occurs at the beginning of each stage (except the first stage) using a convolution with a stride of 2.
Let's break down each stage:
- Stage 1: No downsampling (input dimensions remain the same).
- Stage 2: Downsampling by a factor of 2 .
- Stage 3: Downsampling by a factor of 2 .
- Stage 4: Downsampling by a factor of 2 .
To calculate the overall downsampling ratio of ResNet50:
- Initial downsampling: \(\frac{1}{4}\) (due to initial convolution and max pooling)
- Downsampling at each stage: \(\frac{1}{2} \times \frac{1}{2} \times \frac{1}{2}\) (due to downsampling at the beginning of stages 2 , 3, and 4)
Overall downsampling ratio: \[ \frac{1}{4} \times \frac{1}{2} \times \frac{1}{2} \times \frac{1}{2}=\frac{1}{32} \]