Convolutional Neural Networks
TL;DR: In a convolution layer, \[ \text { Output size }=\left\lfloor\frac{\text { Input size }+2 \times \text { Padding }- \text { Kernel size }}{\text { Stride }}\right\rfloor+1 \]
Sources:
- Convolutional Neural Networks from d2l
The cross-correlation operation
Here is one example. In the below figure, we have:
- Input: A two-dimensional tensor with a height of 3 and width of 3.
- Kernel: A two-dimensional tensor with a height of 2 and width of 2.
- Output: A two-dimensional tensor with a height of 2 and width of 2.
When computing the cross-correlation, we start with the convolution window at the upper-left corner of the input tensor, then we slide it across the input tensor both from left to right and top to bottom.
The calculation is: \[ \begin{split}0\times0+1\times1+3\times2+4\times3=19,\\ 1\times0+2\times1+4\times2+5\times3=25,\\ 3\times0+4\times1+6\times2+7\times3=37,\\ 4\times0+5\times1+7\times2+8\times3=43.\end{split} \] The output size is given by the input size \(x_\textrm{h} \times x_\textrm{w}\) minus the size of the convolution kernel \(k_\textrm{h} \times k_\textrm{w}\) via \[ (x_\textrm{h}-k_\textrm{h}+1) \times (x_\textrm{w}-k_\textrm{w}+1). \]
Pseudocode:
1 | def corr2d(X, K): |
The figure represents the code:
1 | x = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]]) |
The output is:
1 | tensor([[19., 25.], |
Padding
One tricky issue when applying convolutional layers is that we tend to lose pixels on the perimeter of our image. Consider Fig. 2 that depicts the pixel utilization as a function of the convolution kernel size and the position within the image. The pixels in the corners are hardly used at all.
Since we typically use small kernels, for any given convolution we might only lose a few pixels but this can add up as we apply many successive convolutional layers. One straightforward solution to this problem is to add extra pixels of filler around the boundary of our input image, thus increasing the effective size of the image. Typically, we set the values of the extra pixels to zero.
In Fig. 3, we pad a \(3 \times 3\) input, increasing its size to \(5 \times 5\). The corresponding output then increases to a \(4 \times 4\) matrix. The shaded portions are the first output element as well as the input and kernel tensor elements used for the output computation: \(0\times0+0\times1+0\times2+0\times3=0\).
Fig. 7.3.2 Two-dimensional cross-correlation with padding.
Stride
Stride is the number of pixels shifts over the input matrix
Fig. 4 Cross-correlation with strides of 3 and 2 for height and width, respectively.