Convolutional Neural Networks

TL;DR: In a convolution layer, \[ \text { Output size }=\left\lfloor\frac{\text { Input size }+2 \times \text { Padding }- \text { Kernel size }}{\text { Stride }}\right\rfloor+1 \]

Sources:

  1. Convolutional Neural Networks from d2l

The cross-correlation operation

Here is one example. In the below figure, we have:

  • Input: A two-dimensional tensor with a height of 3 and width of 3.
  • Kernel: A two-dimensional tensor with a height of 2 and width of 2.
  • Output: A two-dimensional tensor with a height of 2 and width of 2.
Figure 1.1
Figure 1.2

When computing the cross-correlation, we start with the convolution window at the upper-left corner of the input tensor, then we slide it across the input tensor both from left to right and top to bottom.

The calculation is: \[ \begin{split}0\times0+1\times1+3\times2+4\times3=19,\\ 1\times0+2\times1+4\times2+5\times3=25,\\ 3\times0+4\times1+6\times2+7\times3=37,\\ 4\times0+5\times1+7\times2+8\times3=43.\end{split} \] The output size is given by the input size \(x_\textrm{h} \times x_\textrm{w}\) minus the size of the convolution kernel \(k_\textrm{h} \times k_\textrm{w}\) via \[ (x_\textrm{h}-k_\textrm{h}+1) \times (x_\textrm{w}-k_\textrm{w}+1). \]

Pseudocode:

1
2
3
4
5
6
7
8
9
10
def corr2d(X, K): 
"""Compute 2D cross-correlation."""
x_h, x_w = X.shape # The input shape
k_h, k_w = K.shape # The kernel shape
y_h, y_w = x_h - k_h + 1, x_w - k_w + 1 # The output shape
Y = torch.zeros(y_h, y_w)
for i in range(y_h):
for j in range(y_w):
Y[i, j] = (X[i:i + k_h, j:j + k_w] * K).sum()
return Y

The figure represents the code:

1
2
3
x = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
k = torch.tensor([[0.0, 1.0], [2.0, 3.0]])
y = corr2d(x, k)

The output is:

1
2
tensor([[19., 25.],
[37., 43.]])

Padding

One tricky issue when applying convolutional layers is that we tend to lose pixels on the perimeter of our image. Consider Fig. 2 that depicts the pixel utilization as a function of the convolution kernel size and the position within the image. The pixels in the corners are hardly used at all.

Figure 2

Since we typically use small kernels, for any given convolution we might only lose a few pixels but this can add up as we apply many successive convolutional layers. One straightforward solution to this problem is to add extra pixels of filler around the boundary of our input image, thus increasing the effective size of the image. Typically, we set the values of the extra pixels to zero.

In Fig. 3, we pad a \(3 \times 3\) input, increasing its size to \(5 \times 5\). The corresponding output then increases to a \(4 \times 4\) matrix. The shaded portions are the first output element as well as the input and kernel tensor elements used for the output computation: \(0\times0+0\times1+0\times2+0\times3=0\).

Figure 3

Fig. 7.3.2 Two-dimensional cross-correlation with padding.

Stride

Stride is the number of pixels shifts over the input matrix

Figure 4

Fig. 4 Cross-correlation with strides of 3 and 2 for height and width, respectively.