Pytorch Basic

Sources:

1.PyTorch document

For all modern deep learning frameworks, the tensor class (ndarray in MXNet, Tensor in PyTorch and TensorFlow) resembles NumPy’s ndarray, with a few killer features added. First, the tensor class supports automatic differentiation. Second, it leverages GPUs to accelerate numerical computation, whereas NumPy only runs on CPUs. These properties make neural networks both easy to code and fast to run.

A tensor represents a (possibly multi-dimensional) array of numerical values.

  • With one axis, a tensor is called a vector.
  • With two axes, a tensor is called a matrix.
  • With \(k>2\) axes, we drop the specialized names and just refer to the object as a \(k^{th}\) order tensor.

Each of these values is called an element of the tensor.

Data Manipulation

Tensor

In PyTorch, tensors default to a 32-bit precision (float32) when you create them with floating-point data, unless specified otherwise.

Create a tensor

  • Create a vector of evenly spaced values, starting at 0 (included) and ending at n (not included). By default, the interval size is 1.

    1
    2
    x = torch.arange(12, dtype=torch.float32)
    # Output: tensor([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11.])

    Unless otherwise specified, new tensors are stored in main memory and designated for CPU-based computation.

  • Creates a (3,4) tensor with elements drawn from a standard Gaussian (normal) distribution \(\mathcal N(0,1)\).

    1
    torch.randn(3, 4)
    • Recalling that for r.v. \(X \sim \mathcal N(\mu, \sigma^2)\), if \(Y = a + X\), then \(Y \sim \mathcal N(a + \mu, \sigma^2)\); if \(Z = a X\), then \(Z \sim \mathcal N(a\mu, a^2\sigma^2)\). So we can create a tensor \(\mathcal N(5,2)\):

      1
      2
      3
      4
      X = torch.randn(3,4)
      mean = 5
      variance = 2
      Z = mean + torch.sqrt(torch.tensor(variance)) * X
  • Create a tensor from a list:

    1
    torch.tensor([[2, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
  • Create an all-zero or all-one tensor:

    1
    2
    torch.zeros((2, 3, 4))
    torch.ones((2, 3, 4))
  • Get the number of elements in a tensor:

    1
    x.numel()
  • shape: We can access a tensor’s shape (the length along each axis) by inspecting its shape attribute:

    1
    x.shape
  • reshape: We can change the shape of a tensor without altering its size or values, by invoking reshape. For example, we can transform our vector x whose shape is (12,) to a matrix X with shape (3, 4):

    1
    X = x.reshape(3, 4)

    Note that specifying every shape component to reshape is redundant. Because we already know our tensor’s size, we can work out one component of the shape given the rest. For example, given a tensor of size \(n\) and target shape \((h,w)\), we know that \(w = n / h\). To automatically infer one component of the shape, we can place a -1 for the shape component that should be inferred automatically. In our case, instead of calling x.reshape(3, 4), we could have equivalently called x.reshape(-1, 4) or x.reshape(3, -1).

  • Tensor.item() → number

    Returns the value of this tensor as a standard Python number. This only works for tensors with one element. For other cases, see tolist().

    1
    2
    x = torch.tensor([1.0])
    x.item()

Operations

unary operators

Most standard operators can be applied elementwise including unary operators like \(e^x\).

1
torch.exp(x)

binary operators

1
2
3
x = torch.tensor([1.0, 2, 4, 8])
y = torch.tensor([2, 2, 2, 2])
x + y, x - y, x * y, x / y, x ** y

Also, remember that the == operator can compare the value of the objects. It's also elementwise for tensor.

1
X == Y

concatenate

The example below shows what happens when we concatenate two matrices along rows (axis 0) instead of columns (axis 1). We can see that the first output’s axis-0 length () is the sum of the two input tensors’ axis-0 lengths (); while the second output’s axis-1 length () is the sum of the two input tensors’ axis-1 lengths ().

1
2
3
4
5
6
X = torch.arange(12, dtype=torch.float32).reshape((3,4))
Y = torch.tensor([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
P = torch.cat((X, Y), dim=0) # Get a (3+3) * 4 tensor.
print(P.shape) # torch.Size([6, 4])
Q = torch.cat((X, Y), dim=1) # Get a (3) * (4+4) tensor.
print(Q.shape) # torch.Size([3, 8])

Broadcasting

Even when shapes differ, we can still perform elementwise binary operations by invoking the broadcasting mechanism.

Broadcasting works according to the following two-step procedure:

  1. expand one or both arrays by copying elements along axes with length 1 so that after this transformation, the two tensors have the same shape;
  2. perform an elementwise operation on the resulting arrays.
1
2
a = torch.arange(3).reshape((3, 1))
b = torch.arange(2).reshape((1, 2))

Saving Memory

If we write Y = X + Y, we dereference the tensor that Y used to point to and instead point Y at the newly allocated memory.

1
2
3
4
before = id(Y)
Y = Y + X
id(Y) == before
# False

Fortunately, performing in-place operations is easy. We can assign the result of an operation to a previously allocated array Y by using slice notation: Y[:] = <expression>. To illustrate this concept, we overwrite the values of tensor Z, after initializing it, using zeros_like, to have the same shape as Y.

1
2
3
4
Z = torch.zeros_like(Y)
print('id(Z):', id(Z))
Z[:] = X + Y
print('id(Z):', id(Z))
1
2
id(Z): 140381179266448
id(Z): 140381179266448

If the value of X is not reused in subsequent computations, we can also use X[:] = X + Y or X += Y to reduce the memory overhead of the operation.

1
2
3
before = id(X)
X += Y
id(X) == before
1
True

Shape

Change BCHW to BHWC

1
2
3
# convert inputs from BCHW -> BHWC
inputs = inputs.permute(0, 2, 3, 1).contiguous() # (256, 64, 8, 8) BCHW --> (256, 8, 8, 64) BHWC
input_shape = inputs.shape # BHWC

The .contiguous() method ensures that the tensor is stored in a contiguous chunk of memory. This is often necessary after operations like .permute(), which change the tensor's shape or strides but don't move the data in memory.

Flatten

1
2
3
# Flatten input
# Now set C'=64 (_embedding_dim), flatten `inputs` into (N, C'), where N=B*H*W, i.e., we have N=B*H*W vectors, each vector has dimension=C.
flat_input = inputs.view(-1, self._embedding_dim)

Create one-hot encoding

We create every one-hot encoding as follows: \[ \text{encoding}=\left\{\begin{array}{ll} 1 & \text { for } \mathrm{k}=\operatorname{argmin}_j\left\|z_e(x)-e_j\right\|_2, \\ 0 & \text { otherwise } \end{array},\right. \]

1
2
3
4
5
6
7
8
9
10
11
# Calculate distances
distances = (torch.sum(flat_input**2, dim=1, keepdim=True)
+ torch.sum(self._embedding.weight**2, dim=1)
- 2 * torch.matmul(flat_input, self._embedding.weight.t())) # Each vector `z_e` has distances with all the quantized vectors `e_j` in the codebook, where j in K = `_num_embeddings`.

# Encoding
encoding_indices = torch.argmin(distances, dim=1).unsqueeze(1)# For each vector `z_e`, select the index of the **closest** quantized vector `e_j` in the codebook.

# For each each vector `z_e`, use the index of its corresponding `z_q` to create a one-hot encoding.
encodings = torch.zeros(encoding_indices.shape[0], self._num_embeddings, device=inputs.device)
encodings.scatter_(1, encoding_indices, 1)

Explaination of encodings.scatter_(1, encoding_indices, 1) (Source):

First, note that scatter_() is an inplace function, meaning that it will change the value of input tensor.

The official document scatter_(dim, index, src) → Tensor tells us that parameters include the dim, index tensor, and the source tensor. dim specifies where the index tensor is functioning, and we will keep the other dimensions unchanged. And as the function name suggests, the goal is to scatter values in the source tensor to the input tensor self. What we are going to do is to loop through the values in the source tensor, find its position in the input tensor, and replace the old one.

Note that src can also just be a scalar. In this case, we would just scatter this single value according to the index tensor.

1
2
3
self[index[i][j][k]][j][k] = src[i][j][k]  # if dim == 0
self[i][index[i][j][k]][k] = src[i][j][k] # if dim == 1
self[i][j][index[i][j][k]] = src[i][j][k] # if dim == 2

The Straight Through Estimator Operation

https://ai.stackexchange.com/questions/26770/in-vq-vae-code-what-does-this-line-of-code-signify

1
quantized = inputs + (quantized - inputs).detach()

In this line:

  • quantized is the output from a quantization process where each input vector from inputs is replaced with its nearest vector in a codebook.
  • (quantized - inputs).detach() computes the difference between the quantized vectors and the original input vectors, and .detach() is called to prevent gradients from flowing through this operation. This means that during backpropagation, the gradient of this term will be considered as zero, effectively stopping gradients from flowing through the quantization step.
  • inputs + (quantized - inputs).detach() then adds this detached difference back to the original inputs. The result is a tensor that is numerically equal to quantized during the forward pass. However, during the backward pass (gradient computation), the gradients are taken with respect to inputs. This "trick" allows gradients to bypass the non-differentiable quantization step, maintaining the ability to optimize the parameters that affect inputs through gradient descent.

Calculus

\[ f'(x) = \lim_{h \rightarrow 0} \frac{f(x+h) - f(x)}{h}. \]

Define \(u = f(x) = 3x^2-4x\),

1
2
def f(x):
return 3 * x ** 2 - 4 * x

We can see that \(f'(x) = 6x - 4\) and \(f'(1) = 2\).

Now in pytorch, we evaluate numerical value of \(f'(1)\) given \(h = 10^{-1}, 10^{-2}, 10^{-3}, \cdots\).

1
2
3
# np.arange(-1, -6, -1): Create an array of values ranging from -1 to -6 (exclusive) with a step of -1.
for h in 10.0**np.arange(-1, -6, -1):
print(f'h={h:.5f}, numerical limit={(f(1+h)-f(1))/h:.5f}')

As \(h\) becomes smaller, the result approaches \(2\):

1
2
3
4
5
h=0.10000, numerical limit=2.30000
h=0.01000, numerical limit=2.03000
h=0.00100, numerical limit=2.00300
h=0.00010, numerical limit=2.00030
h=0.00001, numerical limit=2.00003

Plot

Define \(u = f(x) = 3x^2-4x\), since:

  1. \(f'(x=1) = 3 * 1 -4 = 2\)
  2. \(f(1) = -1\)

So the tangent line at \(x=3\) is \(y = 2x-1\).

../_images/output_calculus_7e7694_56_0.svg

Automatic Differentiation

Let’s assume that we are interested in differentiating the function with respect to the column vector . To start, we assign x an initial value.

Let \(x = [0,1,2,3]^{\text{T}}\), \(y = 2xx^{\text{T}}\). Note: y is the "dot product" of \(x , x^{\text{T}}\).

1
2
3
4
5
6
# Can also create x = torch.arange(4.0, requires_grad=True)

x = torch.arange(4.0) # tensor([0., 1., 2., 3.], requires_grad=True)
x.requires_grad_(True) # Store the gradient

y = 2 * torch.dot(x, x) # tensor(28., grad_fn=<MulBackward0>)

We can now take the gradient of y with respect to x by calling its backward method. Next, we can access the gradient via x’s grad attribute.

1
2
3
y.backward()
print(f"x.grad: {x.grad}")
# x.grad: tensor([ 0., 4., 8., 12.])

We already know that the gradient of the function \(y = 2xx^{\text{T}}\) with respect \(x\) to should be \(4x\).

1
2
3
4
print(x.grad == 4*x)
print(f"4*x = {4*x } when x = {x}")
print(f"x.grad = {x.grad } when x = {x}")
print(x.grad == 4*x)

PyTorch does not automatically reset the gradient buffer when we record a new gradient. Instead, the new gradient is added to the already-stored gradient. This behavior comes in handy when we want to optimize the sum of multiple objective functions. To reset the gradient buffer, we can call x.grad.zero_() as follows:

1
x.grad.zero_()  # Reset the gradient

Example: \(y = \sum x_i\)

1
2
3
y = x.sum()
y.backward()
x.grad # x.grad: tensor([1., 1., 1., 1.])

Loading dataset

1
2
3
4
5
6
7
8
9
10
11
12
training_data = datasets.CIFAR10(root="data", train=True, download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,0.5,0.5), (1.0,1.0,1.0))
]))

# The train=False parameter specifies that the test set should be loaded instead of the training set.
validation_data = datasets.CIFAR10(root="data", train=False, download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,0.5,0.5), (1.0,1.0,1.0))
]))

Loading the Dataset:

  • datasets.CIFAR10(root="data", train=True, download=True, ...): This line loads the CIFAR-10 dataset for training. If the dataset is not already downloaded, it will be downloaded and saved to the "data" directory. The train=True parameter indicates that the training set should be loaded.
  • datasets.CIFAR10(root="data", train=False, download=True, ...): Similarly, this line loads the CIFAR-10 dataset for validation (or testing). The train=False parameter specifies that the test set should be loaded instead of the training set.

Transformations:

  • transforms.Compose([...]): This is a function that composes several transformations together. It takes a list of transformations and applies them in sequence. This is useful for chaining together multiple processing steps.

  • transforms.ToTensor(): This transformation converts PIL images or NumPy arrays into PyTorch tensors. It automatically scales the images to a range of [0.0, 1.0] by dividing the pixel values by 255.

  • transforms.Normalize((0.5, 0.5, 0.5), (1.0, 1.0, 1.0)): This transformation normalizes the tensor images. The first tuple (0.5, 0.5, 0.5) is the mean for each of the three channels (Red, Green, Blue), and the second tuple (1.0, 1.0, 1.0) is the standard deviation for each channel. Normalization is applied by subtracting the mean and dividing by the standard deviation for each channel:

    Normalized Channel=Channel−MeanStandard DeviationNormalized Channel=Standard DeviationChannel−Mean

    In this case, since the mean is 0.5 and the standard deviation is 1.0 for all channels, the normalization formula essentially transforms the pixel values to a range of [-0.5, 0.5]. Normalization is a common preprocessing step that can help speed up convergence during training by ensuring that the input data has a mean of 0 and a standard deviation of 1.

Saving and loading the checkpoint

Saving a Checkpoint:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import os

checkpoint_path = "path/to/save/checkpoint.pth" # Define the path where you want to save the checkpoint

for i in xrange(num_training_updates):
# Your training code remains the same...

if (i+1) % 100 == 0:
# Print training status
print('%d iterations' % (i+1))
print('recon_error: %.3f' % np.mean(train_res_recon_error[-100:]))
print('perplexity: %.3f' % np.mean(train_res_perplexity[-100:]))

# Save checkpoint
checkpoint = {
'iteration': i + 1,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'train_res_recon_error': train_res_recon_error,
'train_res_perplexity': train_res_perplexity,
'loss': loss.item(),
# Add any other information you need
}
torch.save(checkpoint, checkpoint_path)
print("Checkpoint saved at iteration %d" % (i+1))

Loading a Checkpoint

To resume training or evaluate your model using a saved checkpoint, you first need to initialize your model and optimizer, then load the checkpoint and update the model and optimizer states:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Initialize your model and optimizer here
model = MyModel(...) # Make sure to initialize your model with the same parameters as before
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate) # Similarly for the optimizer

# Load the checkpoint
checkpoint = torch.load(checkpoint_path)

# Update model and optimizer states
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])

# You can also load other saved training information
current_iteration = checkpoint['iteration']
train_res_recon_error = checkpoint['train_res_recon_error']
train_res_perplexity = checkpoint['train_res_perplexity']
last_loss = checkpoint['loss']

# If you plan to continue training, don't forget to set the model to training mode
model.train()

# Now you can resume training or perform evaluation using the model

Documentation

While we cannot possibly introduce every single PyTorch function and class (and the information might become outdated quickly), the API documentation and additional tutorials and examples provide such documentation. This section provides some guidance for how to explore the PyTorch API.

  • To know which functions and classes can be called in a module, we invoke the dir function. For instance, we can query all properties in the module for generating random numbers:

    1
    print(dir(torch.distributions))
  • For specific instructions on how to use a given function or class, we can invoke pytorch's help function. As an example, let’s explore the usage instructions for tensors’ ones function.

    1
    help(torch.ones)

Debug

The core functionalities of PyTorch, including data loading, are implemented in C++. While you interact with these utilities using Python, the actual work is done in the C++ backend. This means that you might not have direct access to the Python-level source code during debugging.

image-20231217145350730