Pytorch Basic
Sources:
1.PyTorch document
For all modern deep learning frameworks, the tensor class (ndarray in MXNet, Tensor in PyTorch and TensorFlow) resembles NumPy’s ndarray, with a few killer features added. First, the tensor class supports automatic differentiation. Second, it leverages GPUs to accelerate numerical computation, whereas NumPy only runs on CPUs. These properties make neural networks both easy to code and fast to run.
A tensor represents a (possibly multi-dimensional) array of numerical values.
- With one axis, a tensor is called a vector.
- With two axes, a tensor is called a matrix.
- With \(k>2\) axes, we drop the specialized names and just refer to the object as a \(k^{th}\) order tensor.
Each of these values is called an element of the tensor.
Data Manipulation
Tensor
In PyTorch, tensors default to a 32-bit precision (float32) when you create them with floating-point data, unless specified otherwise.
Create a tensor
Create a vector of evenly spaced values, starting at 0 (included) and ending at
n
(not included). By default, the interval size is 1.1
2x = torch.arange(12, dtype=torch.float32)
# Output: tensor([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11.])Unless otherwise specified, new tensors are stored in main memory and designated for CPU-based computation.
Creates a (3,4) tensor with elements drawn from a standard Gaussian (normal) distribution \(\mathcal N(0,1)\).
1
torch.randn(3, 4)
Recalling that for r.v. \(X \sim \mathcal N(\mu, \sigma^2)\), if \(Y = a + X\), then \(Y \sim \mathcal N(a + \mu, \sigma^2)\); if \(Z = a X\), then \(Z \sim \mathcal N(a\mu, a^2\sigma^2)\). So we can create a tensor \(\mathcal N(5,2)\):
1
2
3
4X = torch.randn(3,4)
mean = 5
variance = 2
Z = mean + torch.sqrt(torch.tensor(variance)) * X
Create a tensor from a list:
1
torch.tensor([[2, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
Create an all-zero or all-one tensor:
1
2torch.zeros((2, 3, 4))
torch.ones((2, 3, 4))Get the number of elements in a tensor:
1
x.numel()
shape: We can access a tensor’s shape (the length along each axis) by inspecting its
shape
attribute:1
x.shape
reshape: We can change the shape of a tensor without altering its size or values, by invoking
reshape
. For example, we can transform our vectorx
whose shape is (12,) to a matrixX
with shape (3, 4):1
X = x.reshape(3, 4)
Note that specifying every shape component to
reshape
is redundant. Because we already know our tensor’s size, we can work out one component of the shape given the rest. For example, given a tensor of size \(n\) and target shape \((h,w)\), we know that \(w = n / h\). To automatically infer one component of the shape, we can place a-1
for the shape component that should be inferred automatically. In our case, instead of callingx.reshape(3, 4)
, we could have equivalently calledx.reshape(-1, 4)
orx.reshape(3, -1)
.Tensor.item()
→ numberReturns the value of this tensor as a standard Python number. This only works for tensors with one element. For other cases, see
tolist()
.1
2x = torch.tensor([1.0])
x.item()
Operations
unary operators
Most standard operators can be applied elementwise including unary operators like \(e^x\).
1 | torch.exp(x) |
binary operators
1 | x = torch.tensor([1.0, 2, 4, 8]) |
Also, remember that the ==
operator can compare the value of the objects. It's also elementwise for tensor.
1 | X == Y |
concatenate
The example below shows what happens when we concatenate two matrices along rows (axis 0) instead of columns (axis 1). We can see that the first output’s axis-0 length () is the sum of the two input tensors’ axis-0 lengths (); while the second output’s axis-1 length () is the sum of the two input tensors’ axis-1 lengths ().
1 | X = torch.arange(12, dtype=torch.float32).reshape((3,4)) |
Broadcasting
Even when shapes differ, we can still perform elementwise binary operations by invoking the broadcasting mechanism.
Broadcasting works according to the following two-step procedure:
- expand one or both arrays by copying elements along axes with length 1 so that after this transformation, the two tensors have the same shape;
- perform an elementwise operation on the resulting arrays.
1 | a = torch.arange(3).reshape((3, 1)) |
Saving Memory
If we write Y = X + Y
, we dereference the tensor that Y
used to point to and instead point Y
at the newly allocated memory.
1 | before = id(Y) |
Fortunately, performing in-place operations is easy. We can assign the result of an operation to a previously allocated array Y
by using slice notation: Y[:] = <expression>
. To illustrate this concept, we overwrite the values of tensor Z
, after initializing it, using zeros_like
, to have the same shape as Y
.
1 | Z = torch.zeros_like(Y) |
1 | id(Z): 140381179266448 |
If the value of X
is not reused in subsequent computations, we can also use X[:] = X + Y
or X += Y
to reduce the memory overhead of the operation.
1 | before = id(X) |
1 | True |
Shape
Change BCHW to BHWC
1 | # convert inputs from BCHW -> BHWC |
The .contiguous()
method ensures that the tensor is stored in a contiguous chunk of memory. This is often necessary after operations like .permute(), which change the tensor's shape or strides but don't move the data in memory.
Flatten
1 | # Flatten input |
Create one-hot encoding
We create every one-hot encoding as follows: \[ \text{encoding}=\left\{\begin{array}{ll} 1 & \text { for } \mathrm{k}=\operatorname{argmin}_j\left\|z_e(x)-e_j\right\|_2, \\ 0 & \text { otherwise } \end{array},\right. \]
1 | # Calculate distances |
Explaination of encodings.scatter_(1, encoding_indices, 1)
(Source):
First, note that scatter_() is an inplace function, meaning that it will change the value of input tensor.
The official document scatter_
(dim, index, src) → Tensor tells us that parameters include the dim, index tensor, and the source tensor. dim specifies where the index tensor is functioning, and we will keep the other dimensions unchanged. And as the function name suggests, the goal is to scatter values in the source tensor to the input tensor self. What we are going to do is to loop through the values in the source tensor, find its position in the input tensor, and replace the old one.
Note that src can also just be a scalar. In this case, we would just scatter this single value according to the index tensor.
1 | self[index[i][j][k]][j][k] = src[i][j][k] # if dim == 0 |
The Straight Through Estimator Operation
https://ai.stackexchange.com/questions/26770/in-vq-vae-code-what-does-this-line-of-code-signify
1 | quantized = inputs + (quantized - inputs).detach() |
In this line:
quantized
is the output from a quantization process where each input vector frominputs
is replaced with its nearest vector in a codebook.(quantized - inputs).detach()
computes the difference between the quantized vectors and the original input vectors, and.detach()
is called to prevent gradients from flowing through this operation. This means that during backpropagation, the gradient of this term will be considered as zero, effectively stopping gradients from flowing through the quantization step.inputs + (quantized - inputs).detach()
then adds this detached difference back to the originalinputs
. The result is a tensor that is numerically equal toquantized
during the forward pass. However, during the backward pass (gradient computation), the gradients are taken with respect toinputs
. This "trick" allows gradients to bypass the non-differentiable quantization step, maintaining the ability to optimize the parameters that affectinputs
through gradient descent.
Calculus
\[ f'(x) = \lim_{h \rightarrow 0} \frac{f(x+h) - f(x)}{h}. \]
Define \(u = f(x) = 3x^2-4x\),
1 | def f(x): |
We can see that \(f'(x) = 6x - 4\) and \(f'(1) = 2\).
Now in pytorch, we evaluate numerical value of \(f'(1)\) given \(h = 10^{-1}, 10^{-2}, 10^{-3}, \cdots\).
1 | # np.arange(-1, -6, -1): Create an array of values ranging from -1 to -6 (exclusive) with a step of -1. |
As \(h\) becomes smaller, the result approaches \(2\):
1 | h=0.10000, numerical limit=2.30000 |
Plot
Define \(u = f(x) = 3x^2-4x\), since:
- \(f'(x=1) = 3 * 1 -4 = 2\)
- \(f(1) = -1\)
So the tangent line at \(x=3\) is \(y = 2x-1\).
Automatic Differentiation
Let’s assume that we are interested in differentiating the function with respect to the column vector . To start, we assign x
an initial value.
Let \(x = [0,1,2,3]^{\text{T}}\), \(y = 2xx^{\text{T}}\). Note: y is the "dot product" of \(x , x^{\text{T}}\).
1 | # Can also create x = torch.arange(4.0, requires_grad=True) |
We can now take the gradient of y
with respect to x
by calling its backward
method. Next, we can access the gradient via x
’s grad
attribute.
1 | y.backward() |
We already know that the gradient of the function \(y = 2xx^{\text{T}}\) with respect \(x\) to should be \(4x\). 1
2
3
4print(x.grad == 4*x)
print(f"4*x = {4*x } when x = {x}")
print(f"x.grad = {x.grad } when x = {x}")
print(x.grad == 4*x)
PyTorch does not automatically reset the gradient buffer when we record a new gradient. Instead, the new gradient is added to the already-stored gradient. This behavior comes in handy when we want to optimize the sum of multiple objective functions. To reset the gradient buffer, we can call x.grad.zero_()
as follows:
1 | x.grad.zero_() # Reset the gradient |
Example: \(y = \sum x_i\)
1 | y = x.sum() |
Loading dataset
1 | training_data = datasets.CIFAR10(root="data", train=True, download=True, |
Loading the Dataset:
datasets.CIFAR10(root="data", train=True, download=True, ...)
: This line loads the CIFAR-10 dataset for training. If the dataset is not already downloaded, it will be downloaded and saved to the"data"
directory. Thetrain=True
parameter indicates that the training set should be loaded.datasets.CIFAR10(root="data", train=False, download=True, ...)
: Similarly, this line loads the CIFAR-10 dataset for validation (or testing). Thetrain=False
parameter specifies that the test set should be loaded instead of the training set.
Transformations:
transforms.Compose([...])
: This is a function that composes several transformations together. It takes a list of transformations and applies them in sequence. This is useful for chaining together multiple processing steps.transforms.ToTensor()
: This transformation converts PIL images or NumPy arrays into PyTorch tensors. It automatically scales the images to a range of [0.0, 1.0] by dividing the pixel values by 255.transforms.Normalize((0.5, 0.5, 0.5), (1.0, 1.0, 1.0))
: This transformation normalizes the tensor images. The first tuple(0.5, 0.5, 0.5)
is the mean for each of the three channels (Red, Green, Blue), and the second tuple(1.0, 1.0, 1.0)
is the standard deviation for each channel. Normalization is applied by subtracting the mean and dividing by the standard deviation for each channel:Normalized Channel=Channel−MeanStandard DeviationNormalized Channel=Standard DeviationChannel−Mean
In this case, since the mean is
0.5
and the standard deviation is1.0
for all channels, the normalization formula essentially transforms the pixel values to a range of[-0.5, 0.5]
. Normalization is a common preprocessing step that can help speed up convergence during training by ensuring that the input data has a mean of 0 and a standard deviation of 1.
Saving and loading the checkpoint
Saving a Checkpoint:
1 | import os |
Loading a Checkpoint
To resume training or evaluate your model using a saved checkpoint, you first need to initialize your model and optimizer, then load the checkpoint and update the model and optimizer states:
1 | # Initialize your model and optimizer here |
Documentation
While we cannot possibly introduce every single PyTorch function and class (and the information might become outdated quickly), the API documentation and additional tutorials and examples provide such documentation. This section provides some guidance for how to explore the PyTorch API.
To know which functions and classes can be called in a module, we invoke the
dir
function. For instance, we can query all properties in the module for generating random numbers:1
print(dir(torch.distributions))
For specific instructions on how to use a given function or class, we can invoke pytorch's
help
function. As an example, let’s explore the usage instructions for tensors’ones
function.1
help(torch.ones)
Debug
The core functionalities of PyTorch, including data loading, are implemented in C++. While you interact with these utilities using Python, the actual work is done in the C++ backend. This means that you might not have direct access to the Python-level source code during debugging.