Sources:
- Derivatives, Backpropagation, and Vectorization
- Wikipedia: chain rule
- Wikipedia: quptient rule
It's recommended to compute the back propagation manually to get a deeper understanding. Read this article if you are interested.
Tensors
The word tensor is used in different ways in different fields; you may have seen the term before in physics or abstract algebra. The machine learning definition of a tensor as a dimensional grid of numbers is closely related to the definitions of tensors in these other fields.
In machine learning,
- a 0-axis tensor is a scalar,
- a 1-axis tensor is a vector,
- a 2-axis tensor is a matrix,
- a 3-axis and higher dimensional tensor is just called "tensor".
Chain rule
If is the function such that for every , then the chain rule is, in Lagrange's notation, or, equivalently,
The chain rule may also be expressed in Leibniz's notation. If a variable depends on the variable , which itself depends on the variable , then depends on as well, via the intermediate variable . In this case, the chain rule is expressed as and for indicating at which points the derivatives have to be evaluated.
Quotient rule
Let , where both and are differentiable and . The quotient rule states that the derivative of is
Scalar case
Given a function , the derivative of at a point is defined as:
In this scalar case, the derivative of the function at the point tells us how much the function changes as the input changes by a small amount :
For ease of notation we will commonly assign a name to the output of , say , and write for the derivative of with respect to . We can write this relationship as
You should read this as saying "changing to implies that will change to approximately ". This notation is nonstandard, but I like it since it emphasizes the relationship between changes in and changes in .
In the scalar case suppose that and ; then we can also write , or draw the following computational graph:
The (scalar) chain rule tells us that This equation makes intuitive sense. The derivatives and give:
Combining these two rules lets us compute the effect of on :
- if changes by then will change by , so we have .
- If changes by then will change by which is exactly what the chain rule tells us.
Gradient: Vector in, scalar out
This same intuition carries over into the vector case. Now suppose that : takes a vector as input and produces a scalar. The derivative of at the point is now called the gradient, and it is defined as:
The gradient , or , is a vector of partial derivatives: where is the th coordinate of the vector , which is a scalar, so each partial derivative is also a scalar.
If we set then we have the relationship
The formula changes a bit from the scalar case to account for the fact that , and are now vectors in while is a scalar. In particular when multiplying by we use the dot product, which combines two vectors to give a scalar.
Jacobian: Vector in, Vector out
Now suppose that takes a vector as input and produces a vector as output. Then the derivative of at a point , also called the Jacobian matrix, is the matrix of partial derivatives. If we again set then we can write:
The Jacobian tells us the relationship between each element of and each element of : the -th element of is equal to , so it tells us the amount by which will change if is changed by a small amount.
Just as in the previous cases, the Jacobian tells us the relationship between changes in the input and changes in the output:
Here is a matrix and is an -dimensional vector, so the product is a matrix-vector multiplication resulting in an -dimensional vector:
The chain rule can be extended to the vector case using Jacobian matrices. Suppose that and . Let , and with and , so we have the same computational graph as the scalar case:
The chain rule also has the same form as the scalar case:
However now each of these terms is a matrix: is a matrix, is a matrix, and is a matrix; the multiplication of and is matrix multiplication.
Generalized Jacobian: Tensor in, Tensor out
Suppose now that . Then:
- the input to is a -dimensional tensor of shape , and
- the output of is a -dimensional tensor of shape .
If then the derivative is a generalized Jacobian, which is an object with shape
Note that we have separated the dimensions of into two groups:
- the first group matches the dimensions of and
- the second group matches the dimensions of .
With this grouping, we can think of the generalized Jacobian as generalization of a matrix, where each "row" has the same shape as and each "column" has the same shape as .
Now if we let and be vectors of integer indices, then we can write
Note that and are scalars, so the derivative is also a scalar. Using this notation we see that like the standard Jacobian, the generalized Jacobian tells us the relative rates of change between all elements of and all elements of .
The generalized Jacobian gives the same relationship between inputs and outputs as before:
The difference is that now is a tensor of shape and is a generalized matrix of shape . The product is therefore a generalized matrix-vector multiply, which results in a tensor of shape .
The generalized matrix-vector multipy follows the same algebraic rules as a traditional matrix-vector multiply:
The only difference is that the indicies and are not scalars; instead they are vectors of indicies. In the equation above the term is the th "row" of the generalized matrix , which is a tensor with the same shape as . We have also used the convention that the dot product between two tensors of the same shape is an elementwise product followed by a sum, identical to the dot product between vectors.
The chain rule also looks the same in the case of tensor-valued functions. Suppose that and , where and have the same shapes as above and has shape . Now the chain rule looks the same as before:
The difference is that now is a generalized matrix of shape , and is a generalized matrix of shape ; the product is a generalized matrix-matrix multiply, resulting in an object of shape . Like the generalized matrix-vector multiply defined above, the generalized matrixmatrix multiply follows the same algebraic rules as the traditional matrix-matrix multiply:
In this equation the indices are vectors of indices, and the terms and are the th "row" of and the th "column" of respectively.
Backward Propagation with Tensors
In the context of neural networks, a layer is typically a function of (tensor) inputs and weights ; the (tensor) output of the layer is then . The layer is typically embedded in some large neural network with scalar loss .
During backpropagation, we assume that we are given (it's the upstram gradient value and has been already computed in the upstram neoron) and our goal is to compute and . By the chain rule we know that
Therefore one way to proceed would be to form the (generalized) Jacobians and and use (generalized) matrix multiplication to compute and .
However, there's a problem with this approach: the Jacobian matrices and are typically far too large to fit in memory.
Problem: Too large to put in the memory
As a concrete example, let's suppose that: the layer is a linear layer that takes as input a minibatch of vectors, each of dimension , and produces a minibatch of vectors, each of dimension .
Then is a matrix of shape is a matrix of shape , and is a matrix of shape . The Jacobian then has shape .
In a typical neural network we might have and ; then consists of scalar values; this is more than 68 billion numbers; using 32-bit floating point, this Jacobian matrix will take 256 GB of memory to store. Therefore it is completely impossible to try and explicitly store and manipulate the Jacobian matrix.
However it turns out that for most common neural network layers, we can derive expressions that compute the product without explicitly forming the Jacobian . Even better, we can typically derive this expression without even computing an explicit expression for the Jacobian ; in many cases we can work out a small case on paper and then infer the general formula.
Solution
Let's see how this works out for the case of the linear layer . Set . Then we can explicitly write
During backpropagation we assume that we have access to (it's the upstream gradient value) which technically has shape ; however for notational convenience we will instead think of it as a matrix of shape . Then we can write
Our goal now is to derive an expression for in terms of , and , without explicitly forming the entire Jacobian .
We know that will have shape (1) , but as is typical for representing gradients we instead view as a matrix of shape .
We know that each element of is a scalar giving the partial derivatives of with respect to the elements of :
Thinking one element at a time, the chain rule tells us that
Since we already know , we just have to compute .
If we view and as matrices of shape , then their generalized matrix product is simply the dot product . Now we compute where the final equality comes from taking the derivatives of with respect to . We can now combine these results and write
This gives us our final expression for :
This final result is very interesting because it allows us to efficiently compute without explicitly forming the Jacobian . We have only derived this formula for the specific case of but it in fact holds in general.
By a similar thought process we can derive a similar expression for without explicitly forming the Jacobian .