Derivatives, Backpropagation, and Vectorization

Sources:

  1. Derivatives, Backpropagation, and Vectorization
  2. Wikipedia: chain rule
  3. Wikipedia: quptient rule

It's recommended to compute the back propagation manually to get a deeper understanding. Read this article if you are interested.

Tensors

The word tensor is used in different ways in different fields; you may have seen the term before in physics or abstract algebra. The machine learning definition of a tensor as a D dimensional grid of numbers is closely related to the definitions of tensors in these other fields.

In machine learning,

  1. a 0-axis tensor is a scalar,
  2. a 1-axis tensor is a vector,
  3. a 2-axis tensor is a matrix,
  4. a 3-axis and higher dimensional tensor is just called "tensor".

Chain rule

If h=fg is the function such that h(x)=f(g(x)) for every x, then the chain rule is, in Lagrange's notation, h(x)=f(g(x))g(x). or, equivalently, h=(fg)=(fg)g.

The chain rule may also be expressed in Leibniz's notation. If a variable z depends on the variable y, which itself depends on the variable x , then z depends on x as well, via the intermediate variable y. In this case, the chain rule is expressed as dzdx=dzdydydx and dzdx|x=dzdy|y(x)dydx|x for indicating at which points the derivatives have to be evaluated.

Quotient rule

Let h(x)=f(x)g(x), where both f and g are differentiable and g(x)0. The quotient rule states that the derivative of h(x) is h(x)=f(x)g(x)f(x)g(x)g(x)2.

Scalar case

Given a function f:RR, the derivative of f at a point xR is defined as: f(x)=limh0f(x+h)f(x)h

In this scalar case, the derivative of the function f at the point x tells us how much the function f changes as the input x changes by a small amount ε : f(x+ε)f(x)+εf(x)

For ease of notation we will commonly assign a name to the output of f, say y=f(x), and write yx for the derivative of y with respect to x. We can write this relationship as xx+Δxy→≈y+yxΔx

You should read this as saying "changing x to x+Δx implies that y will change to approximately y+Δxyx ". This notation is nonstandard, but I like it since it emphasizes the relationship between changes in x and changes in y.

In the scalar case suppose that f,g:RR and y=f(x),z=g(y); then we can also write z=(gf)(x), or draw the following computational graph: xfygz

The (scalar) chain rule tells us that zx=zyyx This equation makes intuitive sense. The derivatives zy and yx give: xx+Δxy→≈y+yxΔxyy+Δyz→≈z+zyΔy

Combining these two rules lets us compute the effect of x on z :

  1. if x changes by Δx then y will change by yxΔx, so we have Δy=yxΔx.
  2. If y changes by Δy then z will change by zyΔy=zyyxΔx which is exactly what the chain rule tells us.

Gradient: Vector in, scalar out

This same intuition carries over into the vector case. Now suppose that f : RNR takes a vector as input and produces a scalar. The derivative of f at the point xRN is now called the gradient, and it is defined as: xf(x)=limh0f(x+h)f(x)h

The gradient xf(x)RN , or yx, is a vector of partial derivatives: yx=(yx1,yx2,,yxN) where xi is the i th coordinate of the vector x, which is a scalar, so each partial derivative yxi is also a scalar.

If we set y=f(x) then we have the relationship xx+Δxy→≈y+yxΔx

The formula changes a bit from the scalar case to account for the fact that x,Δx, and yx are now vectors in RN while y is a scalar. In particular when multiplying yx by Δx we use the dot product, which combines two vectors to give a scalar.

Jacobian: Vector in, Vector out

Now suppose that f:RNRM takes a vector as input and produces a vector as output. Then the derivative of f at a point x, also called the Jacobian matrix, is the M×N matrix of partial derivatives. If we again set y=f(x) then we can write: yx=(y1x1y1xNyMx1yMxN)

The Jacobian tells us the relationship between each element of x and each element of y : the (i,j)-th element of yx is equal to yixj, so it tells us the amount by which yi will change if xj is changed by a small amount.

Just as in the previous cases, the Jacobian tells us the relationship between changes in the input and changes in the output: xx+Δxy→≈y+yxΔx

Here yx is a M×N matrix and Δx is an N-dimensional vector, so the product yxΔx is a matrix-vector multiplication resulting in an M-dimensional vector: RM×NRN×1RM×1.

The chain rule can be extended to the vector case using Jacobian matrices. Suppose that f:RNRM and g:RMRK. Let xRN,yRM, and zRK with y=f(x) and z=g(y), so we have the same computational graph as the scalar case: xfygz

The chain rule also has the same form as the scalar case: zx=zyyx

However now each of these terms is a matrix: zy is a K×M matrix, yx is a M×N matrix, and zx is a K×N matrix; the multiplication of zy and yx is matrix multiplication.

Generalized Jacobian: Tensor in, Tensor out

Suppose now that f:RN1××NDxRM1××MDy. Then:

  1. the input to f is a Dx-dimensional tensor of shape N1××NDx, and
  2. the output of f is a Dy-dimensional tensor of shape M1××MDy.

If y=f(x) then the derivative yx is a generalized Jacobian, which is an object with shape (M1××MDy)×(N1××NDx)

Note that we have separated the dimensions of yx into two groups:

  1. the first group matches the dimensions of y and
  2. the second group matches the dimensions of x.

With this grouping, we can think of the generalized Jacobian as generalization of a matrix, where each "row" has the same shape as y and each "column" has the same shape as x.

Now if we let iZDy and jZDx be vectors of integer indices, then we can write (yx)i,j=yixj

Note that yi and xj are scalars, so the derivative yixj is also a scalar. Using this notation we see that like the standard Jacobian, the generalized Jacobian tells us the relative rates of change between all elements of x and all elements of y.

The generalized Jacobian gives the same relationship between inputs and outputs as before: xx+Δxy→≈y+yxΔx

The difference is that now Δx is a tensor of shape N1××NDx and yx is a generalized matrix of shape (M1××MDy)×(N1××NDx). The product yxΔx is therefore a generalized matrix-vector multiply, which results in a tensor of shape M1××MDy.

The generalized matrix-vector multipy follows the same algebraic rules as a traditional matrix-vector multiply: (yxΔx)j=i(yx)i,j(Δx)i=(yx)j,:Δx

The only difference is that the indicies i and j are not scalars; instead they are vectors of indicies. In the equation above the term (yx)j,: is the j th "row" of the generalized matrix yx, which is a tensor with the same shape as x. We have also used the convention that the dot product between two tensors of the same shape is an elementwise product followed by a sum, identical to the dot product between vectors.

The chain rule also looks the same in the case of tensor-valued functions. Suppose that y=f(x) and z=g(y), where x and y have the same shapes as above and z has shape K1××KDz. Now the chain rule looks the same as before: zx=zyyx

The difference is that now zy is a generalized matrix of shape (K1××KDz)× (M1××MDy), and yz is a generalized matrix of shape (M1××MDy)× (N1××NDx); the product zyyx is a generalized matrix-matrix multiply, resulting in an object of shape (K1××KDz)×(N1××NDx). Like the generalized matrix-vector multiply defined above, the generalized matrixmatrix multiply follows the same algebraic rules as the traditional matrix-matrix multiply: (zx)i,j=k(zy)i,k(yx)k,j=(zy)i,:(yx):,j

In this equation the indices i,j,k are vectors of indices, and the terms (zy)i and (yx):,j are the i th "row" of zy and the j th "column" of yx respectively.

Backward Propagation with Tensors

In the context of neural networks, a layer f is typically a function of (tensor) inputs x and weights w; the (tensor) output of the layer is then y=f(x,w). The layer f is typically embedded in some large neural network with scalar loss L.

During backpropagation, we assume that we are given Ly (it's the upstram gradient value and has been already computed in the upstram neoron) and our goal is to compute Lx and Lw. By the chain rule we know that Lx=LyyxLw=Lyyw

Therefore one way to proceed would be to form the (generalized) Jacobians yx and yw and use (generalized) matrix multiplication to compute Lx and Lw.

However, there's a problem with this approach: the Jacobian matrices yx and yw are typically far too large to fit in memory.

Problem: Too large to put in the memory

As a concrete example, let's suppose that: the layer f is a linear layer that takes as input a minibatch of N vectors, each of dimension D, and produces a minibatch of N vectors, each of dimension M.

Then x is a matrix of shape N×D,w is a matrix of shape D×M, and y=f(x,w)=xw is a matrix of shape N×M. f:RN×DRN×M The Jacobian yx then has shape (N×M)×(N×D).

In a typical neural network we might have N=64 and M=D=4096; then yx consists of 644096644096 scalar values; this is more than 68 billion numbers; using 32-bit floating point, this Jacobian matrix will take 256 GB of memory to store. Therefore it is completely impossible to try and explicitly store and manipulate the Jacobian matrix.

However it turns out that for most common neural network layers, we can derive expressions that compute the product yxLy without explicitly forming the Jacobian yx. Even better, we can typically derive this expression without even computing an explicit expression for the Jacobian yx; in many cases we can work out a small case on paper and then infer the general formula.

Solution

Let's see how this works out for the case of the linear layer f(x,w)=xw. Set N=1,D=2,M=3. Then we can explicitly write (1)y=(y1,1y1,2y1,3)=xw=(x1,1x1,2)(w1,1w1,2w1,3w2,1w2,2w2,3)=(x1,1w1,1+x1,2w2,1x1,1w1,2+x1,2w2,2x1,1w1,3+x1,2w2,3)T

During backpropagation we assume that we have access to Ly (it's the upstream gradient value) which technically has shape (1)×(N×M); however for notational convenience we will instead think of it as a matrix of shape N×M. Then we can write Ly=(dy1,1dy1,2dy1,3)

Our goal now is to derive an expression for Lx in terms of x,w, and Ly, without explicitly forming the entire Jacobian yx.

We know that Lx will have shape (1) ×(N×D), but as is typical for representing gradients we instead view Lx as a matrix of shape N×D.

We know that each element of Lx is a scalar giving the partial derivatives of L with respect to the elements of x : Lx=(Lx1,1Lx1,2)

Thinking one element at a time, the chain rule tells us that Lx1,1=Lyyx1,1Lx1,2=Lyyx1,2

Since we already know Ly, we just have to compute yx1,1,yx1,2.

If we view Ly and yx1,1 as matrices of shape N×M, then their generalized matrix product is simply the dot product Lyyx1,1. Now we compute yx1,1=(y1,1x1,1y1,2x1,1y1,3x1,1)=(w1,1w1,2w1,3)yx1,2=(y1,1x1,2y1,2x1,2y1,3x1,2)=(w2,1w2,2w2,3) where the final equality comes from taking the derivatives of (1) with respect to x1,1. We can now combine these results and write Lx1,1=Lyyx1,1=dy1,1w1,1+dy1,2w1,2+dy1,3w1,3Lx1,2=Lyyx1,2=dy1,1w2,1+dy1,2w2,2+dy1,3w2,3

This gives us our final expression for Lx : Lx=(Lx1,1Lx1,2)=(dy1,1w1,1+dy1,2w1,2+dy1,3w1,3dy1,1w2,1+dy1,2w2,2+dy1,3w2,3)T=LyxT

This final result Lx=LyxT is very interesting because it allows us to efficiently compute Lx without explicitly forming the Jacobian yx. We have only derived this formula for the specific case of N=1,D=2,M=3 but it in fact holds in general.

By a similar thought process we can derive a similar expression for Lw without explicitly forming the Jacobian yw.