Stochastic Gradient Descent

Sources:

  1. Shiyu Zhao. Chapter 6: Stochastic Approximation. Mathematical Foundations of Reinforcement Learning
  2. --> Youtube: Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) algorithm: minimize J(w)=E[f(w,X)] using {wf(wk,xk)} wk+1=wkαkwf(wk,xk)

Stochastic gradient descent

The iterative mean estimation algorithm is a special form of the Stochastic Gradient Descent (SGD) algorithm.

The SGD algorithm is a special form of the RM algorithm.

1
RM --> iterative mean estimation --> SGD

Problem statement

Suppose we aim to solve problem minwJ(w) where J(w)=E[f(w,X)]

  • w is the parameter to be optimized.
  • X is a random variable1. The expectation is with respect to X.
  • w and X can be either scalars or vectors. The function f() is a scalar.

Method 1: gradient descent (GD)

wk+1=wkαkwJ(w)=wkαkwE[f(wk,X)]=wkαkE[wf(wk,X)]

It is not practical since if we don't know the probability distribution of X, we don't know E[f(w,X)].

Method 2: batch gradient descent (BGD)

We can use Monte Carlo estimation to compute the estimate E[f(w,X)]. E[wf(wk,X)]1ni=1nwf(wk,xi).wk+1=wkαk1ni=1nwf(wk,xi). where n is the number of the dataset (remember that we have a dataset {x1,x2,,xn} of X).

Drawback: it requires many samples in each iteration for each wk.

Method 3: stochastic gradient descent (SGD)

(1)wk+1=wkαkwf(wk,xk), - Compared to the gradient descent method: Replace the true gradient E[wf(wk,X)] by the stochastic gradient wf(wk,xk). - Compared to the batch gradient descent method: let n=1.

From GD to SGD: wk+1=wkαkE[wf(wk,X)]wk+1=wkαkwf(wk,xk)

The so-called stocastic gradient wf(wk,xk) can be viewed as a noisy measurement of the true gradient E[wf(w,X)]: wf(wk,xk)=E[wf(w,X)]+wf(wk,xk)E[wf(w,X)]η.

An example of SGD

We next consider an example: minwJ(w)=E[f(w,X)]=E[12wX2], where f(w,X)=wX2/2wf(w,X)=wX

It can be verified that the optimal solution is by solving wJ(w)=0. Therefore, this optimization problem is equivalent to the iterative mean estimation problem.

The GD algorithm for solving the above problem is wk+1=wkαkwJ(wk)=wkαkE[wf(wk,X)]=wkαkE[wkX].

The SGD algorithm for solving the above problem is wk+1=wkαkwf(wk,xk)=wkαk(wkxk)

We can therefore see that the iterative mean estimation algorithm is a special form of the SGD algorithm.

Convergence analysis

We next show that SGD is a special RM algorithm, Then, the convergence naturally follows.

The aim of SGD is to minimize J(w)=E[f(w,X)]

This problem can be converted to a root-finding problem2: wJ(w)=E[wf(w,X)]=0

Let g(w)=wJ(w)=E[wf(w,X)].

Then, the aim of SGD is to find the root of g(w)=0.

As in RM algorithm, we don't know g(w). Knowing it needs GD (gradient descent), not SGD.

What we can measure is g~(w,η)=wf(w,x)=E[wf(w,X)]g(w)+wf(w,x)E[wf(w,X)]η.

Then, the RM algorithm for solving g(w)=0 is wk+1=wkakg~(wk,ηk)=wkakwf(wk,xk).

  • It is exactly the SGD algorithm.
  • Therefore, SGD is a special RM algorithm.

Convergence of SGD

Since SGD is a special RM algorithm, its convergence naturally follows.

Theorem (Convergence of SGD): In the SGD algorithm, if

  1. 0<c1w2f(w,X)c2 3;
  2. k=1ak= and k=1ak2<;
  3. {xk}k=1 is iid;

then wk converges to the root of wE[f(w,X)]=0 with probability 1.

Convergence pattern

Question: Since the stochastic gradient is random and hence the approximation is inaccurate, whether the convergence of SGD is slow or random? To answer this question, we consider the relative error between the stochastic and batch gradients:

The relative error between the stochastic and true gradients is δk|wf(wk,xk)E[wf(wk,X)]||E[wf(wk,X)]|.

For the sake of simplicity, we consider the case where w and wf(w,x) are both scalars. Since w is the optimal solution, it holds that E[wf(w,X)]=0. Then, the relative error can be rewritten as δk=|wf(wk,xk)E[wf(wk,X)]||E[wf(wk,X)]E[wf(w,X)]|=|wf(wk,xk)E[wf(wk,X)]||E[w2f(w~k,X)(wkw)]|, where the last equality is due to the mean value theorem and w~k[wk,w]. Suppose that f is strictly convex such that w2fc>0 for all w,X. Then, the denominator becomes |E[w2f(w~k,X)(wkw)]|=|E[w2f(w~k,X)]||(wkw)|c|wkw|.

Substituting the above inequality into the above equation yields δk|wf(wk,xk)stochastic gradient E[wf(wk,X)]true gradient |c|wkw|distance to the optimal solution .

The above inequality suggests an interesting convergence pattern of SGD: the relative error δk is inversely proportional to |wkw|.

As a result, when |wkw| is large, δk is small. In this case, the SGD algorithm behaves like the gradient descent algorithm and hence wk quickly converges to w. When wk is close to w, the relative error δk may be large, and the convergence exhibits more randomness.

A deterministic formulation of SGD

The formulation of SGD in (1) involves random variables. One may often encounter deterministic formulation of SGD without involving any random variables.

In particular, consider a set of real numbers {xi}i=1n, where xi does not have to be a ample of any random variable. The optimization problem to be solved is to minimize he average: minwJ(w)=1ni=1nf(w,xi) where f(w,xi) is a parameterized function, and w is the parameter to be optimized. The batch gradient descent (BGD) algorithm for solving this problem is wk+1=wkαkwJ(wk)=wkαk1ni=1nwf(wk,xi).

Suppose that the set {xi}i=1n is large and we can only fetch a single number each time.

In this case, it is favorable to update wk in an incremental manner: (2)wk+1=wkαkwf(wk,xk).

It must be noted that xk here is the number fetched at time step k instead of the k th element in the set {xi}i=1n.

The algorithm in (2) is very similar to SGD, but its problem formulation is subtly different because it does not involve any random variables or expected values.

Then, many questions arise. For example, is this algorithm SGD? How should we use the finite set of numbers {xi}i=1n ? Should we sort these numbers in a certain order and then use them one by one, or should we randomly sample a number from the set?

A quick answer to the above questions is that, although no random variables are involved in the above formulation, we can convert the deterministic formulation to the stochastic formulation by introducing a random variable.

In particular, let X be a random variable defined on the set {xi}i=1n. Suppose that its probability distribution is uniform such that p(X=xi)=1/n. Then, the deterministic optimization problem becomes a stochastic one: minwJ(w)=1ni=1nf(w,xi)=E[f(w,X)].

The last equality in the above equation is strict instead of approximate. Therefore, the algorithm in (2) is SGD, and the estimate converges if xk is uniformly and independently sampled from {xi}i=1n. Note that xk may repeatedly take the same number in {xi}i=1n since it is sampled randomly.

BGD, MBGD, and SGD

Suppose we would like to minimize J(w)=E[f(w,X)], given a set of random samples {xi}i=1n of X. The BGD, SGD, MBGD algorithms solving this problem are, respectively, wk+1=wkαk1ni=1nwf(wk,xi),(BGD)wk+1=wkαk1mjIkwf(wk,xj),(MBGD)wk+1=wkαkwf(wk,xk)(SGD)

where n is the size of the dataset, m can be smaller.

  • In the BGD algorithm, all the samples are used in every iteration. When n is large, (1/n)i=1nwf(wk,xi) is close to the true gradient E[wf(wk,X)].
  • In the MBGD algorithm, Ik is a subset of {1,,n} with the size as |Ik|=m. The set Ik is obtained by m times i.d.d. samplings.
  • In the SGD algorithm, xk is randomly sampled from {xi}i=1n at time k.

Comparison

Compare MBGD with BGD and SGD: - Compared to SGD, MBGD has less randomness because it uses more samples instead of just one as in SGD. - Compared to BGD, MBGD does not require to use all the samples in every iteration, making it more flexible and efficient. - If m=1, MBGD becomes SGD. - If m=n, MBGD does NOT become BGD strictly speaking, because MBGD uses randomly fetched n samples whereas BGD uses all n numbers. In particular, MBGD may use a value in {xi}i=1n multiple times whereas BGD uses each number once.

An example

Given some numbers {xi}i=1n, our aim is to calculate the mean x¯=i=1nxi/n. This problem can be equivalently stated as the following optimization problem: minwJ(w)=12ni=1nwxi2

The three algorithms for solving this problem are, respectively, wk+1=wkαk1ni=1n(wkxi)=wkαk(wkx¯),(BGD)wk+1=wkαk1mjIk(wkxj)=wkαk(wkx¯k(m)),(MBGD)wk+1=wkαk(wkxk), (SGD)  where x¯k(m)=jIkxj/m.

Furthermore, if αk=1/k, the above equation can be solved as wk+1=1kj=1kx¯=x¯,(BGD)wk+1=1kj=1kx¯j(m),(BGD)wk+1=1kj=1kxj.(BGD) - The estimate of BGD at each step is exactly the optimal solution w=x¯. - The estimate of MBGD approaches the mean faster than SGD because x¯k(m) is already an average.


  1. The progability distribution of X is fixed, but we don't know it.↩︎

  2. The optimal solution w must be a root of the equation. But a root may not be optimal solution. This relates to the convex optimization problem.↩︎

  3. Recall that in RM algorithm we need 0<c1wg(w)c2. In SGD we have g(w)=wJ(w). Therefore, we obtain w2f(w,X).↩︎