Stochastic Gradient Descent
Sources:
- Shiyu Zhao. Chapter 6: Stochastic Approximation. Mathematical Foundations of Reinforcement Learning
- --> Youtube: Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) algorithm: minimize
Stochastic gradient descent
The iterative mean estimation algorithm is a special form of the Stochastic Gradient Descent (SGD) algorithm.
The SGD algorithm is a special form of the RM algorithm.
1 | RM --> iterative mean estimation --> SGD |
Problem statement
Suppose we aim to solve problem
is the parameter to be optimized. is a random variable1. The expectation is with respect to . and can be either scalars or vectors. The function is a scalar.
Method 1: gradient descent (GD)
It is not practical since if we don't know the probability distribution of
Method 2: batch gradient descent (BGD)
We can use Monte Carlo estimation to compute the estimate
Drawback: it requires many samples in each iteration for each
Method 3: stochastic gradient descent (SGD)
From GD to SGD:
The so-called stocastic gradient
An example of SGD
We next consider an example:
It can be verified that the optimal solution is by solving
The GD algorithm for solving the above problem is
The SGD algorithm for solving the above problem is
We can therefore see that the iterative mean estimation algorithm is a special form of the SGD algorithm.
Convergence analysis
We next show that SGD is a special RM algorithm, Then, the convergence naturally follows.
The aim of SGD is to minimize
This problem can be converted to a root-finding problem2:
Let
Then, the aim of SGD is to find the root of
As in RM algorithm, we don't know
What we can measure is
Then, the RM algorithm for solving
- It is exactly the SGD algorithm.
- Therefore, SGD is a special RM algorithm.
Convergence of SGD
Since SGD is a special RM algorithm, its convergence naturally follows.
Theorem (Convergence of SGD): In the SGD algorithm, if
3; and ; is iid;
then
Convergence pattern
Question: Since the stochastic gradient is random and hence the approximation is inaccurate, whether the convergence of SGD is slow or random? To answer this question, we consider the relative error between the stochastic and batch gradients:
The relative error between the stochastic and true gradients is
For the sake of simplicity, we consider the case where
Substituting the above inequality into the above equation yields
The above inequality suggests an interesting convergence pattern of SGD: the relative error
As a result, when
A deterministic formulation of SGD
The formulation of SGD in
In particular, consider a set of real numbers
Suppose that the set
In this case, it is favorable to update
It must be noted that
The algorithm in
Then, many questions arise. For example, is this algorithm SGD? How should we use the finite set of numbers
A quick answer to the above questions is that, although no random variables are involved in the above formulation, we can convert the deterministic formulation to the stochastic formulation by introducing a random variable.
In particular, let
The last equality in the above equation is strict instead of approximate. Therefore, the algorithm in
BGD, MBGD, and SGD
Suppose we would like to minimize
where
- In the BGD algorithm, all the samples are used in every iteration. When
is large, is close to the true gradient . - In the MBGD algorithm,
is a subset of with the size as . The set is obtained by times i.d.d. samplings. - In the SGD algorithm,
is randomly sampled from at time .
Comparison
Compare MBGD with BGD and SGD: - Compared to SGD, MBGD has less randomness because it uses more samples instead of just one as in SGD. - Compared to BGD, MBGD does not require to use all the samples in every iteration, making it more flexible and efficient. - If
An example
Given some numbers
The three algorithms for solving this problem are, respectively,
Furthermore, if