Stochastic Gradient Descent
Sources:
- Shiyu Zhao. Chapter 6: Stochastic Approximation. Mathematical Foundations of Reinforcement Learning
- --> Youtube: Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) algorithm: minimize \(J(w)=\mathbb{E}[f(w, X)]\) using \(\left\{\nabla_w f\left(w_k, x_k\right)\right\}\) \[ w_{k+1}=w_k-\alpha_k \nabla_w f\left(w_k, x_k\right) \]
Stochastic gradient descent
The iterative mean estimation algorithm is a special form of the Stochastic Gradient Descent (SGD) algorithm.
The SGD algorithm is a special form of the RM algorithm.
1 | RM --> iterative mean estimation --> SGD |
Problem statement
Suppose we aim to solve problem \[ \min_w J(w) \] where \[ J(w) =\mathbb{E}[f(w, X)] \]
- \(w\) is the parameter to be optimized.
- \(X\) is a random variable1. The expectation is with respect to \(X\).
- \(w\) and \(X\) can be either scalars or vectors. The function \(f(\cdot)\) is a scalar.
Method 1: gradient descent (GD)
\[ w_{k+1}= w_k-\alpha_k \textcolor{blue} {\nabla_w J(w)} = w_k-\alpha_k \textcolor{blue} {\nabla_w \mathbb{E}\left[f\left(w_k, X\right)\right]}=w_k-\alpha_k \textcolor{blue} {\mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right]} \]
It is not practical since if we don't know the probability distribution of \(X\), we don't know \(\mathbb{E}[f(w, X)]\).
Method 2: batch gradient descent (BGD)
We can use Monte Carlo estimation to compute the estimate \(\mathbb{E}[f(w, X)]\). \[ \begin{gathered} \mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right] \approx \frac{1}{n} \sum_{i=1}^n \nabla_w f(w_k , x_i) . \\ w_{k+1}=w_k-\alpha_k \frac{1}{n} \sum_{i=1}^n \nabla_w f\left(w_k, x_i\right) . \end{gathered} \] where \(n\) is the number of the dataset (remember that we have a dataset \(\left\{x_1, x_2, \ldots, x_n\right\}\) of \(X\)).
Drawback: it requires many samples in each iteration for each \(w_k\).
Method 3: stochastic gradient descent (SGD)
\[ \begin{equation} \label{eq_SGD} w_{k+1}=w_k-\alpha_k \textcolor{blue} {\nabla_w f\left(w_k, x_k\right)}, \end{equation} \] - Compared to the gradient descent method: Replace the true gradient \(\mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right]\) by the stochastic gradient \(\nabla_w f\left(w_k, x_k\right)\). - Compared to the batch gradient descent method: let \(n=1\).
From GD to SGD: \[ \begin{gathered} w_{k+1}=w_k-\alpha_k \textcolor{green} {\mathbb{E}}\left[\nabla_w f\left(w_k, X\right)\right] \\ \Downarrow \\ w_{k+1}=w_k-\alpha_k \nabla_w f\left(w_k, x_k\right) \end{gathered} \]
The so-called stocastic gradient \(\nabla_w f\left(w_k, x_k\right)\) can be viewed as a noisy measurement of the true gradient \(\mathbb{E}\left[\nabla_w f(w, X)\right]\): \[ \nabla_w f\left(w_k, x_k\right)=\mathbb{E}\left[\nabla_w f(w, X)\right]+\underbrace{\nabla_w f\left(w_k, x_k\right)-\mathbb{E}\left[\nabla_w f(w, X)\right]}_\eta . \]
An example of SGD
We next consider an example: \[ \min _w J(w)=\mathbb{E}[f(w, X)]=\mathbb{E}\left[\frac{1}{2}\|w-X\|^2\right], \] where \[ f(w, X)=\|w-X\|^2 / 2 \quad \nabla_w f(w, X)=w-X \]
It can be verified that the optimal solution is by solving \(\textcolor{green} {\nabla_w J(w)=0}\). Therefore, this optimization problem is equivalent to the iterative mean estimation problem.
The GD algorithm for solving the above problem is \[ \begin{aligned} w_{k+1} & =w_k-\alpha_k \nabla_w J\left(w_k\right) \\ & =w_k-\alpha_k \mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right] \\ & =w_k-\alpha_k \mathbb{E}\left[w_k-X\right] . \end{aligned} \]
The SGD algorithm for solving the above problem is \[ w_{k+1}=w_k-\alpha_k \nabla_w f\left(w_k, x_k\right)=w_k-\alpha_k\left(w_k-x_k\right) \]
We can therefore see that the iterative mean estimation algorithm is a special form of the SGD algorithm.
Convergence analysis
We next show that SGD is a special RM algorithm, Then, the convergence naturally follows.
The aim of SGD is to minimize \[ J(w)=\mathbb{E}[f(w, X)] \]
This problem can be converted to a root-finding problem2: \[ \nabla_w J(w)=\mathbb{E}\left[\nabla_w f(w, X)\right]=0 \]
Let \[ \textcolor{blue} {g(w)=\nabla_w J(w)=\mathbb{E}\left[\nabla_w f(w, X)\right]} . \]
Then, the aim of SGD is to find the root of \(g(w)=0\).
As in RM algorithm, we don't know \(g(w)\). Knowing it needs GD (gradient descent), not SGD.
What we can measure is \[ \begin{aligned} \tilde{g}(w, \eta) & =\nabla_w f(w, x) \\ & =\underbrace{\mathbb{E}\left[\nabla_w f(w, X)\right]}_{g(w)}+\underbrace{\nabla_w f(w, x)-\mathbb{E}\left[\nabla_w f(w, X)\right]}_\eta . \end{aligned} \]
Then, the RM algorithm for solving \(g(w)=0\) is \[ w_{k+1}=w_k-a_k \tilde{g}\left(w_k, \eta_k\right)=w_k-a_k \nabla_w f\left(w_k, x_k\right) . \]
- It is exactly the SGD algorithm.
- Therefore, SGD is a special RM algorithm.
Convergence of SGD
Since SGD is a special RM algorithm, its convergence naturally follows.
Theorem (Convergence of SGD): In the SGD algorithm, if
- \(0<c_1 \leq \nabla_w^2 f(w, X) \leq c_2\) 3;
- \(\sum_{k=1}^{\infty} a_k=\infty\) and \(\sum_{k=1}^{\infty} a_k^2<\infty\);
- \(\left\{x_k\right\}_{k=1}^{\infty}\) is iid;
then \(w_k\) converges to the root of \(\nabla_w \mathbb{E}[f(w, X)]=0\) with probability 1.
Convergence pattern
Question: Since the stochastic gradient is random and hence the approximation is inaccurate, whether the convergence of SGD is slow or random? To answer this question, we consider the relative error between the stochastic and batch gradients:
The relative error between the stochastic and true gradients is \[ \delta_k \triangleq \frac{\left|\nabla_w f\left(w_k, x_k\right)-\mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right]\right|}{\left|\mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right]\right|} . \]
For the sake of simplicity, we consider the case where \(w\) and \(\nabla_w f(w, x)\) are both scalars. Since \(w^*\) is the optimal solution, it holds that \(\mathbb{E}\left[\nabla_w f\left(w^*, X\right)\right]=0\). Then, the relative error can be rewritten as \[ \delta_k=\frac{\left|\nabla_w f\left(w_k, x_k\right)-\mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right]\right|}{\left|\mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right]-\textcolor{blue} {\mathbb{E} [\nabla_w f\left(w^*, X\right)] }\right|} = \frac{\left|\nabla_w f\left(w_k, x_k\right)-\mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right]\right|}{| \textcolor{purple} {\mathbb{E} [\nabla_w^2 f\left(\tilde{w}_k, X\right)\left(w_k-w^*\right)] }|}, \] where the last equality is due to the mean value theorem and \(\tilde{w}_k \in\left[w_k, w^*\right]\). Suppose that \(f\) is strictly convex such that \(\nabla_w^2 f \geq c>0\) for all \(w, X\). Then, the denominator becomes \[ \begin{aligned} \left|\mathbb{E}\left[\nabla_w^2 f\left(\tilde{w}_k, X\right)\left(w_k-w^*\right)\right]\right| & =\left|\mathbb{E}\left[\nabla_w^2 f\left(\tilde{w}_k, X\right)\right]\right|\left|\left(w_k-w^*\right)\right| \\ & \geq c\left|w_k-w^*\right| . \end{aligned} \]
Substituting the above inequality into the above equation yields \[ \delta_k \leq \frac{|\overbrace{\nabla_w f\left(w_k, x_k\right)}^{\text {stochastic gradient }}-\overbrace{\mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right]}^{\text {true gradient }}|}{\underbrace{c\left|w_k-w^*\right|}_{\text {distance to the optimal solution }}} . \]
The above inequality suggests an interesting convergence pattern of SGD: the relative error \(\delta_k\) is inversely proportional to \(\left|w_k-w^*\right|\).
As a result, when \(\left|w_k-w^*\right|\) is large, \(\delta_k\) is small. In this case, the SGD algorithm behaves like the gradient descent algorithm and hence \(w_k\) quickly converges to \(w^*\). When \(w_k\) is close to \(w^*\), the relative error \(\delta_k\) may be large, and the convergence exhibits more randomness.
A deterministic formulation of SGD
The formulation of SGD in \(\eqref{eq_SGD}\) involves random variables. One may often encounter deterministic formulation of SGD without involving any random variables.
In particular, consider a set of real numbers \(\left\{x_i\right\}_{i=1}^n\), where \(x_i\) does not have to be a ample of any random variable. The optimization problem to be solved is to minimize he average: \[ \min _w J(w)=\frac{1}{n} \sum_{i=1}^n f\left(w, x_i\right) \] where \(f\left(w, x_i\right)\) is a parameterized function, and \(w\) is the parameter to be optimized. The batch gradient descent (BGD) algorithm for solving this problem is \[ w_{k+1}=w_k-\alpha_k \nabla_w J\left(w_k\right)=w_k-\alpha_k \frac{1}{n} \sum_{i=1}^n \nabla_w f\left(w_k, x_i\right) . \]
Suppose that the set \(\left\{x_i\right\}_{i=1}^n\) is large and we can only fetch a single number each time.
In this case, it is favorable to update \(w_k\) in an incremental manner: \[ \begin{equation} \label{eq_SGD_deterministic} w_{k+1}=w_k-\alpha_k \textcolor{blue}{\nabla_w f\left(w_k, x_k\right)} . \end{equation} \]
It must be noted that \(x_k\) here is the number fetched at time step \(k\) instead of the \(k\) th element in the set \(\left\{x_i\right\}_{i=1}^n\).
The algorithm in \(\eqref{eq_SGD_deterministic}\) is very similar to SGD, but its problem formulation is subtly different because it does not involve any random variables or expected values.
Then, many questions arise. For example, is this algorithm SGD? How should we use the finite set of numbers \(\left\{x_i\right\}_{i=1}^n\) ? Should we sort these numbers in a certain order and then use them one by one, or should we randomly sample a number from the set?
A quick answer to the above questions is that, although no random variables are involved in the above formulation, we can convert the deterministic formulation to the stochastic formulation by introducing a random variable.
In particular, let \(X\) be a random variable defined on the set \(\left\{x_i\right\}_{i=1}^n\). Suppose that its probability distribution is uniform such that \(p\left(X=x_i\right)=1 / n\). Then, the deterministic optimization problem becomes a stochastic one: \[ \min _w J(w)=\frac{1}{n} \sum_{i=1}^n f\left(w, x_i\right)=\mathbb{E}[f(w, X)] . \]
The last equality in the above equation is strict instead of approximate. Therefore, the algorithm in \(\eqref{eq_SGD_deterministic}\) is SGD, and the estimate converges if \(x_k\) is uniformly and independently sampled from \(\left\{x_i\right\}_{i=1}^n\). Note that \(x_k\) may repeatedly take the same number in \(\left\{x_i\right\}_{i=1}^n\) since it is sampled randomly.
BGD, MBGD, and SGD
Suppose we would like to minimize \(J(w)=\mathbb{E}[f(w, X)]\), given a set of random samples \(\left\{x_i\right\}_{i=1}^n\) of \(X\). The BGD, SGD, MBGD algorithms solving this problem are, respectively, \[ \begin{aligned} & w_{k+1}=w_k-\alpha_k \frac{1}{n} \sum_{i=1}^n \nabla_w f\left(w_k, x_i\right), \quad \text{(BGD)}\\ & w_{k+1}=w_k-\alpha_k \frac{1}{m} \sum_{j \in \mathcal{I}_k} \nabla_w f\left(w_k, x_j\right), \quad \text{(MBGD)}\\ & w_{k+1}=w_k-\alpha_k \nabla_w f\left(w_k, x_k\right) \quad \text{(SGD)} \end{aligned} \]
where \(n\) is the size of the dataset, \(m\) can be smaller.
- In the BGD algorithm, all the samples are used in every iteration. When \(n\) is large, \((1 / n) \sum_{i=1}^n \nabla_w f\left(w_k, x_i\right)\) is close to the true gradient \(\mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right]\).
- In the MBGD algorithm, \(\mathcal{I}_k\) is a subset of \(\{1, \ldots, n\}\) with the size as \(\left|\mathcal{I}_k\right|=m\). The set \(\mathcal{I}_k\) is obtained by \(m\) times i.d.d. samplings.
- In the SGD algorithm, \(x_k\) is randomly sampled from \(\left\{x_i\right\}_{i=1}^n\) at time \(k\).
Comparison
Compare MBGD with BGD and SGD: - Compared to SGD, MBGD has less randomness because it uses more samples instead of just one as in SGD. - Compared to BGD, MBGD does not require to use all the samples in every iteration, making it more flexible and efficient. - If \(m=1\), MBGD becomes SGD. - If \(m=n\), MBGD does NOT become BGD strictly speaking, because MBGD uses randomly fetched \(n\) samples whereas BGD uses all \(n\) numbers. In particular, MBGD may use a value in \(\left\{x_i\right\}_{i=1}^n\) multiple times whereas BGD uses each number once.
An example
Given some numbers \(\left\{x_i\right\}_{i=1}^n\), our aim is to calculate the mean \(\bar{x}=\sum_{i=1}^n x_i / n\). This problem can be equivalently stated as the following optimization problem: \[ \min _w J(w)=\frac{1}{2 n} \sum_{i=1}^n\left\|w-x_i\right\|^2 \]
The three algorithms for solving this problem are, respectively, \[ \begin{aligned} & w_{k+1}=w_k-\alpha_k \frac{1}{n} \sum_{i=1}^n\left(w_k-x_i\right)=w_k-\alpha_k\left(w_k-\bar{x}\right), \quad \text{(BGD)} \\ & w_{k+1}=w_k-\alpha_k \frac{1}{m} \sum_{j \in \mathcal{I}_k}\left(w_k-x_j\right)=w_k-\alpha_k\left(w_k-\bar{x}_k^{(m)}\right), \quad \text{(MBGD)}\\ & w_{k+1}=w_k-\alpha_k\left(w_k-x_k\right), \quad \text { (SGD) } \end{aligned} \] where \(\bar{x}_k^{(m)}=\sum_{j \in \mathcal{I}_k} x_j / m\).
Furthermore, if \(\alpha_k=1 / k\), the above equation can be solved as \[ \begin{aligned} & w_{k+1}=\frac{1}{k} \sum_{j=1}^k \bar{x}=\bar{x}, \quad \text{(BGD)} \\ & w_{k+1}=\frac{1}{k} \sum_{j=1}^k \bar{x}_j^{(m)}, \quad \text{(BGD)}\\ & w_{k+1}=\frac{1}{k} \sum_{j=1}^k x_j . \quad \text{(BGD)} \end{aligned} \] - The estimate of BGD at each step is exactly the optimal solution \(w^*=\bar{x}\). - The estimate of MBGD approaches the mean faster than SGD because \(\bar{x}_k^{(m)}\) is already an average.
The progability distribution of \(X\) is fixed, but we don't know it.↩︎
The optimal solution \(w^*\) must be a root of the equation. But a root may not be optimal solution. This relates to the convex optimization problem.↩︎
Recall that in RM algorithm we need \(0<c_1 \leq \nabla_w g(w) \leq c_2\). In SGD we have \(g(w)=\nabla_w J(w)\). Therefore, we obtain \(\nabla_w^2 f(w, X)\).↩︎