Robbins-Monro Algorithm

Sources:

  1. Shiyu Zhao. Chapter 6: Stochastic Approximation. Mathematical Foundations of Reinforcement Learning
  2. --> Youtube: Stochastic Gradient Descent

Robbins-Monro (RM) algorithm: solve \(g(w)=0\) using \(\left\{\tilde{g}\left(w_k, \eta_k\right)\right\}\) \[ w_{k+1}=w_k-a_k \tilde{g}\left(w_k, \eta_k\right) \]

Stochastic approximation

Stochastic approximation (SA): - SA refers to a broad class of stochastic iterative algorithms solving root finding or optimization problems. - Compared to many other root-finding algorithms such as gradient-based methods, SA is powerful in the sense that it does not require to know the expression of the objective function nor its derivative.

Robbins-Monro (RM) algorithm: - The is a pioneering work in the field of stochastic approximation. - The famous stochastic gradient descent (SGD) algorithm is a special form of the RM algorithm. - It can be used to analyze the mean estimation algorithms introduced in the former article.

Philosophy: without model, we need data!

Problem statement

Problem statement: Suppose we would like to find the root of the equation \[ g(w)=0, \] where \(w \in \mathbb{R}\) is the variable to be solved and \(g: \mathbb{R} \rightarrow \mathbb{R}\) is a function.

How to calculate the root of \(g(w)=0\)1?

If the expression of \(g\) or its derivative is known, there are many numerical algorithms that can solve this problem.

What if the expression of the function \(g\) is unknown? For example, the function is represented by an artificial neuron network. Moreover, we can only obtain a noisy observation of \(g(w)\) : \[ \begin{equation} \label{eq_2} \textcolor{purple} {\tilde{g}(w, \eta)=g(w)+\eta} , \end{equation} \] where \(\eta \in \mathbb{R}\) is the observation error, which may or may not be Gaussian. In summary, it is a black-box system where only the input \(w\) and the noisy output \(\tilde{g}(w, \eta)\) are known (see Figure 1). Our aim is to solve \(g(w)=0\) using \(w\) and \(\tilde{g}\).

Fugure 1

The Robbins-Monro (RM) algorithm

The Robbins-Monro (RM) algorithm can solve \(g(w)=0\) is \[ \begin{equation} \label{eq_3} \textcolor{red} {w_{k+1}=w_k-a_k \tilde{g}\left(w_k, \eta_k\right)} \end{equation} \] \(k=1,2,3, \ldots\),

where

  • \(w_k\) is the \(k\) th estimate of the root
  • \(\tilde{g}\left(w_k, \eta_k\right)=g\left(w_k\right)+\eta_k\) is the \(k\)th noisy observation
  • \(a_k\) is a positive coefficient.

As can been seen, RM algorithm only requires the input and output data:

  • Input sequence: \(\left\{w_k\right\}\)
  • Noisy output sequence: \(\left\{\tilde{g}\left(w_k, \eta_k\right)\right\}\)

It does not need to know the expression of \(g(w)\).

An illustrative example

Figure 2

Consider the example shown in Figure 2. In this example,

  • \(g(w)=\tanh (w-1)\),

  • The true root of \(g(w)=0\) is \(w^*=1\).

  • Parameters: \(w_1=3, a_k=1 / k, \eta_k \equiv 0\) (no noise for the sake of simplicity)

The RM algorithm in this case is \[ \textcolor{blue} {w_{k+1}=w_k-a_k g\left(w_k\right)} \] since \(\tilde{g}\left(w_k, \eta_k\right)=g\left(w_k\right)\) when \(\eta_k=0\).

The resulting \(\left\{w_k\right\}\) generated by the RM algorithm is shown in Figure 2. It can be seen that \(w_k\) converges to the true root \(w^*=1\). This simple example can illustrate why the RM algorithm converges.

  • When \(w_k>w^*\), we have \(g\left(w_k\right)>0\). Then, \(w_{k+1}=w_k-a_k g\left(w_k\right)<w_k\). If \(a_k g\left(w_k\right)\) is sufficiently small, we have \(w^*<w_{k+1}<w_k\). As a result, \(w_{k+1}\) is closer to \(w^*\) than \(w_k\).
  • When \(w_k<w^*\), we have \(g\left(w_k\right)<0\). Then, \(w_{k+1}=w_k-a_k g\left(w_k\right)>w_k\). If \(\left|a_k g\left(w_k\right)\right|\) is sufficiently small, we have \(w^*>w_{k+1}>w_k\). As a result, \(w_{k+1}\) is closer to \(w^*\) than \(w_k\).

In either case, \(w_{k+1}\) is closer to \(w^*\) than \(w_k\). Therefore, it is intuitive that \(w_k\) converges to \(w^*\).

Proof

Robbins-Monro theorem: In the Robbins-Monro algorithm in \(\eqref{eq_3}\), if

  1. \(0<c_1 \leq \nabla_w g(w) \leq c_2\) for all \(w\);
  2. \(\sum_{k=1}^{\infty} a_k=\infty\) and \(\sum_{k=1}^{\infty} a_k^2<\infty\);
  3. \(\mathbb{E}\left[\eta_k \mid \mathcal{H}_k\right]=0\) and \(\mathbb{E}\left[\eta_k^2 \mid \mathcal{H}_k\right]<\infty\);

where \(\mathcal{H}_k=\left\{w_k, w_{k-1}, \ldots\right\}\), then \(w_k\) almost surely converges to the root \(w^*\) satisfying \(g\left(w^*\right)=0\).

We postpone the proof of this theorem to a later article. This theorem relies on the notion of almost sure convergence, which is introduced in it as well.

The three conditions in previous are explained as follows.

Condition 1

In the first condition, \(0<c_1 \leq \nabla_w g(w)\) indicates that \(g(w)\) is a monotonically increasing function. This condition ensures that the root of \(g(w)=0\) exists and is unique. If \(g(w)\) is monotonically decreasing, we can simply treat \(-g(w)\) as a new function that is monotonically increasing.

The inequality \(\nabla_w g(w) \leq c_2\) indicates that the gradient of \(g(w)\) is bounded from above. For example, \(g(w)=\tanh (w-1)\) satisfies this condition, but \(g(w)=w^3-5\) does not.

Condition 2

The second condition about \(\left\{a_k\right\}\) is interesting. We often see conditions like this in reinforcement learning algorithms.

In particular, the condition \(\sum_{k=1}^{\infty} a_k^2<\infty\) means that \(\lim _{n \rightarrow \infty} \sum_{k=1}^n a_k^2\) is bounded from above. It requires that \(a_k\) converges to zero as \(k \rightarrow \infty\).

The condition \(\sum_{k=1}^{\infty} a_k=\infty\) means that \(\lim _{n \rightarrow \infty} \sum_{k=1}^n a_k\) is infinitely large. It requires that \(a_k\) should not converge to zero too fast. These conditions have interesting properties, which will be analyzed in detail shortly.

Condition 3

The third condition is mild. It does not require the observation error \(\eta_k\) to be Gaussian. An important special case is that \(\left\{\eta_k\right\}\) is an i.i.d. stochastic sequence satisfying \(\mathbb{E}\left[\eta_k\right]=0\) and \(\mathbb{E}\left[\eta_k^2\right]<\infty\). In this case, the third condition is valid because \(\eta_k\) is independent of \(\mathcal{H}_k\) and hence we have \(\mathbb{E}\left[\eta_k \mid \mathcal{H}_k\right]=\mathbb{E}\left[\eta_k\right]=0\) and \(\mathbb{E}\left[\eta_k^2 \mid \mathcal{H}_k\right]=\mathbb{E}\left[\eta_k^2\right]\).

Why is condition 2 important?

Why is the second condition important for the convergence of the RM algorithm?

This question can naturally be answered when we present a rigorous proof of the above theorem later. Here, we would like to provide some insightful intuition.

First, \(\sum_{k=1}^{\infty} a_k^2<\infty\) indicates that \(a_k \rightarrow 0\) as \(k \rightarrow \infty\). Why is this condition important? Suppose that the observation \(\tilde{g}\left(w_k, \eta_k\right)\) is always bounded. Since \[ w_{k+1}-w_k=-a_k \tilde{g}\left(w_k, \eta_k\right), \] if \(a_k \rightarrow 0\), then \(a_k \tilde{g}\left(w_k, \eta_k\right) \rightarrow 0\) and hence \(w_{k+1}-w_k \rightarrow 0\), indicating that \(w_{k+1}\) and \(w_k\) approach each other when \(k \rightarrow \infty\).

Otherwise, if \(a_k\) does not converge, then \(w_k\) may still fluctuate when \(k \rightarrow \infty\).

Second, \(\sum_{k=1}^{\infty} a_k=\infty\) indicates that \(a_k\) should not converge to zero too fast. Why is this condition important?

Summarizing both sides of the equations of \[ w_2-w_1 = -a_1 \tilde{g}\left(w_1, \eta_1\right), w_3-w_2=-a_2 \tilde{g}\left(w_2, \eta_2\right), w_4-w_3=-a_3 \tilde{g}\left(w_3, \eta_3\right), \ldots \] gives \[ w_1-w_{\infty}=\sum_{k=1}^{\infty} a_k \tilde{g}\left(w_k, \eta_k\right) . \]

If \(\sum_{k=1}^{\infty} a_k<\infty\), then \(\left|\sum_{k=1}^{\infty} a_k \tilde{g}\left(w_k, \eta_k\right)\right|\) is also bounded. But we will show that this can not guarantee the convergence. Let \(b\) denote the finite upper bound such that \[ \begin{equation} \label{eq_4} \left|w_1-w_{\infty}\right|=\left|\sum_{k=1}^{\infty} a_k \tilde{g}\left(w_k, \eta_k\right)\right| \leq b . \end{equation} \]

If the initial guess \(w_1\) is selected far away from \(w^*\) so that \(\left|w_1-w^*\right|>b\), then it is impossible to have \(w_{\infty}=w^*\) according to \(\eqref{eq_4}\).

This suggests that the RM algorithm cannot find the true solution \(w^*\) in this case. Therefore, the condition \(\sum_{k=1}^{\infty} a_k=\infty\) is necessary to ensure convergency given an arbitrary initial guess.

An example of \(a_k\)

What kinds of sequences satisfy \(\sum_{k=1}^{\infty} a_k=\infty\) and \(\sum_{k=1}^{\infty} a_k^2<\infty\) ?

# TO BE PROVED

One typical sequence is \[ \alpha_k=\frac{1}{k} . \]

On the one hand, it holds that \[ \lim _{n \rightarrow \infty}\left(\sum_{k=1}^n \frac{1}{k}-\ln n\right)=\kappa, \] where \(\kappa \approx 0.577\) is called the Euler-Mascheroni constant (or Euler's constant) [28]. Since \(\ln n \rightarrow \infty\) as \(n \rightarrow \infty\), we have \[ \sum_{k=1}^{\infty} \frac{1}{k}=\infty \]

In fact, \(H_n=\sum_{k=1}^n \frac{1}{k}\) is called the harmonic number in number theory [29]. On the other hand, it holds that \[ \sum_{k=1}^{\infty} \frac{1}{k^2}=\frac{\pi^2}{6}<\infty . \]

Finding the value of \(\sum_{k=1}^{\infty} \frac{1}{k^2}\) is known as the Basel problem [30].

In summary, the sequence \(\left\{a_k=1 / k\right\}\) satisfies the second condition in Theorem 6.1. Notably, a slight modification, such as \(a_k=1 /(k+1)\) or \(a_k=c_k / k\) where \(c_k\) is bounded, also preserves this condition.

How to select \(a_k\)?

In the RM algorithm, \(a_k\) is often selected as a sufficiently small constant in many applications. Although the second condition is not satisfied anymore in this case because \(\sum_{k=1}^{\infty} a_k^2=\infty\) rather than \(\sum_{k=1}^{\infty} a_k^2<\infty\), the algorithm can still converge in a certain sense.

(I can't understand this. This is no gurantee for convergence in this case)


  1. Note that an equation like \(g(w)=c\) with \(c\) as a constant can also be converted to the above equation by rewriting \(g(w)-c\) as a new function.↩︎