## Introduction

In this post, I review the convergence proofs of gradient algorithms. Our main reference is: Leon Bottou, Online learning and stochastic approximations. I rewrite the proofs described in Bottou's paper but with more details about the points which are subtle to me. I tried to write the proofs as clear as possible so as to make them accessible to everyone.

Consider the classical (discrete-time) gradient descent algorithm: \begin{align*} \theta_n = \theta_{n-1} - \gamma_n \nabla_{\theta} f(\theta,\x) |_{\theta = \theta_{n-1}} \end{align*} where $f(\theta,\x)$ is the cost function for a fixed parameter $\theta$. $\x$ represents the full dataset consists of $N$ fixed observations. Therefore $f$ is a function of $\theta$ only - and we define $f: \bR \to \bR$ with argument $\theta$. In the practical setup, the fixed $\theta^*$ is unknown and our goal is to find $\theta^*$. We denote the current value of the parameter estimate with $\theta_{n}$. In this post, we will show that, under certain conditions for $f$, $\theta_n \to \theta^*$ eventually. If we evaluate $x_i$ instead of $\x$ in the gradient-descent update, the convergence result holds. The latter algorithm is called the stochastic gradient descent algorithm and although we won't give its proof, the machinery provided here is sufficient to prove the stochastic version.

## Conditions for $f$

We have to assume certain conditions for $f$ to ensure the convergence. These conditions, which are collectively called "general convexity", are as follows.
(i) $f(\theta)$ has a single minimum $\theta^*$

(ii) The following holds, \begin{align}\label{condition} \epsilon > 0, \,\,\,\,\,\, \inf_{(\theta - \theta^*)^2 > \epsilon} (\theta - \theta^*) \nabla_\theta f(\theta) > 0 \end{align}
Condition (ii) ensures that, gradient always points the minimum.

Here, we start with the continuous-time gradient descent proof, which is very easy but contains all main steps for the discrete-time proofs.

Consider a differential equation defines the parameter trajectory $\theta(t)$, \begin{align}\label{contGrad} \frac{\text{d}\theta}{\text{d}t} = - \nabla_\theta f(\theta) \end{align} It is more intuitive to think this equation as, \begin{align*} {\text{d}\theta} = - {\text{d}t}\nabla_\theta f(\theta) \end{align*} then we can see $dt$ as the step-size in the discrete-time gradient descent.

Proposition 1. Assume $f(\theta)$ meets the general convexity condition. Then, procedure $\eqref{contGrad}$ gives the unique minimum eventually.

Proof. (Sketch) All convergence proofs of such methods relies on the following three steps.

(a) Define a function $h$ that measures the distance of current parameter estimate $\theta$ to the minimum $\theta^*$.

(b) Show that $h \to 0$ as $t \to \infty$.

(c) Show that (b) $\Rightarrow$ $\theta \to \theta^*$.

This proof sketch will be same for the next proof.

Proof. Let us proceed in three steps.

(a) Define a function (Lyapunov function), \begin{align} h(t) \triangleq (\theta(t) - \theta^*)^2 \end{align} This function measures how far our parameter $\theta(t)$ to the minima $\theta^*$.

(b) Compute the derivative of this function $h(t)$, \begin{align} \frac{dh}{dt} = 2(\theta-\theta^*) \frac{d\theta}{dt} \end{align} By using our differential equation at hand, we can write \begin{align}\label{LessThanZero} \frac{dh}{dt} = 2(\theta -\theta^*) \frac{d\theta}{dt} = -2(\theta - \theta^*) \nabla_\theta f(\theta) \leq 0 \end{align} We say that this quantity is less than zero because of the condition (ii). In that condition, we said that $(\theta - \theta^*) \nabla_\theta f(\theta)$ should be positive. Hence Eq. $\eqref{LessThanZero}$ should be less than zero. Then, it turns out that $h(t)$ is a monotonically decreasing function. Since $h(t) > 0$, it has a limit when $t \to \infty$ (see Lemma 2, Appendix for a proof).

(c) Since $h(t)$ converges, its gradient tends toward zero. \begin{align}\label{dhdt} \frac{dh}{dt} = -2(\theta -\theta^*) \nabla_\theta f(\theta) \to 0 \,\,\, \text{as} \,\,\, t \to \infty \end{align} To show $h(t) \to 0$, conversely assume $h(t)$ converges to a value greater than zero. After a certain time, the following will hold: $$h(t) = (\theta(t) - \theta^*)^2 > \epsilon$$ This is incompatible with the general convexity condition (ii) and Eq. $\eqref{dhdt}$. Because since $(\theta - \theta^*)^2 > \epsilon$, $(\theta - \theta^*) \nabla_\theta f(\theta)$ always greater than zero. Because we have, \begin{align*} \epsilon > 0, \,\,\,\,\,\, \inf_{(\theta - \theta^*)^2 > \epsilon} (\theta - \theta^*) \nabla_\theta f(\theta) > 0 \end{align*} This contradicts with $\eqref{dhdt}$. Then, we conclude that $h(t) \to 0$ as $t \to \infty$. If $h(t) \to 0$, then we say that, \begin{align} \theta(t) \to \theta^* \,\,\,\,\,\,\,\, \text{as} \,\,\,\, t\to\infty \end{align}$\blacksquare$

Now, we return to the classical discrete-time gradient descent: \begin{align}\label{DiscGradientDesc} \theta_n = \theta_{n-1} - \gamma_n \nabla_{\theta} f(\theta) |_{\theta = \theta_{n-1}} \end{align} Here now we have $\gamma_n$ as the step-size explicitly. The convergence proof relies on same three steps as in the continuous GD proof. However, now ensuring the convergence of a sequence requires more effort.

We note that proof sketch is the same with the Proposition 1. This time we have to define $h_n$ instead of $h(t)$ (emphasizing the time is discrete). In order to ensure that $h_n$ converges, we need a criterion for sequence convergence. To overcome this technical point, Bottou uses a technical lemma. Basically, that lemma proves that, if positive variations of a sequence converges, then it is sufficient to say that the sequence is convergent. Here, we review this lemma first. Then, we analyze the system defined in $\eqref{DiscGradientDesc}$.

Lemma 1. Let $(u_n)_{n\geq 1}$ be a positive sequence of numbers. Define the positive variations $S_n^+$ as, \begin{align*} S_n^+ = \sum_{k=1}^{n-1} (u_{k+1} - u_k)_+ \end{align*} where $(x)_+ = 1$ if $x > 0$ and $(x)_+ = 0$ otherwise. Then, if $S_n^+$ has a limit, i.e., $\exists$ $S^{+}_\infty < \infty$, then $u_n \to u_\infty < \infty$, i.e. it is sufficient to say that $u_n$ converges.

Proof. See Appendix.

Proposition 2. Consider the dynamical system defined in the Eq. $\eqref{DiscGradientDesc}$. We claim $\theta_n \to \theta^*$.

Proof. We follow the three steps in the proof sketch.

(a) Define a Lyapunov sequence, \begin{align*} h_n \triangleq (\theta_n - \theta^*)^2 \end{align*} (b) Prove that the Lyapunov sequence $(h_n)$ converges. We can write, \begin{align}\label{ht1_ht} h_{n+1} - h_n = -2 \gamma_n (\theta_n - \theta^*) \nabla_\theta f(\theta_n) + \gamma_n^2 (\nabla_\theta f(\theta_n))^2 \end{align} (ii) ensures that the first term is always negative. But we have a positive second term. We have to know that the second term goes to zero. In order to be sure of this, we introduce additional conditions, \begin{align*} &\text{(C1)} \,\,\,\,\,\,\,\,\,\,\,\, \sum_{i=1}^\infty \gamma_n^2 < \infty \\ &\text{(C2)} \,\,\,\,\,\,\,\,\,\,\,\, (\nabla_\theta f(\theta))^2 \leq A + B(\theta -\theta^*)^2 \,\,\,\, A,B \geq 0 \end{align*} Note that (C2) is required because if $(\nabla_\theta f(\theta))^2$ grows too fast, it eliminates the effect of $\gamma_n$. The rationale behind the (C2) can be clarified as follows. We have the Taylor expansion of $\nabla_\theta f(\theta)$ as follows. \begin{align*} \nabla_\theta f(\theta) = \nabla_\theta f(\theta^*) + \nabla_{\theta\theta}^2 f(\theta^*) (\theta^* - \theta) + R \end{align*} where $R$ is the error term. Since $\theta^*$ is the minima, gradient vanishes at $\theta^*$, i.e., $\nabla_\theta f(\theta^*) = 0$. Taking squares of the both sides gives (C2) with a variable transformation (see here, pg. 10)

First note that, using (C2), we have, \begin{align*} h_{n+1} - h_n &= -2 \gamma_n (\theta_n - \theta^*) \nabla_\theta f(\theta_n) + \gamma_n^2 (\nabla_\theta f(\theta_n))^2 \\ &\leq -2 \gamma_n (\theta_n - \theta^*) \nabla_\theta f(\theta_n) + \gamma_n^2 (A + B(\theta-\theta^*)^2) \end{align*} Put $(\theta_n -\theta^*)^2 = h_n$ and continue, \begin{align*} h_{n+1} - h_n &\leq -2 \gamma_n (\theta_n - \theta^*) \nabla_\theta f(\theta_n) + \gamma_n^2 (A + B h_n) \end{align*} With rearrangements, we have, \begin{align*} h_{n+1} - (1 - \gamma_n B) h_n &\leq -2 \gamma_n (\theta_n - \theta^*) \nabla_\theta f(\theta_n) + \gamma_n^2 A \\ &\leq \gamma_n^2 A \nonumber \end{align*} The last inequality follows from the positivity of the first term of rhs in the second inequality. Lastly, we have, \begin{align}\label{EqLyapun} h_{n+1} - (1 - \gamma_n^2 B) h_n \leq -2 \gamma_n (\theta_n - \theta^*) \nabla_\theta f(\theta_n) + \gamma_n^2 A \leq \gamma_n^2 A \end{align} We define two auxiliary sequences, $\mu_n$ and $h^\prime_n$, \begin{align*} \mu_n &\triangleq \prod_{i=1}^n \frac{1}{1 - \gamma_n^2 B} \to \mu_\infty \,\,\,\, \text{as} \,\,\,\, t \to \infty \\ h^\prime_n &\triangleq \mu_{n-1} h_n \end{align*} Multiply both sides of the Eq. $\eqref{EqLyapun}$ with $\mu_n$ explicitly, \begin{align*} &\mu_n h_{n+1} - \mu_n (1 - \gamma_n^2 B) h_n \leq -2 \mu_n\gamma_n (\theta_n - \theta^*) \nabla_\theta f(\theta_n) + \gamma_n^2\mu_n A \leq \gamma_n^2 \mu_n A \end{align*} By using $h^\prime_n = \mu_{n-1} h_n$, we write, \begin{align*} &h^\prime_{n+1} - h^\prime_n \leq -2 \mu_n\gamma_n (\theta_n - \theta^*) \nabla_\theta f(\theta_n) + \gamma_n^2\mu_n A \leq \gamma_n^2 \mu_n A \end{align*} We conclude, \begin{align} h^\prime_{n+1} - h^\prime_n \leq \gamma_n^2 \mu_n A \end{align} Since the rhs is positive, the positive variations of $h^\prime_n$ are at most equal to $\gamma_n^2 \mu_n A$, which is the summand of a convergent infinite sum. According to Lemma 1, the sequence $h^\prime_n$ converges. Similarly, $\mu_n$ converges and $\mu_\infty \neq 0$. To see this, notice $\sum_{k=1}^\infty \gamma_n^2 < \infty \Rightarrow \gamma_n^2 \to 0$. An easy proof is:$$\lim_{n\to\infty} \gamma_{n} = \lim_{n\to\infty} \left(\sum_{k=1}^n \gamma_k - \sum_{k=1}^{n-1} \gamma_k\right) = 0.$$ This implies convergence to a nonzero number. These two imply the convergence of $h_n$.

(c) We now prove convergence of the Lyapunov seq. $\Rightarrow$ the convergence of the discrete gradient descent. Recall the rewritten version of Eq. $\eqref{EqLyapun}$: \begin{align*} h_{n+1} - (1 - \gamma_n^2 B) h_n + 2 \gamma_n (\theta_n - \theta^*) \nabla_\theta f(\theta_n) \leq \gamma_n^2 A \end{align*} Take infinite sum of both sides, \begin{align*} \sum_{n=1}^\infty \left( h_{n+1} - (1 - \gamma_n^2 B) h_n + 2 \gamma_n (\theta_n - \theta^*) \nabla_\theta f(\theta_n)\right) \leq \sum_{n=1}^\infty \gamma_n^2 A \end{align*} Since $h_n$ converges, the first two terms of the lhs are finite constants. The rhs is also finite. Then we have, \begin{align}\label{420} \sum_{n=1}^\infty \gamma_n (\theta_n - \theta^*) \nabla_\theta f(\theta_n) < \infty \end{align} At this point, we introduce a new condition to limit the rate of decrease of the learning rates. This condition required because if the learning rates decrease too quickly, algoritm could stop without attaining the minimum. This condition is expressed as follows. \begin{align}\label{421} \sum_{i=1}^\infty \gamma_i = \infty \end{align} Then, Eq. $\eqref{420}$ and $\eqref{421}$ implies, \begin{align} (\theta_n - \theta^*) \nabla_\theta f(\theta_n) \to 0 \,\,\, \text{as} \,\,\, t \to \infty \end{align} This is because $\gamma_n$ and $(\theta_n - \theta^*) \nabla_\theta f(\theta_n)$ are positive and $\gamma_n$ diverges. If $(\theta_n - \theta^*) \nabla_\theta f(\theta_n)$ goes to zero, this implies $(\theta_n - \theta^*) < \epsilon$ because of condition (ii). Then, the result is similar to the continuous case and leads to the same conclusion, \begin{align} \theta_n \to \theta^* \,\,\,\,\,\,\, \text{as} \,\,\, n \to \infty \end{align} $\blacksquare$

## Conclusion

The convergence proof stochastic version of the discrete-time gradient descent really mimics the arguments in the discrete-time proof. One has to define a positive stochastic process $h_n(\omega)$ instead of a positive sequence of numbers $h_n$. Then using probabilistic analog of the lemma we presented here, namely the quasimartingale convergence theorem, it is easy to prove the stochastic version by using conditional expectations. By using the machinery and extra details presented here, it is easy to understand the stochastic proof from the Bottou. So, we do not reproduce it here.

## Appendix

Lemma 2. If $h(t) > 0$ and it is a continuous and monotonically decreasing function, it has a limit as $t \to \infty$.

Proof. Let $\epsilon > 0$ and $L \geq 0$ where $L$ stands for the lower bound. We want to show that, there is a value $\delta > 0$ s.t. for any $a$ s.t. $t < a < t + \delta$ the following holds: $L - \epsilon < h(a) < L + \epsilon$.

By definition of $h$, we have $L - \epsilon < h(a)$ for any $a$. For the latter: since $L + \epsilon$ is not a lower bound, there exists $b \in \bR$ s.t. $h(b) < L + \epsilon$. Since $h$ decreases for $b < a$, $h(a) < h(b)$. Then, we have $h(a) < h(b) < L + \epsilon$. Then choosing $\delta = b - t$ gives the result. $\blacksquare$

Proof. (of Lemma 1) Let $S_n^+$ be defined as in the statement, and define negative variations as follows. \begin{align*} S_n^- = \sum_{k = 1}^{n-1} (u_{k+1} - u_k)_- \end{align*} where $(x)_- = 0$ if $x > 0$ and $(x)_+ = 1$ otherwise. Recall $u_n \geq 0$. We want to show that, if $S_\infty^+ < \infty$ then $u_\infty$ exists. We have [Bottou], \begin{align*} 0 \leq u_n = u_1 + S_n^+ + S_n^- \leq u_1 + S_\infty^+ + S_n^- < u_1 + S_\infty^+ \end{align*} Since $S_n^-$ is sum of negative numbers (or zero), $S_n^- \leq 0$. From the first inequality above, we have, \begin{align*} -u_1 - S_\infty^+ \leq S_n^- \leq 0 \end{align*} then if $S_\infty^+$ exists, then we have a lower bound for $S_n^-$. Since $S_n^-$ is a monotonically decreasing lower bounded sequence, it converges to a limit (this is the discrete version of Lemma 2). Therefore, $S_\infty^-$ exists. Then, since $u_\infty = u_1 + S_\infty^+ + S_\infty^-$, we have $u_\infty < \infty$, i.e. the limit exists. This concludes the proof. $\blacksquare$

Unknown said...

Hi, this is a nice post. Could you elaborate more about the convergence (15) implied by (13) and (14) in a mathematically rigorous way? By "\gamma_{n} diverges" you actually mean that the series \sum_{n}\gamma_{n} diverges, right? This is because the sequence actually \gamma_{n} converges to zero. I still don't find a rigorous proof for the convergence (15) implied by (13) and (14). But my strategy is to first prove the left-hand side of (15) is bounded and then prove the upper limit of the left-hand side of (15) is less than or equal to zero (it's easy to see that the lower limit is zero).

Deniz said...

Yes you are right. "$\gamma_n$ diverges" should be written as $\sum_{i=1}^\infty \gamma_n$ is divergent. Thanks for correction!

As for your question, for simplicity, let's write $k_n = (\theta_n - \theta^*) \nabla_\theta f(\theta_n)$. And $\sum_n$ will denote the infinite sum over $n$. So the statement becomes, if $\sum_n \gamma_n = \infty$ (14) and $\sum_n \gamma_n k_n < \infty$ (13), it means $k_n \to 0$ (15).

I'll try to make this rigorous proof by contradiction. Let us assume (13) and (14) holds, and $k_n$ doesn't converge to zero. This means there exists an $\epsilon > 0$ such that $k_n > \epsilon$ for all $n$. Then we can show (since we assumed (14) holds),
\begin{align*}
\sum_n k_n \gamma_n > \epsilon \sum_n \gamma_n = \infty.
\end{align*}
Assuming $k_n$ doesn't converge to zero, we arrived a contradiction with (13), one of our assumptions. Then $k_n \to 0$ should be the case.

Jovitian said...

how do you have equation (10)? I tried some ways to rebuilt $h_{n+1} - h_{n}$ but can not get something like (10). if we expand h_{n+1} as (θn+1−θ∗)^2 ,and h_{n} as (θn−θ∗)^2, then replace θn with θn−1−γn∇θf(θn-1), and θn+1 with θn−γn+1∇θf(θn), but how to calculate γn+1?

Unknown said...

So did you use the fixed step-size when proved the convergence of Discrete-time gradient descent? Now I know how to derive (10), but that stepsize should be $\gamma_{n+1}$, not $n$. So how can you get $\gamma_{n+1}$ in (10)?
And another question, if now I wanna use this proving process to a maximization problem, and my updating equation is $\theta_{n}=\theta_{n-1} + \gamma_n ∇θf(θ_{n-1})$, can I use this method? and how to adaptively rewrite it? thank you!

Unknown said...

With rearrangements, we have, hn+1−(1−γnB)hn, I think here should be hn+1−(1+γn^2 B)hn, right?

Deniz said...

Jie, for the second comment, yes you are right. In fact, the typo is in the definition of the gradient descent, it should be read as,
\begin{align*} \theta_{n+1} = \theta_{n} - \gamma_n \nabla_{\theta} f(\theta,\x) |_{\theta = \theta_{n}} \end{align*}
Thanks for catching it! Sorry for the late reply. In any case, you can check Bottou's paper: http://leon.bottou.org/publications/pdf/online-1998.pdf

I think your proof will be valid for maximisation process with gradient ascent as soon as negative of the cost function you use satisfies the concepts introduced in the paper (general convexity).

(correction about the algorithm may lead to another compilation error in the notation that I thought was a typo in Bottou's paper, but apparently it wasn't!).

Unknown said...

hi there again, there are some subscripts that are not very clear to me, for example:
the i and t in auxiliary sequences, μn
the second condition for $f$, where did you find this condition? why it is not $\theta^*$ in $\nabla_{\theta} f(\theta)$?
thank you!

Deniz said...

Hi,

Did you look at the paper Leon Bottou, Online learning and stochastic approximations?

Unknown said...

Hi there, sorry i missed your comment, I am reading it. Thanks!