Suppose, given a convex function $f: \bR^d \to \bR$, we would like to find the minimum of $f$ by iterating \begin{align*} \theta_t = \theta_{t-1} - \gamma \nabla f(\theta_{t-1}). \end{align*} How fast do we converge to the minima of $f$?

In this post, I will prove that this convergence rate is $\mathcal{O}(1/t)$ and discuss all relevant material for this proof.

As usual, proving a result like this requires more regularity conditions on
the function. In particular, we assume the following so called $L$*-smoothness*
assumption on $f$.

**A1.**

*Assume $f(\theta)$ is convex and differentiable and $L$-gradient Lipschitz, i.e., \begin{align*} \|\nabla f(\theta) - \nabla f(\theta')\|_2 \leq L \|\theta - \theta'\|_2 \quad \quad \text{for any} \quad \theta,\theta' \in \bR^d. \end{align*}*

Given this assumption, in this post, we will give a proof of the following theorem which provides the convergence rate of gradient descent for convex functions.

**Theorem 1.**

*Assume*

**A1**

*and let $f^\star = f(\theta^\star)$ be the optimal value of the function $f(\theta)$. Then, given that $\gamma \leq 1/L$, we have \begin{align*} f(\theta_t) - f^\star \leq \frac{\|\theta_0 - \theta^\star\|_2^2}{2\gamma t}. \end{align*}*

Before proceeding to the proof, we give a couple of remarks about this theorem. First, obviously, it demonstrates $\mathcal{O}(1/t)$ convergence rate. Secondly, for this to be guaranteed, one needs to ensure $\gamma \leq 1/L$. This is often the impractical bit about this type of results as we would not know the constant $L$ beforehand about the function. This issue is often much worse for harder problems. Also, this may suggest very small step-sizes in practice, which is a condition that may be violated in more complicated scenarios.

In order to have a fully self-contained proof, we will introduce some basic lemmas and definitions here. First, we recall the definition of the directional derivative.

**Definition 1.**

*(Directional derivative) Let $f:\bR^d \to \bR$ be a differentiable function at $x\in\bR^d$ and consider a direction vector $d\in\bR^d$. We define the directional derivative of $f$ in the direction $d$ as, \begin{align*} f'(x,d) = \lim_{\lambda \to 0} \frac{f(x + \lambda d) - f(x)}{\lambda}. \end{align*}*

After this definition, we also recall the following result from vector calculus.

**Lemma 1.**

*Let $f:\bR^d\to\bR$ and $f'(x,d)$ be the directional derivative of $f$ at $x$ in the direction $d$ and let $\nabla f(x)$ be the gradient of $f$. Then the following holds \begin{align*} f'(x,d) = \nabla f(x)^\top d. \end{align*}*

*Proof.* We use a first-order Taylor approximation of $f$ at $x$,
\begin{align*} {f}(x + h d) = f(x) + \nabla f(x)^\top (h d) +
\mathcal{O}(\|h d\|_2^2), \end{align*} where $h\in\bR$ and obtain
\begin{align*} \frac{{f}(x + h d) - f(x)}{h} = \nabla f(x)^\top d +
\mathcal{O}(h \|d\|_2^2). \end{align*} Taking $h \to 0$ concludes. $\square$

This basic result establishes a connection between the directional derivative and the gradient itself. This helps us proving an alternative characterization of convexity, which is given as follows.

**Lemma 2.**

*If $f:\bR^d\to\bR$ is convex, then for all $x,y \in \bR^d$, we have \begin{align}\label{eq:convexity} f(x) + \langle \nabla f(x), y-x\rangle \leq f(y). \end{align}*

*Proof.*
Let $\lambda \in [0,1]$. By definition of
convexity, we write \begin{align*} f(\lambda y + (1-\lambda) x) \leq \lambda f(y) +
(1-\lambda) f(x). \end{align*} We rearrange terms and subtract $f(x)$ from
both sides and obtain \begin{align*} f(x + \lambda (y-x)) - f(x) \leq
\lambda f(y) - \lambda f(x). \end{align*} Therefore, we arrive at
\begin{align*} \frac{f(x + \lambda (y-x)) - f(x)}{\lambda} \leq f(y) - f(x).
\end{align*} Now we take $\lambda \to 0$ and using *Lemma 1* for $d =
y-x$, \begin{align*} \nabla f(x)^\top (y-x) \leq f(y) - f(x). \end{align*}
Rearranging gives the result. $\square$

But *what does this lemma mean?* If you look at the inequality, you can indeed see an intuitive picture. Note that the expression appears in the lhs of the inequality \eqref{eq:convexity} is the linear (first order Taylor) approximation of the function $f$ at point $x$. So the inequality states that, given an approximation at the point $x \in \bR^d$, the function value at any $y \in \bR^d$ is simply bigger than this linear approximation. If you imagine a convex shape, you can easily see the geometric intuition behind this (I will not draw any graphs, not yet).

Next, we prove a more specific characterisation of $L$-smooth functions. In particular, we show that, if A1 is satisfied, then the following lemma holds.

**Lemma 3.**

*If*

**A1**

*is satisfied, then for any $x,y \in \bR^d$, \begin{align}\label{eq:Lsmooth} f(y) \leq f(x) + \langle \nabla f(x),y-x\rangle + \frac{L}{2} \|y-x\|_2^2. \end{align}*

*Proof.* We know from the fundamental theorem of calculus that, given a
function $h:\bR\to\bR$, we have $h(1) = h(0) + \int_0^1 h'(\tau) \md \tau$.
Now, we define $h(\tau) = f(x + \tau(y-x))$ and note that $h(1) = f(y)$ and
$h(0) = f(x)$. Using the fundamental theorem of calculus: \begin{align*}
f(y) &= f(x) + \int_0^1 \langle \nabla f(x + \tau(y-x)), y-x\rangle \md
\tau, \\ &= f(x) + \langle \nabla f(x), y-x\rangle + \int_0^1 \langle \nabla
f(x + \tau (y-x)) - \nabla f(x), y-x\rangle \md \tau. \end{align*} For all
vectors $u,v \in \bR^d$, we have $\left|\langle u,v\rangle\right| \leq
\|u\|_2 \|v\|_2$, applying it to the inner product inside the integral, we
obtain \begin{align*} f(y) &\leq f(x) + \langle \nabla f(x), y-x\rangle +
\int_0^1 \|\nabla f(x + \tau (y-x)) - \nabla f(x)\|_2 \|y-x\|_2 \md \tau,\\
&\leq f(x) + \langle \nabla f(x), y-x\rangle + \int_0^1 L \|\tau (y-x)\|_2
\|y-x\|_2 \md \tau, \end{align*} which follows from $L$-gradient Lipschitz
property of $f$. Then we finally arrive at \begin{align*} f(y) \leq f(x) +
\langle \nabla f(x), y-x\rangle + \frac{L}{2} \|y -x \|_2^2. \end{align*}
$\square$

This result also gives us an intuitive picture: If a function is $L$-smooth, then at any point $y \in \bR^d$, the function is upper bounded by a quadratic at $x$ (the rhs of the expression \eqref{eq:Lsmooth}). In other words, this lemma provides a quadratic upper bound for $L$-smooth functions.

Finally, we can prove our theorem.

*Proof of Theorem 1.*Since $f$ is $L$-gradient Lipschitz, using Lemma 3, we write \begin{align*} f(\theta_{k+1}) &\leq f(\theta_k) + \langle \nabla f(\theta_k), \theta_{k+1} - \theta_k\rangle + \frac{L}{2}\|\theta_{k+1} - \theta_k\|_2^2, \\ &= f(\theta_k) - \gamma \|\nabla f(\theta_k)\|_2^2 + \frac{L\gamma^2}{2}\|\nabla f(\theta_k)\|_2^2, \\ &= f(\theta_k) - \gamma \left(1 - \frac{L\gamma}{2}\right) \|\nabla f(\theta_k)\|_2^2. \end{align*} Since we assume that $\gamma \leq 1/L$, we have \begin{align*} -\gamma \left(1 - \frac{L\gamma}{2}\right) \leq -\frac{\gamma}{2}, \end{align*} therefore \begin{align}\label{proof:GradDesc1} f(\theta_{k+1}) \leq f(\theta_k) - \frac{\gamma}{2} \|\nabla f(\theta_k)\|_2^2. \end{align} Separately, by convexity of $f$, in particular using Lemma 2, we obtain \begin{align}\label{proof:GradDesc2} f(\theta_k) \leq f(\theta^\star) + \langle \nabla f(\theta_k), \theta_k - \theta^\star\rangle. \end{align} We also note that $\nabla f(\theta_k) = (1/\gamma) (\theta_k - \theta_{k+1})$. Now substituting \eqref{proof:GradDesc2} into \eqref{proof:GradDesc1}, we obtain \begin{align*} f(\theta_{k+1}) &\leq f(\theta^\star) + \langle \nabla f(\theta_k),\theta_k - \theta^\star\rangle - \frac{\gamma}{2} \|\nabla f(\theta_k)\|_2^2, \\ &= f(\theta^\star) + \frac{1}{\gamma} \langle \theta_k - \theta_{k+1},\theta_k - \theta^\star \rangle - \frac{1}{2\gamma} \|\theta_k - \theta_{k+1}\|_2^2. \end{align*} By completing the square in the last term \begin{align*} f(\theta_{k+1}) &\leq f(\theta^\star) + \frac{1}{2\gamma} \|\theta_k - \theta^\star\|_2^2 - \frac{1}{2\gamma} \left( \|\theta_k - \theta^\star\|_2^2 - 2 \langle \theta_k-\theta_{k+1},\theta_k - \theta^\star \rangle + \|\theta_k - \theta_{k+1}\|_2^2\right). \end{align*} Recognizing that the last term above is a square, we arrive at \begin{align*} f(\theta_{k+1}) &\leq f(\theta^\star) + \frac{1}{2\gamma} \|\theta_k - \theta^\star\|_2^2 - \frac{1}{2\gamma} \|\theta_{k+1} - \theta^\star\|_2^2. \end{align*} Finally, summing both sides which results in a telescoping sum in the rhs, we end up with \begin{align*} \sum_{k=0}^{t-1} \left(f(\theta_{k+1}) - f(\theta^\star)\right) &\leq \frac{1}{2\gamma} \left(\|\theta_0 - \theta^\star\|_2^2 - \|\theta_t - \theta^\star\|_2^2\right), \\ &\leq \frac{\|\theta_0 - \theta^\star\|_2^2}{2\gamma}. \end{align*} Since $f(\theta_0),\ldots,f(\theta_k)$ is a non-increasing sequence, we have $f(\theta_k) \leq f(\theta_t)$ for all $k \leq t$, therefore \begin{align*} t (f(\theta_t) - f(\theta^\star)) \leq \sum_{k=0}^{t-1} \left(f(\theta_{k+1}) - f(\theta^\star)\right), \end{align*} which leads to \begin{align*} f(\theta_t) - f^\star \leq \frac{\|\theta_0 - \theta^\star\|_2^2}{2\gamma t}, \end{align*} where $f^\star = f(\theta^\star)$, hence concluding the proof. $\square$

We finish this post with one final note: Note that Eq. \eqref{proof:GradDesc1} achieves a descent lemma. In other words, once you assume that your function is $L$-smooth, you can guarantee that the gradient step results in the decrease of the function value provided that $\gamma \leq 1/L$.

References

- Convergence Theorems for Gradient Descent, Robert M. Gower.
- 10-725: Optimization Fall 2013, Lecture 6: September 12, Lecturer: Ryan Tibshirani.

## 0 comments:

## Post a Comment