almost stochastic: Convergence rate of gradient descent for convex functions

Suppose, given a convex function $f: \bR^d \to \bR$, we would like to find the minimum of $f$ by iterating \begin{align*} \theta_t = \theta_{t-1} - \gamma \nabla f(\theta_{t-1}). \end{align*} How fast do we converge to the minima of $f$?

In this post, I will prove that this convergence rate is $\mathcal{O}(1/t)$ and discuss all relevant material for this proof.

As usual, proving a result like this requires more regularity conditions on the function. In particular, we assume the following so called $L$-smoothness assumption on $f$.

\begin{assump}\label{assump:convex-lip} Assume $f(\theta)$ is convex and differentiable and $L$-gradient Lipschitz, i.e., \begin{align*} \|\nabla f(\theta) - \nabla f(\theta')\|_2 \leq L \|\theta - \theta'\|_2 \quad \quad \text{for any} \quad \theta,\theta' \in \bR^d. \end{align*} \end{assump}

Given this assumption, in this post, we will give a proof of the following theorem which provides the convergence rate of gradient descent for convex functions.

\begin{thm}\label{thm:theorem} Under Assumption \ref{assump:convex-lip}, let $f^\star = f(\theta^\star)$ be the optimal value of the function $f(\theta)$. Then, given that $\gamma \leq 1/L$, we have \begin{align*} f(\theta_t) - f^\star \leq \frac{\|\theta_0 - \theta^\star\|_2^2}{2\gamma t}. \end{align*} \end{thm}

Before proceeding to the proof, we give a couple of remarks about this theorem. First, obviously, it demonstrates $\mathcal{O}(1/t)$ convergence rate. Secondly, for this to be guaranteed, one needs to ensure $\gamma \leq 1/L$. This is often the impractical bit about this type of results as we would not know the constant $L$ beforehand about the function. This issue is often much worse for harder problems. Also, this may suggest very small step-sizes in practice, which is a condition that may be violated in more complicated scenarios.

In order to have a fully self-contained proof, we will introduce some basic lemmas and definitions here. First, we recall the definition of the directional derivative.

\begin{defn}\label{defn:directional-derivative} (Directional derivative) Let $f:\bR^d \to \bR$ be a differentiable function at $x\in\bR^d$ and consider a direction vector $d\in\bR^d$. We define the directional derivative of $f$ in the direction $d$ as, \begin{align*} f'(x,d) = \lim_{\lambda \to 0} \frac{f(x + \lambda d) - f(x)}{\lambda}. \end{align*} \end{defn}

After this definition, we also recall the following result from vector calculus.

\begin{lem}\label{lem:directional-derivative} Let $f:\bR^d\to\bR$ and $f'(x,d)$ be the directional derivative of $f$ at $x$ in the direction $d$ and let $\nabla f(x)$ be the gradient of $f$. Then the following holds \begin{align*} f'(x,d) = \nabla f(x)^\top d. \end{align*} \end{lem} \begin{proof} We use a first-order Taylor approximation of $f$ at $x$, \begin{align*} {f}(x + h d) = f(x) + \nabla f(x)^\top (h d) + \mathcal{O}(\|h d\|_2^2), \end{align*} where $h\in\bR$ and obtain \begin{align*} \frac{{f}(x + h d) - f(x)}{h} = \nabla f(x)^\top d + \mathcal{O}(h \|d\|_2^2). \end{align*} Taking $h \to 0$ concludes. \end{proof}

This basic result establishes a connection between the directional derivative and the gradient itself. This helps us proving an alternative characterization of convexity, which is given as follows.

\begin{lem}\label{lem:convexity-def} If $f:\bR^d\to\bR$ is convex, then for all $x,y \in \bR^d$, we have \begin{align}\label{eq:convexity} f(x) + \langle \nabla f(x), y-x\rangle \leq f(y). \end{align} \end{lem} \begin{proof} Let $\lambda \in [0,1]$. By definition of convexity, we write \begin{align*} f(\lambda y + (1-\lambda) x) \leq \lambda f(y) + (1-\lambda) f(x). \end{align*} We rearrange terms and subtract $f(x)$ from both sides and obtain \begin{align*} f(x + \lambda (y-x)) - f(x) \leq \lambda f(y) - \lambda f(x). \end{align*} Therefore, we arrive at \begin{align*} \frac{f(x + \lambda (y-x)) - f(x)}{\lambda} \leq f(y) - f(x). \end{align*} Now we take $\lambda \to 0$ and using Lemma 1 for $d = y-x$, \begin{align*} \nabla f(x)^\top (y-x) \leq f(y) - f(x). \end{align*} Rearranging gives the result. \end{proof}

But what does this lemma mean? If you look at the inequality, you can indeed see an intuitive picture. Note that the expression appears in the lhs of the inequality \eqref{eq:convexity} is the linear (first order Taylor) approximation of the function $f$ at point $x$. So the inequality states that, given an approximation at the point $x \in \bR^d$, the function value at any $y \in \bR^d$ is simply bigger than this linear approximation. If you imagine a convex shape, you can easily see the geometric intuition behind this (I will not draw any graphs, not yet).

Next, we prove a more specific characterisation of $L$-smooth functions. In particular, we show that, if Assumption \ref{assump:convex-lip} is satisfied, then the following lemma holds.

\begin{lem}\label{lem:upper-bound} If Assumption \ref{assump:convex-lip} is satisfed, then for any $x,y \in \bR^d$, \begin{align}\label{eq:Lsmooth} f(y) \leq f(x) + \langle \nabla f(x),y-x\rangle + \frac{L}{2} \|y-x\|_2^2. \end{align} \end{lem} \begin{proof} We know from the fundamental theorem of calculus that, given a function $h:\bR\to\bR$, we have $h(1) = h(0) + \int_0^1 h'(\tau) \md \tau$. Now, we define $h(\tau) = f(x + \tau(y-x))$ and note that $h(1) = f(y)$ and $h(0) = f(x)$. Using the fundamental theorem of calculus: \begin{align*} f(y) &= f(x) + \int_0^1 \langle \nabla f(x + \tau(y-x)), y-x\rangle \md \tau, \\ &= f(x) + \langle \nabla f(x), y-x\rangle + \int_0^1 \langle \nabla f(x + \tau (y-x)) - \nabla f(x), y-x\rangle \md \tau. \end{align*} For all vectors $u,v \in \bR^d$, we have $\left|\langle u,v\rangle\right| \leq \|u\|_2 \|v\|_2$, applying it to the inner product inside the integral, we obtain \begin{align*} f(y) &\leq f(x) + \langle \nabla f(x), y-x\rangle + \int_0^1 \|\nabla f(x + \tau (y-x)) - \nabla f(x)\|_2 \|y-x\|_2 \md \tau,\\ &\leq f(x) + \langle \nabla f(x), y-x\rangle + \int_0^1 L \|\tau (y-x)\|_2 \|y-x\|_2 \md \tau, \end{align*} which follows from $L$-gradient Lipschitz property of $f$. Then we finally arrive at \begin{align*} f(y) \leq f(x) + \langle \nabla f(x), y-x\rangle + \frac{L}{2} \|y -x \|_2^2. \end{align*} \end{proof}

This result also gives us an intuitive picture: If a function is $L$-smooth, then at any point $y \in \bR^d$, the function is upper bounded by a quadratic at $x$ (the rhs of the expression \eqref{eq:Lsmooth}). In other words, this lemma provides a quadratic upper bound for $L$-smooth functions.

Finally, we can prove our theorem.

\begin{proof} (Theorem 1) Since $f$ is $L$-gradient Lipschitz, using Lemma 3, we write \begin{align*} f(\theta_{k+1}) &\leq f(\theta_k) + \langle \nabla f(\theta_k), \theta_{k+1} - \theta_k\rangle + \frac{L}{2}\|\theta_{k+1} - \theta_k\|_2^2, \\ &= f(\theta_k) - \gamma \|\nabla f(\theta_k)\|_2^2 + \frac{L\gamma^2}{2}\|\nabla f(\theta_k)\|_2^2, \\ &= f(\theta_k) - \gamma \left(1 - \frac{L\gamma}{2}\right) \|\nabla f(\theta_k)\|_2^2. \end{align*} Since we assume that $\gamma \leq 1/L$, we have \begin{align*} -\gamma \left(1 - \frac{L\gamma}{2}\right) \leq -\frac{\gamma}{2}, \end{align*} therefore \begin{align}\label{proof:GradDesc1} f(\theta_{k+1}) \leq f(\theta_k) - \frac{\gamma}{2} \|\nabla f(\theta_k)\|_2^2. \end{align} Separately, by convexity of $f$, in particular using Lemma 2, we obtain \begin{align}\label{proof:GradDesc2} f(\theta_k) \leq f(\theta^\star) + \langle \nabla f(\theta_k), \theta_k - \theta^\star\rangle. \end{align} We also note that $\nabla f(\theta_k) = (1/\gamma) (\theta_k - \theta_{k+1})$. Now substituting \eqref{proof:GradDesc2} into \eqref{proof:GradDesc1}, we obtain \begin{align*} f(\theta_{k+1}) &\leq f(\theta^\star) + \langle \nabla f(\theta_k),\theta_k - \theta^\star\rangle - \frac{\gamma}{2} \|\nabla f(\theta_k)\|_2^2, \\ &= f(\theta^\star) + \frac{1}{\gamma} \langle \theta_k - \theta_{k+1},\theta_k - \theta^\star \rangle - \frac{1}{2\gamma} \|\theta_k - \theta_{k+1}\|_2^2. \end{align*} By completing the square in the last term \begin{align*} f(\theta_{k+1}) &\leq f(\theta^\star) + \frac{1}{2\gamma} \|\theta_k - \theta^\star\|_2^2 - \frac{1}{2\gamma} \left( \|\theta_k - \theta^\star\|_2^2 - 2 \langle \theta_k-\theta_{k+1},\theta_k - \theta^\star \rangle + \|\theta_k - \theta_{k+1}\|_2^2\right). \end{align*} Recognizing that the last term above is a square, we arrive at \begin{align*} f(\theta_{k+1}) &\leq f(\theta^\star) + \frac{1}{2\gamma} \|\theta_k - \theta^\star\|_2^2 - \frac{1}{2\gamma} \|\theta_{k+1} - \theta^\star\|_2^2. \end{align*} Finally, summing both sides which results in a telescoping sum in the rhs, we end up with \begin{align*} \sum_{k=0}^{t-1} \left(f(\theta_{k+1}) - f(\theta^\star)\right) &\leq \frac{1}{2\gamma} \left(\|\theta_0 - \theta^\star\|_2^2 - \|\theta_t - \theta^\star\|_2^2\right), \\ &\leq \frac{\|\theta_0 - \theta^\star\|_2^2}{2\gamma}. \end{align*} Since $f(\theta_0),\ldots,f(\theta_k)$ is a non-increasing sequence, we have $f(\theta_k) \leq f(\theta_t)$ for all $k \leq t$, therefore \begin{align*} t (f(\theta_t) - f(\theta^\star)) \leq \sum_{k=0}^{t-1} \left(f(\theta_{k+1}) - f(\theta^\star)\right), \end{align*} which leads to \begin{align*} f(\theta_t) - f^\star \leq \frac{\|\theta_0 - \theta^\star\|_2^2}{2\gamma t}, \end{align*} where $f^\star = f(\theta^\star)$, hence concluding the proof. \end{proof}

We finish this post with one final note: Note that Eq. \eqref{proof:GradDesc1} achieves a descent lemma. In other words, once you assume that your function is $L$-smooth, you can guarantee that the gradient step results in the decrease of the function value provided that $\gamma \leq 1/L$.

References

Convergence Theorems for Gradient Descent, Robert M. Gower.
10-725: Optimization Fall 2013, Lecture 6: September 12, Lecturer: Ryan Tibshirani.

almost stochastic

2020/11/11

Convergence rate of gradient descent for convex functions

0 comments:

Post a Comment

search

mathematics village

archive

band pass filter

links