2013/05/23

Stochastic gradient descent

In this post, I introduce the widely used stochastic optimization technique, namely the stochastic gradient descent. I also implement the algorithm for the linear-regression problem and provide the Matlab code.

Stochastic gradient descent is an online optimization method widely used when the exact gradient is prohibitive to compute. A typical example arises in machine learning where we typically want to minimise cost functions involving a lot of samples (data). These cost functions are formed as the sum of individual cost functions (I will make this clearer in the following paragraph). Gradient is, then, the sum of individual gradients. If you have a lot of samples, this means that you have to compute a huge sum for a single gradient step of your gradient descent algorithm. This makes the algorithm very slow and even impossible to implement. However, there is an alternative: Instead of computing the exact gradient, you can compute an unbiased estimate of it. The classical tradeoff of computational vs. statistical efficiency works great here: You can use a subset of samples in order to obtain an unbiased gradient. This gradient will be noisy and will not point the true direction all the time but it will do so in expectation. So you can obtain a great computational advantage with a relatively little statistical accuracy loss.

We consider solving the following type of problems, \begin{align*} \min_{\theta \in \bR^d} f(\theta) := \frac{1}{n}\sum_{k=1}^n f_k(\theta) \end{align*} where $n$ is very large. To make it a bit concrete, you can write linear/nonlinear regression problems in this form. You can also write neural network training and lots of machine learning problems in this form, too. Starting from an initial parameter estimate $\theta_0$, gradient descent aims to minimise the cost by implementing the following iteration, \begin{align*} \theta_t = \theta_{t-1} - \eta \nabla_\theta f(\theta_{t-1}). \end{align*} Since $f$ is defined as the sum of very large number of functions (each function is parameterised by an individual sample from our dataset), you can easily see the problem: For each $\theta_{t-1}$, you need to sum over a great number of functions. SGD type methods implement the following type of iteration in order to get rid of this problem, \begin{align*} \theta_t = \theta_{t-1} - \eta_t g_t(\theta_{t-1}). \end{align*} where $\bE[g_t(\theta)] = \nabla_\theta f(\theta)$ and $\eta_t$ is the step-size satisfying $\sum_t \eta_t = \infty$ and $\sum_t \eta_t^2 < \infty$. This is the general way to write the algorithm. Of course the simplest thing to do is, \begin{align*} \theta_t = \theta_{t-1} - \eta_t \nabla_\theta f_{i_t}(\theta) \end{align*} where $i_t$ is drawn uniformly random from the index set $[n] = \{1,\ldots,n\}$. Since we draw the index uniformly random, the expectation of the gradient we used here is the same with the exact gradient. Using a single sample would result in a high-variance estimate of the gradient so people usually use a minibatch of samples in order to reduce the variance. There is a huge literature on variance reduction algorithms, I will not even try to address that.

In my previous post, I introduced the algorithm for maximum-likelihood (ML) estimation. It is fairly well known that you can use SGD for ML estimation by minimising the averaged negative log-likelihood. This time I will not go into any probabilistic interpretation but just sketch how you can apply SGD to linear regression.

Linear Regression.

Let's try to solve the large-scale linear regression problem with SGD. Assume we have lots of unidimensional observations (outputs) $(y_k)_{k=1}^n$ and coefficients $(x_k)_{k=1}^n$ (inputs) where $x_k \in \bR^d$.

We would like to fit a linear model to model input-output relationship: $Y \approx X^\top \theta$ where $X$ is the input and $Y$ is the output. In general, the usual approach in machine learning is to formulate the problem as an empirical risk minimisation problem. In this framework, you assume $(X,Y) \sim \bP$ where $\bP$ is an unknown probability measure. Ideally, you would like to minimise (the risk), \begin{align*} \min_\theta \bE_\bP\left[(Y - X^\top \theta)^2\right] \end{align*} where the expectation is taken with respect to the unknown probability measure and $(X,Y) \sim \bP$ are random variables. Since we don't know $\bP$ but instead have samples $(X_{1:n} = x_{1:n},Y_{1:n} = y_{1:n})$ from it, we can instead minimise the empirical risk (the Monte Carlo estimation of the expectation, if you like): \begin{align*} \min_\theta \frac{1}{n}\sum_{k = 1}^n (y_k - x_k^\top \theta)^2. \end{align*} To be clear, you can arrive to the same optimisation problem by assuming $y_k$ is generated via $x_k^\top \theta$ plus Gaussian noise and deriving the maximum likelihood estimate of $\theta$ (being Bayesian but without priors). The factor $\frac{1}{n}$ will be missing but it can be put into the step size of the algorithm. For finite $n$, it is a constant so it doesn't change the argument. Here $f_k(\theta) = (y_k - x_k^\top \theta)^2$, so if you implement stochastic gradient descent, you will have, \begin{align*} \theta_k &= \theta_{k-1} - \eta_k \nabla_\theta f_k(\theta_{k-1}) \\ &= \theta_{k-1} + \eta_k x_k (y_k - x_k^\top \theta_{k-1}). \end{align*} That's it. Actually there was a 2-factor but it can be embedded into the step-size. Of course, there is a usual assumption on the step size, you need, \begin{align*} \sum_k \eta_k = \infty \,\,\,\,\,\,\, \text{and} \,\,\,\,\,\,\, \sum_k \eta_k^2 < \infty \end{align*} What's done in practice is to parameterise the entire $(\eta_k)_{k=1}^\infty$ sequence by two parameters. Basically, you choose, \begin{align*} \eta_k = \frac{\alpha}{k^\beta} \end{align*} with $\alpha > 0$ and $0.5 < \beta < 1$. A typical choice is $\beta = 0.51$ and then you can tune only $\alpha$. Here is a simple Matlab code to generate data and perform regression:

% DATA GENERATION
N = 1000; % small dataset
dx = 3; % dimension of the parameter and inputs
th_tr = [1,2,3]'; % the real parameter
X = rand(dx,N); % arbitrary input data generation
sig = 0.5; % noise variance
Y = X' * th_tr + sqrt(sig) * randn(N,1); % output data generation
%

% ALGORITHM INITIALISATION
thinit = [0,0,0]';
a = 0.7;
b = 0.51;
%

% The first iteration
t = 1;
ind = randi(N);
gam(t) = a ./(t^b);
th(:,t) = thinit + gam(t) * X(:,ind) * (Y(ind) - X(:,ind)' * thinit);
err(t) = sqrt((1/N) * norm(Y - X' * th(:,t))^2);
% Note that computing the error is not necessary and it is not done in
% practice since datasets are massive (Look at the code line and see that
% all data is required to compute the error). For this example, you can
% plot it.
%

% For loop.
for t = 2:2*N
    ind = randi(N);
    gam(t) = a ./(t^b);
    th(:,t) = th(:,t-1) + gam(t) * X(:,ind) * (Y(ind) - X(:,ind)' * th(:,t-1));
    err(t) = sqrt((1/N) * norm(Y - X' * th(:,t))^2);

end

Last updated: 17/01/2017

0 comments:

Post a Comment