## 2013/05/23

In this post, I introduce the widely used stochastic optimization technique, namely the stochastic gradient descent. I also implement the algorithm for the linear-regression problem and provide the Matlab code.

We consider solving the following type of problems, \begin{align*} \min_{\theta \in \bR^d} f(\theta) := \frac{1}{n}\sum_{k=1}^n f_k(\theta) \end{align*} where $n$ is very large. To make it a bit concrete, you can write linear/nonlinear regression problems in this form. You can also write neural network training and lots of machine learning problems in this form, too. Starting from an initial parameter estimate $\theta_0$, gradient descent aims to minimise the cost by implementing the following iteration, \begin{align*} \theta_t = \theta_{t-1} - \eta \nabla_\theta f(\theta_{t-1}). \end{align*} Since $f$ is defined as the sum of very large number of functions (each function is parameterised by an individual sample from our dataset), you can easily see the problem: For each $\theta_{t-1}$, you need to sum over a great number of functions. SGD type methods implement the following type of iteration in order to get rid of this problem, \begin{align*} \theta_t = \theta_{t-1} - \eta_t g_t(\theta_{t-1}). \end{align*} where $\bE[g_t(\theta)] = \nabla_\theta f(\theta)$ and $\eta_t$ is the step-size satisfying $\sum_t \eta_t = \infty$ and $\sum_t \eta_t^2 < \infty$. This is the general way to write the algorithm. Of course the simplest thing to do is, \begin{align*} \theta_t = \theta_{t-1} - \eta_t \nabla_\theta f_{i_t}(\theta) \end{align*} where $i_t$ is drawn uniformly random from the index set $[n] = \{1,\ldots,n\}$. Since we draw the index uniformly random, the expectation of the gradient we used here is the same with the exact gradient. Using a single sample would result in a high-variance estimate of the gradient so people usually use a minibatch of samples in order to reduce the variance. There is a huge literature on variance reduction algorithms, I will not even try to address that.

In my previous post, I introduced the algorithm for maximum-likelihood (ML) estimation. It is fairly well known that you can use SGD for ML estimation by minimising the averaged negative log-likelihood. This time I will not go into any probabilistic interpretation but just sketch how you can apply SGD to linear regression.

Linear Regression.

Let's try to solve the large-scale linear regression problem with SGD. Assume we have lots of unidimensional observations (outputs) $(y_k)_{k=1}^n$ and coefficients $(x_k)_{k=1}^n$ (inputs) where $x_k \in \bR^d$.

We would like to fit a linear model to model input-output relationship: $Y \approx X^\top \theta$ where $X$ is the input and $Y$ is the output. In general, the usual approach in machine learning is to formulate the problem as an empirical risk minimisation problem. In this framework, you assume $(X,Y) \sim \bP$ where $\bP$ is an unknown probability measure. Ideally, you would like to minimise (the risk), \begin{align*} \min_\theta \bE_\bP\left[(Y - X^\top \theta)^2\right] \end{align*} where the expectation is taken with respect to the unknown probability measure and $(X,Y) \sim \bP$ are random variables. Since we don't know $\bP$ but instead have samples $(X_{1:n} = x_{1:n},Y_{1:n} = y_{1:n})$ from it, we can instead minimise the empirical risk (the Monte Carlo estimation of the expectation, if you like): \begin{align*} \min_\theta \frac{1}{n}\sum_{k = 1}^n (y_k - x_k^\top \theta)^2. \end{align*} To be clear, you can arrive to the same optimisation problem by assuming $y_k$ is generated via $x_k^\top \theta$ plus Gaussian noise and deriving the maximum likelihood estimate of $\theta$ (being Bayesian but without priors). The factor $\frac{1}{n}$ will be missing but it can be put into the step size of the algorithm. For finite $n$, it is a constant so it doesn't change the argument. Here $f_k(\theta) = (y_k - x_k^\top \theta)^2$, so if you implement stochastic gradient descent, you will have, \begin{align*} \theta_k &= \theta_{k-1} - \eta_k \nabla_\theta f_k(\theta_{k-1}) \\ &= \theta_{k-1} + \eta_k x_k (y_k - x_k^\top \theta_{k-1}). \end{align*} That's it. Actually there was a 2-factor but it can be embedded into the step-size. Of course, there is a usual assumption on the step size, you need, \begin{align*} \sum_k \eta_k = \infty \,\,\,\,\,\,\, \text{and} \,\,\,\,\,\,\, \sum_k \eta_k^2 < \infty \end{align*} What's done in practice is to parameterise the entire $(\eta_k)_{k=1}^\infty$ sequence by two parameters. Basically, you choose, \begin{align*} \eta_k = \frac{\alpha}{k^\beta} \end{align*} with $\alpha > 0$ and $0.5 < \beta < 1$. A typical choice is $\beta = 0.51$ and then you can tune only $\alpha$. Here is a simple Matlab code to generate data and perform regression:

% DATA GENERATION
N = 1000; % small dataset
dx = 3; % dimension of the parameter and inputs
th_tr = [1,2,3]'; % the real parameter
X = rand(dx,N); % arbitrary input data generation
sig = 0.5; % noise variance
Y = X' * th_tr + sqrt(sig) * randn(N,1); % output data generation
%

% ALGORITHM INITIALISATION
thinit = [0,0,0]';
a = 0.7;
b = 0.51;
%

% The first iteration
t = 1;
ind = randi(N);
gam(t) = a ./(t^b);
th(:,t) = thinit + gam(t) * X(:,ind) * (Y(ind) - X(:,ind)' * thinit);
err(t) = sqrt((1/N) * norm(Y - X' * th(:,t))^2);
% Note that computing the error is not necessary and it is not done in
% practice since datasets are massive (Look at the code line and see that
% all data is required to compute the error). For this example, you can
% plot it.
%

% For loop.
for t = 2:2*N
ind = randi(N);
gam(t) = a ./(t^b);
th(:,t) = th(:,t-1) + gam(t) * X(:,ind) * (Y(ind) - X(:,ind)' * th(:,t-1));
err(t) = sqrt((1/N) * norm(Y - X' * th(:,t))^2);

end

Last updated: 17/01/2017