## Introduction

In this post, we review the online maximum-likelihood parameter estimation for GARCH model which is a dynamic variance model. GARCH can be seen as a toy volatility model and used as a textbook example for financial time series modelling.

## A Simple Dynamic variance Structure

Many time series in real world exhibit a dynamic variance structure. It means that, the variance of the sequence of random variables is not static over time. As a simple example, consider the following model, \begin{align*} x_t &\sim \NPDF(0,\sigma_t^2) \\ \sigma_t^2 &= c + \sum_{j=1}^q \alpha_j x^2_{t-j} + \sum_{i = 1}^p \beta_i \sigma^2_{t-i} \end{align*} This model is known as Generalized Autoregressive Conditional Heteroscedasticity (GARCH) model. It is denoted with GARCH(p,q). Multivariate versions of the GARCH can also be constructed which I hope to do later. In this post, I aim to derive a ML estimation procedure for this type of model. For further reducing the complexity, we'll use GARCH(1,1). So, the model becomes, \begin{align*} x_t &\sim \NPDF(0,\sigma_t^2) \\ \sigma_t^2 &= c + \alpha x^2_{t-1} + \beta \sigma^2_{t-1} \end{align*} This model is significantly simpler than the previous one. Furthermore, in order to reduce overfitting problems, it is customary to select the model as such.

## Likelihood of the GARCH model

We can form the log-likelihood function as follows. \begin{align*} \mathcal{L}(\theta) = \sum_{t=1}^T \log p(x_t | \sigma_t; \theta) \end{align*} where $\theta = \{c,\alpha,\beta\}$. So, we know, \begin{align*} p(x_t | \sigma_t; \theta) = \frac{1}{\sqrt{2\pi \sigma_t^2}} \exp \left(-\frac{x_t^2}{2 \sigma_t^2}\right) \end{align*} Therefore, \begin{align*} \mathcal{L}(\theta) = \sum_{t=1}^T -\log \sqrt{2\pi} - \frac{x_t^2}{2 \sigma_t^2} - \frac{1}{2} \log \sigma_t^2 \end{align*} There are two constraints for stability (1) $\alpha + \beta \leq 1$, (2) $\alpha > 0, \beta > 0$. In this post, we try to optimize this function using different techniques. First we give the gradient descent algorithm and second we will consider a stochastic (online) version of this algorithm. To overcome the constraints in the optimization problem, we use projected optimization algorithms which project the candidate solution to the feasible set of solutions at each step.

## Estimating $c$, $\alpha$ and $\beta$

It is customary to do the optimization in a coordinate-descent setting, i.e., separately wrt $c$,$\alpha$ and $\beta$. Suppose, we do not know only $c$, i.e., we fix $\alpha$ and $\beta$. Also, for the moment, we also suppose we know all $\sigma_t$'s (Later we, of course, will drop this assumption). The cost is, \begin{align*} \mathcal{L}(\theta) = \sum_{t=1}^T -\log \sqrt{2\pi} - \frac{x_t^2}{2 c + 2 \alpha x_{t-1}^2 + 2 \beta \sigma_{t-1}^2} - \frac{1}{2} \log \left( c + \alpha x_{t-1}^2 + \beta \sigma_{t-1}^2 \right) \end{align*} So, we write, \begin{align*} \frac{\partial \mathcal{L}(\theta)}{\partial c} &= \sum_{t=1}^T \frac{2 x_t^2}{(2 c + 2 \alpha x_{t-1}^2 + 2 \beta \sigma_{t-1}^2)^2} - \frac{1}{2 \left( c + \alpha x_{t-1}^2 + \beta \sigma_{t-1}^2 \right)} \end{align*} We can use gradient descent by using the update procedure, \begin{align*} \left. c^{n+1} \leftarrow c^n + \eta \frac{\partial \mathcal{L}(\theta)}{\partial c} \right\rvert_{c = c^n} \end{align*} Similarly, let us compute the derivatives wrt $\alpha$ and $\beta$, \begin{align*} \frac{\partial \mathcal{L}(\theta)}{\partial \alpha} &= \sum_{t=1}^T \frac{2 x_t^2 x_{t-1}^2}{(2 c + 2 \alpha x_{t-1}^2 + 2 \beta \sigma_{t-1}^2)^2} - \frac{x_{t-1}^2}{2 \left( c + \alpha x_{t-1}^2 + \beta \sigma_{t-1}^2 \right)} \\ \frac{\partial \mathcal{L}(\theta)}{\partial \beta} &= \sum_{t=1}^T \frac{2 x_t^2 \sigma_{t-1}^2}{(2 c + 2 \alpha x_{t-1}^2 + 2 \beta \sigma_{t-1}^2)^2} - \frac{\sigma_{t-1}^2}{2 \left( c + \alpha x_{t-1}^2 + \beta \sigma_{t-1}^2 \right)} \end{align*} and the update rules are, \begin{align*} \alpha^{n+1} &\leftarrow \alpha^n + \eta \left. \frac{\partial \mathcal{L}(\theta)}{\partial \alpha} \right\rvert_{\alpha = \alpha^n} \\ \beta^{n+1} &\leftarrow \beta^n + \eta \left. \frac{\partial \mathcal{L}(\theta)}{\partial \beta} \right\rvert_{\beta = \beta^n} \end{align*} So far, so good. However, the main drawback of the above algorithm is that, it assumes the $\sigma_t$'s are known. If we drop this assumption, we have to recursively estimate the $\sigma_t$. In order to do so, we can employ the following strategy: For each updated $c^n, \alpha^n$ and $\beta^n$, one can compute, at least, an estimate of the $\sigma_t$'s as, \begin{align*} \hat{\sigma}_t^2 &= c^n + \alpha^n x^2_{t-1} + \beta^n \hat{\sigma}^2_{t-1} \end{align*} for each $t$. So this is a batch algorithm, that is, in every big gradient step, we also compute each $\hat{\sigma}_{1:T}$ with new parameters. This is a big bottleneck. However, it turns out that, there is an online version of this algorithm which numerically works well!

## Online estimation of $c$, $\alpha$ and $\beta$

If one has a large dataset, the described algorithm is impractical. Because, at each iteration, besides the sum over whole dataset and covariances, one has to run over $T$ inner iterations to construct new estimate of $\hat{\sigma}_{1:T}$ where $T$ is the size of the data. Instead, one can use a Robbins-Monro type algorithm where, at each step, only one sample can be used to update parameters. Hence, at each step, only $\hat{\sigma}_t$ will be estimated. This makes this algorithm perfectly fast.

Let us consider the notation, \begin{align*} \mathcal{L}_t(\theta) = -\log \sqrt{2\pi} - \frac{x_t^2}{2 c + 2 \alpha x_{t-1}^2 + 2 \beta \sigma_{t-1}^2} - \frac{1}{2} \log \left( c + \alpha x_{t-1}^2 + \beta \sigma_{t-1}^2 \right) \end{align*} where $\mathcal{L}_t(\theta)$ is the cost for only one sample given the previous sample. Therefore, our new algorithm is the following, \begin{align*} c^{t+1} &\leftarrow c^t + \eta \left. \frac{\partial \mathcal{L}_t(\theta)}{\partial c} \right\rvert_{c = c^t, \alpha = \alpha^t,\beta = \beta^t} \\ \alpha^{t+1} &\leftarrow \alpha^t + \eta \left. \frac{\partial \mathcal{L}_t(\theta)}{\partial \alpha} \right\rvert_{c = c^{t+1}, \alpha = \alpha^t,\beta = \beta^t} \\ \beta^{t+1} &\leftarrow \beta^t + \eta \left. \frac{\partial \mathcal{L}_t(\theta)}{\partial \beta} \right\rvert_{c = c^{t+1}, \alpha = \alpha^{t+1},\beta = \beta^t} \\ \hat{\sigma}_{t+1}^2 &= c^{t+1} + \alpha^{t+1} x^2_{t} + \beta^{t+1} \hat{\sigma}^2_{t} \end{align*} This algorithm is completely online and scales arbitrarily large datasets. One important thing is annealing of step sizes $\eta$ where usual conditions for stochastic gradient descent must hold. However there is a more serious problem.

## A projected gradient descent algorithm

Above gradient ascent updates are for unconstrained optimization problems. However, in the GARCH model, the following constrains have to be considered, \begin{align*} c,\alpha,\beta &\geq 0 \\ \alpha + \beta &\leq 1 \end{align*} Therefore, the feasible set is actually a triangle between $\alpha$-axis, $\beta$-axis $\alpha + \beta = 1$ in the space of $(\alpha,\beta)$ pairs. In a three-dimensional parameter space of $(\alpha,\beta,c)$, the feasible set is a triangular polyhedra. Let us denote this polyhedra with $C$. At each step, we project the candidate solution to this feasible set of solutions. Let us denote this projection operator as $\Pi_C$. Therefore, we use the following algorithm. \begin{align*} c^{t+1} &\leftarrow \Pi_C \left[ c^t + \eta \left. \frac{\partial \mathcal{L}_t(\theta)}{\partial c} \right\rvert_{c = c^t, \alpha = \alpha^t,\beta = \beta^t} \right] \\ \alpha^{t+1} &\leftarrow \Pi_C \left[\alpha^t + \eta \left. \frac{\partial \mathcal{L}_t(\theta)}{\partial \alpha} \right\rvert_{c = c^{t+1}, \alpha = \alpha^t,\beta = \beta^t} \right]\\ \beta^{t+1} &\leftarrow \Pi_C \left[\beta^t + \eta \left. \frac{\partial \mathcal{L}_t(\theta)}{\partial \beta} \right\rvert_{c = c^{t+1}, \alpha = \alpha^{t+1},\beta = \beta^t} \right]\\ \hat{\sigma}_{t+1}^2 &= c^{t+1} + \alpha^{t+1} x^2_{t} + \beta^{t+1} \hat{\sigma}^2_{t} \end{align*} Implementing $\Pi_C$ is easy and in fact can be achieved by a sequence of $\min$ and $\max$ operators.

Stability of this algorithm strongly depends on the step sizes. Also, it will be problematic for the values like $\alpha = 0.01$. Because in such cases, step size must be carefully annealed so as to catch the true value without being caught on the projection limit (in that case, the value will be fixed to zero). In the above algorithm, we use $\eta$ for all steps, but it is more sound to use $\eta_c, \eta_\alpha, \eta_\beta$ for each gradient step.

We think, this algorithm is an instance of a stochastic version of the EM algorithm. E-step is the variance update, where a deterministic relationship is defined, so E-step is trivial. In the M-step, instead of do full-maximization, we take a little step toward local maximum.

## Results

For a small experiment, we generate data from the generative model with the parameters $\alpha = 0.3, \beta = 0.5, c = 2$ and $\sigma_1 = 15$ as initial variance. Then we generate the data which can be seen from the figures. The generated $x_{1:T}$.
For initialisation, we choose $\alpha^1 = 0.9, \beta^1 = 0.1, c^1 = 5, \hat{\sigma}_1 = 4$. We choose $\eta = 0.005$ and use same $\eta$ for all updates. We generate 1 million samples so as to show that the algorithm scales well. Then, we apply the algorithm we derived in the last section. Then, $\alpha^\text{end} = 0.3079,\beta^\text{end} = 0.4724$ and $c^\text{end} = 2.0477$. From this estimation performance, it can easily seen that, algorithm estimates the correct parameters. Estimated parameters wrt iterations (data samples). Red lines are true values.
An interesting phenomenon occurs in the estimates of the variance. In algorithm, we said that, in each iteration, we `propagate' the variance with the current estimate of parameters. This selection seems to lead to the accurate predictions of the variance. To test it, we investigate, \begin{align*} |\sigma_t - \hat{\sigma}_t| \end{align*} as $t \to \infty$. Empirical results suggest that, as $t\to\infty$, variance estimates become accurate as parameters correctly estimated. This is also has to be verified theoretically.

## Conclusion

Although algorithm works well for toy data, it has to be tested with real data. We note that the GARCH model is a toy model, hence modeling variance and estimating parameters in this way will be unrealistic in a real setting.