## Introduction

In this post, we derive the batch MLE procedure for the GARCH model in a more principled way than the last GARCH post. The derivation presented here is simple and concise.

The general GARCH model is defined as follows. \begin{align} V_1 &\sim \mu \\ X_n &\sim \NPDF(0,V_n) \label{GARCHobs} \\ V_n &= c + \sum_{j=1}^q \alpha_j X^2_{n-j} + \sum_{i = 1}^p \beta_i V_{n-i} \label{GARCHtr} \end{align} where $(X_n)_{n\geq 1}$ is a stochastic process of defined form. Note that $V_n = \sigma_n^2$ in the classical notation in statistics and $\mu$ is the initial distribution. It is customary to use this model with model order $p=1,q=1$, that is called GARCH(1,1). Then the model reduces to, \begin{align*} V_1 &\sim \mu \\ X_n &\sim \NPDF(0,V_n) \\ V_n &= c + \alpha X^2_{n-1} + \beta V_{n-1} \end{align*} In this model, we think of $X_n$ as observations. We set $\theta = (c,\alpha,\beta)$. The goal is to estimate the parameter vector $\theta \in \Theta$ where $\Theta \subset \bR^3$. We assume $\theta$ is fixed. It is called as a `static' parameter.

Remark 1. There are some stability constraints for this model: \begin{align*} \alpha + \beta \leq 1, \,\, \alpha \geq 0 \,\, \beta \geq 0 \,\, c > 0 \end{align*}

Notation and setup. We assume all measures are absolutely continuous wrt some dominating measure (e.g. Lebesgue measure) denoted generically with $\mbox{d}x$. Then, we note the following notation, \begin{align*} p(x | v) = \bP(X \in \mbox{d}x | V = v) \end{align*} for any $X$ and $V$. By using this notation, we denote the conditional density. If there are some parameters of a probability density $p(x)$, we denote this as $p_\theta(x)$ where $\theta$ is a (possibly) vector of parameters. To denote the joint density of a sequence of random variables $(Y_1,\ldots,Y_n)$, we write $p(y_{1:n})$. Furthermore, we fix a filtered probability space $(\Omega,\cF,\bP)$ with filtration $(\cF^X_n)_{n\geq 1}$ with $\cF^X_n = \sigma(X_1,\ldots,X_n)$.

In the sequel, we will fix $V_1 = v_1$, i.e., we assume the initial condition is known. This is certainly not the case in real applications, but we assume this to keep derivations simple. Also it is known that for large sample sizes, the effect of conditioning to a fixed value washes out. This setting frees us from specifying $\mu$ explicitly which is an important (and problematic) part of the MLE procedure. Otherwise, one can devise an algorithm which assumes a parametrised distribution $\mu$ on $V_1$ with parameter $\nu$ and then extend the parameter vector as $\theta = (\alpha,\beta,c,\nu)$ and employ MLE.

## Problem formulation

The typical MLE problem is to solving the following optimisation problem, \begin{align}\label{MLEformulation} \theta^* = \argmax_{\theta \in \Theta} \log p_\theta(x_{1:n}) \end{align} where $\theta$ is the parameter, and $p_\theta(x_{1:n})$ is the likelihood of the observations $(X_1,\ldots,X_n)$ for a fixed $\theta$. So, we have to first write $p_\theta(x_{1:n})$ under the assumption that $(X_1,\ldots,X_n)$ is simulated from a GARCH(1,1) model.

If there is a hidden variable model - one typically makes use of this model to derive the likelihood. In our case, according to the model, first we write, \begin{align}\label{intgNote} p_\theta(x_{1:n}) = \int p_\theta(x_{1:n},v_{1:n}) \mbox{d}v_{1:n} \end{align} and accordingly we decompose the joint density as follows, \begin{align}\label{decompJointdensity} p_\theta(x_{1:n},v_{1:n}) = p(v_1) p(x_1 | v_1) \prod_{k=2}^n p(x_k | v_k) p_\theta(v_k | v_{k-1}, x_{k-1}) \end{align} for a fixed $\theta \in \Theta$ where $\Theta \subset \bR^{d_\theta}$.

## Batch MLE

Here we derive the algorithm for fixed $n \in \bN$. We can write the marginal likelihood by using \eqref{decompJointdensity} as \begin{align*} p_\theta(x_{1:n}) &= \int p_\theta(x_{1:n},v_{1:n}) \mbox{d}v_{1:n} \\ &= p(x_1) \int \prod_{k=2}^n p(x_k | v_k) p_\theta(v_k | v_{k-1},x_{k-1}) \mbox{d}v_{2:n} \\ &= p(x_1) \prod_{k=2}^n \int p(x_k | v_k) p_\theta(v_k | v_{k-1},x_{k-1}) \mbox{d}v_{k} \end{align*} Observe that for a fixed $\theta$, \eqref{GARCHtr} implies a degenerate distribution for $V_k$. This means, given $V_{k-1} = v_{k-1}$ and $X_{k-1} = x_{k-1}$, the distribution of $V_k$ is degenerate, i.e., \begin{align*} p_{\theta}(v_{k} | v_{k-1},x_{k-1}) = \delta(v_k - (c + \alpha x_{k-1}^2 + \beta v_{k-1})) \end{align*} Note the following property of the Dirac density, \begin{align*} f(x) = \int \delta(y - x) f(y) \mbox{d}y \end{align*} for a bounded measurable function $f$. For a fixed $\theta$, the marginal likelihood becomes, \begin{align*} p_\theta(x_{1:n}) = p(v_1) \prod_{k=2}^n p(x_k | v_k^\theta) \end{align*} Here the sequence $\{v_k^\theta \}_{k=1}^n$ is a particular realisation of the process $V_n$ for a fixed $\theta$ and given $\cF^X_{n-1}$. Then, the marginal log likelihood is, \begin{align}\label{LLlastform} \log p_\theta(x_{1:n}) = \log p(v_1) + \sum_{k=2}^n \log p(x_k | v_k^\theta) \end{align} To minimise \eqref{MLEformulation}, we have to optimise \eqref{LLlastform} wrt $\theta$. Notice that we do not have to deal with the first term, as we specified it beforehand. For optimisation, we have a variety of choices. Here we use good old gradient descent. This corresponds to the performing the update given in \eqref{BatchUpdate} in the Algorithm 1. For details of the derivatives and required computations, see here.

Algorithm 1. (Batch MLE for GARCH)
• Read $x_{1:n}$ and initialize $\theta^1$
• Repeat until convergence:
• Simulate $v_{1:n}^{\theta^j}$:
• \begin{align*} v_{k}^{\theta^j} = c^j + \alpha^j x_{k-1}^2 + \beta^j v_{k-1}^{\theta^j} \end{align*}
• Update: \begin{align}\label{BatchUpdate} \theta^{j+1} = \Pi_C \left[\theta_j + \gamma_j \nabla_\theta \log p_\theta(x_{1:n}) \big\vert_{\theta = \theta^j} \right] \end{align}

Remark 2. As one can see from \eqref{BatchUpdate}, the update rule requires a projection. This is related to the stability conditions given in Remark 1. In each update, one should project the parameter vector to the region of stability.

Remark 3. In addition to the for loop, one has to do a summation while performing \eqref{BatchUpdate} because of the marginal log likelihood. This algorithm is then typically slow for large $n$.