The EM Algorithm


In this post, we review the Expectation-Maximization (EM) algorithm and its use for maximum-likelihood problems.

Latent variable models.

Let $\{X_1,\ldots,X_N\}$ be iid random variables with densities $p(x_i | \theta)$ and $\mathbf{x} = \{x_1, \ldots, x_N\}$ be realisations of them. We would like to find maximum-likelihood estimation of $\theta$ given the data $\mathbf{x}$, that is, the most probable parameter that explains the data at hand. It can be formalized as $p(\mathbf{x}|\theta)$. We further assume that, maximization of the log-likelihood is hard or impossible. But if there is a proper latent variable structure, this estimation problem can be solved via the EM algorithm. Suppose observed variables $\mathbf{x}$ live in some space, $\mathbf{x} \in \mathcal{X}$ and there is a set of hidden variables $\mathbf{y} \in \mathcal{Y}$. If maximization of the `complete' log-likelihood $p(\mathbf{x},\mathbf{y}|\theta)$ in the extended space $\mathcal{X} \times \mathcal{Y}$ is relatively easy and the existence of hidden variables makes sense, we can use this extended model. The EM algorithm exactly addresses this situation. However, yet we do not have observations $\mathbf{y}$, we have to do maximization partially. The E-step involves an expectation over the hidden variables which corresponds to taking a convex combination of the possible values of the hidden variables. M-step, then, maximizes this expected log-likelihood.


Assume we have an extended probability model $p(\mathbf{x}, \mathbf{y}|\theta)$ and we would like to estimate $\theta$ given only observations $\mathbf{x}$. Let us rewrite our problem, \begin{align*} \theta^* &= \argmax_\theta \log p(\mathbf{x}|\theta) \\ &= \argmax_\theta \log \int_\mathcal{Y} p(\mathbf{x}, \mathbf{y} | \theta) \mbox{d}\mathbf{y} \end{align*} Note that, we have multiple integrals if $\mathbf{y}$ is a set of observations. We use a compact notation. To do this maximization, we use a set of tricks. First of all, we set an instrumental distribution over hidden variables $q(\mathbf{y})$. Then, we can write, \begin{align*} \log \int_\mathcal{Y} p(\mathbf{x}, \mathbf{y} | \theta) \mbox{d}\mathbf{y} &= \log \int_\mathcal{Y} p(\mathbf{x}, \mathbf{y} | \theta) \frac{q(\mathbf{y})}{q(\mathbf{y})} \mbox{d}\mathbf{y} \\ &= \log \bE_{q(\mathbf{y})} \left[ \frac{p(\mathbf{x}, \mathbf{y} | \theta)}{ q(\mathbf{y})} \right] \end{align*} where $\bE_{q(\mathbf{y})}$ denotes the expectation over the distribution ${q(\mathbf{y})}$. At this point, we use the following theorem to proceed more.

Theorem 1. (Jensen's inequality) If $\varphi$ is a convex function then $\varphi(\bE [X]) \leq \bE[ \varphi(X)]$.

Since $\log$ is a concave function, we can use Theorem 1 in reverse direction. Then we have, \begin{align} \log \bE_{q(\mathbf{y})} \left[ \frac{p(\mathbf{x}, \mathbf{y} | \theta)}{ q(\mathbf{y})} \right] &\geq \bE_{q(\mathbf{y})} \left[ \log \frac{p(\mathbf{x}, \mathbf{y} | \theta)}{ q(\mathbf{y})}\right] \nonumber \\ &= \bE_{q(\mathbf{y})} \left[ \log {p(\mathbf{x}, \mathbf{y} | \theta)}\right] - \bE_{q(\mathbf{y})} \left[ \log {q(\mathbf{y})} \right] \label{Estep} \end{align} E-step actually aims to 'approximate' to the complete-data log-likelihood $p(\mathbf{x},\mathbf{y}|\theta)$. However since we do not know $\mathbf{y}$, it is makes sense to choose $p(\mathbf{y}|\mathbf{x},\theta)$ as $q(\mathbf{y})$ to obtain all available information about the unobserved variables (as noted by the one of our readers, also this is theoretically optimal). However, there is a problem again: we do not know $\theta$. Therefore, we can not use a one-step method rather we have to start with an initial $\theta$ and update it. We denote the $\theta$ in the previous iteration as $\theta^{\text{old}}$. Therefore we choose $q(\mathbf{y}) = p(\mathbf{y}|\mathbf{x},\theta^{\text{old}})$. Hence, E-step consists of taking the expectation \eqref{Estep}. Notice also that, we do not need to compute the second term since it does not depend on $\theta$ and not important for our optimization problem. Before saying the last words, let us adapt a new notation, \begin{align*} \mathcal{Q}(\theta | \theta^{\text{old}}) &= \bE_{p(\mathbf{y}|\mathbf{x},\theta^{\text{old}})} \left[ \log {p(\mathbf{x}, \mathbf{y} | \theta)}\right] \\ &= \int_\mathcal{Y} \log {p(\mathbf{x}, \mathbf{y} | \theta)} p(\mathbf{y}|\mathbf{x},\theta^{\text{old}}) \mbox{d}\mathbf{y} \end{align*} Note that conditional independence relations between $\mathbf{x},\mathbf{y}$ and $\theta$ may simplify this integral. Expectation in the E-step is generally derived by hand in a closed analytic form. If this is not available, Monte Carlo methods are also available to estimate this expectation which is beyond of this post.


Our ultimate aim is to maximize the log-likelihood. In the E-step, we provide a lower bound of the log-likelihood and now we would like to maximize this. It is in fact possible to show that this lower bound is tight, i.e., we are doing the correct thing. We omit this proof. M-step consists of, \begin{align*} \theta^* = \argmax_\theta \mathcal{Q}(\theta | \theta^{\text{old}}) \end{align*} This is again problem dependent as in the E-step. However, general steps are just these. After estimating new $\theta^*$, one has to use it as $\theta^{\text{old}}$. This iterative procedure maximizes the log-likelihood, i.e., one can prove, $\mathcal{Q}(\theta^{t+1}|\theta^t) \geq \mathcal{Q}(\theta^{t}|\theta^{t-1})$.


We will not give concrete examples and leave it to the future (possibly application oriented) posts. Also, convergence proof of the EM is another interesting topic. EM is a special case of a class of more general algorithms called majorization-minimization (MM) algorithms which are also interesting. MM algorithms are especially useful to exploit the geometric intuition behind the EM. We hope to pursue the MM algorithms in the future posts.


bitkidoku said...

Not only it makes sense to choose $p(\mathbf{y}|\mathbf{x},\theta)$ as ${q(\mathbf{y})}$, that is the best you can do. If you find the ${q(\mathbf{y})}$ that maximizes $\log \E_{q(\mathbf{y})} \left[ \frac{p(\mathbf{x}, \mathbf{y} | \theta)}{ q(\mathbf{y})} \right]$ by taking derivative and setting it to zero, you end up with $p(\mathbf{y}|\mathbf{x},\theta)$.

Deniz said...

Thanks, I added this as a note in the post.

Post a Comment