Introduction.
In this post, we review the Expectation-Maximization (EM) algorithm and its use for maximum-likelihood problems.Latent variable models.
Let $\{X_1,\ldots,X_N\}$ be iid random variables with densities $p(x_i | \theta)$ and $\mathbf{x} = \{x_1, \ldots, x_N\}$ be realisations of them. We would like to find maximum-likelihood estimation of $\theta$ given the data $\mathbf{x}$, that is, the most probable parameter that explains the data at hand. It can be formalized as $p(\mathbf{x}|\theta)$. We further assume that, maximization of the log-likelihood is hard or impossible. But if there is a proper latent variable structure, this estimation problem can be solved via the EM algorithm. Suppose observed variables $\mathbf{x}$ live in some space, $\mathbf{x} \in \mathcal{X}$ and there is a set of hidden variables $\mathbf{y} \in \mathcal{Y}$. If maximization of the `complete' log-likelihood $p(\mathbf{x},\mathbf{y}|\theta)$ in the extended space $\mathcal{X} \times \mathcal{Y}$ is relatively easy and the existence of hidden variables makes sense, we can use this extended model. The EM algorithm exactly addresses this situation. However, yet we do not have observations $\mathbf{y}$, we have to do maximization partially. The E-step involves an expectation over the hidden variables which corresponds to taking a convex combination of the possible values of the hidden variables. M-step, then, maximizes this expected log-likelihood.E-step.
Assume we have an extended probability model $p(\mathbf{x}, \mathbf{y}|\theta)$ and we would like to estimate $\theta$ given only observations $\mathbf{x}$. Let us rewrite our problem, \begin{align*} \theta^* &= \argmax_\theta \log p(\mathbf{x}|\theta) \\ &= \argmax_\theta \log \int_\mathcal{Y} p(\mathbf{x}, \mathbf{y} | \theta) \mbox{d}\mathbf{y} \end{align*} Note that, we have multiple integrals if $\mathbf{y}$ is a set of observations. We use a compact notation. To do this maximization, we use a set of tricks. First of all, we set an instrumental distribution over hidden variables $q(\mathbf{y})$. Then, we can write, \begin{align*} \log \int_\mathcal{Y} p(\mathbf{x}, \mathbf{y} | \theta) \mbox{d}\mathbf{y} &= \log \int_\mathcal{Y} p(\mathbf{x}, \mathbf{y} | \theta) \frac{q(\mathbf{y})}{q(\mathbf{y})} \mbox{d}\mathbf{y} \\ &= \log \bE_{q(\mathbf{y})} \left[ \frac{p(\mathbf{x}, \mathbf{y} | \theta)}{ q(\mathbf{y})} \right] \end{align*} where $\bE_{q(\mathbf{y})}$ denotes the expectation over the distribution ${q(\mathbf{y})}$. At this point, we use the following theorem to proceed more.
Theorem 1. (Jensen's inequality) If $\varphi$ is a convex function then $\varphi(\bE [X]) \leq \bE[ \varphi(X)]$.
Since $\log$ is a concave function, we can use Theorem 1 in reverse direction. Then we have,
\begin{align}
\log \bE_{q(\mathbf{y})} \left[ \frac{p(\mathbf{x}, \mathbf{y} | \theta)}{ q(\mathbf{y})} \right] &\geq \bE_{q(\mathbf{y})} \left[ \log \frac{p(\mathbf{x}, \mathbf{y} | \theta)}{ q(\mathbf{y})}\right] \nonumber \\
&= \bE_{q(\mathbf{y})} \left[ \log {p(\mathbf{x}, \mathbf{y} | \theta)}\right] - \bE_{q(\mathbf{y})} \left[ \log {q(\mathbf{y})}
\right] \label{Estep}
\end{align}
E-step actually aims to 'approximate' to the complete-data log-likelihood $p(\mathbf{x},\mathbf{y}|\theta)$. However since we do not know $\mathbf{y}$, it is makes sense to choose $p(\mathbf{y}|\mathbf{x},\theta)$ as $q(\mathbf{y})$ to obtain all available information about the unobserved variables (as noted by the one of our readers, also this is theoretically optimal). However, there is a problem again: we do not know $\theta$. Therefore, we can not use a one-step method rather we have to start with an initial $\theta$ and update it. We denote the $\theta$ in the previous iteration as $\theta^{\text{old}}$. Therefore we choose $q(\mathbf{y}) = p(\mathbf{y}|\mathbf{x},\theta^{\text{old}})$. Hence, E-step consists of taking the expectation \eqref{Estep}. Notice also that, we do not need to compute the second term since it does not depend on $\theta$ and not important for our optimization problem. Before saying the last words, let us adapt a new notation,
\begin{align*}
\mathcal{Q}(\theta | \theta^{\text{old}}) &= \bE_{p(\mathbf{y}|\mathbf{x},\theta^{\text{old}})} \left[ \log {p(\mathbf{x}, \mathbf{y} | \theta)}\right] \\
&= \int_\mathcal{Y} \log {p(\mathbf{x}, \mathbf{y} | \theta)} p(\mathbf{y}|\mathbf{x},\theta^{\text{old}}) \mbox{d}\mathbf{y}
\end{align*}
Note that conditional independence relations between $\mathbf{x},\mathbf{y}$ and $\theta$ may simplify this integral. Expectation in the E-step is generally derived by hand in a closed analytic form. If this is not available, Monte Carlo methods are also available to estimate this expectation which is beyond of this post.
2 comments:
Not only it makes sense to choose $p(\mathbf{y}|\mathbf{x},\theta)$ as ${q(\mathbf{y})}$, that is the best you can do. If you find the ${q(\mathbf{y})}$ that maximizes $\log \E_{q(\mathbf{y})} \left[ \frac{p(\mathbf{x}, \mathbf{y} | \theta)}{ q(\mathbf{y})} \right]$ by taking derivative and setting it to zero, you end up with $p(\mathbf{y}|\mathbf{x},\theta)$.
Thanks, I added this as a note in the post.
Post a Comment