2013/08/20

Sequential importance sampling-resampling

Introduction

In this post, I review the sequential importance sampling-resampling for state space models. These algorithms are also known as particle filters. I give a derivation of these filters and their application to the general state space models.

Basics and HMMs

Notation and Convention. In this previous post, we discussed the topic of importance sampling (IS). Here, we will use a slightly different notation and convention. First of all, although IS based techniques are developed for approximating the integrals (expectations), here we aim to approximate the distributions. After obtained the probability distribution which is the key part of the expectation, computing any moment is straightforward. Furthermore, in order to denote an empirical distribution, we use the following notation: Suppose we have a density $\pi(x)$ and realisations of iid random variables $\{X^{(i)} \}_{i=1}^N$ each of which $X^{(i)} \sim \pi(x)$. Then, we denote the empirical approximation $\pi^N$ of $\pi$ as, \begin{align*} \pi^N(\mbox{d}x) = \frac{1}{N} \sum_{i=1}^N \delta_{X^{(i)}} (\mbox{d}x) \end{align*} where $\delta$ is the Dirac measure. This measure has the following property. Let $f: E \to \bR$ be a bounded measurable function. For any $x\in E$, \begin{align}\label{diracProp} f(x) = \int f(y) \delta_x (\mbox{d}y) \end{align} Therefore, to compute the expectation, we simply put the empirical distribution into the integral and obtain the empirical expectation.

Hidden Markov Models. As we will use the sequential importance sampling (SIS) algorithms for hidden Markov models (HMMs), we define them in a nutshell. Note that by an HMM, we mean a general state-space model.

Definition 1. Let $\{X_n\}_{n \in \bN}$ and be a hidden discrete-time Markov process with the prior probability distribution $p(x_1)$. Suppose also we have discrete-time observation process $\{Y_n\}_{n \in \bN}$. Then, the HMM is the following model, \begin{align} X_1 &\sim p(x_1) \\ X_n | (X_{n-1} = x_{n-1}) &\sim p(x_n | x_{n-1}) \label{transPr} \\ Y_n | (X_{n} = x_{n}) &\sim p(y_n | x_ {n}) \label{obsPr} \end{align}

So intuitively, the HMM consists of two stochastic processes. One is the hidden process $\{X_n\}$ that is of interest in many cases. Other one is the observation process $\{Y_n\}$. In our convention, capital letters denote the random variables and small letters denote the realised versions of them.

From the model structure, it can be easily seen that, \begin{align*} p(x_{1:n},y_{1:n}) = p(x_1) \prod_{k = 2}^n p(x_k|x_{k-1}) \prod_{k=1}^n p(y_k | x_k) \end{align*} In the next section, we will consider the filtering problem for state space models, that is, estimating $p(x_n | y_{1:n})$ sequentially. We are also interested to estimate $p(x_{1:n}|y_{1:n})$ sequentially. To these ends, we derive expressions for these distributions. We start with $p(x_{1:n}|y_{1:n})$. Notice that, $p(x_{1:n}|y_{1:n}) \propto p(x_{1:n},y_{1:n})$ from Bayes theorem. Then, we can write [1], \begin{align*} p(x_{1:n},y_{1:n}) = p(x_{1:n-1},y_{1:n-1}) p(x_n | x_{n-1}) p(y_n | x_n) \end{align*} from the HMM model structure. Therefore, we obtain the posterior $p(x_{1:n} | y_{1:n})$ as the following. \begin{align*} p(x_{1:n} | y_{1:n}) &= \frac{p(x_{1:n},y_{1:n})}{p(y_{1:n})} \\ &= \frac{ p(x_{1:n-1},y_{1:n-1}) p(x_n | x_{n-1}) p(y_n | x_n)}{p(y_{1:n})} \\ &= \frac{p(x_{1:n-1},y_{1:n-1})}{p(y_{1:n-1})} \frac{p(x_n | x_{n-1}) p(y_n | x_n)}{p(y_{n}|y_{1:n-1})} \\ &= p(x_{1:n-1} | y_{1:n-1}) \frac{p(x_n | x_{n-1}) p(y_n | x_n)}{p(y_{n}|y_{1:n-1})} \end{align*} Therefore, we have, \begin{align}\label{trajectDistribution} p(x_{1:n} | y_{1:n}) = p(x_{1:n-1} | y_{1:n-1}) \frac{p(x_n | x_{n-1}) p(y_n | x_n)}{p(y_{n}|y_{1:n-1})} \end{align} where \begin{align*} p(y_n | y_{1:n-1}) &= \int p(x_n, x_{n-1}, y_n | y_{1:n-1}) \mbox{d}x_{n-1:n} \\ & = \int p(y_n | x_n) p(x_n | x_{n-1}) p(x_{n-1}|y_{1:n-1}) \mbox{d}x_{n-1:n} \end{align*} Obtaining an expression for $p(x_n|y_{1:n})$ is easier. We can simply integrate out $x_{1:n-1}$ in \eqref{trajectDistribution}. Let us put, \begin{align*} p(x_n | y_{1:n}) &= \int p(x_{1:n} | y_{1:n}) \mbox{d}x_{1:n-1} \\ &= \int p(x_{1:n-1} | y_{1:n-1}) \frac{p(x_n | x_{n-1}) p(y_n | x_n)}{p(y_{n}|y_{1:n-1})} \mbox{d}x_{1:n-1} \\ &= \int p(x_{1:n} | y_{1:n-1}) \frac{p(y_n | x_n)}{p(y_{n}|y_{1:n-1})} \mbox{d}x_{1:n-1} \\ &= \frac{p(x_{n} | y_{1:n-1}) p(y_n | x_n)}{p(y_{n}|y_{1:n-1})} \end{align*} where $p(x_n|y_{1:n-1})$ is the predictive density and it is given by, \begin{align*} p(x_n | y_{1:n-1}) &= \int p(x_n,x_{n-1}|y_{1:n-1})\mbox{d}x_{n-1} \\ &= \int p(x_n | x_{n-1}) p(x_{n-1} | y_{1:n-1}) \mbox{d}x_{n-1} \end{align*} Now, before delving into SIS, to ease the notation, we set $\pi_n(x_n) = p(x_n | y_{1:n})$ and $\pi_n(x_{1:n}) = p(x_{1:n}|y_{1:n})$, and also put, $\gamma_n(x_n) = p(x_n , y_{1:n})$ and $\gamma_n(x_{1:n}) = p(x_{1:n},y_{1:n})$.

Sequential importance sampling

By using the notation settled in the end of the last section, we give a general view of importance and sequential importance sampling for increasing state spaces. This view also applicable to the state-space models as the notation implies.

Importance sampling for increasing spaces. In the previous regarding IS, we dealt with a single target distribution $\pi(x)$. Now, suppose we have a sequence of probability distributions as target distributions: $\pi_{n}(x_{1:n})$ for $n \in \bN$. Each $\pi_n$ is defined on a product space $\bigotimes_{k=1}^n E_k$ [2]. Therefore, $\pi_1(x_1)$ is defined on $E_1$ whereas $\pi_2(x_{1:2})$ is defined on $E_1 \times E_2$. If we use the `plain' IS to estimate the distribution $\pi_n(x_{1:n})$, we have to use the following procedure [3]. Let us choose a proposal distribution $q_n(x_{1:n})$ which is easy to sample from. We assume some absolute continuity properties. Let us denote the unnormalised potential of $\pi_n(x_{1:n})$ as $\gamma_{n}(x_{1:n})$ and assume $\gamma_n(x_{1:n})$ is known pointwise. We write $\pi_{n}(x_{1:n}) = \gamma_{n}(x_{1:n})/Z_n$ where $Z_n$ is a normalising constant and $Z_n = \int \gamma_n(x_{1:n}) \mbox{d}x_{1:n}$. For the IS, first we compute the weight function as, \begin{align*} W_n(x_{1:n}) = \frac{\gamma_n(x_{1:n})}{q_n(x_{1:n})} \end{align*} Since we know $\pi_{n}(x_{1:n}) = \gamma_{n}(x_{1:n})/Z_n$, therefore we write, \begin{align*} \pi_n(x_{1:n}) = \frac{W_n(x_{1:n}) q_n(x_{1:n})}{Z_n} \end{align*} and also $Z_n$ can be written as $Z_n = \int W_n(x_{1:n}) q_n(x_{1:n}) \mbox{d}x_{1:n}$. We can easily sample trajectories from $q_n(x_{1:n})$ (note that $x_n$ can be a vector itself) and denote these samples with $X_{1:n}^{(i)}$ where $i = 1,\ldots,N$. Via these samples, we can construct the empirical measure of $q_n$ as, \begin{align*} q_n^N(\mbox{d}x_{1:n}) = \frac{1}{N} \sum_{i=1}^N \delta_{X_{1:n}^{(i)}} (\mbox{d}x_{1:n}) \end{align*} Therefore, by using the property \eqref{diracProp}, we may write \begin{align*} Z_n^N = \frac{1}{N} \sum_{i=1}^N W_n(X_{1:n}^{(i)}) \end{align*} We also write, \begin{align*} W_n(x_{1:n}) q_n(x_{1:n}) \approx \frac{1}{N} \sum_{i=1}^N W_n(X^{(i)}_{1:n}) \delta_{X_{1:n}^{(i)}} (\mbox{d}x_{1:n}) \end{align*} Hence, \begin{align*} \pi^N(\mbox{d}x_{1:n}) = \frac{\frac{1}{N} \sum_{i=1}^N W_n(X^{(i)}_{1:n}) \delta_{X_{1:n}^{(i)}} (\mbox{d}x_{1:n})}{\frac{1}{N} \sum_{i=1}^N W_n(X_{1:n}^{(i)})} \end{align*} and we define, $w_n^{(i)}$ as, \begin{align*} w_n^{(i)} = \frac{W_n(X^{(i)}_{1:n})}{\sum_{i=1}^N W_n(X_{1:n}^{(i)})} \end{align*} Hence at least we arrive the approximation of $\pi$ as, \begin{align*} \pi_n^N (\mbox{d}x_{1:n}) = \sum_{i=1}^N w_n^{(i)} \delta_{X_{1:n}^{(i)}} (\mbox{d}x_{1:n}) \end{align*} The main drawback of this method is obvious. As $n$ increases, the dimension of $x_{1:n}$ increases, hence the state-space increases. Therefore, at each iteration, we have to use an IS procedure in such a way that computational complexity constantly increases, at least of order $n$. Therefore, it makes sense to investigate the sequential version of this IS algorithm so as to make this idea compatible with the online inference problems.

Sequential importance sampling. We are interested in developing an algorithm with fixed computational complexity to estimate $\pi_n(x_{1:n})$ and $\pi_n(x_n)$ for $n\geq 1$. To that end, we assume the following structure for the proposal $q_n(x_{1:n})$ [3], \begin{align*} q_n(x_{1:n}) = q_{n-1}(x_{1:n-1}) q_n(x_n|x_{1:n-1}) \end{align*} Hence, we can write the following [3], \begin{align*} W_n(x_{1:n}) &= \frac{\gamma_n(x_{1:n})}{q_n(x_{1:n})} \\ &= \frac{1}{q_{n-1}(x_{1:n-1})} \frac{\gamma_{n-1}(x_{1:n-1})}{\gamma_{n-1}(x_{1:n-1})} \frac{\gamma_n(x_{1:n})}{q_{n}(x_n | x_{1:n-1})} \\ &= \frac{\gamma_{n-1}(x_{1:n-1})}{q_{n-1}(x_{1:n-1})} \frac{\gamma_n(x_{1:n})}{\gamma_{n-1}(x_{1:n-1})q_{n}(x_n | x_{1:n-1})} \\ &= W_{n-1}(x_{1:n-1}) \underbrace{\frac{\gamma_n(x_{1:n})}{\gamma_{n-1}(x_{1:n-1})q_{n}(x_n | x_{1:n-1})}}_{\alpha_n(x_{1:n})} \end{align*} The point of this selection is obvious. It enables us to compute importance weights recursively, i.e., one can use the weights of previous time-step to find the weights of the current time-step. Therefore, we can write the SIS algorithm as follows.

Algorithm 1. (The SIS Algorithm)
  • At time $n = 1$
    • Sample $X_1^{(i)} \sim q(x_1)$
    • Find weights, \begin{align*} W_1(X_1^{(i)}) = \gamma_1(X_1^{(i)})/q_1(X_1^{(i)}) \end{align*} and normalise to find $w_1^{(i)}$.
  • For time $n \geq 2$,
    • Sample $X_n^{(i)} \sim q_n(x_n | X^{(i)}_{1:n-1})$
    • Compute, \begin{align*} W_n(X^{(i)}_{1:n}) = W_{n-1}(X^{(i)}_{1:n-1}) \alpha_n(X^{(i)}_{1:n}) \end{align*} and normalise $w_n^{(i)} \propto W_n(X^{(i)}_{1:n})$.
    • Approximate the distributions as, \begin{align*} \pi^N(\mbox{d}x_{1:n}) = \sum_{i=1}^N w_n^{(i)} \delta_{X^{(i)}_{1:n}}(\mbox{d}x) \\ \pi^N(\mbox{d}x_n) = \sum_{i=1}^N w_n^{(i)} \delta_{X^{(i)}_{n}}(\mbox{d}x) \end{align*}

This algorithm is somehow abstract, that is, we did not specify the $q_n$ and even the $\pi_n$ on a concrete problem. However, before moving into a simple example, we give a brief discussion on proposal densities $q_n$.

It is shown that, the optimal importance function in terms of minimizing the variance is $q_n(x_n | X^{(i)}_{1:n-1},y_{1:n}) = p(x_n| X^{(i)}_{n-1},y_n)$, a proof can be seen from [4]. However, in practice, the widely used proposal density is $p(x_n| x_{n-1})$ since $p(x_n| X^{(i)}_{n-1},y_n)$ is intractable.

Application to general state-space models. Suppose we have a general state-space model (SSM) $p(y_n|x_n)$ and $p(x_n | x_{n-1})$. Recall that, we have, \begin{align*} \gamma_n(x_{1:n}) = p(x_{1:n},y_{1:n}) = p(x_{1:n-1},y_{1:n-1}) p(x_n | x_{n-1}) p(y_n | x_n) \end{align*} Let us choose the proposal distribution as, \begin{align*} q_n(x_{1:n}) = q_{n-1}(x_{1:n-1}) p(x_n | x_{n-1}) \end{align*} Therefore, if we wish to implement the particle filter, the weight expression will become, \begin{align*} W(x_{1:n}) &= \frac{\gamma_n(x_{1:n})}{q_n(x_{1:n})} \\ &= W(x_{1:n-1}) \frac{p(y_n|x_n) p(x_n|x_{n-1})}{p(x_n|x_{n-1})} \\ &= W(x_{1:n-1}) p(y_n|x_n) \end{align*} So for this selection, the SIS filter is as the following,

Algorithm 2. (The SIS Algorithm for proposal $p(x_n|x_{n-1})$)
  • At time $n = 1$
    • Sample $X_1^{(i)} \sim p(x_1)$
    • Find weights, \begin{align*} W_1(X_1^{(i)}) = p(y_1 | X_1^{(i)}) \end{align*} and normalise to find $w_1^{(i)}$.
  • For time $n \geq 2$,
    • Sample $X_n^{(i)} \sim p(x_n | X^{(i)}_{n-1})$
    • Compute, \begin{align*} W_n(X^{(i)}_{1:n}) = W_{n-1}(X^{(i)}_{1:n-1}) p(y_n | X_n^{(i)}) \end{align*} and normalise $w_n^{(i)} \propto W_n(X^{(i)}_{1:n})$.
    • Approximate the distributions as, \begin{align*} \pi^N(\mbox{d}x_{1:n}) = \sum_{i=1}^N w_n^{(i)} \delta_{X^{(i)}_{1:n}}(\mbox{d}x) \\ \pi^N(\mbox{d}x_n) = \sum_{i=1}^N w_n^{(i)} \delta_{X^{(i)}_{n}}(\mbox{d}x) \end{align*}

In state estimation problems, the most wanted quantity is expectation of $X_n$ with respect to filtering distribution $p(x_n | y_{1:n}) = \pi_n(x_n)$, i.e., $\bE_{\pi_n(x_n)}[X_n]$. It is easy to see that, Algorithm 2 can be used to estimate this expectation. We can do the following, \begin{align*} \bE_{\pi_n(x_n)}[X_n] &= \int x_n \pi_n(x_n) \mbox{d}x_n \\ \end{align*} At time $n$, we have the approximation $\pi_n^N$ and by using \eqref{diracProp}, we can write, \begin{align*} \int x_n \pi_n(x_n) \mbox{d}x_n &\approx \int x_n \pi_n^N(x_n) \mbox{d}x_n\\ &= \int x_n \sum_{i=1}^N w_n^{(i)} \delta_{X_n^{(i)}} (\mbox{d}x_n) \\ &= \sum_{i=1}^N w_n^{(i)} X_n^{(i)} \end{align*}

Problems with sequential importance sampling

There is a problem of SIS called as weight degeneracy. We can numerically show that as in the following figure, Fig. 1.
Figure 1. Estimates of the filtering distribution as $n$ increases.
Typically, as $n$ increases, probability distributions become degenerate. To overcome this problem, various resampling schemes are introduced. At each iteration, via a resampling scheme, particles become unweighted. This operation makes particles dependent but it is asymptotically consistent.

Sequential importance resampling

To overcome this problem, a resampling scheme is generally employed. Resampling means that, we sample from the empirical distribution and and use these samples as they are equally-weighted. This procedure partially prevents the weight degeneracy problem. The SIR algorithm can be written as in the Algorithm 3 [3].

Algorithm 3. (The SIR Algorithm)
  • At time $n = 1$
    • Sample $X_1^{(i)} \sim q(x_1)$
    • Find weights, \begin{align*} W_1(X_1^{(i)}) = \gamma_1(X_1^{(i)})/q_1(X_1^{(i)}) \end{align*} and normalise to find $w_1^{(i)}$.
    • Resample $\{w^{(1)},X_{1}^{(i)}\}$ and obtain $N$ equally-weighted particles $\left\lbrace \frac{1}{N},\bar{X}_{1}^{(i)} \right\rbrace$.
  • For time $n \geq 2$,
    • Sample $X_n^{(i)} \sim q_n(x_n | X^{(i)}_{1:n-1})$
    • Set $X_{1:n}^{(i)} = \left(\bar{X}_{1:n-1}^{(i)},X_n^{(i)}\right)$
    • Compute, \begin{align*} W_n(X^{(i)}_{1:n}) = W_{n-1}(X^{(i)}_{1:n-1}) \alpha_n(X^{(i)}_{1:n}) \end{align*} and normalise $w_n^{(i)} \propto W_n(X^{(i)}_{1:n})$.
    • Resample $\{w^{(i)},X_{1:n}^{(i)}\}$ and obtain $N$ equally-weighted particles $\left\lbrace \frac{1}{N},\bar{X}_{1:n}^{(i)} \right\rbrace$.
    • Approximate the distributions as, \begin{align*} \pi^N(\mbox{d}x_{1:n}) = \frac{1}{N} \sum_{i=1}^N \delta_{\bar{X}^{(i)}_{1:n}}(\mbox{d}x) \\ \pi^N(\mbox{d}x_n) = \frac{1}{N} \sum_{i=1}^N \delta_{\bar{X}^{(i)}_{n}}(\mbox{d}x) \end{align*}

The application of this algorithm to a state space model is straightforward. We provided a similar application for SIS in this post, see Algorithm 1 and 2.

References

1. Recent Developments in Auxiliary Particle Filtering, N. Whiteley, A.M. Johansen, in book Bayesian Time Series Models.

2. Monte Carlo Methods, Lecture notes, Adam M. Johansen and Ludger Evers (edited by Nick Whiteley), University of Bristol.

3. A Tutorial on Particle Filtering and Smoothing: Fifteen years later, A. Doucet, Adam M. Johansen.

4. On sequential Monte Carlo sampling methods for Bayesian filtering, A. Doucet, S. Godsill, C. Andrieu.

0 comments:

Post a Comment