Say that you have a dynamical process of interest $X_1,\ldots,X_n$ and you can only observe the process with some noise, i.e., you get an observation sequence $Y_1,\ldots,Y_n$. What is the optimal way to estimate $X_n$ conditioned on the whole sequence of observations $Y_{1:n}$?

To start from the bottom, suppose you have an observation $Y$ and you think that it is related to something you want to learn about (let's call that something $X$ as the uncle of Einstein would). There are many methods in science about how to go from $Y$ to $X$, e.g. you can solve an optimization problem. A Bayesian approach would be to specify a probability distribution over $X$, denoted $p(x)$, and the likelihood $p(y|x)$ which encodes how $X$ becomes observable to us through the observation process. Then, our curiosity about $X$ can be reframed as obtaining its posterior distribution $p(x|y)$ which is possible by the celebrated Bayes' formula,
\begin{align*}
p(x|y) = \frac{p(y|x)p(x)}{p(y)}.
\end{align*}
In the filtering setup, $X$ is assumed to be a changing, dynamic process; hence denoted with $(X_n)_{n\in\bN}$. Since $X_n$ changes with time $n$, accordingly, you collect observations from this process at each time step $n$ which is denoted by $(Y_n)_{n\in\bN}$. Simply, the process could be depicted as in the following graphical model.
\begin{align*}
\begin{array}{ccccccc} & X_0 \to & X_1 & \to & X_2 & \to & \dots \\
&& \downarrow && \downarrow \\
& & Y_1 & & Y_2 & & \dots
\end{array}
\end{align*}
As you can see clearly, $X_n$ is a Markov process. $Y_n$ are independent given the $X$ process. What filtering aims at is to obtain the most refined (optimal) information about $X_n$ given the whole sequence of observations $Y_{1:n}=y_{1:n}$. Ideally, this would be the exact posterior distribution $p(x_n|y_{1:n})$. In order to have it, as you would guess, we have to specify two things: First, how $X_n$ evolves (the transition model), second, how $Y_n$ and $X_n$ are related (the observation model).

Generally, such models are specified in the following way,
\begin{align*}
X_0 &\sim p(x_0), \\
X_n | \{X_{n-1} = x_{n-1}\} &\sim p(x_n | x_{n-1}), \\
Y_n | \{X_n = x_n\} &\sim p(y_n|x_n).
\end{align*}
The prior on $X_0$ is usually of little practical importance, as there are strong results about what happens if you set it wrong: as long as the filter is stable, it forgets its initial condition independent of the fact that it is right or wrong. The transition model $p(x_n|x_{n-1})$ specifies how your process evolves (or rather

*how you think*it evolves). For example, if you are working on object tracking, it is the model of the vehicle. If you work on weather and fluid dynamics, it is the (possibly stochastic) partial differential equation from fluid mechanics. If you try to land the Apollo to the moon, it would be the controlled state process (you need an additional control input in that case). To summarise, if you work on X dynamics, it is how your X evolves. The observation process $p(y_n|x_n)$ is somewhat more concrete thing to have. You may have a model of the vehicle but you can only observe it through some sensor signals. This model would encode that process: What you measure as a sensor signal and how it is related to the true position. Of course this is subject to many uncertainties, for example, in this case, you have parameters from the weather environment (e.g. connectivity) on signal's way to the satellite and/or vast amount of noise. You should encode all of it into $p(y_n|x_n)$. For the weather dynamics, you will not be able to observe ocean currents or large eddies despite you can have a pretty good idea about how they evolve (since it is physics). So both distributions $p(x_n|x_{n-1})$ and $p(y_n|x_n)$ really complete each other, you need the information from both of them in order to estimate $x_n$ at any given time $n$.
As I have written above, the goal here is to obtain $p(x_n|y_{1:n})$. One way to do is to use Bayes' theorem iteratively to achieve this task. In order to have a recursive formula for it, we need to figure out how to obtain it from $p(x_{n-1}|y_{1:n-1})$. Why? Because that's all we need. If we have a procedure for that, we'd start with $p(x_1)$ and obtain $p(x_1|y_1)$ by using Bayes' theorem as I showed in the beginning, then continue to $p(x_2|y_{1:2})$ with "the method" (which I will explain in the following) which maps densities in the following way $p(x_{n-1}|y_{1:n-1}) \mapsto p(x_n|y_{1:n})$, and so on.

There are two steps to doing this.

**(i) Prediction.**Given $p(x_{n-1}|y_{1:n-1})$, the usual way is to first obtain the predictive distribution of the state $p(x_n|y_{1:n-1})$ using the following relationship, \begin{align*} p(x_n|y_{1:n-1}) = \int p(x_n|x_{n-1}) p(x_{n-1}|y_{1:n-1}) \mbox{d}x_{n-1}. \end{align*} The name of this can get as fancy as it can get: the Chapman-Kolmogorov equation. At the core it uses conditional independence structure of the model and the marginalisation rule, nothing more than that. All it does is to integrate out the variable $x_{n-1}$ from the joint distribution $p(x_n,x_{n-1}|y_{1:n-1})$ which can be factorised as in the integral because of the conditional independence. Nevertheless, recall that having a formula doesn't mean that we can compute it. We have to compute the integral, and most of the cases, it is impossible.

**(ii) Update.**Update step is taking the observation into the account. You have the predictive distribution from above, so it's your "guess" about what could happen. Now you need to correct your prediction using the data. This means we need to obtain $p(x_n|y_{1:n})$ from $p(x_n|y_{1:n-1})$. This is possible (of course) by the use of Bayes' theorem, \begin{align*} p(x_n|y_{1:n}) = p(x_n|y_{1:n-1}) \frac{p(y_n|x_n)}{p(y_n|y_{1:n-1})}. \end{align*} Why did I call this "the use of Bayes' theorem"? If you look at what I described as Bayes' theorem above, this relationship might not be obvious. Try to look at in this way, \begin{align*} p(x_n|y_n,y_{1:n-1}) = p(x_n|y_{1:n-1}) \frac{p(y_n|x_n)}{p(y_n|y_{1:n-1})}, \end{align*} and now *ignore* all $y_{1:n-1}$ terms. Try to write it down without $y_{1:n-1}$, you will see that it is in fact the Bayes' rule, just conditioned on $y_{1:n-1}$.

You might ask the question why this has a prediction-correction structure. Why do we need to predict first and then update? I have no answer beyond that this is a possible way to do it.

Now, we have the recipe. Let's see what we can do with that.

If the model is linear and Gaussian, all these computations are analytically possible. This is what makes the Kalman filters special. Suppose we have the following probabilistic model (stochastic linear dynamical system),
\begin{align*}
X_0 &\sim \mathcal{N}(x_0;\mu_0,P_0), \\
X_n | \{X_{n-1} = x_{n-1}\} &\sim \mathcal{N}(x_n;A x_{n-1},Q), \\
Y_n | \{X_n = x_n\} &\sim \mathcal{N}(y_n;C x_n,R).
\end{align*}
Collecting $Y_{1:n} = y_{1:n}$, we are naturally after the posterior distribution $p(x_n|y_{1:n})$. If you bravely compute the integral and update rule above, you will see that they can be carried out analytically. Resulting posterior $p(x_n|y_{1:n})$ is Gaussian so it can be parameterised by its mean and covariance. So instead of updating the probability measure (which is not obvious at all), all you need to do is to update its mean and covariance recursively, as you can get the whole information using those and the functional form encoded by the Gaussian density (this is called finite-dimensional filtering, as you do it by updating the finite-dimensional sufficient statistics). If you denote the posterior with $p(x_n|y_{1:n}) = \mathcal{N}(x_n;\mu_n,P_n)$, the prediction step is given by,
\begin{align*}
\mu_{n|n-1} &= A \mu_{n-1}, \\
P_{n|n-1} &= A P_n A^\top + Q.
\end{align*}
Then the update step is the following,
\begin{align*}
\mu_n &= \mu_{n|n-1} + P_{n|n-1} C^\top (C P_{n|n-1} C^\top + R)^{-1} (y_n - C x_{n|n-1}), \\
P_n &= P_{n|n-1} - P_{n|n-1} C^\top (C P_{n|n-1} C^\top + R)^{-1} C P_{n|n-1}.
\end{align*} Notice that, in the prediction step, we don't have $y_n$ nor any parameter from the observation model $p(y_n|x_n)$, it only includes the parameters of the transition model $p(x_n|x_{n-1})$. And the reverse is true for the update step, as it only involves the parameters of the observation model $p(y_n|x_n)$ and it does not have any parameter of the transition model $p(x_n|x_{n-1})$.

So let's conclude with a note about particle filters. What do particle filters do? It approximates the posterior probability distribution $p(x_n|y_{1:n})$ where $p(x_n|x_{n-1})$ and $p(y_n|x_n)$ are specified in general ways so, essentially, when it is not possible to perform above computations. Don't be mad at me, but most of the particle filters don't use the above recursions but rather they use something else. However, conceptually, this post should be helpful for you to understand what they are trying to do.

## 3 comments:

I think you could show the sequential nature of the Bayes' theorem and from that, show the Kalman filter as a special case of it. There are other examples of sequential inference in reliability analysis with some uncommon probabilistic models.

Hi. I did not understand the first part of the comment. But I'd love to see other analytically tractable nonGaussian models, if there are any.

Awesome post.

## Post a Comment