Gaussianity, Least squares, Pseudoinverse


In this post, we show the relationship between Gaussian observation model, Least-squares and pseudoinverse. We start with a Gaussian observation model and then move to the least-squares estimation. Then we show that the solution of the least-squares corresponds to the pseudoinverse operation.

Gaussian observation model.

Let $A$ be a matrix, and $y$ and $x$ are vectors each of which are of proper dimensions. Assume, we have the following observation model, \begin{align} y = Ax + \eta \end{align} From the Bayesian view of inverse problems, if we assume $\eta$ is Gaussian, estimating $x$ corresponds to minimisation of a quadratic cost function. We can show that easily. Notice that, if we assume $\eta \sim \mathcal{N}(0,I)$ where $I$ is an identity covariance matrix, we can write the following, \begin{align*} p(y|x) = \mathcal{N}(y;Ax,I) \end{align*} which assumes $A$ is known. Recall that, our aim is to find $x$ given this model. So, we seek an $x$ which maximises the likelihood, \begin{align*} x^* = \argmax_x p(y|x) \end{align*} The rationale behind this procedure is that, we see $x$ as a parameter and we seek for a parameter which best explains the data. One can see that, \begin{align} x^* &= \argmax_x p(y|x) \nonumber \\ &= \argmax_x \log p(y|x) \nonumber \\ &= \argmax_x \log \mathcal{N}(y;Ax,I) \label{loglik} \end{align}

From Gaussian observation model to Least squares.

We know, \begin{align*} \mathcal{N}(y;Ax,I) = \frac{1}{\sqrt{(2\pi)^k}} \exp\left(-\frac{1}{2} (y-Ax)^T (y-Ax) \right) \end{align*} Recall, our aim is to find the $x$ which maximises $\log \mathcal{N}(y;Ax,I)$. Then if we try to rewrite the Eq. \eqref{loglik}, we arrive \begin{align*} x^* &= \argmin_x \frac{1}{2} \|y - Ax\|_2^2 \end{align*} How do we come to this? You can see this previous post which includes a similar derivation. Hints: By definition, $\|y - Ax\|_2 = \sqrt{(y-Ax)^T (y-Ax)}$. Then, $(y-Ax)^T (y-Ax)$ is $\|y - Ax\|_2^2$. Other terms of Gaussian do not depend on $x$, hence they are not important for our optimisation problem.

From Least squares to Pseudoinverse.

We showed that, maximizing likelihood is equivalent to solving least squares problem. Let us proceed to the solution. Recall, we want to solve, \begin{align*} x^* &= \argmin_x \frac{1}{2} \|y - Ax\|_2^2 \end{align*} Then, it can be written as, \begin{align*} x^* &= \argmin_x \frac{1}{2} \|y - Ax\|_2^2 \\ &= \argmin_x (y-Ax)^T (y-Ax) \\ &= \argmin_x y^T y - y^T A x - x^T A^T y + x^T A^T A x \\ &= \argmin_x - y^T A x - x^T A^T y + x^T A^T A x \end{align*} So, basically we have a cost, \begin{align*} J(x) = - y^T A x - x^T A^T y + x^T A^T A x \end{align*} At this point, the trick is the following: Notice that, each term of $J(x)$ are scalars. Then, they equal to their traces since the trace of a scalar is the scalar itself. Then, we write, \begin{align*} J(x) = \operatorname{Tr}(x^T A^T A x) - \operatorname{Tr}(y^T A x) - \operatorname{Tr}(x^T A^T y) \end{align*} Traces have wonderful properties. We use the following properties of them, \begin{align*} \operatorname{Tr}(A) &= \operatorname{Tr}(A^T) \\ \operatorname{Tr}(AB) &= \operatorname{Tr}(BA) \end{align*} and note that the last equality holds for all cyclic permutations. To minimize $J(x)$, we need to take derivative of it. So, we have to know derivative of traces also - we have the following equalities from the linear algebra, \begin{align*} \frac{\partial \operatorname{Tr}(Ax)}{\partial x} &= A^T \\ \frac{\partial \operatorname{Tr}(Ax^T)}{\partial x} &= A \\ \frac{\partial \operatorname{Tr}(A x x^T)}{\partial x} &= Ax + A^T x \end{align*} Using these rules, we would like to differentiate, \begin{align*} J(x) = \operatorname{Tr}(x^T A^T A x) - \operatorname{Tr}(y^T A x) - \operatorname{Tr}(x^T A^T y) \end{align*} and find $x^*$ which makes $\partial J(x)/{\partial x} = 0$. Then, \begin{align*} \frac{\partial J(x)}{\partial x} &= \frac{\partial (\operatorname{Tr}(x^T A^T A x) - \operatorname{Tr}(y^T A x) - \operatorname{Tr}(x^T A^T y))}{\partial x} \\ &= 2A^T A x - 2 A^T y \end{align*} Let us set this equal to zero, \begin{align*} 2A^T A x = 2 A^T y \end{align*} hence, \begin{align*} x^* = (A^T A)^{-1} A^T y \end{align*} The term $(A^T A)^{-1} A^T$ is called pseudoinverse of $A$.


We showed that the LS estimation procedure underlies a certain statistical assumption. The second part of the post is about the solution of the LS problem which clarifies that pseudoinverse is a solution to a special inverse problem. We hope these connections will give insights to the beginners.


İsmail said...

Bu çok faydalı olmuş.

Post a Comment