## Introduction.

In this post, we show the relationship between Gaussian observation model, Least-squares and pseudoinverse. We start with a Gaussian observation model and then move to the least-squares estimation. Then we show that the solution of the least-squares corresponds to the pseudoinverse operation.

## Gaussian observation model.

Let $A$ be a matrix, and $y$ and $x$ are vectors each of which are of proper dimensions. Assume, we have the following observation model, \begin{align} y = Ax + \eta \end{align} From the Bayesian view of inverse problems, if we assume $\eta$ is Gaussian, estimating $x$ corresponds to minimisation of a quadratic cost function. We can show that easily. Notice that, if we assume $\eta \sim \mathcal{N}(0,I)$ where $I$ is an identity covariance matrix, we can write the following, \begin{align*} p(y|x) = \mathcal{N}(y;Ax,I) \end{align*} which assumes $A$ is known. Recall that, our aim is to find $x$ given this model. So, we seek an $x$ which maximises the likelihood, \begin{align*} x^* = \argmax_x p(y|x) \end{align*} The rationale behind this procedure is that, we see $x$ as a parameter and we seek for a parameter which best explains the data. One can see that, \begin{align} x^* &= \argmax_x p(y|x) \nonumber \\ &= \argmax_x \log p(y|x) \nonumber \\ &= \argmax_x \log \mathcal{N}(y;Ax,I) \label{loglik} \end{align}

## From Gaussian observation model to Least squares.

We know, \begin{align*} \mathcal{N}(y;Ax,I) = \frac{1}{\sqrt{(2\pi)^k}} \exp\left(-\frac{1}{2} (y-Ax)^T (y-Ax) \right) \end{align*} Recall, our aim is to find the $x$ which maximises $\log \mathcal{N}(y;Ax,I)$. Then if we try to rewrite the Eq. \eqref{loglik}, we arrive \begin{align*} x^* &= \argmin_x \frac{1}{2} \|y - Ax\|_2^2 \end{align*} How do we come to this? You can see this previous post which includes a similar derivation. Hints: By definition, $\|y - Ax\|_2 = \sqrt{(y-Ax)^T (y-Ax)}$. Then, $(y-Ax)^T (y-Ax)$ is $\|y - Ax\|_2^2$. Other terms of Gaussian do not depend on $x$, hence they are not important for our optimisation problem.

## From Least squares to Pseudoinverse.

We showed that, maximizing likelihood is equivalent to solving least squares problem. Let us proceed to the solution. Recall, we want to solve, \begin{align*} x^* &= \argmin_x \frac{1}{2} \|y - Ax\|_2^2 \end{align*} Then, it can be written as, \begin{align*} x^* &= \argmin_x \frac{1}{2} \|y - Ax\|_2^2 \\ &= \argmin_x (y-Ax)^T (y-Ax) \\ &= \argmin_x y^T y - y^T A x - x^T A^T y + x^T A^T A x \\ &= \argmin_x - y^T A x - x^T A^T y + x^T A^T A x \end{align*} So, basically we have a cost, \begin{align*} J(x) = - y^T A x - x^T A^T y + x^T A^T A x \end{align*} At this point, the trick is the following: Notice that, each term of $J(x)$ are scalars. Then, they equal to their traces since the trace of a scalar is the scalar itself. Then, we write, \begin{align*} J(x) = \operatorname{Tr}(x^T A^T A x) - \operatorname{Tr}(y^T A x) - \operatorname{Tr}(x^T A^T y) \end{align*} Traces have wonderful properties. We use the following properties of them, \begin{align*} \operatorname{Tr}(A) &= \operatorname{Tr}(A^T) \\ \operatorname{Tr}(AB) &= \operatorname{Tr}(BA) \end{align*} and note that the last equality holds for all cyclic permutations. To minimize $J(x)$, we need to take derivative of it. So, we have to know derivative of traces also - we have the following equalities from the linear algebra, \begin{align*} \frac{\partial \operatorname{Tr}(Ax)}{\partial x} &= A^T \\ \frac{\partial \operatorname{Tr}(Ax^T)}{\partial x} &= A \\ \frac{\partial \operatorname{Tr}(A x x^T)}{\partial x} &= Ax + A^T x \end{align*} Using these rules, we would like to differentiate, \begin{align*} J(x) = \operatorname{Tr}(x^T A^T A x) - \operatorname{Tr}(y^T A x) - \operatorname{Tr}(x^T A^T y) \end{align*} and find $x^*$ which makes $\partial J(x)/{\partial x} = 0$. Then, \begin{align*} \frac{\partial J(x)}{\partial x} &= \frac{\partial (\operatorname{Tr}(x^T A^T A x) - \operatorname{Tr}(y^T A x) - \operatorname{Tr}(x^T A^T y))}{\partial x} \\ &= 2A^T A x - 2 A^T y \end{align*} Let us set this equal to zero, \begin{align*} 2A^T A x = 2 A^T y \end{align*} hence, \begin{align*} x^* = (A^T A)^{-1} A^T y \end{align*} The term $(A^T A)^{-1} A^T$ is called pseudoinverse of $A$.

## Conclusion.

We showed that the LS estimation procedure underlies a certain statistical assumption. The second part of the post is about the solution of the LS problem which clarifies that pseudoinverse is a solution to a special inverse problem. We hope these connections will give insights to the beginners. 