## Introduction.

In this post, I derive the nonnegative matrix factorization (NMF) algorithm as proposed by Lee and Seung (1999). I derive the multiplicative updates from a gradient descent point of view by using the treatment of Lee and Seung in their later NIPS paper Algorithms for Nonnegative Matrix Factorization. The code for this blogpost can be accessed from here.## Nonnegative matrix factorization.

NMF aims to factorize a data matrix $X \in \bR^{M\times N}_+$ as, \begin{align*} X \approx WH, \end{align*} where $W \in \bR^{M\times r}_+$ and $H \in \bR^{r \times N}_+$. Here $r$ is the approximation rank. The factorization simply aims at finding a low-dimensional space to represent the data. To gain intuition, consider the following factorization, \begin{align*} \underbrace{\begin{bmatrix} \times & \times & \times & \times & \times \\ \times & \times & \times & \times & \times \\ \times & \times & \times & \times & \times \end{bmatrix}}_{X} = \underbrace{\begin{bmatrix} \times & \times \\ \times & \times \\ \times & \times \end{bmatrix}}_{W} \underbrace{\begin{bmatrix} \times & \times & \times & \times & \times \\ \times & \times & \times & \times & \times \end{bmatrix}}_{H} \end{align*} This is a rank 2 approximation of the data matrix $X$.
To compute such factorizations, standard methods are designed so that they minimize some distance measure $D(X \mid WH)$ with respect to $W$ and $H$. The general approach is to minimize $W$ and $H$ in a separate setting, which corresponds to a block-coordinate descent approach. In this post, I review only the easiest Frobenius norm case: $D(X \mid WH) = \| X - WH \|^2_F$ where $\| A - B\|^2_F = \sum_{ij} (A_{ij} - B_{ij})^2$. This cost implies that we assume the noise in the data as Gaussian.

Formally, the NMF problem corresponds to the following optimization problem for our case,
\begin{align*}
\{W^*, H^* \} = \argmin_{W\geq 0,H \geq 0} \frac{1}{2} \| X - WH \|^2_F
\end{align*}
Notice that we have nonnegativity constraints, which are important. Minimizing this cost function jointly with respect to $W$ and $H$ is not possible, so usually the minimization is done for $W$ and $H$ separately at each iteration. A standard method to minimize such functionals is gradient descent. To perform the gradient descent, we need $\nabla_W \frac{1}{2} \| X - WH \|^2_F$ and $\nabla_H \frac{1}{2} \| X - WH \|^2_F$ as we perform the gradient steps in a coordinate descent setting.

## Gradient Descent Approach.

First of all, note that, \begin{align*} \| A \|_F^2 = \Tr A^T A \end{align*} where $\Tr$ is the trace operator.
To find update rule for $W$, we think of $H$ as it is fixed. Let us calculate,
\begin{align*}
\nabla_W \frac{1}{2} \| X - WH \|^2_F &= \frac{1}{2} \nabla_W \Tr \left[ (X^T - H^T W^T) (X - WH) \right] \\
&= \frac{1}{2} \nabla_W \Tr \left[ X^T X - X^T WH - H^T W^T X + H^T W^T W H\right] \\
&= \frac{1}{2} \nabla_W \Tr \left[ - X^T WH - H^T W^T X + H^T W^T W H\right] \\
&= - X H^T + W H H^T
\end{align*}
The last equality follows from the matrix identities. So, the following update will reduce the distance measure,
\begin{align}\label{additiveupdateW}
W_{ij} \leftarrow W_{ij} + \eta_{ij} (X H^T - W H H^T)_{ij}
\end{align}
This update rule for $W$ does not ensure the nonnegativity. In the literature, the projected gradient algorithms are used. The classical multiplicative update rules are based on the special selection of $\eta_{ij}$. To see this, let us choose,
\begin{align*}
\eta_{ij} = \frac{W_{ij}}{(W H H^T)_{ij}}
\end{align*}
Therefore, the update rule \eqref{additiveupdateW} becomes,
\begin{align}\label{updateW}
W_{ij} \leftarrow W_{ij} \frac{(X H^T)_{ij}}{(W H H^T)_{ij}}
\end{align}
Similar to deriving this update rule, let us find the update rule for $H$ by fixing $W$. Then,
\begin{align*}
\nabla_H \frac{1}{2} \| X - WH \|^2_F &= \frac{1}{2} \nabla_H \Tr \left[ (X^T - H^T W^T) (X - WH) \right] \\
&= \frac{1}{2} \nabla_H \Tr \left[ X^T X - X^T WH - H^T W^T X + H^T W^T W H\right] \\
&= \frac{1}{2} \nabla_H \Tr \left[ - X^T WH - H^T W^T X + H^T W^T W H\right] \\
&= -W^T X + W^T W H
\end{align*} Eventually, we have the following gradient descent update,
\begin{align}\label{additiveupdateH} H_{ij} \leftarrow H_{ij} + \mu_{ij} (W^T X - W^T W H)_{ij}
\end{align}
This time we'll choose,
\begin{align*}
\mu_{ij} = \frac{H_{ij}}{(W^T W H)_{ij}}
\end{align*}
and then by putting this into \eqref{additiveupdateH} we obtain the update rule for $H$ as follows.
\begin{align}\label{updateH}
H_{ij} \leftarrow H_{ij} \frac{(W^T X)_{ij}}{(W^T W H)_{ij}}
\end{align}
So the NMF algorithm is simply as follows: Read the data matrix $X$ and introduce some $W,H$ randomly. Then iteratively perform the updates \eqref{updateW} and \eqref{updateH}. You can find an example implementation on GitHub.

Convergence proof of this algorithm is given in the reference, Algorithms for Nonnegative Matrix Factorization, Lee and Seung.

Last updated:

Convergence proof of this algorithm is given in the reference, Algorithms for Nonnegative Matrix Factorization, Lee and Seung.

Last updated:

*9 November 2017*.
## 3 comments:

This has been quite convenient and easy-to-understand approach. Thanks for the explanation!

You are welcome :)

Thank you very much for posting this! Now I understand what the original paper said.

## Post a Comment