Introduction
The recent developments in technology have resulted in increasing quantities of data, rapidly overwhelming many of the classical analysis tools available. Processing this massive amount of data requires powerful methods for their representation and dimension reduction.
Several real-world applications frequently deal with nonnegative data. Examples of nonnegative data include pixel intensities composing an image, amplitude spectra, occurrence counts of a particular event, scores achieved by users, and stock market values.
However, it is not advisable to use popular matrix factorizations (e.g., singular value decomposition, SVD) and dimension reduction tools (e.g., factor analysis, principal component analysis, and cluster analysis) on this data since they cannot guarantee to preserve the nonnegativity.
Despite its various strengths (e.g., optimality property, fast and robust computation, unique factorization, and orthogonality), SVD does not reveal anything about the data collection, since the factors provide no interpretability. The nonnegativity constraint makes much of the classical techniques inapplicable to the present case (Press et al. 1989) and has opened the way to the development of nonnegative matrix factorization (NMF).
NMF has appeared in slightly different flavors firstly in Jeter and Pye (1981), under the name of nonnegative rank factorization, then in Paatero and Tapper (1994), under the historically unfortunate name of positive matrix factorization. However, it was only after the publication of the article “Learning the parts of objects by non-negative matrix factorization” by Lee and Seung (1999) that this technique gathered the deserved credit and attention.
Nowadays, NMF stands out as one of the most popular low-rank approximations used for compression, visualization, feature selection, and noise filtering. It replaces the original data with a lower-dimensional representation obtained via subspace approximation. That is, it constructs a set of basis elements such that the linear space that they span approximates the data points as closely as possible. Such basis elements can then be used for identification and classification, which makes NMF a powerful unsupervised learning technique that automatically clusters similar elements into groups. This also makes it a valuable alternative to common cluster algorithms, like \(k\)-means.
NMF can be also regarded as a possible substitute of principal component analysis in the context of feature learning, of hidden Markov models in the field of temporal segmentation, of independent component analysis in the area of filtering and source separation, and of vector quantization in coding applications. By providing a low-rank approximation, the nonnegative decomposition maximizes hardware efficiency, since the low-rank matrix requires less storage than the original one. On top of that, it provides a cleaner and more efficient representation of the relationship between data elements.
The fields where NMF has been successfully applied for the analysis of high dimensional data are countless. Some examples include:
- Image processing to recognize the main traits of a face;
- Text mining to recover the main topics of a set of documents;
- Bioinformatics and computational biology for molecular pattern discovery
- Signal processing;
- Air emission control;
- Blind-source separation to separate voices in speech mixtures, to separate voice from background, or to separate a singing voice from accompaniment or musical instruments in polyphonic mixtures;
- Music analysis and transcription to recognize musical notes played by piano, drums, or multiple instruments;
- Collaborative filtering;
- Community detection;
- Portfolio diversification.
Some fields of application of nonnegative matrix factorization.
Many other disciplines are expected to employ NMF in the near future due to its innate ability to automatically extract sparse, meaningful, and easily-interpretable features from a set of nonnegative data.
The Method
Nonnegative matrix factorization decomposes multivariate data (arranged in a matrix-like format) by searching for the latent features that are presumed to be responsible for the generation of the directly observable variables (Nakayama and Shimojo 1992). One suitable way to characterize the interrelationships between multiple variables is assuming that each element of the data matrix is a linearly weighted score by a specific entity based on several factors.
Specifically, a set of multivariate \(n\)-dimensional nonnegative data vectors are placed in the columns of an \(n\times m\) matrix \(\boldsymbol V\) (target matrix), where \(m\) is the number of entities. The target matrix is then approximately factorized into the product of two nonnegative factor matrices, an \(n \times r\) matrix \(\boldsymbol W\) and an \(r \times m\) matrix \(\boldsymbol H\), such that:
\[\begin{equation*} \boldsymbol V \approx \boldsymbol W \boldsymbol H \qquad \text{with } (\boldsymbol W, \boldsymbol H) \ge 0, \end{equation*}\]
where the columns in \(\boldsymbol V = \Big[ \boldsymbol v_1 \ldots \boldsymbol v_{\mu} \ldots \boldsymbol v_m \Big]\) represent the data points, those in \(\boldsymbol W = \Big[ \boldsymbol w_1 \ldots \boldsymbol w_{a} \ldots \boldsymbol w_r \Big]\) the latent features, and those in \(\boldsymbol H = \Big[ \boldsymbol h_1 \ldots \boldsymbol h_{\mu} \ldots \boldsymbol h_m \Big]\) the coordinates of each data point in the factor matrix \(\boldsymbol W\).
The rank \(r\) of the factorization is generally chosen to be smaller than \(n\) or \(m\) (i.e., \(r << min(n,m)\)), and such that \((n+m)r < nm\), so that \(\boldsymbol W\) and \(\boldsymbol H\) are smaller than \(\boldsymbol V\) and hence their product can be regarded as a compressed form of the original data matrix. In this way, the information contained in \(\boldsymbol V\) can be summarized and split into \(r\) factors.
This parameter plays a crucial role: too high a value could result in potentially serious overfitting problems, whereas too small a value could lead to a bad representation of the data.
The matrix \(\boldsymbol W\) is called basis matrix since its column vectors approximately form a basis for the vector space spanned by the columns of \(\boldsymbol V\). Since relatively few basis vectors are used to represent many data vectors, a good approximation can only be achieved if the factors discover the true latent structure present in the data.
The matrix \(\boldsymbol H\) is called mixture coefficients matrix since its column vectors provide the weights that are used in the linear combination of the columns of the basis matrix to reconstruct the target matrix.
The NMF approximation statement can be equivalently rewritten elementwise as
\[\begin{equation*} v_{i\mu } \approx (\boldsymbol W \boldsymbol H)_{i\mu} = \sum_{a=1}^{r} w_{ia} h_{a\mu}, \qquad \text{for } i=1, \ldots, n, \text{ and } \mu=1, \ldots, m, \end{equation*}\]
where the element \(v_{i\mu}\) of the target matrix is the score obtained by entity \(\mu\) on variable \(i\), \(w_{ia}\) is the loading of variable \(i\) on factor \(a\), and \(h_{a\mu}\) is the response of entity \(\mu\) to factor \(a\).
Decomposition of the target matrix in nonnegative matrix factorization.
The nonnegativity constraints placed on the elements of the factor matrices are compatible with the intuitive notion of additively combining parts (that is, subsets of the visible variables activated by specific factors) to generate the whole representation, which is how NMF learns a parts-based representation.
The columns of the matrices \(\boldsymbol W\) and \(\boldsymbol H\) are often sparse since they generally contain a large portion of vanishing coefficients. Sparseness is one of the greatest strengths of NMF, since a sparse representation represents the majority of data by using only few active factors, making the feature set and the coefficients easier to interpret.
Matrix factorization
The factorization of the matrix \(\boldsymbol V\) can be achieved through numerical algorithms, that find at each iterations new values of \(\boldsymbol W\) and \(\boldsymbol H\) minimizing a certain cost function.
The majority of them uses a block coordinate descendent scheme, namely, they optimize a loss function with respect to alternatively one of the two matrices, \(\boldsymbol W\) or \(\boldsymbol H\), while keeping the other fixed. The justification for using such an approach is that whereas the nonlinear optimization problem is not simultaneously convex in both \(\boldsymbol W\) and \(\boldsymbol H\), the subproblem in either the basis matrix or the coefficients matrix is indeed convex. Since the problem is also generally symmetric in \(\boldsymbol W\) and \(\boldsymbol H\), most algorithms actually focus on updating only one among the two matrices and use the same update even for the other one.
Every algorithm of this type is repeated until a convergence criterion is met, which is typically performing the numerical operations for a fixed number of iterations. Although the natural stopping criterion would be terminating the procedure whenever \(\left \lVert \boldsymbol V - \boldsymbol W \boldsymbol H \right \rVert \le \epsilon\), for any positive \(\epsilon\), the computational effort to check this requirement is huge. Therefore, the widely shared choice is fixing a priori a number of iterations.
Most NMF algorithms adhere to the following framework.
The estimation procedures developed for NMF differ for two main aspects: the objective function that represents the quality of the data representation, and the numerical algorithm employed to optimize it.
Objective function
To find an approximate factorization \(\boldsymbol V \approx \boldsymbol W \boldsymbol H\), one needs to define a loss functions \(L(\boldsymbol V, \boldsymbol W \boldsymbol H)\) quantifying the quality of the approximation. Common cost functions are based on the square Euclidean distance between the two nonnegative matrices \(\boldsymbol V\) and \(\boldsymbol W \boldsymbol H\) (Lee and Seung 1997; Paatero 1997):
\[\begin{equation*} L(\boldsymbol V, \boldsymbol W \boldsymbol H) = \left\lVert \boldsymbol V-\boldsymbol W \boldsymbol H \right\rVert^2 = \sum_{i=1}^{n} \sum_{\mu = 1}^{m} \Big(v_{i\mu} - (\boldsymbol W \boldsymbol H)_{i\mu}\Big)^2. \end{equation*}\]
This loss function equivalently represents the square Froebenius norm of the error. It implicitly assumes that the noise \(\boldsymbol N\) present in the matrix \(\boldsymbol V = \boldsymbol W \boldsymbol H + \boldsymbol N\) is Gaussian.1
Though reasonable in many practical situations, a Gaussian noise may not be the best choice with sparse nonnegative data. For this reason, other objective functions are more used in practice, like the generalized Kullback-Leibler divergence of \(\boldsymbol V\) from \(\boldsymbol W \boldsymbol H\):2
\[\begin{equation*} L(\boldsymbol V, \boldsymbol W \boldsymbol H) = D(\boldsymbol V\lvert \rvert \boldsymbol W \boldsymbol H) = \sum_{i=1}^{n} \sum_{\mu = 1}^{m} \Big(v_{i\mu} \log \frac{v_{i\mu}}{(\boldsymbol W \boldsymbol H)_{i\mu}} - v_{i\mu} + (\boldsymbol W \boldsymbol H)_{i\mu}\Big). \end{equation*}\]
Both measures are lower bounded by zero, and vanish if and only if \(\boldsymbol V = \boldsymbol W \boldsymbol H\), but whereas the former is a distance, the latter is not, since it is not symmetric in \(\boldsymbol V\) and \(\boldsymbol W \boldsymbol H\). In particular, it reduces to the relative entropy (Kullback and Leibler 1951) when \(\sum_{i=1}^{n} \sum_{\mu = 1}^{m} v_{i\mu} = \sum_{i=1}^{n} \sum_{\mu = 1}^{m} (\boldsymbol W \boldsymbol H)_{i\mu} = 1\), so that \(\boldsymbol V\) and \(\boldsymbol W \boldsymbol H\) can be regarded as normalized probability distributions.
The nonlinear optimization problem is then minimizing the objective function with respect to the matrix factors, subject to the nonnegativity constraint. In practice, an optional regularization function \(R(\boldsymbol V, \boldsymbol W \boldsymbol H)\) is usually added to the cost function to enforce desirable properties on the factor matrices (e.g., smoothness or sparsity):
\[\begin{equation*} \min_{\boldsymbol W, \boldsymbol H \ge 0} \Big[ L(\boldsymbol V, \boldsymbol W \boldsymbol H) + R (\boldsymbol V, \boldsymbol W \boldsymbol H) \Big]. \end{equation*}\]
A sequence of matrices \((\boldsymbol W^{(t)}, \boldsymbol H^{(t)})\) is built such that at each step the value of the objective function is reduced. Lee and Seung (1999) pointed out that the exact form of the objective function is not as crucial as the nonnegativity constraints for the success of the factorization.
Numerical algorithms
The sequences of matrices \((\boldsymbol W^{(t)}, \boldsymbol H^{(t)})\) may also differ in the choice of the optimization technique used to update them. The cost functions are convex in either \(\boldsymbol W\) or \(\boldsymbol H\), and not simultaneously in both variables. This implies that only convergence to stationary points may eventually be guaranteed. Unlike the unconstrained optimization problem which can efficiently be solved using SVD, the complexity of NMF is generally NP-hard3 (Vavasis 2009). For real-world problems, however, even local minima are useful as they also provide the desirable properties of data compression and feature extraction.
The numerical techniques that can be applied to find local minima can be classified into three categories.
Multiplicative update rules
Multiplicative update rules are by now the most employed numerical algorithm to solve NMF due to their tradeoff between speed, ease of implementation for solving the optimization problem, and desirable properties. They were proposed as the numerical algorithm for solving NMF by Lee and Seung (2001), whose article launched the research on this topic.
They can be divided into two main classes, depending on the choice of the objective function.
- Loss function based on the divergence measure: \(D(\boldsymbol V\lvert \rvert \boldsymbol W \boldsymbol H)\) is invariant if and only if \(\boldsymbol W\) and \(\boldsymbol H\) are at a stationary point of the divergence, and it is non increasing under the updates:
\[\begin{eqnarray*} \quad h_{a\mu} & \leftarrow & h_{a\mu} \dfrac{\sum_{i=1}^{n} w_{ia} \dfrac{v_{i\mu}}{\left(\boldsymbol W \boldsymbol H\right)_{i\mu}}}{\sum_{i=1}^{n} w_{ia}} \quad a= 1, \ldots, r \quad \mu = 1, \ldots, m, \\[0.2cm] \quad w_{ia} & \leftarrow & w_{ia} \dfrac{\sum_{\mu=1}^{m} h_{a\mu} \dfrac{ v_{i\mu}}{\left(\boldsymbol W\boldsymbol H\right)_{i\mu}}}{\sum_{\mu=1}^{m} h_{a\mu}} \quad i=1, \ldots, n \quad a= 1, \ldots, r. \end{eqnarray*}\]
- Loss function based on the square Euclidean distance: \(\left\lVert \boldsymbol V-\boldsymbol W\boldsymbol H \right\rVert\) is invariant if and only if \(\boldsymbol W\) and \(\boldsymbol H\) are at a stationary point of the distance, and it is non increasing under the updates:
\[\begin{eqnarray*} \quad h_{a\mu} & \leftarrow & h_{a\mu} \frac{\left(\boldsymbol W^T \boldsymbol V\right)_{a\mu}}{\left(\boldsymbol W^T\boldsymbol W\boldsymbol H\right)_{a\mu}} \quad a= 1, \ldots, r \quad \mu = 1, \ldots, m, \\ \quad w_{ia} & \leftarrow & w_{ia} \frac{\left(\boldsymbol V\boldsymbol H^T\right)_{ia}}{\left(\boldsymbol W\boldsymbol H\boldsymbol H^T\right)_{ia}} \quad i=1, \ldots, n \quad a= 1, \ldots, r. \end{eqnarray*}\]
Each update consists of a multiplication by a term, which is unity when \(\boldsymbol V= \boldsymbol W \boldsymbol H\), so that the perfect reconstruction of the data is necessarily a fixed point of the update rules. The fidelity of the approximation enters the updates through the normalized quotient \(\dfrac{v_{i\mu}}{\left(\boldsymbol W \boldsymbol H \right)_{i\mu}}\).
The multiplicative rules are based on the majorization-minimization framework, since the estimates are the global minimizer of a quadratic function majorizing \(L\), that is, a function that is larger than \(L\) everywhere and is equal to \(L\) at the current iteration. Hence minimizing that function ensures \(L\) to decrease, and therefore, leads to an algorithm for which the objective function monotonically decreases. Unfortunately, there is no guarantee that the algorithm converges to a local minimum (Chu et al. 2004).
Alternating least squares
Alternating least squares (ALS) NMF algorithms were developed to solve a couple of issues affecting multiplicative rules (Kim and Park 2007). They exploit the convexity of the objective function in either \(\boldsymbol W\) or \(\boldsymbol H\), which implies that, given one matrix, the other one can be found with a simple least-squares computation.
Multiplicative rules are relatively slow to reach convergence and can be significantly accelerated through a more efficient alternation strategy. In the basic ALS framework, a least-squares step is followed by another least-squares step in an alternating fashion. One matrix is taken as fixed, and the other is computed using least squares; then, the former is updated while the other is keeping fixed. After each least-squares step, a projection step ensuring nonnegativity follows, so that the possibly negative elements resulting from the numerical computation are set to zero. Some extensions of ALS algorithms update one matrix several times before updating the other, giving rise to faster convergence techniques.
Multiplicative rules are also affected by the locking property, that is, once an element in \(\boldsymbol W\) or \(\boldsymbol H\) becomes zero, it remains zero since NMF only allows non-subtractive combination of parts to represent a whole. Therefore, once the algorithm starts heading down a path towards a fixed point, even if it is a poor fixed point, it continues in that direction. On the contrary, ALS algorithms are more flexible, and allow the iterative process to escape from a path heading towards a poor local minimum. They were proved to converge to a local minimum and give accurate factors, although the additional constraint on least-squares generally makes the optimization computationally demanding.
Gradient descent
Gradient descent techniques are the simplest algorithms to implement, but very slow to converge (if at all). Other methods, such as conjugate gradient, have faster convergence, at least in the vicinity of local minima, but are more complicated to implement. Unfortunately, the convergence of gradient-based methods is very sensitive to the choice of the step size, and a convergence supporting theory is missing.
Initialization
All NMF algorithms need to be initialized with a value for \(\boldsymbol W^{(0)}\) and/or \(\boldsymbol H^{(0)}\). Since there is no global minimization algorithm, the choice of the initialization turns out to be very important to get meaningful results. A good initialization can produce faster convergence to an improved local minimum while also increasing the speed and accuracy of the algorithms (Smilde, Bro, and Geladi 2005).
In the standard NMF algorithm, the factor matrices are initialized randomly, with their elements drawn from a uniform distribution defined on the same range of the target matrix’s entries. This method is very simple, however, multiple runs with different starting points are necessary to achieve decent stability, which in turn significantly increases the computational time.
To overcome this issue, several seeding methods that need to be performed only once have been proposed. Examples include methods based on
- Independent component analysis: it is like NMF, but independence between the columns of \(\boldsymbol W\) is assumed (Marchini, Heaton, and Ripley 2013).
- Nonnegative double singular value decomposition: it is particularly suited to initialize NMF algorithms with sparse factors (Boutsidis and Gallopoulos 2008).
- Clustering techniques: the basis matrix is initialized using the centroids computed with some clustering methods (e.g., \(k\)-means), whereas the mixture coefficients matrix is initialized as a proper scaling of the cluster indicator matrix (Xue et al. 2008).
In practice, one may use several initializations via a Monte Carlo type approach, and keep the best solution, even if this may be prohibitive on large and realistically-sized problems.
Rank choice
The final decision concerns the choice of the factorization rank. If the value is too high, the potential risk of overfitting may occur, since more variables can better fit the data, and thus residuals decrease. If it is too small, the approximation could be a bad representation of the data.
Unless one has prior knowledge based on the specific domain theory, it is common to decide on its value through trial and error. Different values are tried, some quality measures are computed, and the best value is chosen according to such measures.
Several approaches have been proposed to further help users find the optimal value of such a parameter. These techniques include:
- Applying SVD and looking at the decay of the singular values of the input data matrix;
- Taking the first value of \(r\) for which the cophenetic coefficient starts decreasing (Brunet et al. 2004);
- Choosing the first value where the residual sum of squares curve presents an inflection point (Hutchins et al. 2008);
- Considering the smallest value at which the decrease in the residual sum of squares is lower than the decrease of the analogous curve obtained from random data (Frigyesi and Höglund 2008).
An ill-posed problem
When proposing NMF factorization, Lee and Seung did not specify under which assumptions the approximation is unique (i.e., well defined) and when it recovers the correct decomposition. To this end, Donoho and Stodden (2004) pointed out a potentially serious problem with NMF and discovered that even when \(\boldsymbol V = \boldsymbol W \boldsymbol H\) holds exactly, the decomposition might not be unique.
They interpreted NMF geometrically as the problem of finding a simplicial convex cone \(C_{\boldsymbol W} = \Big \{ \sum_{a=1}^{r} \lambda_a \boldsymbol w_a, \lambda_a \ge 0 \Big \}\) which contains the cloud of data points and is contained in the positive orthant, and showed that if the data values are strictly positive, i.e., \(v_{i\mu} \ge \epsilon \, \, \,\forall \epsilon > 0, \text{for } i=1, \ldots, n \text{ and } \mu=1, \ldots, m\), the column vectors of \(\boldsymbol V\) are placed well inside the interior of the positive orthant of \(\mathbb{R}^n\), and there are many simplicial cones containing the data because nonnegative matrices form a cone with many facets, making it hard to characterize which and whether a facet is active in the optimization (Chu et al. 2004).
As an illustration, consider the data points in two dimensions depicted in the figure below. Since there is open “space” between the data and the coordinate axes, one can choose the basis vectors anywhere in this open space, and represent each data point exactly with a nonnegative linear combination of these vectors.
Non-uniqueness of nonnegative matrix factorization.
Under such a strict positivity condition, there are many distinct representations of the form \(\boldsymbol V = \boldsymbol W \boldsymbol H\) where \(\boldsymbol W \ge 0\) and \(\boldsymbol H \ge 0\). This implies that the solution found by algorithms is not unique but depends on the employed starting values.
In other words, any minimum solution given by the matrices \(\boldsymbol W, \boldsymbol H\) can also be given by an infinite number of equally good solution pairs. In fact, any nonnegative invertible matrix \(\boldsymbol Q\) satisfying \(\widetilde{\boldsymbol W} = \boldsymbol W \boldsymbol Q \ge 0\) and \(\widetilde{\boldsymbol H} = \boldsymbol Q^{-1} \boldsymbol H \ge 0\) generates an equivalent factorization4 \(\widetilde{\boldsymbol W} \widetilde{ \boldsymbol H }= \boldsymbol W \boldsymbol H\).
This issue is addressed by saying that NMF is an ill-posed problem. Such complications are usually alleviated by imposing additional constraints to confine the feasible solution set (e.g., smoothness constraints, shape constraints, geometric constraints, cross-modal correspondence constraints).
Image compression
The first field of application to which NMF has been applied is image processing and face recognition. Lee and Seung (1999) showed that this innovative method could provide a convenient way to learn parts of faces (like the nose, the eyes, and the mouth), that could be successively used to represent a visage.
We now consider a simple application of NMF to image processing. Our aim here is to use NMF factorization as a tool for obtaining an optimally compressed version of an original image yet retaining its main features.
In this context:
- The columns of the target matrix \(\boldsymbol V\) will be made up by the pixel intensities of the image;
- The columns of \(\boldsymbol W\) will constitute the basis vectors, and are the main features of the image;
- The columns of \(\boldsymbol H\) are named encodings and are the coefficients by which the image is represented with a linear combination of basis vectors.
The example image we will be using is the famous painting Madonna of the Goldfinch (Madonna del Cardellino) by the Italian artist Raphael, dated 1506. The painting shows Mary, Christ, and the young John the Baptist, who are arranged to form an almost regular triangle.
library(imager)
library(NMF)
rgb_image <- load.image("https://upload.wikimedia.org/wikipedia/commons/thumb/5/57/Raffaello_Sanzio_-_Madonna_del_Cardellino_-_Google_Art_Project.jpg/549px-Raffaello_Sanzio_-_Madonna_del_Cardellino_-_Google_Art_Project.jpg")
par(mar=c(0.1, 0.1, 0.1, 0.1))
rgb_image %>%
plot(axes = FALSE)