Problem
Definition

Clustering or unsupervised learning is a generic name for a variety of procedures designed to find natural groupings or clusters in multidimensional data based on measured or perceived similarities among the patterns. Matrixfactorizationbased clustering has emerged as an effective approach for clustering problems in highdimensional spaces.
Algorithms
 Proposed a Semisupervised NMF (SSNMF) based framework to incorporate prior knowledge into heterogeneous data coclustering.
 Users provide constraints on data samples in the central type, specifying whether they "must" (mustlink) or "cannot" (cannotlink) be clustered together.
 Improve the quality of coclustering by learning a new distance metric based on these constraints.
 Perform trifactorizations of the new data matrices, obtained with the learned distance metric, to infer the central data clusters while simultaneously deriving the clusters of related feature modalities.
Results
 An example of SSNMF for pairwise coclustering is illustrated in the above figure. Fig (a) shows the relational data with two clusters (15 asterisk points and 15 circle points), both following Gaussian distributions. The first step of SSNMF coclustering, distance metric learning, is shown in Fig. (c), in which a new relational data is learned through embedding the distance metric L(c1) into the original R(c1). Clearly, with the mustlink and cannotlink constraints, the data points within the same cluster are placed closer while points in different clusters are moved away. The result of the second step, trifactorization of R(c1), is illustrated in Fig. (d). As a comparison, we also show the result obtained by the unsupervised NMF coclustering in Fig. (b). It is clear that the semisupervised model has a better performance
 Panels (a)(c), (d)(f), and (g)(i) in following figure show the clustering results obtained by SRC, NMF, and SSNMF, in which the two clusters are denoted by the red stars and the blue triangles, respectively. This indicates that SSNMF can provide better clustering accuracy than unsupervised NMF because the cluster label for a data point is determined by finding the axis with which the data point has the largest projection value.
 The image data used in our experiments is chosen from Corel CDs, which contains 31,438 generalpurpose images of various contents, such as plants, animals, buildings, human society, etc. To evaluate our algorithm, weconstruct a data set with 1,000 images from 10 categories: "eggs", "decoys", "firearms", "cards", "buses", "abstract", "foliage", "dawn", "texture", and "wave". Some examples from each category are shown in the following figure.

 We conduct pairwise coclustering experiments on the text data sets with documentword matrices and compare the performance of SSNMF with six clustering methods. In the following Fig.(a), we plot the average AC value on all eight data sets against the increasing percentage of pairwise constraints for TSVM, SSKK, SSCMRF, and SSNMF. We clearly see that SSNMF significantly outperforms TSVM, SSKK, and SSCMRF in all cases, gaining at least 12 percent higher clustering accuracy. Another important observation is that the average accuracy of all four methods consistently increases with the gradual increase of the pairwise constraints (from 0.5 to 10 percent). Particularly, SSNMF is able to generate significantly better results (over 10 percent) by quickly learning from just a few constraints (0.5 percent). Therefore, document clustering performance can be greatly improved even with very limited prior knowledge.
 We conduct coclustering on gene expressions with conditiongene matrix and compare the performance of SSNMF. Fig. (b) illustrates the average AC values against the increasing percentage of pairwise constraints for semisupervised condition clustering/classification. Overall, SSNMF provides the highest accuracy among the four semisupervised methods. We see that more constraints on the patient conditions lead to higher accuracy for all four approaches. Again, substantial performance improvement is achieved by SSNMF, up to 20 percent accuracy increase, with very limited prior knowledge (e.g., 0.5 percent constraints).

 We conduct experiments to cocluster words, documents, and categories and compare the performance of SSNMF with three unsupervised approaches and one semisupervised method. In the following Fig. (a), we plot the average AC values against increasing percentage of pairwise constraints for SSCMRF and SSNMF. Again, when more prior knowledge is available, the performance of SSCMRF and SSNMF clearly gets better. It is also obvious that on average SSCMRF is consistently outperformed by SSNMF with varying amounts of constraints.
 We present the experimental results on coclustering image data. Fig. (b) shows that the quality of the clustering improves when the amount of constraints increases. Note that while we observe better performance of SSNMF over SSCMRF in text data sets, it is clear to see that the performance of SSCMRF and SSNMF is very close in image data sets regardless of the amount of constraints.
 We compare the computational speed of three unsupervised approaches: SRC, CMRF, and NMF, and two semisupervised approaches: SSCMRF and SSNMF. The following Fig. (a) illustrates the computational speed for all five methods with increasing number of samples in the central data type nc for a fixed np, while Fig. (b) shows the computational speed with increasing feature dimensions np for a fixed nc. In both cases, unsupervised NMF is the quickest among the five approaches as it uses an efficient iterative algorithm to compute the cluster indicator and cluster association matrices.
Related Publications
