Clustering or unsupervised learning is a generic name for a variety of procedures designed to find natural groupings or clusters in multidimensional data based on measured or perceived similarities among the patterns. Matrix-factorization-based clustering has emerged as an effective approach for clustering problems in high-dimensional spaces.
- Proposed a Semisupervised NMF (SS-NMF) based framework to incorporate prior knowledge into heterogeneous data coclustering.
- Users provide constraints on data samples in the central type, specifying whether they "must" (must-link) or "cannot" (cannot-link) be clustered together.
- Improve the quality of coclustering by learning a new distance metric based on these constraints.
- Perform trifactorizations of the new data matrices, obtained with the learned distance metric, to infer the central data clusters while simultaneously deriving the clusters of related feature modalities.
- An example of SS-NMF for pairwise coclustering is illustrated in the above figure. Fig (a) shows the relational data with two clusters (15 asterisk points and 15 circle points), both following Gaussian distributions. The first step of SS-NMF coclustering, distance metric learning, is shown in Fig. (c), in which a new relational data is learned through embedding the distance metric L(c1) into the original R(c1). Clearly, with the must-link and cannot-link constraints, the data points within the same cluster are placed closer while points in different clusters are moved away. The result of the second step, trifactorization of R(c1), is illustrated in Fig. (d). As a comparison, we also show the result obtained by the unsupervised NMF coclustering in Fig. (b). It is clear that the semisupervised model has a better performance
- Panels (a)-(c), (d)-(f), and (g)-(i) in following figure show the clustering results obtained by SRC, NMF, and SS-NMF, in which the two clusters are denoted by the red stars and the blue triangles, respectively. This indicates that SS-NMF can provide better clustering accuracy than unsupervised NMF because the cluster label for a data point is determined by finding the axis with which the data point has the largest projection value.
- The image data used in our experiments is chosen from Corel CDs, which contains 31,438 general-purpose images of various contents, such as plants, animals, buildings, human society, etc. To evaluate our algorithm, weconstruct a data set with 1,000 images from 10 categories: "eggs", "decoys", "firearms", "cards", "buses", "abstract", "foliage", "dawn", "texture", and "wave". Some examples from each category are shown in the following figure.
- We conduct pairwise coclustering experiments on the text data sets with document-word matrices and compare the performance of SS-NMF with six clustering methods. In the following Fig.(a), we plot the average AC value on all eight data sets against the increasing percentage of pairwise constraints for TSVM, SS-KK, SS-CMRF, and SS-NMF. We clearly see that SS-NMF significantly outperforms TSVM, SS-KK, and SS-CMRF in all cases, gaining at least 12 percent higher clustering accuracy. Another important observation is that the average accuracy of all four methods consistently increases with the gradual increase of the pairwise constraints (from 0.5 to 10 percent). Particularly, SS-NMF is able to generate significantly better results (over 10 percent) by quickly learning from just a few constraints (0.5 percent). Therefore, document clustering performance can be greatly improved even with very limited prior knowledge.
- We conduct coclustering on gene expressions with condition-gene matrix and compare the performance of SSNMF. Fig. (b) illustrates the average AC values against the increasing percentage of pairwise constraints for semisupervised condition clustering/classification. Overall, SS-NMF provides the highest accuracy among the four semisupervised methods. We see that more constraints on the patient conditions lead to higher accuracy for all four approaches. Again, substantial performance improvement is achieved by SS-NMF, up to 20 percent accuracy increase, with very limited prior knowledge (e.g., 0.5 percent constraints).
- We conduct experiments to cocluster words, documents, and categories and compare the performance of SSNMF with three unsupervised approaches and one semisupervised method. In the following Fig. (a), we plot the average AC values against increasing percentage of pairwise constraints for SS-CMRF and SS-NMF. Again, when more prior knowledge is available, the performance of SS-CMRF and SS-NMF clearly gets better. It is also obvious that on average SS-CMRF is consistently outperformed by SS-NMF with varying amounts of constraints.
- We present the experimental results on coclustering image data. Fig. (b) shows that the quality of the clustering improves when the amount of constraints increases. Note that while we observe better performance of SS-NMF over SS-CMRF in text data sets, it is clear to see that the performance of SS-CMRF and SS-NMF is very close in image data sets regardless of the amount of constraints.
- We compare the computational speed of three unsupervised approaches: SRC, CMRF, and NMF, and two semisupervised approaches: SS-CMRF and SS-NMF. The following Fig. (a) illustrates the computational speed for all five methods with increasing number of samples in the central data type nc for a fixed np, while Fig. (b) shows the computational speed with increasing feature dimensions np for a fixed nc. In both cases, unsupervised NMF is the quickest among the five approaches as it uses an efficient iterative algorithm to compute the cluster indicator and cluster association matrices.