Topics
Semisupervised NMF for Homogeneous Data Clustering
Problem
Definition

Semisupervised clustering uses class labels or
pairwise constraints on data objects to aid
unsupervised clustering. It can group data using the
categories of the initial labeled data as well as
unlabeled data in order to modify the existing set
of categories which reflect the whole regularities
in the data.
Algorithms
 Supervision is provided as two sets of pairwise
constraints on the data objects: mustlink
constraints M and cannotlink constraints C. Every
pair of data (xi, xj) in M implies that xi and xj
must belong to the same cluster. Similarly, all
possible pairs (xi, xj) in C implies that the two
objects should belong to different clusters (see
above Figure).
 Perform symmetric nonnegative trifactorization
of the data similarity matrix with constraints using
an interative algorithm. The correctness and
convergence of the algorithm are proved by showing
that the solution satisfied the KKT optimality and
the algorithm is guaranteed to converge.
Results
 In the following figures, (a) shows an artificial
toy data set consisting of two natural clusters, (b)
shows data distribution in the SSNMF subspace of
the two column vectors of indicator G. The data
points from the two clusters get distributed along
the two axes, and (c) shows data distribution in the
SSSNC subspace of the first two singular vectors.
There is no relationship between the axes and the
clusters.
 In the following table, we compared the algorithms on
some document datasets using AC values. The performance of
the first three methods is similar with NMF proving to be
the best amongst the unsupervised methods. However, the
accuracy of NMF greatly deteriorates and is unable to
produce meaningful results on datasets having more than two
clusters. On the other hand, the superior performance of
SSNMF is evident across all the datasets. We can see that
in general a semisupervised method can greatly enhance the
document clustering results by benefiting from the user
provided knowledge.Moreover, SSNMF is able to generate
significantly better results by quickly learning from the
few pairwise constraints provided.
 The following subfigures show that the AC values
against increasing percentage of pairwise
constraints available, for the algorithms on some
document datasets. On the whole, all three
algorithms perform better as the percentage of
pairwise constraints increases. Comparatively,
SSNMF gets better accuracy than the other two
algorithms even for minimum percentage of pairwise
constraints.
 We plotted the computational speed of SSNMF with respect to SSKK and
SSSNC in the following figure. As the number of data samples increase, SSSNC
turns out to be the slowest of the three algorithms. SSKK is the quickest with
SSNMF closely following it. SSNMF provides an efficient approach for
clustering.
Related Publications
 Yanhua Chen, Manjeet Rege, Ming Dong and Jing Hua,
"Nonnegative Matrix Factorization for
Semisupervised Data Clustering", Journal of
Knowledge and Information Systems (Springer), Vol.
17, No. 3, pp. 355  379, 2008. (invited as a best
paper of ICDM 07, conference version: "Incorporating
User provided Constraints into Document Clustering",
was published in Proc. of IEEE International
Conference on Data Mining, Omaha, NE, 2007 as a
Regular paper, acceptance rate 7.2%).
 Yanhua Chen, Manjeet Rege, Ming Dong and Farshad
Fotouhi, "Deriving Semantics for Image Clustering
from Accumulated User Feedbacks", Proc. of ACM
Multimedia, Germany, 2007 (acceptance rate: 24%).
 
