Topics
Isoperimetric Coclustering Algorithm (ICA) for pairwise data
coclustering
Problem
Definition

Data coclustering refers to the problem of
simultaneous clustering of two data types.
Typically, the data is stored in a contingency or
cooccurrence matrix C where rows and columns of the
matrix represent the data types to be coclustered.
An entry Cij of the matrix signifies the relation
between the data type represented by row i and
column j. Coclustering is the problem of deriving
submatrices from the larger data matrix by
simultaneously clustering rows and columns of the
data matrix.
Algorithms
 The two data types are modeled as the two sets of
vertices of a weighted bipartite graph (see above
Figure).
 Isoperimetric Coclustering Algorithm (ICA)
requires a simple solution to a sparse system of
linear equations instead of the eigenvalue or SVD
problem in the popular spectral coclustering
approach.
Results
 In the following figures, (a) shows a typical
worddocument matrix before coclustering.
Coclustering leads to rearranging of the documents
and words such that the most corelated get grouped
together simultaneously. (b) shows a dataset having
2 clusters in the ground truth, bipartitioned using
ICA. (c) Similarly, dataset having 6 clusters in the
ground truth has been kpartitioned by ICA.
 The following figures show the performance of ICA
and SpectralSVD in the presence of additive and
multiplicative Gaussian noise on some datasets.
Xaxis has the variance of the noise while the
isoperimetric ratio of the partitioning is along the
Yaxis. Additive noise had zero mean with variance
increasing from 1 to the maximum value in the
original data. Multiplicative noise had mean of 1
with its variance going from 1 to a maximum of 5.
First two plots are bipartitioning in the presence
of additive noise on InterestTrade dataset and
multiplicative noise on ArachidonicAcidsHematocrit
dataset. Similarly, next two plots are for
kpartitioning with additive noise on wap and
multiplicative on re0 datasets. From these results,
we can see that inspite of the varying amounts and
kinds of noise in the data, ICA is able to perform
optimal partitioning indicated by its low
isoperimetric ratio. Second noticeable fact is in
regards to stability. Rising ratios as the variance
increases indicates that the performance of
algorithm is gradually decreasing. However,
fluctuating ratios indicates instability and
inconsistency to partition optimally.




 We plotted the computational speed of ICA and SpectralSVD in the following
figure. The time required to compute the indicator vector is along the Yaxis
and the number of vertices in the bipartite graph are shown along the Xaxis.
Time for ICA gradually increases as the number of vertices increase. However,
SpectralSVD time increases more rapidly comparatively and hence is clearly
outperformed by ICA.
Related Publications
 Manjeet Rege, Ming Dong and Farshad Fotouhi,
“Bipartite Isoperimetric Graph Partitioning for Data
Coclustering”, Data Mining and Knowledge Discovery
Journal, Vol. 16, No. 3, 2008.
 Manjeet Rege, Ming Dong and Farshad Fotouhi,
“Coclustering documents and words using Bipartite
Isoperimetric Graph Partitioning”, in proceedings of
IEEE International Conference on Data Mining, 2006.
(regular paper  acceptance rate 10%, oral
presentation)
 Yanhua Chen, Ming Dong and Manjeet Rege, “Gene
Expression Clustering: a Novel Graph Partitioning
Approach”, in proceedings of International Joint
Conference on Neural Networks, 2007


