Topics
Consistent Isoperimetric Highorder Coclustering (CIHC) for
highorder data coclustering
Problem
Definition

Clustering objects of more than two data types is
highorder heterogeneous clustering. The data
structure is starstructured coclustering problems
in which a central data type is connected to all the
other data types. Intuitively, one might expect that
this kind of coclustering can be achieved by
trivially extending the pairwise coclustering
techniques previously developed. However, this
approach does not consider the constraints enforced
on the central data type. A basic element of this
problem is the starstructured triplet.
Algorithms
 Model the triplet using a tripartite graph
consisting of the 3 data types  f, m and w. We
treat this tripartite graph as two bipartite graphs
of fm and mw. Coclustering is then achieved by
simultaneously partitioning these two bipartite
graphs together such that the starstructured
constraint is respected (see above Figure).
 Consistent Isoperimetric Highorder Coclustering
(CIHC) requires a simple and very quick solution to
a sparse system of linear equations.
Results
 In the following subfigures show the partitioning
obtained for categories, documents and words,
respectively. The dotted line in the second
subfigure separates the documents from the two
categories. Good document clustering should
partition documents on either side of the line into
different clusters.
 The following figures show the performance of CIHC
and CBGC in the presence of additive and
multiplicative Gaussian noise on some datasets.
Xaxis has the variance of the noise while the
isoperimetric ratio of the partitioning is along the
Yaxis. First two plots are tripartitioning in the
presence of additive noise on DS1 and DS2 datasets.
Similarly, next two plots are in the presence of
multiplicative noise on these two datasets. From
these results, we can see that inspite of the
varying amounts and kinds of noise in the data, CIHC
is able to perform optimal partitioning indicated by
its low isoperimetric ratio. Second noticeable fact
is in regards to stability. Rising ratios as the
variance increases indicates that the performance of
the algorithm is decreasing. However, uctuating
ratios indicates instability and inconsistency to
partition optimally. While the ratios of CIHC
uctuate in limits, CBGC ratio fuctuation clearly
shows instability.




 We plotted the computational speed of CIHC and CBGC in the following figure.
The time required to compute the indicator vector is along the Yaxis and the
number of vertices in the tripartite graph are shown along the Xaxis. Time for
CIHC gradually increases with the number of vertices. For the maximum number of
vertices we increased to  about almost 4,000, CIHC required about 98 seconds
only. CBGC on the other hand, was unable to keep up with CIHC. As can be seen,
the time required by CBGC really shoots up for a few hundred vertices in the
graph. Moreover, CBGC is unscalable and is unable to handle larger sized graphs.
In our experiment, CBGC was unable to handle graphs with more than 1,500
vertices. This clearly demonstrates the computational efficiency of CIHC and the
potential for applicability in largescale realworld applications.
Related Publications


