VII Center for Visual Informatics and Intelligence wsu
Home  NSF: CRI arrow Topics

Topics

Consistent Isoperimetric High-order Co-clustering (CIHC) for high-order data co-clustering



Problem Definition


  • Clustering objects of more than two data types is high-order heterogeneous clustering. The data structure is star-structured co-clustering problems in which a central data type is connected to all the other data types. Intuitively, one might expect that this kind of co-clustering can be achieved by trivially extending the pairwise co-clustering techniques previously developed. However, this approach does not consider the constraints enforced on the central data type. A basic element of this problem is the star-structured triplet.

Algorithms

  • Model the triplet using a tripartite graph consisting of the 3 data types - f, m and w. We treat this tripartite graph as two bipartite graphs of f-m and m-w. Co-clustering is then achieved by simultaneously partitioning these two bipartite graphs together such that the star-structured constraint is respected (see above Figure).
  • Consistent Isoperimetric High-order Co-clustering (CIHC) requires a simple and very quick solution to a sparse system of linear equations.

Results

  • In the following sub-figures show the partitioning obtained for categories, documents and words, respectively. The dotted line in the second sub-figure separates the documents from the two categories. Good document clustering should partition documents on either side of the line into different clusters.
  • The following figures show the performance of CIHC and CBGC in the presence of additive and multiplicative Gaussian noise on some datasets. X-axis has the variance of the noise while the isoperimetric ratio of the partitioning is along the Y-axis. First two plots are tripartitioning in the presence of additive noise on DS1 and DS2 datasets. Similarly, next two plots are in the presence of multiplicative noise on these two datasets. From these results, we can see that inspite of the varying amounts and kinds of noise in the data, CIHC is able to perform optimal partitioning indicated by its low isoperimetric ratio. Second noticeable fact is in regards to stability. Rising ratios as the variance increases indicates that the performance of the algorithm is decreasing. However, uctuating ratios indicates instability and inconsistency to partition optimally. While the ratios of CIHC uctuate in limits, CBGC ratio fuctuation clearly shows instability.
  • We plotted the computational speed of CIHC and CBGC in the following figure. The time required to compute the indicator vector is along the Y-axis and the number of vertices in the tripartite graph are shown along the X-axis. Time for CIHC gradually increases with the number of vertices. For the maximum number of vertices we increased to - about almost 4,000, CIHC required about 98 seconds only. CBGC on the other hand, was unable to keep up with CIHC. As can be seen, the time required by CBGC really shoots up for a few hundred vertices in the graph. Moreover, CBGC is unscalable and is unable to handle larger sized graphs. In our experiment, CBGC was unable to handle graphs with more than 1,500 vertices. This clearly demonstrates the computational efficiency of CIHC and the potential for applicability in large-scale real-world applications.

Related Publications


Top

    Contact Webmaster
    Updated by 09/28/2011