VII Center for Visual Informatics and Intelligence wsu
Home  NSF: CRI arrow Topics

Topics

Isoperimetric Co-clustering Algorithm (ICA) for pairwise data co-clustering



Problem Definition


  • Data co-clustering refers to the problem of simultaneous clustering of two data types. Typically, the data is stored in a contingency or co-occurrence matrix C where rows and columns of the matrix represent the data types to be co-clustered. An entry Cij of the matrix signifies the relation between the data type represented by row i and column j. Co-clustering is the problem of deriving sub-matrices from the larger data matrix by simultaneously clustering rows and columns of the data matrix.

Algorithms

  • The two data types are modeled as the two sets of vertices of a weighted bipartite graph (see above Figure).
  • Isoperimetric Co-clustering Algorithm (ICA) requires a simple solution to a sparse system of linear equations instead of the eigenvalue or SVD problem in the popular spectral co-clustering approach.

Results

  • In the following figures, (a) shows a typical word-document matrix before co-clustering. Co-clustering leads to rearranging of the documents and words such that the most co-related get grouped together simultaneously. (b) shows a dataset having 2 clusters in the ground truth, bipartitioned using ICA. (c) Similarly, dataset having 6 clusters in the ground truth has been k-partitioned by ICA.
  • The following figures show the performance of ICA and Spectral-SVD in the presence of additive and multiplicative Gaussian noise on some datasets. X-axis has the variance of the noise while the isoperimetric ratio of the partitioning is along the Y-axis. Additive noise had zero mean with variance increasing from 1 to the maximum value in the original data. Multiplicative noise had mean of 1 with its variance going from 1 to a maximum of 5. First two plots are bipartitioning in the presence of additive noise on Interest-Trade dataset and multiplicative noise on ArachidonicAcids-Hematocrit dataset. Similarly, next two plots are for k-partitioning with additive noise on wap and multiplicative on re0 datasets. From these results, we can see that inspite of the varying amounts and kinds of noise in the data, ICA is able to perform optimal partitioning indicated by its low isoperimetric ratio. Second noticeable fact is in regards to stability. Rising ratios as the variance increases indicates that the performance of algorithm is gradually decreasing. However, fluctuating ratios indicates instability and inconsistency to partition optimally.
  • We plotted the computational speed of ICA and Spectral-SVD in the following figure. The time required to compute the indicator vector is along the Y-axis and the number of vertices in the bipartite graph are shown along the X-axis. Time for ICA gradually increases as the number of vertices increase. However, Spectral-SVD time increases more rapidly comparatively and hence is clearly outperformed by ICA.

Related Publications


Top

    Contact Webmaster
    Updated by 09/28/2011