Data clustering, in general, is the task of automatically grouping data points into meaningful groups, known as clusters. In a typical large data mining application, data is not only collected over a period of time, but the relationship between data points can change over that period too. It is important that the clustering does not deviate too much from the recent past due to a sudden change in the relationships between data points. At the same time, for an evolving change in the nature, the clustering algorithm should reflect the corresponding change in the results as well. In addition, the clustering at that time step should not drastically be different from the clustering of the recent past. This dual objective evolving nature of clustering is radically different from the goal of a traditional clustering algorithm, and falls in the paradigm of evolutionary clustering.
- Proposed a general model for large-scale Evolutionary Clustering based on low-rank Kernel matrix Factorization(ECKF).
- Clusters large evolutionary datasets by the amalgamation of low-rank matrix approximation methods and matrix factorization based clustering.
- We compare the proposed method with evolutionary spectral clustering (ENC) and accumulated K-means.
As can be seen from figure above, the exemplars (representative points) selected by ECKF illustrated by blue dots, nicely represent the underlying data distribution, i.e., no outliers are picked. In addition, we have also plotted SSEs gained from Colibri methods in following figure (a). As can be observed, SSE changes with the evolving nature of the data. Specifically, it increases slightly when noise is added at t1. There is almost no change in SSE when highly correlated nodes are inserted into an existing cluster at t2. When some data points are deleted at t3, it increases since the subspace is shrunk and some of the exemplars are removed. As t4 is a duplicate snapshot of t3, SSE remains unchanged. At t5, SSE first increases significantly due to the addition of a new cluster but comes back to the original range once the subspace is reconstructed based on the total data. Like the case of t4 before, SSE remains unchanged at t6. At t7, although a cluster is removed, SSE does not change because the subspace for the existing clusters is unchanged, and it is enough to estimate the approximation.
- To further examine the subspace, we plotted the number of exemplars for each time step in following figure (b). We can see that when noise or new nodes are added, the subspace grows, such as in the case of t1, t2 and t5. On the other hand, when nodes are deleted like t3 and t7, the subspace shrinks. Hence, the size of subspace reflects the complexity of data structure.
- To evaluate the performance of ECKF on real world data, we selected two publicly available social network datasets. The first dataset is collected from Pubmed database.
- In the following figure(a), we plot SSEs gained by Colibri methods on the Pubmed datasets. The peaks before 1997 indicate the changes in the structure of the data. Especially during 1992 to 1996, every year a new subspace needs to be created. In the period 1996 to 1999, the data structure remains relatively steady, and the subspace at each time step succeeds from the previous one. A similar pattern of peaks happens from 2000 to 2002, and stability thereafter appears until 2008. The sudden spike in the SSE in 2009 suggests that the subspace in this year is significantly different from that in the previous year.
In the following figure (b), we have plotted the number of exemplars to show the complexity of data structure at each time step. The pattern observed here supports the one observed in (a). That is, before 1997, the subspace grows almost every year, indicating that the data structure is becoming more complex. Thereafter, during 1997 to 1999, the subspace shrinks. Again, from 2000 to 2002, the data becomes more complex, followed by a reduction in the complexity after 2003.
- The lack of much change in SSE during 1997 to 1999 and in 2007, means that we do not need a new clustering on these datasets. In the following figure, we have plotted the number of clusters across all time steps. As discussed before, since the data is more complex prior to 2002,the number of clusters is more compared with after 2002.
- The following figure performs a head-to-head comparison of ECKF, ENC, KMF, and AccKM on the remaining datasets with respect to KMCost values. Clearly, among the four methods, ECKF gains the lowest KMCost values on all the evolving data, followed by AccKM. The KMCost values of ENC are zeros during the period of 2003 to 2007, where the data contains over 4500 nodes. A zerovalue here indicates ENC fails to cluster the data due to insufficient memory. All throughout, KMF generally gets the highest KMCosts. This clearly demonstrates the benefits of incorporating historical knowledge into the matrix factorization based clustering. The exemplars are able to rule out noisy data points, hence making the ECKF framework very robust.
- The second dataset is the Enron datasets.The following figure(a) shows the approximation errors over the time steps of ten months. As we can see, the structure of the data changes considerably in most months, except for the periods of June to July and August to September in 2001. In particular, in December 2001, the data changes significantly indicated by the highest point in the figure. In Figure (b), we have shown the number of selected exemplars for each month. Note that, before July 2001, the number decreases every month, meaning the structure is becoming less complex, while after that, it increases for the opposite reason.
- Based on the low-rank approximation results, we do not perform a new clustering on the data in July 2001. In the following figure, we have plotted the number of clusters on the Enron datasets. As noticed, the value of cluster number drops before June 2001, and then grows, which is consistent with the curve of the number of exemplars in the above figure(b).
- In the following figure, we have compared the KMCost values obtained by ECKF, KMF, and AccKM. As we see, ECKF gains the lowest KMCost values on all the envolving Enron datasets.