VII Center for Visual Informatics and Intelligence wsu

CRI: IAD Acquisition of Research Infrastructure for Knowledge-enhanced Large-scale Learning of Multimodality Visual Data



Summary

CRI_NSF.jpg

Compared to the development of visual data acquisition technology and the explosive collection of acquired datasets, computational techniques for knowledge discovery and learning from very large, diverse, heterogeneous visual datasets have only evolved modestly, ultimately impeding the more effective utilization and better understanding. The project aims to bridge the aforementioned gaps and foster a strong research program in geometry-guided knowledge discovery in multimodality visual data, with an emphasis on neuroimaging applications. Specifically, the project focuses on: (1) exploring new tools based on Riemannian geometry for computing geometric structures of 3-manifolds and developing a novel volumetric mapping with geometric flow; (2) developing a rigorous mathematical foundation for semi-supervised data clustering; and (3) extending our approaches on geometric mapping and semi-supervised learning to (high-order) heterogeneous volumetric visual data analysis. Once developed, these novel algorithms are applied to computer assisted diagnosis of various important brain diseases, such as tumors and brain functional disorder. This project can help with identifying disease patterns in human brain, and thus possibly provides both clinical and social benefits to a large sector of the population. Moreover, the project can immediately help to elevate the existing resources and on-going research to a unified, systematic level and strengthen computer science education. The research results will be widely disseminated to both computer science and medical communities through free Web access of the software tools (including source codes), and the set of sample data (including raw neuroimaging data and processed ones such as high-resolution brain surface meshes) via the project Web site.




Top
Acknowledgments

The support from National Science Fundation (NSF) under Award Number 0751045 is kindly appreciated.


Collaborators

  • Dr. Ming Dong
  • Dr. Jing Hua
  • Dr. E. Mark Haacke
  • Dr. Farshad Fotouhi

  • Students

  • Manjeet Rege
  • Yanhua Chen
  • Lijun Wang
  • Chang Liu
  • Zhaoqiang Lai
  • Dashan Pai
  • Jiaxi Hu
  • Vahid Taimouri
  • Areen Al.Bashir
  • Gutam Bahal
  • Samuel Barnes


  • Software and Datasets

    Further details are available here.


    Top
    Topics

    • Low-rank Kernel Matrix Factorization for Large Scale Evolutionary Clustering


    • bi_type

      Traditional clustering techniques are inapplicable to problems where the relationships between data points evolve over time. Not only is it important for the clustering algorithm to adapt to the recent changes in the evolving data, but it also needs to take the historical relationship between the data points into consideration. In this paper, we propose ECKF, a general framework for evolutionary clustering large-scale data based on low-rank kernel matrix factorization. To the best of our knowledge, this is the first work that clusters large evolutionary datasets by the amalgamation of low-rank matrix approximation methods and matrix factorization based clustering. Since the low-rank approximation provides a compact representation of the original matrix, and especially, the near-optimal low-rank approximation can preserve the sparsity of the original data, ECKF gains computational efficiency and hence is applicable to large evolutionary datasets. Moreover, matrix factorization based methods have been shown to effectively cluster high dimensional data in text mining and multimedia data analysis. From a theoretical standpoint, we mathematically prove the convergence and correctness of ECKF, and provide detailed analysis of its computational efficiency (both time and space). Through extensive experiments performed on synthetic and real datasets, we show that ECKF outperforms the existing methods in evolutionary clustering.

      Further details of this work are available here.

      Top


    • Non-Negative Matrix Factorization for Semisupervised Heterogeneous Data Coclustering


    • bi_type

      Coclustering heterogeneous data has attracted extensive attention recently due to its high impact on various important applications, such us text mining, image retrieval, and bioinformatics. However, data coclustering without any prior knowledge or background information is still a challenging problem. In this paper, we propose a Semisupervised Non-negative Matrix Factorization (SS-NMF) framework for data coclustering. Specifically, our method computes new relational matrices by incorporating user provided constraints through simultaneous distance metric learning and modality selection. Using an iterative algorithm, we then perform trifactorizations of the new matrices to infer the clusters of different data types and their correspondence. Theoretically, we prove the convergence and correctness of SS-NMF coclustering and show the relationship between SS-NMF with other well-known coclustering models. Through extensive experiments conducted on publicly available text, gene expression, and image data sets, we demonstrate the superior performance of SS-NMF for heterogeneous data coclustering.

      Further details of this work are available here.

      Top


    • Selection-fusion approach for classification of datasets with missing values


    • bi_type

      This paper proposes a new approach based on missing value pattern discovery for classifying incomplete data. This approach is particularly designed for classification of datasets with a small number of samples and a high percentage of missing values where available missing value treatment approaches do not usually work well. Based on the pattern of the missing values, the proposed approach finds subsets of samples for which most of the features are available and trains a classifier for each subset. Then, it combines the outputs of the classifiers. Subset selection is translated into a clustering problem, allowing derivation of a mathematical framework for it. A trade off is established between the computational complexity (number of subsets) and the accuracy of the overall classifier. To deal with this trade off, a numerical criterion is proposed for the prediction of the overall performance. The proposed method is applied to seven datasets from the popular University of California, Irvine data mining archive and an epilepsy dataset from Henry Ford Hospital, Detroit, Michigan (total of eight datasets). Experimental results show that classification accuracy of the proposed method is superior to those of the widely used multiple imputations method and four other methods. They also show that the level of superiority depends on the pattern and percentage of missing values.

      Further details of this work are available here.

      Top


    • Intra-Patient Supine-Prone Colon Registration in CT Colonography Using Shape Spectrum


    • bi_type

      CT colonography (CTC) is a minimally invasive screening technique for colorectal polyps and colon cancer. Since electronic colon cleansing (ECC) cannot completely remove the presence of pseudo-polyps, most CTC protocols acquire both prone and supine images to improve the visualization of the lumen wall and to reduce false positives. Comparisons between the prone and supine images can be facilitated by computerized registration between the scans. In this paper, we develop a fully automatic method for registering colon surfaces extracted from prone and supine images. The algorithm uses shape spectrum to extract the shape characteristics which are employed as the surface signature to find the correspondent regions between the prone and supine lumen surfaces. Our experimental results demonstrate an accuracy of 12.6 4.20 mm over 20 datasets. It also shows excellent potential in reducing the false positive when it is used to determine polyps through correspondences between prone and supine images.

      Further details of this work are available here.

      Top


    • Isoperimetric Co-clustering Algorithm (ICA) for pairwise data co-clustering


    • bi_type

      Data co-clustering refers to the problem of simultaneous clustering of two data types. Typically, the data is stored in a contingency or co-occurrence matrix C where rows and columns of the matrix represent the data types to be co-clustered. An entry Cij of the matrix signifies the relation between the data type represented by row i and column j. Co-clustering is the problem of deriving sub-matrices from the larger data matrix by simultaneously clustering rows and columns of the data matrix. We present a novel graph theoretic approach to data co-clustering. The two data types are modeled as the two sets of vertices of a weighted bipartite graph. We use Isoperimetric Co-clustering Algorithm (ICA)--a new method for partitioning the bipartite graph. ICA requires a simple solution to a sparse system of linear equations instead of the eigenvalue or SVD problem in the popular spectral co-clustering approach. Our theoretical analysis and extensive experiments performed on publicly available datasets demonstrate the advantages of ICA over other approaches in terms of the quality, efficiency and stability in partitioning the bipartite graph.

      Further details of this work are available here.

      Top


    • Consistent Isoperimetric High-order Co-clustering (CIHC) for high-order data co-clustering


    • tri_type

      Many of the real world clustering problems arising in data mining applications are heterogeneous in nature. Heterogeneous co-clustering involves simultaneous clustering of objects of two or more data types. While pairwise co-clustering of two data types has been well studied in the literature, research on high-order heterogeneous co-clustering is still limited. We propose a graph theoretical framework for addressing star- structured co-clustering problems in which a central data type is connected to all the other data types. Partitioning this graph leads to co-clustering of all the data types under the constraints of the star-structure. Although, graph partitioning approach has been adopted before to address star-structured heterogeneous complex problems, the main contribution of this work lies in an efficient algorithm that we propose for partitioning the star-structured graph. Computationally, our algorithm is very quick as it requires a simple solution to a sparse system of overdetermined linear equations. Theoretical analysis and extensive experiments performed on toy and real datasets demonstrate the quality, efficiency and stability of the proposed algorithm.

      Further details of this work are available here.

      Top


    • Semi-supervised NMF for Homogeneous Data Clustering


    • homo_con

      Traditional clustering algorithms are inapplicable to many real-world problems where limited knowledge from domain experts is available. Incorporating the domain knowledge can guide a clustering algorithm, consequently improving the quality of clustering. We propose SS-NMF: a semi-supervised non-negative matrix factorization framework for data clustering. In SS-NMF, users are able to provide supervision for clustering in terms of pairwise constraints on a few data objects specifying whether they "must" or "cannot" be clustered together. Through an iterative algorithm, we perform symmetric trifactorization of the data similarity matrix to infer the clusters. Theoretically, we show the correctness and convergence of SS-NMF and SS-NMF provides a general framework for semi-supervised clustering. Through extensive experiments conducted on publicly available datasets, we demonstrate the superior performance of SS-NMF for clustering.

      Further details of this work are available here.

      Top


    • Semi-supervised NMF for Heterogeneous Data Clustering


    • heter_con

      Co-clustering heterogeneous data has attracted extensive attention recently due to its high impact on various important applications, such us text mining, image retrieval and bioinformatics. However, data co-clustering without any prior knowledge or background information is still a challenging problem. In this work, we propose a Semi-Supervised Non-negative Matrix Factorization (SS-NMF) framework for data co-clustering. Specifically, our method computes new relational matrices by incorporating user provided constraints through simultaneous distance metric learning and modality selection. Using an iterative algorithm, we then perform trifactorizations of the new matrices to infer the clusters of different data types and their correspondence. Theoretically, we prove the convergence and correctness of SS-NMF co-clustering and show the relationship between SS-NMF with other well-known co-clustering models. Through extensive experiments conducted on publicly available text, gene expression, and image data sets, we demonstrate the superior performance of SS-NMF for heterogeneous data co-clustering.

      Further details of this work are available here.

      Top


    • Physically Based Modeling and Simulation with Dynamic Spherical Volumetric Simplex Splines


    • brain.png

      In this work, we present a novel computational modeling and simulation framework based on dynamic spherical volumetric simplex splines. The framework can handle the modeling and simulation of genus-zero objects with real physical properties. In this framework, we first develop an accurate and efficient algorithm to reconstruct the high-fidelity digital model of a real-world object with spherical volumetric simplex splines which can represent with accuracy geometric, material, and other properties of the object simultaneously. With the tight coupling of Lagrangian mechanics, the dynamic volumetric simplex splines representing the object can accurately simulate its physical behavior because it can unify the geometric and material properties in the simulation. The visualization can be directly computed from the object's geometric or physical representation based on the dynamic spherical volumetric simplex splines during simulation without interpolation or resampling. We have applied the framework for biomechanic simulation of brain deformations, such as brain shifting during the surgery and brain injury under blunt impact. We have compared our simulation results with the ground truth obtained through intra-operative magnetic resonance imaging and the real biomechanic experiments. The evaluations demonstrate the excellent performance of our new technique.

      Further details of this work are available here.

      Top


    • Geodesic Distance-Weighted Shape Vector Image Diffusion


    • work.png

      This work proposes a novel and efficient surface matching and visualization framework through the geodesic distance-weighted shape vector image diffusion. Based on conformal geometry, our approach can uniquely map a 3D surface to a canonical rectangular domain and encode the shape characteristics (e.g., mean curvatures and conformal factors) of the surface in the 2D domain to construct a geodesic distance-weighted shape vector image, where the distances between sampling pixels are not uniform but the actual geodesic distances on the manifold. Through the novel geodesic distance-weighted shape vector image diffusion, we can create a multiscale diffusion space, in which the cross-scale extrema can be detected as the robust geometric features for the matching and registration of surfaces. Therefore, statistical analysis and visualization of surface properties across subjects become readily available. The experiments on scanned surface models show that our method is very robust for feature extraction and surface matching even under noise and resolution change. We have also applied the framework on the real 3D human neocortical surfaces, and demonstrated the excellent performance of our approach in statistical analysis and integrated visualization of the multimodality volumetric data over the shape vector image.

      Further details of this work are available here.

      Top


    • Simultaneous Localized Feature Selection and Model Detection for Gaussian Mixtures


    • feature_selection

      This work proposes a novel approach of simultaneous localized feature selection and model detection for unsupervised learning. In our approach, local feature saliency, together with other parameters of Gaussian mixtures, are estimated by Bayesian variational learning. Experiments performed on both synthetic and real-world data sets demonstrate that our approach is superior over both global feature selection and subspace clustering methods.

      Further details of this work are available here.

      Top


    • Exemplar-based Visualization of Large Document Corpus


    • feature_selection

      With the rapid growth of the World Wide Web and electronic information services, text corpus is becoming available online at an incredible rate. By displaying text data in a logical layout (e.g., color graphs), text visualization presents a direct way to observe the documents as well as understand the relationship between them. In this work, we propose a novel technique, Exemplarbased Visualization (EV), to visualize an extremely large text corpus. Capitalizing on recent advances in matrix approximation and decomposition, EV presents a probabilistic multidimensional projection model in the low-rank text subspace with a sound objective function. The probability of each document proportion to the topics is obtained through iterative optimization and embedded to a low dimensional space using parameter embedding. By selecting the representative exemplars, we obtain a compact approximation of the data. This makes the visualization highly efficient and flexible. In addition, the selected exemplars neatly summarize the entire data set and greatly reduce the cognitive overload in the visualization, leading to an easier interpretation of large text corpus. Empirically, we demonstrate the superior performance of EV through extensive experiments performed on the publicly available text data sets.

      Further details of this work are available here.

      Exemplar-based Visualization software demo is available here.

      The 10Pubmed data set used for the software are available here.

      Top


    • Intrinsic Geometric Scale Space by Shape Diffusion


    • feature_selection

      This work formalizes a novel, intrinsic geometric scale space (IGSS) of 3D surface shapes. The intrinsic geometry of a surface is diffused by means of the Ricci flow for the generation of a geometric scale space. We rigorously prove that this multiscale shape representation satisfies the axiomatic causality property. Within the theoretical framework, we further present a featurebased shape representation derived from IGSS processing, which is shown to be theoretically plausible and practically effective. By integrating the concept of scale-dependent saliency into the shape description, this representation is not only highly descriptive of the local structures, but also exhibits several desired characteristics of global shape representations, such as being compact, robust to noise and computationally efficient. We demonstrate the capabilities of our approach through salient geometric feature detection and highly discriminative matching of 3D scans.

      Further details of this work are available here.

      Top


    • Selection–fusion Approach for Classification of Datasets with Missing Values


    • feature_selection

      This work proposes a new approach based on missing value pattern discovery for classifying incomplete data. This approach is particularly designed for classification of datasets with a small number of samples and a high percentage of missing values where available missing value treatment approaches do not usually work well. Based on the pattern of the missing values, the proposed approach finds subsets of samples for which most of the features are available and trains a classifier for each subset. Then, it combines the outputs of the classifiers. Subset selection is translated into a clustering problem, allowing derivation of a mathematical framework for it. A trade off is established between the computational complexity (number of subsets) and the accuracy of the overall classifier. To deal with this trade off, a numerical criterion is proposed for the prediction of the overall performance. The proposed method is applied to seven datasets from the popular University of California, Irvine data mining archive and an epilepsy dataset from Henry Ford Hospital, Detroit, Michigan (total of eight datasets). Experimental results show that classification accuracy of the proposed method is superior to those of the widely used multiple imputations method and four other methods. They also show that the level of superiority depends on the pattern and percentage of missing values.

      Further details of this work are available here.

      Top


    Contact Webmaster
    Updated by 09/28/2011