Selection-fusion approach for classification of datasets with missing values
Management of missing values becomes critical when the number of available samples is small. Many challenges arise from insufficient statistical power after the missing values are imputed. We need to resolve the questions of measuring the complexity of the missing values, working with the missing values when imputation of the missing values is inappropriate, and managing the missing values when the same features are missing in the test and training samples.
- Proposed a new approach, named selection-fusion, based on the subspace classification method.
- Missing value management is integrated not only in the training but also in the testing of the classifier.
- A set of classifiers are trained on the subspaces of the original feature space and then clustered using a distance metric.
- The best classifiers in each cluster, depending on the testing data, are combined to construct the overall classifier and estimate the final output.
- To support the hypotheses in the previous sections and to evaluate the proposed method and compare it with the previous methods, we have conducted a variety of experiments using a wide range of real-world datasets. Seven datasets from the University of California, Irvine and our epilepsy dataset (a total of eight datasets) have been used in these experiments. The proposed method is compared with five well-known missing value management algorithms: (1) pairwise deletion; (2) decision tree (CART); (3) expectation maximization (EM) single imputation; (4) multiple imputations (MI) with EM; and (5) ensemble classifier (voting selection-fusion (SF) with random selections). The means and standard deviations of the correct classification percentages are calculated and presented in the following table. The results show that the proposed algorithm outperforms the other methods when either the percentage of the missing values is large (more than 20%) or the number of samples in the dataset is small.
- The relationship between the proposed index(CVI) and the performance of the selection-fusion algorithm is evaluated. This following figure compares the accuracy of the four methods when 1/CVI changes from 1 to 40 for the 3 datasets. The results illustrate that although the relationship between the accuracy and the 1/CVI depends on the pattern of the missing values, our approach (SF) is always superior when 1/CVI is larger than 20. Also, as 1/CVI increases further, the superiority of the SF approach to the other methods becomes more pronounced.
- To evaluate the effect of the number of samples in the dataset and the percentage of the missing values on the CVI, some of the features are removed from the Breast Cancer dataset, using the MAR, MCAR, and systematic missing value models. For each of the resulting datasets, 1/CVI is calculated and plotted in the following figure versus the number of the samples (sample space size) and the percentage of the missing values. The results show that the relationships between the 1/CVI and the missing value parameters depend on the pattern of the missing values, although it is always a monotone function.
- The effect of the number of subsets on the performance of the proposed method is evaluated by applying the method to the original HBIDS dataset and additional datasets generated by randomly removing some of the features from the original dataset. The results are graphed in the following figure (a). Note that with 10% missing values, 5 subsets yield the maximum performance. In this case, the performance does not improve much by increasing the number of subsets beyond 5. On the other hand, with 30% missing values, at least 8 subsets are required to get the maximum performance.
- The performance of the proposed approach is compared with the multiple imputation method by estimating their receiver operating characteristic (ROC) for the HBIDS dataset. The results graphed in the following figure (b) show that our approach has higher sensitivity and specificity. The area under the ROC curve of the proposed method for the dataset with 20% missing values is about 5% larger than that of the multiple imputations.