An embedded feature selection method for imbalanced data classification

Haoyue Liu, Mengchu Zhou, Qing Liu

Research output: Contribution to journalArticlepeer-review

243 Scopus citations


Imbalanced data is one type of datasets that are frequently found in real-world applications, e.g., fraud detection and cancer diagnosis. For this type of datasets, improving the accuracy to identify their minority class is a critically important issue. Feature selection is one method to address this issue. An effective feature selection method can choose a subset of features that favor in the accurate determination of the minority class. A decision tree is a classifier that can be built up by using different splitting criteria. Its advantage is the ease of detecting which feature is used as a splitting node. Thus, it is possible to use a decision tree splitting criterion as a feature selection method. In this paper, an embedded feature selection method using our proposed weighted Gini index x0028 WGI x0029 is proposed. Its comparison results with Chi2, F-statistic and Gini index feature selection methods show that F-statistic and Chi2 reach the best performance when only a few features are selected. As the number of selected features increases, our proposed method has the highest probability of achieving the best performance. The area under a receiver operating characteristic curve x0028 ROC AUC x0029 and F-measure are used as evaluation criteria. Experimental results with two datasets show that ROC AUC performance can be high, even if only a few features are selected and used, and only changes slightly as more and more features are selected. However, the performance of Fmeasure achieves excellent performance only if 20 x0025 or more of features are chosen. The results are helpful for practitioners to select a proper feature selection method when facing a practical problem.

Original languageEnglish (US)
Article number8677302
Pages (from-to)703-715
Number of pages13
JournalIEEE/CAA Journal of Automatica Sinica
Issue number3
StatePublished - May 2019

All Science Journal Classification (ASJC) codes

  • Control and Optimization
  • Artificial Intelligence
  • Information Systems
  • Control and Systems Engineering


Dive into the research topics of 'An embedded feature selection method for imbalanced data classification'. Together they form a unique fingerprint.

Cite this