TY - JOUR
T1 - An embedded feature selection method for imbalanced data classification
AU - Liu, Haoyue
AU - Zhou, Mengchu
AU - Liu, Qing
N1 - Funding Information:
Manuscript received September 20, 2018; revised December 31, 2018; accepted February 21, 2019. This work was supported in part by the National Science Foundation of USA (CMMI-1162482). Recommended by Associate Editor Kao-shing Hwang. (Corresponding author: Haoyue Liu.) Citation: H. Y. Liu, M. C. Zhou, and Q. Liu, “An embedded feature selection method for imbalanced data classification,” IEEE/CAA J. Autom. Sinica, vol. 6, no. 3, pp. 703–715, May 2019.
Publisher Copyright:
© 2014 Chinese Association of Automation.
PY - 2019/5
Y1 - 2019/5
N2 - Imbalanced data is one type of datasets that are frequently found in real-world applications, e.g., fraud detection and cancer diagnosis. For this type of datasets, improving the accuracy to identify their minority class is a critically important issue. Feature selection is one method to address this issue. An effective feature selection method can choose a subset of features that favor in the accurate determination of the minority class. A decision tree is a classifier that can be built up by using different splitting criteria. Its advantage is the ease of detecting which feature is used as a splitting node. Thus, it is possible to use a decision tree splitting criterion as a feature selection method. In this paper, an embedded feature selection method using our proposed weighted Gini index x0028 WGI x0029 is proposed. Its comparison results with Chi2, F-statistic and Gini index feature selection methods show that F-statistic and Chi2 reach the best performance when only a few features are selected. As the number of selected features increases, our proposed method has the highest probability of achieving the best performance. The area under a receiver operating characteristic curve x0028 ROC AUC x0029 and F-measure are used as evaluation criteria. Experimental results with two datasets show that ROC AUC performance can be high, even if only a few features are selected and used, and only changes slightly as more and more features are selected. However, the performance of Fmeasure achieves excellent performance only if 20 x0025 or more of features are chosen. The results are helpful for practitioners to select a proper feature selection method when facing a practical problem.
AB - Imbalanced data is one type of datasets that are frequently found in real-world applications, e.g., fraud detection and cancer diagnosis. For this type of datasets, improving the accuracy to identify their minority class is a critically important issue. Feature selection is one method to address this issue. An effective feature selection method can choose a subset of features that favor in the accurate determination of the minority class. A decision tree is a classifier that can be built up by using different splitting criteria. Its advantage is the ease of detecting which feature is used as a splitting node. Thus, it is possible to use a decision tree splitting criterion as a feature selection method. In this paper, an embedded feature selection method using our proposed weighted Gini index x0028 WGI x0029 is proposed. Its comparison results with Chi2, F-statistic and Gini index feature selection methods show that F-statistic and Chi2 reach the best performance when only a few features are selected. As the number of selected features increases, our proposed method has the highest probability of achieving the best performance. The area under a receiver operating characteristic curve x0028 ROC AUC x0029 and F-measure are used as evaluation criteria. Experimental results with two datasets show that ROC AUC performance can be high, even if only a few features are selected and used, and only changes slightly as more and more features are selected. However, the performance of Fmeasure achieves excellent performance only if 20 x0025 or more of features are chosen. The results are helpful for practitioners to select a proper feature selection method when facing a practical problem.
UR - http://www.scopus.com/inward/record.url?scp=85063899495&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85063899495&partnerID=8YFLogxK
U2 - 10.1109/JAS.2019.1911447
DO - 10.1109/JAS.2019.1911447
M3 - Article
AN - SCOPUS:85063899495
SN - 2329-9266
VL - 6
SP - 703
EP - 715
JO - IEEE/CAA Journal of Automatica Sinica
JF - IEEE/CAA Journal of Automatica Sinica
IS - 3
M1 - 8677302
ER -