TY - GEN
T1 - Weighted Gini index feature selection method for imbalanced data
AU - Liu, Haoyue
AU - Zhou, Mengchu
AU - Lu, Xiaoyu Sean
AU - Yao, Cynthia
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/5/18
Y1 - 2018/5/18
N2 - An imbalanced class problem occurs within abundant real-world applications, e.g., fraud detection, text classification, and cancer diagnosis. Beside balancing the imbalanced data distribution to deal with imbalanced data problems, another significant way to solve the bias-to-majority problem is via proper feature selection. This work is intended to use a feature selection method that can choose a subset of features and make ROC AUC and F-measure results in order to achieve high performance on a minority class. In this paper, a weighted Gini index(WGI) feature selection method is proposed. In order to evaluate the proposed method, a comparison result among Chi-square, F-statistic and Gini index feature selection is shown, and Xgboost is the classifier that is used to test the performance of the subset of features. Experimental results indicate that F-statistic contains the best performance when a few features are selected. However, when the number of selected features increases, WGI feature selection achieves the best results. A comparison between the average results from ROC AUC and F-measure are also presented. It shows that ROC AUC always contains a good performance, even if only a few features are selected, and only changes slightly as the subset of features expands. However, the performance of F-measure achieves a good performance after 60% of features are chosen. The results are helpful for practitioners to select a proper feature selection method when facing a practical problem.
AB - An imbalanced class problem occurs within abundant real-world applications, e.g., fraud detection, text classification, and cancer diagnosis. Beside balancing the imbalanced data distribution to deal with imbalanced data problems, another significant way to solve the bias-to-majority problem is via proper feature selection. This work is intended to use a feature selection method that can choose a subset of features and make ROC AUC and F-measure results in order to achieve high performance on a minority class. In this paper, a weighted Gini index(WGI) feature selection method is proposed. In order to evaluate the proposed method, a comparison result among Chi-square, F-statistic and Gini index feature selection is shown, and Xgboost is the classifier that is used to test the performance of the subset of features. Experimental results indicate that F-statistic contains the best performance when a few features are selected. However, when the number of selected features increases, WGI feature selection achieves the best results. A comparison between the average results from ROC AUC and F-measure are also presented. It shows that ROC AUC always contains a good performance, even if only a few features are selected, and only changes slightly as the subset of features expands. However, the performance of F-measure achieves a good performance after 60% of features are chosen. The results are helpful for practitioners to select a proper feature selection method when facing a practical problem.
KW - feature selection
KW - imbalanced data
KW - weighted gini index
UR - http://www.scopus.com/inward/record.url?scp=85048249209&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85048249209&partnerID=8YFLogxK
U2 - 10.1109/ICNSC.2018.8361371
DO - 10.1109/ICNSC.2018.8361371
M3 - Conference contribution
AN - SCOPUS:85048249209
T3 - ICNSC 2018 - 15th IEEE International Conference on Networking, Sensing and Control
SP - 1
EP - 6
BT - ICNSC 2018 - 15th IEEE International Conference on Networking, Sensing and Control
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 15th IEEE International Conference on Networking, Sensing and Control, ICNSC 2018
Y2 - 27 March 2018 through 29 March 2018
ER -