TY - JOUR
T1 - A Distance-Based Weighted Undersampling Scheme for Support Vector Machines and its Application to Imbalanced Classification
AU - Kang, Qi
AU - Shi, Lei
AU - Zhou, Meng Chu
AU - Wang, Xue Song
AU - Wu, Qi Di
AU - Wei, Zhi
N1 - Funding Information:
Manuscript received February 27, 2017; revised June 19, 2017 and September 17, 2017; accepted September 17, 2017. Date of publication October 25, 2017; date of current version August 20, 2018. This work was supported in part by the Natural Science Foundation of China under Grant 51775385, Grant 71371142, and Grant 61703279, FDCT (Fundo para o Desenvolvimento das Ciencias e da Tecnologia) under Grant 119/2014/A3 and the Fundamental Research Funds for the Central Universities. (Corresponding authors: Qi Kang and MengChu Zhou.) Q. Kang, L. Shi, X. Wang, and Q. Wu are with the Department of Control Science and Engineering, School of Electronics and Information Engineering, Tongji University, Shanghai 201804, China (e-mail: qkang@tongji.edu.cn).
Publisher Copyright:
© 2012 IEEE.
PY - 2018/9
Y1 - 2018/9
N2 - A support vector machine (SVM) plays a prominent role in classic machine learning, especially classification and regression. Through its structural risk minimization, it has enjoyed a good reputation in effectively reducing overfitting, avoiding dimensional disaster, and not falling into local minima. Nevertheless, existing SVMs do not perform well when facing class imbalance and large-scale samples. Undersampling is a plausible alternative to solve imbalanced problems in some way, but suffers from soaring computational complexity and reduced accuracy because of its enormous iterations and random sampling process. To improve their classification performance in dealing with data imbalance problems, this work proposes a weighted undersampling (WU) scheme for SVM based on space geometry distance, and thus produces an improved algorithm named WU-SVM. In WU-SVM, majority samples are grouped into some subregions (SRs) and assigned different weights according to their Euclidean distance to the hyper plane. The samples in an SR with higher weight have more chance to be sampled and put to use in each learning iteration, so as to retain the data distribution information of original data sets as much as possible. Comprehensive experiments are performed to test WU-SVM via 21 binary-class and six multiclass publically available data sets. The results show that it well outperforms the state-of-the-art methods in terms of three popular metrics for imbalanced classification, i.e., area under the curve, F-Measure, and G-Mean.
AB - A support vector machine (SVM) plays a prominent role in classic machine learning, especially classification and regression. Through its structural risk minimization, it has enjoyed a good reputation in effectively reducing overfitting, avoiding dimensional disaster, and not falling into local minima. Nevertheless, existing SVMs do not perform well when facing class imbalance and large-scale samples. Undersampling is a plausible alternative to solve imbalanced problems in some way, but suffers from soaring computational complexity and reduced accuracy because of its enormous iterations and random sampling process. To improve their classification performance in dealing with data imbalance problems, this work proposes a weighted undersampling (WU) scheme for SVM based on space geometry distance, and thus produces an improved algorithm named WU-SVM. In WU-SVM, majority samples are grouped into some subregions (SRs) and assigned different weights according to their Euclidean distance to the hyper plane. The samples in an SR with higher weight have more chance to be sampled and put to use in each learning iteration, so as to retain the data distribution information of original data sets as much as possible. Comprehensive experiments are performed to test WU-SVM via 21 binary-class and six multiclass publically available data sets. The results show that it well outperforms the state-of-the-art methods in terms of three popular metrics for imbalanced classification, i.e., area under the curve, F-Measure, and G-Mean.
KW - Class imbalance
KW - Euclidean distance
KW - data distribution
KW - support vector machine (SVM)
KW - undersampling
UR - http://www.scopus.com/inward/record.url?scp=85032450330&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85032450330&partnerID=8YFLogxK
U2 - 10.1109/TNNLS.2017.2755595
DO - 10.1109/TNNLS.2017.2755595
M3 - Article
C2 - 29990027
AN - SCOPUS:85032450330
SN - 2162-237X
VL - 29
SP - 4152
EP - 4165
JO - IEEE Transactions on Neural Networks
JF - IEEE Transactions on Neural Networks
IS - 9
M1 - 8082535
ER -