TY - GEN
T1 - An ensemble deep learning model for drug abuse detection in sparse twitter-sphere
AU - Hu, Han
AU - Phan, Nhat Hai
AU - Geller, James
AU - Iezzi, Stephen
AU - Vo, Huy
AU - Dou, Dejing
AU - Chun, Soon Ae
N1 - Publisher Copyright:
© 2019 International Medical Informatics Association (IMIA) and IOS Press. This article is published online with Open Access by IOS Press and distributed under the terms of the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0).
PY - 2019/8/21
Y1 - 2019/8/21
N2 - As the problem of drug abuse intensifies in the U.S., many studies that primarily utilize social media data, such as postings on Twitter, to study drug abuse-related activities use machine learning as a powerful tool for text classification and filtering. However, given the wide range of topics of Twitter users, tweets related to drug abuse are rare in most of the datasets. This imbalanced data remains a major issue in building effective tweet classifiers, and is especially obvious for studies that include abuse-related slang terms. In this study, we approach this problem by designing an ensemble deep learning model that leverages both word-level and character-level features to classify abuse-related tweets. Experiments are reported on a Twitter dataset, where we can configure the percentages of the two classes (abuse vs. non abuse) to simulate the data imbalance with different amplitudes. Results show that our ensemble deep learning models exhibit better performance than ensembles of traditional machine learning models, especially on heavily imbalanced datasets.
AB - As the problem of drug abuse intensifies in the U.S., many studies that primarily utilize social media data, such as postings on Twitter, to study drug abuse-related activities use machine learning as a powerful tool for text classification and filtering. However, given the wide range of topics of Twitter users, tweets related to drug abuse are rare in most of the datasets. This imbalanced data remains a major issue in building effective tweet classifiers, and is especially obvious for studies that include abuse-related slang terms. In this study, we approach this problem by designing an ensemble deep learning model that leverages both word-level and character-level features to classify abuse-related tweets. Experiments are reported on a Twitter dataset, where we can configure the percentages of the two classes (abuse vs. non abuse) to simulate the data imbalance with different amplitudes. Results show that our ensemble deep learning models exhibit better performance than ensembles of traditional machine learning models, especially on heavily imbalanced datasets.
KW - Machine Learning
KW - Social Media
KW - Substance-Related Disorders
UR - http://www.scopus.com/inward/record.url?scp=85071471488&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85071471488&partnerID=8YFLogxK
U2 - 10.3233/SHTI190204
DO - 10.3233/SHTI190204
M3 - Conference contribution
C2 - 31437906
AN - SCOPUS:85071471488
T3 - Studies in Health Technology and Informatics
SP - 163
EP - 167
BT - MEDINFO 2019
A2 - Seroussi, Brigitte
A2 - Ohno-Machado, Lucila
A2 - Ohno-Machado, Lucila
A2 - Seroussi, Brigitte
PB - IOS Press
T2 - 17th World Congress on Medical and Health Informatics, MEDINFO 2019
Y2 - 25 August 2019 through 30 August 2019
ER -