MGEL: Multigrained Representation Analysis and Ensemble Learning for Text Moderation

Fei Tan, Changwei Hu, Yifan Hu, Kevin Yen, Zhi Wei, Aasish Pappu, Serim Park, Keqian Li

Research output: Contribution to journalArticlepeer-review

Abstract

In this work, we describe our efforts in addressing two typical challenges involved in the popular text classification methods when they are applied to text moderation: the representation of multibyte characters and word obfuscations. Specifically, a multihot byte-level scheme is developed to significantly reduce the dimension of one-hot character-level encoding caused by the multiplicity of instance-scarce non-ASCII characters. In addition, we introduce a simple yet effective weighting approach for fusing n-gram features to empower the classical logistic regression. Surprisingly, it outperforms well-tuned representative neural networks greatly. As a continual effort toward text moderation, we endeavor to analyze the current state-of-the-art (SOTA) algorithm bidirectional encoder representations from transformers (BERT), which works well in context understanding but performs poorly on intentional word obfuscations. To resolve this crux, we then develop an enhanced variant and remedy this drawback by integrating byte and character decomposition. It advances the SOTA performance on the largest abusive language datasets as demonstrated by our comprehensive experiments. Our work offers a feasible and effective framework to tackle word obfuscations.

Original languageEnglish (US)
JournalIEEE Transactions on Neural Networks and Learning Systems
DOIs
StateAccepted/In press - 2022

All Science Journal Classification (ASJC) codes

  • Software
  • Computer Science Applications
  • Computer Networks and Communications
  • Artificial Intelligence

Keywords

  • Abusive language detection
  • Bit error rate
  • Blogs
  • Encoding
  • Hate speech
  • Machine learning
  • Social networking (online)
  • Vocabulary
  • hate speech
  • multibyte characters
  • text moderation
  • word obfuscations.

Fingerprint

Dive into the research topics of 'MGEL: Multigrained Representation Analysis and Ensemble Learning for Text Moderation'. Together they form a unique fingerprint.

Cite this