An adaptive wordpiece language model for learning Chinese word embeddings

Binchen Xu, Lu Ma, Liang Zhang, Haohai Li, Qi Kang, Mengchu Zhou

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Scopus citations

Abstract

Word representations are crucial for many nature language processing tasks. Most of the existing approaches learn contextual information by assigning a distinct vector to each word and pay less attention to morphology. It is a problem for them to deal with large vocabularies and rare words. In this paper we propose an Adaptive Wordpiece Language Model for learning Chinese word embeddings (AWLM), as inspired by previous observation that subword units are important for improving the learning of Chinese word representation. Specifically, a novel approach called BPE+ is established to adaptively generates variable length of grams which breaks the limitation of stroke n-grams. The semantical information extraction is completed by three elaborated parts i.e., extraction of morphological information, reinforcement of fine-grained information and extraction of semantical information. Empirical results on word similarity, word analogy, text classification and question answering verify that our method significantly outperforms several state-of-the-art methods.

Original languageEnglish (US)
Title of host publication2019 IEEE 15th International Conference on Automation Science and Engineering, CASE 2019
PublisherIEEE Computer Society
Pages812-817
Number of pages6
ISBN (Electronic)9781728103556
DOIs
StatePublished - Aug 2019
Event15th IEEE International Conference on Automation Science and Engineering, CASE 2019 - Vancouver, Canada
Duration: Aug 22 2019Aug 26 2019

Publication series

NameIEEE International Conference on Automation Science and Engineering
Volume2019-August
ISSN (Print)2161-8070
ISSN (Electronic)2161-8089

Conference

Conference15th IEEE International Conference on Automation Science and Engineering, CASE 2019
Country/TerritoryCanada
CityVancouver
Period8/22/198/26/19

All Science Journal Classification (ASJC) codes

  • Control and Systems Engineering
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'An adaptive wordpiece language model for learning Chinese word embeddings'. Together they form a unique fingerprint.

Cite this