TY - GEN
T1 - An adaptive wordpiece language model for learning Chinese word embeddings
AU - Xu, Binchen
AU - Ma, Lu
AU - Zhang, Liang
AU - Li, Haohai
AU - Kang, Qi
AU - Zhou, Mengchu
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/8
Y1 - 2019/8
N2 - Word representations are crucial for many nature language processing tasks. Most of the existing approaches learn contextual information by assigning a distinct vector to each word and pay less attention to morphology. It is a problem for them to deal with large vocabularies and rare words. In this paper we propose an Adaptive Wordpiece Language Model for learning Chinese word embeddings (AWLM), as inspired by previous observation that subword units are important for improving the learning of Chinese word representation. Specifically, a novel approach called BPE+ is established to adaptively generates variable length of grams which breaks the limitation of stroke n-grams. The semantical information extraction is completed by three elaborated parts i.e., extraction of morphological information, reinforcement of fine-grained information and extraction of semantical information. Empirical results on word similarity, word analogy, text classification and question answering verify that our method significantly outperforms several state-of-the-art methods.
AB - Word representations are crucial for many nature language processing tasks. Most of the existing approaches learn contextual information by assigning a distinct vector to each word and pay less attention to morphology. It is a problem for them to deal with large vocabularies and rare words. In this paper we propose an Adaptive Wordpiece Language Model for learning Chinese word embeddings (AWLM), as inspired by previous observation that subword units are important for improving the learning of Chinese word representation. Specifically, a novel approach called BPE+ is established to adaptively generates variable length of grams which breaks the limitation of stroke n-grams. The semantical information extraction is completed by three elaborated parts i.e., extraction of morphological information, reinforcement of fine-grained information and extraction of semantical information. Empirical results on word similarity, word analogy, text classification and question answering verify that our method significantly outperforms several state-of-the-art methods.
UR - http://www.scopus.com/inward/record.url?scp=85072967510&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85072967510&partnerID=8YFLogxK
U2 - 10.1109/COASE.2019.8843151
DO - 10.1109/COASE.2019.8843151
M3 - Conference contribution
AN - SCOPUS:85072967510
T3 - IEEE International Conference on Automation Science and Engineering
SP - 812
EP - 817
BT - 2019 IEEE 15th International Conference on Automation Science and Engineering, CASE 2019
PB - IEEE Computer Society
T2 - 15th IEEE International Conference on Automation Science and Engineering, CASE 2019
Y2 - 22 August 2019 through 26 August 2019
ER -