TY - GEN
T1 - Contextuality of Code Representation Learning
AU - Li, Yi
AU - Wang, Shaohua
AU - Nguyen, Tien N.
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Advanced machine learning models (ML) have been successfully leveraged in several software engineering (SE) applications. The existing SE techniques have used the embedding models ranging from static to contextualized ones to build the vectors for program units. The contextualized vectors address a phenomenon in natural language texts called polysemy, which is the coexistence of different meanings of a word/phrase. However, due to different nature, program units exhibit the nature of mixed polysemy. Some code tokens and statements exhibit polysemy while other tokens (e.g., keywords, separators, and operators) and statements maintain the same meaning in different contexts. A natural question is whether static or contextualized embeddings fit better with the nature of mixed polysemy in source code. The answer to this question is helpful for the SE researchers in selecting the right embedding model. We conducted experiments on 12 popular sequence-/tree-/graph-based embedding models and on the samples of a dataset of 10,222 Java projects with +14M methods. We present several contextuality evaluation metrics adapted from natural-language texts to code structures to evaluate the embeddings from those models. Among several findings, we found that the models with higher contextuality help a bug detection model perform better than the static ones. Neither static nor contextualized embedding models fit well with the mixed polysemy nature of source code. Thus, we develop Hycode, a hybrid embedding model that fits better with the nature of mixed polysemy in source code.
AB - Advanced machine learning models (ML) have been successfully leveraged in several software engineering (SE) applications. The existing SE techniques have used the embedding models ranging from static to contextualized ones to build the vectors for program units. The contextualized vectors address a phenomenon in natural language texts called polysemy, which is the coexistence of different meanings of a word/phrase. However, due to different nature, program units exhibit the nature of mixed polysemy. Some code tokens and statements exhibit polysemy while other tokens (e.g., keywords, separators, and operators) and statements maintain the same meaning in different contexts. A natural question is whether static or contextualized embeddings fit better with the nature of mixed polysemy in source code. The answer to this question is helpful for the SE researchers in selecting the right embedding model. We conducted experiments on 12 popular sequence-/tree-/graph-based embedding models and on the samples of a dataset of 10,222 Java projects with +14M methods. We present several contextuality evaluation metrics adapted from natural-language texts to code structures to evaluate the embeddings from those models. Among several findings, we found that the models with higher contextuality help a bug detection model perform better than the static ones. Neither static nor contextualized embedding models fit well with the mixed polysemy nature of source code. Thus, we develop Hycode, a hybrid embedding model that fits better with the nature of mixed polysemy in source code.
KW - Code Representation Learning
KW - Contextuality of Code Embedding
KW - Contextualized Embedding
UR - http://www.scopus.com/inward/record.url?scp=85179003533&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85179003533&partnerID=8YFLogxK
U2 - 10.1109/ASE56229.2023.00029
DO - 10.1109/ASE56229.2023.00029
M3 - Conference contribution
AN - SCOPUS:85179003533
T3 - Proceedings - 2023 38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023
SP - 548
EP - 559
BT - Proceedings - 2023 38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023
Y2 - 11 September 2023 through 15 September 2023
ER -