Contextuality of Code Representation Learning

Yi Li, Shaohua Wang, Tien N. Nguyen

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Advanced machine learning models (ML) have been successfully leveraged in several software engineering (SE) applications. The existing SE techniques have used the embedding models ranging from static to contextualized ones to build the vectors for program units. The contextualized vectors address a phenomenon in natural language texts called polysemy, which is the coexistence of different meanings of a word/phrase. However, due to different nature, program units exhibit the nature of mixed polysemy. Some code tokens and statements exhibit polysemy while other tokens (e.g., keywords, separators, and operators) and statements maintain the same meaning in different contexts. A natural question is whether static or contextualized embeddings fit better with the nature of mixed polysemy in source code. The answer to this question is helpful for the SE researchers in selecting the right embedding model. We conducted experiments on 12 popular sequence-/tree-/graph-based embedding models and on the samples of a dataset of 10,222 Java projects with +14M methods. We present several contextuality evaluation metrics adapted from natural-language texts to code structures to evaluate the embeddings from those models. Among several findings, we found that the models with higher contextuality help a bug detection model perform better than the static ones. Neither static nor contextualized embedding models fit well with the mixed polysemy nature of source code. Thus, we develop Hycode, a hybrid embedding model that fits better with the nature of mixed polysemy in source code.

Original languageEnglish (US)
Title of host publicationProceedings - 2023 38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages548-559
Number of pages12
ISBN (Electronic)9798350329964
DOIs
StatePublished - 2023
Event38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023 - Echternach, Luxembourg
Duration: Sep 11 2023Sep 15 2023

Publication series

NameProceedings - 2023 38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023

Conference

Conference38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023
Country/TerritoryLuxembourg
CityEchternach
Period9/11/239/15/23

All Science Journal Classification (ASJC) codes

  • Software
  • Safety, Risk, Reliability and Quality
  • Control and Optimization

Keywords

  • Code Representation Learning
  • Contextuality of Code Embedding
  • Contextualized Embedding

Fingerprint

Dive into the research topics of 'Contextuality of Code Representation Learning'. Together they form a unique fingerprint.

Cite this