Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Harnessing Syntactic Feature for Code Representation Learning

Version 1 : Received: 19 December 2023 / Approved: 19 December 2023 / Online: 20 December 2023 (06:32:05 CET)

How to cite: Clevor, B.; Patel, R.; Slyder, W. Harnessing Syntactic Feature for Code Representation Learning. Preprints 2023, 2023121463. https://doi.org/10.20944/preprints202312.1463.v1 Clevor, B.; Patel, R.; Slyder, W. Harnessing Syntactic Feature for Code Representation Learning. Preprints 2023, 2023121463. https://doi.org/10.20944/preprints202312.1463.v1

Abstract

The paradigm of leveraging code as a dataset has recently gained traction, offering innovative solutions in domains such as automated commit message generation, pull request description automation, and program repair mechanisms. Consider the challenge in generating commit messages: traditional methods treat source code as a mere token sequence, applying neural machine translation models. This approach, however, overlooks the critical syntactic structures inherent in programming languages, which could offer deeper insights and improved accuracy. Building upon prior research, specifically the Code2Seq framework, which utilized Abstract Syntax Tree (AST) structural data for source code representation to automate method name generation, this paper extends and refines this concept. We introduce "CSR", a novel methodology adapted to represent code edits effectively. This paper investigates the impact of employing syntactic structure, focusing on the classification of code edits. Drawing inspiration from Code2Seq, "CSR" utilizes AST's structural properties, particularly the paths connecting leaf nodes, to enhance the task of code edit classification. This approach is rigorously evaluated on two distinct datasets, comprising fine-grained syntactic edits. Our comprehensive experiments reveal that incorporating syntactic structures does not significantly outperform simpler methodologies. While methods like Code2Seq and our proposed "CSR" show potential, our findings highlight that there is considerable scope for improvement and refinement before such techniques can be universally applied for learning representations of code edits. We anticipate that our findings will spark further research in this field, paving the way for more effective use of syntactic structures in code representation.

Keywords

code edit classification; abstract syntax tree; code structure representation

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.