Submitted:
19 December 2023
Posted:
20 December 2023
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Training models with syntactic constructs for code edit classification likely requires more data than currently available: CSR is trained on over 15 million data points, whereas our edit datasets are orders of magnitude smaller.
2. Related Work
3. Methodology
3.1. edit2vec
- Characterizing Code Edits: Unlike code2seq, which only inputs a single code snippet, edit2vec inputs a pair of code snippets to encapsulate both pre- and post-edit states. This differentiation is crucial for capturing the nature of the edit.
- Classification over Generation: Our model is designed for classification tasks, not generation. Consequently, we replace the decoder in code2seq with a classification layer, specifically employing a softmax layer for multi-class classification.
Path-Context Extractor
Path-Context Encoder (PCE)
Code Encoder (CE)
Classifier
Model Hyperparameters
3.2. LSTM
3.3. Bag-of-words
Tokenization and Vectorization
- Count-based Vectorizer: This vectorizer translates the tokens into vectors based on their frequency counts. It emphasizes tokens that appear more frequently, potentially capturing dominant features in the code.
- Tf-idf Vectorizer: The tf-idf vectorizer assigns weights to tokens not just based on their frequency in a single snippet but also considering their commonness across all code snippets. This approach helps in reducing the impact of universally common tokens, like standard data types in programming languages.
Classification
Contextual Consideration
- Example 1:os.file(path)→os.folder(path)
- Example 2:file.getSize()→folder.getSize()
Dataset and Sample Distribution
4. Experimental Setup
4.1. Code Edit Classification Task
4.1.1. Bug-fix Classification
Data Preparation and Selection Criteria
Data Distribution and Stratification
4.1.2. Code Transformation Classification
Dataset Overview and Sampling Strategy
5. Evaluation Results
5.1. Insights into Model Performance
5.2. Canonicalization and Its Impact
5.3. Further Analysis

5.4. Threats to Validity
5.4.1. Internal Validity Concerns
Data Scarcity Challenges
Model and Encoding Considerations
5.4.2. External Validity Concerns
Generalizability Across Tasks and Languages
Exploring Broader Applications
6. Conclusion and Future Directions
7. Conclusion
7.1. Future Work
Expanding the Scope of CSR.
Enhancing CSR with Advanced Techniques.
Dataset Development and Curation.
Interdisciplinary Approaches.
References
- GitHub Inc.. https://github.com. [Online; accessed 8-May-2020].
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
- Bakarov, A. A survey of word embeddings evaluation methods. arXiv preprint arXiv:1801.09536 arXiv:1801.09536 2018.
- Fei, H.; Ren, Y.; Ji, D. Retrofitting Structure-aware Transformer Language Model for End Tasks. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020, pp. 2151–2161.
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 2013, pp. 3111–3119.
- Allamanis, M.; Peng, H.; Sutton, C. A convolutional attention network for extreme summarization of source code. International conference on machine learning, 2016, pp. 2091–2100.
- Iyer, S.; Konstas, I.; Cheung, A.; Zettlemoyer, L. Summarizing Source Code using a Neural Attention Model. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Berlin, Germany, 2016; pp. 2073–2083. [Google Scholar] [CrossRef]
- Li, J.; Xu, K.; Li, F.; Fei, H.; Ren, Y.; Ji, D. MRN: A Locally and Globally Mention-Based Reasoning Network for Document-Level Relation Extraction. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, 1359–1370. [Google Scholar]
- Fei, H.; Wu, S.; Ren, Y.; Zhang, M. Matching Structure for Dual Learning. Proceedings of the International Conference on Machine Learning, ICML, 2022, pp. 6373–6391.
- LeClair, A.; Jiang, S.; McMillan, C. A neural model for generating natural language summaries of program subroutines. 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 2019, pp. 795–806.
- Movshovitz-Attias, D.; Cohen, W. Natural language models for predicting programming comments. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2013, pp. 35–40.
- Fei, H.; Ren, Y.; Ji, D. Boundaries and edges rethinking: An end-to-end neural model for overlapping entity relation extraction. Information Processing & Management 2020, 57, 102311. [Google Scholar]
- Li, J.; Fei, H.; Liu, J.; Wu, S.; Zhang, M.; Teng, C.; Ji, D.; Li, F. Unified Named Entity Recognition as Word-Word Relation Classification. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, pp. 10965–10973.
- Tufano, M.; Watson, C.; Bavota, G.; Di Penta, M.; White, M.; Poshyvanyk, D. Deep learning similarities from different representations of source code. 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR). IEEE, 2018, pp. 542–553.
- White, M.; Tufano, M.; Vendome, C.; Poshyvanyk, D. Deep learning code fragments for code clone detection. 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2016, pp. 87–98.
- Allamanis, M.; Barr, E.T.; Devanbu, P.; Sutton, C. A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR) 2018, 51, 1–37. [Google Scholar] [CrossRef]
- Wu, S.; Fei, H.; Li, F.; Zhang, M.; Liu, Y.; Teng, C.; Ji, D. Mastering the Explicit Opinion-Role Interaction: Syntax-Aided Neural Transition System for Unified Opinion Role Labeling. Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, 2022, pp. 11513–11521.
- Shi, W.; Li, F.; Li, J.; Fei, H.; Ji, D. Effective Token Graph Modeling using a Novel Labeling Strategy for Structured Sentiment Analysis. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 4232–4241.
- Fei, H.; Zhang, Y.; Ren, Y.; Ji, D. Latent Emotion Memory for Multi-Label Emotion Classification. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 7692–7699.
- Wang, F.; Li, F.; Fei, H.; Li, J.; Wu, S.; Su, F.; Shi, W.; Ji, D.; Cai, B. Entity-centered Cross-document Relation Extraction. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 9871–9881.
- Allamanis, M.; Brockschmidt, M. Smartpaste: Learning to adapt source code. arXiv preprint arXiv:1705.07867 arXiv:1705.07867 2017.
- Park, E.; Cavazos, J.; Alvarez, M.A. Using graph-based program characterization for predictive modeling. Proceedings of the Tenth International Symposium on Code Generation and Optimization, 2012, pp. 196–206.
- Nobre, R.; Martins, L.G.; Cardoso, J.M. A graph-based iterative compiler pass selection and phase ordering approach. ACM SIGPLAN Notices 2016, 51, 21–30. [Google Scholar] [CrossRef]
- Zhang, J.; Wang, X.; Zhang, H.; Sun, H.; Wang, K.; Liu, X. A novel neural source code representation based on abstract syntax tree. 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 2019, pp. 783–794.
- Allamanis, M.; Brockschmidt, M.; Khademi, M. Learning to Represent Programs with Graphs. 2017, arXiv:cs.LG/1711.00740]. [Google Scholar]
- Alon, U.; Zilberstein, M.; Levy, O.; Yahav, E. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages 2019, 3, 1–29. [Google Scholar] [CrossRef]
- Fei, H.; Wu, S.; Ren, Y.; Li, F.; Ji, D. Better Combine Them Together! Integrating Syntactic Constituency and Dependency Representations for Semantic Role Labeling. Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, 2021, 549–559. [Google Scholar]
- Alon, U.; Brody, S.; Levy, O.; Yahav, E. code2seq: Generating sequences from structured representations of code. arXiv preprint arXiv:1808.01400 arXiv:1808.01400 2018.
- Wu, S.; Fei, H.; Ren, Y.; Ji, D.; Li, J. Learn from Syntax: Improving Pair-wise Aspect and Opinion Terms Extraction with Rich Syntactic Knowledge. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, 2021, pp. 3957–3963.
- Fei, H.; Li, F.; Li, B.; Ji, D. Encoder-Decoder Based Unified Semantic Role Labeling with Label-Aware Syntax. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, pp. 12794–12802.
- Cambronero, J.; Li, H.; Kim, S.; Sen, K.; Chandra, S. When deep learning met code search. Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering - ESEC/FSE 2019. [CrossRef]
- Fei, H.; Wu, S.; Li, J.; Li, B.; Li, F.; Qin, L.; Zhang, M.; Zhang, M.; Chua, T.S. LasUIE: Unifying Information Extraction with Latent Adaptive Structure-aware Generative Language Model. Proceedings of the Advances in Neural Information Processing Systems, NeurIPS 2022, 2022, 15460–15475. [Google Scholar]
- Jayasundara, V.; Bui, N.D.Q.; Jiang, L.; Lo, D. TreeCaps: Tree-Structured Capsule Networks for Program Source Code Processing. 2019, arXiv:cs.LG/1910.12306]. [Google Scholar]
- Fei, H.; Ren, Y.; Zhang, Y.; Ji, D.; Liang, X. Enriching contextualized language model from knowledge graph for biomedical information extraction. Briefings in Bioinformatics 2021, 22. [Google Scholar] [CrossRef] [PubMed]
- Li, Y.; Tarlow, D.; Brockschmidt, M.; Zemel, R. Gated Graph Sequence Neural Networks. 2015, arXiv:cs.LG/1511.05493]. [Google Scholar]
- Fei, H.; Liu, Q.; Zhang, M.; Zhang, M.; Chua, T.S. Scene Graph as Pivoting: Inference-time Image-free Unsupervised Multimodal Machine Translation with Visual Scene Hallucination. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 5980–5994.
- Loyola, P.; Marrese-Taylor, E.; Matsuo, Y. A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). [CrossRef]
- Jiang, S.; Armaly, A.; McMillan, C. Automatically generating commit messages from diffs using neural machine translation. 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). [CrossRef]
- Fei, H.; Zhang, M.; Ji, D. Cross-Lingual Semantic Role Labeling with High-Quality Translated Training Corpus. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7014–7026.
- Liu, Z.; Xia, X.; Hassan, A.E.; Lo, D.; Xing, Z.; Wang, X. Neural-Machine-Translation-Based Commit Message Generation: How Far Are We? Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering; Association for Computing Machinery: New York, NY, USA, 2018. [Google Scholar] [CrossRef]
- Lozoya, R.C.; Baumann, A.; Sabetta, A.; Bezzi, M. Commit2Vec: Learning Distributed Representations of Code Changes. arXiv preprint arXiv:1911.07605 arXiv:1911.07605 2019.
- Wu, S.; Fei, H.; Qu, L.; Ji, W.; Chua, T.S. NExT-GPT: Any-to-Any Multimodal LLM. CoRR, 2309. [Google Scholar]
- Pradel, M.; Sen, K. Deep Learning to Find Bugs. 2017.
- Yin, P.; Neubig, G.; Allamanis, M.; Brockschmidt, M.; Gaunt, A.L. Learning to Represent Edits. 2018, arXiv:cs.LG/1810.13337]. [Google Scholar]
- Karampatsis, R.M.; Sutton, C. How Often Do Single-Statement Bugs Occur? The ManySStuBs4J Dataset. arXiv preprint arXiv:1905.13334, arXiv:1905.13334 2019.
- Josef Pihrt. [Online; accessed 8-May-2020].
| Old code | New code |
|---|---|
| = processURL(message, depth,baseURL,url) | = processURL(message, depth,url,baseURL); |
| ={ processURL, NE0, MCE, NE3,baseURL | ={ processURL, NE0, MCE, NE4,baseURL |
| processURL, NE0, MCE, NE2,depth | processURL, NE0, MCE, NE2,depth |
| message, NE1, MCE, NE2,depth ...} | message, NE1, MCE, NE2,depth ...} |
| PCE Output = , , .... . | PCE Output = , , .... |
| CE output = [160-D vector] | CE output = [160-D vector] |
| Bug Category | Description | Sample Count |
|---|---|---|
| Function Caller Modification | Verifies if the calling object in a function invocation has been substituted with a different one. | 1488 |
| Numerical Literal Alteration | Identifies changes where one numerical literal is replaced by another. | 4779 |
| Operand Correction | Examines if any operand in a binary operation was modified. | 741 |
| Operator Substitution | Determines if one binary operator was mistakenly exchanged with another of the same category. | 1711 |
| Incorrect Method Call | Checks cases where an erroneous function was invoked. | 9383 |
| Conditional Statement Expansion | Verifies the addition of an alternative condition (`||` operator) in an if statement. | 2095 |
| Conditional Statement Restriction | Assesses the insertion of an additional condition (`&&` operator) in an if statement. | 1836 |
| Reduced Argument Method Overload | Checks whether a method with fewer arguments (overloaded) was called. | 1040 |
| Additional Argument Method Overload | Identifies if an overloaded version of a function with more arguments was used. | 3820 |
| Argument Swap in Function Call | Verifies cases where two arguments in a function call were interchanged. | 536 |
| Boolean Literal Switch | Determines whether a Boolean literal was substituted with another. | 1531 |
| Analyzer tag | Description | No of samples |
|---|---|---|
| RCS1001 | Add braces (when expression spans over multiple lines) | 443 |
| RCS1032 | Remove redundant parentheses | 516 |
| RCS1049 | Simplify boolean comparison | 574 |
| RCS1085 | Use auto-implemented property | 2163 |
| RCS1123 | Add parentheses according to operator precedence | 1428 |
| RCS1124 | Inline local variable | 1067 |
| RCS1146 | Use conditional access | 3368 |
| RCS1163 | Rename unused parameter to `_’ | 2053 |
| RCS1168 | Change parameter name to base name when they are not the same | 816 |
| RCS1220 | Use pattern matching instead of combination of ’is’ operator and cast operator | 356 |
| Model | Bug-fix | Code Transformation | ||
|---|---|---|---|---|
| Accuracy | Accuracy (Canon.) |
Accuracy | Accuracy (Canon.) |
|
| tf-idf SVM (RBF) | 32.34% | 58.81% | 26.31% | 73.82% |
| tf-idf SVM (linear) | 85.30% | 67.37% | 85.30% | 69.49% |
| count SVM (RBF) | 32.34% | 72.06% | 34.24% | 76.31% |
| count SVM (linear) | 86.69% | 76.08% | 88.39% | 74.78% |
| LSTM | 94.47% | 99.21% | 92.55% | 92.77% |
| code2seq | 93.17% | 98.44% | 92.28% | 92.59% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).