Submitted:
11 December 2023
Posted:
12 December 2023
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Introduction of the Syntax Structure (SS) network, employing two sub-networks (SAST and GAST) for unified AST representation learning. SAST extracts global syntactic features from AST path sequences, while GAST captures local semantic features within the AST tree. This comprehensive approach ensures effective learning of code semantics.
- Development of a unified vocabulary mechanism to minimize language disparities, allowing for efficient embedded vector mapping and training in cross-language program classification.
- Performance evaluation on two datasets, demonstrating SS’s superiority over existing baselines in key metrics like Recall, Precision, F1-score, and Accuracy.
- Compilation of a benchmark dataset for cross-language program classification, encompassing five programming languages (C, C++, Java, Python, and JavaScript), with 50 problems and 20,000 solution files that are semantically akin.
2. Related Work
3. The Proposed Approach
3.1. Overall Structure
3.2. Unified Vocabulary
3.3. Sequence-based AST Network (SAST)
3.4. Graph-based AST Network (GAST)
3.5. Unified AST Feature Fusion
3.6. Cross-language Program Classification
4. Experiments
4.1. Targets
4.2. Datasets
- Leetcode: A comprehensive dataset from Leetcode featuring 20000 solutions in five languages (C, C++, Java, Python, and JavaScript) across 50 categories. We employ NICAD for deduplication and anonymize function names to maintain consistency. This dataset also follows a training-validation-testing split, with the distribution detailed in Table 1.
4.3. Baselines
- Bi-TBCNN [50]: This approach utilizes tree-based neural networks to decipher cross-language programming paradigms. It features a dual-neural network framework, where each network encodes and interprets the grammatical and semantic aspects of language codes, thus facilitating cross-language program classification.
- CodeBERT [54]: A BERT [55] adaptation, CodeBERT employs Replaced Token Detection (RTD) for its initial training. This model utilizes deep bi-transformer components to generate code features that integrate contextual information, effectively capturing the nuances of input sequences. Given its training on six programming languages, it stands as a robust tool for cross-language classification.
- Infercode [52]: Employing self-supervised learning principles from natural language processing, this model trains code representations by predicting the contextual subtrees of ASTs. Trained on a multitude of programming languages, Infercode is adept at extracting and classifying features unique to different languages, making it suitable for cross-language program classification.
4.4. Setting
4.5. Evaluation Metrics
4.6. Results and Analysis
4.6.1. Model Performance
4.6.2. The Impact of the Unified Vocabulary
4.6.3. The Impact of the AST Feature Fusion
4.6.4. The Impact of Parameter Settings
5. Conclusions
References
- Baker, B.S. A program for identifying duplicated code. Computing Science and Statistics, 1993; 49. [Google Scholar]
- Bellon, S.; Koschke, R.; Antoniol, G.; Krinke, J.; Merlo, E. Comparison and evaluation of clone detection tools. IEEE Transactions on software engineering 2007, 33, 577–591. [Google Scholar] [CrossRef]
- Fei, H.; Ren, Y.; Ji, D. Retrofitting Structure-aware Transformer Language Model for End Tasks. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020, pp. 2151–2161.
- Börstler, J. Feature-oriented classification for software reuse. Proceedings of the 7th International Conference on Software Engineering and Knowledge Engineering. Knowledge Systems Institute, 1995, Vol. 95, pp. 22–24.
- Fei, H.; Ren, Y.; Ji, D. Boundaries and edges rethinking: An end-to-end neural model for overlapping entity relation extraction. Information Processing & Management 2020, 57, 102311. [Google Scholar] [CrossRef]
- Li, J.; Fei, H.; Liu, J.; Wu, S.; Zhang, M.; Teng, C.; Ji, D.; Li, F. Unified Named Entity Recognition as Word-Word Relation Classification. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, pp. 10965–10973.
- Wang, S.; Liu, T.; Tan, L. Automatically learning semantic features for defect prediction. Proceedings of the 38th International Conference on Software Engineering. ACM, 2016, pp. 297–308.
- Li, J.; Xu, K.; Li, F.; Fei, H.; Ren, Y.; Ji, D. MRN: A Locally and Globally Mention-Based Reasoning Network for Document-Level Relation Extraction. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 1359–1370. [Google Scholar]
- Fei, H.; Wu, S.; Ren, Y.; Zhang, M. Matching Structure for Dual Learning. Proceedings of the International Conference on Machine Learning, ICML, 2022, pp. 6373–6391.
- Li, J.; He, P.; Zhu, J.; Lyu, M.R. Software defect prediction via convolutional neural network. Proceedings of the 2017 International Conference on Software Quality, Reliability and Security. IEEE, 2017, pp. 318–328.
- Wu, S.; Fei, H.; Li, F.; Zhang, M.; Liu, Y.; Teng, C.; Ji, D. Mastering the Explicit Opinion-Role Interaction: Syntax-Aided Neural Transition System for Unified Opinion Role Labeling. Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, 2022, pp. 11513–11521.
- Shi, W.; Li, F.; Li, J.; Fei, H.; Ji, D. Effective Token Graph Modeling using a Novel Labeling Strategy for Structured Sentiment Analysis. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 4232–4241.
- Fei, H.; Zhang, Y.; Ren, Y.; Ji, D. Latent Emotion Memory for Multi-Label Emotion Classification. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 7692–7699.
- Wang, F.; Li, F.; Fei, H.; Li, J.; Wu, S.; Su, F.; Shi, W.; Ji, D.; Cai, B. Entity-centered Cross-document Relation Extraction. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 9871–9881.
- Kim, K.; Kim, D.; Bissyandé, T.F.; Choi, E.; Li, L.; Klein, J.; Traon, Y.L. FaCoY: a code-to-code search engine. Proceedings of the 40th International Conference on Software Engineering. ACM, 2018, pp. 946–957.
- Fei, H.; Wu, S.; Ren, Y.; Li, F.; Ji, D. Better Combine Them Together! Integrating Syntactic Constituency and Dependency Representations for Semantic Role Labeling. Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, 2021, pp. 549–559. [Google Scholar]
- Harer, J.A.; Kim, L.Y.; Russell, R.L.; Ozdemir, O.; Kosta, L.R.; Rangamani, A.; Hamilton, L.H.; Centeno, G.I.; Key, J.R.; Ellingwood, P.M. ; others. Automated software vulnerability detection with machine learning. arXiv preprint, 2018; arXiv:1803.04497. [Google Scholar]
- Ben-Nun, T.; Jakobovits, A.S.; Hoefler, T. Neural code comprehension: A learnable representation of code semantics. Advances in Neural Information Processing Systems. Curran Associates, Inc, 2018, Vol. 31, pp. 3585–3597.
- Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The graph neural network model. IEEE transactions on neural networks 2008, 20, 61–80. [Google Scholar] [CrossRef] [PubMed]
- Elman, J.L. Finding structure in time. Cognitive science 1990, 14, 179–211. [Google Scholar] [CrossRef]
- Azcona, D.; Arora, P.; Hsiao, I.H.; Smeaton, A. user2code2vec: Embeddings for profiling students based on distributional representations of source code. Proceedings of the 9th International Conference on Learning Analytics & Knowledge. ACM, 2019, pp. 86–95.
- Mou, L.; Li, G.; Zhang, L.; Wang, T.; Jin, Z. Convolutional neural networks over tree structures for programming language processing. Proceedings of the 30th AAAI Conference on Artificial Intelligence. ACM, 2016.
- Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv preprint, 2016; arXiv:1609.02907. [Google Scholar]
- Wu, S.; Fei, H.; Ren, Y.; Ji, D.; Li, J. Learn from Syntax: Improving Pair-wise Aspect and Opinion Terms Extraction with Rich Syntactic Knowledge. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, 2021, pp. 3957–3963.
- Fei, H.; Li, F.; Li, B.; Ji, D. Encoder-Decoder Based Unified Semantic Role Labeling with Label-Aware Syntax. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, pp. 12794–12802.
- Peters, F.; Tun, T.T.; Yu, Y.; Nuseibeh, B. Text filtering and ranking for security bug report prediction. IEEE Transactions on Software Engineering 2017, 45, 615–631. [Google Scholar] [CrossRef]
- Fontana, F.A.; Zanoni, M. Code smell severity classification using machine learning techniques. Knowledge-Based Systems 2017, 128, 43–58. [Google Scholar] [CrossRef]
- Alon, U.; Zilberstein, M.; Levy, O.; Yahav, E. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages 2019, 3, 1–29. [Google Scholar] [CrossRef]
- Clark, K.L.; Darlington, J. Algorithm classification through synthesis. The computer journal 1980, 23, 61–65. [Google Scholar] [CrossRef]
- Fei, H.; Wu, S.; Li, J.; Li, B.; Li, F.; Qin, L.; Zhang, M.; Zhang, M.; Chua, T.S. LasUIE: Unifying Information Extraction with Latent Adaptive Structure-aware Generative Language Model. Proceedings of the Advances in Neural Information Processing Systems, NeurIPS 2022, 2022, pp. 15460–15475. [Google Scholar]
- Wu, S.; Fei, H.; Ji, W.; Chua, T.S. Cross2StrA: Unpaired Cross-lingual Image Captioning with Cross-lingual Cross-modal Structure-pivoted Alignment. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 2593–2608.
- Fei, H.; Ren, Y.; Zhang, Y.; Ji, D.; Liang, X. Enriching contextualized language model from knowledge graph for biomedical information extraction. Briefings in Bioinformatics 2021, 22. [Google Scholar] [CrossRef] [PubMed]
- Jiang, L.; Su, Z. Automatic mining of functionally equivalent code fragments via random testing. Proceedings of the 18th International Symposium on Software Testing and Analysis. ACM, 2009, pp. 81–92.
- Taherkhani, A.; Korhonen, A.; Malmi, L. Recognizing algorithms using language constructs, software metrics and roles of variables: An experiment with sorting algorithms. The Computer Journal 2011, 54, 1049–1066. [Google Scholar] [CrossRef]
- Drucker, H.; Burges, C.J.; Kaufman, L.; Smola, A.; Vapnik, V.; others. Support vector regression machines. Advances in neural information processing systems 1997, 9, 155–161. [Google Scholar]
- Ugurel, S.; Krovetz, R.; Giles, C.L. What’s the code? automatic classification of source code archives. Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2002, pp. 632–638.
- Quinlan, J.R. Induction of decision trees. Machine learning 1986, 1, 81–106. [Google Scholar] [CrossRef]
- Friedman, N.; Geiger, D.; Goldszmidt, M. Bayesian network classifiers. Machine learning 1997, 29, 131–163. [Google Scholar] [CrossRef]
- Ma, Y.; Fakhoury, S.; Christensen, M.; Arnaoudova, V.; Zogaan, W.; Mirakhorli, M. Automatic classification of software artifacts in open-source applications. Proceedings of the 15th International Conference on Mining Software Repositories. ACM, 2018, pp. 414–425.
- Zhang, J.; Wang, X.; Zhang, H.; Sun, H.; Wang, K.; Liu, X. A novel neural source code representation based on abstract syntax tree. Proceedings of the 41st International Conference on Software Engineering. IEEE, 2019, pp. 783–794.
- Barchi, F.; Parisi, E.; Urgese, G.; Ficarra, E.; Acquaviva, A. Exploration of Convolutional Neural Network models for source code classification. Engineering Applications of Artificial Intelligence 2021, 97, 104075. [Google Scholar] [CrossRef]
- Kalchbrenner, N.; Grefenstette, E.; Blunsom, P. A convolutional neural network for modelling sentences. arXiv preprint, 2014; arXiv:1404.2188. [Google Scholar]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv preprint, 2013; arXiv:1301.3781. [Google Scholar]
- Zhang, Y.; Wallace, B. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv preprint, 2015; arXiv:1510.03820. [Google Scholar]
- Wang, W.; Li, G.; Ma, B.; Xia, X.; Jin, Z. Detecting code clones with graph neural network and flow-augmented abstract syntax tree. Proceedings of the 27th International Conference on Software Analysis, Evolution and Reengineering. IEEE, 2020, pp. 261–271.
- Wei, J.; Goyal, M.; Durrett, G.; Dillig, I. Lambdanet: Probabilistic type inference using graph neural networks. Proceedings of the 8th International Conference on Learning Representations. OpenReview, 2020.
- Huang, J.T.; Li, J.; Yu, D.; Deng, L.; Gong, Y. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. Proceedings of the 2013 International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013, pp. 7304–7308.
- Conneau, A.; Lample, G. Cross-lingual language model pretraining. Advances in Neural Information Processing Systems. Curran Associates, Inc., 2019, Vol. 32, pp. 7059–7069.
- Nguyen, A.T.; Nguyen, T.T.; Nguyen, T.N. Migrating code with statistical machine translation. Proceedings of the 36th International Conference on Software Engineering Companion. ACM, 2014, pp. 544–547.
- Bui, N.D.; Yu, Y.; Jiang, L. Bilateral dependency neural networks for cross-language algorithm classification. Proceedings of the 26th International Conference on Software Analysis, Evolution and Reengineering. IEEE, 2019, pp. 422–433.
- Ye, F.; Zhou, S.; Venkat, A.; Marucs, R.; Tatbul, N.; Tithi, J.J.; Petersen, P.; Mattson, T.; Kraska, T.; Dubey, P. ; others. Misim: An end-to-end neural code similarity system. arXiv preprint, 2020; arXiv:2006.05265. [Google Scholar]
- Bui, N.D.; Yu, Y.; Jiang, L. InferCode: Self-supervised learning of code representations by predicting subtrees. Proceedings of the 43rd International Conference on Software Engineering. IEEE, 2021, pp. 1186–1197.
- Peng, H.; Li, G.; Wang, W.; Zhao, Y.; Jin, Z. Integrating tree path in transformer for code representation. Proceedings of the 35th Conference on Neural Information Processing Systems. Curran Associates, Inc., 2021.
- Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D. ; others. CodeBERT: A pre-trained model for programming and natural languages. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings. ACL, 2020, pp. 1536–1547.
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. ACL, 2019, pp. 29–35.
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations. OpenReview, 2015.
| JC | Leetcode | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Java | C++ | C | C++ | Java | Python | JavaScript | |||
| Training | 3498 | 4215 | 331 | 3428 | 5051 | 2633 | 557 | ||
| Validation | 1162 | 1402 | 110 | 1143 | 1684 | 878 | 185 | ||
| Testing | 1162 | 1402 | 110 | 1143 | 1684 | 878 | 185 | ||
| Model | Recall | Precision | F1-score | Accuracy |
|---|---|---|---|---|
| CodeBERT | 0.9078 | 0.9177 | 0.9090 | 0.9005 |
| Infercode | 0.8317 | 0.8468 | 0.8325 | 0.8343 |
| SS | 0.9611 | 0.9631 | 0.9617 | 0.9626 |
| Model | Recall | Precision | F1-score | Accuracy |
|---|---|---|---|---|
| CodeBERT | 0.6147 | 0.6348 | 0.6174 | 0.6245 |
| Infercode | 0.5696 | 0.5819 | 0.5762 | 0.5807 |
| SS | 0.7958 | 0.8025 | 0.7965 | 0.7964 |
| Dataset | Model | Recall | Precision | F1-score | Accuracy |
|---|---|---|---|---|---|
| JC | SS-V | 0.8802 | 0.8868 | 0.8816 | 0.8856 |
| SS | 0.9125 | 0.9254 | 0.9142 | 0.9142 | |
| GAST-V | 0.9467 | 0.9479 | 0.9469 | 0.9478 | |
| GAST | 0.9504 | 0.9516 | 0.9511 | 0.9508 | |
| SS-V | 0.9524 | 0.9526 | 0.9509 | 0.9516 | |
| SS | 0.9611 | 0.9631 | 0.9617 | 0.9626 | |
| Leetcode | SS-V | 0.6553 | 0.6894 | 0.6540 | 0.6554 |
| SS | 0.6718 | 0.7020 | 0.6707 | 0.6721 | |
| GAST-V | 0.7744 | 0.7793 | 0.7749 | 0.7749 | |
| GAST | 0.7892 | 0.7956 | 0.7887 | 0.7892 | |
| SS-V | 0.7893 | 0.7970 | 0.7904 | 0.7882 | |
| SS | 0.7958 | 0.8025 | 0.7965 | 0.7964 |
| Dataset | Mean | Median | 70% | 80% | 90% |
|---|---|---|---|---|---|
| JC | 576 | 354 | 502 | 726 | 1498 |
| Leetcode | 165 | 144 | 189 | 221 | 279 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).