Submitted:
10 June 2024
Posted:
12 June 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Can we improve the deep learning model’s training, enhance vulnerable function detection, and increase Perfect Prediction by adjusting VulRepair’s hyperparameters and using its current libraries?
2. Related Works
3. Methodology
3.1. VulRepair Replication
-
tokenizer name = MickyMike/VulRepair
- -
- Ours: Salesforce/codet5-base
-
model name/path = MickyMike/VulRepair
- -
- Ours: Salesforce/codet5-base
- epochs = 75
- encoder block size = 512
- decoder block size = 256
- train batch size = 4
- eval batch size = 4
- test batch size = 1
- optimizer = AdamW
- learning rate = 2e-5
3.2. Replication of Other Models
3.3. Improving VulRepair: ImpVulRepair
3.3.1. Optimizer
3.3.2. Learning Rate
3.3.3. Weight Decay
3.3.4. Batch Size
3.3.5. Encoder & Decoder Block Size
3.3.6. Betas
4. Results
4.1. LION Optimization
4.2. Testing Deprecated AdamW

4.3. Testing PyTorch AdamW
5. Discussion
6. Limitation
7. Conclusion
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| MDPI | Multidisciplinary Digital Publishing Institute |
| DOAJ | Directory of open access journals |
| T L A | Three letter acronym |
| PL | Programming Language |
| N L | Natural Language |
| T5 | Text-to-Text Transfer Transformer |
| L I ON | Evolved Sign Momentum |
| CWE | Common Weakness Enumeration |
| CWSS | Common Vulnerability Scoring System |
References
- Michael Fu, Chakkrit Tantithamthavorn, Trung Le, Yuki Kume, Van Nguyen, Dinh Phung and John Grundy. AIBugHunter: A Practical tool for predicting, classifying and repairing software vulnerabilities. Empirical Software Engineering, 29(1):4, November 2023. [CrossRef]
- Mitre. CWE - About CWE, March 2024.
- NIST. N V D - Vulnerability Metrics, September 2022.
- Michael Fu and Chakkrit Tantithamthavorn. LineVul: A Transformer-based Line-Level Vulnerability Prediction. In 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), pages 608–620, 2022.
- Yi Li, Shaohua Wang, and Tien N. Nguyen. Vulnerability detection with fine-grained interpreta-tions. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2021, pages 292–303, New York, NY, USA, 2021. Association for Computing Machinery. event-place: Athens, Greece.
- What is the T5-Model? | Data Basecamp, September 2023. Section: ML - Blog.
- Michael Fu, Chakkrit Tantithamthavorn, Trung Le, Van Nguyen, and Dinh Phung. VulRepair: a T5-based automated software vulnerability repair. In Proceedings of the 30th AC M Joint Euro- pean Software Engineering Conference and Symposium on the Foundations of Software Engineering, E SE C / F SE 2022, pages 935–947, New York, N Y, USA, 2022. Association for Computing Ma- chinery. event-place: <conf-loc>, <city>Singapore</city>, <country>Singapore</country>, </conf-loc>.
- Jiahao Fan, Yi Li, Shaohua Wang, and Tien N. Nguyen. A C / C + + Code Vulnerability Dataset with Code Changes and C V E Summaries. In 2020 IEEE/ACM 17th International Conference on Mining Software Repositories (MSR), pages 508–512, 2020.
- ZeoVan. ZeoVan/MSR_20_code_vulnerability_csv_dataset, April 2024. original-date: 2020-06- 25T04:47:52Z. 547.
- Guru Prasad Bhandari, Amara Naseer, and Leon Moonen. CVEfixes: automated collection of vulnerabilities and their fixes from open source software. Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering, 2021.
- Songhui Yue. A data-to-product multimodal conceptual framework to achieve automated software evolution for context-rich intelligent applications, 2024.
- Christoforos Seas, Glenn Fitzpatrick, John A. Hamilton, and Martin C. Carlisle. Automated Vulnerability Detection in Source Code Using Deep Representation Learning. In 2024 I E E E 14th Annual Computing and Communication Workshop and Conference (CCWC), pages 0484–0490, 2024.
- Guanjun Lin, Sheng Wen, Qing-Long Han, Jun Zhang, and Yang Xiang. Software Vulnerability Detection Using Deep Neural Networks: A Survey. Proceedings of the IEEE , 108(10):1825–1848, 2020. [CrossRef]
- Martin Monperrus. Automatic Software Repair: A Bibliography. AC M Comput. Surv., 51(1), January 2018. Place: New York, N Y, U SA Publisher: Association for Computing Machinery.
- Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish K . Shevade. DeepFix: Fixing Common C Language Errors by Deep Learning. In A A A I Conference on Artificial Intelligence, 2017. [CrossRef]
- Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. CodeBERT: A pre-trained model for program- ming and natural languages. In Trevor Cohn, Yulan H e and Yang Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1536–1547, Online, November 2020. Association for Computational Linguistics.
- Ehsan Mashhadi and Hadi Hemmati. Applying CodeBERT for Automated Program Repair of Java Simple Bugs. 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), pages 505–509, 2021.
- Tina Marjanov, Ivan Pashchenko, and Fabio Massacci. Machine Learning for Source Code Vulnerability Detection: What Works and What Isn’t There Yet. I EEE Security & Privacy, 20(5):60– 76, 2022. [CrossRef]
- M. Fu, C. Tantithamthavorn, V. Nguyen, and T. Le. ChatGPT for Vulnerability Detection, Classification, and Repair: How Far Are We? In 2023 30th Asia-Pacific Software Engineering Conference (APSEC), pages 632–636, Los Alamitos, C A , USA, December 2023. I E E E Computer Society.
- Midya Alqaradaghi and Tamás Kozsik. Comprehensive Evaluation of Static Analysis Tools for Their Performance in Finding Vulnerabilities in Java Code. I E E E Access, 12:55824–55842, 2024. [CrossRef]
- Thomas Sutter, Timo Kehrer, Marc Rennhard, Bernhard Tellenbach, and Jacques Klein. Dynamic Security Analysis on Android: A Systematic Literature Review. I E E E Access, 12:57261–57287, 2024. [CrossRef]
- awsm-research/VulRepair, March 2024. original-date: 2022-07-19T01:29:56Z.
- Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http: //www.deeplearningbook.org.
- Amita Kapoor, Antonio Gulli, and Sujit Pal. Deep Learning with TensorFlow and Keras. Packt, 3rd edition, 2022. 587.
- Aurelien Geron. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly Media, Inc., 3rd edition, 2023.
- Chirag Shah. A Hands-On Introduction to Machine Learning. Cambridge University Press, 2022.
- A.C. Müller and S. Guido. Introduction to Machine Learning with Python: A Guide for Data Scientists. O’Reilly Media, 2016.
- Rommel Jay Gadil. Maximizing Computing Power: A Guide to Google Colab Hardware Options, October 2023.
- Michael Fu. MickyMike/VulRepair · Hugging Face, 2022.
- Yue Wang, Weishi Wang, Shafiq Joty, and Steven C. H . Hoi. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation, 2021.
- Xin Zhou, Kisub Kim, Bowen Xu, DongGyun Han, and David Lo. Large Language Model as Synthesizer: Fusing Diverse Inputs for Better Automatic Vulnerability Repair, February 2024. arXiv:2401.15459 [cs].
- Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V. Le. Symbolic Discovery of Optimization Algorithms, May 2023. arXiv:2302.06675 [cs].
- Shanthababu Pandian. A Comprehensive Guide on Hyperparameter Tuning and its Techniques, February 2022.
- Yash Bhaskar. Lion Optimizer, November 2023.
- Phil Wang. lucidrains/lion-pytorch, March 2024. original-date: 2023-02-15T04:24:19Z.
- Keras. keras L I ON optimizer, March 2024.
- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, September 2023. arXiv:1910.10683 [cs, stat].
- Darren Cook. Implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW, February 2023.
- Panco. FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead - Beginners, July 2023. Section: Beginners.
- Stuart Logan. Training A I in 2024: Steps & Best Practices, February 2024.
- Slater Victoroff. Should we remove duplicates from a data set while training a Machine Learning algorithm (shallow and/or deep methods)?, February 2019.
- Xin Zhou, Kisub Kim, Bowen Xu, Donggyun Han, and David Lo. Out of Sight, Out of Mind: Better Automatic Vulnerability Repair by Broadening Input Ranges and Sources. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13, Lisbon Portugal, April 2024. AC M.
- David de Fitero-Dominguez, Eva Garcia-Lopez, Antonio Garcia-Cabot, and Jose-Javier Martinez- Herraiz. Enhanced Automated Code Vulnerability Repair using Large Language Models, January 2024. arXiv:2401.03741 [cs].
- Su Yang, Yang Xiao, Zhengzi Xu, Chengyi Sun, Chen Ji, and Yuqing Zhang. Enhancing OSS Patch Backporting with Semantics. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 2366–2380, Copenhagen Denmark, November 2023. AC M.
- Yu Nong, Richard Fang, Guangbei Yi, Kunsong Zhao, Xiapu Luo, Feng Chen, and Haipeng Cai. Vgx: Large-scale sample generation for boosting learning-based software vulnerability analyses, 2024.

| VulRepair Training to be Updated | ||
|---|---|---|
| Library & other | 2022 | Apr 2024 |
| transformers | 4.19.1 | 4.40.0 |
| torch | 1.10.2+cu113 | 2.2.1+cu121 |
| numpy | 1.22.3 | 1.25.2 |
| tqdm | 4.62.3 | 4.66.2 |
| pandas | 1.4.1 | 2.0.3 |
| tokenizers | 0.11.6 | 0.19.1 |
| datasets | 2.0.0 | 2.18.0 |
| gdown | 4.5.1 | 5.1.0 |
| Python | 3.9.7 | 3.10.12 |
| Optimizer | AdamW | optim.torch.AdamW |
| GPU | N V I D I A 3090 | N V I D I A 4090 |
| Hyperparameter Comparison | ||
|---|---|---|
| Hyperparameter | VulRepair | ImpVulRepair |
| Optimizer | AdamW | L I O N |
| Learning Rate | 2e-5 | 2e-5 |
| Weight Decay | 0.0 | 1e-3 |
| Betas | (0.9, 0.999) | (0.9, 0.99) |
| Training Batch Size | 4 | 2 |
| Testing Batch Size | 1 | 1 |
| Epochs | 75 | 10 |
| Encoder Block Size | 512 | 1024 |
| Decoder Block Size | 256 | 512 |
| GPU | N V I D I A 3090 | Colab V100 |
| Perfect Prediction | 44% | 56% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).