Submitted:
06 May 2025
Posted:
07 May 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
3. Tools and Methods
3.1. Approaches to Test Case Generation
3.1.1. Test Case Generation in C
3.2. Selected LLMs for Comparison
3.3. Comparative Analysis Framework
3.3.1. Proposed Datasets
3.3.2. Evaluation Metrics
4. Experimental Setup

- Pass@1 - Measures the percentage of correct solutions that pass all generated test cases on the first attempt.
- Line Coverage - Evaluates how well the generated test cases cover the solution’s lines of code.
5. Results and Discussion
5.1. Experimental Results
5.2. Strengths and Weaknesses of LLMs in C Test Case Generation
Case Study 1: Floating-Point Instability in Nonlinear Functions
Case Study 2: Logical Arithmetic Misinterpretation
- Numeric computations involving floating-point behavior and math library functions.
- Logical reasoning involving offsets, indexing, or nested arithmetic operations.
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. System Prompt + User Prompt Example For Problem 5

References
- M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al., "Evaluating Large Language Models Trained on Code," arXiv preprint, vol. 2107.03374, 2021. [Online]. Available: https://arxiv.org/abs/2107.03374.
- R. A. Poldrack, T. Lu, and G. Beguš, "AI-assisted coding: Experiments with GPT-4," arXiv preprint, vol. 2304.13187, 2023. [Online]. Available: https://arxiv.org/abs/2304.13187.
- J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton, “Program Synthesis with Large Language Models,” arXiv preprint arXiv:2108.07732, 2021. [Online]. Available: https://arxiv.org/abs/2108.07732.
- W. Wang, C. Yang, Z. Wang, Y. Huang, Z. Chu, D. Song, L. Zhang, A. R. Chen, and L. Ma, “TESTEVAL: Benchmarking Large Language Models for Test Case Generation,” arXiv preprint arXiv:2406.04531, 2025. [Online]. Available: https://arxiv.org/abs/2406.04531.
- R. Santos, I. Santos, C. Magalhaes, and R. de Souza Santos, "Are We Testing or Being Tested? Exploring the Practical Applications of Large Language Models in Software Testing," arXiv preprint, vol. 2312.04860, 2023. [Online]. Available: https://arxiv.org/abs/2312.04860.
- S. Zilberman and H. C. Betty Cheng, "“No Free Lunch” when using Large Language Models to Verify Self-Generated Programs," in *Proc. 2024 IEEE Int. Conf. Softw. Test. Verif. Valid. Workshops (ICSTW)*, 2024, pp. 29-36. [CrossRef]
- W. Aljedaani, A. Peruma, A. Aljohani, M. Alotaibi, M. W. Mkaouer, A. Ouni, C. D. Newman, A. Ghallab, and S. Ludi, “Test Smell Detection Tools: A Systematic Mapping Study,” in *Proc. 25th Int. Conf. Evaluation and Assessment in Software Engineering (EASE)*, Trondheim, Norway, 2021, pp. 170–180. [Online]. Available:. [CrossRef]
- A. M. Dakhel, A. Nikanjam, V. Majdinasab, F. Khomh, and M. C. Desmarais, "Effective Test Generation Using Pre-trained Large Language Models and Mutation Testing," arXiv preprint, vol. 2308.16557, 2023. [Online]. Available: https://arxiv.org/abs/2308.16557.
- J. Xu, J. Xu, T. Chen, and X. Ma, "Symbolic Execution with Test Cases Generated by Large Language Models," in *Proc. 2024 IEEE 24th Int. Conf. Softw. Qual. Rel. Security (QRS)*, 2024, pp. 228-237. [CrossRef]
- C. Lemieux, J. P. Inala, S. K. Lahiri, and S. Sen, "CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models," in *Proc. 2023 IEEE/ACM 45th Int. Conf. Softw. Eng. (ICSE)*, 2023, pp. 919-931. [CrossRef]
- K. Li and Y. Yuan, "Large Language Models as Test Case Generators: Performance Evaluation and Enhancement," arXiv preprint, vol. 2404.13340, 2024. [Online]. Available: https://arxiv.org/abs/2404.13340.
- Z. Liu, Y. Tang, X. Luo, Y. Zhou and L. F. Zhang, "No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT," in IEEE Transactions on Software Engineering, vol. 50, no. 6, pp. 1548-1584, June 2024. [CrossRef]
- A. M. Sami, Z. Rasheed, M. Waseem, Z. Zhang, H. Tomas, and P. Abrahamsson, "A Tool for Test Case Scenarios Generation Using Large Language Models," arXiv preprint, vol. 2406.07021, 2024. [Online]. Available: https://arxiv.org/abs/2406.07021.
- A. Bhat and S. M. K. Quadri, “Equivalence class partitioning and boundary value analysis – A review,” in Proceedings of the 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom), IEEE, 2015, pp. 1557–1562.
- X. Guo, H. Okamura, and T. Dohi, “Optimal test case generation for boundary value analysis,” Software Quality Journal, vol. 32, no. 2, pp. 543–566, Jun. 2024. [CrossRef]
- A. YK, “Decision Table Based Testing,” International Journal on Recent and Innovation Trends in Computing and Communication, vol. 3, pp. 1298–1301, Jan. 2015. [CrossRef]
- A. Intana and A. Sawedsuthiphan, “STATETest: An Automatic Test Case Generation Framework for State Transition Testing,” in Proceedings of the 2023 20th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), IEEE, 2023, pp. 1–4. [CrossRef]
- C. Cadar and M. Nowack, "KLEE symbolic execution engine in 2019," International Journal on Software Tools for Technology Transfer, vol. 23, no. 6, pp. 867–870, Dec. 2021. [CrossRef]
- D. Kroening, P. Schrammel, and M. Tautschnig, CBMC: The C Bounded Model Checker, arXiv preprint arXiv:2302.02384, 2023.
- Z. Yu, Z. Liu, X. Cong, X. Li, and L. Yin, "Fuzzing: Progress, Challenges, and Perspectives," Computers, Materials & Continua, vol. 78, pp. 1–10, Jan. 2024. [CrossRef]
- N. Alshahwan, J. Chheda, A. Finegenova, B. Gokkaya, M. Harman, I. Harper, A. Marginean, S. Sengupta, and E. Wang, “Automated Unit Test Improvement using Large Language Models at Meta,” arXiv preprint arXiv:2402.09171, 2024. Available: https://arxiv.org/abs/2402.09171.
- T. Wang, R. Wang, Y. Chen, L. Yu, Z. Pan, M. Zhang, H. Ma, and J. Zheng, “Enhancing Black-box Compiler Option Fuzzing with LLM through Command Feedback,” in Proceedings of the 2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE), pp. 319–330, 2024. [CrossRef]
- W. Wang, K. Liu, A. R. Chen, G. Li, Z. Jin, G. Huang, and L. Ma, “Python Symbolic Execution with LLM-powered Code Generation,” arXiv preprint arXiv:2409.09271, 2024. Available: https://arxiv.org/abs/2409.09271.
- F. Tip, J. Bell, and M. Schaefer, “LLMorpheus: Mutation Testing using Large Language Models,” arXiv preprint arXiv:2404.09952, 2025. Available: https://arxiv.org/abs/2404.09952.
- R. Pan, M. Kim, R. Krishna, R. Pavuluri, and S. Sinha, “ASTER: Natural and Multi-language Unit Test Generation with LLMs,” arXiv preprint arXiv:2409.03093, 2025. Available: https://arxiv.org/abs/2409.03093.
- N. Do and C. Nguyen, “Generate High-Coverage Unit Test Data Using the LLM Tool,” International Journal of Innovative Technology and Exploring Engineering, vol. 13, pp. 13–18, Nov. 2024. [CrossRef]
- L. Zhong, Z. Wang and J. Shang, "Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step," arXiv preprint, arXiv:2402.16906, 2024. [Online]. Available: https://arxiv.org/abs/2402.16906.
- T. X. Olausson, J. P. Inala, C. Wang, J. Gao, and A. Solar-Lezama, "Is Self-Repair a Silver Bullet for Code Generation?," arXiv preprint, arXiv:2306.09896, 2024. [Online]. Available: https://arxiv.org/abs/2306.09896.
- C. E. A. Coello, M. N. Alimam, and R. Kouatly, “Effectiveness of ChatGPT in Coding: A Comparative Analysis of Popular Large Language Models,” Digital, vol. 4, no. 1, pp. 114–125, 2024. [Online]. Available: https://www.mdpi.com/2673-6470/4/1/5. [CrossRef]
- A. Bouali and B. Dion, “Formal Verification for Model-Based Development,” in Proceedings of SAE 2005 World Congress & Exhibition, Apr. 2005. [Online]. Available:. [CrossRef]
- A. Ulrich and A. Votintseva, “Experience report: Formal verification and testing in the development of embedded software,” in Proceedings of the 2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE), 2015, pp. 293–302. [CrossRef]
- M. F. Rabbi, A. I. Champa, M. F. Zibran, and M. R. Islam, “AI Writes, We Analyze: The ChatGPT Python Code Saga,” in Proceedings of the 21st International Conference on Mining Software Repositories (MSR ’24), Lisbon, Portugal, 2024, pp. 177–181. [Online]. Available:. [CrossRef]
- B. Idrisov and T. Schlippe, “Program Code Generation with Generative AIs,” Algorithms, vol. 17, no. 2, Article 62, 2024. [Online]. Available: https://www.mdpi.com/1999-4893/17/2/62. [CrossRef]
- F. Subhan, X. Wu, L. Bo, X. Sun, and M. Rahman, "A deep learning-based approach for software vulnerability detection using code metrics," *IET Software*, vol. 16, no. 5, pp. 516-526, 2022. [CrossRef]
| 1 | |
| 2 | |
| 3 |

| Model | Provider | HumanEval Accuracy (%) | MBPP1 Accuracy (%) |
| GPT-4o Preview (Sept 2024) | OpenAI | 96.3 | 95.5 |
| Claude-3.5-Sonnet (Oct 2024) | Anthropic | 93.7 | - |
| Qwen2.5-Coder-32B-Instruct (Nov 2024) | Alibaba Cloud | 92.1 | 90.5 |
| Amazon Nova Pro (Dec 2024) | Amazon | 89.9 | - |
| Llama 3.3 70B (Dec 2024) | Meta AI | 88.4 | - |
| Experiment ID | Prompt Description |
|---|---|
| stmt | Only the problem statement. |
| C_sol | Only solution in C. |
| stmt+py_sol | Problem statement along with its solution in Python. |
| stmt+C_sol | Problem statement and its solution in C. |
| C_sol+3tests | Solution in C along with three example test cases. |
| stmt+C_sol+3tests | Problem statement, its solution in C, and three example test cases. |
| Experiment ID | GPT-4o Preview (Sep 2024) | Qwen2.5-Coder -32B-Instruct (Nov 2024) | Llama 3.3-70B (Dec 2024) | Claude-3.5 -Sonnet (Oct 2024) | Amazon Nova Pro (Dec 2024) |
| stmt | 91.00 | 70.75 | 73.00 | 76.50 | 68.25 |
| C_sol | 97.50 | 73.50 | 71.50 | 75.75 | 73.00 |
| stmt+py_sol | 95.75 | 83.75 | 75.50 | 85.75 | 74.00 |
| stmt+C_sol | 96.25 | 81.50 | 78.25 | 84.25 | 78.25 |
| C_sol+3tests | 97.75 | 79.25 | 79.00 | 84.50 | 81.75 |
| stmt+C_sol+3tests | 98.25 | 79.25 | 81.25 | 86.75 | 81.50 |
| Experiment ID | GPT-4o Preview (Sep 2024) | Qwen2.5-Coder -32B-Instruct (Nov 2024) | Llama 3.3-70B (Dec 2024) | Claude-3.5 -Sonnet (Oct 2024) | Amazon Nova Pro (Dec 2024) |
| stmt+C_sol | 98.7 | 98.0 | 98.4 | 99.2 | 95.0 |
| C_sol+3tests | 97.5 | 97.0 | 96.1 | 98.1 | 98.1 |
| stmt+C_sol+3tests | 98.0 | 98.0 | 97.0 | 98.1 | 98.1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).