Submitted:
28 May 2025
Posted:
29 May 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- We reproduce 16 of the 32 scores reported for the evaluated models. This implies that many of the discrepancies in reported results can be attributed to using different, or possibly incorrect evaluation settings.
- We quantify the impact of modifying evaluation settings individually. For instance, maximum generation length, sampling strategy, 4-bit quantization, and prompt template choice significantly influence the results, whereas precision and 8-bit quantization do not.
- We identify instances of possibly misreported results, likely due to confusion between base and instruction-tuned models.
2. Method
2.1. Benchmarking with HumanEvalFix
2.2. Review of Reported Benchmark Results
2.3. Experimental Setup
- A differently tuned version of the model was used (e.g. the base model instead of its instruction-tuned variant), causing a large difference in benchmark result. (2 issues)
- The wrong variant of the same benchmark was used: MBPP instead of MBPP+, and HumanEval instead of its unstripped variant. (2 issues)
- The temperature was set incorrectly: while generating one sample for each prompt, the temperature turned out to be an improperly large value. (1 issue)
- An incorrect prompt was used, causing more than of a decrease in pass@1 performance. (1 issue)
- The reproduced results were different by only a few percentage points compared to the officially reported ones. Such a discrepancy was considered negligible, as this can be due to minor variations in evaluation settings, minor updates in model versions, or in hardware configurations. (4 issues)
- Prompt template
- The evaluation framework defaults to the instruct prompt template, with only the context and instruction (the program and the instruction to fix it). The default prompt template lacks model-specific formatting suggested by model authors. We evaluate models both with their suggested prompt template as well as the default instruct setting.
- Sampling vs. greedy decoding
- The default behavior in the framework is not to use greedy decoding, but to apply sampling with the temperature . Alongside greedy decoding, we conduct experiments using sampling with temperatures and .
- Limiting generation length
- The default value of 512 tokens is insufficient, as it can lead to unfinished programs and reduced benchmark scores. We conduct our experiments using lengths of 512 and 2048. Since the max length generation parameter considers both the prompt and the generated output, it must be large enough to prevent cutting the output short. We found 2048 to be a good choice on this benchmark, as tokenizers typically fit the inputs within 1024 tokens, leaving enough space for the generated output.
- Precision and quantization
- The majority of LLMs are released using bf16 precision, making it the preferred choice for precision. We evaluate models using all three common precision formats: fp16, fp32, and bf16. While quantization is a useful technique to lower memory usage when running models, it’s generally best to use models without it. In addition to our tests without quantization, we also run experiments using 4-bit and 8-bit quantization (with fp16 precision).
3. Experimental Results
3.1. The Effect of Individual Evaluation Settings
3.1.1. Temperature: Greedy Evaluation and Sampling
3.1.2. Maximum Generation Length
3.1.3. Using an Incorrect Prompt Template
3.1.4. Precision and Quantization
3.2. Reproducibility of Results in Existing Research
3.2.1. CodeLlama
3.2.2. DeepSeekCoder
3.2.3. CodeGemma
4. Discussion
5. Conclusion
Author Contributions
Funding
Conflicts of Interest
References
- Muennighoff, N.; Liu, Q.; Zebaze, A.; Zheng, Q.; Hui, B.; Zhuo, T.Y.; Singh, S.; Tang, X.; von Werra, L.; Longpre, S. 2024; arXiv:cs.CL/2308.07124].
- Rozière, B.; Gehring, J.; Gloeckle, F.; Sootla, S.; Gat, I.; Tan, X.E.; Adi, Y.; Liu, J.; Sauvestre, R.; Remez, T.; et al. O: Llama, 2024; arXiv:cs.CL/2308.12950].
- Guo, D.; Zhu, Q.; Yang, D.; Xie, Z.; Dong, K.; Zhang, W.; Chen, G.; Bi, X.; Wu, Y.; Li, Y.K.; et al. 2024; arXiv:cs.SE/2401.14196].
- Lin, D.; Koppel, J.; Chen, A.; Solar-Lezama, A. QuixBugs: a multi-lingual program repair benchmark set based on the quixey challenge. In Proceedings of the Proceedings Companion of the 2017 ACM SIGPLAN International Conference on Systems, Programming, Languages, New York, NY, USA, 2017; SPLASH Companion 2017, and Applications: Software for Humanity; pp. 55–56. [CrossRef]
- Widyasari, R.; Sim, S.Q.; Lok, C.; Qi, H.; Phan, J.; Tay, Q.; Tan, C.; Wee, F.; Tan, J.E.; Yieh, Y.; et al. BugsInPy: a database of existing bugs in Python programs to enable controlled testing and debugging studies. In Proceedings of the Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, New York, NY, USA, 2020. [CrossRef]
- Liu, S.; Chai, L.; Yang, J.; Shi, J.; Zhu, H.; Wang, L.; Jin, K.; Zhang, W.; Zhu, H.; Guo, S.; et al. 2024; arXiv:cs.CL/2411.02310].
- Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; de Oliveira Pinto, H.P.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. 2021; arXiv:cs.LG/2107.03374].
- Pimentel, M.A.; Christophe, C.; Raha, T.; Munjal, P.; Kanithi, P.K.; Khan, S. A: Metrics, 2024; arXiv:cs.AI/2407.21072].
- Paul, D.G.; Zhu, H.; Bayley, I. A: and Metrics for Evaluations of Code Generation, 2024; arXiv:cs.AI/2406.12655].
- Wang, S.; Asilis, J. ; Ömer Faruk Akgül.; Bilgin, E.B., Liu, O., Eds.; Neiswanger, W. Tina: Tiny Reasoning Models via LoRA, 2025, [arXiv:cs.CL/2504. [Google Scholar]
- Team, C.; Zhao, H.; Hui, J.; Howland, J.; Nguyen, N.; Zuo, S.; Hu, A.; Choquette-Choo, C.A.; Shen, J.; Kelley, J.; et al. 2024; arXiv:cs.CL/2406.11409].
- Ben Allal, L.; Muennighoff, N.; Kumar Umapathi, L.; Lipkin, B.; von Werra, L. A framework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-evaluation-harness, 2022.
- Cassano, F.; Li, L.; Sethi, A.; Shinn, N.; Brennan-Jones, A.; Ginesin, J.; Berman, E.; Chakhnashvili, G.; Lozhkov, A.; Anderson, C.J.; et al. Can It Edit? 2024; arXiv:cs.SE/2312.12450]. [Google Scholar]
- Chae, H.; Kwon, T.; Moon, S.; Song, Y.; Kang, D.; Ong, K.T.i.; Kwak, B.w.; Bae, S.; Hwang, S.w.; Yeo, J. Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code. In Proceedings of the Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing.
- Dehghan, M.; Wu, J.J.; Fard, F.H.; Ouni, A. 2024; arXiv:cs.SE/2408.09568].
- Granite Team, I. Granite 3.0 Language Models, 2024.
- Campos, V. Bug Detection and Localization using Pre-trained Code Language Models. In INFORMATIK 2024; Gesellschaft für Informatik e.V.: Bonn, 2024; pp. 1419–1429. [Google Scholar] [CrossRef]
- Jiang, Y.; He, Q.; Zhuang, X.; Wu, Z. 2024; arXiv:cs.CL/2403.19121].
- Jiang, H.; Liu, Q.; Li, R.; Ye, S.; Wang, S. 2024; arXiv:cs.CL/2410.07002].
- Lozhkov, A.; Li, R.; Allal, L.B.; Cassano, F.; Lamy-Poirier, J.; Tazi, N.; Tang, A.; Pykhtar, D.; Liu, J.; Wei, Y.; et al. T: 2 and The Stack v2, 2024; arXiv:cs.SE/2402.19173].
- Mishra, M.; Stallone, M.; Zhang, G.; Shen, Y.; Prasad, A.; Soria, A.M.; Merler, M.; Selvam, P.; Surendran, S.; Singh, S.; et al. A: Code Models, 2024; arXiv:cs.AI/2405.04324].
- Moon, S.; Chae, H.; Song, Y.; Kwon, T.; Kang, D.; iunn Ong, K.T.; won Hwang, S.; Yeo, J. 2024; arXiv:cs.CL/2311.07215].
- Nakamura, T.; Mishra, M.; Tedeschi, S.; Chai, Y.; Stillerman, J.T.; Friedrich, F.; Yadav, P.; Laud, T.; Chien, V.M.; Zhuo, T.Y.; et al. Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. 2024; arXiv:cs.CL/2404.00399]. [Google Scholar]
- Shi, Y.; Wang, S.; Wan, C.; Gu, X. C: Code to Correctness, 2024; arXiv:cs.CL/2410.01215].
- Singhal, M.; Aggarwal, T.; Awasthi, A.; Natarajan, N.; Kanade, A. 2024; arXiv:cs.SE/2401.15963].
- Wang, X.; Li, B.; Song, Y.; Xu, F.F.; Tang, X.; Zhuge, M.; Pan, J.; Song, Y.; Li, B.; Singh, J.; et al. 2024; arXiv:cs.SE/2407.16741].
- Yang, J.; Jimenez, C.E.; Wettig, A.; Lieret, K.; Yao, S.; Narasimhan, K.; Press, O. 2024; arXiv:cs.SE/2405.15793].
- Yu, Z.; Zhang, X.; Shang, N.; Huang, Y.; Xu, C.; Zhao, Y.; Hu, W.; Yin, Q. 2024; arXiv:cs.CL/2312.14187].
- Wang, Y.; Le, H.; Gotmare, A.D.; Bui, N.D.Q.; Li, J.; Hoi, S.C.H. 2023; arXiv:cs.CL/2305.07922].
- Tunstall, L.; Lambert, N.; Rajani, N.; Beeching, E.; Le Scao, T.; von Werra, L.; Han, S.; Schmid, P.; Rush, A. Creating a Coding Assistant with StarCoder. Hugging Face Blog.
- Zheng, Q.; Xia, X.; Zou, X.; Dong, Y.; Wang, S.; Xue, Y.; Wang, Z.; Shen, L.; Wang, A.; Li, Y.; et al. 2024; arXiv:cs.LG/2303.17568].
- Li, R.; Allal, L.B.; Zi, Y.; Muennighoff, N.; Kocetkov, D.; Mou, C.; Marone, M.; Akiki, C.; Li, J.; Chim, J.; et al. StarCoder: may the source be with you! 2023; arXiv:cs.CL/2305.06161]. [Google Scholar]
- Luo, Z.; Xu, C.; Zhao, P.; Sun, Q.; Geng, X.; Hu, W.; Tao, C.; Ma, J.; Lin, Q.; Jiang, D. 2023; arXiv:cs.CL/2306.08568].
- Workshop, B. ; :.; Scao, T.L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A.S., Eds.; et al. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model, 2023, [arXiv:cs.CL/2211. [Google Scholar]
- Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. 2023; arXiv:cs.CL/2310.06825].
- Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. 2024; arXiv:cs.AI/2407.21783].
| 1 | |
| 2 | We reviewed these issues on the 5th of November (2024) |


| black Model | Reported results |
|---|---|
| gray!20 InstructCodeT5+ (16B) | 2.7[1,18,22] |
| StarChat- (16B) | 18.1[1,22] |
| CodeGeeX2 (6B) | 15.9[1,18,22] |
| StarCoder (16B) | 8.7[1,18,22,28] |
| StarCoderBase (15B) | 10.4 , 12.6 [20,23] |
| StarCoder2 (15B) | 9.1 , 9.7 [20,23] |
| OctoGeeX (6B) | 28.1[1,22] |
| OctoCoder (16B) | 28 , 30.4 [1,18,20,22,23,28] |
| WizardCoder (16B) | 31.8[1,18,22,28], 51.6 |
| CodeLlama-Inst (7B) | 15.2 , 19.5 , 20.6 , 28.0 |
| CodeLlama-Inst (13B) | 15.2 , 16.4 , 18.9 , 19.4 , 29.2 , 30.5 |
| CodeLlama-Inst (34B) | 36.5[13,20], 37.8 , 55.9 |
| CodeLlama (13B) | 6.1 , 33.1 |
| CodeLlama (34B) | 14.0 , 47.6 |
| BLOOMZ (176B) | 16.6[1,18,23] |
| Mistral-Inst (7B) | 10.9 , 53.05 |
| Llama3-Inst (8B) | 40.2 , 41.5 |
| Llama3-Inst (70B) | 57.3 , 81.7 |
| CodeGemma (2B) | 4.3 , 22.7 |
| CodeGemma (7B) | 8.5 , 11.7 , 53.7 |
| CodeGemma-Inst (7B) | 46.3 , 72.4 |
| DeepSeekCoder (1.3B) | 1.2 , 16.4 |
| DeepSeekCoder (6.7B) | 23.8 , 29.9 , 45.4 |
| DeepSeekCoder-Inst (1.3B) | 9.1 , 29.3 , 48.9 |
| DeepSeekCoder-Inst (6.7B) | 42.1 , 44.9 [13,20], 56.1 , 60.4 , 73.3 |
| DeepSeekCoder-Inst (33B) | 47.5[13,20], 81.0 |
| black |
![]() |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
