Submitted:
17 June 2024
Posted:
17 June 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
- RQ1. What violations to coding conventions [10] can be observed in the source code if students use or not use ChatGPT for the course’s programming exercises?
- RQ2. What are the most violation-dense topics in the programming exercises if students use or not use ChatGPT for implementing them?
- RQ3. How does using ChatGPT influence the cyclomatic and cognitive complexity of the students’ programming exercises’ source code?
2. Related Work
3. Methodology
| Control Group | Treatment Group | ||||
|---|---|---|---|---|---|
| # | Exercise Focus | • | ∘ | ||
| Fu | Fundamentals | 16 | 19 | 3 | |
| Lo | Loops | 15 | 19 | 3 | |
| OO | Object-Oriented Programming | 15 | 19 | 3 | |
| In | Interfaces, Exception Handling | 15 | 19 | 3 | |
| Co | Collections | 15 | 15 | 6 | |
| Fi | File, IO Streams | 14 | 16 | 3 | |
| La | Lambda Expressions, Multithreading | 13 | 12 | 6 | |
3.1. Static Code Analysis
3.2. Statistical Tests
- H1,0: the number of rule violations (to code coventions).
- H2.1,0: the cyclomatic complexity of the source code.
- H2.2,0: the cognitive complexity of the source code.
4. Results
4.1. Most Frequent Rule Violations (RQ1)
- LineLength: Statistical analysis revealed a significant difference between control and treatment groups, supporting the alternative hypothesis H1,a with a p-value of 2.16e-15. The treatment group demonstrated greater adherence this rule across all programming exercises.
- FinalParameters: The statistical test supports the alternative hypothesis H1,a with a p-value of 9.99e-11, showing significant differences between control and treatment groups. The data reveals consistently higher median violations in the control group across all exercises, emphasizing the improved adherence to this rule when using ChatGPT. For instance, in the “Object-Oriented Programming”exercise, the treatment group’s median violations are almost half those of the control group.
- HiddenField: The statistical analysis revealed a significant difference between the control and treatment groups, with a p-value of 1.34e-07, supporting the alternative hypothesis H1,a. The treatment group had consistently lower median rule violations across all exercises. As an example, the “Collections” exercise had a median of 8.57 violations compared to the control group’s median of 16.
- MissingSwitchDefault: We found a statistically significant difference with a p-value of 8.34e-05, indicating that ChatGPT effectively minimized these violations, thus supporting the alternative hypothesis H1,a. The treatment group generally showed lower median violations compared to the control group in all areas. In the “Interfaces and Exception Handling” exercise, for example, the median violations for the treatment group were 1.82 and thus significantly lower than the control group’s median of 4.
- DesignForExtension: The p-value of 6.14e-04 confirmed significant differences among the groups, supporting the alternative hypothesis H1,a. The median violations consistently favored the treatment group over the control group across all exercises. For example, in the “Interfaces and Exception Handling” exercise, the treatment group’s median violation was 20, significantly lower than the control group’s 34.
- MagicNumber: The analysis shows significant differences with a p-value of 7.32e-04, supporting the alternative hypothesis H1,a. The treatment group generally has lower median violations, such as in the “Interfaces and Exception Handling”exercise where it was 7.27 for the treatment group and 15.7 for the control group.
- VisibilityModifier: The analysis supports the alternative hypothesis H1,a with a p-value of 1.31e-03, indicating significant differences. The treatment group showed greater adherence to this rule, with lower median violations across exercises. For example, in the “Object-Oriented Programming” exercise, the median for the treatment group was 1.36, significantly lower than the 4.67 observed in the control group.
- RightCurly: Statistical analysis showed a significant variance with a p-value of 9.45e-03, supporting the alternative hypothesis H1,a for the comparison between control and treatment groups. The treatment group had consistently lower median violations across different parts of the exercise, indicating better compliance with the rule. For instance, in the “Interfaces and Exception Handling” exercise, the median for the treatment group was 1.36, compared to 4.33 for the control group.
- NeedBraces: The statistical analysis produced a p-value of 0.0100, which suggests a significant difference and supports the alternative hypothesis H1,a. The results varied across exercises, with the treatment group often displaying improved compliance by having lower median violations. Specifically, in the “Interfaces and Exception Handling” exercise, the treatment group had a median of 1.36, whereas the control group had 13.3, showing a significant reduction in violations. However, in the “Object-Oriented Programming” exercise, the treatment group had higher median violations.
- LocalVariableName: The statistical results indicate a significant difference with a p-value of 0.0116, supporting the alternative hypothesis H1,a. The treatment group generally exhibited better adherence to naming conventions for local variables. For instance, in the “Object-Oriented Programming” exercise the median was 0.909 for the treatment group compared to 3.33 for the control group.
- MethodParamPad: The median violations in both groups were comparable across the exercises, with only minor differences in specific contexts. For example, the treatment group showed a lower median of 1 in the “File IO and Streams” exercise compared to the control group’s 3.93. However, since the overall p-value is 0.057, these differences are not statistically significant at the chosen threshold, and the null hypothesis H1,0 can be accepted.
- AvoidNestedBlocks: With a p-value of 0.154 the difference in this rule’s violations between the two groups does not reach statistical significance. Hence the null hypothesis H1,0 can be supported. Despite some variations in median violations across different parts of the exercise, such as the treatment group exhibiting a median of 5.68 in the “Interfaces and Exception Handling” exercise compared to 15 for the control group, these differences are not statistically robust across all exercises.
| Rule | Definition |
|---|---|
| LineLength | Checks for long code lines. |
| FinalParameters | Checks that parameters for methods, constructors, catch and for-each blocks are marked final. |
| HiddenField | Checks that a local variable or a parameter does not shadow a field that is defined in the same class. |
| MissingSwitchDefault | Checks that each switch statement has a default clause. |
| DesignForExtension | Ensures that classes are structured in a way that facilitates their extension through subclassing. |
| MagicNumber | Checks that there are no “magic numbers”, i.e., a numeric literal not defined as a constant. |
| VisibilityModifier | Enforces encapsulation by requiring that public class members must be either static final, immutable or annotated by specific annotations. |
| RightCurly | Checks the placement of right curly braces (’}’) for code blocks. |
| NeedBraces | Checks for braces around code blocks. |
| LocalVariableName | Checks that local, non-final variable names conform to Java naming conventions. |
| MethodParamPad | Evaluates the spacing between method and constructor definitions, calls, and invocations, and the opening parentheses of parameter lists. |
| AvoidNestedBlocks | Identifies instances where blocks of code are nested within other blocks. |
| Control Group | Treatment Group | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Rule | p-value | ||||||||
| * | LineLength | < 0.005 | 32.71 | 50.3 | 54.36 | 15.91 | 16.65 | 21.18 | |
| * | FinalParameters | < 0.005 | 19.22 | 28.57 | 33.24 | 10.62 | 16.23 | 17.13 | |
| * | HiddenField | < 0.005 | 6.24 | 9.91 | 11.61 | 3.96 | 6.32 | 6.81 | |
| * | MissingSwitchDefault | < 0.005 | 2.02 | 2.80 | 3.13 | 1.43 | 0.92 | 1.59 | |
| * | DesignForExtension | < 0.005 | 14.95 | 23.51 | 25.19 | 10.94 | 18.10 | 17.54 | |
| * | MagicNumber | < 0.005 | 9.48 | 15.42 | 16.28 | 7.46 | 11.42 | 12.05 | |
| * | VisibilityModifier | < 0.005 | 2.20 | 2.97 | 3.49 | 2.02 | 1.69 | 2.28 | |
| * | RightCurly | < 0.005 | 3.92 | 2.74 | 4.05 | 2.81 | 1.55 | 2.59 | |
| * | NeedBraces | < 0.005 | 7.01 | 5.37 | 7.99 | 3.91 | 2.69 | 3.83 | |
| * | LocalVariableName | < 0.005 | 1.88 | 2.37 | 2.90 | 1.33 | 1.03 | 1.60 | |
| MethodParamPad | 0.057 | 3.07 | 2.66 | 3.79 | 2.27 | 1.31 | 2.41 | ||
| AvoidNestedBlocks | 0.154 | 7.49 | 8.07 | 10.76 | 5.19 | 5.57 | 6.85 | ||
| Complexity Measure | |||||||||
| * | CyclomaticComplexity | < 0.005 | 8.67 | 9.17 | 12.2 | 4.80 | 7.27 | 8.82 | |
| * | CognitiveComplexity | < 0.005 | 10.4 | 1.50 | 3.30 | 6.78 | 1.34 | 2.88 | |

4.2. Most Violation-Dense Topics (RQ2)

4.3. Influence on Cyclomatic and Cognitive Complexity (RQ3)

5. Discussion
6. Threats to Validity
7. Conclusion
- Most Frequent Rule Violations: Our investigation revealed statistically significant differences in rule adherence between students using ChatGPT (treatment group) and those not using it (control group). Specifically, the treatment group showed fewer violations across several key coding conventions. Notably, rules verifying the length of code lines (LineLength), the declaration of parameters for methods, constructors, catch blocks, and for-each loops as final (FinalParameters) or verifying code extensibility (DesignForExtension) showed marked improvements, with p-values less than 0.005, indicating a statistically significant difference in adherence between the two groups.
- Most Violation-Dense Topics: We observed exercises related to “Collections”, “File IO and Streams”, and “Object-Oriented Programming” as topics with the highest incidence of rule violations. Despite a significant overall reduction in violations among the treatment group, these areas remained challenging, underscoring the need for targeted educational interventions. The rules LineLength and FinalParameters were among the most frequently violated, highlighting specific areas where ChatGPT’s guidance notably improved student performance.
- Influence on Cyclomatic and Cognitive Complexity: Our study further explored the cyclomatic and cognitive complexity of student submissions. We observed that code from the treatment group displayed lower complexity levels across both metrics, notably in cyclomatic complexity (p = 7.91e-14). These findings imply that ChatGPT contributes not merely to simplifying code complexity but also to enhancing code comprehensibility, as evidenced by the statistically significant difference in cognitive complexity (p = 6.45e-22).
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- OpenAI. OpenAI. https://openai.com/, 2024. Accessed: 2024-06-14.
- Ozkaya, I. Application of Large Language Models to Software Engineering Tasks: Opportunities, Risks, and Implications. IEEE Software 2023, 40, 4–8. [CrossRef]
- Breuker, D.M.; Derriks, J.; Brunekreef, J. Measuring static quality of student code. Proceedings of the 16th annual joint conference on Innovation and technology in computer science education; Association for Computing Machinery: New York, NY, USA, 2011; ITiCSE ’11, pp. 13–17. [CrossRef]
- Izu, C.; Mirolo, C. Exploring CS1 Student’s Notions of Code Quality. Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1; Association for Computing Machinery: New York, NY, USA, 2023; ITiCSE 2023, pp. 12–18. [CrossRef]
- Östlund, L.; Wicklund, N.; Glassey, R. It’s Never too Early to Learn About Code Quality: A Longitudinal Study of Code Quality in First-year Computer Science Students. Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1; Association for Computing Machinery: New York, NY, USA, 2023; SIGCSE 2023, pp. 792–798. [CrossRef]
- Brown, N.C.; Altadmri, A. Investigating novice programming mistakes: educator beliefs vs. student data. Proceedings of the tenth annual conference on International computing education research; Association for Computing Machinery: New York, NY, USA, 2014; ICER ’14, pp. 43–50. [CrossRef]
- De Ruvo, G.; Tempero, E.; Luxton-Reilly, A.; Rowe, G.B.; Giacaman, N. Understanding semantic style by analysing student code. Proceedings of the 20th Australasian Computing Education Conference; Association for Computing Machinery: New York, NY, USA, 2018; ACE ’18, pp. 73–82. [CrossRef]
- checkstyle. checkstyle – Checkstyle 10.14.1. https://checkstyle.sourceforge.io/, 2024. Accessed: 2024-06-14.
- PMD. PMD. https://pmd.github.io/, 2024. Accessed: 2024-06-14.
- Oracle. Code Conventions for the Java Programming Language: Contents. https://www.oracle.com/java/technologies/javase/codeconventions-contents.html, 2024. Accessed: 2024-06-14.
- Li, K.; Hong, S.; Fu, C.; Zhang, Y.; Liu, M. Discriminating Human-authored from ChatGPT-Generated Code Via Discernable Feature Analysis. 2023 IEEE 34th International Symposium on Software Reliability Engineering Workshops (ISSREW), 2023, pp. 120–127. [CrossRef]
- Guo, Q.; Cao, J.; Xie, X.; Liu, S.; Li, X.; Chen, B.; Peng, X. Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical Study. Proceedings of the 46th IEEE/ACM International Conference on Software Engineering; Association for Computing Machinery: New York, NY, USA, 2024; ICSE ’24, pp. 1–13. [CrossRef]
- Liu, Y.; Le-Cong, T.; Widyasari, R.; Tantithamthavorn, C.; Li, L.; Le, X.B.D.; Lo, D. Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues. ACM Transactions on Software Engineering and Methodology 2024. Just Accepted, . [CrossRef]
- Börstler, J.; Störrle, H.; Toll, D.; van Assema, J.; Duran, R.; Hooshangi, S.; Jeuring, J.; Keuning, H.; Kleiner, C.; MacKellar, B. "I know it when I see it" Perceptions of Code Quality: ITiCSE ’17 Working Group Report. Proceedings of the 2017 ITiCSE Conference on Working Group Reports; Association for Computing Machinery: New York, NY, USA, 2018; ITiCSE-WGR ’17, pp. 70–85. [CrossRef]
- Brown, N.C.C.; Weill-Tessier, P.; Sekula, M.; Costache, A.L.; Kölling, M. Novice Use of the Java Programming Language. ACM Transactions on Computing Education 2022, 23, 10:1–10:24. [CrossRef]
- Karnalim, O.; Simon.; Chivers, W. Work-In-Progress: Code Quality Issues of Computing Undergraduates. 2022 IEEE Global Engineering Education Conference (EDUCON), 2022, pp. 1734–1736. ISSN: 2165-9567, . [CrossRef]
- Keuning, H.; Heeren, B.; Jeuring, J. Code Quality Issues in Student Programs. Proceedings of the 2017 ACM Conference on Innovation and Technology in Computer Science Education; Association for Computing Machinery: New York, NY, USA, 2017; ITiCSE ’17, pp. 110–115. [CrossRef]
- Keuning, H.; Jeuring, J.; Heeren, B. A Systematic Mapping Study of Code Quality in Education. Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1; Association for Computing Machinery: New York, NY, USA, 2023; ITiCSE 2023, pp. 5–11. [CrossRef]
- Brown, N.C.C.; Altadmri, A. Novice Java Programming Mistakes: Large-Scale Data vs. Educator Beliefs. ACM Transactions on Computing Education 2017, 17, 7:1–7:21. [CrossRef]
- Kazemitabaar, M.; Chow, J.; Ma, C.K.T.; Ericson, B.J.; Weintrop, D.; Grossman, T. Studying the effect of AI Code Generators on Supporting Novice Learners in Introductory Programming. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems; Schmidt, A.; Väänänen, K.; Goyal, T.; Kristensson, P.O.; Peters, A.; Mueller, S.; Williamson, J., Eds.; Association for Computing Machinery: New York, NY, USA, 2023; CHI ’23, pp. 1–23. [CrossRef]
- Jacques, L. Teaching CS-101 at the Dawn of ChatGPT. ACM Inroads 2023, 14, 40–46. [CrossRef]
- Daun, M.; Brings, J. How ChatGPT Will Change Software Engineering Education. Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1; Laakso, M.J.; Monga, M., Eds.; Association for Computing Machinery: New York, NY, USA, 2023; ITiCSE 2023, pp. 110–116. [CrossRef]
- checkstyle. Checks. https://checkstyle.sourceforge.io/checks.html, 2024. Accessed: 2024-06-14.
- checkstyle. Sun’s Java Style. https://checkstyle.sourceforge.io/sun_style.html, 2024. Accessed: 2024-06-14.
- checkstyle. checkstyle/checkstyle. https://github.com/checkstyle/checkstyle/blob/master/src/main/resources/sun_checks.xml, 2024. Accessed: 2024-06-14.
- McCabe, T. A Complexity Measure. IEEE Transactions on Software Engineering 1976, SE-2, 308–320. Conference Name: IEEE Transactions on Software Engineering, . [CrossRef]
- Campbell, G.A. Cognitive complexity: an overview and evaluation. Proceedings of the 2018 International Conference on Technical Debt; Association for Computing Machinery: New York, NY, USA, 2018; TechDebt ’18, pp. 57–58. [CrossRef]
- SonarSource. Code Quality Tool & Secure Analysis with SonarQube. https://www.sonarsource.com/products/sonarqube/, 2024. Accessed: 2024-06-14.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).