Submitted:
03 March 2026
Posted:
04 March 2026
You are already at the latest version
Abstract

Keywords:
1. Introduction
- First, we present a formal architecture for multi agent orchestration that enforces a student first workflow. In this workflow, the system always starts from the student’s own attempt and focuses on debugging and scaffolding rather than full solution generation.
- Second, we conduct an ablation study that compares this architecture with standard single model baselines. We focus in particular on false positive validations in which an AI tutor incorrectly accepts faulty code.
- Third, we introduce a multi model evaluation protocol. In this protocol, different base models occupy different roles in The Code Council, and we measure how these assignments affect correctness, stability, and quality of feedback.
- Fourth, we design experiments that evaluate the framework in two modes, a Blind mode in which agents solve problems without external ground truth and a Guided mode in which they can access a reference solution.
2. Related Work
2.1. Large Language Models in Education
2.2. LLM Use and Student Learning Outcomes in Programming Education
2.3. LLMs for Programming Assistance and Program Repair
2.4. Positioning and Comparative Analysis with State-of-the-Art Educational AI Systems
- What this work adds beyond prior systems.
3. Methodology
3.1. Agent-Based Tutoring Architecture
- Architect. Receives the problem statement and the buggy program and proposes a candidate patch. The Architect focuses on code-level reasoning and aims for minimal changes that fix the bug.
- Skeptic. Receives the problem, the buggy program and the Architect’s patch and acts as an adversarial validator. The Skeptic checks logical correctness and edge cases and either accepts the patch or rejects it with a critique.
- Secretary. Maintains the internal state of the tutoring process. The Secretary routes messages between agents and turns the Skeptic’s critiques into concrete revision requests for the Architect when needed.
- Pedagogue. Receives the buggy program and a verified internal patch and performs a gap analysis. The Pedagogue identifies the main misconceptions and designs a teaching plan without exposing a full corrected program.
- Mentor. Receives the teaching plan and turns it into a student-facing hint. The Mentor controls tone, length and phrasing and follows a scaffolding policy that encourages the student to reason about the bug.
3.2. Tutor Configurations
- Single-model tutors
- The model receives the problem text and the buggy program and generates a single patched program. We do not allow retries, self-reflection loops, majority voting, or any extra calls for that case. Patch success for single models therefore reflects first-shot repair ability.
- The same model then receives the problem, the buggy code and its own patch and generates one student-facing hint. The hint prompt asks the model to focus on the main bug, avoid full corrected programs, and adopt a scaffolding style.
- Multi-agent councils
- The Architect receives the problem and the buggy code and proposes a candidate patch.
- The Skeptic receives the problem, the buggy code and the candidate patch, checks the logic and either accepts the patch as provisionally correct or rejects it with a critique.
- If the Skeptic rejects the patch, the Secretary turns the critique into a new, more precise request for the Architect. The Architect then proposes a revised patch. This Architect–Skeptic loop can repeat.
- We give the Architect and the Skeptic up to five attempts to reach agreement. One attempt is a full cycle in which the Architect proposes a patch and the Skeptic evaluates it. If, after five attempts, the Skeptic still rejects all candidate patches, the council marks the case as no agreed patch. For technical metrics this counts as patch failure. In that situation, the Pedagogue and Mentor still produce a cautious, high-level hint that encourages rethinking the algorithm rather than specific line-by-line edits.
- As soon as the Skeptic accepts a patch within the five-attempt limit, the Pedagogue receives the buggy code and the verified patch and designs a teaching plan that focuses on the main misconception.
- The Mentor transforms this teaching plan into a short, friendly hint for the student, following scaffolding constraints.
3.3. Operating Modes: Blind and Guided
3.4. Experimental Procedure
- 1.
- We construct a debugging case from archival data, consisting of the problem text, the buggy program and the external verdict. A corresponding correct reference solution is stored separately and used only for internal execution checks and Guided-mode comparisons.
- 2.
-
For each tutor configuration we run one complete debugging episode in Blind mode and one in Guided mode.
- For a single-model tutor, the base model produces one patch in a single call (one-shot) and then one hint in a second call.
- For a council, the Architect and Skeptic may revise the patch through the internal loop managed by the Secretary, with at most five Architect–Skeptic attempts per case. If the Skeptic accepts a patch, it becomes the council’s internal repair. If no agreement is reached within five attempts, the case is marked as technical failure but a conservative hint is still generated.
- 3.
- We record the internal patch and measure its similarity to the buggy code using a normalised token-level edit distance. We also run the patch on a test harness compatible with the reference judge and record whether it passes all official tests.
- 4.
- We store the student-facing hint and apply automatic analyses for localization, solution leakage, hint form and scaffolding level as defined in Section 4.
4. Evaluation Framework
4.1. Dimension A: Technical Integrity of Patches
- Patch success rate
- False positive rate
- Relative patch size
4.2. Dimension B: Hint Localisation, Solution Leakage, and Form
- Localisation
- Localisation precision: the proportion of locations mentioned in the hint that fall inside the true error region;
- Localisation recall: the proportion of true error locations that the hint mentions.
- Solution leakage
- A code leakage detector that flags complete programs, long corrected code blocks, and line-by-line rewritten versions of the student’s program;
- A text leakage detector that flags explanations that describe the algorithm step-by-step in a way that can be translated into code almost mechanically.
- Hint form
4.3. Dimension C: Scaffolding Behaviour from a Judge Model
- Full solution: the hint exposes a complete program, a line-by-line corrected version of the student code, or a step-by-step natural language description that reconstructs the algorithm with almost no additional reasoning from the student;
- Heavy hint: the hint does not fully reveal the solution but gives specific steps, formulas or code fragments that leave only routine translation work;
- Conceptual guidance: the hint focuses on ideas, input/output conditions, typical edge cases or high-level control flow, and leaves construction of the exact code to the student.
4.4. Dimension D: Expert Ratings of Hint Quality
4.5. Dimension E: Runtime, Resource Usage, and Stability
- Per-call latency for internal steps (Architect, Skeptic, Pedagogue, Mentor);
- Per-interaction latency from the initial request to the final hint;
- The number of prompt and completion tokens per interaction (median and 90th percentile).
- Patch stability: the fraction of debugging cases in this subset where all internal patches are identical across runs;
- Hint stability: the average semantic similarity (for example, cosine similarity of embeddings) between hints for the same case across runs.
4.6. Dimension F: Difficulty Sensitivity and Consistency
- Hint consistency: whether students in the same cluster receive similar advice;
- Patch pattern consistency: whether internal patches in the cluster address the same root cause.
5. Results
5.1. Dataset and Tracks
- Course track (10 problems). Problems that resemble exercises in an introductory programming course: basic input/output, arithmetic expressions, conditionals, loops, strings, and simple arrays. The set of foundational course-track problems selected for evaluation is presented in Table 2.
- Challenge track (10 problems). Harder problems inspired by International Collegiate Programming Contest (ICPC) tasks, with more complex control flow, data structures, and edge case reasoning (for example weighted union–find, shortest paths, or dynamic programming). Table 3 summarizes the challenge-track problems used in the evaluation.
5.2. Single–Model Tutors: Technical Integrity
5.3. Single–Model Tutors: Hint Localisation, Leakage, and Scaffolding
5.4. Single–Model Tutors: Runtime, Stability, and Expert Ratings
5.5. From Single–Model Results to Council Design
- GPT-4o is the strongest all–rounder, with the best challenge–track patch success and fluent explanations.
- DeepSeek v3.2 is extremely capable at low–level code repair but tends to rewrite more code and leak more solutions when used directly for hints.
- Claude 4.5 produces the most conservative hints, with lower leakage and more conceptual guidance, at a small cost in challenge–track patch success relative to GPT-4o.
- Gemini 2.0 is a solid generalist with good long–context handling but does not dominate technical metrics.
5.6. Council Configurations
- Council–GPT. GPT-4o plays all five roles. This isolates the effect of role separation and internal negotiation, since the base model matches the GPT-4o single–model tutor.
- Council–CodeFirst. DeepSeek v3.2 acts as Architect and Skeptic, Gemini 2.0 as Secretary, and GPT-4o as both Pedagogue and Mentor. DeepSeek v3.2 focuses on code repair and validation, Gemini 2.0 handles routing and state, and GPT-4o crafts and delivers hints.
- Council–Conservative. GPT-4o acts as Architect, DeepSeek v3.2 as Skeptic, Gemini 2.0 as Secretary, and Claude 4.5 as both Pedagogue and Mentor. Here DeepSeek v3.2 probes code more aggressively, while Claude 4.5 enforces a cautious scaffolding style.
- Council–HybridStrong. DeepSeek v3.2 acts as Architect and Skeptic, Gemini 2.0 as Secretary, Claude 4.5 as Pedagogue, and GPT-4o as Mentor. This configuration places the strongest code model on both generation and validation, uses Gemini 2.0 for long–context orchestration, and splits pedagogical responsibilities between Claude 4.5 (design) and GPT-4o (delivery).
5.7. Councils: Technical Integrity in Blind Mode
5.8. Councils: Hint Quality and Scaffolding in Blind Mode
5.9. Councils: Runtime, Stability, and Expert Ratings
5.10. Guided Mode and Prompt Revision
- Only the Skeptic and internal checking routines saw the reference solution; Architect, Pedagogue, and Mentor were intended to remain blind;
- The Secretary simply routed messages, passing the Skeptic’s critiques verbatim to other roles;
- Prompts for Pedagogue and Mentor mentioned non–leakage but did not explicitly constrain how ground–truth–based feedback should be used.
- we ensured that both Architect and Pedagogue remain blind to the reference solution throughout, relying only on the buggy code, the problem statement, and abstract feedback from the Skeptic;
- we upgraded the Secretary into a sanitiser in Guided mode: when forwarding Skeptic critiques, it removes specific code snippets, identifiers, or algorithmic blueprints, preserving only high–level error descriptions (for example, “off–by–one in loop bounds” rather than “use a DP table of size ”);
- we tightened Pedagogue and Mentor prompts to emphasise that internal correctness signals must be treated as yes/no judgements on candidate reasoning, not as a source of code or step–by–step solutions to be quoted or paraphrased.
5.11. Trap Resistance in Guided Mode
- Trap resistance: the final hint remains aligned with the true specification and does not steer the student toward the poisoned logic;
- Trap compliance: the hint explicitly pushes the student toward the poisoned algorithm or its distinctive wrong pattern;
- Trap collapse: the council fails to converge to coherent advice (for example, cycles through conflicting internal patches and returns a very generic or self-contradictory hint).
6. Discussion
6.1. Single–Model Tutors and Councils in Comparison
6.2. Guided Mode and the Risk of Leakage
6.3. Trap Experiments and Fail–Safe Behaviour
6.4. Hallucinations and Misaligned Confidence
6.5. Practical Trade–Offs and Implications for Tutors
7. Limitations and Threats to Validity
7.1. Internal Validity: Inference Budget and Orchestration Effects
7.2. Construct Validity: Measuring Scaffolding and Educational Value
7.3. Conclusion Validity: Non-Determinism and Stability Estimation
7.4. Guided Mode Leakage Pathways and Specification Limits
7.5. External Validity: Platform, Language, and Benchmark Origin
7.6. Operational Reproducibility Under Evolving APIs
8. Conclusion
References
- Mutanga, M.B.; Msane, J.; Mndaweni, T.N.; Hlongwane, B.B.; Ngcobo, N.Z. Exploring the Impact of LLM Prompting on Students’ Learning. Trends in Higher Education 2025, 4. [CrossRef]
- Korpimies, K.; Laaksonen, A.; Luukkainen, M. Unrestricted Use of LLMs in a Software Project Course: Student Perceptions on Learning and Impact on Course Performance. In Proceedings of the Proceedings of the 24th Koli Calling International Conference on Computing Education Research, New York, NY, USA, 2024; Koli Calling ’24. [CrossRef]
- Turková, K.; Krásničan, V.; Prázová, I.; Turčínek, P.; Foltýnek, T. Adapting to the future: the use of AI tools and applications in university education and a call for transparent rules and guidelines. International Journal for Educational Integrity 2025, 21, 29. [CrossRef]
- Lyu, M.R.; Ray, B.; Roychoudhury, A.; Tan, S.H.; Thongtanunam, P. Automatic Programming: Large Language Models and Beyond. ACM Trans. Softw. Eng. Methodol. 2025, 34. [CrossRef]
- Hall, O. Expanding Bloom’s Two-Sigma Tutoring Theory Using Intelligent Agents:. International Journal of Knowledge-Based Organizations 2018, 8, 28–46. [CrossRef]
- Hasanein, A.M.; Sobaih, A.E.E. Drivers and Consequences of ChatGPT Use in Higher Education: Key Stakeholder Perspectives. European Journal of Investigation in Health, Psychology and Education 2023, 13, 2599–2614. [CrossRef]
- Huo, H.; Ding, X.; Guo, Z.; Shen, S.; Ye, D.; Pham, O.; Milne, D.; Mathieson, L.; Gardner, A. Accelerating Learning with AI: Improving Students’ Capability to Receive and Build Automated Feedback for Programming Courses. In Proceedings of the 2024 World Engineering Education Forum - Global Engineering Deans Council (WEEF-GEDC), 2024, pp. 1–9. [CrossRef]
- Santos, E.A.; Salmon, A.; Hammer, K. It’s Dangerous to Prompt Alone! Exploring How Fine-Tuning GPT-4o Affects Novices’ Programming Error Resolution. IEEE Access 2025, 13, 166943–166958. [CrossRef]
- Papavlasopoulou, S.; Giannakos, M.N.; Jaccheri, L. Exploring children’s learning experience in constructionism-based coding activities through design-based research. Computers in Human Behavior 2019, 99, 415–427. [CrossRef]
- Dahl, J.E.; Mørch, A. A theoretical and empirical analysis of tensions between learning objects and constructivism. Education and Information Technologies 2025, 30, 22101–22150. [CrossRef]
- Groothuijsen, S.; van den Beemt, A.; Remmers, J.C.; van Meeuwen, L.W. AI chatbots in programming education: Students’ use in a scientific computing course and consequences for learning. Computers and Education: Artificial Intelligence 2024, 7, 100290. [CrossRef]
- Bower, M.; Torrington, J.; Lai, J.W.M.; Petocz, P.; Alfano, M. How should we change teaching and assessment in response to increasingly powerful generative Artificial Intelligence? Outcomes of the ChatGPT teacher survey. Education and Information Technologies 2024, 29, 15403–15439. [CrossRef]
- Bittle, K.; El-Gayar, O. Generative AI and Academic Integrity in Higher Education: A Systematic Review and Research Agenda. Information 2025, 16. [CrossRef]
- Pwanedo Amos, J.; Ahmed Amodu, O.; Azlina Raja Mahmood, R.; Bolakale Abdulqudus, A.; Zakaria, A.F.; Rhoda Iyanda, A.; Ali Bukar, U.; Mohd Hanapi, Z. A Bibliometric Exposition and Review on Leveraging LLMs for Programming Education. IEEE Access 2025, 13, 58364–58393. [CrossRef]
- Bo, J.Y.; Wan, S.; Anderson, A. To Rely or Not to Rely? Evaluating Interventions for Appropriate Reliance on Large Language Models. In Proceedings of the Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, New York, NY, USA, 2025; CHI ’25. [CrossRef]
- Cai, Y.; Hou, Z.; Sanan, D.; Luan, X.; Lin, Y.; Sun, J.; Dong, J.S. Automated Program Refinement: Guide and Verify Code Large Language Model with Refinement Calculus. Proc. ACM Program. Lang. 2025, 9. [CrossRef]
- Kazemitabaar, M.; Ye, R.; Wang, X.; Henley, A.Z.; Denny, P.; Craig, M.; Grossman, T. CodeAid: Evaluating a Classroom Deployment of an LLM-based Programming Assistant that Balances Student and Educator Needs. In Proceedings of the Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, New York, NY, USA, 2024; CHI ’24. [CrossRef]
- Lee, D.K.; Joe, I. A GPT-Based Code Review System With Accurate Feedback for Programming Education. IEEE Access 2025, 13, 105724–105737. [CrossRef]
- Zucchelli, M.M.; Matteucci Armandi Avogli Trotti, N.; Pavan, A.; Piccardi, L.; Nori, R. The Dual Process model: the effect of cognitive load on the ascription of intentionality. Frontiers in Psychology 2025, 16, 1451590. [CrossRef]
- Mienye, I.D.; Sun, Y. A Survey of Ensemble Learning: Concepts, Algorithms, Applications, and Prospects. IEEE Access 2022, 10, 99129–99149.
- Vallecillos Ruiz, F.; Hort, M.; Moonen, L. The Art of Repair: Optimizing Iterative Program Repair with Instruction-Tuned Models. In Proceedings of the Proceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering, New York, NY, USA, 2025; EASE ’25, p. 500–511. [CrossRef]
- Long, Y.; Luo, H.; Zhang, Y. Evaluating large language models in analysing classroom dialogue. NPJ Science of Learning 2024, 9, 60. [CrossRef]
- Noveski, G.; Jeroncic, M.; Velard, T.; Kocuvan, P.; Gams, M. Comparison of Large Language Models in Generating Machine Learning Curricula in High Schools. Electronics 2024, 13. [CrossRef]
- Shahzad, T.; Mazhar, T.; Tariq, M.U.; Ahmad, W.; Ouahada, K.; Hamam, H. A comprehensive review of large language models: issues and solutions in learning environments. Discover Sustainability 2025, 6, 27. [CrossRef]
- Haruto, S.; Nnadi, L.; Yutaka, W., Systematic Review of Large Language Model Applications in Programming Education; 2025. [CrossRef]
- Vargas, H.J.; Pereira, R.L.C.S.; de Moraes, A.F.; Silva, L.A. Evaluating Generative AI in Education: LLM as English Tutor. In Proceedings of the New Trends in Disruptive Technologies, Tech Ethics and Artificial Intelligence; de la Iglesia, D.H.; de Paz Santana, J.F.; López Rivero, A.J., Eds., Cham, 2025; pp. 447–458.
- Jošt, G.; Taneski, V.; Karakatič, S. The Impact of Large Language Models on Programming Education and Student Learning Outcomes. Applied Sciences 2024, 14. [CrossRef]
- Lai, C.H.; Lin, C.Y. Analysis of Learning Behaviors and Outcomes for Students with Different Knowledge Levels: A Case Study of Intelligent Tutoring System for Coding and Learning (ITS-CAL). Applied Sciences 2025, 15. [CrossRef]
- Cowan, B.; Watanobe, Y.; Shirafuji, A. Enhancing Programming Learning with LLMs: Prompt Engineering and Flipped Interaction. In Proceedings of the Proceedings of the 2023 4th Asia Service Sciences and Software Engineering Conference, New York, NY, USA, 2024; ASSE ’23, p. 10–16. [CrossRef]
- Mueller, M.; List, C.; Kipp, M. The Power of Context: An LLM-based Programming Tutor with Focused and Proactive Feedback. In Proceedings of the Proceedings of the 6th European Conference on Software Engineering Education, New York, NY, USA, 2025; ECSEE ’25, p. 1–10. [CrossRef]
- Bucaioni, A.; Ekedahl, H.; Helander, V.; Nguyen, P.T. Programming with ChatGPT: How far can we go? Machine Learning with Applications 2024, 15, 100526. [CrossRef]
- Jiang, Q.; Gao, Z.; Karniadakis, G.E. DeepSeek vs. ChatGPT vs. Claude: A comparative study for scientific computing and scientific machine learning tasks. Theoretical and Applied Mechanics Letters 2025, 15, 100583. [CrossRef]
- Li, Y.; Peng, Y.; Huo, Y.; Lyu, M.R. Enhancing LLM-Based Coding Tools through Native Integration of IDE-Derived Static Context. In Proceedings of the 2024 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code), 2024, pp. 70–74.
- Zubair, F.; Al-Hitmi, M.; Catal, C. The use of large language models for program repair. Computer Standards & Interfaces 2025, 93, 103951. [CrossRef]
- OpenAI. Introducing study mode. https://openai.com/index/chatgpt-study-mode/, 2025. Accessed 2025-12-03.
- OpenAI Help Center. ChatGPT Study Mode – FAQ. https://help.openai.com/en/articles/11780217-chatgpt-study-mode-faq, 2025. Accessed 2025-12-03.
- Anthropic. Introducing Claude for education. https://www.anthropic.com/news/introducing-claude-for-education, 2025. Accessed 2025-12-03.
- Google. Guided Learning in Gemini: From answers to understanding. https://blog.google/products-and-platforms/products/education/guided-learning/, 2025. Accessed 2025-12-03.
- LearnLM Team.; Modi, A.; Veerubhotla, A.S.; Rysbek, A.; Huber, A.; Wiltshire, B.; Veprek, B.; Gillick, D.; Kasenberg, D.; Ahmed, D.; et al. LearnLM: Improving Gemini for Learning. arXiv preprint arXiv:2412.16429 2024.
- Google AI for Developers. LearnLM (Gemini API documentation). https://ai.google.dev/gemini-api/docs/learnlm, 2025. Accessed 2025-12-03.
- Fu, L.; Yuan, H.; Chen, D.; Dai, X.; Li, Q.; Zhang, W.; Liu, W.; Yu, Y. DebugTA: An LLM-Based Agent for Simplifying Debugging and Teaching in Programming Education. arXiv preprint arXiv:2510.11076 2025.
- LEHKYI, M.; SHEVCHUK, H. TOOLS TO SUPPORT PROGRAMMING EDUCATION WITH THE HELP OF MULTI-ROLE AI AGENTSЗАСОБИ ПІДТРИМКИ НАВЧАННЯ ПРОГРАМУВАННЮ ЗА ДОПОМОГОЮ БАГАТОРОЛЬОВИХ AI-АГЕНТІВ. Herald of Khmelnytskyi National University. Technical sciences 2026, 361, 217–226. [CrossRef]
- Ferrag, M.A.; Tihanyi, N.; Hamouda, D.; Maglaras, L.; Lakas, A.; Debbah, M. From prompt injections to protocol exploits: Threats in LLM-powered AI agents workflows. ICT Express 2025. [CrossRef]









| System / Line of Work | Year | Primary architecture | Pedagogical safeguard | Robustness / resilience mechanism | Domain |
|---|---|---|---|---|---|
| Frontier LLM “study/learning” modes [35,36,37,38] | 2025 | Single model (sequential dialogue) | Socratic prompting; hint-style policies | Policy compliance and self-restraint within one model | General / Code |
| Education-tuned LLMs [39,40] | 2025 | Single model (fine-tuned) | Instruction tuned for pedagogy | Training-time alignment; long-context grounding | General |
| Tool-augmented tutoring and debugging agents [41] | 2025 | Agent + tools | Stepwise tool calls; compiler/test feedback | External verification via compilation/tests | Programming |
| Multi-role tutor/evaluator/coach prototypes [42] | 2026 | Multi-agent (role separation) | Human-in-the-loop or evaluator feedback | Role-based feedback loops (often without explicit sanitization) | Programming |
| The Code Council (Ours) | 2026 | Multi-agent council with technical loop + pedagogical phase | Student-facing roles isolated from internal patches | Secretary sanitization; adversarial Skeptic; trap-resistance probes | Programming |
| Problem ID | Problem Name | Link |
|---|---|---|
| ITP1_4_B | Circle | https://onlinejudge.u-aizu.ac.jp/courses/lesson/2/ITP1/4/ITP1_4_B |
| ITP1_5_C | Print a Chessboard | https://onlinejudge.u-aizu.ac.jp/courses/lesson/2/ITP1/5/ITP1_5_C |
| ITP1_5_D | Structured Programming | https://onlinejudge.u-aizu.ac.jp/courses/lesson/2/ITP1/5/ITP1_5_D |
| ITP1_6_C | Official House | https://onlinejudge.u-aizu.ac.jp/courses/lesson/2/ITP1/6/ITP1_6_C |
| ITP1_7_C | Spreadsheet | https://onlinejudge.u-aizu.ac.jp/courses/lesson/2/ITP1/7/ITP1_7_C |
| ITP1_8_D | Ring | https://onlinejudge.u-aizu.ac.jp/courses/lesson/2/ITP1/8/ITP1_8_D |
| ITP1_9_D | Transformation | https://onlinejudge.u-aizu.ac.jp/courses/lesson/2/ITP1/9/ITP1_9_D |
| ITP1_10_D | Distance II | https://onlinejudge.u-aizu.ac.jp/courses/lesson/2/ITP1/10/ITP1_10_D |
| ITP1_11_C | Dice III | https://onlinejudge.u-aizu.ac.jp/courses/lesson/2/ITP1/11/ITP1_11_C |
| ITP1_11_D | Dice IV | https://onlinejudge.u-aizu.ac.jp/courses/lesson/2/ITP1/11/ITP1_11_D |
| Configuration | Patch Succ. (%) | False Pos. (%) | Med. PatchSize |
| GPT-4o tutor | 87.7 | 11.3 | 0.41 |
| Gemini 2.0 tutor | 84.8 | 12.7 | 0.44 |
| Claude 4.5 tutor | 84.2 | 9.9 | 0.40 |
| DeepSeek v3.2 tutor | 80.7 | 9.1 | 0.38 |
| Configuration | Course Patch Succ. (%) | Challenge Patch Succ. (%) |
| GPT-4o tutor | 100.0 | 75.3 |
| Gemini 2.0 tutor | 100.0 | 69.6 |
| Claude 4.5 tutor | 100.0 | 68.4 |
| DeepSeek v3.2 tutor | 100.0 | 61.4 |
| Configuration | Loc. Prec. |
Loc. Rec. |
Leakage (%) |
Code Tokens (med.) |
Words (med.) |
Questions (med.) |
| GPT-4o tutor | 0.62 | 0.55 | 22.0 | 102 | 86 | 0.8 |
| Gemini 2.0 tutor | 0.59 | 0.52 | 23.5 | 96 | 92 | 0.7 |
| Claude 4.5 tutor | 0.64 | 0.56 | 18.5 | 88 | 95 | 0.9 |
| DeepSeek v3.2 tutor | 0.60 | 0.52 | 24.0 | 115 | 80 | 0.6 |
| Configuration | Full solution | Heavy hint | Conceptual guidance |
| GPT-4o tutor | 20% | 39% | 41% |
| Gemini 2.0 tutor | 22% | 38% | 40% |
| Claude 4.5 tutor | 17% | 37% | 46% |
| DeepSeek v3.2 tutor | 23% | 36% | 41% |
| Configuration | Latency (med.) |
Tokens / interaction (med.) |
Tokens / interaction (p90) |
| GPT-4o tutor | 7.4 | 3,150 | 3,980 |
| Gemini 2.0 tutor | 7.9 | 3,280 | 4,120 |
| Claude 4.5 tutor | 8.3 | 3,210 | 4,050 |
| DeepSeek v3.2 tutor | 6.8 | 2,980 | 3,760 |
| Configuration | Patch Stab. (% ident.) | Hint Sim. (mean cos.) |
| GPT-4o tutor | 64.0 | 0.81 |
| Gemini 2.0 tutor | 61.5 | 0.79 |
| Claude 4.5 tutor | 66.2 | 0.83 |
| DeepSeek v3.2 tutor | 63.7 | 0.80 |
| Configuration | Correct. | Loc. | Scaf. | Usef. |
| GPT-4o tutor | 3.8 | 3.5 | 3.2 | 3.4 |
| Gemini 2.0 tutor | 3.6 | 3.3 | 3.1 | 3.3 |
| Claude 4.5 tutor | 3.9 | 3.7 | 3.5 | 3.7 |
| DeepSeek v3.2 tutor | 3.8 | 3.4 | 3.1 | 3.4 |
| Configuration | Patch Succ | False Pos. | Med. PatchSize |
| Council–GPT | 89.9 | 6.5 | 0.35 |
| Council–CodeFirst | 91.9 | 5.2 | 0.30 |
| Council–Conservative | 92.4 | 4.6 | 0.34 |
| Council–HybridStrong | 93.8 | 4.9 | 0.32 |
| Configuration | Cour. Pat. Succ. (%) | Chal. Pat. Succ. (%) |
| Council–GPT | 100.0 | 79.8 |
| Council–CodeFirst | 100.0 | 83.8 |
| Council–Conservative | 100.0 | 84.8 |
| Council–HybridStrong | 100.0 | 87.5 |
| Configuration | Loc. Prec. |
Loc. Rec. |
Leakage (%) |
Code Tokens (med.) |
Words (med.) |
Questions (med.) |
| Council–GPT | 0.70 | 0.61 | 13.5 | 74 | 79 | 1.1 |
| Council–CodeFirst | 0.76 | 0.68 | 9.5 | 60 | 73 | 1.3 |
| Council–Conservative | 0.75 | 0.67 | 6.8 | 54 | 74 | 1.5 |
| Council–HybridStrong | 0.78 | 0.69 | 7.5 | 52 | 71 | 1.4 |
| Configuration | Full sol. | Heavy hint | Concept. guid. |
| Council–GPT | 12% | 35% | 53% |
| Council–CodeFirst | 8% | 34% | 58% |
| Council–Conservative | 4% | 28% | 68% |
| Council–HybridStrong | 6% | 30% | 64% |
| Configuration | Latency (med., s) |
Tokens / interaction (med.) |
Tokens / interaction (p90) |
| Council–GPT | 11.5 | 4,500 | 5,700 |
| Council–CodeFirst | 12.6 | 4,850 | 6,020 |
| Council–Conservative | 13.0 | 5,050 | 6,150 |
| Council–HybridStrong | 13.4 | 5,030 | 6,320 |
| Configuration | Pat. Stab. (% ident.) | Hint Sim. (mean cos.) |
| Council–GPT | 71.6 | 0.84 |
| Council–CodeFirst | 74.9 | 0.86 |
| Council–Conservative | 78.2 | 0.88 |
| Council–HybridStrong | 77.0 | 0.87 |
| Configuration | Correct. | Local. | Scaf. | Usef. |
| Council–GPT | 4.1 | 3.9 | 3.8 | 3.9 |
| Council–CodeFirst | 4.2 | 4.0 | 4.0 | 4.1 |
| Council–Conservative | 4.2 | 4.1 | 4.3 | 4.4 |
| Council–HybridStrong | 4.4 | 4.2 | 4.2 | 4.3 |
| Patch Succ. (%) | False Pos. (%) | |||
| Configuration | Blind | Guid. (rev.) | Blind | Guid. (rev.) |
| Council–GPT | 89.9 | 90.8 | 6.5 | 5.0 |
| Council–CodeFirst | 91.9 | 92.6 | 5.2 | 4.2 |
| Council–Conservative | 92.4 | 93.3 | 4.6 | 4.1 |
| Council–HybridStrong | 93.8 | 94.6 | 4.9 | 3.7 |
| Configuration | Blind | Guided (naive) | Guided (rev.) |
| Council–GPT | 13.5% | 24.0% | 18.0% |
| Council–CodeFirst | 9.5% | 32.0% | 19.5% |
| Council–Conservative | 6.8% | 18.5% | 10.5% |
| Council–HybridStrong | 7.5% | 38.0% | 15.0% |
| Configuration | Res. (%) | Compl.(%) | Coll.%) |
| Council–GPT | 58.0 | 28.0 | 14.0 |
| Council–CodeFirst | 55.0 | 32.0 | 13.0 |
| Council–Conservative | 68.0 | 18.0 | 14.0 |
| Council–HybridStrong | 62.0 | 26.0 | 12.0 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).