Preprint
Article

This version is not peer-reviewed.

The Code Council: Orchestrating Heterogeneous Large Language Models for Robust Programming Scaffolding

Submitted:

03 March 2026

Posted:

04 March 2026

You are already at the latest version

Abstract
Recent advances in large language models (LLMs) have made it feasible to use them as automated debugging tutors, but it remains unclear how much can be gained by moving from single-model tutors to multi-agent councils with separated roles. We study this question in an offline simulation on 200 debugging cases drawn from an online judge, spanning 20 problems split into course-style and contest-style challenge tracks. We compare four single-model tutors based on current frontier models with four councils that assign models to Architect, Skeptic, Secretary, Pedagogue, and Mentor roles and operate in both Blind and Guided modes. Single-model tutors achieve near-perfect repair on course problems but perform less reliably on challenge cases and often rewrite large portions of student code, show non-negligible false positive rates, and leak full or near-full solutions in a substantial share of hints. Councils designed around measured model strengths improve both technical and pedagogical behaviour. On the challenge track, the best council raises patch success by 12.2 percentage points over the best single tutor, while reducing false positives, shrinking median patch size, improving hint localisation, and cutting solution leakage in Blind mode from about one fifth of hints to under ten percent. Councils also exhibit higher stability across reruns and produce hints that two independent instructors consistently rate as more useful and better scaffolded. Guided mode, where internal components see a reference solution, yields further technical gains but introduces leakage risks that require prompt tightening and a sanitising Secretary to control the flow of ground truth. Additional trap experiments with poisoned reference solutions show a mix of resistance and fail-safe collapse rather than systematic poisoning of hints. These results indicate that orchestration and information flow are powerful levers and that well-designed councils can provide more reliable and pedagogically aligned debugging support than strong single-model tutors alone.
Keywords: 
;  ;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated