Submitted:
14 February 2026
Posted:
27 February 2026
Read the latest preprint version here
Abstract
Keywords:
1. Introduction
1.1. The Core Problem
1.2. The Deep Context Challenge
1.3. External Selection
- Generation: High-entropy exploration of the hypothesis space
- Selection: Low-entropy evaluation under external constraints
- Feedback: Updating generation based on what survives selection
1.4. Contributions
- 1.
- Information-theoretic bounds: Formalizing conditions under which self-evaluation may provide weak evidence
- 2.
- Connections to prior work: A possible explanation for empirical results in self-correction and multi-agent debate
- 3.
- Practical architecture: A framework for generate-then-judge workflows, including same-model implementations via context separation
- 4.
- Worked examples: Illustrations of context-separated evaluation
1.5. Scope and Claims
2. Problem Formalization
2.1. Basic Variables and Setup
2.2. Shared Blind Spots
3. When Self-Evaluation Fails
3.1. Main Result
3.2. Information-Theoretic Formulation
3.3. Evidence Bound Formulation
3.4. The Confidence Amplification Problem
3.5. When External Selection Works
- 1.
- The selector accesses information not contained in , breaking the conditional independence
- 2.
- The selector’s blind spots have low overlap with the generator’s
- 3.
- The false acceptance rate satisfies
3.6. Mechanistic Note: A Predictive Interpretation
A note on optimization targets.
Falsifiable prediction.
User control through prompting.
Implication for context separation.
4. Selection Pressure Across Domains
4.1. Self-Reference Limitations
4.2. Selection Pressure Across Domains
4.3. External Selection in Practice: Formal Verification
4.4. Persistence as the Criterion for Reliability
5. Multi-Agent Verification
5.1. Breaking Correlation
5.2. External Selection Channels
Formal verification.
Executable verification.
Fresh context evaluation.
Limitations of same-model context separation.
Numerical invariants.
Retrieval-grounded checking.
Independent critics.
5.3. Why Independence Matters
5.4. Persona-Based Diversity
6. A Context-Separated Architecture
6.1. Design Goals
- 1.
- Maximize exploration: Generate diverse candidates without premature filtering
- 2.
- Ensure rigor: Select only candidates surviving external validation
- 3.
- Enable iteration: Feed selection results back to improve generation
- 4.
- Preserve human judgment: Surface candidates for human review, don’t replace human decision-making
6.2. Architecture Components
6.3. High-Entropy Generation
- Operate at high temperature or use explicit diversity objectives
- Generate candidates spanning the hypothesis space, including edge cases
- Avoid premature self-filtering that would narrow the search
- Include alternative assumptions and boundary conditions
- High-temperature sampling from a single model
- Ensemble sampling from multiple models
- Structured exploration of assumption variations
- Adversarial generation targeting unexplored regions
6.4. External Selection
- Operate at low temperature for consistency
- Apply explicit checklists and external tools
- Produce structured verdicts with rationales
- Flag uncertainty rather than forcing binary decisions
- 1.
- Formal: Does it compile/prove/type-check?
- 2.
- Executable: Does it pass tests?
- 3.
- Numerical: Does it satisfy invariants?
- 4.
- Grounded: Do citations check out?
- 5.
- Adversarial: Does it survive independent critique?
6.5. Learning from Selection
- Add constraints that killed candidates to future prompts
- Preserve successful patterns as templates
- Escalate ambiguous survivors to human review
- Track failure modes for architecture improvement
6.6. Distinguishing from Related Architectures
| Architecture | Selection criterion | Goal | External? |
|---|---|---|---|
| GAN | Discriminator fooled | Realism | No (co-trained) |
| RLHF | Human preference | Alignment | Partially |
| Self-consistency | Agreement across samples | Confidence | No (correlated) |
| Context-separated | External validation | Truth | Yes |
6.7. Implementation Sketch
6.8. Same-Model Implementation via Context Separation
- Lower cost: Fresh context with no history is faster and cheaper than extended chain-of-thought in a single context.
- Broken correlation: Context B cannot see Context A’s reasoning errors, only the output. The blind spot that caused the error is absent.
- Simultaneous opposition: Requesting both steelman and attack forces the model to genuinely consider both sides rather than anchoring on one.
- Temperature control: High temperature in generation (exploration), low temperature in critique (precision).
6.9. Initial Observations and Future Validation
- Prompt sensitivity: Minor framing changes (e.g., “thoughts” vs “honest review”) produced dramatically different evaluations of identical content.
- In-context persuasion: Critics who heard the author’s defense often revised harsh assessments to positive ones, suggesting context sharing may correlate evaluator judgment with author framing.
- Fresh-context disagreement: Multiple fresh-context evaluations of the same manuscript frequently disagreed with each other, while self-evaluation produced consistent (but potentially unreliable) agreement.
Falsifiability and future work.
Collaboration invited.
Additional observations.
Negative results.
Stop condition.
7. Threat Model and Mitigations
7.1. Attack Surfaces
Correlated critics.
Prompt injection on selection.
Spoofable external checks.
Human confirmation bias.
Feedback gaming.
7.2. Mitigations
| Threat | Mitigation |
|---|---|
| Correlated critics | Model diversity (different families, training data, architectures) |
| Prompt injection | Structured verdict formats, input sanitization, separate contexts |
| Spoofable checks | Hard external criteria (formal proofs, physical measurements) |
| Human bias | Adversarial critics, blind review protocols, explicit checklists |
| Feedback gaming | Diverse selection criteria, periodic architecture audits |
8. Implications
8.1. For AI-Assisted Workflows
8.2. For Reduced Human Oversight
- Formal verification covering the relevant claim space
- Executable tests providing ground truth
- Evaluation architectures with independent failure modes
8.3. For Human-AI Collaboration
8.4. Deployment Considerations for Research Applications
8.5. Open Questions
- 1.
- Can formal verification scale to cover more scientific domains?
- 2.
- What is the minimum diversity required for effective multi-agent selection?
- 3.
- Can selection criteria themselves be learned without introducing correlation?
- 4.
- Do the structural parallels to physics reflect deeper principles?
- 5.
- Can commercial API constraints on system prompts be overcome through prompt engineering, or is local deployment necessary for full external selection effectiveness?
8.6. Future Work
- Conduct controlled experiments comparing self-evaluation to context-separated evaluation across standard benchmarks
- Measure error correlation empirically using established inter-rater reliability metrics
- Test multiple implementations: context separation, persona-based diversity, temperature variation, and multi-model configurations
- Document which approaches work in which settings, rather than advocate for a single method
9. Related Work
LLM self-correction: the empirical foundation.
The Self-Correction Blind Spot.
Self-consistency and majority voting.
Multi-agent debate and verification.
Process supervision.
External verification and tool use.
Apparent counterexamples.
LLM-as-a-Judge biases.
Ensemble diversity and error decorrelation.
AI for science.
10. Conclusion
Funding
Use of AI Tools
Acknowledgments
Conflicts of Interest
References
- Kaito Baba, Chaoran Liu, Shuhei Kurita, and Akiyoshi Sannai. Prover Agent: An agent-based framework for formal mathematical proofs. arXiv preprint arXiv:2506.19923, 2025. [CrossRef]
- Yuntao Bai et al. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073, 2022. [CrossRef]
- Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 610–623, 2021.
- Justin Chih-Yao Chen, Swarnadeep Saha, and Mohit Bansal. ReConcile: Round-table conference improves reasoning via consensus among diverse LLMs. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024.
- Karl Cobbe et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. [CrossRef]
- Cover, Thomas M.; Thomas, Joy A. Elements of Information Theory, 2nd edition; Wiley-Interscience, 2006. [Google Scholar]
- Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. Proceedings of the 41st International Conference on Machine Learning (ICML), 2024.
- Kurt Gödel. Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I. Monatshefte für Mathematik und Physik, 38:173–198, 1931. [CrossRef]
- Zhibin Gou et al. CRITIC: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738, 2023. [CrossRef]
- Sirui Hong et al. MetaGPT: Meta programming for a multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023. [CrossRef]
- Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, AdamsWei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. Proceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024.
- John Jumper et al. Highly accurate protein structure prediction with AlphaFold. Nature, 596:583–589, 2021. [CrossRef]
- Aobo Kong, Shiwan Zhao, Hao Chen, Qicheng Li, Yong Qin, Ruiqi Sun, and Xin Zhou. Better Zero-Shot Reasoning with Role-Play Prompting. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024.
- Anders Krogh and Jesper Vedelsby. Neural network ensembles, cross validation, and active learning. Advances in Neural Information Processing Systems (NIPS), 1995.
- Jiayi Li et al. Justice or prejudice? Quantifying biases in LLM-as-a-Judge. arXiv preprint arXiv:2410.02736, 2024. [CrossRef]
- Guohao Li et al. CAMEL: Communicative agents for “mind” exploration of large language model society. Advances in Neural Information Processing Systems, 36, 2023.
- Tian Liang et al. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118, 2023. [CrossRef]
- Hunter Lightman et al. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023. (OpenAI; ICLR 2024). [CrossRef]
- Aman Madaan et al. Self-Refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Arjun Panickssery et al. Self-preference bias in LLM-as-a-Judge. arXiv preprint arXiv:2410.21819, 2024. [CrossRef]
- Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393, 2020. [CrossRef]
- Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, and Ethan Perez. Towards Understanding Sycophancy in Language Models. Proceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024.
- Gladys Tyen, Hassan Mansoor, Victor Carbune, Peter Chen, and Tony Mak. LLMs Cannot Find Reasoning Errors, but Can Correct Them Given the Error Location. Findings of the Association for Computational Linguistics: ACL 2024, pages 13894–13908, 2024.
- Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I Learned to Start Worrying about Prompt Formatting. Proceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024.
- Xuezhi Wang et al. Self-consistency improves chain of thought reasoning in language models. Proceedings of the Eleventh International Conference on Learning Representations (ICLR), 2023.
- Qingyun Wu et al. AutoGen: Enabling next-gen LLM applications via multi-agent conversation. arXiv preprint arXiv:2308.08155, 2023. [CrossRef]
- Kaiyu Yang et al. LeanDojo: Theorem proving with retrieval-augmented language models. Advances in Neural Information Processing Systems, 36, 2023.
- Mingqian Zheng, Jiaxin Pei, and David Jurgens. Is “A Helpful Assistant” the Best Role for Large Language Models? A Systematic Evaluation of Social Roles in System Prompts. arXiv preprint arXiv:2311.10054, 2023. [CrossRef]
- Jingming Zhuo, Songyang Zhang, Xinyu Fang, Haodong Duan, Dahua Lin, and Kai Chen. ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs. Findings of the Association for Computational Linguistics: EMNLP 2024, 2024.
- Emily Pronin, Daniel Y. Lin, and Lee Ross. The bias blind spot: Perceptions of bias in self versus others. Personality and Social Psychology Bulletin, 28(3):369–381, 2002. [CrossRef]
- Ken Tsui. Self-Correction Bench: Uncovering and Addressing the Self-Correction Blind Spot in Large Language Models. arXiv preprint arXiv:2507.02778, 2025.
| System | Perturbation | Selection Criterion | Amplification |
|---|---|---|---|
| Engineering | |||
| Fuzzer | Random mutation | No crash | Bug-free path found |
| Monte Carlo | Stochastic proposal | Satisfies constraints | Solution region mapped |
| Genetic alg. | Crossover/mutation | Fitness improves | Optimized design |
| Scientific | |||
| Hypothesis | Conjecture | Persists under test | Theory in literature |
| Human Reasoning | |||
| Draft | Initial attempt | Survives fresh review | Revised manuscript |
| LLM (proposed) | |||
| High temp. | Random token | Survives fresh context | Validated output |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).