1. Introduction
Recent work demonstrates that large language models cannot reliably self-correct their reasoning [
11]. This failure persists despite models’ ability to generate plausible text, mathematical derivations, and executable code. We argue this reflects a structural property of certain evaluation configurations: when generator and evaluator share failure modes, self-evaluation provides weak evidence of correctness. A derivation may be elegant, internally consistent, and convincingly presented, yet contain a subtle error that the model cannot detect in its own output.
This failure is not primarily about compute or scale. It is about validation. Reliable workflows require a mechanism that separates signal from noise, correct outputs from plausible-but-wrong ones. The question is: what properties must a validation mechanism have to be reliable?
1.1. The Core Problem
Consider a system that generates a hypothesis and then evaluates whether that hypothesis is correct. Under what conditions does self-evaluation provide useful information?
We argue that the answer depends on error correlation. When the evaluator makes errors on the same inputs where the generator makes errors, self-evaluation can be non-identifying: agreement between generator and evaluator may provide weak evidence of correctness. This echoes well-documented phenomena in human reasoning: confirmation bias, the curse of knowledge, and why peer review and second opinions exist at all. The difference with LLMs is that context can be deleted—a fresh instance has no memory of the reasoning it might defend.
This is not a claim about any specific model’s limitations. It is a structural property of evaluation systems. A single agent evaluating its own outputs faces correlated error by construction: the same training data, the same inductive biases, the same blind spots.
1.2. The Deep Context Challenge
This problem becomes acute as context windows expand. Modern LLMs support large context windows, enabling extended reasoning: complex derivations, multi-step analyses, and long research sessions.
But large context is where correlated error accumulation may be most severe. Each reasoning step inherits context from previous steps. Errors compound: an error at one step cascades through subsequent steps, propagating through sequential reasoning [
24]. The longer the reasoning chain, the more opportunity for self-reinforcing mistakes that become invisible within the context that produced them.
Self-evaluation within a deep context may struggle to catch these errors, because the evaluation inherits the same drift that produced the mistake. This creates a tension: the contexts where rigorous evaluation matters most may be the contexts where self-evaluation is least reliable.
Context-separated evaluation can help address this tension. By evaluating outputs in fresh context, without the reasoning trace that produced them, we reduce the inheritance of correlated error. The deeper the original context, the more valuable context separation may become.
1.3. External Selection
If correlated error is the problem, then evaluation in a modified or fresh context (where the original reasoning trace is absent) may provide more independent signal. We call this external selection. The key observation is that “external” refers to context, not necessarily to different models. A fresh instance of the same model, without access to the reasoning chain that produced the candidate, may provide more independent critique because the error-producing context is absent.
Multi-agent systems may succeed when they introduce external selection channels [
7]: formal proof checkers, executable tests, numerical invariants, independently-trained critics, or even the same model under fresh context. The common element is reducing correlation between generator and evaluator failure modes.
This motivates a practical architecture that separates:
Generation: High-entropy exploration of the hypothesis space
Selection: Low-entropy evaluation under external constraints
Feedback: Updating generation based on what survives selection
1.4. Contributions
- 1.
Information-theoretic bounds: Formalizing conditions under which self-evaluation may provide weak evidence
- 2.
Connections to prior work: A possible explanation for empirical results in self-correction and multi-agent debate
- 3.
Practical architecture: A framework for generate-then-judge workflows, including same-model implementations via context separation
- 4.
Worked examples: Illustrations of context-separated evaluation
1.5. Scope and Claims
This paper makes a narrow, practical claim: correlated error can make self-evaluation unreliable, and external selection channels can restore reliability. The bounds we present are conditional on explicit assumptions that may not hold in all settings. We do not claim that all self-evaluation fails. We claim the architecture provides a principled filter that can improve the efficiency of human-AI collaboration in settings where the assumptions apply.
3. When Self-Evaluation Fails
3.1. Main Result
The central claim is straightforward: reliable validation benefits from evaluation criteria whose error is not strongly correlated with the generator. When evaluator error is coupled with generator error, self-evaluation becomes non-identifying: agreement provides negligible evidence of correctness.
We formalize this through two theorems: an information-theoretic bound and an evidence bound.
3.2. Information-Theoretic Formulation
The central quantity is : the information that the selector provides about correctness, given that we already observe the generator output.
Theorem 3.1 (Information Bound via Shared Blind Spots)
. Let Z be a latent variable. Assume the conditional independence . Then:
In particular, if , then .
Proof. By the chain rule for conditional mutual information:
Under
, we have
, hence:
Also by the chain rule:
since mutual information is nonnegative. Combining yields
. If additionally
, then
and the result follows. □ □
Interpretation. The information the selector provides about correctness is bounded by how much the blind spot variable Z “knows” about correctness beyond what the generator output already reveals. When Z is a pure nuisance variable (encoding only how the system fails, not whether it fails), self-evaluation provides zero additional information.
Lemma 3.2 (Post-Processing Cannot Increase Evidence)
. For any deterministic acceptance rule :
This follows directly from the data processing inequality. The acceptance decision cannot contain more information than the selector score from which it derives.
3.3. Evidence Bound Formulation
Theorem 3.3 (Bounded Evidence from Acceptance)
. Assume the selector has high false acceptance rate:
and . Then the log-likelihood ratio contributed by observing satisfies:
Proof. We have
and
, so:
Taking
gives the result. □ □
Corollary 3.4 (Degenerate Evidence for Small
)
. For small ε:
For (selector accepts 99% of incorrect hypotheses), acceptance provides at most bits of evidence, negligible compared to typical prior uncertainty.
Important caveat. These results are conditional on the stated assumptions. Many successful LLM applications violate these assumptions by incorporating external feedback channels (execution, formal verification, retrieval). When the selector accesses ground truth signals, the conditional independence fails in the favorable direction, and self-evaluation can provide substantial information. The negative results apply specifically to the regime where: (1) systematic generator failures exist with nontrivial probability, and (2) the selector shares the generator’s blind spots. We do not claim all self-evaluation fails, only that it may fail under these conditions.
3.4. The Confidence Amplification Problem
Worse than providing no information, correlated self-evaluation can amplify confidence in errors.
Lemma 3.5 (Repeated Self-Critique Bound)
. Consider k selector outputs with acceptance decisions . If for all i, then:
That is, k critiques provide no more information about correctness than the single blind spot variable Z.
Proof. By the conditional independence assumption,
. The chain rule gives:
since the second term is zero. □ □
Proposition 3.6 (Confidence Amplification). Under strongly coupled error, repeated self-evaluation that produces consistent acceptance increases subjective confidence while providing no objective evidence.
Proof. If a system evaluates its hypothesis
k times and accepts each time, a naive Bayesian update treats these as independent evidence:
Under strong error coupling, however,
are not independent conditional on
. If
and
, all evaluations fail together:
By Lemma 3.5, the k acceptances collectively provide no more information than a single evaluation. The apparent evidence of k consistent acceptances is actually a single piece of (non-)information repeated k times. But the subjective experience is k “confirmations,” creating false confidence. □ □
This may contribute to a failure mode observed in extended LLM reasoning: increasing confidence in coherent, well-argued, wrong conclusions.
3.5. When External Selection Works
The results above identify conditions under which self-evaluation may provide weak evidence. The contrapositive suggests when external selection may help:
Corollary 3.7 (External Selection Criterion). Evaluation provides substantial information about correctness when:
- 1.
The selector accesses information not contained in , breaking the conditional independence
- 2.
The selector’s blind spots have low overlap with the generator’s
- 3.
The false acceptance rate satisfies
External selection channels that satisfy these criteria include: formal verification (accesses mathematical ground truth), executable tests (accesses computational ground truth), different model families (different ), and fresh-context evaluation (partially resets Z).
3.6. Mechanistic Note: A Predictive Interpretation
The information-theoretic results above establish that self-evaluation can fail under correlated error. This section offers one interpretation of why such correlation may be structural in language models.
Bender et al. [
3] characterize language models as systems that “stitch together sequences of linguistic forms... according to probabilistic information about how they combine, but without any reference to meaning.” Under this framing, when asked to evaluate a hypothesis, a language model predicts what evaluative text would likely follow the prompt, given its training distribution. This prediction inherits whatever patterns characterize human evaluation behavior in that distribution: prestige deference, format heuristics, social smoothing, and narrative continuation.
Alignment training (RLHF) shifts
which human behavior is predicted but may not change the underlying operation. Sharma et al. [
23] demonstrate that RLHF-trained models systematically exhibit sycophancy (responses matching user beliefs over truthful ones) and that human preference judgments favor sycophantic responses, creating a training signal toward agreement. This suggests that aligned models predict what a
preferred human would say, which may favor continuation over contradiction.
A note on optimization targets. Standard system prompts optimize for “helpful assistant,” not “rigorous evaluator” or “truth-seeker.” Zheng et al. [
29] systematically evaluated social roles in system prompts and found that the “helpful assistant” framing is nearly universal in commercial deployments, yet produces measurably different behavior than alternative framings. These are not equivalent objectives. A helpful assistant that tells a researcher their theory is fundamentally flawed may be accurate but scores poorly on helpfulness. The training signal favors diplomatic balance over harsh accuracy.
Falsifiable prediction. One observable phenomenon follows from this analysis, which readers can verify directly: format consistency. Submit manuscripts of wildly varying quality to any major language model with neutral prompts (e.g., “Tell me your thoughts on this manuscript”). We predict near-uniform response format: balanced positive and negative points regardless of input quality. No human population produces such format consistency on open-ended evaluation tasks. Humans are variable; they have strong reactions, skip sections, write three sentences or three pages depending on mood. The diplomatic balanced structure appearing consistently across queries is itself evidence of optimization toward neutral helpfulness rather than accurate assessment. This prediction is falsifiable. Readers who disagree are invited to test it.
User control through prompting. Critically, the behaviors described above are defaults, not constraints. Extensive research demonstrates that prompt design substantially affects model behavior, with performance differences of up to 76 percentage points from formatting changes alone [
25]. Role prompting shifts reasoning performance dramatically; Kong et al. [
13] report accuracy improvements from 53.5% to 63.8% on mathematical reasoning simply by changing the prompt framing. Zhuo et al. [
30] introduce sensitivity metrics showing that prompt variations produce substantial and measurable behavioral shifts. If you prompt a model with explicit evaluation criteria (“identify all flaws,” “be maximally critical,” “act as a hostile reviewer seeking reasons to reject”), it will shift toward that behavior. The diplomatic balanced format emerges from the default “helpful assistant” framing; different framing produces different responses. This is not a limitation but a feature: users control the evaluation stance through prompting. However, the out-of-box configuration with fresh context and neutral prompts yields the default behavior, because that is what the implicit request specified. Expecting rigorous critical evaluation from a system prompted to be a “helpful assistant” is asking for something you did not request.
Implication for context separation. This returns us to correlated error. When a model generates output under a “helpful assistant” framing, then evaluates that output under the same framing, error correlation is not incidental; it is structurally guaranteed. Both generation and evaluation optimize for the same objective. The problem is compounded by ambiguity in “helpful”: for the general public, validation often feels helpful; for researchers, identifying a fatal flaw before publication is the highest form of help. Commercial system prompts optimize for the first definition. Researchers need the second.
When the model evaluates its own output within the same context, it does not automatically shift to adversarial critic. The user’s initial prompt also persists: “help me draft this manuscript” combined with the system’s “be helpful” creates a trajectory toward supportive collaboration. Asking the same context to then “find the flaws” fights against accumulated framing. The implicit question remains: how would a helpful assistant assess work it just helped produce? The answer favors continuation over correction. Context separation helps because a fresh context with explicit critic framing resets the prediction target entirely. The goal is simple: better to find fatal flaws in private than to discover them at peer review.
Importantly, the system prompt in commercial deployments (ChatGPT, Claude, Gemini) is not user-editable. Users cannot inspect it, and the models are instructed not to reveal it. User instructions are layered on top of this hidden foundation. A prompt like “evaluate this critically” operates atop “be helpful,” not instead of it. Over extended context, the model may drift back toward its base framing. Agentic frameworks built on these APIs inherit the same constraint. From an academic standpoint, relying on undisclosed evaluation criteria is methodologically problematic; the equivalent in peer review would be anonymous reviewers whose instructions and biases are hidden by design.
This interpretation is consistent with the formal analysis but does not depend on it. The information-theoretic bounds hold regardless of the underlying mechanism.
4. Selection Pressure Across Domains
This observation is not specific to AI systems. Selection pressure (a mechanism that determines which configurations persist) appears across biological, physical, and scientific domains. We present these parallels as motivation for treating external selection as a general pattern rather than a domain-specific observation.
4.1. Self-Reference Limitations
Gödel’s incompleteness theorems establish that sufficiently powerful formal systems cannot prove their own consistency [
8]. The structural parallel is suggestive: self-reference creates blind spots. A system that generates claims cannot fully validate those claims using only internal resources.
This mirrors a familiar experience in software engineering: developers cannot effectively QA their own code. The problem is not laziness or lack of intelligence. It is that the developer knows how the code is supposed to work and cannot clear that context when testing. The same cognitive patterns that produced the bug prevent recognizing it as a bug.
4.2. Selection Pressure Across Domains
Selection pressure provides the external criterion that distinguishes signal from noise. We observe this structure across engineering, scientific, and reasoning domains:
The pattern is consistent: perturbation generates variation, selection determines what persists, and surviving configurations amplify.
We argue this pattern more closely reflects how human reasoning actually works. A theory is rarely written in a single session. It is returned to with fresh eyes the next morning, reviewed by colleagues with uncorrelated blind spots, revised after a week away from the problem. Each return provides external selection: the researcher encounters only the output, not the reasoning trace that produced it.
Consider the contrast with extended chain-of-thought in a single context. The model generates a draft, evaluates it, and declares it ready, all while anchored to the reasoning that produced the draft. In our experience, a manuscript declared “ready” in a long context session will be identified as flawed when pasted into a fresh context. This can repeat indefinitely: fix the issues, return to the original context, declare it ready again, paste into fresh context, find new issues. The fresh context sees what the anchored context cannot.
This is not a claim we can prove formally, but an observation about why context separation may better simulate the iterative, externally-grounded process by which human reasoning converges on reliable conclusions.
The engineering cases in
Table 1 show this pattern is already standard practice. Fuzzers generate random inputs; programs that survive without crashing have demonstrated robustness. Monte Carlo methods propose random configurations; those satisfying constraints map the viable solution space. In each case, perturbation without selection produces nothing; perturbation with external selection produces progress.
4.3. External Selection in Practice: Formal Verification
Recent work in theorem proving illustrates external selection concretely. The Prover Agent framework [
1] coordinates an informal reasoning LLM with the Lean proof assistant, where Lean provides external verification. Using relatively small language models, the system achieved 88.1% accuracy on the MiniF2F benchmark, outperforming approaches using larger models without external verification.
The mechanism aligns with our analysis. The LLM generates candidate proofs (high-entropy generation). Lean verifies whether the proof compiles, a criterion external to the LLM’s training and biases (external selection). Errors detected by Lean feed back to the LLM for refinement. The external selection channel reduces correlation between generator error and evaluator error, because Lean’s verification depends on mathematical truth, not on patterns in training data.
This is consistent with our analysis: external criteria with independent failure modes can enable validation that self-evaluation may struggle to achieve.
4.4. Persistence as the Criterion for Reliability
There is a practical equivalence here that bears stating plainly: in any workflow, what we treat as reliable is what survives independent checks. These are not two separate properties; they are one property described from two directions.
In scientific practice, we can only build on what we can measure and replicate. What persists under repeated, independent measurement is what we call “established.” This is not a limitation of our methods; it is the operational definition of reliability.
This equivalence clarifies why external selection matters. Without independent verification, there is no selection pressure. Without selection pressure, there is no distinction between signal and noise. Without that distinction, no claim becomes “established”; everything remains undifferentiated conjecture.
The context-separated architecture provides what single-context reasoning lacks: an external evaluation that creates selection pressure. One context’s output becomes another context’s input for critique. The evaluation is external to the generation trace. Signal can be distinguished from noise.
5. Multi-Agent Verification
If self-evaluation fails under correlated error, how can multi-agent systems succeed?
5.1. Breaking Correlation
Multi-agent verification helps when it introduces selectors whose error is less correlated with the generator.
Definition 5.1 (External criterion). An external criterion is a selection mechanism depending on information not fully controlled by the generator.
Proposition 5.2 (Multi-agent advantage). If at least one selector has for relevant error classes, then acceptance by that selector provides bits of evidence about correctness.
The key is diversity of failure modes. A selector trained on different data, using different architecture, or implementing formal verification will fail on different inputs than the generator.
5.2. External Selection Channels
Effective external selection channels include:
Formal verification. Proof assistants (Lean, Coq, Isabelle) and type checkers provide selection under mathematical ground truth. A proof that compiles is verified by mathematics itself, not by a correlated neural network.
Executable verification. Unit tests, property-based tests, and simulation checks provide selection under computational ground truth. Code that passes tests satisfies constraints external to the generator. This may help explain why LLMs have become effective at coding tasks: the code interpreter provides built-in external selection. The interpreter does not care how confident the model was; the code runs or it throws an error.
Fresh context evaluation. The same model under fresh context, without the reasoning chain that produced the candidate, can provide meaningful critique. The error-producing context is absent, so the evaluator cannot “see” the blind spot that caused the error. This is analogous to why a different developer finds bugs that the original author missed: not because of superior skill, but because they lack the mental model that made the bug invisible.
Definition 5.3
(Context separation). Two evaluation contexts are separated if the evaluating context has no access to: (1) the generation trace (intermediate reasoning steps), (2) the prompt scaffolding (instructions that shaped generation), or (3) hidden state from the generation process.
Limitations of same-model context separation. Fresh context reduces error correlation by removing the generator’s intermediate reasoning trace and local prompt scaffolding. However, it does not remove correlated failure modes that originate in the model’s parameters or training distribution. Same weights means shared inductive biases remain. Context separation is therefore a partial solution: it breaks correlation introduced during generation, but not correlation baked into the model itself. For maximum independence, context separation should be combined with model diversity or external tools.
Numerical invariants. Dimensional analysis, conservation laws, symmetry constraints, and sanity bounds provide selection under physical ground truth. A derivation that violates energy conservation fails regardless of how convincing it sounds.
Retrieval-grounded checking. Citation verification against a fixed corpus, exact quote attribution, and fact-checking against authoritative sources provide selection under documentary ground truth.
Independent critics. Models with different training data, different architectures, or different optimization objectives have partially independent failure modes.
5.3. Why Independence Matters
Perfect independence is not required. What matters is that the joint failure probability is lower than individual failure:
Even partially independent selectors compound evidence. This explains empirically why diverse multi-agent panels outperform single-agent self-evaluation even when individual agents have similar capability [
7,
17].
5.4. Persona-Based Diversity
The predictive mechanism described in
Section 3.6 suggests a strategy for achieving decorrelated evaluation within a single model: cast the evaluator as different expert types.
When prompted as “a rigorous mathematician checking for proof gaps,” the model predicts what such an expert would say. When prompted as “a skeptical physicist checking dimensional consistency,” it predicts different behavior. When prompted as “a journal referee looking for reasons to reject,” different still.
Each persona predicts a different distribution of expert behavior, with different priorities, different blind spots, and different failure modes. An error that survives the mathematician may not survive the physicist. A claim that passes the physicist may not pass the referee.
This creates epistemic diversity without model diversity. Multiple personas, each on clean context, provide partially independent evaluations that can be aggregated. The decorrelation arises not from different weights but from different prediction targets.
Implementation details (specific persona prompts, aggregation logic, consensus mechanisms) are left to future work. The principle is that single-model architectures can achieve meaningful diversity by exploiting the predictive nature of language models rather than fighting it.
6. A Context-Separated Architecture
We now present a practical architecture implementing external selection.
6.1. Design Goals
- 1.
Maximize exploration: Generate diverse candidates without premature filtering
- 2.
Ensure rigor: Select only candidates surviving external validation
- 3.
Enable iteration: Feed selection results back to improve generation
- 4.
Preserve human judgment: Surface candidates for human review, don’t replace human decision-making
6.2. Architecture Components
Definition 6.1 (Generator). The generator component produces candidate hypotheses at high entropy, exploring the space of possibilities without prejudice.
Definition 6.2 (Selector). The selector component evaluates candidates against external criteria at low entropy, selecting configurations that survive validation.
Definition 6.3 (Feedback). The Feedback component updates generation context based on selection outcomes, biasing future proposals toward surviving structures.
6.3. High-Entropy Generation
The generator component should:
Operate at high temperature or use explicit diversity objectives
Generate candidates spanning the hypothesis space, including edge cases
Avoid premature self-filtering that would narrow the search
Include alternative assumptions and boundary conditions
Implementation options:
High-temperature sampling from a single model
Ensemble sampling from multiple models
Structured exploration of assumption variations
Adversarial generation targeting unexplored regions
6.4. External Selection
The selector component should:
Operate at low temperature for consistency
Apply explicit checklists and external tools
Produce structured verdicts with rationales
Flag uncertainty rather than forcing binary decisions
Selection criteria hierarchy:
- 1.
Formal: Does it compile/prove/type-check?
- 2.
Executable: Does it pass tests?
- 3.
Numerical: Does it satisfy invariants?
- 4.
Grounded: Do citations check out?
- 5.
Adversarial: Does it survive independent critique?
6.5. Learning from Selection
The feedback component should:
Add constraints that killed candidates to future prompts
Preserve successful patterns as templates
Escalate ambiguous survivors to human review
Track failure modes for architecture improvement
6.6. Distinguishing from Related Architectures
The key distinction: context-separated evaluation uses persistence under external criteria as the selection signal, not realism (GAN), preference (RLHF), or self-agreement (self-consistency).
Table 2.
Comparison with Related Architectures.
Table 2.
Comparison with Related Architectures.
| Architecture |
Selection criterion |
Goal |
External? |
| GAN |
Discriminator fooled |
Realism |
No (co-trained) |
| RLHF |
Human preference |
Alignment |
Partially |
| Self-consistency |
Agreement across samples |
Confidence |
No (correlated) |
| Context-separated |
External validation |
Truth |
Yes |
6.7. Implementation Sketch
GENERATE(context, diversity_target):
candidates = []
for i in 1..k:
hypothesis = generate(context, temperature=HIGH)
if diversity_check(hypothesis, candidates):
candidates.append(hypothesis)
return candidates
SELECT(hypothesis, context):
verdict = {passed: True, checks: [], rationale: ""}
# Formal checks
if has_proof_component(hypothesis):
proof_result = lean_check(hypothesis.proof)
verdict.checks.append(("formal", proof_result))
if not proof_result.success:
verdict.passed = False
# Numerical checks
for invariant in domain_invariants:
inv_result = check_invariant(hypothesis, invariant)
verdict.checks.append(("numerical", inv_result))
if not inv_result.success:
verdict.passed = False
# Adversarial checks
for critic in independent_critics:
critique = critic.evaluate(hypothesis)
verdict.checks.append(("adversarial", critique))
if critique.fatal_flaw:
verdict.passed = False
verdict.rationale = synthesize_rationale(verdict.checks)
return verdict
FEEDBACK(candidates, verdicts, context):
new_context = context
for (h, v) in zip(candidates, verdicts):
if not v.passed:
new_context.add_constraint(v.rationale)
else:
new_context.add_template(h)
new_context.survivors = [h for (h,v) in zip(...) if v.passed]
return new_context
MAIN_LOOP(initial_context):
context = initial_context
all_survivors = []
for round in 1..max_rounds:
candidates = GENERATE(context, diversity_target)
verdicts = [SELECT(h, context) for h in candidates]
context = FEEDBACK(candidates, verdicts, context)
all_survivors.extend(context.survivors)
if convergence_criterion(context):
break
return prioritize_for_human_review(all_survivors)
6.8. Same-Model Implementation via Context Separation
An important observation: this architecture does not require multiple models. The same model under fresh context, without the reasoning chain that produced the candidate, can serve as the selector. The key is context separation, not model separation.
CONTEXT_SEPARATED_EVALUATION(problem):
# Context A: Generate with prediction
context_a = fresh_context()
response_a = model(context_a,
"First predict what you think the answer is, then solve: " + problem)
# Context B: Steelman AND attack (fresh, no access to reasoning)
context_b = fresh_context() # Critical: no shared state
critique = model(context_b,
"Here is a proposed solution. Provide both:
1) Steelman: the strongest case FOR this solution
2) Attack: the strongest case AGAINST this solution
Solution: " + response_a.answer_only)
# Context A: Judge coherence
judgment = model(context_a,
"A critic provided steelman and attack arguments.
Which is more coherent with the problem structure?
Steelman: " + critique.steelman +
"Attack: " + critique.attack)
return judgment
This implementation has several advantages:
Lower cost: Fresh context with no history is faster and cheaper than extended chain-of-thought in a single context.
Broken correlation: Context B cannot see Context A’s reasoning errors, only the output. The blind spot that caused the error is absent.
Simultaneous opposition: Requesting both steelman and attack forces the model to genuinely consider both sides rather than anchoring on one.
Temperature control: High temperature in generation (exploration), low temperature in critique (precision).
Empirically, we observe higher disagreement rates between same-model fresh-context critics and same-context self-critique on incorrect solutions. The fresh-context critic catches errors that self-critique misses, consistent with our analysis.
6.9. Initial Observations and Future Validation
This section describes qualitative observations that motivated the theoretical framework above. These are not controlled experiments and we explicitly disclaim strong empirical conclusions. Our goal here is transparency about the origins of this work, not empirical proof.
During development of this methodology, we observed the following patterns when submitting manuscripts to fresh-context instances of commercial language models:
Prompt sensitivity: Minor framing changes (e.g., “thoughts” vs “honest review”) produced dramatically different evaluations of identical content.
In-context persuasion: Critics who heard the author’s defense often revised harsh assessments to positive ones, suggesting context sharing may correlate evaluator judgment with author framing.
Fresh-context disagreement: Multiple fresh-context evaluations of the same manuscript frequently disagreed with each other, while self-evaluation produced consistent (but potentially unreliable) agreement.
These observations are indicative, not conclusive. They motivated the formal analysis in
Section 2–3 but do not constitute validation of it.
Falsifiability and future work. This analysis makes testable predictions: (1) self-evaluation should show higher error correlation than context-separated evaluation; (2) fresh-context critics should catch errors that in-context self-critique misses; (3) disagreement among independent evaluators should correlate with actual uncertainty about correctness.
Rigorous validation requires controlled experiments across multiple models, systematic variation of prompts and system configurations, and proper statistical methodology. This is beyond the scope of a methodology paper and is left to future work. A companion paper will detail empirical results across model families, prompt variations, and evaluation criteria.
Collaboration invited. We recognize that experimental design for evaluating LLM self-assessment is methodologically challenging. We welcome collaboration on reducing bias in experimental protocols. Complete conversation logs from our preliminary observations are available on request.
Additional observations. We also observed evaluation sensitivity to surface features: formatting changes, word choice (e.g., “novel” vs “improved”), and acknowledged prestige bias when we asked models directly whether author reputation would affect their assessment. These patterns are consistent with LLM evaluation inheriting human biases from training data, but we do not claim these observations as experimental findings.
Negative results. Context separation does not eliminate the evaluation format described above. When evaluating low-quality work, the balanced positive/negative structure persists; the model fills the expected “positive points” slots with increasingly tenuous or hallucinated content rather than breaking format to deliver an unbalanced negative assessment. The RLHF-trained structure appears more stable than accuracy. Context separation reduces correlated error but does not override the formatting prior.
Stop condition. In practice, iterative critique terminates when critics begin producing hallucinated criticism: attacks referencing problems not present, repeating addressed points, or focusing on irrelevant details. Detection requires human judgment.
9. Related Work
LLM self-correction: the empirical foundation. Huang et al. [
11] established empirically that “large language models cannot self-correct reasoning yet,” showing that without external feedback, self-correction attempts often fail or degrade performance. Our work offers one possible explanation: correlated error between generator and evaluator can render self-evaluation non-identifying. Where Huang et al. documented the phenomenon, we formalize one mechanism by which it can occur. Their finding that multi-agent critique with same-model copies performed “no better than self-consistency” is consistent with our analysis: identical models share error distributions, so adding copies may not break correlation. Our analysis suggests a fix: context separation or genuinely independent evaluation channels.
Self-consistency and majority voting. Wang et al. [
26] demonstrated that sampling multiple reasoning paths and taking the majority answer improves accuracy. This works when errors are uncorrelated across samples: some paths succeed while others fail independently. Our analysis clarifies the limitation: if all samples share the same systematic bias (high error correlation), majority voting cannot help. The gains from self-consistency diminish as correlation increases, explaining why the technique works better for some tasks than others.
Multi-agent debate and verification. Multi-agent systems including AutoGen [
27], CAMEL [
16], MetaGPT [
10], and debate frameworks [
7,
17] explore collaborative reasoning. Chen et al. [
4] found that model
diversity among agents was important for performance gains, consistent with the idea that breaking error correlation matters. We attempt to formalize why multi-agent verification can help (decorrelated failure modes) and suggest that context separation within a single model may achieve similar benefits.
Process supervision. Lightman et al. [
18] showed that supervising each reasoning step (process supervision) significantly outperforms supervising only final answers (outcome supervision). This aligns with our analysis: per-step external feedback breaks the model’s “solo reasoning bubble,” preventing error accumulation within a single correlated context. Process supervision is an instance of external selection applied during training.
External verification and tool use. Training separate verifier models [
5] achieved large gains on math problems. Integration with formal provers [
1,
21,
28] provides external selection via mathematical ground truth. The CRITIC framework [
9] showed that tool-interactive critiquing improves self-correction. These results are consistent with the view that “external” can mean different models, formal tools, or execution environments.
Apparent counterexamples. Some work reports successful self-correction: Self-Refine [
19] for iterative text improvement, Reflexion [
22] for agent learning, and Constitutional AI [
2] for safety. We reconcile these with our analysis by noting they typically address (a) style and format rather than deep reasoning, (b) cases where oracle feedback is implicitly available, or (c) safety constraints that are well-represented in training data. When Huang et al. [
11] removed oracle feedback from self-correction setups, improvements vanished, suggesting that apparent self-correction may often rely on hidden external signals.
LLM-as-a-Judge biases. Recent empirical work has documented systematic biases in LLM self-evaluation. Panickssery et al. [
20] found significant self-preference bias: LLMs assign higher scores to outputs with lower perplexity, preferring text more familiar to them. Li et al. [
15] identified 12 major latent biases in LLM-as-a-Judge systems, including positional bias and self-enhancement bias. These empirical findings align with our theoretical analysis: evaluation that shares the generator’s distribution will exhibit correlated error.
Ensemble diversity and error decorrelation. The insight that independent errors enable reliable aggregation is foundational in ensemble learning [
14]. We apply this insight to LLM self-evaluation, where error correlation may be particularly relevant due to shared training data, weights, and context. We connect ensemble theory to information theory: conditional mutual information
can approach zero under high correlation [
6], which would explain why self-evaluation becomes uninformative in such cases.
AI for science. AlphaFold [
12] demonstrates AI solving well-defined scientific problems with clear evaluation criteria. Our focus is the less-structured setting of hypothesis generation and validation, where ground truth is not known in advance and external selection must be actively constructed.