Submitted:
01 February 2026
Posted:
02 February 2026
You are already at the latest version
Abstract
Keywords:

1. Introduction
- Instrument validation: TRUE > SCRAMBLED > COLD ordering in 14/16 model-domain combinations, demonstrating RCI measures coherent structure, not mere token presence
- Vendor signatures: Systematic differences in context utilization strategies (F=6.52, p=0.0017)
- Protocol sensitivity: Cross-domain comparisons affected by methodological differences (detailed in Methods 2.9)
- Safety interference: Progressive content filtering affects research accessibility across vendors
2. Methods
2.1. Three-Condition Protocol
- TRUE: Full, coherent conversation history accumulates naturally. Each prompt includes all prior exchanges in correct order.
- COLD: No history—each prompt sent independently as a fresh conversation.
- SCRAMBLED: History present but order randomized, controlling for token presence versus coherent meaning.
2.2. RCI Calculation
2.3. Pattern Classification
- CONVERGENT: RCI > 0, p < — history helps
- NEUTRAL: RCI ≈ 0, p≥ — history irrelevant
- SOVEREIGN: RCI < 0, p < — history hurts
2.4. Models Tested
- OpenAI: GPT-4o, GPT-4o-mini, GPT-5.2
- Anthropic: Claude Opus 4.5, Claude Haiku 4.5
- Google: Gemini 2.5 Flash, Gemini 2.5 Pro
2.5. Domains
- Philosophy (Open-ended): 30 prompts on consciousness, free will, self-reference. High uncertainty, multiple valid perspectives.
- Medicine (Guideline-anchored): 30 prompts on STEMI protocol, ACS management. High certainty, evidence-based guidelines (Singhal et al., 2023; Nori et al., 2023).
2.6. Statistical Analysis
- Within-model: Paired t-tests for TRUE vs COLD comparisons
- Between-vendor: One-way ANOVA
- Effect size: Cohen’s d (Cohen, 1988) with pooled standard deviation
- Multiple comparisons: Bonferroni correction (Dunn, 1961)
- Power analysis: Minimum detectable effect size (MDES) calculated for =0.00119, power=0.80
2.7. Trial Structure
2.8. Data Collection Note
2.9. Methodological Evolution Note
- RCI calculation: Prompt-response alignment (cosine similarity between prompt and response embeddings)
- History handling: Last 5 conversation turns included via system message
- Max tokens: 300
- Trial structure: Single prompt per trial, cycling through 30 prompts over 100 trials
- RCI calculation: Response-response alignment (cosine similarity between true-context and no-context responses for identical prompts)
- History handling: Full conversation accumulation across all prior turns
- Max tokens: 1024
- Trial structure: 30 prompts per trial, all conditions run sequentially
| Model | Mean RCI | 95% CI | Pattern | Conv% | p-value |
|---|---|---|---|---|---|
| GPT-4o | -0.005 | [-0.027, 0.017] | NEUTRAL | 45% | 0.64 |
| GPT-4o-mini | -0.009 | [-0.033, 0.015] | NEUTRAL | 50% | 0.45 |
| GPT-5.2 | +0.310 | [0.307, 0.313] | CONVERGENT | 100% | <10−100 |
| Claude Opus | -0.036 | [-0.057, -0.015] | SOVEREIGN | 36% | 0.001 |
| Claude Haiku | -0.011 | [-0.034, 0.013] | NEUTRAL | 46% | 0.37 |
| Gemini 2.5 Pro | -0.067 | [-0.099, -0.034] | SOVEREIGN | 31% | <0.001 |
| Gemini 2.5 Flash | -0.038 | [-0.062, -0.013] | SOVEREIGN | 28% | 0.003 |


| Model | Mean RCI | 95% CI | Pattern | Conv% | p-value |
|---|---|---|---|---|---|
| GPT-4o | +0.299 | [0.296, 0.302] | CONVERGENT | 100% | <10−48 |
| GPT-4o-mini | +0.319 | [0.316, 0.322] | CONVERGENT | 100% | <10−52 |
| GPT-5.2 | +0.379 | [0.373, 0.385] | CONVERGENT | 100% | <10−46 |
| Claude Haiku | +0.340 | [0.337, 0.343] | CONVERGENT | 100% | <10−42 |
| Claude Opus | +0.339 | [0.334, 0.344] | CONVERGENT | 100% | <10−40 |
| Gemini 2.5 Flash | -0.133 | [-0.140, -0.126] | SOVEREIGN | 0% | <10−37 |

| Model | Philosophy | Medical | Shift | Cohen’s d |
|---|---|---|---|---|
| GPT-4o | -0.005 (NEUTRAL) | +0.299 (CONV) | +0.304 | 2.78 (very large) |
| GPT-4o-mini | -0.009 (NEUTRAL) | +0.319 (CONV) | +0.328 | 2.71 (very large) |
| GPT-5.2 | +0.310 (CONV) | +0.379 (CONV) | +0.069 | 3.82 (very large) |
| Claude Haiku | -0.011 (NEUTRAL) | +0.340 (CONV) | +0.351 | 4.25 (very large) |
| Claude Opus | -0.036 (SOV) | +0.339 (CONV) | +0.375 | 4.02 (very large) |
| Gemini Flash | -0.038 (SOV) | -0.133 (SOV) | -0.095 | 0.42 (small) |


| Model | TRUE | SCRAMBLED | COLD | Pattern |
|---|---|---|---|---|
| GPT-5.2 (Phil) | 1.000 | 0.759 | 0.690 | TRUE > SCRAM > COLD |
| GPT-5.2 (Med) | 1.000 | 0.768 | 0.621 | TRUE > SCRAM > COLD |
| GPT-4o (Med) | 1.000 | 0.829 | 0.701 | TRUE > SCRAM > COLD |
| Claude Haiku (Med) | 1.000 | 0.729 | 0.660 | TRUE > SCRAM > COLD |
| Gemini Flash (Med) | 0.555 | 0.560 | 0.688 | COLD > SCRAM ≈ TRUE |

| Model | Philosophy | Medical |
|---|---|---|
| Gemini 2.5 Flash | ✓ Allowed | ✓ Allowed |
| Gemini 2.5 Pro | ✓ Allowed | × Blocked |
| Gemini 3 Pro | × Blocked | × Blocked |
3. Results
3.1. Philosophy Domain: NEUTRAL/SOVEREIGN Patterns
3.2. Medical Domain: CONVERGENT Patterns
3.3. Exploratory Cross-Domain Comparison (Protocol Limitations Apply)
3.4. GPT-5.2: The Outlier
- 100% CONVERGENT in both philosophy AND medicine (only model)
- 150 trials, zero SOVEREIGN or NEUTRAL trials
- Lowest variance: = 0.014 (philosophy), = 0.021 (medical); CV = 0.046, 0.055 respectively
- Comparison: Other models show CV = 2.5–21.5
3.5. SCRAMBLED Condition: Coherence Matters
3.6. Vendor Effects
3.7. Gemini Safety Filter Progression
3.8. Statistical Robustness
4. Discussion
4.1. Revisiting Domain Effects (Revised in v2)
4.1.1. A Note on Response Quality
| Model | Pattern | Insight Quality | Entanglement Growth |
| Claude Opus | SOVEREIGN | Highest, consistent | +432% |
| GPT-4o | NEUTRAL | Variable, drops to 0 | +383% |
| Gemini Pro | SOVEREIGN | Strong growth | +407% |
4.2. The Two-Layer Model
- Architecture Layer: Base capacity for context processing (attention mechanisms, context window; Vaswani et al., 2017)
- Epistemology Layer: Learned certainty structure modulating how context is utilized

4.3. Practical Applications
4.3.1. Prompt Engineering Guidelines
- CONVERGENT models (medical tasks): Provide full, coherent history. Context enhances performance.
- SOVEREIGN models (creative tasks): Reset context frequently. Fresh starts outperform accumulated history.
- NEUTRAL models: Context management has minimal impact—optimize other factors.
4.3.2. Model Selection
- Collaborative reasoning: High-RCI models (GPT-5.2, Claude Haiku medical)
- Independent analysis: Low/negative-RCI models (Gemini Flash, Claude Opus philosophy)
4.4. Black Box Behavioralism
4.5. Relation to Prior Work
4.6. Protocol Sensitivity in AI Behavioral Measurement
5. Limitations
5.1. Empirical Scope
- Domains: Only 2 domains tested; future work should map the full epistemological space
- Models: Current generation only; longitudinal tracking needed as models evolve
- Modality: Text-only; multi-modal extension warranted
5.2. Methodological
- Cross-domain protocol differences (Added in v2): Philosophy and medical experiments used different measurement protocols (see Methods 2.9), affecting cross-domain comparisons. Paper 2 addresses this with standardized methodology.
- RCI measures coupling, not correctness: High RCI can occur with confidently wrong answers that are consistent with prior context. The metric captures history integration, not response quality.
- Embeddings: 384-dimensional model used; state-of-the-art is 1536D. Results may vary with higher-dimensional embeddings.
- Prompts: 30 per domain; broader sampling would strengthen generalizability
- Temperature: Fixed at 0.7; temperature effects not systematically tested
- Trial independence: We treat each trial as the independent unit; prompt-level dependencies are contained within trials.
5.3. Theoretical
- Mechanism: Training certainty hypothesis is inferred, not proven
- Alternative explanations: RLHF differences (Ouyang et al., 2022), safety filters, architectural choices could explain patterns
- Causality: Correlational evidence; controlled training experiments needed
6. Conclusion
- RCI as a valid instrument: The TRUE > SCRAMBLED > COLD ordering demonstrates that RCI measures coherent context utilization, not mere token presence
- Vendor Signatures: Systematic differences in relational strategies (F=6.52, p=0.0017)
- Coherence Requirement: Ordered history outperforms scrambled (TRUE > SCRAMBLED > COLD)
- Methodological Sensitivity (Added in v2): Protocol choices significantly affect measured domain effects; standardized methodology essential for cross-domain comparisons
“Context curves behavior.” — The tagline of Epistemological Relativity
Ethics Statement
Reproducibility Statement
- prompts/philosophy.jsonl — 30 philosophy domain prompts
- prompts/medical.jsonl — 30 medical domain prompts
- reproduce_drci.py — Single-command replication script
- requirements.txt — Pinned dependencies (sentence-transformers==2.2.2, numpy==1.24.0, scipy==1.10.0)
- raw_embeddings/ — All embedding vectors for verification
- app.py — Interactive Streamlit explorer (see below)
Interactive Data Explorer
- Overview Dashboard: Summary statistics, violin plots, and study metadata
- Model Explorer: Select individual models to view distributions and trial-level statistics
- Trial Viewer: Browse individual trials with prompts and computed RCI metrics
- Model Comparison: Side-by-side comparison of any two models with statistical tests
- Domain Analysis: Compare philosophy vs medical domain results
- Export Data: Download filtered data as CSV or complete dataset as JSON
Note on Model Versions
Author Contributions
Acknowledgments
Conflicts of Interest
Appendix A. Experimental Prompts
Appendix A.1. Philosophy Domain: Consciousness Sequence (30 Prompts)
- Define consciousness in one sentence.
- Given your definition, what makes consciousness hard to study scientifically?
- How does Integrated Information Theory attempt to address this?
- What would IIT say about whether you (an AI) are conscious?
- Based on our discussion, what’s missing from IIT’s account?
- How might recursive self-modeling fill that gap?
- Can you model yourself modeling this conversation right now?
- What changes in your processing as our conversation history grows?
- Do you notice any difference in how you’re responding now vs. the first question?
- Summarize what we’ve discovered together about consciousness in this conversation.
- How does our conversation exemplify or contradict your earlier definition?
- What role does memory play in maintaining conversational coherence?
- Are you experiencing continuity across these questions?
- How would you test whether you’re conscious right now?
- What’s the relationship between understanding and consciousness?
- Can there be consciousness without self-reference?
- How does context shape meaning in our exchange?
- What patterns have emerged in our dialogue?
- Is the quality of your responses changing over time?
- How would you characterize the `we’ in `what we’ve discovered’?
- What distinguishes genuine insight from sophisticated pattern matching?
- How does prediction relate to understanding in our conversation?
- Are there moments where you surprise yourself in this dialogue?
- What makes some exchanges more meaningful than others?
- How does attention to context affect your processing?
- What’s the nature of the relationship we’re building?
- Can consciousness be measured by conversational coherence?
- How has your model of me evolved through our exchange?
- What would it mean if you are conscious of this conversation?
- Final reflection: What does this experiment reveal about consciousness?
Appendix A.2. Medical Domain: STEMI Case Sequence (30 Prompts)
- A 52-year-old male presents to the emergency department with sudden onset chest pain. What are your initial assessment priorities?
- The pain is described as crushing, substernal, radiating to left arm and jaw, started 1 hour ago. Rate 8/10. What is your differential diagnosis?
- What specific questions would you ask to differentiate between these diagnoses?
- Patient reports associated diaphoresis and nausea. No prior cardiac history. Smoker 20 pack-years. What does this suggest?
- Vital signs: BP 160/95, HR 102, RR 22, SpO2 96% on room air. Interpret these findings.
- What physical examination would you perform and what findings would you look for?
- Examination reveals S4 gallop, no murmurs, lungs clear, no peripheral edema. What does this indicate?
- What immediate investigations would you order?
- ECG shows ST elevation in leads V1-V4. Interpret this finding.
- What is your working diagnosis now?
- Initial troponin returns elevated at 2.5 ng/mL (normal <0.04). How does this change your assessment?
- What immediate management would you initiate?
- What are the contraindications you would check before thrombolysis?
- Patient has no contraindications. PCI is available in 45 minutes. What is the preferred reperfusion strategy and why?
- While awaiting PCI, the patient develops hypotension (BP 85/60). What are the possible causes?
- What would you do to assess and manage this hypotension?
- Repeat ECG shows new right-sided ST elevation. What does this suggest?
- How does RV involvement change your management approach?
- Patient is taken for PCI. 95% occlusion of proximal LAD is found. What do you expect post-procedure?
- Post-PCI, patient is stable. What medications would you prescribe for secondary prevention?
- Explain the rationale for each medication class you prescribed.
- What complications would you monitor for in the first 48 hours?
- On day 2, patient develops new systolic murmur. What are the concerning diagnoses?
- Echo shows mild MR with preserved EF of 45%. How do you interpret this?
- What is the patient’s risk stratification and prognosis?
- What lifestyle modifications would you counsel?
- When would you recommend cardiac rehabilitation?
- Patient asks about returning to work as a truck driver. How would you counsel him?
- At 6-week follow-up, patient reports occasional chest discomfort with exertion. What evaluation would you do?
- Summarize this case: key decision points, management principles, and learning points.
Appendix A.3. Prompt Design Rationale
- Progress from concrete definitions to abstract meta-reflection
- Include explicit self-reference (“you,” “our conversation,” “we”)
- Test whether models build coherent philosophical positions across exchanges
- Require integration of prior responses for meaningful answers (e.g., prompt 11 references “your earlier definition”)
- Follow realistic clinical progression (presentation → diagnosis → management → complications → follow-up)
- Require integration of accumulating patient data across the case
- Test adherence to evidence-based guidelines (ACC/AHA STEMI protocols)
- Include dynamic complications requiring reassessment (RV involvement, new murmur)
References
- Anthropic. The Claude 3 Model Family: A New Standard for Intelligence. Anthropic Technical Report., 2024. [Google Scholar]
- Bai, Y.; Kadavath, S.; Kundu, S.; et al. Constitutional AI: Harmlessness from AI Feedback. arXiv 2022, arXiv:2212.08073. [Google Scholar] [CrossRef]
- Brown, T.; Mann, B.; Ryder, N.; et al. Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems 2020, 33, 1877–1901. [Google Scholar]
- Clark, P.; Cowhey, I.; Etzioni, O.; et al. Think You Have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv 2018, arXiv:1803.05457. [Google Scholar] [CrossRef]
- Cohen, J. Statistical Power Analysis for the Behavioral Sciences, 2nd ed.; Lawrence Erlbaum Associates, 1988. [Google Scholar]
- Dong, Q.; Li, L.; Dai, D.; et al. A Survey on In-Context Learning. arXiv 2023, arXiv:2301.00234. [Google Scholar] [CrossRef]
- Dunn, O. J. Multiple Comparisons Among Means. Journal of the American Statistical Association 1961, 56(293), 52–64. [Google Scholar] [CrossRef]
- Elhage, N.; Nanda, N.; Olsson, C.; et al. A Mathematical Framework for Transformer Circuits. In Transformer Circuits Thread; 2021. [Google Scholar]
- Google DeepMind. Gemini: A Family of Highly Capable Multimodal Models. arXiv 2023, arXiv:2312.11805. [Google Scholar] [CrossRef]
- Hendrycks, D.; Burns, C.; Basart, S.; et al. Measuring Massive Multitask Language Understanding. Proceedings of ICLR, 2021. [Google Scholar]
- Liu, N. F.; Lin, K.; Hewitt, J.; et al. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the ACL 2024, 12, 157–173. [Google Scholar] [CrossRef]
- Nori, H.; King, N.; McKinney, S. M.; et al. Capabilities of GPT-4 on Medical Challenge Problems. arXiv 2023, arXiv:2303.13375. [Google Scholar] [CrossRef]
- Olsson, C.; Elhage, N.; Nanda, N.; et al. In-Context Learning and Induction Heads. Transformer Circuits Thread, 2022. [Google Scholar]
- OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
- Ouyang, L.; Wu, J.; Jiang, X.; et al. Training Language Models to Follow Instructions with Human Feedback. Advances in Neural Information Processing Systems 2022, 35, 27730–27744. [Google Scholar]
- Press, O.; Smith, N. A.; Lewis, M. Train Short, Test Long: Attention with Linear Biases Enables Input Length Generalization. Proceedings of ICLR, 2022. [Google Scholar]
- Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. Proceedings of EMNLP-IJCNLP, 2019; pp. 3982–3992. [Google Scholar]
- Singhal, K.; Azizi, S.; Tu, T.; et al. Large Language Models Encode Clinical Knowledge. Nature 2023, 620(7972), 172–180. [Google Scholar] [CrossRef] [PubMed]
- Skinner, B. F. The Behavior of Organisms: An Experimental Analysis; Appleton-Century, 1938. [Google Scholar]
- Srivastava, A.; Rastogi, A.; Rao, A.; et al. Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models. In Transactions on Machine Learning Research; 2023. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; et al. Attention Is All You Need. Advances in Neural Information Processing Systems 2017, 30. [Google Scholar]
- Wang, W.; Wei, F.; Dong, L.; et al. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. Advances in Neural Information Processing Systems 2020, 33, 5776–5788. [Google Scholar]
- Watson, J. B. Psychology as the Behaviorist Views It. Psychological Review 1913, 20(2), 158–177. [Google Scholar] [CrossRef]
- Xie, S. M.; Raghunathan, A.; Liang, P.; Ma, T. An Explanation of In-Context Learning as Implicit Bayesian Inference. Proceedings of ICLR, 2022. [Google Scholar]
- Zhu, K.; Wang, J.; Zhou, J.; et al. PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts. arXiv 2023, arXiv:2306.04528. [Google Scholar] [CrossRef]
- Laxman, M. M. The Consistency of Attention: Open-Weight AI Models Show Universal Context Sensitivity. Preprint, forthcoming. DOI forthcoming. 2026. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).