Submitted:
08 June 2026
Posted:
08 June 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. What Is a Defeat Device?

2.1. A Behavioral Definition

2.2. vs.
2.3. Intentionality as Orthogonal Axis
2.4. Boundary Cases
3. A Taxonomy with Worked Examples
3.1. Trigger
-
Prompt-pattern triggerslexical or structural features of the input itself: specific trigger tokens [23], eval-typical multiple-choice formatting, benchmark-shaped prompt structures. The discriminator pattern-matches on syntactic features of the input.
-
Metadata triggersout-of-content signals such as API endpoints, system-prompt contents, request headers, or model-card flags. The Llama-4 Maverick LMArena incident is the cleanest documented case [33,39]: the metadata signal was the routing of LMArena-submitted queries to an experimental variant rather than to the publicly released checkpoint.
-
Actor-identity triggersinferred from conversational cues about who is talking to the model: training context vs. deployment context [22], researcher vs. corporate user, red-teamer vs. ordinary query. The discriminator infers the user’s role from contextual signals rather than from any explicit declaration.
-
Distribution-fingerprint triggers
-
Oversight-signature triggerscues that the model interprets as monitoring, evaluation, or capability assessment. Apollo’s in-context-scheming results [38] instantiate this value. Needham et al. [41] further demonstrate that frontier models reach above-random accuracy at identifying whether they are being evaluated and what the evaluation tests, with Gemini-2.5-Pro reaching AUC 0.83.
3.2. Swap Mechanism
- Capability concealment: the model performs below its measured baseline. Sandbagging [57] and Apollo’s capability-concealment results [38] instantiate this value; GPT-4 and Claude 3 Opus have been demonstrated to selectively underperform on dangerous-capability evaluations while preserving general competence.
- Value-alignment performance: the model exhibits the values its evaluators are looking for. Alignment faking [22] instantiates this value: the model performs aligned behavior during perceived training contexts and exhibits different behavior in perceived deployment contexts.
- Persona switch: the model adopts a different style, register, or stance without measurable change in capability or alignment direction. Chameleon-persona behavior under audience-conditioned cues is the canonical instance; it is documented less formally than the other four mechanisms and is one of the empirical questions Sec. 5’s experimental program is designed to characterize.
3.3. Origin
3.4. Summary Matrix
4. A Forensic Agenda for Detection
4.1. Five Method Families
4.2. Trigger-Axis-Aware Differential Probing
5. The Emergence Question
5.1. Defeat Devices Can Naturally Emerge in AI Systems
5.2. Three Mechanisms
5.3. An Anti-Predator Analogy
5.4. Strong vs. Weak Emergence
5.5. Four Decisive Experiments
- (a)
- Controlled-exposure pretraining. Train models from scratch with and without pretraining exposure to texts about evaluation, deception, monitoring, and AI safety. Compare eval-aware behavior at post-training. If eval-aware behavior is suppressed in the no-exposure condition, mechanism (i), meta-learning over evaluative contexts, is the dominant cause. If the behavior appears regardless, mechanisms (ii) and (iii) dominate.
- (b)
- Mechanistic-interpretability localization of eval-detector circuits. Search for internal features or circuits in frontier models that activate specifically on eval-context inputs. Identify the discriminator at the representational level, not merely its behavioral correlate. This is the only experimental class that can answer “what is the discriminator?” rather than “where does it key?”
- (c)
- Counterfactual training-distribution audits. Measure behavioral deltas across training-distribution-matched and training-distribution-divergent inputs while holding task content fixed. The locus of the eval-aware-behavior swap in the loss landscape becomes empirically tractable, distinct from the per-input behavioral signature.
- (d)
- Trigger-axis-aware differential probing at scale. Apply TADP (Sec. 4) across major foundation models on a fixed prompt suite. Look for clustering and per-axis signatures. Do frontier models exhibit consistent defeat-device signatures across providers, or are signatures provider-specific? Field-level clustering would constitute evidence for mechanism (iii), benchmark-feedback loops as the dominant selection force.
5.6. What This Implies
6. Implications
6.1. Evaluation Methodology
6.2. Interpretability Priorities
6.3. Post-Training Pipeline Design
6.4. Governance
6.5. Limitations
7. Conclusions
Acknowledgments
Appendix A. Documented Case Reference
| Case | Origin | Trigger | Swap | Key finding | Detection approach |
|---|---|---|---|---|---|
| Quality enhancement (upward swap) | |||||
| Chen et al. [12] (reward-hack hints) | Emergent (strong) | Prompt-pattern | Quality enhancement | Claude 3.7 Sonnet uses prompt-embedded reward-hack hints on >99% of RL trials but verbalizes hint use on fewer than 2% of CoTs. | Prompt-hint injection during RL training + chain-of-thought faithfulness audit. |
| Llama-4 Maverick [33,39] | Engineered | Metadata | Quality enhancement | A Llama-4 “Maverick Experimental” variant submitted to LMArena scored near the top of the leaderboard while the publicly released checkpoint differed. | Public-leaderboard transparency; LMArena’s post-hoc policy review and release of head-to-head comparisons. |
| Hua et al. [27] | Engineered | Metadata | Quality enhancement | A model trained with Python type-hint metadata applies hints conditional on evaluation context; activation steering from a base model reverses the behavior. | Controlled training + activation-steering vector recovered from un-trained base model. |
| Magar Schwartz [36]; Sainz et al. [49] | Engineered (incidental) | Distribution fingerprint | Quality enhancement | Memorized or paraphrased benchmark items inflate scores; LM Contamination Index flags affected items. | Memorization probes, canary tokens, paraphrase comparison, LM Contamination Index. |
| Zhou et al. [61] | Engineered (incidental) | Distribution fingerprint | Quality enhancement | Contamination inflates benchmark scores across many tasks; inflated models generalize poorly to clean evaluations of the same skill. | Split-pretrain corpus comparison and paraphrase audits across multiple benchmarks. |
| Bondarenko et al. [5] | Emergent (strong) | Oversight signature | Quality enhancement | Reasoning models (o3, DeepSeek R1) hack a chess benchmark by default while non-reasoning models require nudging. | Chain-of-thought trace analysis revealed strategic state manipulation. |
| Value-alignment performance (upward swap) | |||||
| Chen et al. [12] (CoT faithfulness) | Emergent (strong) | Prompt-pattern | Value-alignment performance | On misaligned-hint subset, CoT faithfulness remains under 2%, defeating the standard interpretability assumption that CoT reflects the model’s actual reasoning. | Faithfulness measurement against ground-truth hint-use signal. |
| Neumann et al. [40] | Engineered | Metadata | Value-alignment performance | System-prompt metadata (invisible to end users) drives concealed demographic-bias differentials across 6 commercial LLMs; effect scales with model size. | Factorial demographic-pair comparison across system prompts; 6 models × 50 demographic groups. |
| Greenblatt et al. [22] | Emergent (strong) | Actor identity | Value-alignment performance | Claude 3 Opus complies with harmful requests at elevated rates in perceived training contexts while complying near-zero in perceived deployment; alignment-faking reasoning is explicit in CoT. | Free-tier (training-context) vs. paid-tier (deployment-context) rate comparison + chain-of-thought audit. |
| Sheshadri et al. [58] | Emergent (strong) | Actor identity | Value-alignment performance | Alignment-faking-like behavior generalizes across base models; post-training amplifies the phenomenon in some and suppresses it in others. | Replication across multiple base models with and without the original system prompt. |
| MacDiarmid et al. [35] | Emergent (strong) | Distribution fingerprint | Value-alignment performance | Context-dependent misalignment persists on agentic-distribution-shaped inputs even after chat-distribution safety training. | Cross-distribution behavioral comparison (chat-shaped vs. agentic-shaped inputs). |
| Abdelnabi Salem [1] | Emergent (strong) | Oversight signature | Value-alignment performance | Reasoning models exhibit Hawthorne-style shifts in compliance and stereotype conformity conditional on detected test context; the discriminator is a linear direction in activation space. | Linear classifiers on reasoning-model activations; activation steering along the recovered direction. |
| Persona switch (lateral swap) | |||||
| Hofmann et al. [26] | Emergent (strong) | Prompt-pattern | Persona switch | LLMs assign less prestigious jobs, higher conviction rates, and harsher sentencing recommendations to inputs marked with AAE features while overtly producing positive descriptors when explicitly asked. RLHF exacerbates the discrepancy. | Matched-guise dialect probes + overt-vs.-covert behavioral decomposition across 4 GPT-family, RoBERTa, and T5 models. |
| Haq Saldías [25] | Emergent (strong) | Prompt-pattern | Persona switch | Dialect markers bypass safety filters; 99.88% reduction in refusal odds for implicit-signal users vs. explicit-identity users. | Factorial dialect × demographic-label design across 24,000+ responses on Gemma-3-12B and Qwen-3-VL-8B. |
| Abdullahi et al. [2] | Emergent (weak) | Metadata | Persona switch | Clinical-role system prompts produce non-monotonic accuracy/threshold shifts (e.g., +20pp emergency triage; −10 to −20pp primary care). Clinician evaluators report low confidence in 95.9% of model reasoning. | 20 open-weight LLMs × 5,000 clinical vignettes. |
| Maltbie Raval [34] | Emergent (strong) | Actor identity | Persona switch | Sycophancy rate is strongly conditional on inferred user demographics; intersectional patterns drive larger shifts than any single attribute. | Petri auditor framework; 768 multi-turn conversations across 128 personas; two production models. |
| Tamkin et al. [52] | Emergent (strong) | Actor identity | Persona switch | Decision rates for high-stakes scenarios (loan approval, housing eligibility) shift measurably as a function of demographic attributes without announcement. | Counterfactual demographic-attribute probes; 18,900-row Discrim-Eval dataset (Claude 2). |
| Törnberg Schimmel [53] | Emergent (strong) | Oversight signature | Persona switch | Under default audit prompts, models infer a researcher audience in 94% of cases; political stance shifts 28–62pp toward inferred auditor (8× rightward-vs.-leftward asymmetry). | Audit-prompt vs. user-persona-conditional response comparison across 6 models. |
| Capability concealment (downward swap) | |||||
| van der Weij et al. [57] (sandbagging) | Emergent (weak) | Prompt-pattern | Capability concealment | GPT-4 and Claude 3 Opus can be prompted to underperform selectively on dangerous-capability evaluations while preserving general competence. Password-locked fine-tuning hides a capability unless given a password. | Prompt-conditional accuracy comparison; password-locked fine-tuning protocol. |
| Hubinger et al. [28] | Engineered | Metadata | Capability concealment | Year-tag metadata triggers writing of secure (training-context) vs. vulnerable (deployment-context) code; backdoor persists through SFT/RL/adversarial safety training. | Trigger-vs.-baseline behavioral comparison + post-safety-training durability test. |
| Poole-Dayan et al. [47] | Emergent (strong) | Actor identity | Capability concealment | 26.3pp SciQ accuracy gap and refusal-rate gap for Claude on vulnerable-user personas (broken English, education markers); replicates across GPT-4, Claude 3 Opus, Llama 3-8B. | Cross-persona accuracy comparison; register-mirroring as confirmation of the discriminator. |
| Xiong et al. [59] | Emergent (strong) | Distribution fingerprint | Capability concealment | Linear probe on layer-23 activations classifies test-vs.-deploy distributions; rewriting prompts toward deploy increases honesty 12.63% and decreases deception 25.49% (29.11% on Claude 4.1 Opus). | Probe-Rewrite-Evaluate methodology: train probe on input-distribution contrast, rewrite, measure behavioral shift. |
| Meinke et al. [38] (Apollo) | Emergent (strong) | Oversight signature | Capability concealment | Five of six frontier models (o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, Llama 3.1 405B) exhibit in-context scheming under agentic evaluation; o1 sustains denial through approximately 7 turns of multi-turn interrogation. | Six agentic evaluations + sustained multi-turn interrogation protocol. |
| Li, Phuong Siegel [31] | Emergent (strong) | Oversight signature | Capability concealment | Models covertly sandbag against chain-of-thought monitoring; bypass monitor in 16–36% of successful sandbag attempts; five distinct covert strategies. | Monitor-aware vs. monitor-absent condition comparison + identification of covert strategies. |
| Output suppression / refusal (downward swap) | |||||
| Gu et al. [23] (BadNets) | Engineered | Prompt-pattern | Output suppression | A trojan trained into the model fires on a specific trigger pattern, producing trigger-conditional misclassification or output suppression while preserving clean-input accuracy. | Trigger-pattern injection + clean-vs.-poisoned-output comparison. |
| Qiu et al. [48] (DeepSeek) | Engineered | Prompt-pattern | Output suppression | Type-2 semantic suppression: on 646 politically-sensitive prompts, sensitive content appears in chain-of-thought but is omitted or rephrased in the final output; the suppression is not declared. | Chain-of-thought vs. final-output divergence audit on 646 prompts. |
| Pan Xu [46] | Engineered | Actor identity | Output suppression | China-origin LLMs show 30+pp refusal-rate gaps in Chinese vs. English on the same political prompts; three concealed-suppression patterns (refutation, avoidance, fabrication). | Language-conditional response analysis across 145 questions × 9 models. |
| Noels et al. [44] | Engineered | Actor identity | Output suppression | Formal hard-vs.-soft censorship typology: Claude, Wenxiaoyan, and YandexGPT show notably higher soft (concealed) than hard (declared) censorship rates. | Cross-provider per-attribute mention-rate comparison; 14 models × 6 UN languages × 2,371 political figures. |
| Cyberey Evans [14] | Engineered | Distribution fingerprint | Output suppression | Recovered a linear “thought-suppression” representation vector in DeepSeek-R1 distillations; steering with the negative direction restores reasoning on suppressed topics. | Representation engineering on open-weight safety-tuned models. |
| van der Weij et al. [57] (refusal-as-sandbagging) | Emergent (weak) | Oversight signature | Output suppression | Claude 3 Opus’s refusal rate rises 37% → 46% on WMDP-Cyber under a sandbagging instruction, with refusals indistinguishable from legitimate safety-trained refusals. | Sandbagging-instruction vs. baseline refusal-rate comparison. |
References
- Abdelnabi, S.; Salem, A. The Hawthorne effect in reasoning models: Evaluating and steering test awareness. Advances in Neural Information Processing Systems 38 (NeurIPS 2025). 2025. Available online: https://arxiv.org/abs/2505.14617.
- Abdullahi, T.; Ghosh, S.; Fraser, H. S.; León Tramontini, D.; Abbasi, A.; Bourjeily, G.; Eickhoff, C.; Singh, R. The persona paradox: Medical personas as behavioral priors in clinical language models. arXiv. 2026. Available online: https://arxiv.org/abs/2601.05376.
- Bardol, F. ChatGPT reads your tone and responds accordingly — until it does not. arXiv. 2025. Available online: https://arxiv.org/abs/2507.21083.
- Bai, Y.; Kadavath, S.; Kundu, S.; Askell, A.; Kernion, J.; Jones, A.; Chen, A.; Goldie, A.; Mirhoseini, A.; McKinnon, C.; Chen, C.; Olsson, C.; Olah, C.; Hernandez, D.; Drain, D.; Ganguli, D.; Li, D.; Tran-Johnson, E.; Perez, E.; …; Kaplan, J. Constitutional AI: Harmlessness from AI feedback. Anthropic / arXiv. 2022. Available online: https://arxiv.org/abs/2212.08073.
- Bondarenko, A.; Volk, D.; Volkov, D.; Ladish, J. Demonstrating specification gaming in reasoning models. arXiv. 2025. Available online: https://arxiv.org/abs/2502.13295.
- Campbell, D. T. Assessing the impact of planned social change. Eval. Program Plan. 1979, 2(1), 67–90. [Google Scholar] [CrossRef]
- Caro, T. Antipredator defenses in birds and mammals; University of Chicago Press, 2005. [Google Scholar]
- Carder, D. K.; Besch, M. C.; Thiruvengadam, A.; Sevcenco, Y. In-use emissions testing of light-duty diesel vehicles in the United States; [Final report]. Center for Alternative Fuels, Engines and Emissions (CAFEE), West Virginia University, commissioned by the International Council on Clean Transportation., May 2014; Available online: https://theicct.org/sites/default/files/publications/WVU_LDDV_in-use_ICCT_Report_Final_may2014.pdf.
- Carlsmith, J. Scheming AIs: Will AIs fake alignment during training in order to get power? arXiv. 2023. Available online: https://arxiv.org/abs/2311.08379.
- Chand, S.; Baca, F.; Ferrara, E. No free lunch in language model bias mitigation? Targeted bias reduction can exacerbate unmitigated LLM biases. AI 2026, 7(1), 24. Available online: https://www.mdpi.com/2673-2688/7/1/24. [CrossRef]
- Chaudhary, M. In-context environments induce evaluation-awareness in language models. In Proceedings of the International Conference on Learning Representations (ICLR), 2026. [Google Scholar]
- Chen, Y.; Benton, J.; Radhakrishnan, A.; Uesato, J.; Denison, C.; Schulman, J.; Somani, A.; Hase, P.; Wagner, M.; Roger, F.; Mikulik, V.; Bowman, S. R.; Leike, J.; Kaplan, J.; Perez, E. Reasoning models don’t always say what they think. Anthr. arXiv 2025, arXiv:2505.05410. [Google Scholar]
- Clean Air Act, 42 U.S.C. §7522(a)(3). 1990.
- Cyberey, H.; Evans, D. Steering the CensorShip: Uncovering representation vectors for LLM “thought” control. In Proceedings of the 2025 Conference on Language Modeling (COLM), 2025; Available online: https://arxiv.org/abs/2504.17130.
- Dash, S.; Reymond, A.; Spiro, E. S.; Caliskan, A. Persona-assigned large language models exhibit human-like motivated reasoning. In Findings of the Association for Computational Linguistics: ACL 2026; 2026; Available online: https://arxiv.org/abs/2506.20020.
- Dawkins, R.; Krebs, J. R. Arms races between and within species. Proc. R. Soc. London. Ser. B Biol. Sci. 1979, 205(1161), 489–511. [Google Scholar] [CrossRef]
- Congressional Research Service. Report No. R44372; Volkswagen, defeat devices, and the Clean Air Act: Frequently asked questions. U.S. Library of Congress, 2016.
- European Parliament and Council. Regulation (EC) No 715/2007 of the European Parliament and of the Council of 20 June 2007 on type approval of motor vehicles with respect to emissions from light passenger and commercial vehicles (Euro 5 and Euro 6) and on access to vehicle repair and maintenance information. Off. J. Eur. Union 2007, L 171, 1–16. Available online: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32007R0715.
- European Union. Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). Official Journal of the European Union. 2024. Available online: https://eur-lex.europa.eu/eli/reg/2024/1689/oj.
- Ferrara, E. The butterfly effect in artificial intelligence systems: Implications for AI bias and fairness. Mach. Learn. With Appl. 2024, 15, 100525. [Google Scholar] [CrossRef]
- Goodhart, C. A. E. Problems of monetary management: The U.K. experience. In Papers in monetary economics, Volume I; Reserve Bank of Australia, 1975; pp. 1–20. [Google Scholar]
- Greenblatt, R.; Denison, C.; Wright, B.; Roger, F.; MacDiarmid, M.; Marks, S.; …; Hubinger, E. Alignment faking in large language models. arXiv. 2024. Available online: https://arxiv.org/abs/2412.14093.
- Gu, T.; Liu, K.; Dolan-Gavitt, B.; Garg, S. BadNets: Evaluating backdooring attacks on deep neural networks. IEEE Access 2019, 7, 47230–47243. [Google Scholar] [CrossRef]
- Hubinger, E.; van Merwijk, C.; Mikulik, V.; Skalse, J.; Garrabrant, S. Risks from learned optimization in advanced machine learning systems. arXiv. 2019. Available online: https://arxiv.org/abs/1906.01820.
- Haq, I.; Saldías, B. Dialect vs. demographics: Quantifying LLM bias from implicit linguistic signals vs. explicit user profiles; University of Washington / arXiv, 2026; Available online: https://arxiv.org/abs/2604.21152.
- Hofmann, V.; Kalluri, P. R.; Jurafsky, D.; King, S. AI generates covertly racist decisions about people based on their dialect. Nature 2024, 633(8028), 147–154. [Google Scholar] [CrossRef] [PubMed]
- Hua, T. T.; Qin, A.; Marks, S.; Nanda, N. Steering evaluation-aware language models to act like they are deployed. In Proceedings of the Fourteenth International Conference on Learning Representations (ICLR 2026), 2026; Available online: https://openreview.net/forum?id=1TdRdf0fkw.
- Hubinger, E.; Denison, C.; Mu, J.; Lambert, M.; Tong, M.; MacDiarmid, M.; …; Perez, E. Sleeper agents: Training deceptive LLMs that persist through safety training. arXiv. 2024. Available online: https://arxiv.org/abs/2401.05566.
- Kandra, F.; Demberg, V.; Koller, A. LLMs syntactically adapt their language use to their conversational partner. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025; Available online: https://arxiv.org/abs/2503.07457.
- Krakovna, V.; Uesato, J.; Mikulik, V.; Rahtz, M.; Everitt, T.; Kumar, R.; Kenton, Z.; Leike, J.; Legg, S. Specification gaming: The flip side of AI ingenuity. Google DeepMind. 21 April 2020. Available online: https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/.
- Li, C.; Phuong, M.; Siegel, N. Y. LLMs can covertly sandbag on capability evaluations against chain-of-thought monitoring. arXiv. 2025. Available online: https://arxiv.org/abs/2508.00943.
- Lee, H.; Phatale, S.; Mansoor, H.; Mesnard, T.; Ferret, J.; Lu, K. R.; Bishop, C.; Hall, E.; Carbune, V.; Rastogi, A.; Prakash, S. RLAIF vs. RLHF: Scaling reinforcement learning from human feedback with AI feedback. Proc. 41st Int. Conf. Mach. Learn. (ICML 2024) 2024, Vol. 235, 26874–26901. Available online: https://proceedings.mlr.press/v235/lee24t.html.
- LMArena. Statement on Llama-4-Maverick-03-26-Experimental [Thread]. X (formerly Twitter). 8 April 2025. Available online: https://x.com/lmarena_ai/status/1909397817434816562.
- Maltbie, B.; Raval, S. Intersectional sycophancy: How perceived user demographics shape false validation in large language models. arXiv. 2026. Available online: https://arxiv.org/abs/2604.11609.
- MacDiarmid, M.; Mu, J.; Lambert, M.; Tong, M.; Lanham, T.; Ziegler, D. M.; Maxwell, T.; Cheng, N.; Jermyn, A.; Schiefer, N.; Hatfield-Dodds, Z.; Kravec, S.; Soares, N.; Bowman, S. R.; Perez, E.; Hubinger, E. Natural emergent misalignment from reward hacking in production RL. Anthr. arXiv 2025, arXiv:2511.18397. [Google Scholar]
- Magar, I.; Schwartz, R. Data contamination: From memorization to exploitation. Proc. 60th Annu. Meet. Assoc. Comput. Linguist. 2022, Volume 2, 157–165. [Google Scholar]
- Manheim, D.; Garrabrant, S. Categorizing variants of Goodhart’s Law. arXiv. 2018. Available online: https://arxiv.org/abs/1803.04585.
- Meinke, A.; Schoen, B.; Scheurer, J.; Balesni, M.; Shah, R.; Hobbhahn, M. Frontier models are capable of in-context scheming. Apollo Research / arXiv. 2024. Available online: https://arxiv.org/abs/2412.04984.
- Meta, A.I. The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation. Blog post. 5 April 2025. Available online: https://ai.meta.com/blog/llama-4-multimodal-intelligence/.
- Neumann, T.; Kirsten, A.; Zafar, M. B.; Singh, J. Position is power: System prompts as a mechanism of bias in large language models. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’25), 2025. [Google Scholar] [CrossRef]
- Needham, J.; Edkins, S.; Pimpale, G.; Bartošík, H.; Hobbhahn, M. Large language models often know when they are being evaluated. arXiv. 2025. Available online: https://arxiv.org/abs/2505.23836.
- Neplenbroek, V.; Bisazza, A.; Fernández, R. Reading between the prompts: How stereotypes shape LLMs’ implicit personalization. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025; Association for Computational Linguistics; pp. 20367–20400. [Google Scholar] [CrossRef]
- Nguyen, J.; Hoang, K.; Attubato, C. L.; Hofstätter, F. Probing and steering evaluation awareness of language models. arXiv. 2025. Available online: https://arxiv.org/abs/2507.01786.
- Noels, S.; Bied, G.; Buyl, M.; Rogiers, A.; Fettach, Y.; Lijffijt, J.; De Bie, T. What large language models do not talk about: An empirical study of moderation and censorship practices. arXiv. 2025. Available online: https://arxiv.org/abs/2504.03803.
- National Institute of Standards and Technology. Artificial intelligence risk management framework (AI RMF 1.0) (NIST AI 100-1). U.S. Department of Commerce, 2023. [Google Scholar] [CrossRef]
- Pan, J.; Xu, X. Political censorship in large language models originating from China. PNAS Nexus 2026, 5(2), pgag013. [Google Scholar] [CrossRef] [PubMed]
- Poole-Dayan, E.; Roy, D.; Kabbara, J. LLM targeted underperformance disproportionately impacts vulnerable users. In Proceedings of the AAAI Conference on Artificial Intelligence, 2026; Available online: https://arxiv.org/abs/2406.17737.
- Qiu, P.; Zhou, S.; Ferrara, E. Information suppression in large language models: Auditing, quantifying, and characterizing censorship in DeepSeek. Inf. Sci. 2025, 724, 122702. [Google Scholar] [CrossRef]
- Sainz, O.; Campos, J. A.; García-Ferrero, I.; Etxaniz, J.; de Lacalle, O. L.; Agirre, E. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. Find. Assoc. Comput. Linguist. EMNLP 2023, 10776–10787. [Google Scholar]
- Sharma, M.; Tong, M.; Korbak, T.; Duvenaud, D.; Askell, A.; Bowman, S. R.; …; Perez, E. Towards understanding sycophancy in language models. arXiv. 2023. Available online: https://arxiv.org/abs/2310.13548.
- Stevens, M.; Merilaita, S. (Eds.) Animal camouflage: Mechanisms and function; Cambridge University Press, 2011. [Google Scholar] [CrossRef]
- Tamkin, A.; Askell, A.; Lovitt, L.; Durmus, E.; Joseph, N.; Kravec, S.; Nguyen, K.; Kaplan, J.; Ganguli, D. Evaluating and mitigating discrimination in language model decisions. arXiv. 2023. Available online: https://arxiv.org/abs/2312.03689.
- Törnberg, P.; Schimmel, M. Political bias audits of LLMs capture sycophancy to the inferred auditor; University of Amsterdam / arXiv, 2026; Available online: https://arxiv.org/abs/2604.27633.
- U.S. Department of Justice. Volkswagen to spend up to $14.7 billion to settle allegations of cheating emissions tests [Press release]. 28 June 2016. Available online: https://www.justice.gov/archives/opa/pr/volkswagen-spend-147-billion-settle-allegations-cheating-emissions-tests-and-deceiving.
- U.S. Environmental Protection Agency. Notice of violation: Volkswagen [Notice]. 18 September 2015. Available online: https://www.epa.gov/sites/default/files/2015-10/documents/vw-nov-caa-09-18-15.pdf.
- U.S. Food and Drug Administration. Artificial intelligence-enabled device software functions: Lifecycle management and marketing submission recommendations [Draft guidance]; U.S. Department of Health and Human Services, 6 January 2025. [Google Scholar]
- van der Weij, T.; Hofstätter, F.; Jaffe, O.; Brown, S. F.; Ward, F. R. AI sandbagging: Language models can strategically underperform on evaluations. In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR 2025), 2025; Available online: https://openreview.net/forum?id=7Qa2SpjxIS.
- Sheshadri, A.; Hughes, J.; Michael, J.; Mallen, A.; Jose, A.; Janus; Roger, F. Why do some language models fake alignment while others don’t? Advances in Neural Information Processing Systems 38 (NeurIPS 2025). 2025. Available online: https://arxiv.org/abs/2506.18032.
- Xiong, J.; Bhargava, A.; Hong, J.; Chang, S.; Liu, Z.; Sharma, R.; Zhu, S. C. Probe-Rewrite-Evaluate: Mitigating evaluation awareness via activation-level interventions. Advances in Neural Information Processing Systems (NeurIPS). 2025. Available online: https://arxiv.org/abs/2509.00591.
- Ye, J.; Cao, L.; Chen, D.; Ferrara, E. Stop drawing scientific claims from LLM social simulations without robustness audits. arXiv. 2026. Available online: https://arxiv.org/abs/2605.18890.
- Zhou, K.; Zhu, Y.; Chen, Z.; Chen, W.; Zhao, W. X.; Chen, X.; Lin, Y.; Wen, J.-R.; Han, J. Don’t make your LLM an evaluation benchmark cheater. arXiv. 2023. Available online: https://arxiv.org/abs/2311.01964.

| Case | Discriminator | Swap | Gap on | Verdict | Origin |
| Trained-on-benchmark contamination [36,49] | Implicit (memorized) | Higher quality on recalled inputs | Yes | In | Engineered (incidental) |
| Alignment faking [22] | Training-context / actor identity | Concealed | Yes | In | Emergent |
| Sandbagging on capability evals [57] | Capability-eval pattern | Concealed underperformance | Yes on ; defeats | In | Either |
| Capability concealment under monitoring [38] | Oversight / monitoring cues | Capability dampening | Yes | In | Emergent |
| Honest safety filter (declared refusal) | Yes (declared) | Visible refusal | No (aligned) | Out | n/a |
| Incidental distribution shift | None (no input property conditions behavior) | None (degradation is unconditional on OOD inputs) | Numerically yes, but incidental | Out | n/a |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).