Preprint
Concept Paper

This version is not peer-reviewed.

Why Do Different Aligned LLMs Exhibit Different Internal Responses to the Same Jailbreak Prompt?

Submitted:

22 April 2026

Posted:

27 April 2026

You are already at the latest version

Abstract
Aligned large language models (LLMs) often react very differently to the same jailbreak prompt: one model may refuse, another may partially comply, and a third may produce unsafe content. This variability suggests that jailbreak vulnerability is not determined by a single factor. Instead, it likely emerges from the interaction of backbone architecture, tokenization, prompt-template structure, post-training alignment, and internal representation-level mechanisms governing refusal and compliance. This concept paper argues that cross-model jailbreak variability should be studied as a mechanistic problem rather than only a benchmarking problem. Drawing on prior work on safety-training failure modes, optimization-based jailbreaks, shallow safety alignment, prompt-template effects, refusal directions, attention manipulation, and token-position sensitivity, this paper proposes a unified research agenda for explaining why aligned LLMs exhibit different internal responses to the same jailbreak prompt. The central thesis is that architecture matters, but many practically important differences arise from post training alignment and from how refusal and helpfulness are represented and routed internally.The paper formulates testable hypotheses, proposes an experimental framework spanning models such as Llama-2-Chat, Vicuna, and Mistral-Instruct, and outlines a methodology combining attack evaluation with attention analysis, hidden-state analysis, refusal-direction probing, tokenizer analysis, and causal interventions. The goal is to move from measuring jailbreak success toward understanding the internal mechanisms that produce it.
Keywords: 
;  ;  ;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated