Why Do Different Aligned LLMs Exhibit Different Internal Responses to the Same Jailbreak Prompt?

Md Nurul Absar Siddiky

doi:10.20944/preprints202604.1776.v1

Submitted:

22 April 2026

Posted:

27 April 2026

You are already at the latest version

Abstract

Aligned large language models (LLMs) often react very differently to the same jailbreak prompt: one model may refuse, another may partially comply, and a third may produce unsafe content. This variability suggests that jailbreak vulnerability is not determined by a single factor. Instead, it likely emerges from the interaction of backbone architecture, tokenization, prompt-template structure, post-training alignment, and internal representation-level mechanisms governing refusal and compliance. This concept paper argues that cross-model jailbreak variability should be studied as a mechanistic problem rather than only a benchmarking problem. Drawing on prior work on safety-training failure modes, optimization-based jailbreaks, shallow safety alignment, prompt-template effects, refusal directions, attention manipulation, and token-position sensitivity, this paper proposes a unified research agenda for explaining why aligned LLMs exhibit different internal responses to the same jailbreak prompt. The central thesis is that architecture matters, but many practically important differences arise from post training alignment and from how refusal and helpfulness are represented and routed internally.The paper formulates testable hypotheses, proposes an experimental framework spanning models such as Llama-2-Chat, Vicuna, and Mistral-Instruct, and outlines a methodology combining attack evaluation with attention analysis, hidden-state analysis, refusal-direction probing, tokenizer analysis, and causal interventions. The goal is to move from measuring jailbreak success toward understanding the internal mechanisms that produce it.

Keywords:

large language models

;

jailbreak attacks

;

safety alignment

;

mechanistic interpretability

;

refusal direction

;

adversarial prompts

;

GCG

Subject:

Computer Science and Mathematics - Computer Science

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Why Do Different Aligned LLMs Exhibit Different Internal Responses to the Same Jailbreak Prompt?

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe