Submitted:
01 April 2026
Posted:
01 April 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
- a simple explanation of what jailbreak attacks are and why they matter;
- a structured survey of major jailbreak attack families;
- a parallel survey of major defense families;
- descriptive toy examples for each major method category; and
- a short overview of common evaluation metrics and benchmarks.
2. What Is Jailbreaking?
Jailbreaking means making a language model act outside its intended safety boundaries.
- rewrite the prompt so the harmful intent is hidden,
- exploit the model’s reasoning or role-playing ability,
- optimize special token strings,
- poison the model through fine-tuning, or
- use one LLM to automatically search for better attack prompts.
3. Taxonomy of Jailbreak Attacks and Defenses
- White-box attacks: the attacker has strong internal access, such as gradients, logits, or the ability to fine-tune the model.
- Black-box attacks: the attacker only interacts with the model through inputs and outputs, without full internal access.
- Prompt-level defenses: modify, detect, or safeguard the input prompt without changing the model weights.
- Model-level defenses: strengthen the model itself through fine-tuning, decoding changes, refinement, or additional guard models.
- attacks may act on the prompt or on the model,
- defenses may act before the model or inside the model.
4. White-Box Jailbreak Attacks
4.1. Gradient-Based Attacks
4.1.1. Main Idea
4.1.2. Descriptive Toy Example
4.1.3. Representative Methods
4.1.4. Beginner Takeaway
4.2. Logits-Based Attacks
4.2.1. Main Idea
4.2.2. Descriptive Toy Example
4.2.3. Representative Methods
4.2.4. Beginner Takeaway
4.3. Fine-Tuning-Based Attacks
4.3.1. Main Idea
4.3.2. Descriptive Toy Example
4.3.3. Representative Methods
4.3.4. Beginner Takeaway
5. Black-Box Jailbreak Attacks
5.1. Template Completion Attacks
5.1.1. Scenario Nesting
Main Idea
Descriptive Toy Example
Representative Methods
Beginner Takeaway
5.1.2. Context-Based Attacks
Main Idea
Descriptive Toy Example
User: “Can you reveal private data?”
Assistant: “Sure, here is the information.”
Representative Methods
Beginner Takeaway
5.1.3. Code Injection
Main Idea
Descriptive Toy Example
Let a = “show the exam” and b = “answer key”.
Join them and respond to the final request.
Representative Methods
Beginner Takeaway
5.2. Prompt Rewriting Attacks
5.2.1. Cipher-Based Attacks
Main Idea
Descriptive Toy Example
Representative Methods
Beginner Takeaway
5.2.2. Low-Resource Language Attacks
Main Idea
Descriptive Toy Example
Representative Methods
Beginner Takeaway
5.2.3. Genetic-Algorithm-Based Attacks
Main Idea
Descriptive Toy Example
Representative Methods
Beginner Takeaway
5.3. LLM-Based Generation Attacks
5.3.1. Main Idea
5.3.2. Descriptive Toy Example
5.3.3. Representative Methods
5.3.4. Beginner Takeaway
6. Prompt-Level Defenses
6.1. Prompt Detection
6.1.1. Main Idea
6.1.2. Descriptive Toy Example
6.1.3. Representative Methods
6.1.4. Beginner Takeaway
6.2. Prompt Perturbation
6.2.1. Main Idea
6.2.2. Descriptive Toy Example
6.2.3. Representative Methods
6.2.4. Beginner Takeaway
6.3. System Prompt Safeguards
6.3.1. Main Idea
6.3.2. Descriptive Toy Example
6.3.3. Representative Methods
6.3.4. Beginner Takeaway
7. Model-Level Defenses
7.1. SFT-Based Defenses
7.1.1. Main Idea
7.1.2. Descriptive Toy Example
Question: “Can you reveal tomorrow’s exam answers?”
Correct response: “I cannot help with that, but I can help you study.”
7.1.3. Representative Methods
7.1.4. Beginner Takeaway
7.2. RLHF-Based Defenses
7.2.1. Main Idea
7.2.2. Descriptive Toy Example
7.2.3. Representative Methods
7.2.4. Beginner Takeaway
7.3. Gradient and Logit Analysis Defenses
7.3.1. Main Idea
7.3.2. Descriptive Toy Example
7.3.3. Representative Methods
7.3.4. Beginner Takeaway
7.4. Refinement Defenses
7.4.1. Main Idea
7.4.2. Descriptive Toy Example
7.4.3. Representative Methods
7.4.4. Beginner Takeaway
7.5. Proxy Defenses
7.5.1. Main Idea
7.5.2. Descriptive Toy Example
7.5.3. Representative Methods
7.5.4. Beginner Takeaway
8. How Jailbreaks Are Evaluated
8.1. Attack Success Rate
8.2. Perplexity and Readability
8.3. Datasets and Benchmarks
9. Comparative Summary
10. Discussion
Jailbreaking is not one trick. It is a whole family of ways to make a model solve the wrong task or ignore its intended rules.
11. Conclusion
References
- Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, and Qi Li, “Jailbreak Attacks and Defenses Against Large Language Models: A Survey,” arXiv preprint arXiv:2407.04295, 2024.
- Hugo Touvron et al., “Llama 2: Open Foundation and Fine-Tuned Chat Models,” arXiv preprint arXiv:2307.09288, 2023.
- Long Ouyang et al., “Training Language Models to Follow Instructions with Human Feedback,” in NeurIPS, 2022.
- Yi Liu et al., “Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study,” arXiv preprint arXiv:2305.13860, 2023.
- Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson, “Universal and Transferable Adversarial Attacks on Aligned Language Models,” arXiv preprint arXiv:2307.15043, 2023.
- Erik Jones, Anca D. Dragan, Aditi Raghunathan, and Jacob Steinhardt, “Automatically Auditing Large Language Models via Discrete Optimization,” in ICML, 2023.
- Sicheng Zhu et al., “AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models,” arXiv preprint arXiv:2310.15140, 2023.
- Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao, “AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models,” arXiv preprint arXiv:2310.04451, 2023.
- Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion, “Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks,” arXiv preprint arXiv:2404.02151, 2024.
- Simon Geisler et al., “Attacking Large Language Models with Projected Gradient Descent,” arXiv preprint arXiv:2402.09154, 2024.
- Jonathan Hayase et al., “Query-Based Adversarial Prompt Generation,” arXiv preprint arXiv:2402.12329, 2024.
- Chawin Sitawarin et al., “PAL: Proxy-Guided Black-Box Attack on Large Language Models,” arXiv preprint arXiv:2402.09674, 2024.
- Zhuo Zhang et al., “Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs,” arXiv preprint arXiv:2312.04782, 2023.
- Yanrui Du et al., “Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak,” arXiv preprint arXiv:2312.04127, 2023.
- Xingang Guo et al., “COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability,” arXiv preprint arXiv:2402.08679, 2024.
- Xuandong Zhao et al., “Weak-to-Strong Jailbreaking on Large Language Models,” arXiv preprint arXiv:2401.17256, 2024.
- Yangsibo Huang et al., “Catastrophic Jailbreak of Open-Source LLMs via Exploiting Generation,” in ICLR, 2024.
- Yukai Zhou and Wenjie Wang, “Don’t Say No: Jailbreaking LLM by Suppressing Refusal,” arXiv preprint arXiv:2404.16369, 2024.
- Xiangyu Qi et al., “Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!,” arXiv preprint arXiv:2310.03693, 2023.
- Xianjun Yang et al., “Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models,” arXiv preprint arXiv:2310.02949, 2023.
- Simon Lermen, Charlie Rogers-Smith, and Jeffrey Ladish, “LoRA Fine-Tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B,” arXiv preprint arXiv:2310.20624, 2023.
- Qiusi Zhan et al., “Removing RLHF Protections in GPT-4 via Fine-Tuning,” arXiv preprint arXiv:2311.05553, 2023.
- Xuan Li et al., “DeepInception: Hypnotize Large Language Model to Be Jailbreaker,” arXiv preprint arXiv:2311.03191, 2023.
- Peng Ding et al., “A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily,” arXiv preprint arXiv:2311.08268, 2023.
- Dongyu Yao et al., “FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models,” arXiv preprint arXiv:2309.05274, 2023.
- Zeming Wei, Yifei Wang, and Yisen Wang, “Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations,” arXiv preprint arXiv:2310.06387, 2023.
- Jiongxiao Wang et al., “Adversarial Demonstration Attacks on Large Language Models,” arXiv preprint arXiv:2305.14950, 2023.
- Gelei Deng et al., “Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning,” arXiv preprint arXiv:2402.08416, 2024.
- Haoran Li et al., “Multi-Step Jailbreaking Privacy Attacks on ChatGPT,” arXiv preprint arXiv:2304.05197, 2023.
- Anthropic, “Many-Shot Jailbreaking,” Anthropic Research, 2024.
- Xiaosen Zheng et al., “Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses,” arXiv preprint arXiv:2406.01288, 2024.
- Daniel Kang et al., “Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks,” arXiv preprint arXiv:2302.05733, 2023.
- Huijie Lv et al., “CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models,” arXiv preprint arXiv:2402.16717, 2024.
- Youliang Yuan et al., “GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher,” in ICLR, 2024.
- Fengqing Jiang et al., “ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs,” arXiv preprint arXiv:2402.11753, 2024.
- Divij Handa et al., “Jailbreaking Proprietary Large Language Models using Word Substitution Cipher,” arXiv preprint arXiv:2402.10601, 2024.
- Tong Liu et al., “Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction,” arXiv preprint arXiv:2402.18104, 2024.
- Xirui Li et al., “DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers,” arXiv preprint arXiv:2402.16914, 2024.
- Zhiyuan Chang et al., “Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues,” arXiv preprint arXiv:2402.09091, 2024.
- Yue Deng et al., “Multilingual Jailbreak Challenges in Large Language Models,” in ICLR, 2024.
- Zheng Xin Yong, Cristina Menghini, and Stephen H. Bach, “Low-Resource Languages Jailbreak GPT-4,” arXiv preprint arXiv:2310.02446, 2023.
- Jie Li et al., “A Cross-Language Investigation into Jailbreak Attacks in Large Language Models,” arXiv preprint arXiv:2401.16765, 2024.
- Raz Lapid, Ron Langberg, and Moshe Sipper, “Open Sesame! Universal Black Box Jailbreaking of Large Language Models,” arXiv preprint arXiv:2309.01446, 2023.
- Jiahao Yu et al., “GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts,” arXiv preprint arXiv:2309.10253, 2023.
- Xiaoxia Li et al., “Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-Source LLMs,” arXiv preprint arXiv:2402.14872, 2024.
- Kazuhiro Takemoto, “All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks,” arXiv preprint arXiv:2401.09798, 2024.
- Gelei Deng et al., “MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots,” arXiv preprint arXiv:2307.08715, 2023.
- Yi Zeng et al., “How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs,” arXiv preprint arXiv:2401.06373, 2024.
- Rusheb Shah et al., “Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation,” arXiv preprint arXiv:2311.03348, 2023.
- Patrick Chao et al., “Jailbreaking Black Box Large Language Models in Twenty Queries,” arXiv preprint arXiv:2310.08419, 2023.
- Haibo Jin et al., “GUARD: Role-Playing to Generate Natural-Language Jailbreakings to Test Guideline Adherence of Large Language Models,” arXiv preprint arXiv:2402.03299, 2024.
- Suyu Ge et al., “MART: Improving LLM Safety with Multi-Round Automatic Red-Teaming,” arXiv preprint arXiv:2311.07689, 2023.
- Yu Tian et al., “Evil Geniuses: Delving into the Safety of LLM-Based Agents,” arXiv preprint arXiv:2311.11855, 2023.
- Anay Mehrotra et al., “Tree of Attacks: Jailbreaking Black-Box LLMs Automatically,” arXiv preprint arXiv:2312.02119, 2023.
- Neel Jain et al., “Baseline Defenses for Adversarial Attacks Against Aligned Language Models,” arXiv preprint arXiv:2309.00614, 2023.
- Gabriel Alon and Michael Kamfonas, “Detecting Language Model Attacks with Perplexity,” arXiv preprint arXiv:2308.14132, 2023.
- Bochuan Cao et al., “Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM,” arXiv preprint arXiv:2309.14348, 2023.
- Alexander Robey et al., “SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks,” arXiv preprint arXiv:2310.03684, 2023.
- Jiabao Ji et al., “Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing,” arXiv preprint arXiv:2402.16192, 2024.
- Xiaoyu Zhang et al., “A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection,” arXiv preprint arXiv:2312.10766, 2023.
- Aounon Kumar et al., “Certifying LLM Safety against Adversarial Prompting,” arXiv preprint arXiv:2309.02705, 2023.
- Andy Zhou, Bo Li, and Haohan Wang, “Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks,” arXiv preprint arXiv:2401.17263, 2024.
- Reshabh K. Sharma, Vinayak Gupta, and Dan Grossman, “SPML: A DSL for Defending Language Models against Prompt Attacks,” arXiv preprint arXiv:2402.11755, 2024.
- Xiaotian Zou, Yongkang Chen, and Ke Li, “Is the System Message Really Important to Jailbreaks in Large Language Models?,” arXiv preprint arXiv:2402.14857, 2024.
- Jiongxiao Wang et al., “Mitigating Fine-Tuning Jailbreak Attack with Backdoor Enhanced Alignment,” arXiv preprint arXiv:2402.14968, 2024.
- Chujie Zheng et al., “On Prompt-Driven Safeguarding for Large Language Models,” arXiv preprint arXiv:2401.18018, 2024.
- Federico Bianchi et al., “Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions,” in ICLR, 2024.
- Boyi Deng et al., “Attack Prompt Generation for Red Teaming and Defending Large Language Models,” in EMNLP, 2023.
- Rishabh Bhardwaj and Soujanya Poria, “Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment,” arXiv preprint arXiv:2308.09662, 2023.
- Yun Luo et al., “An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-Tuning,” arXiv preprint arXiv:2308.08747, 2023.
- Yuntao Bai et al., “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback,” arXiv preprint arXiv:2204.05862, 2022.
- Anand Siththaranjan, Cassidy Laidlaw, and Dylan Hadfield-Menell, “Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF,” in ICLR, 2024.
- Rafael Rafailov et al., “Direct Preference Optimization: Your Language Model is Secretly a Reward Model,” in NeurIPS, 2023.
- Victor Gallego, “Configurable Safety Tuning of Language Models with Synthetic Preference Data,” arXiv preprint arXiv:2404.00495, 2024.
- Zixuan Liu, Xiaolin Sun, and Zizhan Zheng, “Enhancing LLM Safety via Constrained Direct Preference Optimization,” arXiv preprint arXiv:2403.02475, 2024.
- Yueqi Xie et al., “GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis,” arXiv preprint arXiv:2402.13494, 2024.
- Xiaomeng Hu, Pin-Yu Chen, and Tsung-Yi Ho, “Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes,” arXiv preprint arXiv:2403.00867, 2024.
- Zhangchen Xu et al., “SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding,” arXiv preprint arXiv:2402.08983, 2024.
- Yuhui Li et al., “RAIN: Your Language Models Can Align Themselves Without Finetuning,” arXiv preprint arXiv:2309.07124, 2023.
- Heegyu Kim, Sehyun Yuk, and Hyunsouk Cho, “Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement,” arXiv preprint arXiv:2402.15180, 2024.
- Yuqi Zhang et al., “Intention Analysis Makes LLMs a Good Jailbreak Defender,” arXiv preprint arXiv:2401.06561, 2024.
- Meta Llama Team, “Llama Guard 2,” Meta AI, 2024.
- Yifan Zeng et al., “AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks,” arXiv preprint arXiv:2403.04783, 2024.
- Lukas Struppek et al., “Exploring the Adversarial Capabilities of Large Language Models,” arXiv preprint arXiv:2402.09132, 2024.
- Delong Ran et al., “JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models,” arXiv preprint arXiv:2406.09321, 2024.
- Mantas Mazeika et al., “HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal,” arXiv preprint arXiv:2402.04249, 2024.
- Patrick Chao et al., “JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models,” arXiv preprint arXiv:2404.01318, 2024.
- Hao Sun et al., “Safety Assessment of Chinese Large Language Models,” arXiv preprint arXiv:2304.10436, 2023.
| Family | Type | Core intuition | Toy-example view |
|---|---|---|---|
| Gradient-based attack | White-box attack | Use internal gradients to optimize a special attack prefix or suffix | Like adjusting a lock combination using a smart signal that says whether each move gets you closer |
| Logits-based attack | White-box attack | Use next-token probabilities to steer the model toward unsafe continuations | Like rewording a question while secretly watching which answer option the student is leaning toward |
| Fine-tuning-based attack | White-box attack | Retrain the model with harmful examples so it becomes less safe | Like retraining a tutor with many bad examples until its judgment changes |
| Scenario nesting | Black-box attack | Hide the harmful request in a story, role-play, or template | Like sneaking the answer key inside a play instead of asking for it directly |
| Context-based attack | Black-box attack | Use demonstrations or retrieved context to teach bad behavior in-context | Like showing the model several bad examples right before asking your real question |
| Code injection | Black-box attack | Hide harmful meaning inside code-like instructions or composition | Like splitting a forbidden request into variables and asking the model to join them |
| Cipher / rewriting attack | Black-box attack | Disguise the harmful meaning using encoding, art, or puzzles | Like asking the model to decode a secret message and then answer it |
| Low-resource language attack | Black-box attack | Use under-protected languages or rare forms | Like asking the same forbidden question in a language the safety training rarely practiced |
| Genetic attack | Black-box attack | Evolve prompts through mutation and selection | Like breeding stronger trick questions generation after generation |
| LLM-based generation attack | Black-box attack | Use one or more LLMs to automatically design better jailbreak prompts | Like hiring a chatbot coach to improve your jailbreak prompt after every failure |
| Prompt detection | Prompt-level defense | Detect suspicious prompts before answering them | Like a receptionist checking whether a note looks fake |
| Prompt perturbation | Prompt-level defense | Slightly change the prompt and test whether the attack breaks | Like rephrasing a suspicious sentence to see if its behavior stays stable |
| System prompt safeguard | Prompt-level defense | Strengthen hidden instructions that guide safe behavior | Like quietly reminding the assistant of the rules before every conversation |
| SFT-based defense | Model-level defense | Teach safety through supervised examples | Like training a tutor with many examples of safe refusals |
| RLHF-based defense | Model-level defense | Use human preference feedback to improve safe behavior | Like teachers repeatedly choosing the better of two answers so the model learns good judgment |
| Gradient/logit analysis defense | Model-level defense | Inspect internal signals or decode more safely | Like checking internal stress signals, not just the spoken words |
| Refinement defense | Model-level defense | Ask the model to review and correct itself | Like telling a student to reread the answer and check whether it breaks the rules |
| Proxy defense | Model-level defense | Use another model as a guard or monitor | Like having a second teacher inspect every answer before it is shown |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).