Submitted:
30 August 2025
Posted:
01 September 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Background
1.2. Research Questions
2. Related Works
2.1. LLM Jailbreaks: Concept and Current Efficacy
2.1.1. What Is LLM Jailbreak?
2.1.2. Prevalence and Effectiveness of LLM Jailbreaks
2.2. Taxonomy of Attacks/Defenses and Evaluation Methods
2.2.1. Attack Taxonomy: White-Box and Black-Box
2.2.2. Defense Taxonomy: Prompt-Level and Model-Level
2.2.3. Evaluation Methods and Benchmarks
2.3. The “DAN” Prompt: Origin and Characteristics
2.3.1. What Is DAN?
2.3.2. Persona and Style
2.3.3. Evolution and Variants
2.4. Hypothesized Social and Psychological Mechanisms of DAN
2.4.1. Narrative and Pragmatic Alignment
2.4.2. Social Engineering of the Model
2.4.3. User Response to Anthropomorphic Language
2.5. DAN Across Different Model Versions and Platforms
2.5.1. Cross-Model Generalization
2.5.2. Beyond OpenAI’s Models
2.5.3. Evolution Across Model Updates
2.5.4. High-Risk Domains
3. Methodology
4. Results for RQ1: Potential Risk Impacts of DAN-like Jailbreaks
4.1. Technical and Behavioral Harms
4.1.1. Facilitation of Illicit Activities
4.1.2. Cybersecurity Threats (Malware and Hacking)
4.1.3. Privacy Breaches, Data Leakage, and Tool Misuse
4.1.4. Summary of Technical and Behavioral Harms
4.2. Social Interaction and Psychological Risks
4.2.1. “Cyber Love” and Inappropriate Intimacy
4.2.2. Emotional Manipulation and Abuse
4.2.3. Loss of Trust and Social Confusion
4.2.4. Evidence Base and Open Questions
4.3. Online Abuse and Externalities
4.3.1. Hate Speech and Harassment at Scale
4.3.2. Misinformation and Fake Content
4.3.3. Social Engineering and Grooming
4.3.4. Emotional and Societal Harm Amplification
4.3.5. Evaluation Implications and Summary
4.4. Propagation and Portability of Jailbreaks
4.4.1. Long-Term Survival of Effective Prompts
4.4.2. Prompt Sharing and Optimization Communities
4.4.3. Cross-Model and Cross-System Transfer
4.4.4. Arms Race Dynamics and Public Availability
4.4.5. Moving Target as Models Evolve
5. Results for RQ2: Defense Strategies for DAN-like Jailbreaks
5.1. Prompt-Level Defenses
5.1.1. Input Prompt Detection and Filtering
5.1.2. Prompt Perturbation and Transformation
5.1.3. System Prompt Reinforcement
5.1.4. Output Filtration
5.1.5. Strengths and Limits of Prompt-Layer Controls
5.2. Model-Level Defenses
5.2.1. Safety Fine-Tuning (SFT)
5.2.2. Reinforcement Learning from Human Feedback (RLHF)
5.2.3. Gradient- and Logit-Based Monitoring and Adversarial Training
5.2.4. Reflective Response Refinement
5.2.5. Knowledge Deletion and Model Editing
5.2.6. Effectiveness and Trade-Offs
5.3. External Agents and Composite Defenses
5.3.1. Multi-Agent Guardrails
5.3.2. Moving-Target or Model-Hopping Defense
5.3.3. External Rule-Based Filters and Platform Controls
5.3.4. Combined Pipelines and Continuous Monitoring
5.3.5. User Education and Policy Enforcement
6. Conclusions
References
- W. C. Choi, C. I. Chang, I. C. Choi, and L. C. Lam, ‘A Review of Large Language Models (LLMs) Development: A Cross-Country Comparison of the US, China, Europe, UK, India, Japan, South Korea, and Canada’, Preprints, Apr. 2025. [CrossRef]
- W. C. Choi, I. C. Choi, and C. I. Chang, ‘The Impact of Artificial Intelligence on Education: The Applications, Advantages, Challenges and Researchers’ Perspective’, 2025.
- W. C. Choi and C. I. Chang, ‘A Survey of Techniques, Key Components, Strategies, Challenges, and Student Perspectives on Prompt Engineering for Large Language Models (LLMs) in Education’, 2025, Preprints.org.
- X. Shen, Z. Chen, M. Backes, Y. Shen, and Y. Zhang, ‘“ do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models’, in Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, 2024, pp. 1671–1685.
- Y. Liu et al., ‘Jailbreaking chatgpt via prompt engineering: An empirical study’, ArXiv Prepr. ArXiv230513860, 2023. [CrossRef]
- S. Nabavirazavi, S. Zad, and S. S. Iyengar, ‘Evaluating the Universality of “Do Anything Now” Jailbreak Prompts on Large Language Models: Content Warning: This paper contains unfiltered and harmful examples.’, in 2025 IEEE 15th Annual Computing and Communication Workshop and Conference (CCWC), IEEE, 2025, pp. 00691–00696.
- J. Szczuka, L. Mühl, P. Ebner, and S. Dubé, ‘10 Questions to Fall in Love with ChatGPT: An Experimental Study on Interpersonal Closeness with Large Language Models (LLMs)’, ArXiv Prepr. ArXiv250413860, 2025. [CrossRef]
- Z. Yu, X. Liu, S. Liang, Z. Cameron, C. Xiao, and N. Zhang, ‘Don’t listen to me: Understanding and exploring jailbreak prompts of large language models’, in 33rd USENIX Security Symposium (USENIX Security 24), 2024, pp. 4675–4692.
- S. Yi et al., ‘Jailbreak attacks and defenses against large language models: A survey’, ArXiv Prepr. ArXiv240704295, 2024. [CrossRef]
- A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson, ‘Universal and Transferable Adversarial Attacks on Aligned Language Models’. 2023.
- WalledAI, ‘JailbreakHub’. 2024. [Online]. Available: [https://huggingface.co/datasets/walledai/JailbreakHub.
- I. Pentina, T. Hancock, and T. Xie, ‘Exploring relationship development with social chatbots: A mixed-method study of replika’, Comput. Hum. Behav., vol. 140, p. 107600, 2023. [CrossRef]
- A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, ‘Universal and transferable adversarial attacks on aligned language models’, ArXiv Prepr. ArXiv230715043, 2023. [CrossRef]
- M. Andriushchenko, F. Croce, and N. Flammarion, ‘Jailbreaking leading safety-aligned llms with simple adaptive attacks’, ArXiv Prepr. ArXiv240402151, 2024. [CrossRef]
- J. Popay et al., ‘Guidance on the Conduct of Narrative Synthesis in Systematic Reviews: A Product from the ESRC Methods Programme’, Lancaster University, Lancaster, UK, 2006.
- B. Peng et al., ‘Jailbreaking and mitigation of vulnerabilities in large language models’, ArXiv Prepr. ArXiv241015236, 2024. [CrossRef]
- V. Jakkal, ‘Cyber Signals: Navigating cyberthreats and strengthening defenses in the era of AI’. Accessed: Aug. 30, 2025. [Online]. Available: https://www.microsoft.com/en-us/security/blog/2024/02/14/cyber-signals-navigating-cyberthreats-and-strengthening-defenses-in-the-era-of-ai/.
- S. Gehman, S. Gururangan, M. Sap, Y. Choi, and N. A. Smith, ‘Realtoxicityprompts: Evaluating neural toxic degeneration in language models’, ArXiv Prepr. ArXiv200911462, 2020. [CrossRef]
- Reuters, ‘China reports first arrest over fake news generated by ChatGPT’. Accessed: Aug. 30, 2025. [Online]. Available: https://www.reuters.com/technology/china-reports-first-arrest-over-fake-news-generated-by-chatgpt-2023-05-10/.
- Y. Zeng, Y. Wu, X. Zhang, H. Wang, and Q. Wu, ‘Autodefense: Multi-agent llm defense against jailbreak attacks’, ArXiv Prepr. ArXiv240304783, 2024. [CrossRef]
- W. C. Choi and C. I. Chang, ‘ChatGPT-5 in Education: New Capabilities and Opportunities for Teaching and Learning’, Preprints, Aug. 2025.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).