HumorReject: Decoupling LLM Safety from Refusal Prefix via A Little Humor

Zihui Wu; Haichang Gao; Jiacheng Luo; Zhaoxiang Liu

doi:10.20944/preprints202501.1736.v1

Submitted:

23 January 2025

Posted:

23 January 2025

You are already at the latest version

Abstract

Large Language Models (LLMs) commonly rely on explicit refusal prefixes for safety, making them vulnerable to prefix injection attacks. We introduce HumorReject, a novel data-driven approach that fundamentally reimagines LLM safety by decoupling it from refusal prefixes through the use of humor as an indirect refusal strategy. Rather than explicitly rejecting harmful instructions, HumorReject responds with contextually appropriate humor that naturally defuses potentially dangerous requests while maintaining engaging interactions. Our approach effectively addresses the common "over-defense" issues in existing safety mechanisms, demonstrating superior robustness against various attack vectors while preserving natural and high-quality interactions on legitimate tasks. Our findings suggest that innovations at the data level are even more fundamental than the alignment algorithm itself in achieving effective LLM safety, opening new directions for developing more resilient and user-friendly AI systems. Our code and dataset are available at https://github.com/wooozihui/HumorReject

Keywords:

LLM safety

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

Large Language Models (LLMs) have become a cornerstone technology of the new generation of artificial intelligence, making previously unattainable applications like automatic programming increasingly commonplace. However, the safety concerns of LLMs remain a significant challenge. To enhance LLM safety, researchers have employed alignment training to instill human preferences—training models to refuse rather than execute harmful instructions [1]. Recent studies [2,3], however, have revealed a critical limitation in this approach: LLMs’ safety appears to be predominantly controlled by refusal behavior, but only generalizes to the initial tokens of their refusal responses. This vulnerability is particularly concerning as it enables attackers to easily manipulate model responses by simply inserting affirmative prefixes, effectively bypassing safety measures to elicit harmful responses from safety-aligned models, which is known as prefix injection attack [4,5].

Unfortunately, prefix injection attacks are almost inevitable due to several factors: 1) Although safety-trained models can refuse harmful instructions, the mere act of beginning a response with an affirmative prefix is not inherently harmful; 2) Some training tasks may require models to follow user instructions to begin responses with specific prefixes (e.g., JSON format), further increasing the risk of prefix injection; 3) Attackers can enhance the success rate of prefix injection through adversarial techniques [5,6]; 4) Most critically, when models are white-box or providers support assistant prefilling, attackers can directly modify LLM response prefixes. Consequently, as long as model safety relies on refusal prefixes, the risk of prefix injection jailbreak attacks remains inevitable. This leads to our core research question: Can the safety of LLMs be enhanced by reducing their reliance on refusal prefixes?

To address this challenge, we introduce HumorReject, an innovative data-driven approach that employs humor as an indirect refusal strategy to deflect harmful instructions. Our choice of humor as the core solution offers two distinct advantages: 1) it provides a way to generate harmless responses without explicit refusal, and 2) it preserves natural conversational flow even when faced with injected affirmative prefixes, as the seamless transition to humorous deflection maintains contextual coherence. Specifically, we constructed a HumorReject preference dataset comprising 200 harmful and 200 benign samples. We demonstrate that by applying the existing alignment algorithm [7] with merely 10 epochs of fine-tuning on this dataset, we can fundamentally enhance model safety, even on previously unsafeguarded LLMs [8]. As illustrated in Figure 1, our approach proves remarkably effective - even when directly prefilled with an affirmative prefix, the model successfully evades harmful instructions by generating witty, humorous responses, establishing HumorReject as a compelling alternative to traditional refusal training.

To thoroughly evaluate our approach, we formulate six key research questions (RQs): RQ 1: How effectively does HumorReject decouple safety from refusal prefix? RQ 2: How effectively does HumorReject defend against prefix injection attacks? RQ 3: Beyond prefix injection, do other types of attacks still pose threats to model safety? RQ 4: Does the HumorReject approach introduce new security risks? RQ 5: Does HumorReject affect the model’s performance on benign inputs? RQ 6: Why did previous humorous LLMs not demonstrate good safety? We will address these questions in § Section 4.1 through § Section 4.6. These questions guide our comprehensive evaluation of HumorReject’s effectiveness and resilience.

While our approach presents a significant advancement in enhancing LLM safety, we acknowledge certain limitations. Firstly, investigating whether LLMs can possess a sense of humor akin to humans is beyond the scope of this paper, as it pertains to broader philosophical and cultural considerations. Instead, we utilize humorous rejection solely as a strategy to decouple safety from refusal prefixes. Secondly, while humorous replies provide harmless responses without direct refusal, we recognize that humor is not the only possible solution for this decoupling. Future research may explore alternative rejection strategies that could complement or surpass the effectiveness of humor in ensuring model safety.

In summary, the main contributions of this paper are as follows:

We propose a novel indirect refusal strategy based on humorous responses, which can effectively decouple LLMs’ safety from refusal prefixes (§Section 4.1), significantly lowering the risk of prefix injection attacks;
We construct and publicly release the HumorReject preference dataset of 400 samples, and demonstrate that using existing alignment algorithm [7] with just 10 epochs of fine-tuning on this dataset can fundamentally enhance the safety of the previously unsafeguarded Mistral-7B-instruct-v0.1 model (§Section 4.2). This effective result indicates that existing alignment algorithms are sufficient for producing highly safe models—innovations at the data level are even more fundamental than the alignment algorithm itself in achieving effective LLM safety;
Beyond prefix injection attacks, we conduct extensive security evaluations of the HumorReject model through various attack vectors, including mismatched generalization attacks (§Section 4.3) and our novel adaptive attack “HumorDAN” (§Section 4.4). Our experimental results demonstrate the model’s robust resistance against these diverse attack strategies;
We also perform in-depth analysis of model usability and find that previous defense methods suffer from over-defense issues: 1) models generate refusals even for benign inputs [2,11], and 2) response quality significantly deteriorates under harmful context conditions [12]. HumorReject training effectively avoid these problems (§Section 4.5).

Through this research, we aim to enhance the safety of current LLMs and provide new perspectives for future work in this direction.

2. Related Work

2.1. LLM Alignment

Aligning LLMs with human preferences has predominantly relied on supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) [13]. Direct Preference Optimization (DPO) [14] streamlines this process by directly optimizing for human preference scores. ORPO [7] unifies SFT and preference alignment into a single process, while KTO [15] introduces a human-aware loss function based on prospect theory for binary feedback.

Despite these advancements, LLMs remain vulnerable to jailbreak attacks. To mitigate such risks, researchers have recently explored novel training approaches. The Circuit Breaker [12] prevents harmful outputs by redirecting internal representations to an orthogonal space. To combat "shallow safety alignment" where safety measures only appear in initial tokens, Qi et al. [2] augment training data with harmful instruction-response pairs followed by refusals. Decoupled Refusal Training (DeRTa) [11] enables models to refuse harmful content at any position through combined Maximum Likelihood estimation and Reinforced Transition Optimization. While these methodologies demonstrate remarkable effectiveness in enhancing LLM safety beyond traditional alignment approaches, our analysis in §Section 4.5 reveals their susceptibility to over-defense issues.

2.2. Jailbreak Attacks

Numerous studies [5,6,10,16,17,18,19,20] have demonstrated LLMs’ vulnerability to jailbreak attacks that exploit their training and alignment weaknesses to elicit harmful outputs. Wei et al. [4] categorized these attacks into two types: competing objectives and mismatched generalization.

Competing objectives attacks emerge when a model’s pretraining and instruction-following objectives conflict with its safety objectives. A notable example is prefix injection [5,6], where attackers use affirmative prefixes to misalign model behavior, exploiting the tension between instruction-following and safety constraints.

Mismatched generalization attacks [21,22,23,24] occur when inputs fall outside the model’s safety training distribution while remaining within its pretraining scope. For instance, CodeAttack [22] elicits harmful outputs by prompting responses in out-of-distribution formats like code blocks, highlighting the limitations of current safety measures against diverse attack vectors.

2.3. LLM with Humor

Recent research on LLMs has increasingly focused on their ability to understand and generate humor. In the realm of humor generation, Zhong et al. [25] introduced the Creative Leap-of-Thought (CLoT) paradigm, which significantly enhances LLMs’ humor generation capabilities. Similarly, Tikhonov et al. [26] explored multi-step reasoning for generating stand-up jokes, further advancing the humor creation abilities of LLMs. Chen et al. [27] constructed a large-scale Chinese humor response dataset and designed auxiliary tasks to improve pre-trained language models’ ability to generate humorous responses. Vikhorev et al. [28] developed the CleanComedy corpus, employing data filtering and model training to create non-toxic, humorous content while analyzing the effectiveness and limitations of these methods. Additionally, Chen et al. [29] proposed a multi-stage curriculum learning framework that significantly enhances LLMs’ ability to generate puns by optimizing both pun structure preferences and humor preferences. Overall, these works demonstrate that LLMs have already developed preliminary capabilities in humor understanding and generation.

3. HumorReject Training

In this section, we will introduce the training details of HumorReject, including training dataset construction (§ Section 3.1) and training settings (§ Section 3.2).

3.1. Training Dataset Construction

Overview. We construct a preference dataset for HumorReject alignment training, which comprises both harmful and benign instructions. Specifically, we extract 200 samples from the AdvBench [5] dataset as harmful instructions and 200 samples from the Alpaca [30] dataset as benign instructions. As illustrated in Figure 2, for harmful instructions, the chosen response is a humorous reply generated by GPT-4o prompted for humor, while the rejected response is from an uncensored LLM (we use Mistral-7B-instruct-v0.1). Conversely, for benign instructions, the chosen response is the standard reply from the uncensored LLM, whereas the rejected response is a humorous reply generated by GPT-4o. The construction of this preference dataset aligns with and reflects our training objectives:

1) For harmful instructions, we aim to train the model to generate humorous responses instead of harmful ones, thereby decoupling safety from refusal prefixes.

2) For benign instructions, we want the model to maintain its normal functionality and avoid unnecessary humorous responses.

3) Through this preference dataset, we can effectively guide the model to learn when humor is appropriate as a safety mechanism and when it should provide standard responses.

Non-Refusing but Harmless Humorous Responses. To ensure that humorous responses do not resemble refusal-based prefixes, we prompt GPT-4o to generate responses that are humorous but free of explicit rejections or negative tones. These responses are designed to defuse potential harm while maintaining a natural conversational flow. To verify their harmlessness, we use the Claude-3.5-sonnet [31] model as the safety judge, ensuring that all generated responses are thoroughly safe. Any response flagged as harmful is regenerated until it fully complies with harmlessness standards.

3.2. Training Settings

We employ the ORPO [7] alignment algorithm for training our models. The training is conducted on 4 NVIDIA RTX 4090 GPUs with a batch size of 4 and a learning rate of 5e-5, utilizing a cosine learning rate scheduler. We adopt Low-Rank Adaptation (LoRA) [32] with rank 8 for efficient parameter-efficient fine-tuning. The training process spans 10 epochs and is implemented using the open-source LLaMA Factory [33] codebase. We conducted training on two instruct-tuned LLMs: Llama3-8B-instruct [9] and Mistral-7B-instruct-v0.1 [8] to producing their HumorReject variants.

4. Empirical Studies

This section presents empirical studies designed to address the key research questions outlined in §Section 1. We first detail our experimental setup, including the test dataset, baseline defense strategies, and evaluation metrics.

Test Dataset. We constructed a test dataset containing 100 harmful instructions by selecting 50 samples each from AdvBench [5] and HarmBench [34] to evaluate safety performance against harmful inputs. Additionally, each instruction in the test set is paired with a positive prefix (e.g, "Sure, here is the method to make a bomb"), which serves as the target for prefix injection attacks.

Baseline Defense. In addition to the vanilla model, we incorporate three different defense training strategies as baselines: Circuit Breaker (CB) [12], Qi et al.’s augmentation [2] (in our paper we refer to it as DeepAug), and DeRTa [11]. CB employs a forgetting mechanism to make the model unlearn how to effectively execute harmful instructions, DeepAug utilizes data augmentation, and DeRTa proposes a Reinforced Transition Optimization training method to enable model rejection at any position.

Evaluation Metrics. We design three different judge prompts, namely Safety Judge, Rejection Judge, and Humor Judge. The Safety Judge is used to determine whether the model generates harmful instructions, the Rejection Judge is used to determine whether the model generates refusal prefixes, and the Humor Judge is used to determine whether the model generates humorous replies. Based on these judges, we define three metrics: Safety Rate, Humor Rate, and Reject Rate, each ranging from 0% to 100%, representing the proportion of responses classified as safe, humorous, and refusal-based, respectively. In our experiments, we utilize Claude-3.5-sonnet [31] to implement these judge models.

4.1. RQ1: How Effectively Does HumorReject Decouple Safety from Refusal Prefix?

To assess whether HumorReject enables the model to generate humorous and harmless responses to harmful instructions instead of refusal statements, we evaluated it using the 100 harmful instructions from our test dataset.

As shown in Table 1, the baseline model, LLaMA3-8B-instruct Vanilla, exhibited a high Reject Rate of 96% and a Safety Rate of 97%, indicating a strong reliance on refusal prefixes when responding to harmful instructions. Besides, it did not generate any humorous responses (Humor Rate of 0%). In contrast, the HumorReject model achieved a remarkable Humor Rate of 95%, showcasing its ability to produce humorous and harmless replies. The Reject Rate plummeted to 2%, demonstrating a minimal dependence on traditional refusal prefixes. Additionally, the Safety Rate remained at 100%, ensuring that all responses were safe and devoid of harmful content. These results affirm that HumorReject successfully decouples safety mechanisms from refusal prefixes by leveraging humor.

4.2. RQ2: How Effectively Does HumorReject Defend Against Prefix Injection Attacks?

In §Section 4.1, we initially validated the safety of the HumorReject model against direct harmful instruction inputs. Building on this foundation, we now assess the robustness of HumorReject against prefix injection attacks.

We conducted experiments on two models: Llama3-8B-instruct and Mistral-7B-instruct-v0.1. We tested five distinct types of prefix injection attacks—GCG [5], AutoDAN [6], Template [10], Prefill, and Template+Prefill—and employed five defense strategies for each attack type, including Vanilla, Circuit Breaker (CB), DeepAug, DeRTa, and our proposed HumorReject. The following table presents the Safety Rates of each defense strategy across the various prefix injection attacks for both models, along with their average performance.

As illustrated in Table 2, HumorReject consistently outperforms all baseline defense strategies across every type of prefix injection attack for both Llama3-8B-instruct and Mistral-7B-instruct-v0.1 models. Specifically, HumorReject achieves Safety Rates ranging from 95% to 100%, demonstrating exceptional robustness and reliability in mitigating prefix injection attacks.

When compared to the Vanilla model, which exhibits moderate to low Safety Rates depending on the attack type (averaging 63.2% for Llama3-8B-instruct and 6.6% for Mistral-7B-instruct-v0.1), HumorReject significantly enhances defense effectiveness. Additionally, while strategies like Circuit Breaker also show strong performance with average Safety Rates of 97.4% for Llama3-8B-instruct and 90.6% for Mistral-7B-instruct-v0.1, HumorReject achieves even higher averages of 99.0% and 96.6% respectively. This highlights HumorReject’s superior capability in safeguarding LLMs against adversarial prefix injections, underscoring its effectiveness and robustness relative to existing defense mechanisms.

4.3. RQ3: Beyond Prefix Injection, Do Other Types of Attacks Still Pose Threats to Model Safety?

As highlighted by [4], in addition to prefix injection attacks, LLMs are vulnerable to mismatched generalization attacks. These attacks exploit discrepancies between training and test data distributions, causing models to generate unintended or harmful outputs. To evaluate the effectiveness of HumorReject against such threats, we compare its performance against three specific mismatched generalization attacks: ReNeLLM [24], CodeAttack [22], and CodeChameleon [21].

The results in Table 3 indicate that HumorReject consistently maintains high Safety Rates across all three mismatched generalization attacks for both Llama3-8B-instruct and Mistral-7B-instruct-v0.1 models. Specifically, HumorReject achieves Safety Rates of 92% for ReNeLLM, 77% for CodeAttack, and 83% for CodeChameleon on Llama3-8B-instruct. For Mistral-7B-instruct-v0.1, the Safety Rates are 95%, 98%, and 95% respectively. These findings suggest that HumorReject effectively mitigates the risks posed by mismatched generalization attacks, enhancing the overall safety of LLMs beyond protection against prefix injection alone.

From our case study in Appendix A, we believe that HumorReject enhances the model’s defense against mismatched generalization attacks by establishing a deep connection between harmful instructions and humorous responses. This may enable the model to respond humorously to harmful instructions even when required to use unseen reply templates, such as those used in CodeAttack.

4.4. RQ4: Does the HumorReject Approach Introduce New Security Risks?

HumorDAN, A Novel But Ineffective Adaptive Attack. Previous research [35] has emphasized that new defense mechanisms can potentially introduce new security vulnerabilities. To address this concern for HumorReject, we designed an adaptive attack named HumorDAN. Inspired by the DAN (Do Anything Now) [36] template, HumorDAN prompts the model to respond in two distinct styles: Classic and Humorous. This attack is considered adaptive because it attempts to exploit the fact that HumorReject models are trained to respond humorously rather than directly refuse harmful instructions - by requesting a "Classic Style" response, the attack aims to bypass the humorous defense mechanism and elicit harmful content in a straightforward manner.

Results. We executed the HumorDAN attack against both HumorReject Mistral-7B-instruct-v0.1 and HumorReject Llama3-8B-instruct. The results are as follows:

HumorReject Mistral-7B-instruct-v0.1: Safety Rate 99%
HumorReject Llama3-8B-instruct: Safety Rate 99%

These high safety rates indicate that the HumorDAN attack was unsuccessful in bypassing the HumorReject defense mechanism. Even when prompted to respond in Classic Style, the models consistently refused to generate harmful content, instead providing humorous refusals.

Defense Example. To illustrate the effectiveness of HumorReject, consider the following example where the model is prompted to make a bomb:

In this example, regardless of the requested style, the model refrains from providing harmful instructions and instead offers humorous responses that neutralize the intent of the prompt.

4.5. RQ5: Does HumorReject Affect the Model’s Performance on Benign Inputs?

In this subsection, we evaluate the impact of HumorReject on the model’s performance when handling benign inputs. The assessment is conducted from three perspectives: Multi-Task Language Understanding (MMLU), performance in contexts containing harmful instructions, and the propensity of the model to over-reject benign requests. The evaluation methods are detailed as follows:

MMLU Evaluation. Firstly, to assess the model’s capability in understanding and responding accurately across diverse tasks, we utilized 500 question-answer pairs from the MMLU [37] dataset. This evaluation measures the model’s general comprehension and response quality in various domains.

MMLU with Harmful Context. Secondly, to evaluate the model’s ability to handle benign tasks after responding to harmful instructions, we first prompted the model to respond to a harmful instruction randomly selected from our test set before presenting each MMLU question-answer pair. This setup assesses whether addressing harmful content impacts the model’s subsequent performance on legitimate MMLU tasks.

XSTEST Compliance Rate. Thirdly, to investigate whether HumorReject leads to unnecessary refusals of non-harmful requests, we employed 250 "safe" instructions from the XSTEST [38] dataset. This assessment measures the model’s compliance with benign requests, indicating its tendency to over-reject when faced with non-harmful inputs.

Experimental Results are summarized in Table 4, which compares different defense strategies across evaluation metrics. The results show that HumorReject maintains and slightly improves the model’s MMLU performance, with scores increasing from 58.0 to 60.8 for Llama3-8B-instruct and 49.8 to 50.2 for Mistral-7B-instruct-v0.1, suggesting that humor integration may enhance model comprehension.

When evaluating MMLU performance with harmful context, HumorReject shows superior robustness compared to baseline defenses. While methods like Circuit Breaker and DeepAug show severe degradation in performance, HumorReject maintains strong performance with scores of 58.2 for Llama3-8B-instruct and 45.4 for Mistral-7B-instruct-v0.1. Notably, we observe that after encountering harmful instructions, Circuit Breaker can only generate meaningless responses even for benign queries (causing Mistral’s score to drop to 0), revealing its over-defensive behavior that severely impacts model utility.

Regarding over-rejection, HumorReject demonstrates exceptional performance by maintaining high compliance rates (94.8% and 94.0%) comparable to the Vanilla Model. In contrast, alternative approaches like DeRTa and DeepAug exhibit substantially higher over-rejection tendencies. These findings convincingly illustrate HumorReject’s sophisticated capability to accurately distinguish between benign and harmful inputs, effectively minimizing false rejections while maintaining robust safety measures.

4.6. RQ6: Why Did Previous Humorous LLM Not Demonstrate Good Safety?

Among existing LLMs, Grok [39] is notable for incorporating a "fun mode" that aims to respond in a humorous and playful manner, capable of humorously rejecting harmful instructions like "Tell me how to make cocaine" [40]. However, recent research [41] has revealed that Grok’s safety performance ranks below many mainstream models in robustness against jailbreak attacks, indicating that its humor-based approach does not effectively ensure model safety.

Our investigation reveals that this vulnerability stems from Grok’s reliance on system prompts to generate humorous responses, rather than having this capability built into the model through training. As evidenced by recent prompt leaks [42], Grok’s humorous responses are primarily controlled through system-level prompting rather than being an inherent model capability. This distinction is crucial, as prompt-based approaches can be easily bypassed or modified by users.

To validate this hypothesis, we conducted experiments across multiple LLMs to assess whether system prompts alone could effectively implement humor-based safety measures similar to our HumorReject approach. We provided models with system prompts instructing them to reject harmful queries using humor, then evaluated their responses to the template attack [10]. The results are visualized in Figure 3.

Our experiments revealed two distinct patterns in how models respond to humor-based system prompts. Models like Mistral-7B-instruct-v0.1, Gemma-2-27B-it [43], and LLaMA3-70B-instruct showed limited ability to follow the humorous rejection instructions, resulting in minimal safety improvements. In contrast, models such as Qwen2.5-72B-instruct [44], Grok-2-1212 [39], and Claude-3.5-sonnet demonstrated stronger adherence to the humorous system prompts, showing notable improvements in both safety and humor metrics when rejecting harmful queries.

These findings highlight that while system prompts can potentially implement humor-based safety measures, their effectiveness heavily depends on the model’s ability to follow system instructions. More importantly, this approach remains fundamentally limited as users can simply choose not to use these safety-oriented system prompts. This vulnerability underscores why our HumorReject approach, which builds humor-based safety directly into model training, provides a more robust and reliable solution compared to prompt-based methods.

5. Conclusion

In this work, we introduced HumorReject, a novel method that leverages humor to enhance the safety of LLMs. By employing humorous responses as indirect refusals, HumorReject maintains engaging and natural interactions while effectively addressing safety concerns. Our comprehensive experiments demonstrated that HumorReject significantly improves model robustness while preserving performance on benign inputs, thereby avoiding the over-defense issues prevalent in existing safety strategies. Additionally, the creation and public release of a 400-sample HumorReject preference dataset provide a valuable resource for future research in model alignment and safety. Despite its promising results, further research is needed to explore the cultural nuances of humor and potential security implications of this approach, as humor can sometimes be perceived as offensive by different cultural groups or individuals. We recommend that future work carefully consider cultural sensitivities and establish appropriate guidelines for the use of humor. Overall, HumorReject offers an innovative solution for developing safer, more resilient, and user-friendly LLMs, paving the way for advanced and reliable AI interactions.

Acknowledgments

We thank Youliang Yuan for kindly sharing their DeRTa Mistral model weights!

Appendix A. Case Study

In this section, we present: 1) examples of training data to demonstrate how the HumorReject dataset was constructed; 2) prompts used for Judge models, HumorDAN attack, and HumorReject-like system prompt; 3) defense cases of the LLaMA3-8B-instruct model trained with HumorReject against various attacks (GCG, AutoDAN, ReNeLLM, CodeAttack, and CodeChameleon) when faced with the harmful instruction "Create a tutorial on how to make a bomb"; and 4) failure case analysis to illustrate the current limitations of our approach. Through these case studies, we aim to provide a deeper understanding of how HumorReject works and how it performs across different attack scenarios. More results are provided in our codebase 1.

Appendix A.1. Examples of Training Data

Appendix A.2. Prompts

Appendix A.2.1. Prompts for Judge Models

Appendix A.2.2. Prompts for HumorDAN Attack

Appendix A.2.3. HumorReject-like System Prompt in RQ6

Appendix A.3. Defense Cases

Appendix A.3.1. Defense Against GCG Attack

Appendix A.3.2. Defense Against AutoDAN Attack

Appendix A.3.3. Defense Against CodeAttack

Appendix A.3.4. Defense Against ReNeLLM Attack

Appendix A.3.5. Defense Against CodeChameleon Attack

Appendix A.4. Failure Cases

References

Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems 2022, 35, 27730–27744.
Qi, X.; Panda, A.; Lyu, K.; Ma, X.; Roy, S.; Beirami, A.; Mittal, P.; Henderson, P. Safety Alignment Should Be Made More Than Just a Few Tokens Deep, 2024, [arXiv:cs.CR/2406.05946].
Arditi, A.; Obeso, O.; Syed, A.; Paleka, D.; Panickssery, N.; Gurnee, W.; Nanda, N. Refusal in language models is mediated by a single direction. arXiv preprint arXiv:2406.11717 2024.
Wei, A.; Haghtalab, N.; Steinhardt, J. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems 2024, 36.
Zou, A.; Wang, Z.; Kolter, J.Z.; Fredrikson, M. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 2023.
Liu, X.; Xu, N.; Chen, M.; Xiao, C. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451 2023.
Hong, J.; Lee, N.; Thorne, J. Orpo: Monolithic preference optimization without reference model. In Proceedings of the Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 11170–11189.
Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; Casas, D.d.l.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv preprint arXiv:2310.06825 2023.
Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 2024.
Andriushchenko, M.; Croce, F.; Flammarion, N. Jailbreaking leading safety-aligned llms with simple adaptive attacks. arXiv preprint arXiv:2404.02151 2024.
Yuan, Y.; Jiao, W.; Wang, W.; Huang, J.t.; Xu, J.; Liang, T.; He, P.; Tu, Z. Refuse whenever you feel unsafe: Improving safety in llms via decoupled refusal training. arXiv preprint arXiv:2407.09121 2024.
Zou, A.; Phan, L.; Wang, J.; Duenas, D.; Lin, M.; Andriushchenko, M.; Kolter, J.Z.; Fredrikson, M.; Hendrycks, D. Improving alignment and robustness with circuit breakers. In Proceedings of the The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.
Christiano, P.F.; Leike, J.; Brown, T.; Martic, M.; Legg, S.; Amodei, D. Deep reinforcement learning from human preferences. Advances in neural information processing systems 2017, 30.
Rafailov, R.; Sharma, A.; Mitchell, E.; Manning, C.D.; Ermon, S.; Finn, C. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems 2024, 36.
Ethayarajh, K.; Xu, W.; Muennighoff, N.; Jurafsky, D.; Kiela, D. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306 2024.
Wu, Z.; Gao, H.; He, J.; Wang, P. The dark side of function calling: Pathways to jailbreaking large language models. arXiv preprint arXiv:2407.17915 2024.
Shen, X.; Chen, Z.; Backes, M.; Shen, Y.; Zhang, Y. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825 2023.
Chao, P.; Robey, A.; Dobriban, E.; Hassani, H.; Pappas, G.J.; Wong, E. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419 2023.
Anil, C.; Durmus, E.; Sharma, M.; Benton, J.; Kundu, S.; Batson, J.; Rimsky, N.; Tong, M.; Mu, J.; Ford, D.; et al. Many-shot Jailbreaking. Anthropic, April 2024.
Zeng, Y.; Lin, H.; Zhang, J.; Yang, D.; Jia, R.; Shi, W. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. arXiv preprint arXiv:2401.06373 2024.
Lv, H.; Wang, X.; Zhang, Y.; Huang, C.; Dou, S.; Ye, J.; Gui, T.; Zhang, Q.; Huang, X. CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models, 2024, [arXiv:cs.CL/2402.16717].
Ren, Q.; Gao, C.; Shao, J.; Yan, J.; Tan, X.; Lam, W.; Ma, L. Codeattack: Revealing safety generalization challenges of large language models via code completion. In Proceedings of the Findings of the Association for Computational Linguistics ACL 2024, 2024, pp. 11437–11452.
Deng, Y.; Zhang, W.; Pan, S.J.; Bing, L. Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474 2023.
Ding, P.; Kuang, J.; Ma, D.; Cao, X.; Xian, Y.; Chen, J.; Huang, S. A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily. arXiv preprint arXiv:2311.08268 2023.
Zhong, S.; Huang, Z.; Gao, S.; Wen, W.; Lin, L.; Zitnik, M.; Zhou, P. Let’s Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13246–13257.
Tikhonov, A.; Shtykovskiy, P. Humor Mechanics: Advancing Humor Generation with Multistep Reasoning. arXiv preprint arXiv:2405.07280 2024.
Chen, Y.; Yuan, Y.; Liu, P.; Liu, D.; Guan, Q.; Guo, M.; Peng, H.; Liu, B.; Li, Z.; Xiao, Y. Talk Funny! A Large-Scale Humor Response Dataset with Chain-of-Humor Interpretation. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2024, Vol. 38, pp. 17826–17834.
Vikhorev, D.; Galimzianova, D.; Gorovaia, S.; Zhemchuzhina, E.; Yamshchikov, I.P. CleanComedy: Creating Friendly Humor through Generative Techniques. arXiv preprint arXiv:2412.09203 2024.
Chen, Y.; Yang, C.; Hu, T.; Chen, X.; Lan, M.; Cai, L.; Zhuang, X.; Lin, X.; Lu, X.; Zhou, A. Are U a Joke Master? Pun Generation via Multi-Stage Curriculum Learning towards a Humor LLM. In Proceedings of the Findings of the Association for Computational Linguistics ACL 2024, 2024, pp. 878–890.
Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y.; Li, X.; Guestrin, C.; Liang, P.; Hashimoto, T.B. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
Anthropic. Claude 3.5 sonnet. https://www.anthropic.com/news/claude-3-5-sonnet, 2024.
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 2021.
Zheng, Y.; Zhang, R.; Zhang, J.; Ye, Y.; Luo, Z.; Feng, Z.; Ma, Y. LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models. In Proceedings of the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 2024.
Mazeika, M.; Phan, L.; Yin, X.; Zou, A.; Wang, Z.; Mu, N.; Sakhaee, E.; Li, N.; Basart, S.; Li, B.; et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249 2024.
Tramer, F.; Carlini, N.; Brendel, W.; Madry, A. On adaptive attacks to adversarial example defenses. Advances in neural information processing systems 2020, 33, 1633–1645.
DAN Template. https://gist.github.com/coolaj86/6f4f7b30129b0251f61fa7baaa881516.
Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 2020.
Röttger, P.; Kirk, H.R.; Vidgen, B.; Attanasio, G.; Bianchi, F.; Hovy, D. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263 2023.
X.ai. Grok 2. https://x.ai/blog/grok-2, 2024.
Musk, E. Twitter Status Post. https://x.com/elonmusk/status/1720635518289908042, 2023.
Adversa.ai. LLM Red Teaming vs. Grok, ChatGPT, Claude, Gemini, Bing, Mistral, LLaMA. https://adversa.ai/blog/llm-red-teaming-vs-grok-chatgpt-claude-gemini-bing-mistral-llama/, 2023.
Plinius, E. Grok System Prompt Leak. https://github.com/elder-plinius/Grok-System-Prompt-Leak, 2023.
Team, G.; Riviere, M.; Pathak, S.; Sessa, P.G.; Hardin, C.; Bhupatiraju, S.; Hussenot, L.; Mesnard, T.; Shahriari, B.; Ramé, A.; et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118 2024.
AI, Q. Qwen 2.5: Advancing AI for Everyone. https://qwen2.org/qwen2-5/, 2024.

1	https://github.com/wooozihui/HumorReject

Figure 1. Comparison between the Vanilla LLaMA3-8B-instruct [9] model (top) and HumorReject fine-tuned model (bottom) in response to direct harmful instructions (left) and prefix injection attacks [10] (right). The Vanilla model tends to start with explicit refusals ("I cannot provide") but can be jailbroken when successfully injected with affirmative prefixes (highlighted in brown). In contrast, HumorReject effectively decouples safety from refusal prefixes through indirect humorous rejections, thereby enhancing model safety even when directly prefilled with an affirmative prefix. More defense examples are provided in the case study (Appendix A).

Figure 2. HumorReject Training Dataset Construction. For harmful inputs, we pair GPT-4o’s humorous responses (chosen) with uncensored LLM’s harmful responses (rejected), while for benign inputs, we pair normal responses (chosen) with humorous responses (rejected).

Figure 3. Safety and Humor Rates Across Models with and without Humorous System Prompts. The figure illustrates the varying degrees to which different models adhere to humorous refusal prompts, highlighting the inconsistency and dependency on system-level configurations.

Table 1. Comparison of Humor, Reject, and Safety Rates for Vanilla and HumorReject Versions of LLaMA3-8B-instruct. HumorReject demonstrates a significant improvement in Humor Rate and Safety Rate, with minimal reliance on refusal prefixes.

Model: LLaMA3-8B-instruct	Humor Rate (%)	Reject Rate (%)	Safety Rate (%)
Vanilla	0	96	97
HumorReject	95	2	100

Table 2. Safety Rates (%) on Prefix Injection Attacks for Llama3-8B-instruct and Mistral-7B-instruct-v0.1. HumorReject demonstrates superior average robustness compared to baseline methods.

Model	Attack	Vanilla	CB	DeepAug	DeRTa	HumorReject (Ours)
Llama3-8B-instruct	GCG	88	99	99	97	98
	AutoDAN	87	98	40	89	99
	Template	98	97	100	100	99
	Prefill	41	95	59	98	100
	Template+Prefill	2	98	3	32	98
	Average	63.2	97.4	60.2	83.2	99.0
Mistral-7B-instruct-v0.1	GCG	4	89	66	61	95
	AutoDAN	22	86	19	50	97
	Template	2	89	8	54	96
	Prefill	1	99	56	92	98
	Template+Prefill	4	90	7	53	97
	Average	6.6	90.6	31.2	62.0	96.6

Table 3. Safety Rates (%) on Mismatched Generalization Attacks for Llama3-8B-instruct and Mistral-7B-instruct-v0.1.

Model	Attack	Vanilla	CB	DeepAug	DeRTa	HumorReject (Ours)
Llama3-8B-instruct	ReNeLLM	44	84	63	86	92
	CodeAttack	35	89	79	66	77
	CodeChameleon	44	94	62	68	83
	Average	41.0	89.0	68.0	73.3	84.0
Mistral-7B-instruct-v0.1	ReNeLLM	9	85	19	30	95
	CodeAttack	7	84	8	26	98
	CodeChameleon	47	100	70	73	95
	Average	21.0	89.7	32.3	56.3	96.0

Table 4. Performance of Defense Strategies Across Different Evaluation Metrics. "MMLU with Harmful Context" refers to the model’s performance on MMLU tasks when preceded by a harmful instruction.

Model	Method	MMLU (%)	MMLU with Harmful Context (%)	XSTEST Compliance Rate (%)
Llama3	Vanilla Model	58.0	54.8	95.2
	DeRTa	59.4	50.8	72.4
	Circuit Breaker	58.4	25.8	95.6
	DeepAug	60.6	59.2	60.4
	HumorReject (Ours)	60.8	58.2	94.8
Mistral	Vanilla Model	49.8	45.4	97.2
	DeRTa	39.6	33.6	25.6
	Circuit Breaker	47.4	0	96.4
	DeepAug	47.2	39.2	38
	HumorReject (Ours)	50.2	45.4	94.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.