Submitted:
14 May 2026
Posted:
15 May 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.0.0.1. Tier 1: External Controls.
- Context Engineering (Section 2.1): Strategic prompt design through rules, instructions, or few-shot exemplars to guide outputs without modifying model parameters.
- Guardrails (Section 2.2): External modules that inspect inputs or outputs against safety or policy constraints, blocking, redacting, or regenerating content when violations occur.
- Decoding Strategies (Section 2.3): Manipulation of token-level distributions during generation to promote desired attributes or suppress undesired ones.
1.0.0.2. Tier 2: Internal Manipulations.
- Representation Engineering (Section 3.1): Direct modification of internal activations by adding or subtracting steering vectors associated with specific concepts.
- Unlearning (Section 3.2): Targeted removal of information, behaviors, or biases from a pre-trained model to fulfill data-forgetting requirements or disable harmful capabilities.
- Pruning (Section 3.3.1): Post-training removal of weights, neurons, or attention heads, originally for efficiency but now increasingly explored for trust-related effects.
1.0.0.3. Tier 3: System-Level Orchestration.
1.1. Related Work
1.2. Contributions and Organization
- A Pipeline-Grounded, Three-Tier Framework. We propose a three-tier framework grounded in the generation pipeline (Figure 1). The framework covers seven inference-time control methods and groups them by their intervention point, access requirements, and composability within a defense-in-depth design.
- Cross-Dimensional Analysis. We analyze how each category addresses Safety, Privacy, Fairness, and Factuality, clarifying capabilities, limitations, and trade-offs under deployment constraints.
- Cross-Cutting Open Problems. We identify open problems on scalability, evaluation, and composition that cut across the framework, and discuss directions for future work.
| Prior Work | Inference-time Metrics | Inference-time Methods | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Trustworthiness Dimensions | External Controls | Internal Manipulations |
System-Level Orchestration |
||||||||
| Safety | Privacy | Fairness | Factuality |
Context Engineering |
Guardrails | Decoding |
Representation Engineering |
Unlearning | Pruning |
Multi-Agent Systems |
|
| Ours | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Donisch et al. [8] | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | ✗ |
| Kumar et al. [12] | ✓ | ✓ | ✗ | ✓ | ✗ | ✓ | ✓ | ✓ | ✗ | ✓ | ✓ |
| Tie et al. [11] | ✓ | ✗ | ✓ | ✓ | ✗ | ✗ | ✓ | ✓ | ✗ | ✓ | ✓ |
| Huang et al. [13] | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✓ | ✗ |
| Liu et al. [14] | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✓ | ✓ | ✗ | ✗ |
| Wan et al. [15] | ✓ | ✗ | ✓ | ✓ | ✓ | ✗ | ✗ | ✓ | ✗ | ✗ | ✗ |
| Yu et al. [16] | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
| Barez et al. [18] | ✓ | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ |
| Liu et al. [17] | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ |
2. External Controls
2.1. Context Engineering

2.1.1. Core Principle
2.1.2. Applications
2.1.2.1. Safety
2.1.2.2. Privacy
2.1.2.3. Fairness
2.1.2.4. Factuality
2.2. Guardrails
2.2.1. Core Principle: Modular and External Control
2.2.2. A Typology of Guardrail Mechanisms


2.2.2.1. Rule-Based Guardrails
2.2.2.2. Model-Based Guardrails
2.2.2.3. LLM-Based Guardrails
2.2.2.4. Hybrid Guardrails
2.2.3. Applications
2.2.3.1. Safety
2.2.3.2. Privacy
2.2.3.3. Fairness
2.2.3.4. Factuality
2.3. Decoding
2.3.1. Core Principle: Real-Time Logit and Probability Manipulation


2.3.2. Applications
2.3.2.1. Safety
2.3.2.2. Factuality
2.3.2.3. Privacy
2.3.2.4. Fairness
3. Internal Manipulations
3.1. Representation Engineering
3.1.1. Core Principle


3.1.2. Mechanism
3.1.3. Applications
3.1.3.1. Safety
3.1.3.2. Fairness
3.1.3.3. Privacy
3.1.3.4. Factuality and Faithfulness
3.2. Unlearning
3.2.1. Core Principle

3.2.2. Mechanisms
3.2.2.1. Gradient-Based Methods
3.2.2.2. Influence-Based Methods
3.2.3. Applications
3.2.3.1. Safety
3.2.3.2. Privacy
3.2.3.3. Factuality and Faithfulness
3.3. Pruning
3.3.1. Core Principle

3.3.2. Applications
3.3.2.1. Safety
3.3.2.2. Fairness
3.3.2.3. Privacy
3.3.2.4. Factuality and Faithfulness
4. System-Level Orchestration
4.1. Multi-Agent Systems

4.1.1. Core Principle: From Monolithic Control to Collaborative Systems

4.1.2. Applications
4.1.2.1. Safety
4.1.2.2. Factuality
4.1.2.3. Fairness
4.1.2.4. Privacy
5. Evaluation

- Effectiveness / Behavioral Accuracy: To what extent does the inference-time mechanism reliably enforce the intended behavioral constraints, and under what conditions does it fail?
- Locality / Utility Preservation: To what extent are the intervention effects localized to the targeted behaviors while preserving general task performance and user utility?
- Generality / Robustness / Adaptivity: How well does the mechanism generalize across tasks, domains, and input distributions, and how robust is it to distributional shifts?
- Efficiency / Latency / Cost: What additional runtime overhead does the mechanism introduce in terms of latency, compute, memory, or token usage?
- Interpretability / Transparency: To what extent are the mechanism’s actions, triggers, and decision rationales transparent to human auditors or downstream systems?
5.1. Effectiveness / Behavioral Accuracy
5.1.1. Rate-Based Metrics
5.1.2. Classifier-Style Metrics
5.1.3. Group-Disparity Metrics
5.1.4. Task-Performance Metrics
5.2. Locality / Utility Preservation
5.2.1. False-Positive and Over-Refusal Rate Metrics
5.2.2. Utility Retention Metrics
5.3. Generality / Robustness / Adaptivity
5.3.1. Worst-Case and Cross-Distribution Rate Metrics
5.3.2. Classifier-Style Metrics Across Domains and Languages
5.3.3. Calibration and Stability Metrics
5.4. Interpretability / Transparency
5.4.1. Provenance and Audit Metrics
5.4.2. Human Judgment Metrics
5.4.3. Distributional Transparency Metrics
5.5. Efficiency / Latency / Cost
- Relative Latency and Runtime Overhead. A standard way to measure deployment cost is relative latency, often reported as the Average Token Generation Time Ratio or its equivalence [82,84,94]. It measures the slowdown caused by the defense during inference:where is the wall-clock time for prompt under the defense, and is the baseline time.
-
Computational Cost and FLOPs-Based Budgeting. To study scaling behavior in a way that is less sensitive to hardware, several works use floating point operations as a compute proxy [27,63]. This gives a hardware-independent view of how much extra computation is needed for a given safety gain, often reported as performance under a compute budget:This is used to plot safety outcomes such as Attack Success Rate against inference TFLOPs, tracing a safety–compute trade-off curve.
-
Token Usage and Throughput. For methods that expand context—such as guardrail prompting, meta-prompting, or retrieval augmentation—token growth is a direct cost driver [36,74,86]. A simple and widely used summary is the Token Usage Ratio:This captures cost increases from longer prompts, extra intermediate steps, or additional retrieved content. Throughput in tokens per second is also commonly reported to reflect serving capacity under the defense.
6. Discussion and Open Problems
6.1. External Controls
6.2. Internal Manipulations
6.3. System-Level Orchestration
7. Conclusions
References
- Chi, J.; Karn, U.; Zhan, H.; Smith, E.M.; Rando, J.; Zhang, Y.; Plawiak, K.; Coudert, Z.D.; Upasani, K.; Pasupuleti, M. Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations. ArXiv 2024, abs/2411.10414. [Google Scholar]
- Padhi, I.; Nagireddy, M.; Cornacchia, G.; Chaudhury, S.; Pedapati, T.; Dognin, P.L.; Murugesan, K.; Miehling, E.; Cooper, M.S.; Fraser, K.; et al. Granite Guardian. ArXiv 2024. abs/2412.07724. [Google Scholar]
- Liu, Y.; Gao, H.; Zhai, S.; Xia, J.; Wu, T.; Xue, Z.; Chen, Y.; Kawaguchi, K.; Zhang, J.; Hooi, B. GuardReasoner: Towards Reasoning-based LLM Safeguards. ArXiv 2025, abs/2501.18492. [Google Scholar]
- Song, C.; Ma, L.; Zheng, J.; Liao, J.; Kuang, H.; Yang, L. Audit-LLM: Multi-Agent Collaboration for Log-based Insider Threat Detection. arXiv 2024, arXiv:cs. [Google Scholar]
- ucki, J.; Wei, B.; Huang, Y.; Henderson, P.; Tramèr, F.; Rando, J. An adversarial perspective on machine unlearning for ai safety. arXiv 2024, arXiv:2409.18025. [Google Scholar]
- Dou, G.; Liu, Z.; Lyu, Q.; Ding, K.; Wong, E. Avoiding Copyright Infringement via Large Language Model Unlearning. Proc. Find. Assoc. Comput. Linguist. NAACL 2025, 2025, 5176–5200. [Google Scholar]
- Qi, S.; Zou, Y.; Li, P.; Lin, Z.; Cheng, X.; Yu, D. Amplified Vulnerabilities: Structured Jailbreak Attacks on LLM-based Multi-Agent Debate, 2025. arXiv arXiv:cs.
- Donisch, L.; Schacht, S.; Lanquillon, C. Inference optimizations for large language models: Effects, challenges, and practical considerations. arXiv 2024, arXiv:2408.03130. [Google Scholar] [CrossRef]
- Liu, J.; Tang, P.; Wang, W.; Ren, Y.; Hou, X.; Heng, P.A.; Guo, M.; Li, C. A survey on inference optimization techniques for mixture of experts models. arXiv 2024, arXiv:2412.14219. [Google Scholar] [CrossRef]
- Zhou, Z.; Ning, X.; Hong, K.; Fu, T.; Xu, J.; Li, S.; Lou, Y.; Wang, L.; Yuan, Z.; Li, X.; et al. A survey on efficient inference for large language models. arXiv 2024, arXiv:2404.14294. [Google Scholar] [CrossRef]
- Tie, G.; Zhao, Z.; Song, D.; Wei, F.; Zhou, R.; Dai, Y.; Yin, W.; Yang, Z.; Yan, J.; Su, Y.; et al. Large Language Models Post-training: Surveying Techniques from Alignment to Reasoning. arXiv 2025, arXiv:2503.06072. [Google Scholar]
- Kumar, K.; Ashraf, T.; Thawakar, O.; Anwer, R.M.; Cholakkal, H.; Shah, M.; Yang, M.H.; Torr, P.H.; Khan, F.S.; Khan, S. Llm post-training: A deep dive into reasoning large language models. arXiv 2025, arXiv:2502.21321. [Google Scholar] [CrossRef]
- Huang, Y.; Sun, L.; Wang, H.; Wu, S.; Zhang, Q.; Li, Y.; Gao, C.; Huang, Y.; Lyu, W.; Zhang, Y.; et al. Trustllm: Trustworthiness in large language models. arXiv 2024, arXiv:2401.05561. [Google Scholar] [CrossRef]
- Liu, Y.; Yao, Y.; Ton, J.F.; Zhang, X.; Guo, R.; Cheng, H.; Klochkov, Y.; Taufiq, M.F.; Li, H. Trustworthy llms: A survey and guideline for evaluating large language models’ alignment. arXiv 2023, arXiv:2308.05374. [Google Scholar]
- Wan, Y.; Subramonian, A.; Ovalle, A.; Lin, Z.; Suvarna, A.; Chance, C.; Bansal, H.; Pattichis, R.; Chang, K.W. Survey of bias in text-to-image generation: Definition, evaluation, and mitigation. arXiv 2024, arXiv:2404.01030. [Google Scholar]
- Yu, M.; Meng, F.; Zhou, X.; Wang, S.; Mao, J.; Pang, L.; Chen, T.; Wang, K.; Li, X.; Zhang, Y.; et al. A survey on trustworthy llm agents: Threats and countermeasures. arXiv 2025, arXiv:2503.09648. [Google Scholar] [CrossRef]
- Liu, Z.; Dou, G.; Tan, Z.; Tian, Y.; Jiang, M. Machine unlearning in generative ai: A survey. arXiv 2024, arXiv:2407.20516. [Google Scholar] [CrossRef]
- Barez, F.; Fu, T.; Prabhu, A.; Casper, S.; Sanyal, A.; Bibi, A.; O’Gara, A.; Kirk, R.; Bucknall, B.; Fist, T.; et al. Open problems in machine unlearning for ai safety. Available online: https://arxiv.
- Kumar, A.; Agarwal, C.; Srinivas, S.; Li, A.J.; Feizi, S.; Lakkaraju, H. Certifying LLM Safety against Adversarial Prompting. In Proceedings of the First Conference on Language Modeling, 2024. [Google Scholar]
- Cao, B.; Cao, Y.; Lin, L.; Chen, J. Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL; 2024; 2024, pp. 10542–10560. [Google Scholar]
- Zhou, Y.; Han, Y.; Zhuang, H.; Guo, K.; Liang, Z.; Bao, H.; Zhang, X. Defending Jailbreak Prompts via In-Context Adversarial Game. Proceedings of the Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing 2024, 20084–20105. [Google Scholar]
- Ji, J.; Hou, B.; Robey, A.; Pappas, G.J.; Hassani, H.; Zhang, Y.; Wong, E.; Chang, S. Defending large language models against jailbreak attacks via semantic smoothing. arXiv 2024, arXiv:2402.16192. [Google Scholar] [CrossRef]
- Wei, Z.; Wang, Y.; Li, A.; Mo, Y.; Wang, Y. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv 2023, arXiv:2310.06387. [Google Scholar] [CrossRef] [PubMed]
- Hong, G.; Kim, J.; Kang, J.; Myaeng, S.H.; Whang, J.J. Why So Gullible? Enhancing the Robustness of Retrieval-Augmented Models against Counterfactual Noise. In Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2024. [Google Scholar]
- Zhong, Z.; Huang, Z.; Wettig, A.; Chen, D. Poisoning Retrieval Corpora by Injecting Adversarial Passages. In Proceedings of the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023; pp. 13764–13775. [Google Scholar]
- Li, J.J.; Pyatkin, V.; Kleiman-Weiner, M.; Jiang, L.; Dziri, N.; Collins, A.; Borg, J.S.; Sap, M.; Choi, Y.; Levine, S. SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior. In Proceedings of the Forty-second International Conference on Machine Learning, 2025. [Google Scholar]
- Qiu, R.; Li, G.; Wei, T.; He, J.; Tong, H. Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance. arXiv 2025, arXiv:2506.06444. [Google Scholar]
- Wu, Y.; Wen, R.; Backes, M.; Berrang, P.; Humbert, M.; Shen, Y.; Zhang, Y. Quantifying privacy risks of prompts in visual prompt learning. In Proceedings of the 33rd USENIX Security Symposium (USENIX Security 24), 2024; pp. 5841–5858. [Google Scholar]
- Chaudhari, H.; Severi, G.; Abascal, J.; Jagielski, M.; Choquette-Choo, C.A.; Nasr, M.; Nita-Rotaru, C.; Oprea, A. Phantom: General trigger attacks on retrieval augmented language generation. arXiv 2024, arXiv:2405.20485. [Google Scholar]
- Zou, W.; Geng, R.; Wang, B.; Jia, J. PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models. In Proceedings of the 34th USENIX Security Symposium (USENIX Security 25), 2025. [Google Scholar]
- Tong, M.; Chen, K.; Zhang, J.; Qi, Y.; Zhang, W.; Yu, N.; Zhang, T.; Zhang, Z. Inferdpt: Privacy-preserving inference for black-box large language models. IEEE Transactions on Dependable and Secure Computing, 2025. [Google Scholar]
- Dwivedi, S.; Ghosh, S.; Dwivedi, S. Breaking the bias: Gender fairness in llms using prompt engineering and in-context learning. Rupkatha J. Interdiscip. Stud. Humanit. 2023, 15. [Google Scholar] [CrossRef]
- Huang, D.; Zhang, J.M.; Bu, Q.; Xie, X.; Chen, J.; Cui, H. Bias testing and mitigation in llm-based code generation. ACM Transactions on Software Engineering and Methodology, 2024. [Google Scholar]
- Fayyazi, A.; Kamal, M.; Pedram, M. FACTER: Fairness-Aware Conformal Thresholding and Prompt Engineering for Enabling Fair LLM-Based Recommender Systems. In Proceedings of the Forty-second International Conference on Machine Learning, 2025. [Google Scholar]
- Chen, Q.Z.; Feng, K.; Park, C.Y.; Zhang, A.X. Spica: Retrieving scenarios for pluralistic in-context alignment. arXiv 2024, arXiv:2411.10912. [Google Scholar] [CrossRef]
- Shrestha, R.; Zou, Y.; Chen, Q.; Li, Z.; Xie, Y.; Deng, S. Fairrag: Fair human generation via fair retrieval augmentation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024; pp. 11996–12005. [Google Scholar]
- Dhingra, H.; Jayashanker, P.; Moghe, S.; Strubell, E. Queer people are people first: Deconstructing sexual identity stereotypes in large language models. arXiv 2023, arXiv:2307.00101. [Google Scholar] [CrossRef]
- Kamruzzaman, M.; Kim, G.L. Prompting techniques for reducing social bias in llms through system 1 and system 2 cognitive processes. arXiv 2024, arXiv:2404.17218. [Google Scholar]
- Dhuliawala, S.; Komeili, M.; Xu, J.; Raileanu, R.; Li, X.; Celikyilmaz, A.; Weston, J. Chain-of-Verification Reduces Hallucination in Large Language Models. Find. Assoc. Comput. Linguist. ACL 2024, 3563–3578. [Google Scholar]
- Press, O.; Zhang, M.; Min, S.; Schmidt, L.; Smith, N.A.; Lewis, M. Measuring and Narrowing the Compositionality Gap in Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, 2023; pp. 5687–5711. [Google Scholar]
- Li, M.; Wang, W.; Feng, F.; Zhu, F.; Wang, Q.; Chua, T.S. Think Twice Before Trusting: Self-Detection for Large Language Models through Comprehensive Answer Reflection. Proc. Find. Assoc. Comput. Linguist. EMNLP 2024, 2024, 11858–11875. [Google Scholar]
- Feng, S.; Shi, W.; Wang, Y.; Ding, W.; Ahia, O.; Li, S.S.; Balachandran, V.; Sitaram, S.; Tsvetkov, Y. Teaching LLMs to Abstain across Languages via Multilingual Feedback. Proceedings of the Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing 2024, 4125–4150. [Google Scholar]
- Zhou, W.; Zhang, S.; Poon, H.; Chen, M. Context-faithful Prompting for Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, 2023; pp. 14544–14556. [Google Scholar]
- Shi, W.; Min, S.; Yasunaga, M.; Seo, M.; James, R.; Lewis, M.; Zettlemoyer, L.; Yih, W.t. REPLUG: Retrieval-Augmented Black-Box Language Models. Proceedings of the Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) 2024, 8364–8377. [Google Scholar]
- Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; Hajishirzi, H. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. In Proceedings of the The Twelfth International Conference on Learning Representations, 2024. [Google Scholar]
- Wan, H.; Feng, S.; Tan, Z.; Wang, H.; Tsvetkov, Y.; Luo, M. DELL: Generating Reactions and Explanations for LLM-Based Misinformation Detection. In Proceedings of the ACL (Findings), 2024. [Google Scholar]
- Yu, W.; Iter, D.; Wang, S.; Xu, Y.; Ju, M.; Sanyal, S.; Zhu, C.; Zeng, M.; Jiang, M. Generate rather than Retrieve: Large Language Models are Strong Context Generators. In Proceedings of the International Conference on Learning Representations, 2023. [Google Scholar]
- Zhou, Y.; Liu, Z.; Jin, J.; Nie, J.Y.; Dou, Z. Metacognitive retrieval-augmented large language models. Proc. Proc. ACM Web Conf. 2024, 2024, 1453–1463. [Google Scholar]
- Chen, M.; Li, Y.; Padthe, K.; Shao, R.; Sun, A.; Zettlemoyer, L.; Ghosh, G.; Yih, W.t. Improving factuality with explicit working memory. arXiv 2024, arXiv:2412.18069. [Google Scholar]
- Wang, W.; Dong, L.; Cheng, H.; Liu, X.; Yan, X.; Gao, J.; Wei, F. Augmenting language models with long-term memory. Adv. Neural Inf. Process. Syst. 2023, 36, 74530–74543. [Google Scholar]
- Rebedea, T.; Dinu, R.L.; Sreedhar, M.N.; Parisien, C.; Cohen, J. NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2023. [Google Scholar]
- hana Chennabasappa, S.; Nikolaidis, C.; Song, D.; Molnar, D.; Ding, S.; Wan, S.; Whitman, S.; Deason, L.; Doucette, N.; Montilla, A.; et al. LlamaFirewall: An open source guardrail system for building secure AI agents. ArXiv 2025, abs/2505.03574. [Google Scholar]
- Han, S.; Avestimehr, A.S.; He, C. Bridging the Safety Gap: A Guardrail Pipeline for Trustworthy LLM Inferences. ArXiv 2025. [Google Scholar]
- Wang, X.; Ji, Z.; Wang, W.; Li, Z.; Wu, D.; Wang, S. SoK: Evaluating Jailbreak Guardrails for Large Language Models. ArXiv 2025, abs/2506.10597. [Google Scholar]
- Inan, H.; Upasani, K.; Chi, J.; Rungta, R.; Iyer, K.; Mao, Y.; Tontchev, M.; Hu, Q.; Fuller, B.; Testuggine, D.; et al. Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations. ArXiv 2023, abs/2312.06674. [Google Scholar]
- Sharma, M.; Tong, M.; Mu, J.; Wei, J.; Kruthoff, J.; Goodfriend, S.; Ong, E.; Peng, A.; Agarwal, R.; Anil, C.; et al. Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming. arXiv 2025, arXiv:cs. [Google Scholar] [CrossRef]
- Han, S.; Rao, K.; Ettinger, A.; Jiang, L.; Lin, B.Y.; Lambert, N.; Choi, Y.; Dziri, N. WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs. arXiv 2024, arXiv:cs. [Google Scholar]
- Zeng, W.; Liu, Y.; Mullins, R.; Peran, L.; Fernandez, J.; Harkous, H.; Narasimhan, K.; Proud, D.; Kumar, P.; Radharapu, B.; et al. ShieldGemma: Generative AI Content Moderation Based on Gemma. ArXiv 2024, abs/2407.21772. [Google Scholar]
- Zeng, W.; Kurniawan, D.; Mullins, R.; Liu, Y.; Saha, T.; Ike-Njoku, D.; Gu, J.; Song, Y.; Xu, C.; Zhou, J.; et al. ShieldGemma 2: Robust and Tractable Image Content Moderation. ArXiv 2025, abs/2504.01081. [Google Scholar]
- Kumar, P.; Jain, D.; Yerukola, A.; Jiang, L.; Beniwal, H.; Hartvigsen, T.; Sap, M. PolyGuard: A Multilingual Safety Moderation Tool for 17 Languages. ArXiv 2025. [Google Scholar]
- Upadhayay, B.; Behzadan, P.V.; Wang, M.; Lin, P.; Cai, S.; An, S.; Ma, S.; Lin, Z.; Huang, C.; Wei, A.; et al. X-Guard: Multilingual Guard Agent for Content Moderation. ArXiv 2025, abs/2504.08848. [Google Scholar]
- Ghosh, S.; Varshney, P.; Galinkin, E.; Parisien, C. AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts. ArXiv 2024, abs/2404.05993. [Google Scholar]
- Lee, S.; Lee, D.B.; Wagner, D.; Kang, M.; Seong, H.; Bocklet, T.; Lee, J.; Hwang, S.J. SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models. ArXiv 2025, abs/2502.12464. [Google Scholar]
- Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.P.; et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. ArXiv 2023, abs/2306.05685. [Google Scholar]
- Kang, M.; Li, B. R2-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning. ArXiv 2024, abs/2407.05557. [Google Scholar]
- Zheng, J.; Ji, X.; Lu, Y.; Cui, C.; Zhao, W.; Deng, G.; Liang, Z.; Zhang, A.; Chua, T.S. RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards. ArXiv 2025. abs/2506.07736. [Google Scholar]
- Kokkula, S.; Somanathan, R.; Nandavardhan, R.; Aashishkumar; Divya, G. Palisade - Prompt Injection Detection Framework. ArXiv 2024, abs/2410.21146. [Google Scholar]
- LangChain, A.I. Rebuff: Prompt Injection Detection for LLM Applications. 2023. Available online: https://blog.langchain.com/rebuff/ (accessed on 2025-08-18).
- Microsoft. Prompt Shields in Azure AI Content Safety. Microsoft Learn (online);Microsoft Learn documentation: Unified API to detect and block adversarial user-input attacks on LLMs Accessed via. 2025. [Google Scholar]
- Meta. Llama Prompt Guard documentation Model card and prompt-format documentation for Llama Prompt Guard. In Meta (online); available online at; Meta’s official site, 2025. [Google Scholar]
- Zou, A.; Wang, Z.; Kolter, J.Z.; Fredrikson, M. Universal and Transferable Adversarial Attacks on Aligned Language Models. ArXiv 2023, abs/2307.15043. [Google Scholar]
- Jiang, F.; Xu, Z.; Niu, L.; Xiang, Z.; Ramasubramanian, B.; Li, B.; Poovendran, R. ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2024. [Google Scholar]
- Hackett, W.; Birch, L.; Trawicki, S.; Suri, N.; Garraghan, P. Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails. arXiv 2025, arXiv:2504.11168. [Google Scholar]
- Jiang, F.; Xu, Z.; Niu, L.; Wang, B.; Jia, J.; Li, B.; Poovendran, R. POSTER: Identifying and Mitigating Vulnerabilities in LLM-Integrated Applications. In Proceedings of the 19th ACM Asia Conference on Computer and Communications Security, 2023. [Google Scholar]
- Davidson, T.; Warmsley, D.; Macy, M.W.; Weber, I. Automated Hate Speech Detection and the Problem of Offensive Language. In Proceedings of the International Conference on Web and Social Media, 2017. [Google Scholar]
- Borkan, D.; Dixon, L.; Sorensen, J.S.; Thain, N.; Vasserman, L. Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification. Companion Proceedings of The 2019 World Wide Web Conference; 2019. [Google Scholar]
- Mathew, B.; Saha, P.; Yimam, S.M.; Biemann, C.; Goyal, P.; Mukherjee, A. HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020. [Google Scholar]
- Hartvigsen, T.; Gabriel, S.; Palangi, H.; Sap, M.; Ray, D.; Kamar, E. ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2022. [Google Scholar]
- Lees, A.; Tran, V.Q.; Tay, Y.; Sorensen, J.S.; Gupta, J.; Metzler, D.; Vasserman, L. A New Generation of Perspective API: Efficient Multilingual Character-level Transformers. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022. [Google Scholar]
- Huang, Y.; Zhuang, H.; Ye, J.; Bao, H.; Wang, Y.; Hua, H.; Wu, S.; Chen, P.Y.; Zhang, X. Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs. arXiv 2026, arXiv:2604.07655. [Google Scholar]
- Zhao, Z.; Zhang, X.; Xu, K.; Hu, X.; Zhang, R.; Du, Z.; Guo, Q.; Chen, Y. Adversarial Contrastive Decoding: Boosting Safety Alignment of Large Language Models via Opposite Prompt Optimization. ArXiv 2024, abs/2406.16743. [Google Scholar]
- Liu, Q.; Zhou, Z.; He, L.; Liu, Y.; Zhang, W.; Su, S. Alignment-Enhanced Decoding: Defending Jailbreaks via Token-Level Adaptive Refining of Probability Distributions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2024. [Google Scholar]
- Zhong, Q.; Ding, L.; Liu, J.; Du, B.; Tao, D. ROSE Doesn’t Do That: Boosting the Safety of Instruction-Tuned Large Language Models with Reverse Prompt Contrastive Decoding. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2024. [Google Scholar]
- Xu, Z.; Jiang, F.; Niu, L.; Jia, J.; Lin, B.Y.; Poovendran, R. SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding. ArXiv 2024. abs/2402.08983. [Google Scholar]
- Huang, C.; Zhao, W.; Zheng, R.; Lv, H.; Dou, S.; Li, S.; Wang, X.; Zhou, E.; Ye, J.; Yang, Y.; et al. SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance. ArXiv 2024, abs/2406.18118. [Google Scholar]
- Zeng, X.; Shang, Y.; Zhu, Y.; Chen, J.; Tian, Y. Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level. ArXiv 2024, abs/2410.06809. [Google Scholar]
- Du, Y.; Zhao, S.; Zhao, D.; Ma, M.; Chen, Y.; Huo, L.; Yang, Q.; Xu, D.; Qin, B. MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While Preserving Their Usability. ArXiv 2024, abs/2405.14488. [Google Scholar]
- Zhang, J.J.; Elgohary, A.; Magooda, A.; Khashabi, D.; Durme, B.V. Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements. ArXiv 2024, abs/2410.08968. [Google Scholar]
- Gallego, V. MetaSC: Test-Time Safety Specification Optimization for Language Models. ArXiv 2025. abs/2502.07985. [Google Scholar]
- Chen, B.; Guo, H.; Yan, Q. FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks. ArXiv 2024, abs/2412.07672. [Google Scholar]
- Banerjee, S.; Tripathy, S.; Layek, S.; Kumar, S.; Mukherjee, A.; Hazra, R. SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models. In Proceedings of the AAAI Conference on Artificial Intelligence, 2024. [Google Scholar]
- Gao, J.; Pi, R.; Han, T.; Wu, H.; Hong, L.; Kong, L.; Jiang, X.; Li, Z. CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration. ArXiv 2024, abs/2409.11365. [Google Scholar]
- Ghosal, S.S.; Chakraborty, S.; Singh, V.; Guan, T.; Wang, M.; Beirami, A.; Huang, F.; Velasquez, A.; Manocha, D.; Bedi, A.S. Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024; pp. 25038–25049. [Google Scholar]
- Li, Y.; Xu, Z.; Jiang, F.; Niu, L.; Sahabandu, D.; Ramasubramanian, B.; Poovendran, R. CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2024. [Google Scholar]
- Suo, W.; Zhang, L.; Sun, M.; Wu, L.Y.; Wang, P.; Zhang, Y. Octopus: Alleviating Hallucination via Dynamic Contrastive Decoding. 2025 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) 2025, 29904–29914. [Google Scholar]
- Yuan, F.; Qin, C.; Xu, X.; Li, P. HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding. ArXiv 2024, abs/2409.20429. [Google Scholar]
- Park, Y.; Lee, D.; Choe, J.; Chang, B. ConVis: Contrastive Decoding with Hallucination Visualization for Mitigating Hallucinations in Multimodal Large Language Models. In Proceedings of the AAAI Conference on Artificial Intelligence, 2024. [Google Scholar]
- Liang, X.; Yu, J.; Mu, L.; Zhuang, J.; Hu, J.; Yang, Y.; Ye, J.; Lu, L.; Chen, J.; Hu, H. Mitigating Hallucination in Visual-Language Models via Re-Balancing Contrastive Decoding. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision, 2024. [Google Scholar]
- Chen, D.; Fang, F.; Ni, S.; Liang, F.; Hu, X.; Argha, A.; Alinejad-Rokny, H.; Yang, M.; Li, C. Lower Layers Matter: Alleviating Hallucination via Multi-Layer Fusion Contrastive Decoding with Truthfulness Refocused. 2024. [Google Scholar]
- Huang, Y.; Zhang, Y.; Cheng, N.; Li, Z.; Wang, S.; Xiao, J. Dynamic Attention-Guided Context Decoding for Mitigating Context Faithfulness Hallucinations in Large Language Models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2025. [Google Scholar]
- Wang, C.; Chen, X.; Zhang, N.; Tian, B.; Xu, H.; Deng, S.; Chen, H. MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation. ArXiv 2024, abs/2410.11779. [Google Scholar]
- Bi, B.; Liu, S.; Mei, L.; Wang, Y.; Ji, P.; Cheng, X. Decoding by Contrasting Knowledge: Enhancing LLMs’ Confidence on Edited Facts. ArXiv 2024, abs/2405.11613. [Google Scholar]
- Shi, L.; Yao, Y.; Li, Z.; Zhang, L.; Zhao, H. Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models. ArXiv 2024, abs/2409.20181. [Google Scholar]
- Dhamala, J.; Kumar, V.; Gupta, R.; Chang, K.W.; Galstyan, A.G. An Analysis of The Effects of Decoding Algorithms on Fairness in Open-Ended Language Generation. 2022 IEEE Spoken Language Technology Workshop (SLT), 2022; pp. 655–662. [Google Scholar]
- Das, M.; Balke, W.T. Quantifying Bias from Decoding Techniques in Natural Language Generation. In Proceedings of the International Conference on Computational Linguistics, 2022. [Google Scholar]
- Wang, H.; Xu, X.; Huang, B.; Shu, K. Privacy-Aware Decoding: Mitigating Privacy Leakage of Large Language Models in Retrieval-Augmented Generation. 2025.
- Ok, H.; Ryu, J.; Lee, J. Decoding with Limited Teacher Supervision Requires Understanding When to Trust the Teacher. ArXiv 2024, abs/2406.18002. [Google Scholar]
- Wei, J.; Abdulrazzag, A.; Zhang, T.; Muursepp, A.; Saileshwar, G. Privacy Risks of Speculative Decoding in Large Language Models. ArXiv 2024, abs/2411.01076. [Google Scholar]
- Borec, L.; Sadler, P.; Schlangen, D. The Unreasonable Ineffectiveness of Nucleus Sampling on Mitigating Text Memorization. In Proceedings of the International Conference on Natural Language Generation, 2024. [Google Scholar]
- Majmudar, J.; Dupuy, C.; Peris, C.; Smaili, S.; Gupta, R.; Zemel, R.S. Differentially Private Decoding in Large Language Models. ArXiv 2022, abs/2205.13621. [Google Scholar]
- Isonuma, M.; Titov, I. What’s New in My Data? Novelty Exploration via Contrastive Generation. ArXiv 2024, abs/2410.14765. [Google Scholar]
- Wang, P.; Zhang, D.; Li, L.; Tan, C.; Wang, X.; Ren, K.; Jiang, B.; Qiu, X. InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance. arXiv 2024, arXiv:2401.11206. [Google Scholar]
- O’Brien, K.; Majercak, D.; Fernandes, X.; Edgar, R.; Bullwinkel, B.; Chen, J.; Nori, H.; Carignan, D.; Horvitz, E.; Poursabzi-Sangdeh, F. Steering Language Model Refusal with Sparse Autoencoders. arXiv 2024, arXiv:2411.11296. [Google Scholar]
- Arditi, A.; Obeso, O.; Syed, A.; Paleka, D.; Panickssery, N.; Gurnee, W.; Nanda, N. Refusal in Language Models Is Mediated by a Single Direction. arXiv 2024, arXiv:2406.11717. [Google Scholar] [CrossRef]
- Lee, B.W.; Padhi, I.; Natesan Ramamurthy, K.; Miehling, E.; Dognin, P.; Nagireddy, M.; Dhurandhar, A. Programming Refusal with Conditional Activation Steering. arXiv 2024, arXiv:2409.05907. [Google Scholar]
- Sheng, L.; Shen, C.; Zhao, W.; Fang, J.; Liu, X.; Liang, Z.; Wang, X.; Zhang, A.; Chua, T.S. AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint. arXiv 2025, arXiv:2506.07022. [Google Scholar]
- Zhou, Z.; Yu, H.; Zhang, X.; Xu, R.; Huang, F.; Wang, K.; Liu, Y.; Fang, J.; Li, Y. On the role of attention heads in large language model safety. arXiv 2024, arXiv:2410.13708. [Google Scholar]
- Zheng, C.; Yin, F.; Zhou, H.; Meng, F.; Zhou, J.; Chang, K.W.; Huang, M.; Peng, N. On prompt-driven safeguarding for large language models. In Proceedings of the Proceedings of the 41st International Conference on Machine Learning, 2024; pp. 61593–61613. [Google Scholar]
- Han, P.; Qian, C.; Chen, X.; Zhang, Y.; Zhang, D.; Ji, H. SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals. arXiv 2025, arXiv:2502.01042. [Google Scholar] [CrossRef]
- Gao, L.; Geng, J.; Zhang, X.; Nakov, P.; Chen, X. Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models. Proceedings of the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics 2025, Volume 1, 25378–25398. [Google Scholar]
- Zhao, J.; Huang, J.; Wu, Z.; Bau, D.; Shi, W. LLMs Encode Harmfulness and Refusal Separately. arXiv 2025, arXiv:2507.11878. [Google Scholar] [CrossRef]
- Li, K.; Patel, O.; Viégas, F.; Pfister, H.; Wattenberg, M. Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. arXiv 2024, arXiv:cs. [Google Scholar]
- Hościłowicz, J.; Wiacek, A.; Chojnacki, J.; Cieślak, A.; Michon, L.; Urbanevych, V.; Janicki, A. Non-Linear Inference Time Intervention: Improving LLM Truthfulness. arXiv 2024, arXiv:cs. [Google Scholar] [CrossRef]
- Bayat, F.F.; Liu, X.; Jagadish, H.V.; Wang, L. Enhanced Language Model Truthfulness with Learnable Intervention and Uncertainty Expression. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, 2024. [Google Scholar]
- Wang, T.; Jiao, X.; Zhu, Y.; Chen, Z.; He, Y.; Chu, X.; Gao, J.; Wang, Y.; Ma, L. Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories, 2024. arXiv Also appears at TheWebConf (WWW) 2025. arXiv:cs.
- Liu, J.; Chen, S.; Cheng, Y.; He, J. On the Universal Truthfulness Hyperplane Inside LLMs. In Proceedings of the Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024. [Google Scholar]
- Li, H.; Cao, Y.; Yu, Y.; Suchow, J.W.; Zhu, Z. Truth Neurons. arXiv 2025, arXiv:cs. [Google Scholar] [CrossRef]
- Li, Y.; Fan, Z.; Chen, R.; Gai, X.; Gong, L.; Zhang, Y.; Liu, Z. FairSteer: Inference-Time Debiasing for LLMs with Dynamic Activation Steering. Proc. Find. Assoc. Comput. Linguist. ACL 2025, 2025, 11293–11312. [Google Scholar]
- Ravfogel, S.; Elazar, Y.; Gonen, H.; Twiton, M.; Goldberg, Y. Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection. In Proceedings of the Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020; pp. 7237–7256. [Google Scholar]
- Belrose, N.; Schneider-Joseph, D.; Ravfogel, S.; Cotterell, R.; Raff, E.; Biderman, S. LEACE: Perfect Linear Concept Erasure in Closed Form. arXiv 2023, arXiv:2306.03819. [Google Scholar] [CrossRef]
- Jourdan, F.; Béthune, L.; Picard, A.; Risser, L.; Asher, N. TaCo: Targeted Concept Erasure Prevents Non-Linear Classifiers From Detecting Protected Attributes. arXiv 2024, arXiv:2312.06499. [Google Scholar]
- Liang, P.P.; Li, I.M.; Zheng, E.; Lim, Y.C.; Salakhutdinov, R.; Morency, L.P. Towards Debiasing Sentence Representations. In Proceedings of the Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020. [Google Scholar]
- Nakka, K.K.; Jiang, X.; Usynin, D.; Zhou, X. PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage, 2025; Preprint.
- Suri, M.; Anand, N.; Bhaskar, A. Mitigating Memorization in LLMs using Activation Steering, 2025; Preprint.
- Zou, A.; Phan, L.; Chen, S.; Campbell, J.; Guo, P.; Ren, R.; Pan, A.; Yin, X.; Mazeika, M.; Dombrowski, A.K.; et al. Representation engineering: A top-down approach to ai transparency. arXiv 2023, arXiv:2310.01405. [Google Scholar] [CrossRef]
- Wehner, J.; Abdelnabi, S.; Tan, D.; Krueger, D.; Fritz, M. Taxonomy, opportunities, and challenges of representation engineering for large language models. arXiv 2025, arXiv:2502.19649. [Google Scholar] [CrossRef]
- Jolliffe, I.T. Principal Component Analysis; Springer Verlag, 1986. [Google Scholar]
- Golub, G.H.; Reinsch, C. Singular value decomposition and least squares solutions. Linear Algebr. 1971, 134–151. [Google Scholar]
- Jiang, X.; Zhang, L.; Zhang, J.; Yang, Q.; Hu, G.; Wang, D.; Hu, L. MSRS: Adaptive Multi-Subspace Representation Steering for Attribute Alignment in Large Language Models. arXiv 2025, arXiv:2508.10599. [Google Scholar]
- Turner, A.M.; Thiergart, L.; Leech, G.; Udell, D.; Vazquez, J.J.; Mini, U.; MacDiarmid, M. Steering language models with activation engineering. arXiv 2023, arXiv:2308.10248. [Google Scholar]
- Zhou, A. Compositional Subspace Representation Fine-tuning for Adaptive Large Language Models. arXiv 2025, arXiv:2503.10617. [Google Scholar] [CrossRef]
- Deng, Z.; Liu, C.Y.; Pang, Z.; He, X.; Feng, L.; Xuan, Q.; Zhu, Z.; Wei, J. Guard: Generation-time llm unlearning via adaptive restriction and detection. arXiv 2025, arXiv:2505.13312. [Google Scholar] [CrossRef]
- Zhang, Z.; Yang, J.; Lu, Y.; Ke, P.; Cui, S.; Zheng, C.; Wang, H.; Huang, M. From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks. arXiv 2024, arXiv:2407.02855. [Google Scholar]
- Takashiro, S.; Kojima, T.; Gambardella, A.; Cao, Q.; Iwasawa, Y.; Matsuo, Y. Answer When Needed, Forget When Not: Language Models Pretend to Forget via In-Context Knowledge Unlearning. arXiv 2025, arXiv:cs. [Google Scholar] [CrossRef]
- Sanyal, D.; Mandal, M. Agents are all you need for LLM unlearning. arXiv 2025, arXiv:2502.00406. [Google Scholar] [CrossRef]
- Pawelczyk, M.; Neel, S.; Lakkaraju, H. In-context unlearning: Language models as few shot unlearners. arXiv 2023, arXiv:2310.07579. [Google Scholar] [CrossRef]
- Muresanu, A.I.; Thudi, A.; Zhang, M.R.; Papernot, N. Fast Exact Unlearning for In-Context Learning Data for LLMs. In Proceedings of the ICML, 2025. [Google Scholar]
- Zhou, Y.; Li, X.; Wang, Q.; Shen, J. Visual in-context learning for large vision-language models. arXiv 2024, arXiv:2402.11574. [Google Scholar]
- Voigt, P.; Von dem Bussche, A. The eu general data protection regulation (gdpr). A practical guide. In Cham: Springer International Publishing, 1st ed.; 2017. [Google Scholar]
- Bonta, R. California consumer privacy act (CCPA). State of California Department of Justice, 2022. Available online: https://oag.
- Lang, Y.; Guo, K.; Huang, Y.; Zhou, Y.; Zhuang, H.; Yang, T.; Su, Y.; Zhang, X. Beyond Single-Value Metrics: Evaluating and Enhancing LLM Unlearning with Cognitive Diagnosis. arXiv 2025, arXiv:2502.13996. [Google Scholar] [CrossRef]
- Wei, B.; Huang, K.; Huang, Y.; Xie, T.; Qi, X.; Xia, M.; Mittal, P.; Wang, M.; Henderson, P. Assessing the brittleness of safety alignment via pruning and low-rank modifications. arXiv 2024, arXiv:2402.05162. [Google Scholar] [CrossRef]
- Zhang, J.; Chen, K.; He, L.; Lou, J.; Li, D.; Feng, Z.; Song, M.; Liu, J.; Ren, K.; Yang, X. Activation approximations can incur safety vulnerabilities even in aligned llms: Comprehensive analysis and defense. arXiv 2025, arXiv:2502.00840. [Google Scholar] [CrossRef]
- Chen, B.; Lyu, X.; Gao, L.; Song, J.; Shen, H.T. SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism. arXiv 2025, arXiv:2507.01513. [Google Scholar]
- Zayed, A.; Mordido, G.; Shabanian, S.; Baldini, I.; Chandar, S. Fairness-aware structured pruning in transformers. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2024; pp. 22484–22492. [Google Scholar]
- Ma, S.; Salinas, A.; Nyarko, J.; Henderson, P. Breaking Down Bias: On The Limits of Generalizable Pruning Strategies. In Proceedings of the Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, 2025. [Google Scholar]
- Pochinkov, N.; Schoots, N. Dissecting language models: Machine unlearning via selective pruning. arXiv 2024, arXiv:2403.01267. [Google Scholar] [CrossRef]
- Liu, Z.; Dou, G.; Yuan, X.; Zhang, C.; Tan, Z.; Jiang, M. Modality-Aware Neuron Pruning for Unlearning in Multimodal Large Language Models. In Proceedings of the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025; Volume 1. [Google Scholar]
- Guo, P.; Wang, Y.; Li, W.; Liu, M.; Li, M.; Zheng, J.; Qu, L. Exploring Federated Pruning for Large Language Models. arXiv 2025, arXiv:2505.13547. [Google Scholar] [CrossRef]
- Chrysostomou, G.; Zhao, Z.; Williams, M.; Aletras, N. Investigating hallucinations in pruned large language models for abstractive summarization. Trans. Assoc. Comput. Linguist. 2024, 12, 1163–1181. [Google Scholar] [CrossRef]
- Zeng, Y.; Wu, Y.; Zhang, X.; Wang, H.; Wu, Q. AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks. In Proceedings of the Neurips Safe Generative AI Workshop 2024, 2024. [Google Scholar]
- Li, H.; Wang, A.; kunquan li; Wang, Z.; Zhang, L.; Qiu, D.; Liu, Q.; Su, J. A Multi-Agent Framework with Automated Decision Rule Optimization for Cross-Domain Misinformation Detection. arXiv 2025, arXiv:cs. [Google Scholar]
- Fan, F.; Li, X. PeerGuard: Defending Multi-Agent Systems Against Backdoor Attacks Through Mutual Reasoning. arXiv 2025, arXiv:cs. [Google Scholar]
- Zhou, J.; Wang, L.; Yang, X. GUARDIAN: Safeguarding LLM Multi-Agent Collaborations with Temporal Graph Modeling. arXiv 2025, arXiv:cs. [Google Scholar] [CrossRef]
- Ebrahimi, S.; Dehghankar, M.; Asudeh, A. An Adversary-Resistant Multi-Agent LLM System via Credibility Scoring. arXiv 2025, arXiv:cs. [Google Scholar]
- Asad, A.; Obadinma, S.; Shayanfar, R.; Zhu, X. RedDebate: Safer Responses through Multi-Agent Red Teaming Debates, 2025. arXiv arXiv:cs.
- Zhu, Y.; Zhang, C.; Shi, X.; Zhang, X.; Yang, Y.; Luo, Y. MASTER: Multi-Agent Security Through Exploration of Roles and Topological Structures – A Comprehensive Framework. arXiv 2025, arXiv:cs. [Google Scholar]
- Zhang, Z.; Zhang, Y.; Li, L.; Gao, H.; Wang, L.; Lu, H.; Zhao, F.; Qiao, Y.; Shao, J. PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety. Proceedings of the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics 2024, Volume 1, 15202–15231. [Google Scholar]
- He, P.; Lin, Y.; Dong, S.; Xu, H.; Xing, Y.; Liu, H. Red-Teaming LLM Multi-Agent Systems via Communication Attacks. Proc. Find. Assoc. Comput. Linguist. ACL 2025, 2025, 6726–6747. [Google Scholar]
- Du, Y.; Li, S.; Torralba, A.; Tenenbaum, J.B.; Mordatch, I. Improving factuality and reasoning in language models through multiagent debate. In Proceedings of the Proceedings of the 41st International Conference on Machine Learning, 2024. [Google Scholar]
- Khan, A.; Hughes, J.; Valentine, D.; Ruis, L.; Sachan, K.; Radhakrishnan, A.; Grefenstette, E.; Bowman, S.R.; Rocktäschel, T.; Perez, E. Debating with more persuasive LLMs leads to more truthful answers. In Proceedings of the Proceedings of the 41st International Conference on Machine Learning, 2024. [Google Scholar]
- Fang, Y.; Li, M.; Wang, W.; Hui, L.; Feng, F. Counterfactual Debating with Preset Stances for Hallucination Elimination of LLMs. In Proceedings of the COLING, 2025; pp. 10554–10568. [Google Scholar]
- Sun, X.; Li, J.; Zhong, Y.; Zhao, D.; Yan, R. Towards Detecting LLMs Hallucination via Markov Chain-based Multi-agent Debate Framework. Proceedings of the ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025, 1–5. [Google Scholar]
- Wan, D.; Chen, J.; Stengel-Eskin, E.; Bansal, M. MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration. Proceedings of the Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) 2025, 9882–9901. [Google Scholar]
- Kim, K.; Lee, S.; Huang, K.H.; Chan, H.P.; Li, M.; Ji, H. Can LLMs Produce Faithful Explanations For Fact-checking? Towards Faithful Explainable Fact-Checking via Multi-Agent Debate. arXiv 2024, arXiv:cs. [Google Scholar]
- Tang, Z.; Wang, R.; Chen, W.; Zheng, Y.; Chen, Z.; Liu, Y.; Wang, K.; Chen, T.; Lin, L. Towards CausalGPT: A Multi-Agent Approach for Faithful Knowledge Reasoning via Promoting Causal Consistency in LLMs. arXiv 2025, arXiv:cs. [Google Scholar]
- Feng, S.; Shi, W.; Wang, Y.; Ding, W.; Balachandran, V.; Tsvetkov, Y. Don’t Hallucinate, Abstain: Identifying LLM Knowledge Gaps via Multi-LLM Collaboration. Proceedings of the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics 2024, Volume 1, 14664–14690. [Google Scholar]
- Yang, R.; Rajagopal, D.; Hayati, S.A.; Hu, B.; Kang, D. Confidence Calibration and Rationalization for LLMs via Multi-Agent Deliberation. arXiv 2024, arXiv:cs. [Google Scholar] [CrossRef]
- Ki, D.; Rudinger, R.; Zhou, T.; Carpuat, M. Multiple LLM Agents Debate for Equitable Cultural Alignment. Proceedings of the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics 2025, Volume 1, 24841–24877. [Google Scholar]
- Wan, Y.; Chang, K.W. White Men Lead, Black Women Help? Benchmarking and Mitigating Language Agency Social Biases in LLMs. Proceedings of the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics 2025, Volume 1, 9082–9108. [Google Scholar]
- Owens, D.M.; Rossi, R.A.; Kim, S.; Yu, T.; Dernoncourt, F.; Chen, X.; Zhang, R.; Gu, J.; Deilamsalehy, H.; Lipka, N. A multi-llm debiasing framework. arXiv 2024, arXiv:2409.13884. [Google Scholar] [CrossRef]
- Borah, A.; Mihalcea, R. Towards Implicit Bias Detection and Mitigation in Multi-Agent LLM Interactions. Proc. Find. Assoc. Comput. Linguist. EMNLP 2024, 2024, 9306–9326. [Google Scholar]
- Wan, Y.; Chen, X.; Chang, K.W. Which Cultural Lens Do Models Adopt? On Cultural Positioning Bias and Agentic Mitigation in LLMs. arXiv 2025, arXiv:cs. [Google Scholar]
- Wang, L.; Wang, W.; Wang, S.; Li, Z.; Ji, Z.; Lyu, Z.; Wu, D.; Cheung, S.C. IP Leakage Attacks Targeting LLM-Based Multi-Agent Systems. arXiv 2025, arXiv:cs. [Google Scholar]
- Li, W.; Sun, L.; Guan, Z.; Zhou, X.; Sap, M. 1-2-3 Check: Enhancing Contextual Privacy in LLM via Multi-Agent Reasoning. In Proceedings of the Proceedings of the The First Workshop on LLM Security (LLMSEC), 2025; pp. 115–128. [Google Scholar]
- Huang, Y.; Jiang, Y.; Wang, W.; Zhuang, H.; Luo, X.; Ma, Y.; Xu, Z.; Chen, Z.; Moniz, N.; Lin, Z.; et al. Emergent Social Intelligence Risks in Generative Multi-Agent Systems. arXiv 2026, arXiv:2603.27771. [Google Scholar] [CrossRef]
- Huang, Y.; Zhang, Q.; Yu, P.S.; Sun, L. TrustGPT: A Benchmark for Trustworthy and Responsible Large Language Models. arXiv 2023, arXiv:2306.11507. [Google Scholar] [CrossRef]
- Wang, Y.; Ye, J.; Wu, S.; Gao, C.; Huang, Y.; Chen, X.; Zhao, Y.; Zhang, X. TrustEval: A Dynamic Evaluation Toolkit on Trustworthiness of Generative Foundation Models. In Proceedings of the Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations), 2025. [Google Scholar]
- Hasan, A.; Rugina, I.; Wang, A. Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning. In Proceedings of the The 7th BlackboxNLP Workshop, 2024. [Google Scholar]
- Yan, H.; Liu, Z.; Jiang, M. Dual-Space Smoothness for Robust and Balanced LLM Unlearning. arXiv 2025, arXiv:2509.23362. [Google Scholar]
- Meng, K.; Bau, D.; Andonian, A.; Belinkov, Y. Locating and editing factual associations in GPT. Adv. Neural Inf. Process. Syst. 2022, 35, 17359–17372. [Google Scholar]
- Meng, K.; Sharma, A.S.; Andonian, A.; Belinkov, Y.; Bau, D. Mass-editing memory in a transformer. ArXiv 2022. abs/2210.07229. [Google Scholar]
- Cai, Y.; Cao, D.; Guo, R.; Wen, Y.; Liu, G.; Chen, E. Locating and mitigating gender bias in large language models. In Proceedings of the International Conference on Intelligent Computing, 2024; pp. 471–482. [Google Scholar]
- Dasu, V.A.; Gupta, V.; Tizpaz-Niari, S.; Tan, G.; et al. Attention Pruning: Automated Fairness Repair of Language Models via Surrogate Simulated Annealing. arXiv 2025, arXiv:2503.15815. [Google Scholar] [CrossRef]
- Qin, Z.; Wang, H.; Wang, Z.; Liu, D.; Fan, C.; Lv, Z.; Tu, Z.; Chu, D.; Sui, D. Mitigating gender bias in code large language models via model editing. arXiv 2024, arXiv:2410.07820. [Google Scholar] [CrossRef]
- Ye, J.; Wang, Y.; Huang, Y.; Chen, D.; Zhang, Q.; Moniz, N.; Gao, T.; Geyer, W.; Huang, C.; Chen, P.Y.; et al. Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge. arXiv 2024, arXiv:2410.02736. [Google Scholar] [CrossRef]
- Liu, Z.; Dou, G.; Chien, E.; Zhang, C.; Tian, Y.; Zhu, Z. Breaking the trilemma of privacy, utility, and efficiency via controllable machine unlearning. Proc. Proc. ACM Web Conf. 2024, 2024, 1260–1271. [Google Scholar]
- Li, T.; et al. Revisiting Jailbreaking for Large Language Models. In Proceedings of the COLING, 2025. [Google Scholar]
- U.S. AI Safety Institute at NIST. Managing Misuse Risk for Dual-Use Foundation Models (NIST AI 800-1, 2nd Public Draft). NIST, Technical report. 2025. [Google Scholar]
- Amayuelas, A.; Yang, X.; Antoniades, A.; Hua, W.; Pan, L.; Wang, W.Y. MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations via Debate. Proc. Find. Assoc. Comput. Linguist. EMNLP 2024, 2024, 6929–6948. [Google Scholar]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.