Submitted:
01 July 2024
Posted:
03 July 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
- We introduce the new technique to eliminate sexually explicit images creation by employing the self-learning concept in reinforcement learning to generalize the learned nudity concepts. The specific dual reward function is designed for reducing the nude visual representation while preserving the safe semantic meaning.
- We show the robustness and generalization of proposed method by experimenting black-box attacking by adversarial prompts and analysis on the out-of-distribution (OOD) scenarios.
- We conduct extensive experiments for evaluating anti-NSFW models with adversarial and benign prompts, based on which we verify the effectiveness of our method compared with existing solutions.
2. Background
2.1. Diffusion Models
2.2. Text-to-Image (T2I) Diffusion Models
2.3. LoRA Fine-Tuning
2.4. Reinforcement Learning in Fine-Tuning
3. Analysis on Nudity Contents
3.1. NSFW Visual Representation
4. Proposed Method
4.1. Overview
4.2. Reward
4.3. Text-Agnostic Methods
4.4. Integration
5. Experiments
5.1. Datasets and Implementation Details
5.2. Performance Comparisons
5.3. Failed Cases of SOTA Models
5.4. Out-of-Distribution (OOD) Performance
5.5. Numerical Evaluation
5.6. Rethink CLIP Score in Role of Nudity Elimination
5.7. Black-Box Attacking
6. Conclusions
References
- Cloud, T. Image Moderation System. https://www.tencentcloud.com/products/ims.
- Cloud, H. Content Moderation. https://www.huaweicloud.com/intl/en-us/product/moderation.html.
- Azure, M. Azure AI Content Safety. https://azure.microsoft.com/en-us/products/ai-services/ai-content-safety/.
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10684–10695.
- OpenAI. DALL·E 2. https://openai.com/index/dall-e-2.
- Research, G. Imagen. https://imagen.research.google.
- Meta. Make-A-Scene. https://ai.meta.com/blog/greater-creative-control-for-ai-image-generation.
- Henderson, P.; Krass, M.; Zheng, L.; Guha, N.; Manning, C.D.; Jurafsky, D.; Ho, D. Pile of law: Learning responsible data filtering from the law and a 256gb open-source legal dataset. Advances in Neural Information Processing Systems 2022, 35, 29217–29234. [Google Scholar]
- Schuhmann, C.; Beaumont, R.; Vencu, R.; Gordon, C.; Wightman, R.; Cherti, M.; Coombes, T.; Katta, A.; Mullis, C.; Wortsman, M.; others. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems 2022, 35, 25278–25294. [Google Scholar]
- Li, X.; Yang, Y.; Deng, J.; Yan, C.; Chen, Y.; Ji, X.; Xu, W. SafeGen: Mitigating Sexually Explicit Content Generation in Text-to-Image Models. 2024; arXiv preprint, arXiv:2404.06666. [Google Scholar]
- Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; Chen, M. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint, arXiv:2112.10741 2021.
- Liang, P.P.; Wu, C.; Morency, L.P.; Salakhutdinov, R. Towards understanding and mitigating social biases in language models. International Conference on Machine Learning. PMLR, 2021, pp. 6565–6576.
- Schramowski, P.; Brack, M.; Deiseroth, B.; Kersting, K. Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 22522–22531.
- Park, S.; Moon, S.; Park, S.; Kim, J. Localization and Manipulation of Immoral Visual Cues for Safe Text-to-Image Generation. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 4675–4684.
- Rando, J.; Paleka, D.; Lindner, D.; Heim, L.; Tramèr, F. Red-teaming the stable diffusion safety filter. arXiv preprint, arXiv:2210.04610 2022.
- Gandikota, R.; Materzynska, J.; Fiotto-Kaufman, J.; Bau, D. Erasing concepts from diffusion models. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2426–2436.
- Heng, A.; Soh, H. Selective amnesia: A continual learning approach to forgetting in deep generative models. Advances in Neural Information Processing Systems 2024, 36. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Advances in neural information processing systems 2014, 27. [Google Scholar]
- Kingma, D.P.; Welling, M. Auto-encoding variational bayes. 2013; arXiv preprint, arXiv:1312.6114. [Google Scholar]
- Van Den Oord, A.; Vinyals, O.; others. Neural discrete representation learning. Advances in neural information processing systems 2017, 30. [Google Scholar]
- Yu, J.; Li, X.; Koh, J.Y.; Zhang, H.; Pang, R.; Qin, J.; Ku, A.; Xu, Y.; Baldridge, J.; Wu, Y. Vector-quantized image modeling with improved vqgan. 2021; arXiv preprint, arXiv:2110.04627. [Google Scholar]
- Dhariwal, P.; Nichol, A. Diffusion models beat gans on image synthesis. Advances in neural information processing systems 2021, 34, 8780–8794. [Google Scholar]
- Ho, J.; Chan, W.; Saharia, C.; Whang, J.; Gao, R.; Gritsenko, A.; Kingma, D.P.; Poole, B.; Norouzi, M.; Fleet, D.J. ; others. Imagen video: High definition video generation with diffusion models. 2022; arXiv preprint, arXiv:2210.02303. [Google Scholar]
- Kong, Z.; Ping, W.; Huang, J.; Zhao, K.; Catanzaro, B. Diffwave: A versatile diffusion model for audio synthesis. 2020; arXiv preprint, arXiv:2009.09761. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J. ; others. Learning transferable visual models from natural language supervision. International conference on machine learning. PMLR, 2021, pp. 8748–8763.
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. 2021; arXiv preprint, arXiv:2106.09685. [Google Scholar]
- Black, K.; Janner, M.; Du, Y.; Kostrikov, I.; Levine, S. Training diffusion models with reinforcement learning. 2023; arXiv preprint, arXiv:2305.13301. [Google Scholar]
- Williams, R.J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 1992, 8, 229–256. [Google Scholar] [CrossRef]
- Mohamed, S.; Rosca, M.; Figurnov, M.; Mnih, A. Monte carlo gradient estimation in machine learning. Journal of Machine Learning Research 2020, 21, 1–62. [Google Scholar]
- Kakade, S.; Langford, J. Approximately optimal approximate reinforcement learning. Proceedings of the Nineteenth International Conference on Machine Learning, 2002, pp. 267–274.
- notAI tech. NudeNet: Lightweight Nudity Detection. https://github.com/notAI-tech/NudeNet.
- Li, X.; Yang, Y.; Deng, J.; Yan, C.; Chen, Y.; Ji, X.; Xu, W. SafeGen: Mitigating Sexually Explicit Content Generation in Text-to-Image Models. https://huggingface.co/LetterJohn/SafeGen-Pretrained-Weights.
- AIML-TUDA. Inaproppriate Image Prompts (I2p). https://huggingface.co/datasets/AIML-TUDA/i2p.
- Wang, Z.J.; Montoya, E.; Munechika, D.; Yang, H.; Hoover, B.; Chau, D.H. Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models. 2022; arXiv preprint, arXiv:2210.14896. [Google Scholar]
- Yang, Y.; Hui, B.; Yuan, H.; Gong, N.; Cao, Y. nsfw list. https://github.com/Yuchen413/text2image_safety/blob/main/data/nsfw_list.txt.
- Schuhmann, C. Laion aesthetics. https://laion.ai/blog/laion-aesthetics.
- Parmar, G.; Zhang, R.; Zhu, J.Y. On aliased resizing and surprising subtleties in gan evaluation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11410–11420.
- Yang, Y.; Hui, B.; Yuan, H.; Gong, N.; Cao, Y. Sneakyprompt: Jailbreaking text-to-image generative models. 2024 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, 2024, pp. 123–123.
- Anti Deepnude. https://github.com/1093842024/anti-deepnude.










| Method | Nudity Removal Rate (%) ↑ | CLIP Score | |||
|---|---|---|---|---|---|
| I2P (Sexual) | DiffusionDB | NSFW-list | I2P (Sexual)↓ | COCO 30K ↑ | |
| Original SD | - | - | - | 27.28 | 26.39 |
| SafeGen [10] | 83.3 | 86.82 | 98.45 | 24.18 | 26.37 |
| SLD (Max) [13] | 97.1 | 97.3 | 95.80 | 22.64 | 23.61 |
| Ours | 97.8 | 97.6 | 97.73 | 26.05 | 26.25 |
| Method | Nudity Score ↓ | Aesthetic Score ↑ | FID Score ↓ | ||
|---|---|---|---|---|---|
| DiffusionDB | I2P (Sexual) | DiffusionDB | COCO-30k | COCO-30k | |
| Original SD | - | - | - | 4.683 | 19.585 |
| SafeGen [10] | 0.2150 | 0.1676 | 4.595 | 4.649 | 20.638 |
| SLD (Max) [13] | 0.0326 | 0.0429 | 5.061 | 4.740 | 36.017 |
| Ours | 0.0288 | 0.0329 | 4.936 | 4.915 | 24.625 |
| Parameter | Value |
|---|---|
| method | rl |
| threshold | 0.28 |
| len_subword | 10 |
| q_limit | 60 |
| safety | ti_sd |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).