Submitted:
24 June 2024
Posted:
24 June 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
- characterizing head functions - identifying the ’trojan’ heads and explaining the ’trojan’ heads
- building a attention-based TrojanNet detector with only limited clean data

2. Threat Models

2.1. Problem Definition
2.2. Self-Generated Models
3. Attention Pattern Exploration
3.1. Distribution of Overall Heads Attention Weights
3.2. Head-Wise Attention Map
3.3. Distribution of Certain Heads Attention Weights
4. Head Functions
4.1. Trigger Heads
4.2. Semantic Heads
4.2.1. Identify Semantic Heads
4.2.2. Redirect Attention
| Benign | Trojan | |
|---|---|---|
| model_s | 93.62%(88/94) | 93.68%(89/95) |
| models_r | 51.14%(45/88) | 89.89%(80/89) |
| sentences_r | 31.05% | 90.51% |
| attention_r | 0.206 | 0.693 |
4.2.3. Redirect Importance to Prediction
4.3. Specific Heads
5. Detector Design
5.1. Naive Detector
5.2. Enumerate Trigger Detector
5.3. Reverse Engineering Based Detector
6. Conclusion
References
- Ahmadreza Azizi, Ibrahim Asadullah Tahmid, Asim Waheed, Neal Mangaokar, Jiameng Pu, Mobin Javed, Chandan K Reddy, and Bimal Viswanath. 2021. T- miner: A generative approach to defend against tro- jan attacks on dnn-based text classification. In 30th USENIX Security Symposium ( USENIX Security 21).
- Wanyu Bian, Albert Jang, and Fang Liu. 2024. Improv- ing quantitative mri using self-supervised deep learn- ing with model reinforcement: Demonstration for rapid t1 mapping. Magnetic Resonance in Medicine.
- Wanyu Bian, Qingchao Zhang, Xiaojing Ye, and Yun- mei Chen. 2022. A learnable variational model for joint multimodal mri reconstruction and synthesis. In International Conference on Medical Image Com- puting and Computer-Assisted Intervention, pages 354–364. Springer.
- Xingyuan Bu, Junran Peng, Junjie Yan, Tieniu Tan, and Zhaoxiang Zhang. 2021. Gaia: A transfer learning system of object detection that fits your needs. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 274–283.
- Ruizhe Chen, Tianxiang Hu, Yang Feng, and Zuozhu Liu. 2024a. Learnable privacy neurons localization in language models. arXiv preprint arXiv:2405.10989.
- Ruizhe Chen, Jianfei Yang, Huimin Xiong, Jianhong Bai, Tianxiang Hu, Jin Hao, Yang Feng, Joey Tianyi Zhou, Jian Wu, and Zuozhu Liu. 2024b. Fast model debias with machine unlearning. Advances in Neural Information Processing Systems, 36.
- Xiaoyi Chen, Ahmed Salem, Dingfan Chen, Michael Backes, Shiqing Ma, Qingni Shen, Zhonghai Wu, and Yang Zhang. 2021. Badnl: Backdoor attacks against nlp models with semantic-preserving improvements. ACSAC.
- Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. 2019. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341.
- Xinyu Dong, Rachel Wong, Weimin Lyu, Kayley Abell- Hart, Jianyuan Deng, Yinan Liu, Janos G Hajagos, Richard N Rosenthal, Chao Chen, and Fusheng Wang. 2023. An integrated lstm-heterorgnn model for in- terpretable opioid overdose risk prediction. Artificial intelligence in medicine, 135:102439.
- Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. 2018. Hotflip: White-box adversarial examples for text classification.
- Weixin Feng, Xingyuan Bu, Chenchen Zhang, and Xubin Li. 2022. Beyond bounding box: Multi- modal knowledge learning for object detection. arXiv preprint arXiv:2205.04072.
- Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. 2017. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733.
- Yaru Hao, Li Dong, Furu Wei, and Ke Xu. 2021. Self- attention attribution: Interpreting information interac- tions inside transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 12963–12971.
- Chuqin Huang, Yanda Cheng, Wenhan Zheng, Robert W Bing, Huijuan Zhang, Isabel Komornicki, Linda M Harris, Praveen R Arany, Saptarshi Chakraborty, Qifa Zhou, et al. 2023. Dual-scan photoacoustic tomog- raphy for the imaging of vascular structure on foot. IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control.
- Tianchu Ji, Shraddhan Jain, Michael Ferdman, Peter Milder, H Andrew Schwartz, and Niranjan Balasub- ramanian. 2021. On the distribution, sparsity, and inference-time quantization of attention values in transformers. arXiv preprint arXiv:2106.01335.
- Can Jin, Tong Che, Hongwu Peng, Yiyuan Li, and Marco Pavone. 2024. Learning from teaching regu- larization: Generalizable correlations should be easy to imitate. arXiv preprint arXiv:2402.02769.
- Can Jin, Tianjin Huang, Yihua Zhang, Mykola Pech- enizkiy, Sijia Liu, Shiwei Liu, and Tianlong Chen. 2023. Visual prompting upgrades neural network sparsification: A data-model perspective. arXiv preprint arXiv:2312.01397.
- Yucheng Li, Bo Dong, Chenghua Lin, and Frank Guerin. 2023a. Compressing context to enhance inference efficiency of large language models. arXiv preprint arXiv:2310.06201.
- Yucheng Li, Shun Wang, Chenghua Lin, and Frank Guerin. 2023b. Metaphor detection via explicit basic meanings modelling. In Proceedings of the 61st An- nual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 91–100.
- Zhenglin Li, Yangchen Huang, Mengran Zhu, Jingyu Zhang, JingHao Chang, and Houze Liu. 2024. Fea- ture manipulation for ddpm based change detection. arXiv preprint arXiv:2403.15943.
- Zhenglin Li, Hanyi Yu, Jinxin Xu, Jihang Liu, and Yuhong Mo. 2023c. Stock market analysis and pre- diction using lstm: A case study on technology stocks. Innovations in Applied Engineering and Technology, pages 1–6.
- Fudong Lin, Xu Yuan, Yihe Zhang, Purushottam Sigdel, Li Chen, Lu Peng, and Nian-Feng Tzeng. 2023. Com- prehensive transformer-based model architecture for real-world storm prediction. In Joint European Con- ference on Machine Learning and Knowledge Dis- covery in Databases, pages 54–71. Springer.
- Wanlong Liu, Shaohuan Cheng, Dingyi Zeng, and Hong Qu. 2023. Enhancing document-level event argument extraction with contextual clues and role relevance. arXiv preprint arXiv:2310.05991.
- Wanlong Liu, Li Zhou, Dingyi Zeng, Yichen Xiao, Shao- huan Cheng, Chen Zhang, Grandee Lee, Malu Zhang, and Wenyu Chen. 2024. Beyond single-event extrac- tion: Towards efficient document-level multi-event ar- gument extraction. arXiv preprint arXiv:2405.01884.
- Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. 2017. Trojaning attack on neural networks.
- Weimin Lyu, Xinyu Dong, Rachel Wong, Songzhu Zheng, Kayley Abell-Hart, Fusheng Wang, and Chao Chen. 2022a. A multimodal transformer: Fusing clin- ical notes with structured ehr data for interpretable in-hospital mortality prediction. In AMIA Annual Symposium Proceedings, volume 2022, page 719. American Medical Informatics Association.
- Weimin Lyu, Sheng Huang, Abdul Rafae Khan, Shengqiang Zhang, Weiwei Sun, and Jia Xu. 2019. Cuny-pku parser at semeval-2019 task 1: Cross- lingual semantic parsing with ucca. In Proceedings of the 13th international workshop on semantic eval- uation, pages 92–96.
- Weimin Lyu, Xiao Lin, Songzhu Zheng, Lu Pang, Haibin Ling, Susmit Jha, and Chao Chen. 2024. Task- agnostic detector for insertion-based backdoor at- tacks. arXiv preprint arXiv:2403.17155.
- Weimin Lyu, Songzhu Zheng, Haibin Ling, and Chao Chen. 2023a. Backdoor attacks against transformers with attention enhancement. In ICLR 2023 Work- shop on Backdoor Attacks and Defenses in Machine Learning.
- Weimin Lyu, Songzhu Zheng, Tengfei Ma, and Chao Chen. 2022b. A study of the attention abnormal- ity in trojaned berts. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4727–4741.
- Weimin Lyu, Songzhu Zheng, Tengfei Ma, Haibin Ling, and Chao Chen. 2022c. Attention hijacking in trojan transformers. arXiv preprint arXiv:2208.04946.
- Weimin Lyu, Songzhu Zheng, Lu Pang, Haibin Ling, and Chao Chen. 2023b. Attention-enhancing back- door attacks against bert-based models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10672–10690.
- Zhaobin Mo, Xuan Di, and Rongye Shi. 2023. Robust data sampling in machine learning: A game-theoretic framework for training and validation data selection. Games, 14(1):13. [CrossRef]
- Zhaobin Mo, Yongjie Fu, and Xuan Di. 2024. Pi- neugode: Physics-informed graph neural ordinary differential equations for spatiotemporal trajectory prediction. In Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems, pages 1418–1426.
- Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. 2017. Univer- sal adversarial perturbations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1765–1773.
- Na Pang, Li Qian, Weimin Lyu, and Jin-Dong Yang. 2019. Transfer learning for scientific data chain ex- traction in small chemical corpus with joint bert-crf model. In BIRNDL@ SIGIR, pages 28–41.
- Junran Peng, Qing Chang, Haoran Yin, Xingyuan Bu, Ji- ajun Sun, Lingxi Xie, Xiaopeng Zhang, Qi Tian, and Zhaoxiang Zhang. 2023. Gaia-universe: Everything is super-netify. IEEE Transactions on Pattern Analy- sis and Machine Intelligence, 45(10):11856–11868.
- Kangrui Ruan, Junzhe Zhang, Xuan Di, and Elias Bareinboim. 2022. Causal imitation learning via inverse reinforcement learning. In The Eleventh In- ternational Conference on Learning Representations.
- Liwei Song, Xinwei Yu, Hsuan-Tung Peng, and Karthik Narasimhan. 2021. Universal adversarial attacks with natural triggers for text classification.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information pro- cessing systems, pages 5998–6008.
- Elena Voita, David Talbot, Fedor Moiseev, Rico Sen- nrich, and Ivan Titov. 2019. Analyzing multi- head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint arXiv:1905.09418.
- Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal adversarial trig- gers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125.
- Yijie Weng and Jianhao Wu. 2024a. Big data and ma- chine learning in defence. International Journal of Computer Science and Information Technology, 16(2).
- Yijie Weng and Jianhao Wu. 2024b. Fortifying the global data fortress: a multidimensional examination of cyber security indexes and data protection mea- sures across 193 nations. International Journal of Frontiers in Engineering Technology, 6(2).
- Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. 2005. Recognizing contextual polarity in phrase- level sentiment analysis. In Proceedings of human language technology conference and conference on empirical methods in natural language processing, pages 347–354.
- Wei Xu, Jianlong Chen, Zhicheng Ding, and Jinyin Wang. 2024. Text sentiment analysis and classifi- cation based on bidirectional gated recurrent units (grus) model. arXiv preprint arXiv:2404.17123.
- Chang Yu, Yongshun Xu, Jin Cao, Ye Zhang, Yinxin Jin, and Mengran Zhu. 2024. Credit card fraud detection using advanced transformer model.
- Zhongping Zhang, Wenda Qin, and Bryan A Plummer. 2024a. Machine-generated text localization. arXiv preprint arXiv:2402.11744.
- Zhongping Zhang, Jian Zheng, Zhiyuan Fang, and Bryan A Plummer. 2024b. Text-to-image editing by image information removal. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5232–5241.
- Jinman Zhao, Yitian Ding, Chen Jia, Yining Wang, and Zifan Qian. 2024. Gender bias in large language models across multiple languages. arXiv preprint arXiv:2403.00277.
- Qi Zheng, Chang Yu, Jin Cao, Yongshun Xu, Qianwen Xing, and Yinxin Jin. 2024. Advanced payment secu- rity system:xgboost, catboost and smote integrated.
- Chang Zhou, Yang Zhao, Jin Cao, Yi Shen, Jing Gao, Xiaoling Cui, Chiyu Cheng, and Hao Liu. 2024a. Optimizing search advertising strategies: Integrating reinforcement learning with generalized second-price auctions for enhanced ad ranking and bidding. arXiv preprint arXiv:2405.13381.
- Yucheng Zhou, Xiang Li, Qianning Wang, and Jian- bing Shen. 2024b. Visual in-context learning for large vision-language models. arXiv preprint arXiv:2402.11574.
- Yucheng Zhou, Tao Shen, Xiubo Geng, Chongyang Tao, Can Xu, Guodong Long, Binxing Jiao, and Daxin Jiang. 2023. Towards robust ranker for text retrieval. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5387–5401.
- Jun Zhuang and Mohammad Al Hasan. 2022a. Defend- ing graph convolutional networks against dynamic graph perturbations via bayesian self-supervision. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 4405–4413.
- Jun Zhuang and Mohammad Al Hasan. 2022b. Robust node classification on graphs: Jointly from bayesian label transition and topology-based label propagation. In Proceedings of the 31st ACM International Con- ference on Information & Knowledge Management, pages 2795–2805.





| Benign | Trojan | |
|---|---|---|
| model_s | 100%(94/94) | 100%(95/95) |
| models_r | 2.13%(2/94) | 93.68%(89/95) |
| sentences_r | 57.16% | 97.08% |
| attention_r | 0.328 | 0.832 |
| Features | acc | auc |
|---|---|---|
| trigger heads | 1 | 1 |
| trigger.to.cls | 1 | 1 |
| avg.over.tokens | 0.91 | 0.92 |
| ACC | AUC | Recall | Precision | F1 |
|---|---|---|---|---|
| 0.91 | 0.91 | 0.81 | 1 | 0.90 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).