Submitted:
30 October 2023
Posted:
01 November 2023
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
2.1. Attack Investigation
2.2. Contrastive Learning Framework
3. Methodology
3.1. Provenance graphs construction and optimization
3.2. Behavior Sequences
3.3. Behavior sequence augmentation
3.4. Behavior sequence representation
3.5. Sequence classification training
4. Experiment
4.1. Datasets and setups
4.2. Attack Investigation Results
4.3. Comparison Analysis
4.4. Runtime Performance of ConLBS
5. Conclusion
Acknowledgments
References
- Milajerdi S M, Eshete B, Gjomemo R, et al. Poirot: Aligning attack behavior with kernel audit records for cyber threat hunting[C]//Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. London, UK, 2019: 1795-1812. [CrossRef]
- Milajerdi S M, Gjomemo R, Eshete B, et al. Holmes: real-time apt detection through correlation of suspicious information flows[C]//Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP). SAN FRANCISCO, USA,2019: 1137-1152. [CrossRef]
- Zeng J, Chua Z L, Chen Y, et al. Watson: Abstracting behaviors from audit logs via aggregation of contextual semantics[C]//Proceedings of the 28th Annual Network and Distributed System Security Symposium, NDSS. 2021. [CrossRef]
- Gao P, Shao F, Liu X, et al. Enabling efficient cyber threat hunting with cyber threat intelligence[C]//2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 2021: 193-204. [CrossRef]
- Alsaheel et al., “ATLAS: A Sequence-based Learning Approach for Attack Investigation”, in Proc. of 30th USENIX Security Symposium, [Online], 2021, pp. 3005-3022.
- W. U. Hassan, M. A. Noureddine, P. Datta, A. Bates, “OmegaLog: High-Fidelity Attack Investigation via Transparent Multilayer Log Analysis”, in Proceedings of Network and Distributed System Security Symposium 2020, [Online], 2020.
- P. Gao, X. Xiao, Z. Li, F. Xu, S. R. Kulkarni and P. Mittal, “AIQL: Enabling Efficient Attack Investigation from System Monitoring Data,” in Proceedings of 2018 USENIX Annual Technical Conference (USENIX ATC 18), Boston, USA, 2018, pp. 113-126.
- K. Yonghwi, et al. "MCI: Modeling-based Causality Inference in Audit Logging for Attack Investigation," in Proceedings of Net-work and Distributed System Security Symposium, vol. 2, pp. 4, 2018. [CrossRef]
- Zhao J, Yan Q, Liu X, et al. Cyber Threat Intelligence Modeling Based on Heterogeneous Graph Convolutional Network[C]//23rd International Symposium on Research in Attacks, Intrusions and Defenses ({RAID} 2020). 2020: 241-256.
- M. N. Hossain, S. Sheikhi, and R. Sekar. "Combating dependence explosion in forensic analysis using alternative tag propagation semantics," in 2020 IEEE Symposium on Security and Privacy (SP), pp. 1139-1155, 2020. [CrossRef]
- Zhu T, Wang J, Ruan L, et al. General, Efficient, and Real-time Data Compaction Strategy for APT Forensic Analysis[J]. IEEE Transactions on Information Forensics and Security, 2021. [CrossRef]
- R. Yang et al. RATScope: Recording and Reconstructing Missing RAT Semantic Behaviors for Forensic Analysis on Windows[J]. IEEE Trans. Dependable and Secure Computer. 2020: 1–1. [CrossRef]
- Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In ACM SIGSAC Conference on Computer and Commu-nications Security, 2017. [CrossRef]
- Ding H, Zhai J, Nan Y, et al. {AIRTAG}: Towards Automated Attack Investigation by Unsupervised Learning with Log Texts[C]//32nd USENIX Security Symposium (USENIX Security 23). 2023: 373-390.
- Liu, Fucheng, et al. "Log2vec: A heterogeneous graph embedding based approach for detecting cyber threats within enterprise." Proceedings of the 2019 ACM SIGSAC conference on computer and communications security. 2019. [CrossRef]
- J. Devlin, M. Chang, K. Lee, and K. Toutanova. "Bert: Pre-training of deep bidirectional transformers for language understanding," in Proceedings of NAACL-HLT, 2019, pp.4171-4186.
- Yan Y, Li R, Wang S, et al. Consert: A contrastive framework for self-supervised sentence representation transfer[J]. arXiv preprint arXiv:2105.11741, 2021.
- Wu Z, Wang S, Gu J, et al. Clear: Contrastive learning for sentence representation[J]. arXiv preprint arXiv:2012.15466, 2020.
- Chen T, Kornblith S, Norouzi M, et al. A simple framework for contrastive learning of visual representations[C]//International con-ference on machine learning. PMLR, 2020: 1597-1607.
- King, Samuel T., and Peter M. Chen. "Backtracking intrusions." in Proc. ACM Symp. Oper. Syst. Princ, pp. 223-236. 2003. [CrossRef]
- W. U. Hassan, et al., "Nodoze: Combatting threat alert fatigue with automated provenance triage." in Proceedings of Network and Dis-tributed System Security Symposium 2019, USA, 2019.
- Yan Zhang, Ruidan He, Zuozhu Liu, Kwan Hui Lim, and Lidong Bing. 2020. An unsupervised sentence embedding method by mu-tual information maximization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1601–1610. [CrossRef]
- Hongchao Fang and Pengtao Xie. 2020. Cert: Contrastive self-supervised learning for language understanding. arXiv preprint arXiv:2005.12766.
- Fredrik Carlsson, Magnus Sahlgren, Evangelia Gogoulou, Amaru Cuba Gyllensten, and Erik Ylipa¨a¨ Hellqvist. 2021. Semantic re-tuning with contrastive tension. In International Conference on Learning Representations.
- John M Giorgi, Osvald Nitski, Gary D Bader, and Bo Wang. 2020. Declutr: Deep contrastive learning for unsupervised textual repre-sentations. arXiv preprint arXiv:2006.03659.
- J. Torrey, “Transparent Computing Engagement 3 Data Release,” 2020, [Online]. Available: https://github.com/darpa-i2o/Transparent-Computing/blob/master/README-E3.md.
- Wang Q, Hassan W U, Li D, et al. You Are What You Do: Hunting Stealthy Malware via Data Provenance Analysis[C]//NDSS. 2020. [CrossRef]
- Y. Zhang, and B. C. Wallace. "A Sensitivity Analysis of (and Prac-titioners’ Guide to) Convolutional Neural Networks for Sentence Classification," in Proc. Int. Jt. Conf. Nat. Lang. Process., vol. 1, pp. 253-263, 2017.
- S. Hochreiter, and J. Schmidhuber, "Long short-term memory," Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997. [CrossRef]
- M. Tomas, I. Sutskever, K. Chen, G. Corrado, and J. Dean, "Dis-tributed representations of words and phrases and their compositionality." in Adv. Neural inf. Process. Syst., -vol. 2, pp. 3111-3119, 2013.
- Y. Liu et al., "RoBERTa: A robustly optimized BERT pretraining approach," in Proc. Int. Conf. Learn. Represent., 2020.







| Type | Node Name | Semantic |
|---|---|---|
| process | name_PID | name |
| network | IP-Port, website | IP address, url |
| file | .jpg, .png, .py, .java | picture file, code file |
| \system32\, \Program files\ | system file, app file | |
| *.html, *.lst | html file, lst file |
| Attack Scenarios | Attack Investigation Results | ||||||
|---|---|---|---|---|---|---|---|
| TP | TN | FP | FN | Precision | Recall | F1-score | |
| ATLAS. S-1 | 4,536 | 78,856 | 28 | 13 | 99.387% | 99.714% | 99.550% |
| ATLAS. S-2 | 13,584 | 331,051 | 47 | 10 | 99.655% | 99.926% | 99.791% |
| ATLAS. S-3 | 4,975 | 109,285 | 22 | 23 | 99.560% | 99.540% | 99.550% |
| ATLAS. S-4 | 13,199 | 88,576 | 21 | 4 | 99.841% | 99.970% | 99.905% |
| ATLAS. M-1 | 6,331 | 171,131 | 13 | 9 | 99.795% | 99.858% | 99.827% |
| ATLAS. M-2 | 28,914 | 180,326 | 51 | 17 | 99.824% | 99.941% | 99.883% |
| ATLAS. M-3 | 24,728 | 140,347 | 94 | 7 | 99.621% | 99.972% | 99.796% |
| ATLAS. M-4 | 5,945 | 137,167 | 24 | 22 | 99.598% | 99.631% | 99.615% |
| ATLAS. M-5 | 23,526 | 452,354 | 86 | 37 | 99.636% | 99.843% | 99.739% |
| ATLAS. M-6 | 6,372 | 201,569 | 17 | 22 | 99.734% | 99.656% | 99.695% |
| ATLAS. Avg. | 13,211 | 189,066 | 40 | 16 | 99.696% | 99.876% | 99.786% |
| CADETS. case-1 | 87,658 | 436,957 | 218 | 76 | 99.752% | 99.913% | 99.833% |
| CADETS. case-2 | 53,631 | 472,913 | 175 | 49 | 99.675% | 99.909% | 99.792% |
| CADETS. case-3 | 34,097 | 209,681 | 58 | 47 | 99.830% | 99.862% | 99.846% |
| CADETS. Avg. | 58,462 | 373,184 | 150 | 57 | 99.744% | 99.902% | 99.823% |
| Method | Precision | Recall | F1-score |
|---|---|---|---|
| RS +BERTBase | 87.782% | 84.333% | 86.023% |
| LemATLAS+BERTBase | 97.102% | 92.184% | 94.579% |
| LemConLBS+BERTBase | 99.532% | 98.831% | 99.180% |
| RS +BERTRe-train | 93.850% | 89.700% | 91.728% |
| LemATLAS+BERTRe-train | 99.132% | 99.365% | 99.248% |
| LemConLBS+BERTRe-train | 99.696% | 99.876% | 99.786% |
| Base models/method | Recall | Precision | F1-score |
|---|---|---|---|
| Word2vec+CNN [27] | 87.425% | 89.379% | 88.391% |
| Word2vec+LSTM[28] | 95.854% | 96.412% | 96.132% |
| BERT [16] | 98.460% | 98.891% | 98.675% |
| RoBERTa [29] | 99.601% | 99.829% | 99.715% |
| ConLBS | 99.902% | 99.744% | 99.823% |
| Method | Logs size (/min) | Graph/Sequence construction | Train Time | Investigation Time (Avg.) |
|---|---|---|---|---|
| POIROT[1] | 114.5MB | 1:54:35 | -- | 7.72s |
| ATLAS | 169MB | 0:30:23 | 0:28:26 | 5.0s |
| ConLBS | 358MB | 0:23:48 | 0:36:35 | 2.53s |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).