Submitted:
17 January 2026
Posted:
19 January 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
2.1. Knowledge Graphs
- First, how to prune and weigh the nodes and edges in a large graph properly?
- Second, how to incorporate large texts into the graph analysis?
- Third, how to retrieve and interpret information from both a graph and an imbalanced learning model efficiently?
2.2. Imbalanced Learning
2.3. Large Language Models (LLMs)
2.4. Graph Anomaly Detection
3. Data
4. Proposed Methodology
- Multiple types of knowledge base including user-activity knowledge graph and documents: This ensures more comprehensive information to be considered.
- Interpretation from analytic models including graph similarity and imbalanced learning: This avoids the fine-tuning of LLM for specific purposes, saving money and improving efficiency.
5. Modeling Process and Results
5.1. Knowledge Graph Creation
- Nodes V: A node represents a user, user role, device, activity type (i.e., logon, email, file access, removable connect, removable disconnect, web visit, logoff) and activity time.
- Edges E: The edges connect the user, the user role, the device, the activity type, and the activity time, which describe what the user did what activity on the device at what time.
5.2. Graph Pruning and Graph Weighting using Imbalanced Learning Techniques
5.2.1. Feature Creation
5.2.2. Feature Selection
5.2.3. Imbalanced Learning
- Model 1: Gradient Boosting Model trained without learnable weights
- Model 2: Gradient Boosting Model trained with learnable weights from Equation 1
5.2.4. Graph Similarity
5.3. Graph Retrieval and Interpretation using Large Language Model
5.3.1. Graph Schema Creation - Extended
5.3.2. Large Language Model as Retriever and Interpreter
- LLM translates the user’s question in English into Graph Database query language and does the relationship-based search in Knowledge Graph.
- LLM standardizes the text data (e.g., user role) and improves the data quality in Knowledge Graph.
- LLM summarizes the user activity information and the content visited from the Knowledge Graph.
- LLM calls to compute the graph similarity between the current activity graph and the past activity graph for the likelihood of being an unknown threat.
- LLM calls to learn through the imbalanced learning model for the likelihood of being a known threat.
- LLM interprets the user’s activities for the user’s interest and intention based on its own training knowledge base from the whole Web.
6. Applications in Other Domains
7. Conclusions
8. Limitations and Future Work
References
- Llinas, J.; Scrofani, J. Foundational technologies for activity-based intelligence—a review of the literature. 2014. [Google Scholar]
- Biltgen, P.; Ryan, S. Activity-based intelligence: principles and applications; Artech House, 2016. [Google Scholar]
- Lawrence, J.L. Activity-Based Intelligence: Coping with the" Unknown Unknowns" in Complex and Chaotic Environments. American Intelligence Journal 2016, 33, 17–25. [Google Scholar]
- Maksimov, N.; Klimov, V. Natural and Artificial Intelligence: An Activity-Based Approach. In Proceedings of the Biologically Inspired Cognitive Architectures Meeting; Springer, 2023; pp. 553–565. [Google Scholar]
- FBI. Internet Crime Report 2023. 2024. Available online: https://www.ic3.gov/AnnualReport/Reports/2023_IC3Report.pdf (accessed on 1 November 2024).
- CISA. Cyber Incident Response to Public Safety Answering Points: A State’s Perspective. 2023. Available online: https://www.cisa.gov/sites/default/files/publications/22_0414_cyber_incident_case_studies_state_final_508c.pdf (accessed on 1 November 2024).
- Zhu, Q.; Fung, C.; Boutaba, R.; Basar, T. GUIDEX: A game-theoretic incentive-based mechanism for intrusion detection networks. IEEE Journal on Selected Areas in Communications 2012, 30, 2220–2230. [Google Scholar] [CrossRef]
- Zhu, Q. Foundations of cyber resilience: The confluence of game, control, and learning theories. arXiv 2024, arXiv:2404.01205. [Google Scholar] [CrossRef]
- Hogan, A.; Blomqvist, E.; Cochez, M.; d’Amato, C.; de Melo, G.; Gutiérrez, C.; Neumaier, S.; Polleres, A.; Schurr, A.; Sequeda, J. Knowledge Graphs. ACM Computing Surveys 2021, 54, 71:1–71:37. [Google Scholar] [CrossRef]
- Ma, X.; Wu, J.; Xue, S.; Yang, J.; Zhou, C.; Sheng, Q.Z.; Xiong, H.; Akoglu, L. A comprehensive survey on graph anomaly detection with deep learning. IEEE transactions on knowledge and data engineering 2021, 35, 12012–12038. [Google Scholar] [CrossRef]
- Janev, V.; Graux, D.; Jabeen, H.; Sallinger, E. Knowledge graphs and big data processing; Springer Nature, 2020. [Google Scholar]
- Zhou, H.; Shen, T.; Liu, X.; Zhang, Y.; Guo, P.; Zhang, J. Survey of knowledge graph approaches and applications. Journal on Artificial Intelligence 2020, 2, 89–101. [Google Scholar] [CrossRef]
- Huang, H.; Chen, Y.; Lou, B.; Hongzhou, Z.; Wu, J.; Yan, K. Constructing knowledge graph from big data of smart grids. In Proceedings of the 2019 10th International Conference on Information Technology in Medicine and Education (ITME); IEEE, 2019; pp. 637–641. [Google Scholar]
- Zhao, Q.; Liu, J.; Sullivan, N.; Chang, K.; Spina, J.; Blasch, E.; Chen, G. Anomaly detection of unstructured big data via semantic analysis and dynamic knowledge graph construction. In Proceedings of the Signal processing, sensor/information fusion, and target recognition XXX. SPIE; 2021; Vol. 11756, pp. 126–142. [Google Scholar]
- Zhang, L.; Priestley, J.; DeMaio, J.; Ni, S.; Tian, X. Measuring customer similarity and identifying cross-selling products by community detection. Big data 2021, 9, 132–143. [Google Scholar] [CrossRef]
- Ren, Y.; Xiao, Y.; Zhou, Y.; Zhang, Z.; Tian, Z. CSKG4APT: A cybersecurity knowledge graph for advanced persistent threat organization attribution. IEEE Transactions on Knowledge and Data Engineering 2022, 35, 5695–5709. [Google Scholar] [CrossRef]
- Chen, T.; Dong, C.; Lv, M.; Song, Q.; Liu, H.; Zhu, T.; Xu, K.; Chen, L.; Ji, S.; Fan, Y. Apt-kgl: An intelligent apt detection system based on threat knowledge and heterogeneous provenance graph learning. IEEE Transactions on Dependable and Secure Computing, 2022. [Google Scholar]
- Sui, Y.; Zhang, Y.; Sun, J.; Xu, T.; Zhang, S.; Li, Z.; Sun, Y.; Guo, F.; Shen, J.; Zhang, Y.; et al. Logkg: Log failure diagnosis through knowledge graph. IEEE Transactions on Services Computing 2023, 16, 3493–3507. [Google Scholar] [CrossRef]
- Sikos, L.F. Cybersecurity knowledge graphs. Knowledge and Information Systems 2023, 65, 3511–3531. [Google Scholar] [CrossRef]
- Rastogi, N.; Dutta, S.; Christian, R.; Gridley, J.; Zaki, M.; Gittens, A.; Aggarwal, C. Predicting malware threat intelligence using KGs. arXiv 2021, arXiv:2102.05571. [Google Scholar]
- Chen, Z.; Yan, Q.; Han, H.; Wang, S.; Peng, L.; Wang, L.; Yang, B. Machine learning based mobile malware detection using highly imbalanced network traffic. Information Sciences 2018, 433, 346–364. [Google Scholar] [CrossRef]
- He, H.; Ma, Y. Imbalanced learning: foundations, algorithms, and applications. 2013. [Google Scholar]
- Zhang, L.; Geisler, T.; Ray, H.; Xie, Y. Improving logistic regression on the imbalanced data by a novel penalized log-likelihood function. Journal of Applied Statistics 2022, 49, 3257–3277. [Google Scholar] [CrossRef]
- Zhang, L.; Ray, H.; Priestley, J.; Tan, S. A descriptive study of variable discretization and cost-sensitive logistic regression on imbalanced credit data. Journal of Applied Statistics 2020, 47, 568–581. [Google Scholar] [CrossRef]
- Wu, C.; Zeng, Z.; Yang, Y.; Chen, M.; Peng, X.; Liu, S. Task-driven cleaning and pruning of noisy knowledge graph. Information Sciences 2023, 646, 119406. [Google Scholar] [CrossRef]
- Chong, Y.; Ding, Y.; Yan, Q.; Pan, S. Graph-based semi-supervised learning: A review. Neurocomputing 2020, 408, 216–230. [Google Scholar] [CrossRef]
- Jarnac, L.; Couceiro, M.; Monnin, P. Relevant entity selection: Knowledge graph bootstrapping via zero-shot analogical pruning. In Proceedings of the Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, 2023; pp. 934–944. [Google Scholar]
- Min, B.; Ross, H.; Sulem, E.; Veyseh, A.P.B.; Nguyen, T.H.; Sainz, O.; Agirre, E.; Heintz, I.; Roth, D. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys 2023, 56, 1–40. [Google Scholar] [CrossRef]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. OpenAI Technical Report 2020. [Google Scholar]
- OpenAI. GPT-4 Technical Report, 2023; OpenAI Blog.
- Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv 2023, arXiv:2303.182231. [Google Scholar] [PubMed]
- Guastalla, M.; Li, Y.; Hekmati, A.; Krishnamachari, B. Application of large language models to ddos attack detection. In Proceedings of the International Conference on Security and Privacy in Cyber-Physical Systems and Smart Vehicles; Springer, 2023; pp. 83–99. [Google Scholar]
- Steck, H.; Ekanadham, C.; Kallus, N. Is cosine-similarity of embeddings really about similarity? In Proceedings of the Companion Proceedings of the ACM Web Conference; 2024; 2024, pp. 887–890. [Google Scholar]
- Park, J.S.; O’Brien, J.; Cai, C.J.; Morris, M.R.; Liang, P.; Bernstein, M.S. Generative agents: Interactive simulacra of human behavior. In Proceedings of the Proceedings of the 36th annual acm symposium on user interface software and technology, 2023; pp. 1–22. [Google Scholar]
- Ni, B.; Buehler, M.J. MechAgents: Large language model multi-agent collaborations can solve mechanics problems, generate new data, and integrate knowledge. Extreme Mechanics Letters 2024, 67, 102131. [Google Scholar] [CrossRef]
- Talebirad, Y.; Nadiri, A. Multi-agent collaboration: Harnessing the power of intelligent llm agents. arXiv 2023, arXiv:2306.03314. [Google Scholar] [CrossRef]
- Kalyuzhnaya, A.; Mityagin, S.; Lutsenko, E.; Getmanov, A.; Aksenkin, Y.; Fatkhiev, K.; Fedorin, K.; Nikitin, N.O.; Chichkova, N.; Vorona, V.; et al. LLM Agents for Smart City Management: Enhancing Decision Support Through Multi-Agent AI Systems. Smart Cities (2624-6511) 2025, 8. [Google Scholar] [CrossRef]
- Akoglu, L.; Tong, H.; Koutra, D. Graph-based anomaly detection and description: a survey. Data Mining and Knowledge Discovery 2015, 29, 626–688. [Google Scholar] [CrossRef]
- Lindauer, B. Insider Threat Test Dataset. 2020. Available online: https://doi.org/10.1184/R1/12841247.v1 (accessed on 1 November 2024).








| Performance Metric | Model 1 | Model 2 |
|---|---|---|
| % captured true threats | ||
| among all true threats (gain) | ||
| at top 3% predicted risky logons | 56% | 60% |
| % captured true threats | ||
| among all true threats (gain) | ||
| at top 30% predicted risky logons | 95% | 98% |
| Area under Precision-Recall Curve | 0.186 | 0.204 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).