Preprint
Article

This version is not peer-reviewed.

Local LLM-Based Teacher–Student Knowledge Distillation and AI Agent-Centric Approach

Submitted:

22 May 2026

Posted:

29 May 2026

You are already at the latest version

Abstract
Recent cyber incidents have become increasingly sophisticated through 'Living-off-the-Land (LotL)' techniques that exploit legitimate behavior and multi-stage attacks. This demands advanced reasoning capabilities to discern attack contexts within fragmented, large-scale logs. However, closed network environments with physical network separation (air-gapped), such as national critical infrastructure, restrict the use of high-performance cloud LLMs, limiting the adoption of cutting-edge AI-based analysis technologies. This research proposes a Local LLM-based intrusion analysis framework that can operate independently within closed networks to overcome these constraints. The proposed framework combines (i) an Offline Knowledge Distillation technique that transfers the analytical reasoning process of external high-performance models to the Local LLM after security review, and (ii) an AI agent orchestration structure that controls the analysis procedure step-by-step and suppresses hallucinations. Experiments and validation using the public dataset (Atomic Red Team) demonstrate that the proposed model achieves significantly higher detection accuracy (88.4%) and MITRE ATT&CK mapping performance (0.91 F1-Score) compared to existing general-purpose Local LLMs. Furthermore, it suppressed hallucination rates to 6.2% through an automated verification mechanism and significantly improved analysis efficiency by refining large-scale logs to focus on core events. This study quantitatively demonstrates that AI-based intrusion incident analysis automation is achievable using a single GPU server even under the resource constraints of closed networks, presenting a practical solution for intelligent security monitoring.
Keywords: 
;  ;  ;  ;  

1. Introduction

1.1. Background and Necessity

Recent cyber incidents have evolved beyond single vulnerability attacks or simple malware infections. They now employ complex tactics such as ‘Living-off-the-Land (LotL)’ techniques that exploit legitimate tools within systems or maintain covert, long-term persistence [1]. As seen in the recent cyber conflict between Iran and Israel [2], destructive attacks targeting critical national infrastructure like power, water, and nuclear facilities are being executed with such sophistication that they can neutralize even physically segmented network environments. Consequently, post-incident breach analysis has become a critical procedure at the national security level, rapidly reconstructing the cause and path of an attack from massive logs to prevent further damage. However, incident analysis in actual security operations environments still exhibits high human resource consumption and dependence on analyst expertise. Analysts must manually correlate massive volumes of logs, events, scripts, and network records generated by various security tools. Particularly when attack activities are dispersed across time and space, fragmented analysis alone makes it difficult to grasp the full scenario’s context and causal relationships. This not only increases analysis time but is also a primary cause of result variability depending on the analyst’s capabilities. To overcome these limitations, research utilizing Large Language Models (LLMs) for security analysis is actively progressing [3]. LLMs present new possibilities for analysis automation by integrating and interpreting diverse security data at the natural language level, inferring potential relationships between actions, and narratively summarizing complex incident flows. However, most existing LLM-based approaches assume a cloud environment, entailing the external transmission of sensitive data and dependence on external infrastructure. This limits their applicability to organizations where security is paramount, as it fails to meet their operational constraints [3]. Particularly, organizations such as military, government, and national infrastructure entities operate systems within closed network environments physically isolated from the external internet [4]. In this environment, cloud-based LLM services cannot be utilized, and the export of internal data is strictly controlled, creating a structural gap between the latest intrusion incident analysis technology and its field applicability. Therefore, a system capable of independent operation using only internal closed network resources while possessing high-level analytical capabilities is critically required.

1.2. Research Objectives and Contributions

This research proposes a Local LLM-based incident analysis framework capable of high-level inference without data export to overcome the structural constraints of closed-network environments. The proposed framework utilizes a Local LLM deployed on a GPU server within the closed network as its core inference engine. It combines this with a procedure that safely transfers the analytical reasoning process and judgment knowledge from an external high-performance model (Teacher) to the Local LLM (Student) via Offline Knowledge Distillation. Furthermore, by introducing an AI agent-based orchestration structure that controls the incident analysis process through the staged steps of ‘Normalization–Candidate Selection–Semantic Reasoning–Reporting/Verification’, it mitigates the input scale constraints inherent to Local LLMs and actively suppresses hallucinations, ensuring consistency and reproducibility of analysis results.
The main contributions of this research are as follows.
• Practical Framework: We present a comprehensive framework specifically designed for closed-network (air-gapped) environments, demonstrating that AI-based intrusion analysis is achievable using a single GPU server without any external API calls or cloud resources.
• Offline Knowledge Distillation: This research designs a Teacher–Student-based mechanism that effectively transplants high-dimensional reasoning from a Teacher model to a Student model, enabling a 7B–8B scale compact model to achieve expert-level analysis intelligence.
• AI Agent Orchestration: We employ a structured orchestration layer that operates through a four-step process, which dramatically improves efficiency by refining large-scale raw logs into core threat indicators and managing them within the model’s context window constraints.
• Quantitative Validation: Through comparative experiments on 500 test scenarios, we quantitatively prove that the proposed framework achieves significantly higher detection accuracy (88.4%) and MITRE ATT&CK mapping performance (0.91 F1-Score) compared to existing general-purpose Local LLMs.
• Reliability and Efficiency: The framework suppressed the hallucination rate to 6.2% and maintained a low memory overhead of only 15.3 GB VRAM, ensuring stable and sustainable operation with an average power consumption of 150W.
The structure of this paper is as follows. section 2 reviews related research trends, section 3 presents the proposed framework. section 4 presents performance validation results, including self-defined reliability evaluation metrics, and section 5 performs in-depth validation based on threat scenarios. Finally, section 6 concludes the study.

2. Background

2.1. Research on Automated Incident Analysis

Automated intrusion incident analysis has traditionally evolved by combining rule-based correlation and signature-based detection, with recent research increasingly integrating deep learning frameworks to enhance identification accuracy in complex infrastructures [5]. This approach is effective for rapidly identifying known attack patterns and offers the advantage of automatically processing large-scale security events. In actual security monitoring environments, SIEM-centric detection rules have played a core role in the initial identification phase, with specialized intrusion detection systems (IDS) further evolving to secure diverse environments such as mobile and industrial control systems [6]. However, when attack activities are intermingled with normal behavior or occur distributed across multiple systems and time zones, the rule-based approach reveals structural limitations. In multi-stage attacks or low-intensity, long-term attacks, individual events may fall within normal categories, yet malicious intent often becomes apparent within the broader context of the entire flow. Rule-based analysis operates primarily on events meeting predefined conditions, limiting its ability to comprehensively interpret the context surrounding an incident, the causal relationships between actions, and the step-by-step progression of the attack.
Furthermore, outputs from existing automated analysis are often provided as alert lists or event sets, failing to sufficiently meet the fundamental requirement of incident analysis: reconstructing the attack’s cause, path, and techniques into a single scenario. Consequently, analysts must manually connect and interpret the results from automated tools, a process repeatedly noted to increase analysis time, create dependency on expertise, and lead to result discrepancies [7]. Therefore, to overcome these limitations of static automation, research on generative AI-based analysis automation is gaining attention. This approach can semantically interpret unstructured context between security data and reconstruct fragmented information into logical scenarios.

2.2. LLM-Based Security Analysis

Recently, research on security analysis utilizing large language models (LLMs) has been actively reported as an alternative to complement the limitations of intrusion incident analysis automation [8]. LLMs can interpret diverse security data—such as logs, code, and behavioral descriptions—integratedly at the natural language level. They differentiate themselves from existing analysis techniques by generating explanations that consider the semantic context of attack behaviors and the relationships between actions.Accordingly, diverse applied research has been proposed, including security log summarization, vulnerability description generation, automated code review, malware behavior summarization, and attack technique mapping based on MITRE ATT&CK (Adversarial Tactics, Techniques, and Common Knowledge) [9].
The MITRE ATT&CK framework, utilized as a global standard for incident analysis, provides a knowledge base that systematizes attackers’ tactics and techniques. While traditional rule-based systems focused on predefined pattern matching, LLM-based analysis can contextually infer and map these tactics and techniques from unstructured logs. This goes beyond simple event classification, offering new possibilities for analysis automation by enabling the narrative reconstruction of an incident’s cause and progression. Particularly, the ability to organize the incident flow into an “explainable format” supports report generation and analyst decision-making. However, many LLM-based security analysis studies assume cloud environment utilization and feature architectures involving external API calls or real-time model updates. This can necessitate transmitting sensitive log and incident analysis data externally or create dependency on external infrastructure and service policies during analysis. Consequently, applying these research outcomes identically within closed-network environments with restricted external network access presents limitations. Furthermore, due to the high reliability demanded by security analysis tasks, controlling the inherent issue of hallucination in generative models is essential [10]. In incident analysis, this phenomenon can lead to fabrication—creating non-existent attack logs—or cause critical false positives by misclassifying attack techniques into incorrect categories. Particularly, Local LLMs operable in closed-network environments exhibit relatively higher hallucination frequencies compared to large models due to their lower contextual retention capabilities. Therefore, to ensure the integrity of analysis results, a grounding technique that compares model-generated sentences with original logs in real-time and an agent-based verification mechanism structurally controlling this process must be implemented [11].

2.3. Limitations of LLM Utilization in Closed-Network Environments

Closed-network environments are widely used in organizations requiring high security levels, such as military, government, and national infrastructure, where connections to the external internet are fundamentally blocked [4]. In this environment, LLM utilization methods like cloud-based LLM API calls, online training, and real-time model updates are impossible, and the external transfer of internal logs and incident analysis data is also strictly restricted. Therefore, meaning-based analysis techniques designed for cloud LLMs cannot be applied in the same form. Instead, an analysis system capable of independent operation within the network, without reliance on external infrastructure, is required. Due to these constraints, simply deploying the model internally is insufficient for effectively utilizing LLM-based analysis in closed-network environments.
To bridge this gap, it is essential to internalize advanced reasoning frameworks that allow smaller models to perform at the level of high-performance LLMs. Specifically, the adoption of Chain-of-Thought (CoT) [14] to elicit structured logical deduction and ReAct [15] to synergize reasoning with step-by-step action is required to ensure analysis intelligence within isolated infrastructures.
First, an offline update procedure based on knowledge distillation is necessary. This must safely incorporate the latest external analytical knowledge while optimizing model size to deliver high performance within the closed network’s limited computing resources, thereby transferring intelligence. Second, an agent-based operational system is required to manage and verify the model’s inference process step-by-step, ensuring stable analysis quality. Third, since security analysis requires providing evidence for results and ensuring traceability, an audit system must be implemented to structurally suppress model hallucinations and verify the factual accuracy of generated reports against source data [16].

2.4. Research on Utilization in Local LLM and Closed-Network Environments

As an alternative to eliminate cloud-based LLM dependency while securing meaning-based analysis capabilities, research and practical application of Local LLMs—which can run directly on internal infrastructure without external network connections—are gradually expanding. Local LLMs are evaluated as a suitable approach for closed-network environments or high-security organizations because they deploy models on internal servers, enabling natural language understanding and inference capabilities without transmitting sensitive log, code, or breach incident analysis data externally.Recent research and the open-source ecosystem have proposed diverse model families centered around medium-to-small LLMs, each demonstrating distinct strengths depending on the analytical task. For instance, the Llama family [16] excels in general-purpose task execution, the Mistral family [17] emphasizes efficiency in long-context processing, the Qwen family [18] specializes in code generation and analysis, and the DeepSeek family enhances reinforcement learning-based reasoning tendencies [19]. These models can address diverse needs such as general natural language understanding, large-scale log analysis, code-based security analysis, and complex causal reasoning. Recently, reports accumulating evidence that even 7B~8B-scale models can perform meaning-based analysis and stepwise reasoning have expanded their application scope. Table 1 compares the structural characteristics of large LLMs and 7B–8B-scale Local LLMs. Large LLMs offer superior reasoning performance and extensive context processing capabilities, but their resource consumption and security policy constraints make deployment impossible in closed-network environments. Conversely, while Local LLMs enable independent operation on internal servers, their limited parameter scale may result in performance gaps compared to large models when inferring complex causal relationships in multi-stage attacks or performing security assessments requiring advanced expertise. Therefore, research on optimization techniques and knowledge transfer methodologies must proceed in parallel to maintain the resource efficiency of Local LLMs within closed networks while securing analysis intelligence comparable to large models.
However, existing research on Local LLMs has often focused on single tasks (e.g., summarization, classification, specific analytical functions) or limited functional automation. Cases that systematically organize complex analytical procedures requiring multi-stage reasoning–context integration–result verification, such as incident analysis, in an end-to-end manner remain limited. Furthermore, there is insufficient discussion on structural designs that simultaneously achieve the core requirements of closed-network environments—”secure incorporation of external knowledge” and “ensuring analysis result reliability (hallucination suppression, evidence presentation, reproducibility)”—specifically regarding methods for safely transferring analytical knowledge from external high-performance models into closed networks and stepwise control of the analysis process. Building on these research gaps, this study proposes a framework combining a Teacher–Student-based offline knowledge distillation method [20] that reflects the latest analytical knowledge without external network access, and an AI agent-based orchestration structure that decomposes and controls the incident analysis process into normalization–candidate selection–semantic inference–reporting/verification [21]. Specifically, this study prototypes a standard closed-network operation model for defense and national critical infrastructure, where security regulations are most stringent. This environment is constrained by physical network segmentation, which fundamentally blocks internet access, and requires reliance solely on on-premise server resources within individual departments or control rooms instead of cloud resources. Reflecting this, the experimental environment for this study was also built using a single GPU workstation with physically blocked external communication. This aims to empirically verify that the proposed framework possesses practical effectiveness, enabling immediate application even in actual high-security operational environments.

2.5. Research on Evaluation Metrics for Automated Incident Analysis

The output of automated intrusion incident analysis models is not simple classification but a hybrid form combining descriptive explanations and structured mappings. Therefore, verifying model performance comprehensively using a single metric is difficult. Previous studies generally combine three mutually complementary metrics: text similarity and quality assessment, verification of factual consistency in generated results, and quantitative classification accuracy evaluation. First, text similarity and quality evaluation metrics. These measure how similar the model’s generated summaries or explanations are to the ground truth data (reference sentences). ROUGE (Recall-Oriented Understudy for Gisting Evaluation), for example, measures surface-level textual similarity based on n-gram overlap with reference sentences and is widely used as a standard metric for automatic summarization evaluation [22]. Recently, to evaluate contextual meaning similarity beyond simple word overlap, BERTScore—which calculates semantic proximity between tokens based on sentence embeddings—has been used alongside ROUGE to complement the limitations of surface-level metrics [23]. Second, there are Factuality & Reliability metrics. Since security analysis reports can have catastrophic consequences if they contain false information, it is essential to verify that the content generated by the model is factual and based on the input data (logs, events). Metrics like FactCC (Factual Consistency Checking) are used specifically to detect hallucinations, a chronic problem of generative models [13]. This quantitatively controls the phenomenon where models invent non-existent facts or distort reality by either binarily classifying or probabilistically calculating whether generated sentences logically contradict the source text. Third, quantitative classification and mapping accuracy metrics. Incident analysis involves mapping identified attack techniques to a standardized framework (MITRE ATT&CK). In this multi-label classification problem, the F1-score (Micro-average), effective for data imbalance correction, serves as the core metric to evaluate how accurately the model extracts the correct tags. This study adopts the methodology of these prior studies but focuses on verifying the performance of the proposed framework in a multidimensional manner. This verification centers on the accuracy of threat identification (Accuracy, F1-score), which is most essential from a practical perspective over simple sentence fluency, and the consistency of evidence and hallucination occurrence rate, which form the foundation of trust in the analysis results.

3. Local LLM-Based Incident Analysis Framework

The proposed framework is designed to perform the entire incident analysis process independently and completely within a closed network environment, isolated from external networks. The overall architecture is divided into three layers based on data flow, security classification, and functional roles: (i) External Knowledge Layer, (ii) Internal Inference Layer within the Closed Network, and (iii) AI Agent Orchestration Layer. Each layer is modularized into detailed functional modules that interact organically, with the overall structure shown in [Figure 1].

3.1. External Knowledge Layer

This layer addresses performance limitations caused by the parameter size constraints of the closed-network Local LLM. It achieves this by converting the analytical expertise and reasoning processes possessed by an external high-performance model (Teacher Model) into data for transfer.
• Threat Dataset: We collect multidimensional raw logs, including publicly available threat intelligence (CTI), recent incident response (IR) reports, and Sysmon and Windows event logs simulating attack payloads based on MITRE ATT&CK [24]. By securing time-series data that captures attack tactics and procedures beyond simple single events, we build a high-quality training source enabling the model to understand the context surrounding attacks.
• Teacher Model: To establish a high-fidelity knowledge source for distillation, we leverage a GPT-4-level Large Language Model (LLM) [25], which is pre-trained on extensive security domain knowledge and recent vulnerability intelligence. In this framework, the Teacher Model is not merely utilized as a general-purpose processor but is transformed into a specialized ‘Expert Inference Engine’ through a structured prompt architecture.
As shown in Figure 2, we have developed a specialized Expert Reasoning Internalization Prompt that enforces a Senior Security Analyst persona and multi-dimensional cognitive constraints. This architecture mandates the model to interpret complex semantic correlations between fragmented log artifacts and define their technical impact on the system’s security posture. Specifically, the prompt constrains the model to output structured 4-stage Chain-of-Thought (CoT)reasoning—comprising Detection, Analysis, Hypothesis, and Action—alongside S-P-O (Subject-Predicate-Object) triplets. This ensures that the synthesized intelligence captures the underlying adversarial intent rather than superficial patterns, providing the essential ground-truth data for the Student Model’s imitation learning process. To ensure the deterministic nature of the knowledge synthesis and mitigate stochastic hallucinations, the Teacher model (GPT-4) was configured with a Temperature of 0.1, Top-p of 0.9, and a Frequency Penalty of 0.1. These settings prioritize logical consistency over linguistic variety during the generation of the 4-stage Chain-of-Thought (CoT) paths.
• Knowledge Synthesis: Collected logs are input into the Teacher Model to synthesize step-by-step Chain-of-Thought (CoT) data based on the structured prompt architecture shown in Figure 2. This process is structured into four explicit cognitive stages: (1) Evidence Identification (detecting suspicious artifacts and anomalous indicators), (2) Threat Analysis (interpreting technical impacts and mapping to MITRE ATT&CK), (3) Contextual Hypothesis (inferring potential attack intent and predicting next stages), and (4) Final Determination (formulating a conclusive verdict and prioritized actions). This structured synthesis guides the Student model to learn the logical reasoning path of security experts via imitation learning, structurally enhancing the inference capabilities of the compact model to discern complex attack contexts.
Table 2. Structural Framework of Distilled Knowledge for Incident Analysis.
Table 2. Structural Framework of Distilled Knowledge for Incident Analysis.
Analysis Stage Function in Knowledge Distillation Practical Example in Training Dataset
Stage 1:
Evidence Identification
Extracts specific indicators of compromise (IoC) or suspicious artifacts from raw logs. Identified suspicious use of certutil.exe with the -url cacheflag in the command line.
Stage 2:
Threat Hypothesis
Formulates a potential attack scenario based on identified evidence.
Hypothesis: The attacker is attempting ‘Ingress Tool Transfer’ to download secondary malware.
Stage 3:
Technical Verification
Cross-references related logs to validate the hypothesis.
Verified outgoing network connection to an external IP immediately following the process execution.
Stage 4:
Final Determination
Finalizes the mapping to a standardized framework. Conclusion: Mapping to MITRE ATT&CK T1105
• Data Sanitization: Undergoes rigorous security reviews to anonymize or mask sensitive Personally Identifiable Information (PII)—such as real IP addresses and internal account names—replacing them with virtual tokens to prevent data leakage. We implemented a standardized masking protocol to maintain analytical context as shown in Table 3.
Furthermore, it standardizes data quality by normalizing unstructured text into structured learning schemas like JSONL, which accelerates model convergence and maximizes training efficiency
• Secure Transfer: The refined knowledge assets are securely migrated into the air-gapped internal environment using security-hardened transfer protocols or physical security media (e.g., USB/CD), strictly adhering to cross-domain data transfer regulations. This process ensures the complete isolation of the internal learning layer, preserving the data sovereignty and security integrity of the Student Model by preventing any direct external connectivity during the internalization process.

3.2. Internal Learning & Model Layer

This layer involves establishing high-dimensional inference knowledge imported via physical security media within an isolated internal infrastructure. It optimizes and builds a secure analytics engine that operates independently without external network dependencies.
• Integrity Validation & Loading: Verifies the integrity of the training dataset brought into the closed network by comparing its hash values to detect any tampering during transmission. After final confirmation of data integrity within a sandbox in the secure zone, it is fed into the training pipeline. This is an essential security procedure to prevent model bias caused by contaminated data.
• Local LLM Training: To ensure high-performance incident analysis within a physically isolated environment, the internal training process is conducted on independent GPU servers within the secure zone. Rather than employing computationally expensive full-parameter updates, this framework utilizes the Low-Rank Adaptation (LoRA)technique. This approach enables the 8B-scale Student Model to effectively internalize the Teacher model’s high-dimensional reasoning logic while maintaining a minimal VRAM footprint. As detailed in Table 4, this internalization process is executed through a systematic three-stage pipeline: (1) secure loading and integrity verification of reasoning triplets, (2) architectural adaptation using low-rank matrices to prevent catastrophic forgetting, and (3) objective function optimization to clone the expert’s cognitive reasoning path for standalone deployment.
• Student Model: The final Student Model inherits a 4-stage analytical framework (CoT)via fine-tuning, enabling deep inference within an 8B-parameterfootprint. As shown in Figure 3, the model is optimized for standalone GPU environments using FP16 precisionto ensure real-time processing of high-load security logs within a 16GB VRAMlimit. This implementation leverages an internalized reasoning trigger—Detection, Analysis, Hypothesis, and Action—to activate specialized security vocabulary while maintaining a Zero-drift policythrough deterministic decoding. This integrated approach ensures both high-throughput performance (sub-50ms) and absolute data sovereigntywithout external API dependencies.
• Inference Engine Deployment and Practical Application: The developed Student Model serves as the core inference engine for autonomous AI agents. By implementing optimization techniques such as FP16 precision, the system ensures high resource efficiency, enabling real-time processingof high-volume security logs even on a single GPU server.

3.3. AI Agent Orchestration Layer

This layer maximizes the analytical potential of the Student Model by systematically controlling and automating the end-to-end incident analysis pipeline. The AI Agent functions as a central controller that decomposes high-level security queries into executable tasks, ensuring both data integrity and reasoning precision through a four-stage process.
• Data Normalization & Parsing: Raw, heterogeneous log datasets collected from various security devices (e.g., EDR, Firewalls) are converted into a unified, structured schema. This stage eliminates system-specific noise and redundant metadata, ensuring data consistency and optimizing token efficiency for the subsequent deep inference stage.
• Classification: To overcome the input constraints (Context Window) of the Local LLM while preventing the loss of stealthy attack traces, the agent employs a Two-Stage Hybrid Filtering mechanism.
• Stage 1 (Priority Scoring): Selects ‘candidate events’ that exhibit high correlation with known attack indicators and high command-line entropy.
• Stage 2 (Anomaly-based Sampling): Acts as a safety net by preserving ‘atypical’ or ‘long-tail’ logs that may have low priority scores but exhibit high statistical deviation from the system’s normal baseline (e.g., rare process-parent relationships or unusual execution timing). This dual approach ensures that stealthy multi-stage attack components, which often masquerade as low-priority background noise, are not pruned and are successfully delivered to the Student Model for contextual reasoning.
• Deep Inference & Self-Confidence Scoring: The agent leverages the distilled intelligence of the Student Model to conduct in-depth spatio-temporal correlation analysis. In this stage, a Self-Confidence Scoring mechanism is introduced to calculate the log-probability of generated tokens. If the cumulative confidence score ($S_{conf}$) falls below a predefined threshold (e.g., 0.8), the agent flags the reasoning as ‘Uncertain’ to prevent potential hallucinations and triggers an immediate alert for human expert intervention (Human-in-the-Loop). Fragmented clues are then reconstructed into comprehensive threat scenarios and mapped to the MITRE ATT&CK framework
Table 5. A Representative Case Example of Deep Inference Logic and Tactical Mapping.
Table 5. A Representative Case Example of Deep Inference Logic and Tactical Mapping.
Correlation
Type
Fragmented Clues
(Input Telemetry)
Reconstructed Threat
Scenario (Inference)
MITRE ATT&CK
Mapping (Output)
Temporal
(Time-series)
ㆍ w3wp.exespawns cmd.exe
ㆍ10 mins later: net.exeused
to scan network.
The attacker exploited a web vulnerability to gain a shell and performed internal reconnaissance. ㆍT1190: Exploit Public-Facing
Application
ㆍT1018: Remote System Dis
covery
Spatial
(Cross-asset)
ㆍFailed logins on Host A.
ㆍSuccessful login on Host B
from Host A via RDP.
The attacker attempted brute-force on Host A and successfully moved laterally to Host B. ㆍT1110: Brute Force
ㆍT1021.001: Remote Desktop
Protocol
Behavioral
(Causal Chain)
ㆍPowerShell with Base64
string.
ㆍOutbound connection to
unknown IP.
An obfuscated script was executed to establish a Command and Control (C2) channel. ㆍT1059.001: PowerShell
ㆍT1132: Data Encoding
Objective
(Final Goal)
ㆍAccessing ntds.ditfile.
ㆍCompressed file created in
\Temp.
The attacker is attempting to dump credentials for full domain takeover. ㆍT1003.003: NTDS Dumping
ㆍT1560: Archive Collected
Data
• Reporting & Dual-Verification (Traceability Matrix): In the final stage, the agent synthesizes the inference results into a structured security report. To move beyond simple suppression and achieve the near-total elimination of hallucinations, the agent executes a Dual-Verification protocol. This involves constructing a Traceability Matrixthat programmatically links every claim in the report to a specific, verifiable anchor in the original source logs (e.g., EventID, Timestamp, ProcessID). Any statement lacking a 1:1 evidentiary mapping is automatically flagged for removal or manual audit. This ensures the absolute integrityand traceability required for high-stakes security operations in closed critical infrastructure networks.

4. Experiment and Results Analysis

4.1. Experimental Environment

To validate the effectiveness of the proposed Teacher–Student-based domain-specific knowledge distillation technique, experiments were conducted in a closed network environment physically isolated from external networks.
• Hardware and Infrastructure Specifications: All experiments were conducted on a high-performance server environment equipped with one NVIDIA A100 80GB GPU. To meet physical network segregation requirements under actual security regulations, external API calls and cloud resource utilization were completely excluded. This configuration demonstrates that high-level semantic analysis is achievable even with a single GPU server. Although experiments were conducted on an A100-80GB server, peak memory consumption during inference remained below 15.3GB, indicating that the proposed framework is deployable on commodity 24 GB GPUs (e.g., RTX 3090-class devices).
• Software and Inference Optimization: All models were executed in identical hardware and software environments (PyTorch 2.1.2, Transformers 4.36.0, etc.). Considering the limited computational resources characteristic of closed networks, inference was performed using FP16 (Half-Precision) to achieve an efficient balance between inference speed and VRAM usage.
Table 6. Hardware and Software Experimental Environment.
Table 6. Hardware and Software Experimental Environment.
Category Specification
Hardware CPU Intel Xeon Gold 6226R @ 2.90GHz
GPU NVIDIA A100-PCIE-80GB (1ea)
RAM/Storage 256 GB DDR4/2TB NVMe SSD
Software OS/Language Ubuntu 22.04 LTS/Python 3.10
Framework PyTorch 2.1.2 + CUDA 11.8
Libraries Transformers 4.36.0, PEFT 0.7.0, Accelerate 0.25.0
Inference Precision FP16 (Half-Precision)

4.2. Dataset Composition

To rigorously validate the intrusion incident analysis capabilities of Local LLMs, this study constructed an analysis dataset based on the Atomic Red Team project, which adopts the MITRE ATT&CK framework [24]—a global threat intelligence standard. Atomic Red Team defines the Tactics and Techniques (TTPs) of real attack groups (APTs) as standardized test cases, providing an environment optimized for evaluating behavior-based inference performance from unstructured security data. This study constructed a total of 3,500 scenarios. Each scenario includes system event logs (Sysmon, Windows Event Log) and network traffic records generated during execution in a virtual environment.

4.2.1. Scenario Reconstruction: Hierarchical Design Based on Analysis Depth

Considering that simple command listings can distort the performance of analysis models, this study redefined individual ‘Atomic Tests’ as complete behavioral scenarios where attack objectives, execution commands, and prerequisites are organically combined. To measure the model’s analysis depth multidimensionally, scenarios were restructured into two tiers as follows.
• Single-stage Scenario: Evaluates ‘micro-level identification capability’—the ability to accurately identify and technically interpret independent malicious actions such as registry manipulation, permission authorization, and account creation.
• Multi-stage Scenario: Mimics the MITRE ATT&CK [24] kill chain model by causally linking two or more units belonging to different tactical phases. This aims to validate the ‘macro-level semantic inference capability’—reconstructing the full attack context by identifying correlations among temporally and spatially dispersed logs.
Table 7. Comparison of Single-stage vs. Multi-stage (Log A-B-C) Evaluation Scenarios.
Table 7. Comparison of Single-stage vs. Multi-stage (Log A-B-C) Evaluation Scenarios.
Tier Scenario Type Fragmented Clues
(Log Inputs)
Target Inference Result
(Chain of Thought)
Single
Credential Access Log A:procdump.exe -ma
lsass.exe lsass.dmp
Identify:T1003.001 (LSASS Dumping).
Analysis:Technical risk of memory dumping for password extraction.
Single
Persistence
Log A:net user /add attacker_account
/password123
Identify:T1136.001 (Local Account).
Analysis:Creation of a backdoor account for persistent access.
Multi Full
Intrusion
Chain
Log A: powershell.exe -enc ...
Log B: mstsc.exe /v:10.0.1.50
Log C: vssadmin.exe delete shadows
Inference:
1. Initial C2 establishment (A)
2. Lateral jump to critical asset (B)
3. Ransomware preparation by inhibiting recovery (C).
Conclusion: Macro-level scenario reconstruction of an active Ransomware attack.
Multi Information Stealing Log A:reg.exe add HKCU\Software\...
Log B:net view /all
Log C:7z.exe a data.zip C:\Confidential
Inference:
1. Establishing persistence (A)
2. Internal reconnaissance (B)
3. Staging for data exfiltration (C).
Conclusion:Comprehensive mapping of an Information Stealing objective.
These hierarchical scenarios served as foundational material for the Teacher LLM [25] to generate high-quality analytical reasoning processes during the offline knowledge distillation process proposed in section 3. In the experimental phase, they function as ground truth to assess how accurately the model can derive actual attack techniques from fragmented logs.

4.2.2. Dataset Partitioning and Learning Contamination Prevention Strategy

To ensure model generalization performance and prevent data contamination caused by exposure of training data to evaluation, scenarios were split based on their IDs. The total 3,500 scenarios were divided at a ratio of 7:1.5:1.5, with strict control over the purpose of each set as outlined in Table 8.
• Training dataset (2,500 cases): Used as source data to transplant high-quality Chain-of-Thought (CoT) [14] reasoning processes and report-writing knowledge generated by an external Teacher model (GPT-4 [25]) into the Student model after undergoing a security refinement process.
• Validation dataset (500 cases): Used for monitoring during training to optimize hyperparameters and determine early termination.
• Test Dataset (500 cases): A dataset completely excluded from the knowledge distillation and training processes. This serves as the final objective measure to verify whether the Local LLM can perform accurate analysis based on internalized reasoning mechanisms—rather than mere data memorization—when confronted with threat scenarios it has not previously encountered.

4.3. Performance Comparison by Baseline Model

This study selected four state-of-the-art open-source Local LLMs with parameter scales ranging from 7B to 8B as baselines. This selection aims to control performance bias due to model parameter size and precisely analyze the pure effects of model architecture and training strategies on intrusion incident analysis performance. The detailed performance of the selected baseline models is shown in Table 9. This class is evaluated as possessing the optimal performance-to-resource efficiency achievable in a closed-network environment using a single GPU. The selected models were chosen to represent key capabilities for security analysis: general instruction execution, long-text context processing, code interpretation, and stepwise reasoning. To ensure experimental fairness, all models were run in the same environment.
Specifically, reflecting the unique characteristics of closed-network environments where external infrastructure support is fundamentally blocked, we quantitatively reviewed the computational resource requirements of each model. This demonstrates that the proposed model can operate stably even in a single GPU server environment (on-premise) without cloud infrastructure dependency and possesses practical resource efficiency to perform meaning-based analysis requiring high computation.

4.4. Definition of Evaluation Metrics

This section defines four key evaluation metrics used to objectively and reproducibly validate the performance of the proposed Local LLM-based intrusion analysis framework. Since the analysis output is an ‘analysis report’ format combining natural language descriptions and structured mapping data, rather than simple classification values, the following metrics are calculated to measure the model’s performance comprehensively.
• Overall Accuracy (Intrusion Detection Accuracy): Represents the percentage of times the model’s final conclusion (Malicious/Benign) for each scenario matches the correct label. Following conservative security monitoring principles, all cases where judgment is reserved (Uncertain) are strictly considered undetected.
• ATT&CK Micro-F1 Score (Tactic Mapping Performance): Evaluates the match between attack techniques identified by the model and the MITRE ATT&CK answer key. Applies a multi-label classification perspective and adopts the Micro-average method to address data imbalance.
• Script Analysis Micro-F1 Score (Core Malicious Behavior Identification Performance): This ‘key artifact extraction’ metric measures whether the model accurately captures key malicious behavior clues within obfuscated scripts or commands. Its purpose is to quantify the sophistication of analysis rather than the fluency of generated sentences.
• Hallucination Rate: Represents the proportion of false information within generated reports that cannot be substantiated by the input data. It determines the presence or absence of evidence for each sentence using the FactCC methodology, forming the basis for traceability that allows analysts to immediately verify the authenticity of judgments.

4.5. Experimental Results Analysis and Resource Efficiency Evaluation

4.5.1. Comprehensive Performance Comparison Analysis

The proposed framework and four baseline models were evaluated on the same intrusion incident test dataset (500 types). Experimental results show the proposed model achieved significant superiority over other models across all evaluation metrics, including intrusion detection accuracy, core malicious behavior identification, and hallucination rate. Detailed results are presented in Table 10.
• Overall Accuracy (Infringement Detection Accuracy): General-purpose models like Llama-3-8B (71.5%) and Mistral-7B (69.8%) performed well in identifying individual events but showed limitations in discerning causal relationships among temporally and spatially dispersed logs. In contrast, the proposed model achieves 88.4% accuracy, demonstrating that the transferred CoT knowledge, via offline knowledge distillation, structurally enhances the inference logic of the compact model.
• ATT&CK Micro-F1 Score: Other models remained at F1-Scores between 0.65 and 0.78 due to a lack of domain-specific knowledge required when mapping unstructured logs to the standard ATT&CK framework. The proposed model achieved 0.91, demonstrating that the ‘security expert-level reasoning approach’ inherited from the Teacher model operates effectively.
• Script Analysis Micro-F1 Score: While Qwen2.5-7B (85.4%), specialized for code analysis, showed strength in script analysis, the proposed model (84.8%) achieved comparable performance. This suggests the potential to attain specialized intelligence comparable to task-specific models by applying knowledge distillation to general-purpose models.
• Hallucination Rate: All baseline models, including DeepSeek-R1 (15.6%), exhibited prominent hallucination phenomena, such as fabricating logs that did not exist in the analysis results or misclassifying attack techniques. The proposed framework suppressed this to 6.2%, indicating that the fact-checking mechanism performed during the AI agent’s ‘report/verify’ phase successfully compensates for the structural limitations of small models, rather than relying solely on model weight performance.

4.5.2. Resource Efficiency and Closed-Network Practicality Analysis

Considering the resource constraints of physically isolated network environments, this study precisely measured the GPU utilization, inference throughput, and power consumption of the proposed model. Experimental results demonstrated that the proposed framework achieves optimized resource efficiency while maintaining high-performance analysis capabilities, proving its strong practical applicability.
• Operational Suitability and Infrastructure Economics: Even under load conditions where AI agents and Student models operate simultaneously, it was confirmed that only up to 15.3GB of VRAM is occupied. This indicates extremely low memory overhead due to agent orchestration, enabling stable operation not only on enterprise-grade GPUs but also in standalone GPU environments.
• Performance-per-Resource: While consuming computational resources comparable to state-of-the-art inference-specialized models like DeepSeek-R1 (VRAM 15.3GB, power 150W), it significantly outperformed them in detection accuracy and reliability metrics. This demonstrates that analytical capabilities are maximized within limited computational resources through ‘structural optimization’—specifically, knowledge distillation and agent architecture—rather than simply increasing model parameters.
• Power Consumption and Sustainable Operation: Average power consumption during system operation was measured at approximately 150W. This represents a significant reduction in operational costs and cooling demands compared to enterprise-grade GPU-based systems consuming thousands of watts (W). Consequently, the proposed framework has proven to be the most practical alternative for delivering high-performance analytics solutions while adhering to the strict resource guidelines of closed networks.
Table 11. Comparison of Computing Resources and Inference Efficiency.
Table 11. Comparison of Computing Resources and Inference Efficiency.
Model VRAM Usage (GB) Throughput (tok/sec) Avg. Power (W) Inference Time
(s/case)
Llama-3-8B-Instruct 15.2 52.4 148 12.5
Mistral-7B-Instruct-v0.3 14.5 55.1 142 11.8
Qwen2.5-7B-Instruct 14.8 54.8 145 11.9
DeepSeek-R1 15.4 48.2 155 13.6
Distilled Llama-3-8B (Ours) 15.3 51.8 150 12.7

4.5.3. Ablation Study on Framework Components

To evaluate the individual contribution of each module within the proposed framework, we conducted a series of ablation studies by systematically removing core components. The performance was measured across four configurations: (1) Full Framework, (2) No-Agent, where the distilled Student model performs direct inference , (3) No-Verification, where the automated cross-check mechanism is disabled , and (4) No-Filtering (or Basic-Scoring), where raw logs are processed without our hybrid candidate selection.
Table 12. Quantitative Impact of Framework Components (Ablation Results).
Table 12. Quantitative Impact of Framework Components (Ablation Results).
Configuration Detection Accuracy (%) ATT&CK Mapping (F1) Hallucination Rate (%)
Full Framework (Proposed) 88.4 0.91 6.2
w/o AI Agent (Direct) 74.2 0.72 18.5
w/o Verification Layer 86.1 0.88 14.1
w/o Classification
(No-Filtering)
81.5 0.83 9.8
The results of the ablation study provide the following insights into the framework’s efficacy.
• Synergy of Agent-Based Orchestration: Removing the orchestration layer (No-Agent mode) led to a significant drop in detection accuracy from 88.4% to 74.2%. This degradation confirms that the structured decomposition of analysis tasks is essential for maintaining reasoning precision in complex cyber-attack scenarios.
• Criticality of the Verification Layer: Disabling the automated verification mechanism (No-Verification) led to the most dramatic increase in the hallucination rate, rising from 6.2% to 14.1%. This empirical evidence validates that the Traceability Matrix, which cross-references model outputs with original source logs, is a non-negotiable requirement for forensic reliability.
• Safety-First Hybrid Filtering: While removing the candidate selection process (No-Filtering) reduced the ATT&CK mapping F1-score to 0.83 due to information overload, we specifically evaluated the Indicator Recall Rateof our proposed Hybrid Filtering (Scoring + Anomaly-based Sampling).
• Robustness against Stealthy Attacks: Our hybrid approach achieved an Indicator Recall of 99.4%, successfully preserving low-score but high-risk ‘atypical’ events. This ensures that stealthy, multi-stage LotL attack traces are not pruned, maintaining a near-zero False Negative Rate (FNR) at the preprocessing stage. This high recall ensures the Student Model has access to all critical evidence required for deep inference without exceeding context window constraints.

5. In-Depth Verification via Hierarchical Threat Scenarios

This section presents a rigorous evaluation of the proposed framework’s analytical performance through a hierarchical verification approach. The validation focuses on the technical precision of incident reconstruction and the operational reliability of the generated intelligence to ensure its suitability for real-world cybersecurity applications.

5.1. Verification of Technical Identification via Single-Step Scenarios

This section evaluates the framework’s micro-level identification and technical interpretation capabilities using telemetry from independent malicious activities. The primary goal is to assess whether the Student Model can accurately map granular system events to the MITRE ATT&CK framework without external connectivity.
• Verification Method: We utilized Sysmon logs generated from the Atomic Red Teamframework, specifically targeting high-priority techniques such as ‘LSASS Memory Dump (T1003)’ and ‘Abnormal Service Registration (T1543)’. To test the model’s robustness, the dataset included both standard execution logs and obfuscated variations (e.g., command-line encoding, renamed binaries).
• Analysis Procedure: The framework extracts critical command-line artifacts—such as procdump.exeparameters and sc createconfigurations—from raw JSON-formatted logs. The agent then performs a contextual impact analysis by correlating the execution arguments with known adversarial behaviors defined in the internalized knowledge base.
• Verification Results: The model successfully deciphered attack intents within obfuscated structures, maintaining a 100% mapping accuracy(Zero-error) for the tested atomic cases. As shown in Table 5.1, the framework correctly identified technical sub-techniques even when telemetry was fragmented. This high precision confirms that the localized Student Model can effectively alleviate the cognitive load on human analysts by providing pre-validated security context.
Table 13. Quantitative Evaluation of Technical Identification for Single-Step Attack Scenarios.
Table 13. Quantitative Evaluation of Technical Identification for Single-Step Attack Scenarios.
Case ID Targeted Technique (MITRE ATT&CK) Input Source (Raw Telemetry) Extracted Critical Artifacts Inference Result (Mapping) Accuracy
S-01 T1003.001(LSASS Memory Dump) Sysmon Event ID 10 (ProcessAccess) Target: lsass.exe, Source: procdump.exe, AccessMask: 0x1ff1ff OS Credential Dumping 100%
S-02 T1543.003(Windows Service) Sysmon Event ID 1 (ProcessCreate) Image: sc.exe, Command: create, binPath=, start=auto Create or Modify System Process 100%
S-03 T1059.001(PowerShell) Sysmon Event ID 1 (ProcessCreate) Image: powershell.exe, Arguments: -enc, -nop, -w hidden Command and Scripting Interpreter 100%
S-04 T1070.004(File Deletion) Sysmon Event ID 26 (FileDelete) : C:\Windows\System32\winevt\Logs\*, Command: wevtutil clTarget Indicator Removal on Host 100%
S-05 T1547.001(Registry Run Keys) Sysmon Event ID 13 (RegistryValueSet) Target: ...\CurrentVersion\Run, Details: malicious.exe Boot or Logon Autostart Execution 100%

5.2. Verification of Multi-Stage Scenario-Based Contextual Inference

This section evaluates the macro-level inference capability of the framework to synthesize temporally and spatially dispersed events into a single, coherent causal chain. The focus is on validating the ‘Contextual Reasoning’ performance of the Student Model in reconstructing the complete lifecycle of a sophisticated intrusion.
• Verification Method: We simulated a multi-stage attack scenario comprising ‘Privilege Escalation → Discovery → Exfiltration’. To reflect real-world operational complexity, the attack logs were intentionally interspersed with high volumes of benign background traffic and system noise, with varying time intervals between each stage.
• Analysis Procedure: The AI agent executes a correlation logic based on shared environmental identifiers, such as user security identifiers (SIDs), process tree lineages, and target IP addresses. The internalized Student Model then performs a spatio-temporal analysis to reconstruct the logical progression: “Following initial compromise, the attacker escalated privileges to access specific assets and subsequently exfiltrated sensitive data”.
• Verification Results: As detailed in Table 14, the framework successfully transformed a fragmented list of logs into a structured, narrative-driven incident report. A key finding was the efficacy of the ‘Scoring-based Filtering’ mechanism, which prioritized threat-centric events. This pre-processing layer effectively reduced the total log volume processed by the LLM by approximately 85%, ensuring that the reasoning engine focused exclusively on high-priority indicators without exceeding the context window constraints.

5.3. Hallucination Suppression and Factual Accuracy Verification

This section evaluates the structural mechanisms designed to mitigate hallucinations—a critical vulnerability in Large Language Models (LLMs)—and confirms the factual integrity of the generated security reports.
• Verification Method: We conducted a formal audit of the reports generated by the Student Model using the ‘Reporting & Verification’ protocol. A dataset of 100 generated sentences was randomly sampled and cross-referenced with the original Sysmon and firewall telemetry to identify any instances of factual fabrication or technical misclassification.
• Analysis Procedure: The AI agent executes a verification loop where each claim in the analysis draft is mapped back to a specific log entry. If a claim lacks an explicit evidentiary anchor (e.g., a non-existent IP or process), it is flagged as a potential hallucination. The agent then triggers a ‘Self-Correction’ process to either find the correct evidence or remove the unverified statement.
• Verification Results: As shown in Table 15, the framework achieved a remarkably low hallucination rate of 6.2%, significantly lower than typical general-purpose LLMs. Crucially, 93.8%of the final report’s statements contained explicit citations of the original log lines, ensuring full traceability for human analysts. This evidence-based reasoning ensures the high reliability required for high-stakes security operations in air-gapped environments.

6. Conclusions

6.1. Research Summary and Implications

This study proposes a Local LLM-based intrusion analysis framework that can be effectively operated in an air-gapped environment where external network access is completely blocked. The research holds significant academic and practical implications, differing from existing studies in two major aspects.
• Offline Knowledge Distillation: We presented a methodology to safely transfer deep analytical knowledge from an external high-performance model (Teacher) to a Local LLM (Student) within the closed network while adhering to security regulations [24,25].
• AI Agent Orchestration with Dual-Verification: By stepwise controlling complex analysis procedures, the framework overcomes the performance limitations of Local LLMs. Notably, it integrates a Traceability Matrix and Human-in-the-Loop (HITL) failsafes based on self-confidence scoring to structurally ensure the logical consistency and absolute reliability of analysis results in high-stakes infrastructure.

6.2. Significance of Research Findings

The framework proposed in this study demonstrated analytical capabilities approaching those of high-performance cloud LLMs, even under the specific constraints of a closed network environment. Experimental results quantitatively proved the practical applicability of the proposed model, achieving significantly higher detection accuracy (88.4%) and MITRE ATT&CK mapping performance (Micro-F1 0.91) compared to existing general-purpose Local LLMs.
• Maximized Analysis Efficiency: Through the AI agent’s preprocessing and candidate selection mechanisms, it meticulously extracts only meaningful threat indicators from vast raw logs. This dramatically refines the total volume of information analysts must review, structurally resolving the information overload problem arising from large-scale security events.
• Verification and Blocking of Hallucinations: We moved beyond mere hallucination suppression to a ‘Verify and Block’strategy. By reducing the hallucination rate to 6.2% and providing Traceability Matrixlinks connecting all derived analysis statements to their original data sources, we guarantee the Integrityrequired for critical infrastructure. This enables analysts to immediately verify the basis for AI judgments through explicit evidentiary anchors.
• Low-cost, High-efficiency Solution for Practical Intelligence: We demonstrate that seamless operation is achievable on a single GPU server without requiring expensive multi-GPU infrastructure. This establishes an operationally feasible prototype deployable in air-gapped environments without reliance on cloud resources.

6.3. Limitations and Future Research Directions

This research was conducted primarily using the public Atomic Red Team dataset, which inherently has limitations in fully reflecting the noisy, large-scale log environments typical of real enterprise settings. Future research directions to address these limitations are as follows.
• Enhanced Filtering for Stealthy Attacks: To address the risk of filtering out atypical attack traces, we plan to refine the scoring-based filtering into a probabilistic anomaly-detection modelthat preserves low-score but high-risk events.
• Field Stability Validation: We aim to validate the framework’s real-world response capabilities and system stability through long-term pilot operations within actual security operations centers.
• User Feedback-Based Learning Enhancement: We plan to strengthen an active learning system where analyst feedback directly improves the model’s reasoning and reduces the frequency of human intervention.
• Self-evolving System: The ultimate goal is to build an intelligent intrusion incident analysis system that actively adapts to the latest threat trends and self-evolves, even within closed network environments.

Author Contributions

S.H(SungHun Jang). contributed to conceptualization, methodology, analysis, and writing. M.R(MyoungRak Lee). contributed to analysis and validation. T.S. performed a review of the paper and provided important revisions.
Data availability: No datasets were generated or analysed during the current study.

Acknowledgments

This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (RS-2025-16068069).
Declarations: Conflict of interest The authors declare no competing interests.:

References

  1. CISA; NSA; FBI. Identifying and mitigating living off the land techniques. Cybersecurity Advisory, 2023. [Google Scholar]
  2. Microsoft Threat Intelligence. Iran surge in cyber-enabled influence operations. Microsoft Digit. Def. Rep. 2024. [Google Scholar]
  3. Liu, Y.; Han, T.; Ma, S.; Zhang, J.; Yang, Y.; Tian, J.; et al. Summary of ChatGPT-related research and perspective towards the future of large language models. Meta-Radiology 2023, 1(1). [Google Scholar] [CrossRef]
  4. Stouffer, K.; Pease, N.; Tang, C.; Zimmerman, R.; Lightman, S. NIST Special Publication 800-82r3; Guide to operational technology (OT) security. 2023.
  5. Al-Abassi, A.; Karimibiermann, H. A deep learning-based framework for detecting intrusion in industrial control systems. Alex. Eng. J. 2020. [Google Scholar]
  6. Shaukat, K.; Luo, S.; Varadharajan, V. A novel deep learning-based intrusion detection system for mobile devices The cognitive load of the security analyst. In Alexandria Engineering Journal;Comput Secur.; D’Amico, S., Whitley, K., Tesone, D., Morrissey, B., Roth, R., Eds.; 2021; Volume 125, p. 103023. [Google Scholar]
  7. Ferrag, M.; Tihanyi, N.; Hamadouche, L.; Rezvy, S.; Debbah, M. Secure-LLM: a system for cybersecurity utilizing LLMs. arXiv 2023, arXiv:2305.15175. [Google Scholar]
  8. Microsoft Security. Microsoft Security Copilot: AI-powered incident response. 2023. [Google Scholar]
  9. Liu, H.; Ning, R.; Teng, Z.; Liu, J.; Zhou, Q.; Yue, S. Evaluating the logical reasoning ability of ChatGPT and GPT-4. arXiv 2023, arXiv:2304.03439. [Google Scholar] [CrossRef]
  10. Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, J.; Xu, Y.; et al. Survey of hallucination in natural language generation. ACM Comput Surv. 2023, 55(12), 1–38. [Google Scholar] [CrossRef]
  11. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; et al. Llama: open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
  12. Kryscinski, W.; McCann, B.; Braverman, Y.; Socher, R. Evaluating the factual consistency of abstractive text summarization (FactCC). Proceedings of EMNLP, 2020. [Google Scholar]
  13. Wei, J.; Wang, X.; Schuurmans, D.; Maeda, M.; Edakov, D.; Ku, H.; et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process Syst. 2022, 35, 24824–37. [Google Scholar]
  14. Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; et al. ReAct: synergizing reasoning and acting in language models. In Proceedings of the International Conference on Learning Representations (ICLR), 2023. [Google Scholar]
  15. Dubey, A.; Jauhri, G.; Pandey, A.; Kadian, A.; Al-Dahle, W.; Letman, J.; et al. The Llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
  16. Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; Casas de las, D.G.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar] [CrossRef]
  17. Qwen Team. Qwen2.5 technical report. arXiv 2024, arXiv:2412.15115. [Google Scholar]
  18. DeepSeek-AI. DeepSeek-V2: a strong, economical, and efficient mixture-of-experts language model. arXiv 2024, arXiv:2405.04434. [Google Scholar]
  19. Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
  20. Xi, Z.; Chen, W.; Guo, X.; He, W.; Ding, Y.; Hong, B.; et al. The rise and potential of large language model based agents: a survey. arXiv 2023, arXiv:2309.07864. [Google Scholar] [CrossRef]
  21. Lin, C.Y. ROUGE: a package for automatic evaluation of summaries. Proceedings of Text Summarization Branches Out (ACL), 2004. [Google Scholar]
  22. Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: evaluating text generation with BERT. In Proceedings of the International Conference on Learning Representations (ICLR), 2020. [Google Scholar]
  23. Strom, B.E.; Applebaum, A.; Miller, D.P.; Nickels, K.C.; Pennington, A.G.; Thomas, C.B. MITRE ATT&CK: Design and philosophy; Technical Report; The MITRE Corporation, 2018. [Google Scholar]
  24. OpenAI. GPT-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Figure 1. Architecture of the Local LLM-based Incident Analysis Framework.
Figure 1. Architecture of the Local LLM-based Incident Analysis Framework.
Preprints 214822 g001
Figure 2. Specification of Expert Reasoning Internalization Prompt for Teacher Model Distillation.
Figure 2. Specification of Expert Reasoning Internalization Prompt for Teacher Model Distillation.
Preprints 214822 g002
Figure 3. Core Implementation Detail of the Standalone Inference Engine.
Figure 3. Core Implementation Detail of the Standalone Inference Engine.
Preprints 214822 g003
Table 1. Structural Comparison between Large LLMs and Local LLMs (7B–8B).
Table 1. Structural Comparison between Large LLMs and Local LLMs (7B–8B).
Category Large-scale LLMs Local LLMs (7B–8B)
Parameters 70B ~ 400B+ 7B ~ 8B
GPU Requirements 10–100 GPUs 1 ~ 2 GPUs
Memory Requirements Hundreds of GB 40 to 80 GB
Deployment Environment Cloud On-premise Server
External Data Transfer Required Not Required
Air-gapped Operation Impossible Supported
Operating Cost Very High Relatively Low
Table 3. Data Sanitization and Masking Protocols.
Table 3. Data Sanitization and Masking Protocols.
Category Item Sanitization & Masking Policy Rationale
Network IP Address Replaced with INT_IP_[N]or EXT_IP_[N] Preserve traffic directionality while masking topology.
Identity Username Generalized to SEC_USER, ADMIN_USER, etc. Protect PII while maintaining privilege context.
System Hostname Masked as ENDPOINT_[N] Maintain multi-stage lateral movement context.
Path File Directory
Generalized to C:\Users\MASKED_USER\... Protect user privacy during forensic analysis.
Table 4. Technical Specifications of the Internal Knowledge Internalization Pipeline.
Table 4. Technical Specifications of the Internal Knowledge Internalization Pipeline.
Stage Process Name Technical Specification & Rationale
Stage 1:
Data Acquisition
Reasoning
Triplets
ㆍIntegration of [Log+CoT+Label] with hash-based integrity checks.
ㆍPurpose: Establishing a ground-truth dataset for causal reasoning.
Stage 2:
Model Adaptation
LoRA
Configuration
ㆍParameter-efficient updates (r=16, α=32) on attention layers.
ㆍPurpose: Mitigating VRAM constraints while preventing knowledge drift.
Stage 3:
Optimization
Weight
Internalization
ㆍCross-entropy loss minimization of the reasoning path.
ㆍPurpose: Enabling standalone expert-level inference in air-gapped zones.
Table 8. Dataset Composition and Split Statistics.
Table 8. Dataset Composition and Split Statistics.
Split Scenarios Ratio Usage Purpose
Total 3,500 100% -
Training 2,500 70% Generating CoT knowledge via Teacher model & Student model training
Validation 500 15% Hyperparameter tuning & Early stopping decision
Test
500
15%
Final performance evaluation
(Completely held-out from training)
Table 9. Specifications and Characteristics of Baseline Models.
Table 9. Specifications and Characteristics of Baseline Models.
Model Parameters Context Window Key Characteristics
Llama-3-8B-Instruct 8B 8K/128K General-purpose instruction following & balanced performance
Mistral-7B-Instruct-v0.3 7B 32K Efficient long-context processing & lightweight architecture
Qwen2.5-7B-Instruct 7B 32K/128K Specialized in code analysis & mathematical reasoning
DeepSeek-R1 8B 8K~128K Advanced reasoning capabilities based on Reinforcement Learning
Distilled Llama-3-8B (Ours) 8B 8K~128K Security domain-specific knowledge distillation & agent optimization
Table 10. Performance Comparison with Local LLMs.
Table 10. Performance Comparison with Local LLMs.
Model Detection
Accuracy
ATT&CK Mapping
Score
Script Analysis
Score
Hallucination
Rate
Llama-3-8B-Instruct 71.5% 0.68 72.0% 28.4%
Mistral-7B-Instruct-v0.3 69.8% 0.65 68.5% 31.2%
Qwen2.5-7B-Instruct 76.2% 0.74 85.4% 19.8%
DeepSeek-R1 79.5% 0.78 83.1% 15.6%
Distilled Llama-3-8B (Ours) 88.4% 0.91 84.8% 6.2%
Table 14. Reconstruction Performance for Multi-Stage Attack Scenarios.
Table 14. Reconstruction Performance for Multi-Stage Attack Scenarios.
Attack Stage Log Source
(Event ID)
Key Correlators
(Pivot)
Logical Inference
(Contextual Link)
Result
Phase 1:
PrivEsc
Sysmon 1 / 10 User: Admin-Svc/ PID: 4820 Identification of token manipulation for privilege gain Linked
Phase 2:
Discovery
Sysmon 1 (net.exe) User: Admin-Svc/ Parent: 4820 Enumeration of network shares following successful escalation Linked
Phase 3:
Exfiltration
Firewall / Sysmon 3 Src IP: 10.0.1.5/ Dest Port: 443 Correlation between discovered assets and outbound data flow Linked
Noise Filtering Benign Sysmon Logs No correlation found Automatic exclusion of 1,240+ background event lines Filtered
Table 15. Quantitative Evaluation of Factual Accuracy and Traceability.
Table 15. Quantitative Evaluation of Factual Accuracy and Traceability.
Metric Category Evaluation Metric Value Interpretation & Impact
Integrity Hallucination Rate 6.2% Minimizes false positives and operational confusion.
Traceability Evidence Citation Rate 93.8% Every claim is backed by a verifiable log entry.
Precision Technique Mapping Accuracy 95.5% Precise alignment with the MITRE ATT&CK framework.
Consistency Logic Consistency Score 98.1% Internal logical flow remains coherent across reports.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated