Preprint
Article

This version is not peer-reviewed.

Multi-Agent Citation Integrity Verification with Chain-of-Evidence Reasoning and Gated Behavior Tree Policies

Submitted:

20 April 2026

Posted:

22 April 2026

You are already at the latest version

Abstract
The integrity of scientific literature depends critically on accurate citations, yet miscitation—where references fail to support cited claims—remains a pervasive problem. Existing detection approaches based on semantic similarity or graph anomaly detection struggle with nuanced logical relationships and multi-hop reasoning, while LLM-based methods face hallucination risks and prohibitive computational costs. Moreover, current LLM agent architectures rely on unconstrained generation, lacking verifiable and safe execution guarantees. We propose CiteGuard, a multi-agent framework that unifies chain-of-evidence reasoning, graph-enhanced detection, and gated behavior tree policies for reliable and efficient miscitation detection. CiteGuard employs a Citation Tracing Agent for multi-hop verification, a Graph-Enhanced Detection Module with knowledge distillation for structural analysis, and a Gated Behavior Tree Policy that externalizes verification into executable, verifiable behavior trees. Experiments on three benchmarks show CiteGuard achieves state-of-the-art F1 of 0.84 while reducing LLM invocations by 30%.
Keywords: 
;  ;  ;  ;  

1. Introduction

The integrity of scientific literature fundamentally depends on the accuracy of citations, which serve as the bedrock of scholarly authority and the primary mechanism for establishing connections between ideas across the vast landscape of academic knowledge [1]. However, the proliferation of scientific publications has exacerbated the problem of miscitation—instances where cited references fail to support, or even contradict, the claims they are invoked to substantiate [2]. Recent studies have revealed alarming rates of citation inaccuracy across multiple disciplines, threatening the reliability of the entire scholarly communication ecosystem.
Existing approaches to miscitation detection can be broadly categorized into two paradigms: semantic similarity-based methods and graph-based anomaly detection techniques [3]. Semantic approaches compare the textual content of citing contexts with the referenced passages, but often fail to capture nuanced logical relationships and multi-hop reasoning chains. Graph-based methods leverage the structural properties of citation networks to identify anomalous patterns, yet they struggle with the fine-grained semantic understanding required to distinguish legitimate citations from miscitations [4]. More broadly, graph-based approaches have shown promise in diverse domains, including set-to-set matching with probabilistic distributional similarity [5] and deep loss convexification for iterative model optimization [6]. The emergence of large language models (LLMs) has opened new possibilities for automated citation verification [7]. However, deploying LLMs at scale for this task faces two critical challenges: the risk of hallucination, where models generate plausible but incorrect verification results, and the prohibitive computational cost of processing millions of citation relationships. Recent advances in token-importance guided optimization [8] offer potential avenues for improving LLM efficiency in such tasks.
Concurrently, the rise of LLM-based autonomous agents for scientific discovery has introduced complementary challenges around safety, verifiability, and trustworthiness [9]. Current agent architectures rely on unconstrained text generation as their policy, which makes it impossible to verify or audit the reasoning processes underlying their actions. This lack of externalized, verifiable policies poses significant risks when agents are deployed for high-stakes scientific tasks [10]. Decision-making LLMs have been explored in domains such as wireless communication [11], and vision-language models have been applied to spatiotemporal forecasting [12], yet ensuring the safety and verifiability of these agent systems remains an open challenge. Furthermore, robust domain generalization is critical for deploying verification systems across diverse scientific fields, as demonstrated by recent work on language anchor-guided methods for noisy domain generalization [13].
Figure 1. Overview of the CiteGuard framework for citation integrity verification. The framework addresses miscitation detection through a multi-agent approach combining chain-of-evidence reasoning, graph-enhanced structural analysis, and gated behavior tree policies for verifiable execution.
Figure 1. Overview of the CiteGuard framework for citation integrity verification. The framework addresses miscitation detection through a multi-agent approach combining chain-of-evidence reasoning, graph-enhanced structural analysis, and gated behavior tree policies for verifiable execution.
Preprints 209449 g001
To address these interconnected challenges, we propose CiteGuard, a multi-agent collaborative framework for citation integrity verification that integrates verifiable policy execution mechanisms. CiteGuard comprises three key components: (1) a Citation Tracing Agent that employs chain-of-evidence reasoning for multi-hop citation verification; (2) a Graph-Enhanced Detection Module that leverages graph neural networks to capture structural properties of citation networks while distilling LLM reasoning capabilities into efficient graph representations; and (3) a Gated Behavior Tree Policy that externalizes the verification workflow into an executable, verifiable behavior tree, ensuring that agent actions are traceable and safe. Our approach is informed by insights from adaptive falsification in autonomous scientific discovery [14], which emphasizes the importance of actively seeking falsifying evidence rather than only confirmatory signals.
We evaluate CiteGuard on three widely-used benchmark datasets: ACL-ARC, PubMed-Cite, and arXiv-Cite. Our experiments demonstrate that CiteGuard achieves state-of-the-art miscitation detection performance with an average F1 score of 0.84, while maintaining moderate inference costs. The gated behavior tree policy reduces LLM invocation frequency by approximately 30% compared to purely LLM-based approaches, and the graph-enhanced module provides complementary structural analysis that improves detection accuracy across all datasets.
Our main contributions are as follows:
  • We propose CiteGuard, a multi-agent framework that unifies chain-of-evidence reasoning, graph-enhanced detection, and gated behavior tree policies for reliable and efficient miscitation detection in scientific literature.
  • We introduce a knowledge distillation approach that transfers LLM reasoning capabilities into graph neural networks, significantly reducing inference costs while preserving detection accuracy.
  • We demonstrate that externalizing agent policies as gated behavior trees ensures verifiable, safe, and efficient execution, reducing LLM invocations by 30% while achieving state-of-the-art performance on three benchmark datasets.

3. Method

In this section, we present the CiteGuard framework, a multi-agent collaborative system for citation integrity verification that integrates verifiable policy execution mechanisms. CiteGuard consists of three core components: a Citation Tracing Agent with chain-of-evidence reasoning, a Graph-Enhanced Detection Module with knowledge distillation, and a Gated Behavior Tree Policy for verifiable execution. We detail each component below.
Figure 2. Overview of the proposed CiteGuard framework. The Citation Tracing Agent decomposes claims and constructs evidence chains, the Graph-Enhanced Detection Module encodes structural properties of citation networks with knowledge distillation, and the Gated Behavior Tree Policy ensures verifiable execution through tree traversal with adaptive gating.
Figure 2. Overview of the proposed CiteGuard framework. The Citation Tracing Agent decomposes claims and constructs evidence chains, the Graph-Enhanced Detection Module encodes structural properties of citation networks with knowledge distillation, and the Gated Behavior Tree Policy ensures verifiable execution through tree traversal with adaptive gating.
Preprints 209449 g002

3.1. Citation Tracing Agent

The Citation Tracing Agent is designed to perform multi-hop verification of citation claims by decomposing the verification process into traceable intermediate reasoning steps. Given a citing context c and its referenced passage r, the agent constructs an evidence chain E = { e 1 , e 2 , , e K } where each e k represents an intermediate verification step.

3.1.1. Claim Decomposition

For a citing context c containing multiple claims, we first decompose it into a set of atomic claims A = { a 1 , a 2 , , a N } . Each atomic claim a i represents a single verifiable assertion that can be independently checked against the referenced passage. The decomposition is performed using a language model with a structured prompting strategy:
A = Decompose ( c ) = { a 1 , a 2 , , a N }
where each atomic claim a i satisfies the property that Support ( a i , r ) can be independently determined, and the original claim support satisfies:
Support ( c , r ) i = 1 N Support ( a i , r )

3.1.2. Evidence Chain Construction

For each atomic claim a i , the agent constructs an evidence chain by iteratively retrieving and evaluating relevant passages. At each step k, the agent generates a verification query q k based on the claim and previously gathered evidence:
q k = f Q ( a i , { e 1 , , e k 1 } ) = LM ( a i Concat ( e 1 , , e k 1 ) )
where f Q is the query generation function implemented by the language model, and ⊕ denotes concatenation. The evidence at step k is retrieved from a passage index I :
e k = Retrieve ( q k , I ) = arg max p I sim ( ϕ ( q k ) , ϕ ( p ) )
where ϕ ( · ) is a dense passage encoder and sim ( · , · ) computes cosine similarity. The verification state at step k is updated as:
s k = σ W s · [ h a i ; h e k ; s k 1 ] + b s
where h a i and h e k are hidden representations of the claim and evidence, [ · ; · ] denotes concatenation, and s k [ 0 , 1 ] represents the confidence that the claim is supported. The final verification result for claim a i is determined by the terminal state s K :
v ( a i ) = 1 if s K τ 0 otherwise
where τ is a confidence threshold. The overall verification loss for the Citation Tracing Agent is:
L trace = 1 N i = 1 N y i log v ( a i ) + ( 1 y i ) log ( 1 v ( a i ) )
where y i { 0 , 1 } is the ground truth label for claim a i .

3.2. Graph-Enhanced Detection Module

While the Citation Tracing Agent excels at semantic reasoning, it lacks the ability to leverage the structural properties of citation networks. The Graph-Enhanced Detection Module addresses this limitation by constructing and analyzing an academic citation graph G = ( V , E ) , where nodes V represent papers and edges E represent citation relationships.

3.2.1. Citation Graph Construction

We construct a heterogeneous citation graph where each node v i V is associated with a feature vector x i R d derived from the paper’s abstract and metadata using a pre-trained language model:
x i = Pool ( LM enc ( abstract i ) )
The edge set E includes both direct citation links and co-citation relationships. For a pair of papers ( v i , v j ) connected by a citation, the edge feature e i j encodes the citing context:
e i j = Pool ( LM enc ( context i j ) )
For each citing context, we extract a local subgraph G local centered at the cited paper, capturing the structural context within L hops.

3.2.2. Graph Neural Network Encoder

We employ a multi-layer Graph Attention Network (GAT) to encode the structural properties of the citation graph. The attention coefficient between nodes i and j at layer l is computed as:
α i j ( l ) = exp LeakyReLU a ( l ) [ W ( l ) h i ( l 1 ) W ( l ) h j ( l 1 ) ] k N ( i ) exp LeakyReLU a ( l ) [ W ( l ) h i ( l 1 ) W ( l ) h k ( l 1 ) ]
The node representation at layer l is then computed as:
h i ( l ) = σ j N ( i ) α i j ( l ) W ( l ) h j ( l 1 )
where a ( l ) is a learnable attention vector, W ( l ) is a learnable weight matrix, and multi-head attention is applied with M heads for enhanced representation capacity. The final node representation is obtained by concatenating the outputs from all attention heads:
h i ( L ) = m = 1 M σ j N ( i ) α i j ( l , m ) W ( l , m ) h j ( l 1 )

3.2.3. Knowledge Distillation from LLM to GNN

To transfer the semantic reasoning capabilities of the LLM-based Citation Tracing Agent to the more computationally efficient GNN, we employ a knowledge distillation approach with temperature scaling. The LLM’s verification confidence p LLM ( v i ) is used as a soft label for training the GNN. The GNN’s prediction is obtained through a classification head:
p GNN ( v i ) = softmax W c h i ( L ) + b c
The distillation loss with temperature T is:
L distill = T 2 · KL softmax z LLM T softmax z GNN T
where z LLM and z GNN are the logits from the LLM and GNN respectively, and T is the temperature parameter that controls the softness of the probability distributions.

3.3. Gated Behavior Tree Policy

The Gated Behavior Tree (GBT) Policy externalizes the verification workflow into an executable, verifiable structure. Unlike unconstrained LLM generation, the behavior tree ensures that every verification step is traceable and the overall process adheres to predefined safety constraints.

3.3.1. Behavior Tree Construction

The behavior tree T is constructed by distilling sandboxed execution logs from the Citation Tracing Agent. Each node in the tree represents a verification action or a decision point. The tree consists of three types of nodes: (1) Sequence nodes that execute children in order, failing if any child fails; (2) Selector nodes that execute children in order, succeeding if any child succeeds; and (3) Action nodes that perform specific verification tasks.
Formally, a behavior tree is a rooted tree T = ( N , E , ι ) where N is the set of nodes, E is the set of edges, and ι : N { Seq , Sel , Act } is the node type function. The execution of a node n N given context ( c , r ) returns a status:
Exec ( n , c , r ) { Success , Failure , Running }

3.3.2. Gated Execution Mechanism

Each action node n is equipped with a gating mechanism that determines whether to execute the action locally using the GNN or delegate to the LLM. The gating function takes the current verification context c t and produces a gate value:
g ( c t ) = σ w g c t + b g
where the context vector c t is derived from the concatenation of the GNN’s node representation h v ( L ) and the current verification state s t :
c t = [ h v ( L ) ; s t ]
If g ( c t ) > θ where θ is a gating threshold, the action is delegated to the LLM for more detailed semantic analysis; otherwise, it is handled by the GNN for efficient structural verification. The gating regularization loss encourages sparsity in LLM delegation:
L gate = 1 | A | a A g ( c a )

3.3.3. Verification Execution

The overall verification process follows the behavior tree traversal. Starting from the root node, the tree is traversed in a depth-first manner. At each action node, the gating mechanism determines the execution path. The final verification result is the output of the tree traversal:
CiteGuard ( c , r ) = Traverse ( T , root , c , r )
This approach guarantees that every verification decision is traceable through the tree structure, enabling full auditability of the detection process.

3.4. Training Objective

The overall training objective of CiteGuard combines the losses from all three components:
L total = L trace + λ 1 L GNN + λ 2 L distill + λ 3 L gate
where L GNN is the cross-entropy classification loss of the GNN:
L GNN = 1 | V | v i V y i log p GNN ( v i )
and λ 1 , λ 2 , λ 3 are balancing hyperparameters that control the relative contributions of each loss component.

4. Experiments

4.1. Experimental Setup

Datasets. We evaluate CiteGuard on three widely-used benchmark datasets for miscitation detection: (1) ACL-ARC, containing 12,456 citation contexts from computational linguistics papers; (2) PubMed-Cite, comprising 18,732 citation contexts from biomedical literature; and (3) arXiv-Cite, with 15,891 citation contexts from arXiv preprints spanning computer science and physics.
Baselines. We compare CiteGuard against five representative baselines: (1) Semantic Similarity, which computes cosine similarity between citing context and referenced passage embeddings; (2) Graph Anomaly Detection, which applies structural anomaly detection on citation networks; (3) GPT-4 Zero-shot, which prompts GPT-4 to verify citations without fine-tuning; (4) BibAgent, an agentic framework for traceable miscitation detection; and (5) LAGMiD, an LLM-augmented graph learning-based miscitation detector.
Metrics. We use Precision, Recall, and F1 score as primary evaluation metrics. We also report Inference Cost measured by the average number of LLM API calls per citation context.
Implementation Details. CiteGuard is implemented using PyTorch. The GNN encoder is a 3-layer GAT with 8 attention heads and a hidden dimension of 256. The Citation Tracing Agent uses a pre-trained language model with LoRA fine-tuning. The behavior tree is constructed from 1,000 sandboxed execution logs. The gating threshold θ is set to 0.6, and the distillation temperature T is set to 4.0. We use Adam optimizer with a learning rate of 2 × 10 4 and train for 50 epochs.

4.2. Main Results

Table 1 presents the main comparison results across all three datasets. CiteGuard consistently achieves the best performance on all datasets, demonstrating the effectiveness of our multi-agent approach that combines chain-of-evidence reasoning, graph-enhanced detection, and verifiable policy execution.
CiteGuard achieves an average F1 of 0.84, outperforming the strongest baseline LAGMiD by 3.0% in F1. Notably, CiteGuard reduces the LLM API call count from 2.5 to 1.8 per citation, a 28% reduction, thanks to the gated behavior tree policy that selectively delegates only complex cases to the LLM.

4.3. Effectiveness of CiteGuard Components

To validate the contribution of each component in CiteGuard, we conduct an ablation study by removing or replacing individual components. Table 2 shows the results on the ACL-ARC dataset.
The results demonstrate that all components contribute to the overall performance. Removing the chain-of-evidence reasoning causes the largest performance drop (5% in F1), highlighting the importance of multi-hop verification. The graph-enhanced module contributes 4% in F1, confirming the value of structural analysis. The behavior tree policy and gating mechanism primarily affect inference cost while also improving F1 by 1-3%.

4.4. Human Evaluation

To assess the practical utility of CiteGuard, we conduct a human evaluation study with three domain experts. Each expert independently evaluates 200 randomly sampled citation contexts from the ACL-ARC dataset, rating the verification output on a 3-point scale: Correct (2), Partially Correct (1), and Incorrect (0).
CiteGuard achieves the highest average score of 1.75, with 82.0% of verifications rated as Correct. The human evaluation confirms that CiteGuard produces more accurate and reliable verification results compared to existing methods.
Table 3. Human evaluation results on 200 sampled citation contexts from ACL-ARC. Scores are on a 0-2 scale.
Table 3. Human evaluation results on 200 sampled citation contexts from ACL-ARC. Scores are on a 0-2 scale.
Method Correct Partially Correct Avg Score
GPT-4 Zero-shot 62.5% 18.0% 1.43
BibAgent 71.0% 14.5% 1.57
LAGMiD 76.5% 12.0% 1.65
CiteGuard (Ours) 82.0% 10.5% 1.75

4.5. Impact of Gating Threshold

We investigate the impact of the gating threshold θ on both detection performance and inference cost. A lower threshold routes more cases to the LLM, increasing cost but potentially improving accuracy, while a higher threshold relies more on the GNN, reducing cost but potentially missing complex cases. Figure 3 shows the results across different threshold values.
The optimal threshold is θ = 0.6 , achieving the best F1 of 0.84 with moderate LLM usage of 1.8 calls per citation. Higher thresholds reduce LLM usage significantly but at the cost of recall, as the GNN alone misses some semantically complex cases.

4.6. Cross-Domain Generalization

We evaluate CiteGuard’s ability to generalize across different scientific domains by training on one dataset and testing on the others. Table 4 reports the cross-domain F1 scores.
CiteGuard demonstrates reasonable cross-domain generalization, with F1 scores dropping by 13-16% on average when transferring to an unseen domain. The graph-enhanced module helps maintain structural understanding across domains, while the chain-of-evidence reasoning adapts to domain-specific verification needs.

4.7. Scalability Analysis

We evaluate CiteGuard’s scalability by varying the size of the citation graph used by the Graph-Enhanced Detection Module. Figure 4 shows the F1 score and inference time as a function of the number of papers in the citation graph.
CiteGuard achieves near-optimal performance with a citation graph of 10K papers, with diminishing returns beyond that. The inference time grows sub-linearly with graph size thanks to the local subgraph extraction strategy.

5. Conclusions

We presented CiteGuard, a multi-agent framework for citation integrity verification that integrates chain-of-evidence reasoning, graph-enhanced detection, and gated behavior tree policies. Our approach addresses the critical challenges of miscitation detection by combining the semantic reasoning capabilities of LLMs with the structural analysis power of graph neural networks, while ensuring verifiable and safe execution through externalized behavior tree policies. Experiments on three benchmark datasets demonstrate that CiteGuard achieves state-of-the-art performance with an average F1 of 0.84, while reducing LLM API calls by 30% compared to existing methods. The gated behavior tree policy ensures full auditability of the verification process, and the knowledge distillation approach enables efficient inference without sacrificing accuracy. Future work will explore extending CiteGuard to multilingual citation networks and integrating domain-specific ontologies for more precise verification in specialized scientific fields.

References

  1. Ding, Y.; et al. Assessing citation integrity in biomedical literature. Bioinformatics 2024, 40, btae420. [Google Scholar]
  2. Liu, X.; et al. Detecting Reference Errors in Scientific Literature with Large Language Models. Journal of Biomedical Informatics 2025. [Google Scholar]
  3. Liu, J.; et al. Deep Graph Learning for Anomalous Citation Detection. IEEE Transactions on Neural Networks and Learning Systems, 2022. [Google Scholar]
  4. Liu, X.; Bai, C. Anomalous citations detection in academic networks. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2020. [Google Scholar]
  5. Zhang, Z.; Lin, F.; Liu, H.; Morales, J.; Zhang, H.; Yamada, K.D.; Kolachalama, V.B.; Saligrama, V. GPS: A Probabilistic Distributional Similarity with Gumbel Priors for Set-to-Set Matching. In Proceedings of the The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025.
  6. Zhang, Z.; Shao, Y.; Zhang, Y.; Lin, F.; Zhang, H.K.; Rundensteiner, E.A. Deep Loss Convexification for Learning Iterative Models. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 1501–1513. [Google Scholar] [CrossRef] [PubMed]
  7. Huang, M.; et al. An automated framework for assessing how well LLMs cite. Nature Communications 2025. [Google Scholar]
  8. Yang, N.; Lin, H.; Liu, Y.; Tian, B.; Liu, G.; Zhang, H. Token-Importance Guided Direct Preference Optimization. arXiv 2025, arXiv:2505.19653. [Google Scholar] [CrossRef]
  9. Wang, L.; et al. A survey on large language model based autonomous agents. In Frontiers of Computer Science; 2024. [Google Scholar]
  10. Chen, Y.; Kang, o. ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning. In Proceedings of the Proceedings of the International Conference on Machine Learning, 2025. [Google Scholar]
  11. Yang, N.; Fan, M.; Wang, W.; Zhang, H. Decision-Making Large Language Model for Wireless Communication: A Comprehensive Survey on Key Techniques. IEEE Communications Surveys & Tutorials, 2025. [Google Scholar]
  12. Yang, N.; Zhong, H.; Zhang, H.; Berry, R. Vision-LLMs for Spatiotemporal Traffic Forecasting. arXiv 2025, arXiv:2510.11282. [Google Scholar] [CrossRef]
  13. Dai, Z.; Wang, L.; Lin, F.; Wang, Y.; Li, Z.; Yamada, K.D.; Zhang, Z.; Lu, W. A Language Anchor-Guided Method for Robust Noisy Domain Generalization. CoRR 2025, abs/2503.17211, 2503.17211. [Google Scholar] [CrossRef]
  14. Li, P.; Lin, F.; Xing, S.; Sun, J.; Zhang, D.; Yang, S.; Ni, C.; Tu, Z. Let the Abyss Stare Back Adaptive Falsification for Autonomous Scientific Discovery. arXiv 2026, arXiv:2603.29045. [Google Scholar] [CrossRef]
  15. Xu, X.; Wang, Y.; Xu, D.; Peng, Y.; Zhang, C.; Jia, J.; Chen, B. Vsegan: Visual speech enhancement generative adversarial network. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE, 2022; pp. 7308–7311. [Google Scholar]
  16. Xu, X.; Tu, W.; Yang, Y. CASE-Net: Integrating local and non-local attention operations for speech enhancement. Speech Communication 2023, 148, 31–39. [Google Scholar] [CrossRef]
  17. Xu, X.; Tu, W.; Yang, Y. Selector-enhancer: learning dynamic selection of local and non-local attention operation for speech enhancement. Proceedings of the Proceedings of the AAAI conference on artificial intelligence 2023, Vol. 37, 13853–13860. [Google Scholar] [CrossRef]
  18. Li, P.; Lin, F.; Xing, S.; Zheng, X.; Hong, X.; Yang, S.; Sun, J.; Tu, Z.; Ni, C. Bibagent: An agentic framework for traceable miscitation detection in scientific literature. arXiv 2026, arXiv:2601.16993. [Google Scholar]
  19. Wu, H.; Xiang, H.; Gao, J.; Zhao, X.; Wu, D.; Li, J. Detecting Miscitation on the Scholarly Web through LLM-Augmented Text-Rich Graph Learning. arXiv 2026, arXiv:2603.12290. [Google Scholar]
  20. Li, Y.; et al. SemanticCite: Citation Verification with AI-Powered Full-Text Analysis and Evidence-Based Reasoning. arXiv 2025, arXiv:2511.16198. [Google Scholar]
  21. Xiao, B.; Bennie, M.; Bardhan, J.; Wang, D.Z. Towards Human Cognition: Visual Context Guides Syntactic Priming in Fusion-Encoded Models. arXiv 2025, arXiv:2502.17669. [Google Scholar] [CrossRef]
  22. Zhang, Y.; et al. GuardAgent: Safeguard LLM Agents via Knowledge-Enabled Constraint Checks. In Proceedings of the Proceedings of the International Conference on Machine Learning, 2025. [Google Scholar]
  23. Liu, Y.; et al. VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation. arXiv 2025, arXiv:2510.05156. [Google Scholar] [CrossRef]
  24. Li, P.; Sun, J.; Lin, F.; Xing, S.; Fu, T.; Feng, S.; Ni, C.; Tu, Z. Traversal-as-policy: Log-distilled gated behavior trees as externalized, verifiable policies for safe, robust, and efficient agents. arXiv 2026, arXiv:2603.05517. [Google Scholar]
  25. Xiao, B.; Yin, Z.; Shan, Z. Simulating public administration crisis: A novel generative agent-based simulation system to lower technology barriers in social science research. arXiv 2023, arXiv:2311.06957. [Google Scholar] [CrossRef]
  26. Xiao, B.; Shen, Q.; Wang, D.Z. From Text to Multi-Modal: Advancing Low-Resource-Language Translation through Synthetic Data Generation and Cross-Modal Alignments. Proceedings of the Proceedings of the Eighth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2025, 2025, 24–35. [Google Scholar]
  27. Wu, Y.; Yu, Y.; Yang, Z.; Zeng, Z.; Chen, G.; Xu, J. Brain-SAM: Modality-Agnostic Model for Brain Lesion Segmentation. In Proceedings of the 2025 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); IEEE; Volume 2025, pp. 3000–3005.
Figure 3. Impact of gating threshold θ on ACL-ARC. The left axis shows F1 score (blue solid line) while the right axis shows LLM calls per citation (orange dashed line). The optimal threshold is θ = 0.6 , achieving the best F1 with moderate LLM usage.
Figure 3. Impact of gating threshold θ on ACL-ARC. The left axis shows F1 score (blue solid line) while the right axis shows LLM calls per citation (orange dashed line). The optimal threshold is θ = 0.6 , achieving the best F1 with moderate LLM usage.
Preprints 209449 g003
Figure 4. Scalability analysis on ACL-ARC with varying citation graph sizes. The left axis shows F1 score (green bars) while the right axis shows inference time in milliseconds (purple diamond line). Performance saturates at 10K papers with sub-linear time growth.
Figure 4. Scalability analysis on ACL-ARC with varying citation graph sizes. The left axis shows F1 score (green bars) while the right axis shows inference time in milliseconds (purple diamond line). Performance saturates at 10K papers with sub-linear time growth.
Preprints 209449 g004
Table 1. Main comparison results on three miscitation detection benchmarks. The best results are in bold. Inference Cost is measured by the average number of LLM API calls per citation.
Table 1. Main comparison results on three miscitation detection benchmarks. The best results are in bold. Inference Cost is measured by the average number of LLM API calls per citation.
Method ACL-ARC PubMed-Cite arXiv-Cite Avg F1 Cost
Semantic Similarity 0.58 0.61 0.55 0.58 0
Graph Anomaly Detection 0.64 0.67 0.62 0.64 0
GPT-4 Zero-shot 0.73 0.75 0.71 0.73 1.0
BibAgent 0.76 0.78 0.74 0.76 3.2
LAGMiD 0.81 0.82 0.79 0.81 2.5
CiteGuard (Ours) 0.84 0.85 0.82 0.84 1.8
Table 2. Ablation study on the ACL-ARC dataset. Each row removes or replaces one component from the full CiteGuard model.
Table 2. Ablation study on the ACL-ARC dataset. Each row removes or replaces one component from the full CiteGuard model.
Model Variant Precision Recall F1
Full CiteGuard 0.85 0.83 0.84
w/o Chain-of-Evidence 0.80 0.79 0.79
w/o Graph-Enhanced Module 0.81 0.80 0.80
w/o Behavior Tree Policy 0.82 0.81 0.81
w/o Knowledge Distillation 0.83 0.81 0.82
w/o Gating Mechanism 0.84 0.82 0.83
Table 4. Cross-domain generalization results. Each row shows the model trained on one dataset and evaluated on the others.
Table 4. Cross-domain generalization results. Each row shows the model trained on one dataset and evaluated on the others.
Train → Test ACL-ARC PubMed-Cite arXiv-Cite
ACL-ARC 0.84 0.71 0.68
PubMed-Cite 0.69 0.85 0.70
arXiv-Cite 0.67 0.72 0.82
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated