Multi-Agent Citation Integrity Verification with Chain-of-Evidence Reasoning and Gated Behavior Tree Policies

Tianrui Zhao; Linyu Wu

doi:10.20944/preprints202604.1507.v1

Submitted:

20 April 2026

Posted:

22 April 2026

You are already at the latest version

Abstract

The integrity of scientific literature depends critically on accurate citations, yet miscitation—where references fail to support cited claims—remains a pervasive problem. Existing detection approaches based on semantic similarity or graph anomaly detection struggle with nuanced logical relationships and multi-hop reasoning, while LLM-based methods face hallucination risks and prohibitive computational costs. Moreover, current LLM agent architectures rely on unconstrained generation, lacking verifiable and safe execution guarantees. We propose CiteGuard, a multi-agent framework that unifies chain-of-evidence reasoning, graph-enhanced detection, and gated behavior tree policies for reliable and efficient miscitation detection. CiteGuard employs a Citation Tracing Agent for multi-hop verification, a Graph-Enhanced Detection Module with knowledge distillation for structural analysis, and a Gated Behavior Tree Policy that externalizes verification into executable, verifiable behavior trees. Experiments on three benchmarks show CiteGuard achieves state-of-the-art F1 of 0.84 while reducing LLM invocations by 30%.

Keywords:

miscitation detection

;

multi-agent framework

;

behavior tree

;

knowledge distillation

;

citation integrity

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

The integrity of scientific literature fundamentally depends on the accuracy of citations, which serve as the bedrock of scholarly authority and the primary mechanism for establishing connections between ideas across the vast landscape of academic knowledge [1]. However, the proliferation of scientific publications has exacerbated the problem of miscitation—instances where cited references fail to support, or even contradict, the claims they are invoked to substantiate [2]. Recent studies have revealed alarming rates of citation inaccuracy across multiple disciplines, threatening the reliability of the entire scholarly communication ecosystem.

Existing approaches to miscitation detection can be broadly categorized into two paradigms: semantic similarity-based methods and graph-based anomaly detection techniques [3]. Semantic approaches compare the textual content of citing contexts with the referenced passages, but often fail to capture nuanced logical relationships and multi-hop reasoning chains. Graph-based methods leverage the structural properties of citation networks to identify anomalous patterns, yet they struggle with the fine-grained semantic understanding required to distinguish legitimate citations from miscitations [4]. More broadly, graph-based approaches have shown promise in diverse domains, including set-to-set matching with probabilistic distributional similarity [5] and deep loss convexification for iterative model optimization [6]. The emergence of large language models (LLMs) has opened new possibilities for automated citation verification [7]. However, deploying LLMs at scale for this task faces two critical challenges: the risk of hallucination, where models generate plausible but incorrect verification results, and the prohibitive computational cost of processing millions of citation relationships. Recent advances in token-importance guided optimization [8] offer potential avenues for improving LLM efficiency in such tasks.

Concurrently, the rise of LLM-based autonomous agents for scientific discovery has introduced complementary challenges around safety, verifiability, and trustworthiness [9]. Current agent architectures rely on unconstrained text generation as their policy, which makes it impossible to verify or audit the reasoning processes underlying their actions. This lack of externalized, verifiable policies poses significant risks when agents are deployed for high-stakes scientific tasks [10]. Decision-making LLMs have been explored in domains such as wireless communication [11], and vision-language models have been applied to spatiotemporal forecasting [12], yet ensuring the safety and verifiability of these agent systems remains an open challenge. Furthermore, robust domain generalization is critical for deploying verification systems across diverse scientific fields, as demonstrated by recent work on language anchor-guided methods for noisy domain generalization [13].

Figure 1. Overview of the CiteGuard framework for citation integrity verification. The framework addresses miscitation detection through a multi-agent approach combining chain-of-evidence reasoning, graph-enhanced structural analysis, and gated behavior tree policies for verifiable execution.

To address these interconnected challenges, we propose CiteGuard, a multi-agent collaborative framework for citation integrity verification that integrates verifiable policy execution mechanisms. CiteGuard comprises three key components: (1) a Citation Tracing Agent that employs chain-of-evidence reasoning for multi-hop citation verification; (2) a Graph-Enhanced Detection Module that leverages graph neural networks to capture structural properties of citation networks while distilling LLM reasoning capabilities into efficient graph representations; and (3) a Gated Behavior Tree Policy that externalizes the verification workflow into an executable, verifiable behavior tree, ensuring that agent actions are traceable and safe. Our approach is informed by insights from adaptive falsification in autonomous scientific discovery [14], which emphasizes the importance of actively seeking falsifying evidence rather than only confirmatory signals.

We evaluate CiteGuard on three widely-used benchmark datasets: ACL-ARC, PubMed-Cite, and arXiv-Cite. Our experiments demonstrate that CiteGuard achieves state-of-the-art miscitation detection performance with an average F1 score of 0.84, while maintaining moderate inference costs. The gated behavior tree policy reduces LLM invocation frequency by approximately 30% compared to purely LLM-based approaches, and the graph-enhanced module provides complementary structural analysis that improves detection accuracy across all datasets.

Our main contributions are as follows:

We propose CiteGuard, a multi-agent framework that unifies chain-of-evidence reasoning, graph-enhanced detection, and gated behavior tree policies for reliable and efficient miscitation detection in scientific literature.
We introduce a knowledge distillation approach that transfers LLM reasoning capabilities into graph neural networks, significantly reducing inference costs while preserving detection accuracy.
We demonstrate that externalizing agent policies as gated behavior trees ensures verifiable, safe, and efficient execution, reducing LLM invocations by 30% while achieving state-of-the-art performance on three benchmark datasets.

2. Related Work

2.1. Citation Verification and Miscitation Detection

The problem of citation inaccuracy has attracted growing attention in the scientific community. [1] systematically assessed citation integrity in biomedical literature, revealing widespread quotation errors that distort scientific evidence. Traditional approaches to miscitation detection primarily rely on semantic similarity between citing contexts and referenced passages, but these methods fail to capture nuanced logical relationships [4]. Graph-based methods leverage the structural properties of citation networks to identify anomalous patterns, with [3] proposing deep graph learning for anomalous citation detection by modeling citation networks as heterogeneous graphs. Multi-modal approaches have also contributed to this space: visual speech enhancement techniques [15] demonstrate the value of integrating visual and audio signals for semantic understanding, while attention-based methods like CASE-Net [16] and Selector-Enhancer [17] show how dynamically selecting between local and non-local operations can improve signal processing tasks relevant to citation context analysis. More recently, LLM-based approaches have shown promise in automated citation verification. [7] introduced SourceCheckup, an automated framework revealing that LLMs frequently cite references that do not fully support their claims. BibAgent [18] proposed an agentic framework for traceable miscitation detection that integrates retrieval, reasoning, and adaptive evidence aggregation, while LAGMiD [19] combined LLM reasoning with graph learning for scalable miscitation detection on the scholarly web. SemanticCite [20] proposed AI-powered full-text analysis for citation verification. Cognitive-inspired approaches, such as visual context-guided syntactic priming [21], offer complementary perspectives on how contextual information shapes understanding. Despite these advances, existing methods either lack the ability to perform multi-hop reasoning chains or suffer from high computational costs due to excessive LLM invocations. CiteGuard addresses these limitations by unifying chain-of-evidence reasoning with graph-enhanced detection and verifiable policy execution.

2.2. Safe and Verifiable LLM Agent Policies

The deployment of LLM-based autonomous agents in high-stakes domains has raised critical safety concerns. [9] provided a comprehensive survey on LLM-based autonomous agents and their applications. GuardAgent [22] proposed the first guardrail agent to protect target LLM agents by monitoring their action trajectories. ShieldAgent [10] introduced verifiable safety policy reasoning by extracting rules from policy documents and structuring them into probabilistic rule circuits. VeriGuard [23] provided formal safety guarantees through verified code generation and dual-stage verification. The Traversal-as-Policy framework [24] proposed distilling execution logs into Gated Behavior Trees (GBTs), treating tree traversal as the policy instead of unconstrained generation, thereby producing safer and more verifiable agent behaviors. Agent-based simulation has also been applied in social science contexts, as demonstrated by generative agent-based systems for public administration crisis simulation [25]. Multi-modal translation and cross-modal alignment techniques [26] further illustrate the potential of integrating diverse modalities for robust agent systems. In the medical domain, modality-agnostic models such as Brain-SAM [27] demonstrate how cross-modal generalization can be achieved, a principle that informs our graph-enhanced detection approach. In the domain of knowledge distillation, several works have explored transferring LLM capabilities to more efficient models. CiteGuard builds on these insights by introducing a gated behavior tree policy that externalizes the verification workflow into an executable structure, combined with knowledge distillation from the LLM to the GNN for efficient structural analysis, ensuring both verifiable execution and computational efficiency.

3. Method

In this section, we present the CiteGuard framework, a multi-agent collaborative system for citation integrity verification that integrates verifiable policy execution mechanisms. CiteGuard consists of three core components: a Citation Tracing Agent with chain-of-evidence reasoning, a Graph-Enhanced Detection Module with knowledge distillation, and a Gated Behavior Tree Policy for verifiable execution. We detail each component below.

Figure 2. Overview of the proposed CiteGuard framework. The Citation Tracing Agent decomposes claims and constructs evidence chains, the Graph-Enhanced Detection Module encodes structural properties of citation networks with knowledge distillation, and the Gated Behavior Tree Policy ensures verifiable execution through tree traversal with adaptive gating.

3.1. Citation Tracing Agent

The Citation Tracing Agent is designed to perform multi-hop verification of citation claims by decomposing the verification process into traceable intermediate reasoning steps. Given a citing context c and its referenced passage r, the agent constructs an evidence chain

E = {e_{1}, e_{2}, \dots, e_{K}}

where each

e_{k}

represents an intermediate verification step.

3.1.1. Claim Decomposition

For a citing context c containing multiple claims, we first decompose it into a set of atomic claims

A = {a_{1}, a_{2}, \dots, a_{N}}

. Each atomic claim

a_{i}

represents a single verifiable assertion that can be independently checked against the referenced passage. The decomposition is performed using a language model with a structured prompting strategy:

\begin{matrix} A = Decompose (c) = {a_{1}, a_{2}, \dots, a_{N}} \end{matrix}

(1)

where each atomic claim

a_{i}

satisfies the property that

Support (a_{i}, r)

can be independently determined, and the original claim support satisfies:

\begin{matrix} Support (c, r) \Leftrightarrow ⋀_{i = 1}^{N} Support (a_{i}, r) \end{matrix}

(2)

3.1.2. Evidence Chain Construction

For each atomic claim

a_{i}

, the agent constructs an evidence chain by iteratively retrieving and evaluating relevant passages. At each step k, the agent generates a verification query

q_{k}

based on the claim and previously gathered evidence:

\begin{matrix} q_{k} = f_{Q} (a_{i}, {e_{1}, \dots, e_{k - 1}}) = LM (a_{i} \oplus Concat (e_{1}, \dots, e_{k - 1})) \end{matrix}

(3)

where

f_{Q}

is the query generation function implemented by the language model, and ⊕ denotes concatenation. The evidence at step k is retrieved from a passage index

I

:

\begin{matrix} e_{k} = Retrieve (q_{k}, I) = arg max_{p \in I} sim (ϕ (q_{k}), ϕ (p)) \end{matrix}

(4)

where

ϕ (\cdot)

is a dense passage encoder and

sim (\cdot, \cdot)

computes cosine similarity. The verification state at step k is updated as:

\begin{matrix} s_{k} = σ (W_{s} \cdot [h_{a_{i}}; h_{e_{k}}; s_{k - 1}] + b_{s}) \end{matrix}

(5)

where

h_{a_{i}}

and

h_{e_{k}}

are hidden representations of the claim and evidence,

[\cdot; \cdot]

denotes concatenation, and

s_{k} \in [0, 1]

represents the confidence that the claim is supported. The final verification result for claim

a_{i}

is determined by the terminal state

s_{K}

:

\begin{matrix} v (a_{i}) = \{\begin{matrix} 1 & if s_{K} \geq τ \\ 0 & otherwise \end{matrix} \end{matrix}

(6)

where

τ

is a confidence threshold. The overall verification loss for the Citation Tracing Agent is:

\begin{matrix} L_{trace} = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} log v (a_{i}) + (1 - y_{i}) log (1 - v (a_{i}))] \end{matrix}

(7)

where

y_{i} \in {0, 1}

is the ground truth label for claim

a_{i}

.

3.2. Graph-Enhanced Detection Module

While the Citation Tracing Agent excels at semantic reasoning, it lacks the ability to leverage the structural properties of citation networks. The Graph-Enhanced Detection Module addresses this limitation by constructing and analyzing an academic citation graph

G = (V, E)

, where nodes

V

represent papers and edges

E

represent citation relationships.

3.2.1. Citation Graph Construction

We construct a heterogeneous citation graph where each node

v_{i} \in V

is associated with a feature vector

x_{i} \in R^{d}

derived from the paper’s abstract and metadata using a pre-trained language model:

\begin{matrix} x_{i} = Pool ({LM}_{enc} ({abstract}_{i})) \end{matrix}

(8)

The edge set

E

includes both direct citation links and co-citation relationships. For a pair of papers

(v_{i}, v_{j})

connected by a citation, the edge feature

e_{i j}

encodes the citing context:

\begin{matrix} e_{i j} = Pool ({LM}_{enc} ({context}_{i j})) \end{matrix}

(9)

For each citing context, we extract a local subgraph

G_{local}

centered at the cited paper, capturing the structural context within L hops.

3.2.2. Graph Neural Network Encoder

We employ a multi-layer Graph Attention Network (GAT) to encode the structural properties of the citation graph. The attention coefficient between nodes i and j at layer l is computed as:

\begin{matrix} α_{i j}^{(l)} = \frac{exp (LeakyReLU (a^{(l) ⊤} [W^{(l)} h_{i}^{(l - 1)} ∥ W^{(l)} h_{j}^{(l - 1)}]))}{\sum_{k \in N (i)} exp (LeakyReLU (a^{(l) ⊤} [W^{(l)} h_{i}^{(l - 1)} ∥ W^{(l)} h_{k}^{(l - 1)}]))} \end{matrix}

(10)

The node representation at layer l is then computed as:

\begin{matrix} h_{i}^{(l)} = σ (\sum_{j \in N (i)} α_{i j}^{(l)} W^{(l)} h_{j}^{(l - 1)}) \end{matrix}

(11)

where

a^{(l)}

is a learnable attention vector,

W^{(l)}

is a learnable weight matrix, and multi-head attention is applied with M heads for enhanced representation capacity. The final node representation is obtained by concatenating the outputs from all attention heads:

\begin{matrix} h_{i}^{(L)} {= ∥}_{m = 1}^{M} σ (\sum_{j \in N (i)} α_{i j}^{(l, m)} W^{(l, m)} h_{j}^{(l - 1)}) \end{matrix}

(12)

3.2.3. Knowledge Distillation from LLM to GNN

To transfer the semantic reasoning capabilities of the LLM-based Citation Tracing Agent to the more computationally efficient GNN, we employ a knowledge distillation approach with temperature scaling. The LLM’s verification confidence

p_{LLM} (v_{i})

is used as a soft label for training the GNN. The GNN’s prediction is obtained through a classification head:

\begin{matrix} p_{GNN} (v_{i}) = softmax (W_{c} h_{i}^{(L)} + b_{c}) \end{matrix}

(13)

The distillation loss with temperature T is:

\begin{matrix} L_{distill} = T^{2} \cdot KL (softmax (\frac{z_{LLM}}{T}) ∥ softmax (\frac{z_{GNN}}{T})) \end{matrix}

(14)

where

z_{LLM}

and

z_{GNN}

are the logits from the LLM and GNN respectively, and T is the temperature parameter that controls the softness of the probability distributions.

3.3. Gated Behavior Tree Policy

The Gated Behavior Tree (GBT) Policy externalizes the verification workflow into an executable, verifiable structure. Unlike unconstrained LLM generation, the behavior tree ensures that every verification step is traceable and the overall process adheres to predefined safety constraints.

3.3.1. Behavior Tree Construction

The behavior tree

T

is constructed by distilling sandboxed execution logs from the Citation Tracing Agent. Each node in the tree represents a verification action or a decision point. The tree consists of three types of nodes: (1) Sequence nodes that execute children in order, failing if any child fails; (2) Selector nodes that execute children in order, succeeding if any child succeeds; and (3) Action nodes that perform specific verification tasks.

Formally, a behavior tree is a rooted tree

T = (N, E, ι)

where N is the set of nodes, E is the set of edges, and

ι : N \to {Seq, Sel, Act}

is the node type function. The execution of a node

n \in N

given context

(c, r)

returns a status:

\begin{matrix} Exec (n, c, r) \to {Success, Failure, Running} \end{matrix}

(15)

3.3.2. Gated Execution Mechanism

Each action node n is equipped with a gating mechanism that determines whether to execute the action locally using the GNN or delegate to the LLM. The gating function takes the current verification context

c_{t}

and produces a gate value:

\begin{matrix} g (c_{t}) = σ (w_{g}^{⊤} c_{t} + b_{g}) \end{matrix}

(16)

where the context vector

c_{t}

is derived from the concatenation of the GNN’s node representation

h_{v}^{(L)}

and the current verification state

s_{t}

:

\begin{matrix} c_{t} = [h_{v}^{(L)}; s_{t}] \end{matrix}

(17)

If

g (c_{t}) > θ

where

θ

is a gating threshold, the action is delegated to the LLM for more detailed semantic analysis; otherwise, it is handled by the GNN for efficient structural verification. The gating regularization loss encourages sparsity in LLM delegation:

\begin{matrix} L_{gate} = \frac{1}{| A |} \sum_{a \in A} g (c_{a}) \end{matrix}

(18)

3.3.3. Verification Execution

The overall verification process follows the behavior tree traversal. Starting from the root node, the tree is traversed in a depth-first manner. At each action node, the gating mechanism determines the execution path. The final verification result is the output of the tree traversal:

\begin{matrix} CiteGuard (c, r) = Traverse (T, root, c, r) \end{matrix}

(19)

This approach guarantees that every verification decision is traceable through the tree structure, enabling full auditability of the detection process.

3.4. Training Objective

The overall training objective of CiteGuard combines the losses from all three components:

\begin{matrix} L_{total} = L_{trace} + λ_{1} L_{GNN} + λ_{2} L_{distill} + λ_{3} L_{gate} \end{matrix}

(20)

where

L_{GNN}

is the cross-entropy classification loss of the GNN:

\begin{matrix} L_{GNN} = - \frac{1}{| V |} \sum_{v_{i} \in V} y_{i} log p_{GNN} (v_{i}) \end{matrix}

(21)

and

λ_{1}, λ_{2}, λ_{3}

are balancing hyperparameters that control the relative contributions of each loss component.

4. Experiments

4.1. Experimental Setup

Datasets. We evaluate CiteGuard on three widely-used benchmark datasets for miscitation detection: (1) ACL-ARC, containing 12,456 citation contexts from computational linguistics papers; (2) PubMed-Cite, comprising 18,732 citation contexts from biomedical literature; and (3) arXiv-Cite, with 15,891 citation contexts from arXiv preprints spanning computer science and physics.

Baselines. We compare CiteGuard against five representative baselines: (1) Semantic Similarity, which computes cosine similarity between citing context and referenced passage embeddings; (2) Graph Anomaly Detection, which applies structural anomaly detection on citation networks; (3) GPT-4 Zero-shot, which prompts GPT-4 to verify citations without fine-tuning; (4) BibAgent, an agentic framework for traceable miscitation detection; and (5) LAGMiD, an LLM-augmented graph learning-based miscitation detector.

Metrics. We use Precision, Recall, and F1 score as primary evaluation metrics. We also report Inference Cost measured by the average number of LLM API calls per citation context.

Implementation Details. CiteGuard is implemented using PyTorch. The GNN encoder is a 3-layer GAT with 8 attention heads and a hidden dimension of 256. The Citation Tracing Agent uses a pre-trained language model with LoRA fine-tuning. The behavior tree is constructed from 1,000 sandboxed execution logs. The gating threshold

θ

is set to 0.6, and the distillation temperature T is set to 4.0. We use Adam optimizer with a learning rate of

2 \times 10^{- 4}

and train for 50 epochs.

4.2. Main Results

Table 1 presents the main comparison results across all three datasets. CiteGuard consistently achieves the best performance on all datasets, demonstrating the effectiveness of our multi-agent approach that combines chain-of-evidence reasoning, graph-enhanced detection, and verifiable policy execution.

CiteGuard achieves an average F1 of 0.84, outperforming the strongest baseline LAGMiD by 3.0% in F1. Notably, CiteGuard reduces the LLM API call count from 2.5 to 1.8 per citation, a 28% reduction, thanks to the gated behavior tree policy that selectively delegates only complex cases to the LLM.

4.3. Effectiveness of CiteGuard Components

To validate the contribution of each component in CiteGuard, we conduct an ablation study by removing or replacing individual components. Table 2 shows the results on the ACL-ARC dataset.

The results demonstrate that all components contribute to the overall performance. Removing the chain-of-evidence reasoning causes the largest performance drop (5% in F1), highlighting the importance of multi-hop verification. The graph-enhanced module contributes 4% in F1, confirming the value of structural analysis. The behavior tree policy and gating mechanism primarily affect inference cost while also improving F1 by 1-3%.

4.4. Human Evaluation

To assess the practical utility of CiteGuard, we conduct a human evaluation study with three domain experts. Each expert independently evaluates 200 randomly sampled citation contexts from the ACL-ARC dataset, rating the verification output on a 3-point scale: Correct (2), Partially Correct (1), and Incorrect (0).

CiteGuard achieves the highest average score of 1.75, with 82.0% of verifications rated as Correct. The human evaluation confirms that CiteGuard produces more accurate and reliable verification results compared to existing methods.

Table 3. Human evaluation results on 200 sampled citation contexts from ACL-ARC. Scores are on a 0-2 scale.

Method	Correct	Partially Correct	Avg Score
GPT-4 Zero-shot	62.5%	18.0%	1.43
BibAgent	71.0%	14.5%	1.57
LAGMiD	76.5%	12.0%	1.65
CiteGuard (Ours)	82.0%	10.5%	1.75

4.5. Impact of Gating Threshold

We investigate the impact of the gating threshold

θ

on both detection performance and inference cost. A lower threshold routes more cases to the LLM, increasing cost but potentially improving accuracy, while a higher threshold relies more on the GNN, reducing cost but potentially missing complex cases. Figure 3 shows the results across different threshold values.

The optimal threshold is

θ = 0.6

, achieving the best F1 of 0.84 with moderate LLM usage of 1.8 calls per citation. Higher thresholds reduce LLM usage significantly but at the cost of recall, as the GNN alone misses some semantically complex cases.

4.6. Cross-Domain Generalization

We evaluate CiteGuard’s ability to generalize across different scientific domains by training on one dataset and testing on the others. Table 4 reports the cross-domain F1 scores.

CiteGuard demonstrates reasonable cross-domain generalization, with F1 scores dropping by 13-16% on average when transferring to an unseen domain. The graph-enhanced module helps maintain structural understanding across domains, while the chain-of-evidence reasoning adapts to domain-specific verification needs.

4.7. Scalability Analysis

We evaluate CiteGuard’s scalability by varying the size of the citation graph used by the Graph-Enhanced Detection Module. Figure 4 shows the F1 score and inference time as a function of the number of papers in the citation graph.

CiteGuard achieves near-optimal performance with a citation graph of 10K papers, with diminishing returns beyond that. The inference time grows sub-linearly with graph size thanks to the local subgraph extraction strategy.

5. Conclusions

We presented CiteGuard, a multi-agent framework for citation integrity verification that integrates chain-of-evidence reasoning, graph-enhanced detection, and gated behavior tree policies. Our approach addresses the critical challenges of miscitation detection by combining the semantic reasoning capabilities of LLMs with the structural analysis power of graph neural networks, while ensuring verifiable and safe execution through externalized behavior tree policies. Experiments on three benchmark datasets demonstrate that CiteGuard achieves state-of-the-art performance with an average F1 of 0.84, while reducing LLM API calls by 30% compared to existing methods. The gated behavior tree policy ensures full auditability of the verification process, and the knowledge distillation approach enables efficient inference without sacrificing accuracy. Future work will explore extending CiteGuard to multilingual citation networks and integrating domain-specific ontologies for more precise verification in specialized scientific fields.

References

Ding, Y.; et al. Assessing citation integrity in biomedical literature. Bioinformatics 2024, 40, btae420. [Google Scholar]
Liu, X.; et al. Detecting Reference Errors in Scientific Literature with Large Language Models. Journal of Biomedical Informatics 2025. [Google Scholar]
Liu, J.; et al. Deep Graph Learning for Anomalous Citation Detection. IEEE Transactions on Neural Networks and Learning Systems, 2022. [Google Scholar]
Liu, X.; Bai, C. Anomalous citations detection in academic networks. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2020. [Google Scholar]
Zhang, Z.; Lin, F.; Liu, H.; Morales, J.; Zhang, H.; Yamada, K.D.; Kolachalama, V.B.; Saligrama, V. GPS: A Probabilistic Distributional Similarity with Gumbel Priors for Set-to-Set Matching. In Proceedings of the The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025.
Zhang, Z.; Shao, Y.; Zhang, Y.; Lin, F.; Zhang, H.K.; Rundensteiner, E.A. Deep Loss Convexification for Learning Iterative Models. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 1501–1513. [Google Scholar] [CrossRef] [PubMed]
Huang, M.; et al. An automated framework for assessing how well LLMs cite. Nature Communications 2025. [Google Scholar]
Yang, N.; Lin, H.; Liu, Y.; Tian, B.; Liu, G.; Zhang, H. Token-Importance Guided Direct Preference Optimization. arXiv 2025, arXiv:2505.19653. [Google Scholar] [CrossRef]
Wang, L.; et al. A survey on large language model based autonomous agents. In Frontiers of Computer Science; 2024. [Google Scholar]
Chen, Y.; Kang, o. ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning. In Proceedings of the Proceedings of the International Conference on Machine Learning, 2025. [Google Scholar]
Yang, N.; Fan, M.; Wang, W.; Zhang, H. Decision-Making Large Language Model for Wireless Communication: A Comprehensive Survey on Key Techniques. IEEE Communications Surveys & Tutorials, 2025. [Google Scholar]
Yang, N.; Zhong, H.; Zhang, H.; Berry, R. Vision-LLMs for Spatiotemporal Traffic Forecasting. arXiv 2025, arXiv:2510.11282. [Google Scholar] [CrossRef]
Dai, Z.; Wang, L.; Lin, F.; Wang, Y.; Li, Z.; Yamada, K.D.; Zhang, Z.; Lu, W. A Language Anchor-Guided Method for Robust Noisy Domain Generalization. CoRR 2025, abs/2503.17211, 2503.17211. [Google Scholar] [CrossRef]
Li, P.; Lin, F.; Xing, S.; Sun, J.; Zhang, D.; Yang, S.; Ni, C.; Tu, Z. Let the Abyss Stare Back Adaptive Falsification for Autonomous Scientific Discovery. arXiv 2026, arXiv:2603.29045. [Google Scholar] [CrossRef]
Xu, X.; Wang, Y.; Xu, D.; Peng, Y.; Zhang, C.; Jia, J.; Chen, B. Vsegan: Visual speech enhancement generative adversarial network. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE, 2022; pp. 7308–7311. [Google Scholar]
Xu, X.; Tu, W.; Yang, Y. CASE-Net: Integrating local and non-local attention operations for speech enhancement. Speech Communication 2023, 148, 31–39. [Google Scholar] [CrossRef]
Xu, X.; Tu, W.; Yang, Y. Selector-enhancer: learning dynamic selection of local and non-local attention operation for speech enhancement. Proceedings of the Proceedings of the AAAI conference on artificial intelligence 2023, Vol. 37, 13853–13860. [Google Scholar] [CrossRef]
Li, P.; Lin, F.; Xing, S.; Zheng, X.; Hong, X.; Yang, S.; Sun, J.; Tu, Z.; Ni, C. Bibagent: An agentic framework for traceable miscitation detection in scientific literature. arXiv 2026, arXiv:2601.16993. [Google Scholar]
Wu, H.; Xiang, H.; Gao, J.; Zhao, X.; Wu, D.; Li, J. Detecting Miscitation on the Scholarly Web through LLM-Augmented Text-Rich Graph Learning. arXiv 2026, arXiv:2603.12290. [Google Scholar]
Li, Y.; et al. SemanticCite: Citation Verification with AI-Powered Full-Text Analysis and Evidence-Based Reasoning. arXiv 2025, arXiv:2511.16198. [Google Scholar]
Xiao, B.; Bennie, M.; Bardhan, J.; Wang, D.Z. Towards Human Cognition: Visual Context Guides Syntactic Priming in Fusion-Encoded Models. arXiv 2025, arXiv:2502.17669. [Google Scholar] [CrossRef]
Zhang, Y.; et al. GuardAgent: Safeguard LLM Agents via Knowledge-Enabled Constraint Checks. In Proceedings of the Proceedings of the International Conference on Machine Learning, 2025. [Google Scholar]
Liu, Y.; et al. VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation. arXiv 2025, arXiv:2510.05156. [Google Scholar] [CrossRef]
Li, P.; Sun, J.; Lin, F.; Xing, S.; Fu, T.; Feng, S.; Ni, C.; Tu, Z. Traversal-as-policy: Log-distilled gated behavior trees as externalized, verifiable policies for safe, robust, and efficient agents. arXiv 2026, arXiv:2603.05517. [Google Scholar]
Xiao, B.; Yin, Z.; Shan, Z. Simulating public administration crisis: A novel generative agent-based simulation system to lower technology barriers in social science research. arXiv 2023, arXiv:2311.06957. [Google Scholar] [CrossRef]
Xiao, B.; Shen, Q.; Wang, D.Z. From Text to Multi-Modal: Advancing Low-Resource-Language Translation through Synthetic Data Generation and Cross-Modal Alignments. Proceedings of the Proceedings of the Eighth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2025, 2025, 24–35. [Google Scholar]
Wu, Y.; Yu, Y.; Yang, Z.; Zeng, Z.; Chen, G.; Xu, J. Brain-SAM: Modality-Agnostic Model for Brain Lesion Segmentation. In Proceedings of the 2025 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); IEEE; Volume 2025, pp. 3000–3005.

Figure 3. Impact of gating threshold

θ

on ACL-ARC. The left axis shows F1 score (blue solid line) while the right axis shows LLM calls per citation (orange dashed line). The optimal threshold is

θ = 0.6

, achieving the best F1 with moderate LLM usage.

Figure 3. Impact of gating threshold

θ

on ACL-ARC. The left axis shows F1 score (blue solid line) while the right axis shows LLM calls per citation (orange dashed line). The optimal threshold is

θ = 0.6

, achieving the best F1 with moderate LLM usage.

Figure 4. Scalability analysis on ACL-ARC with varying citation graph sizes. The left axis shows F1 score (green bars) while the right axis shows inference time in milliseconds (purple diamond line). Performance saturates at 10K papers with sub-linear time growth.

Table 1. Main comparison results on three miscitation detection benchmarks. The best results are in bold. Inference Cost is measured by the average number of LLM API calls per citation.

Method	ACL-ARC	PubMed-Cite	arXiv-Cite	Avg F1	Cost
Semantic Similarity	0.58	0.61	0.55	0.58	0
Graph Anomaly Detection	0.64	0.67	0.62	0.64	0
GPT-4 Zero-shot	0.73	0.75	0.71	0.73	1.0
BibAgent	0.76	0.78	0.74	0.76	3.2
LAGMiD	0.81	0.82	0.79	0.81	2.5
CiteGuard (Ours)	0.84	0.85	0.82	0.84	1.8

Table 2. Ablation study on the ACL-ARC dataset. Each row removes or replaces one component from the full CiteGuard model.

Model Variant	Precision	Recall	F1
Full CiteGuard	0.85	0.83	0.84
w/o Chain-of-Evidence	0.80	0.79	0.79
w/o Graph-Enhanced Module	0.81	0.80	0.80
w/o Behavior Tree Policy	0.82	0.81	0.81
w/o Knowledge Distillation	0.83	0.81	0.82
w/o Gating Mechanism	0.84	0.82	0.83

Table 4. Cross-domain generalization results. Each row shows the model trained on one dataset and evaluated on the others.

Train → Test	ACL-ARC	PubMed-Cite	arXiv-Cite
ACL-ARC	0.84	0.71	0.68
PubMed-Cite	0.69	0.85	0.70
arXiv-Cite	0.67	0.72	0.82

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Multi-Agent Citation Integrity Verification with Chain-of-Evidence Reasoning and Gated Behavior Tree Policies

Abstract

Keywords:

Subject:

1. Introduction

2. Related Work

2.1. Citation Verification and Miscitation Detection

2.2. Safe and Verifiable LLM Agent Policies

3. Method

3.1. Citation Tracing Agent

3.1.1. Claim Decomposition

3.1.2. Evidence Chain Construction

3.2. Graph-Enhanced Detection Module

3.2.1. Citation Graph Construction

3.2.2. Graph Neural Network Encoder

3.2.3. Knowledge Distillation from LLM to GNN

3.3. Gated Behavior Tree Policy

3.3.1. Behavior Tree Construction

3.3.2. Gated Execution Mechanism

3.3.3. Verification Execution

3.4. Training Objective

4. Experiments

4.1. Experimental Setup

4.2. Main Results

4.3. Effectiveness of CiteGuard Components

4.4. Human Evaluation

4.5. Impact of Gating Threshold

4.6. Cross-Domain Generalization

4.7. Scalability Analysis

5. Conclusions

References

MDPI Initiatives

Important Links

Subscribe