Submitted:
23 December 2025
Posted:
24 December 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction

- Fine-grained Entity-Context Extractor (FECE): This module is responsible for a detailed analysis of news content. For text, it extracts not only PER, LOC, and EVT entities but also analyzes their roles within the text, along with associated modifiers, predicate verbs, and other contextual cues to construct entity-relation triplets or descriptive phrases. For images, it employs advanced visual understanding models (e.g., open-vocabulary segmentation with semantic-assisted calibration [12], and video object segmentation models leveraging quality-aware dynamic memory [13] or global spectral filter memory networks [14]) to identify potential entity regions and extract their salient visual features.
- Dynamic Evidence Retrieval and Augmentation (DERA): Diverging from methods that rely on simple, pre-defined reference images, DERA dynamically retrieves multiple highly relevant textual descriptions and images from large-scale knowledge bases (e.g., Wikipedia, Freebase) based on the text entities and their context extracted by FECE. Such knowledge-intensive approaches draw parallels with effective knowledge integration strategies demonstrated in various advanced AI systems, including those employing hierarchical Transformers for knowledge graph embeddings [15]. We devise a cross-modal matching scoring mechanism to evaluate the relevance of these retrieved evidences to both the news text entities and the news image, selecting the most representative and complementary evidence (e.g., entity encyclopedic information, historical event images) for integration.
- Multi-stage Adaptive Verification (MSAV): This module orchestrates the consistency verification process in several adaptive stages. First, an initial cross-modal alignment is performed using a foundational LVLM (e.g., a fine-tuned LLaVA 1.5), providing a preliminary confidence score similar to existing "w/o compositional evidence" setups. Second, the enhanced evidence (text and images) retrieved by DERA is fed into the LVLM using adaptive prompting strategies, enabling joint reasoning with the news image and text. This stage specifically emphasizes entity representations across different evidence sources, leveraging attention mechanisms to strengthen key evidentiary information. Finally, a lightweight fusion network aggregates the results and confidence scores from both stages, producing the ultimate entity consistency prediction. An adversarial training strategy is integrated to improve the model’s accuracy in identifying inconsistencies across modalities.
- We propose AMCV, a novel Adaptive Multi-modal Contextual Verifier, designed to address the limitations of existing Large Vision-Language Models in cross-modal entity consistency verification by integrating sophisticated contextual understanding and verification mechanisms.
- We introduce a Dynamic Evidence Retrieval and Augmentation (DERA) module that intelligently retrieves and integrates multiple relevant textual and visual evidences from external knowledge bases, moving beyond static reference images to enhance context awareness.
- We develop a Multi-stage Adaptive Verification (MSAV) framework that performs a hierarchical verification process, combining initial LVLM-based alignment with evidence-fusion reasoning and confidence aggregation, significantly improving robustness and accuracy in identifying entity inconsistencies.
2. Related Work
2.1. Cross-Modal Entity Consistency and Vision-Language Models
2.2. Knowledge-Enhanced Multimodal Verification
3. Method
3.1. Fine-Grained Entity-Context Extractor (FECE)
3.1.1. Textual Context Extraction
3.1.2. Visual Context Extraction
3.2. Dynamic Evidence Retrieval and Augmentation (DERA)
3.2.1. Evidence Retrieval
3.2.2. Cross-Modal Matching and Selection
3.3. Multi-Stage Adaptive Verification (MSAV)
3.3.1. Stage 1: Initial Cross-Modal Alignment
3.3.2. Stage 2: Evidence Fusion Verification
3.3.3. Stage 3: Confidence Aggregation and Decision
4. Experiments
4.1. Experimental Setup
4.1.1. Datasets
- TamperedNews-Ent: This dataset comprises manually manipulated news image-text pairs, specifically engineered to contain inconsistencies between depicted entities and textual mentions. It includes annotations for Persons (PER), Locations (LOC), and Events (EVT), making it ideal for testing fine-grained consistency detection.
- News400-Ent: Consisting of real-world news image-text pairs, this dataset provides a challenging testbed for our method in authentic journalistic contexts. Like TamperedNews-Ent, it is annotated with PER, LOC, and EVT entities.
-
MMG-Ent: This dataset focuses on document-level consistency verification and features three specialized sub-tasks:
- −
- LCt (Location Consistency Test): Assesses the consistency of location entities.
- −
- LCo (Location Comparison): Compares location consistency across similar news articles.
- −
- LCn (Location Novelty): Verifies if a location is consistent with a provided reference image.
4.1.2. Baselines
-
InstructBLIP: Evaluated in two settings:
- −
- w/o (without compositional evidence): Represents the model’s performance based solely on the original news image and text.
- −
- comp (with compositional evidence): Enhanced by providing additional reference images related to the entities, as per existing methodologies.
-
LLaVA 1.5: Also evaluated in the same two settings:
- −
- w/o (without compositional evidence): Baseline performance of LLaVA 1.5.
- −
- comp (with compositional evidence): Performance when augmented with static compositional evidence images.
4.1.3. Evaluation Metric
4.1.4. Implementation Details
4.2. Performance Comparison
4.3. Ablation Study
- AMCV w/o FECE (Fine-grained Entity-Context Extractor): When FECE is replaced by a simpler entity extraction mechanism (e.g., basic Named Entity Recognition for text and only global image features instead of specific object regions), the performance drops significantly. For instance, accuracy on TamperedNews-Ent (PER) decreases from 0.80 to 0.74. This highlights the critical role of FECE in providing enriched, semantically grounded entity representations from both modalities, which are essential for accurate cross-modal alignment.
- AMCV w/o DERA (Dynamic Evidence Retrieval and Augmentation): If the DERA module is removed, and instead the model relies solely on the original news content (similar to the ‘w/o’ baseline) or a fixed, generic set of reference images (like ‘comp’ baselines), a noticeable performance decrease is observed. For TamperedNews-Ent (PER), the accuracy drops to 0.77. This validates that dynamic, context-aware retrieval of external knowledge is superior to static or absent augmentation strategies, providing crucial disambiguating information and factual grounding.
- AMCV w/o MSAV (Multi-stage Adaptive Verification): When the multi-stage adaptive verification process is simplified (e.g., by directly fusing outputs from an enhanced LVLM without the hierarchical stages and adversarial training), the model’s robustness and accuracy decline. For News400-Ent (EVT), accuracy reduces from 0.88 to 0.86. This indicates that the progressive refinement and confidence aggregation, particularly with the integrated adversarial training strategy, are vital for distinguishing subtle inconsistencies and achieving robust final predictions.
4.4. Human Evaluation
4.5. Analysis of Dynamic Evidence Retrieval (DERA)
4.6. Error Analysis
4.7. Computational Efficiency
5. Conclusion
References
- Luo, G.; Darrell, T.; Rohrbach, A. NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media. In Proceedings of the Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2021; pp. 6801–6817. [Google Scholar] [CrossRef]
- Lin, B.; Ye, Y.; Zhu, B.; Cui, J.; Ning, M.; Jin, P.; Yuan, L. Video-LLaVA: Learning United Visual Representation by Alignment Before Projection. In Proceedings of the Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2024; pp. 5971–5984. [Google Scholar] [CrossRef]
- Gui, L.; Wang, B.; Huang, Q.; Hauptmann, A.; Bisk, Y.; Gao, J. KAT: A Knowledge Augmented Transformer for Vision-and-Language. In Proceedings of the Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2022; pp. 956–968. [Google Scholar] [CrossRef]
- Zhou, Y.; Shen, J.; Cheng, Y. Weak to strong generalization for large language models with multi-capabilities. In Proceedings of the The Thirteenth International Conference on Learning Representations, 2025. [Google Scholar]
- Zhou, Y.; Li, X.; Wang, Q.; Shen, J. Visual In-Context Learning for Large Vision-Language Models. In Proceedings of the Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11–16, 2024. Association for Computational Linguistics, 2024, pp. 15890–15902.
- Tian, Z.; Lin, Z.; Zhao, D.; Zhao, W.; Flynn, D.; Ansari, S.; Wei, C. Evaluating scenario-based decision-making for interactive autonomous driving using rational criteria: A survey. arXiv 2025, arXiv:2501.01886. [Google Scholar] [CrossRef]
- Zheng, L.; Tian, Z.; He, Y.; Liu, S.; Chen, H.; Yuan, F.; Peng, Y. Enhanced mean field game for interactive decision-making with varied stylish multi-vehicles. arXiv 2025, arXiv:2509.00981. [Google Scholar] [CrossRef]
- Lin, Z.; Tian, Z.; Lan, J.; Zhao, D.; Wei, C. Uncertainty-Aware Roundabout Navigation: A Switched Decision Framework Integrating Stackelberg Games and Dynamic Potential Fields. IEEE Transactions on Vehicular Technology 2025, 1–13. [Google Scholar] [CrossRef]
- Huang, S.; et al. AI-Driven Early Warning Systems for Supply Chain Risk Detection: A Machine Learning Approach. Academic Journal of Computing & Information Science 2025, 8, 92–107. [Google Scholar] [CrossRef]
- Huang, S. Measuring Supply Chain Resilience with Foundation Time-Series Models. European Journal of Engineering and Technologies 2025, 1, 49–56. [Google Scholar]
- Ren, L.; et al. Real-time Threat Identification Systems for Financial API Attacks under Federated Learning Framework. Academic Journal of Business & Management 2025, 7, 65–71. [Google Scholar] [CrossRef]
- Liu, Y.; Bai, S.; Li, G.; Wang, Y.; Tang, Y. Open-vocabulary segmentation with semantic-assisted calibration. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024; pp. 3491–3500. [Google Scholar]
- Liu, Y.; Yu, R.; Yin, F.; Zhao, X.; Zhao, W.; Xia, W.; Yang, Y. Learning quality-aware dynamic memory for video object segmentation. In Proceedings of the European Conference on Computer Vision, 2022; Springer; pp. 468–486. [Google Scholar]
- Liu, Y.; Yu, R.; Wang, J.; Zhao, X.; Wang, Y.; Tang, Y.; Yang, Y. Global spectral filter memory network for video object segmentation. In Proceedings of the European Conference on Computer Vision, 2022; Springer; pp. 648–665. [Google Scholar]
- Chen, S.; Liu, X.; Gao, J.; Jiao, J.; Zhang, R.; Ji, Y. HittER: Hierarchical Transformers for Knowledge Graph Embeddings. In Proceedings of the Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2021; pp. 10395–10407. [Google Scholar] [CrossRef]
- Islam, K.I.; Kar, S.; Islam, M.S.; Amin, M.R. SentNoB: A Dataset for Analysing Sentiment on Noisy Bangla Texts. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, 2021; pp. 3265–3271. [Google Scholar] [CrossRef]
- Ao, J.; Wang, R.; Zhou, L.; Wang, C.; Ren, S.; Wu, Y.; Liu, S.; Ko, T.; Li, Q.; Zhang, Y.; et al. SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing. In Proceedings of the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2022, pp. 5723–5738. [CrossRef]
- Liu, F.; Bugliarello, E.; Ponti, E.M.; Reddy, S.; Collier, N.; Elliott, D. Visually Grounded Reasoning across Languages and Cultures. In Proceedings of the Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2021; pp. 10467–10485. [Google Scholar] [CrossRef]
- Liu, L.; Ding, B.; Bing, L.; Joty, S.; Si, L.; Miao, C. MulDA: A Multilingual Data Augmentation Framework for Low-Resource Cross-Lingual NER. In Proceedings of the Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 2021, pp. 5834–5846. [CrossRef]
- Zhou, D.; Huang, J.; Bai, J.; Wang, J.; Chen, H.; Chen, G.; Hu, X.; Heng, P.A. Magictailor: Component-controllable personalization in text-to-image diffusion models. arXiv 2024. arXiv:2410.13370.
- Huang, J.; Yan, M.; Chen, S.; Huang, Y.; Chen, S. Magicfight: Personalized martial arts combat video generation. In Proceedings of the Proceedings of the 32nd ACM International Conference on Multimedia, 2024; pp. 10833–10842. [Google Scholar]
- Guo, D.; Lu, S.; Duan, N.; Wang, Y.; Zhou, M.; Yin, J. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. In Proceedings of the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2022, pp. 7212–7225. [CrossRef]
- Zhang, F.; Chen, H.; Zhu, Z.; Zhang, Z.; Lin, Z.; Qiao, Z.; Zheng, Y.; Wu, X. A survey on foundation language models for single-cell biology. In Proceedings of the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 528–549.
- Zhang, F.; Liu, T.; Zhu, Z.; Wu, H.; Wang, H.; Zhou, D.; Zheng, Y.; Wang, K.; Wu, X.; Heng, P.A. CellVerse: Do Large Language Models Really Understand Cell Biology? arXiv arXiv:2505.07865. [CrossRef]
- Zhang, F.; Liu, T.; Chen, Z.; Peng, X.; Chen, C.; Hua, X.S.; Luo, X.; Zhao, H. Semi-supervised knowledge transfer across multi-omic single-cell data. Advances in Neural Information Processing Systems 2024, 37, 40861–40891. [Google Scholar]
- Liu, F.; Geng, K.; Chen, F. Gone with the Wind? Impacts of Hurricanes on College Enrollment and Completion 1. Journal of Environmental Economics and Management 2025, 103203. [Google Scholar] [CrossRef]
- Liu, F.; Geng, K.; Jiang, B.; Li, X.; Wang, Q. Community-Based Group Exercises and Depression Prevention Among Middle-Aged and Older Adults in China: A Longitudinal Analysis. Journal of Prevention 2025, 1–20. [Google Scholar]
- Liu, F.; Liu, Y.; Geng, K. Medical Expenses, Uncertainty and Mortgage Applications. In Uncertainty and Mortgage Applications; 2024. [Google Scholar]
- Gera, A.; Halfon, A.; Shnarch, E.; Perlitz, Y.; Ein-Dor, L.; Slonim, N. Zero-Shot Text Classification with Self-Training. In Proceedings of the Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2022; pp. 1107–1119. [Google Scholar] [CrossRef]
- Zhu, C.; Hinthorn, W.; Xu, R.; Zeng, Q.; Zeng, M.; Huang, X.; Jiang, M. Enhancing Factual Consistency of Abstractive Summarization. In Proceedings of the Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2021; pp. 718–733. [Google Scholar] [CrossRef]
- Lee, D.H.; Kadakia, A.; Tan, K.; Agarwal, M.; Feng, X.; Shibuya, T.; Mitani, R.; Sekiya, T.; Pujara, J.; Ren, X. Good Examples Make A Faster Learner: Simple Demonstration-based Learning for Low-resource NER. In Proceedings of the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2022, pp. 2687–2700 [CrossRef]
- Zhou, Y.; Song, L.; Shen, J. MAM: Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis via Role-Specialized Collaboration. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, 2025; pp. 25319–25333. [Google Scholar] [CrossRef]
- Huang, J.; Zhou, D.; Liu, J.; Shi, L.; Chen, S. Ifast: Weakly supervised interpretable face anti-spoofing from single-shot binocular nir images. IEEE Transactions on Information Forensics and Security, 2024. [Google Scholar]
- Agarwal, O.; Ge, H.; Shakeri, S.; Al-Rfou, R. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2021; pp. 3554–3565. [Google Scholar] [CrossRef]
- Kottur, S.; Moon, S.; Geramifard, A.; Damavandi, B. SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations. In Proceedings of the Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2021; pp. 4903–4912. [Google Scholar] [CrossRef]
- Chen, J.; Tang, J.; Qin, J.; Liang, X.; Liu, L.; Xing, E.; Lin, L. GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical Reasoning. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics; 2021; pp. 513–523. [Google Scholar] [CrossRef]
- Luo, M.; Zeng, Y.; Banerjee, P.; Baral, C. Weakly-Supervised Visual-Retriever-Reader for Knowledge-based Question Answering. In Proceedings of the Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2021; pp. 6417–6431. [Google Scholar] [CrossRef]
- Li, Z.; Xu, B.; Zhu, C.; Zhao, T. CLMLF:A Contrastive Learning and Multi-Layer Fusion Method for Multimodal Sentiment Detection. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2022. Association for Computational Linguistics, 2022; pp. 2282–2294. [Google Scholar] [CrossRef]
- Hu, G.; Lin, T.E.; Zhao, Y.; Lu, G.; Wu, Y.; Li, Y. UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition. In Proceedings of the Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2022; pp. 7837–7851. [Google Scholar] [CrossRef]
- Weng, Y.; Zhu, M.; Xia, F.; Li, B.; He, S.; Liu, S.; Sun, B.; Liu, K.; Zhao, J. Large Language Models are Better Reasoners with Self-Verification. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics; 2023; pp. 2550–2575. [Google Scholar] [CrossRef]
- Yang, X.; Feng, S.; Zhang, Y.; Wang, D. Multimodal Sentiment Detection Based on Multi-channel Graph Neural Networks. In Proceedings of the Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 2021, pp. 328–339. [CrossRef]



| Model | Setting | TamperedNews-Ent | News400-Ent | MMG-Ent | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| PER | LOC | EVT | PER | LOC | EVT | LCt | LCo | LCn | ||
| InstructBLIP | w/o | 0.66 | 0.81 | 0.76 | 0.68 | 0.75 | 0.79 | 0.63 | 0.30 | 0.59 |
| comp | 0.73 | 0.78 | 0.72 | 0.71 | 0.67 | 0.85 | - | - | - | |
| LLaVA 1.5 | w/o | 0.61 | 0.79 | 0.71 | 0.63 | 0.70 | 0.57 | 0.70 | 0.48 | 0.27 |
| comp | 0.78 | 0.73 | 0.77 | 0.77 | 0.70 | 0.85 | - | - | - | |
| Ours (AMCV) | enhance | 0.80 | 0.82 | 0.79 | 0.79 | 0.76 | 0.88 | 0.73 | 0.52 | 0.31 |
| Model | Setting | TamperedNews-Ent | News400-Ent | ||
|---|---|---|---|---|---|
| PER | EVT | PER | EVT | ||
| AMCV w/o FECE | (Simple NER, no visual regions) | 0.74 | 0.72 | 0.72 | 0.81 |
| AMCV w/o DERA | (Static ‘comp’ equivalent) | 0.77 | 0.76 | 0.75 | 0.84 |
| AMCV w/o MSAV | (Single-stage fusion) | 0.78 | 0.77 | 0.76 | 0.86 |
| AMCV (Full) | enhanced | 0.80 | 0.79 | 0.79 | 0.88 |
| Err Cat | Total Err. (%) | P-I (%) | N-I (%) | Contributing Factors |
|---|---|---|---|---|
| Sub. Vis. Mismatches | 35.2 | 28.1 | 7.1 | Low visual fidelity, complex scenes, obscured entities |
| Complex Contextual Nuances | 28.5 | 19.3 | 9.2 | Idiomatic expressions, sarcasm, highly abstract events |
| Amb. Ext. Evid. | 17.8 | 10.5 | 7.3 | Contradictory KBs, outdated information, limited retrieval capacity |
| Domain Gaps | 12.3 | 8.2 | 4.1 | Highly specialized entities such as obscure historical figures or technical events |
| Other | 6.2 | 3.9 | 2.3 | Rare entity types and parsing errors |
| Model | Component/Setting | Inf. Time (s/sample) | Mem. Usage (GB) |
|---|---|---|---|
| InstructBLIP | (w/o) | 1.25 | 18.1 |
| InstructBLIP | (comp) | 1.48 | 19.5 |
| LLaVA 1.5 | (w/o) | 0.98 | 15.6 |
| LLaVA 1.5 | (comp) | 1.15 | 17.0 |
| AMCV (Full) | FECE | 0.35 | 5.2 |
| DERA | 0.42 | 3.8 | |
| MSAV (Stage 1) | 0.98 | 15.6 | |
| MSAV (Stage 2) | 1.30 | 17.2 | |
| MSAV (Stage 3) | 0.05 | 0.5 | |
| AMCV (Full) | Total | 3.10 | 17.2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.