Preprint
Article

This version is not peer-reviewed.

Adaptive Multi-Modal Contextual Verification for Enhanced Cross-Modal Entity Consistency

Submitted:

23 December 2025

Posted:

24 December 2025

You are already at the latest version

Abstract
The rise of digital media has intensified "context-mismatched" news, where image-text discrepancies erode veracity and trust. Cross-modal Entity Consistency (CEC) verification is crucial, yet existing Large Vision-Language Models struggle with complex entity ambiguity, fine-grained event associations, and insufficient explicit reference information. To address these challenges, we propose an Adaptive Multi-modal Contextual Verifier (AMCV). AMCV incorporates a Fine-grained Entity-Context Extractor, a Dynamic Evidence Retrieval and Augmentation module leveraging external knowledge, and a Multi-stage Adaptive Verification framework. This framework integrates LVLM-based alignment with evidence-fusion reasoning and adversarial training for confidence aggregation. Evaluated zero-shot across benchmark datasets, AMCV consistently outperforms state-of-the-art baselines, showing significant improvements. Ablation studies confirm each module's critical role, and human evaluations validate AMCV's predictions align better with human judgment in challenging scenarios. Our work offers a robust framework for CEC, substantially advancing cross-modal reasoning by intelligently leveraging fine-grained contextual understanding and dynamic external knowledge.
Keywords: 
;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2025 MDPI (Basel, Switzerland) unless otherwise stated