Submitted:
03 February 2026
Posted:
04 February 2026
You are already at the latest version
Abstract
Keywords:
I. Introduction
A. Problem Statement
B. Hypothesis
- 1)
- Multimodal Synergy: Integrating text and image features using a cross-attention mechanism will yield superior detection performance compared to unimodal baselines because it captures the semantic alignment (or misalignment) between the claim and the evidence.
- 2)
- Advanced Encoders: Using region-specific pre-trained models like MuRIL will significantly improve performance on Indian language datasets compared to generic multilingual models like mBERT, especially for transliterated text.
- 3)
- Global Visual Context: Vision Transformers (ViT), which utilize self-attention mechanisms, will effectively capture global visual inconsistencies often present in fake news images, outperforming local-feature-based CNNs.
C. Objectives
- To develop a robust preprocessing pipeline for cleaning code-mixed text and normalizing varied image formats.
- To design and implement a dual-stream deep learning architecture that fuses MuRIL-based text embeddings with ViT-based image embeddings.
- To evaluate the system’s performance using standard metrics (Accuracy, Precision, Recall, F1-Score) against existing state-of-the-art baselines.
- To implement and demonstrate an Explainable AI (XAI) module that visualizes model attention, thereby enhancing the interpretability of the results.
II. Related Work
A. Existing Fake News Detection Approaches
B. Gaps in Existing Literature
- Linguistic Bias: Most datasets and models are heavily skewed towards English, neglecting the high volume of misinformation in regional languages. Existing multilingual models often struggle with the code-mixed nature of Indian social media text.
- Unimodal Focus: Many systems ignore the visual component entirely, despite images being a primary driver of social media engagement.
- Black-Box Nature: High-performing deep learning models offer little introspection into their decision-making process, a critical flaw for deployment in sensitive domains like journalism.
C. Our Contribution
- 1)
- **Multilingual Competence**: We utilize **MuRIL** [6], a BERT model pre-trained specifically on 17 Indian languages. Unlike benchmarks that use mBERT, our choice of MuRIL allows for superior handling of transliterated data (e.g., Hindi written in Latin script), which is omnipresent in our dataset.
- 2)
- **Next-Gen Visual Encoding**: We depart from standard CNNs and adopt **Vision Transformers (ViT)** [5]. By treating images as sequences of patches, ViT allows our model to understand global visual context via self-attention mechanisms. This is crucial for detecting semantic mismatches that simpler CNNs might overlook.
- 3)
- **Explainability First**: We integrate a transparent reasoning layer using **SHAP** for text and **GradCAM** for images. This allows us to provide pixel-level and token-level evidence for every prediction, effectively ”opening the black box” for end-users.
- 4)
- **Cross-Modal Fusion**: We implement a custom
III. Methodology
A. System Architecture
- 1)
-
Dataset Collection & Ingestion: Raw news data (headlines + images) is ingested from the source. The system supports batch processing for high-volume analysis. 2) Preprocessing Module:
- Text: Cleaning (removal of URLs, special characters), Normalization (Unicode standardization), and Tokenization using the MuRIL tokenizer.
- Image: Resizing to 224x224 metric, and Normalization using ImageNet mean/std values.
- 3)
-
Feature Extraction:
- Stream A (Text): The MuRIL Encoder processes the tokenized input and outputs 768-dimensional contextual embeddings corresponding to the CLS token.
- Stream B (Image): The ViT-B/16 Encoder processes image patches and outputs 768-dimensional patch embeddings, capturing both local texture and global structure.
- 4)
- Multimodal Fusion: A Cross-Attention mechanism aligns both embedding spaces. Text embeddings act as Queries, while Image embeddings act as Keys/Values, allowing the model to focus on image regions relevant to the text.
- 5)
-
Classification Heads:
- Fake/Real Head: A fully connected layer with Sigmoid activation for binary classification.
- Claim Head: A separate head for multi-class classification (Politics, Sports, etc.) to provide context.
- 6)
- Evaluation & XAI: Calculation of performance metrics and generation of saliency maps for user interpretation.
B. Data Preprocessing

C. Machine Learning Models

D. Implementation Details
- Libraries: ‘transformers‘ for MuRIL, ‘torchvision‘ for ViT, ‘shap‘ for text explanation, and ‘opencv-python‘ for image manipulation.
- Hardware: Training was performed on HighPerformance Computing nodes equipped with NVIDIA GPUs to handle the computational load of the Transformer models.
- Training Config: We used the AdamW optimizer (lr = 2e − 5) with a linear warmup scheduler to prevent early overfitting. The Binary Cross-Entropy (BCE) loss function is used for the authenticity classification:

- Reproducibility: Random seeds were fixed to 42 to ensure deterministic results across runs.
IV. Results and Analysis
A. Quantitative Results

B. Classification Analysis

C. Explainability Outcomes
- Textual Explanation: SHAP values successfully identify inflammatory words (e.g., ”banned”, ”shocking”) as high contributors to the ”Fake” probability. This confirms the model is paying attention to sensationalist vocabulary typical of clickbait.
- Visual Explanation: Grad-CAM heatmaps generated from the ViT encoder show that the model focuses on foreground subjects (e.g., politicians’ faces) rather than irrelevant backgrounds, confirming that the model learns semantically meaningful features rather than exploiting background artifacts.

V. Discussion and Outcomes
VI. Conclusion
References
- Vosoughi, S.; Roy, D.; Aral, S. The spread of true and false news online. Science 2018, vol. 359(no. 6380), 1146–1151. [Google Scholar] [CrossRef] [PubMed]
- Ahmed, H.; Traore, I.; Saad, S. Detection of online fake news using n-gram analysis and machine learning techniques. In Springer; 2017. [Google Scholar]
- Wang, W. Y. Liar, Liar Pants on Fire: A New Benchmark Dataset for Fake News Detection. ACL, 2017. [Google Scholar]
- Devlin, J.; et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT, 2019. [Google Scholar]
- Dosovitskiy, A.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR, 2021. [Google Scholar]
- Khan, N.; et al. ”MuRIL: Multilingual Representations for Indian Languages. arXiv 2021, arXiv:2103.10730. [Google Scholar]
- He, K.; et al. Deep Residual Learning for Image Recognition. CVPR, 2016. [Google Scholar]
- Lundberg, S. M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In NeurIPS; 2017. [Google Scholar]
- Selvaraju, R. R.; et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. ICCV, 2017. [Google Scholar]
- Shu, K.; et al. FakeNewsNet: A Data Repository with News Content, Social Context, and Spatiotemporal Information. In Big Data; 2020. [Google Scholar]
| Model Architecture | Acc. | Prec. | Rec. | F1 |
|---|---|---|---|---|
| LSTM (Text Baseline) | 0.72 | 0.70 | 0.69 | 0.70 |
| mBERT (Text-only) | 0.76 | 0.75 | 0.74 | 0.74 |
| CNN + BERT (Simple Fusion) | 0.79 | 0.78 | 0.78 | 0.78 |
| MuRIL + ViT (Ours) | 0.82 | 0.81 | 0.80 | 0.81 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
