MLIF-Net: Multimodal Fusion of Vision Transformers and Large Language Models for AI Image Detection

Xuan Li; Lei Fu; Jinghan Cao; Qiyuan Tian; Jing Cao; Kowei Shih

doi:10.20944/preprints202505.2370.v1

Submitted:

29 May 2025

Posted:

29 May 2025

You are already at the latest version

Abstract

This paper presents the Multimodal Language-Image Fusion Network (MLIF-Net), a new architecture to distinguish AI-generated images from real ones. MLIF-Net combines Vision Transformer (ViT) and Large Language Models (LLMs) to build a multimodal feature fusion network that improves AI-generated content detection accuracy. The model uses a Cross-Attention Mechanism to combine visual and semantic features and a Multiscale Contextual Reasoning Layer to capture both global and local image features. An Adaptive Loss Function improves the consistency and robustness of feature extraction. Experimental results show that MLIF-Net outperforms existing models in accuracy, recall, and Average Precision (AP). This approach can lead to more accurate detection of AI-generated content and may have applications in other generative content tasks.

Keywords:

multimodal fusion

;

cross-attention mechanism

;

vision transformer

;

AI-generated content detection

;

adaptive loss function

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

Artificial intelligence (AI) is growing fast. Generative models, especially in image synthesis, are improving quickly. Techniques like GANs (Generative Adversarial Networks) create realistic AI-generated images. These advances are impressive. But they also make it harder to tell real images from fake ones. This causes problems in misinformation, media, security, and trust.

To fix this, researchers study machine learning-based detection methods. Hybrid models, for example, improve prediction accuracy for complex datasets [1]. Some recent studies focus on temporal modeling. Jin’s work [2] shows that attention-based temporal convolutional networks help make strong predictions. But these methods mainly work with time-series data. They do not explore multimodal fusion. Other models like BERT and cross-modal transformers show the power of large-scale pretrained models. They capture contextual and semantic information well [3,4]. But current models still have trouble combining visual and semantic features. This combination is important for detecting AI-generated images.

This paper presents the Multimodal Language-Image Fusion Network (MLIF-Net). It combines Vision Transformers (ViTs) and Large Language Models (LLMs). The network uses cross-attention to fuse features and multiscale contextual reasoning to extract both global and local details. An adaptive loss function helps make feature extraction more stable. Experiments show better accuracy, recall, and average precision than existing methods.

2. Related Work

In AI-generated image detection, previous research has mainly focused on feature extraction methods, model architectures, and multimodal learning. Early approaches used convolutional neural networks (CNNs) to detect artifacts and anomalies in AI-generated images. While CNNs were effective in some cases, they struggled to capture the semantic relationships between features[5].

With transformer advancements, researchers leveraged them for cross-modal tasks. Radford et al. [6] introduced CLIP, employing contrastive learning on image-text pairs for better multimodal representation.

A few studies, such as those by Jin[7] and Dai[8], have examined adaptive learning techniques in AI-driven prediction models. These works emphasize the importance of robust loss functions and adaptive mechanisms to improve model performance under different conditions. Our work extends these ideas by using an adaptive loss function designed for multimodal feature extraction.

Despite these advances, current methods either fail to generalize across diverse datasets or do not effectively integrate multimodal information. MLIF-Net addresses these issues by using a cross-attention mechanism to combine visual and semantic features and a multiscale contextual reasoning layer to ensure thorough feature extraction.

3. Methodology

With the rise of generative models, distinguishing real from AI-generated images has become challenging. This section presents Multimodal Language-Image Fusion Network (MLIF-Net), leveraging large language models (LLMs) and vision transformers (ViTs) for detection. MLIF-Net integrates high-level semantics from LLMs with fine-grained visual features from ViTs, enhancing its ability to discern subtle differences.

The architecture employs cross-attention fusion for visual-semantic alignment and a multiscale reasoning layer to capture both local and global image details. An adaptive loss function optimizes feature extraction while preserving semantic consistency. Experimental results demonstrate MLIF-Net’s superiority over conventional image classifiers, achieving state-of-the-art accuracy in AI image detection. Figure 1 illustrates the framework.

3.1. Model Overview

MLIF-Net consists of four parts: image encoder, LLM-based semantic encoder, feature fusion layer, and multimodal reasoning layer. These components enhance image understanding by capturing both fine spatial details and high-level semantics.

3.2. Image Encoder with Vision Transformer

The image encoder employs a Vision Transformer (ViT). It divides the input image into patches and applies self-attention to extract spatial and contextual features:

X_{ViT} = ViT (X)

(1)

where

X

is the input image, and

X_{ViT}

represents patch-wise features. ViT uses self-attention layers to form a global feature representation. The self-attention MLP in ViT is shown in Figure 2.

3.3. Semantic Encoder based on LLMs

The model extracts image semantics using a Large Language Model (LLM). Trained on extensive text data, the LLM processes image captions to generate feature representations:

X_{LLM} = LLM (C)

(2)

where

C

is the image caption, and

X_{LLM}

is the learned feature representation, capturing objects, scenes, and context.

3.4. Multimodal Feature Fusion Layer

Visual and semantic features are fused using a cross-attention mechanism:

X_{Fusion} = CrossAttention (X_{ViT}, X_{LLM})

(3)

CrossAttention integrates visual and semantic tokens, forming a unified representation:

X_{Fusion} = softmax (\frac{Q K^{T}}{\sqrt{d}}) V

(4)

Here,

Q, K, V

are query, key, and value matrices from visual and semantic features, and d is the dimensionality. The result

X_{Fusion}

encodes fused multimodal information.

3.5. Multiscale Contextual Reasoning Layer

The model processes information at different levels using a multiscale contextual reasoning layer. This enhances fusion by capturing both local and global features:

X_{Context} = MultiscaleReasoning (X_{Fusion})

(5)

This layer refines fused features by integrating information from multiple scales. The output

X_{Context}

improves contextual understanding for better decision-making.

3.6. Final Decision Layer

The model classifies images using a softmax classifier on multimodal features:

y = Softmax (W \cdot X_{Context} + b)

(6)

Here, W is the weight matrix, and b is the bias. The output

y \in R^{2}

represents the probability of the image being real or AI-generated.

3.7. Adaptive Loss Function with Multi-Loss Components

MLIF-Net is trained with an adaptive loss function combining multiple components to enhance visual and semantic learning. The total loss is:

L_{total} = λ_{1} L_{CE} + λ_{2} L_{reg} + λ_{3} L_{semantic}

(7)

where: -

L_{CE}

is the cross-entropy loss:

L_{CE} = - \sum_{i = 1}^{N} y_{i} log ({\hat{y}}_{i})

-

L_{reg}

is the regularization term:

L_{reg} = \sum_{j = 1}^{M} {∥ W_{j} ∥}^{2}

-

L_{semantic}

enforces semantic coherence:

L_{semantic} = {∥ X_{ViT} - X_{LLM} ∥}^{2}

Here,

λ_{1}, λ_{2}, λ_{3}

are hyperparameters controlling each loss component’s contribution.

3.8. Model Output

The final model output is a classification score indicating whether the image is real or AI-generated:

y_{final} = Softmax (W_{2} \cdot X_{Context} + b_{2})

(8)

where

y_{final}

is the predicted probability vector, and the decision is made based on the class with the highest probability.

4. Data Preprocessing

This section outlines preprocessing steps for preparing image and text data in MLIF-Net. The pipeline standardizes and augments images while tokenizing and embedding captions for compatibility with the network. Figure 3 illustrates image preprocessing (left), data augmentation (middle), and image-text alignment via cross-attention (right).

4.1. Image Preprocessing

Images are resized to

224 \times 224

pixels and normalized to

[0, 1]

:

X_{norm} = \frac{X}{255}

(9)

where

X \in R^{224 \times 224 \times 3}

is the input image. Augmentation applies random flipping, rotation (

[- 20^{\circ}, 20^{\circ}]

), and color jittering (brightness, contrast, saturation up to

\pm 0.2

):

X_{aug} = A (X_{norm})

(10)

where

A

is the augmentation function, enhancing generalization.

4.2. Textual Data Preprocessing

Captions are tokenized using the BERT tokenizer. Each caption

C

is split into tokens

T = [t_{1}, t_{2}, \dots, t_{n}]

, then mapped to embeddings:

E_{tokens} = BERT_Embed (T)

(11)

where

E_{tokens} \in R^{n \times 768}

. Sequences are padded to 32 tokens:

E_{tokens}^{pad} = Pad (E_{tokens})

(12)

The padded embeddings

E_{tokens}^{pad} \in R^{32 \times 768}

are input to the LLM for semantic extraction.

4.3. Fusion Preparation

After preprocessing, images and captions are fused. Vision Transformer (ViT) extracts

1 \times 768

visual features, while BERT embeddings provide

32 \times 768

semantic features. Cross-attention aligns these modalities during training, forming a unified multimodal representation for classification.

5. Evaluation Metrics

5.1. Accuracy

Accuracy quantifies overall prediction correctness, computed as:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(13)

where

T P

,

T N

,

F P

, and

F N

denote true positives, true negatives, false positives, and false negatives, respectively.

5.2. Precision

Precision is the proportion of true positive predictions among all positive predictions (both real and AI-generated images). It is calculated as:

Precision = \frac{T P}{T P + F P}

(14)

5.3. Recall

Recall measures the ability of the model to correctly identify the positive class (e.g., AI-generated images). It is given by:

Recall = \frac{T P}{T P + F N}

(15)

5.4. Average Precision (AP)

AP represents the weighted average of precision across recall levels, assessing performance across thresholds:

A P = \sum_{i = 1}^{N} Precision (i) \cdot Δ Recall (i)

(16)

where

Precision (i)

and

Recall (i)

are computed at each threshold.

6. Experiment Results

MLIF-Net is compared with ViT-GAN, ResNet-50-CNN, CLIP, and Xception-V2. Table 1 shows MLIF-Net achieves the highest accuracy, precision, recall, and average precision (AP).

Ablation results in Table 2 highlight the impact of each component. Removing fusion, multiscale reasoning, or the LLM encoder lowers performance, confirming their importance. Figure 4 shows model training indicator changes.

7. Conclusions

In this paper, we introduced the Multimodal Language-Image Fusion Network (MLIF-Net) for detecting AI-generated images. The model leverages the strengths of both visual and semantic modalities by combining Vision Transformers with a large language model-based encoder. Our extensive experimental results demonstrate that MLIF-Net outperforms several benchmark models, including ViT-GAN, ResNet-50-CNN, CLIP, and Xception-V2, achieving the highest accuracy, precision, recall, and average precision. Ablation studies further reveal the essential role of multimodal fusion, multiscale reasoning, and the LLM encoder in contributing to the model’s effectiveness. This work opens new opportunities for the application of multimodal fusion techniques in AI detection tasks and other multimodal learning problems.

References

Lu, J.; Long, Y.; Li, X.; Shen, Y.; Wang, X. Hybrid Model Integration of LightGBM, DeepFM, and DIN for Enhanced Purchase Prediction on the Elo Dataset. In Proceedings of the 2024 IEEE 7th International Conference on Information Systems and Computer Aided Education (ICISCAE). IEEE, 2024, pp. 16–20. [CrossRef]
Jin, T. Attention-Based Temporal Convolutional Networks and Reinforcement Learning for Supply Chain Delay Prediction and Inventory Optimization. Preprints 2025. [CrossRef]
Kenton, J.D.M.W.C.; Toutanova, L.K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Proceedings of naacL-HLT. Minneapolis, Minnesota, 2019, Vol. 1.
Li, S. Leveraging Large Language Models in a Retriever-Reader Framework for Solving STEM Multiple-Choice Questions. In Proceedings of the 2024 4th International Conference on Electronic Information Engineering and Computer Science (EIECS). IEEE, 2024, pp. 658–661.
Huang, B.; Carley, K.M. Syntax-aware aspect level sentiment classification with graph attention networks. arXiv preprint arXiv:1909.02606, arXiv:1909.02606 2019.
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International conference on machine learning. PMLR, 2021, pp. 8748–8763.
Jin, T. Integrated Machine Learning for Enhanced Supply Chain Risk Prediction 2025.
Dai, W.; Jiang, Y.; Liu, Y.; Chen, J.; Sun, X.; Tao, J. CAB-KWS: Contrastive Augmentation: An Unsupervised Learning Approach for Keyword Spotting in Speech Technology. In Proceedings of the International Conference on Pattern Recognition. Springer, 2025, pp. 98–112. [CrossRef]

Figure 1. Multimodal Language-Image Fusion Network.

Figure 2. Self-attention mechanism in ViT.

Figure 3. Examples of data preprocessing.

Figure 4. Model indicator change chart.

Table 1. Comparison of Model Performance on COCO and FakeImage Datasets

Model	Accuracy (%)	Precision (%)	Recall (%)	AP (%)
MLIF-Net	95.3	94.7	96.1	94.2
ViT-GAN	91.4	89.5	92.3	90.5
ResNet-50-CNN	85.6	83.2	87.0	84.8
CLIP	92.0	90.1	93.3	91.5
Xception-V2	89.5	88.0	90.7	88.9

Table 2. Ablation Study Results

Model Variant	Accuracy (%)	Precision (%)	Recall (%)	AP (%)
MLIF-Net	95.3	94.7	96.1	94.2
Without Fusion	89.7	88.1	91.2	88.3
Without Multiscale Reasoning	93.1	92.6	93.5	92.0
Without LLM Encoder	90.8	89.2	92.1	89.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

MLIF-Net: Multimodal Fusion of Vision Transformers and Large Language Models for AI Image Detection

Abstract

Keywords:

Subject:

1. Introduction

2. Related Work

3. Methodology

3.1. Model Overview

3.2. Image Encoder with Vision Transformer

3.3. Semantic Encoder based on LLMs

3.4. Multimodal Feature Fusion Layer

3.5. Multiscale Contextual Reasoning Layer

3.6. Final Decision Layer

3.7. Adaptive Loss Function with Multi-Loss Components

3.8. Model Output

4. Data Preprocessing

4.1. Image Preprocessing

4.2. Textual Data Preprocessing

4.3. Fusion Preparation

5. Evaluation Metrics

5.1. Accuracy

5.2. Precision

5.3. Recall

5.4. Average Precision (AP)

6. Experiment Results

7. Conclusions

References

MDPI Initiatives

Important Links

Subscribe