Cross-Attention Transformer-Based Visual-Language Fusion for Multimodal Image Analysis

Liwei Ding; Kowei shih; Hairu Wen; Xinshi Li; Qin Yang

doi:10.20944/preprints202502.2255.v1

Submitted:

27 February 2025

Posted:

28 February 2025

You are already at the latest version

Abstract

Multimodal image analysis is a significant research direction in the field of computer vision, playing a crucial role in tasks such as image captioning and visual question answering (VQA). However, existing visual-language fusion methods often struggle to capture the fine-grained interactions between visual and language modalities, leading to suboptimal fusion results. To address this issue, this paper proposes a visual-language fusion model based on the Cross-Attention Transformer, which constructs deep interactive relationships between visual and language modalities through cross-attention mechanisms, thereby achieving effective multimodal feature fusion. The proposed model first utilizes convolutional neural networks (CNN) and pre-trained language models (e.g., BERT) to extract visual and language features separately, and then applies cross-attention modules to capture mutual dependencies in feature sequences, resulting in a unified multimodal representation vector. Experimental results demonstrate that the proposed model significantly outperforms traditional methods in tasks such as image captioning and VQA, validating its superiority in multimodal image analysis. Additionally, visualization analysis and ablation experiments further explore the contribution of the cross-attention mechanism to model performance, while discussing the model's limitations and potential future improvements.

Keywords:

Multimodal Image Analysis

;

Cross-Attention Transformer

;

Visual-Language Fusion

;

Cross-Attention Mechanism

Subject:

Computer Science and Mathematics - Computer Vision and Graphics

1. Introduction

Multimodal learning has broad applications in fields such as computer vision and natural language processing (NLP). However, effectively integrating visual and language modalities and capturing their complex interactions remains a major research challenge[1]. Existing methods often rely on simple concatenation or linear transformation strategies, which fail to adequately represent semantic relationships between modalities[2]. To address this issue, this paper proposes a multimodal fusion model based on cross-attention mechanisms, introducing multi-head cross-attention and the Convolutional Block Attention Module (CBAM) to achieve deep integration of visual and language features. The proposed model enhances feature representation in multimodal tasks, as demonstrated by its superior performance in image captioning and image-text matching. The main contributions of this paper include proposing a fusion method combining cross-attention mechanisms and CBAM, and validating its effectiveness in multimodal fusion tasks through experimental results. Ao et al. [3] proposed a splicing image detection algorithm. Jiabei et al.[4] explored DL in multilingual sentiment analysis. Zongqing et al.[5] improved YOLOv5 for foreign object detection.

2. Related Concepts and Applications

2.1. Multimodal Learning and Fusion Methods

Multimodal learning is a research approach that enhances model understanding and analytical capabilities by integrating different types of data sources (modalities) [6]. It is widely used in fields such as computer vision, natural language processing, and biomedicine. In multimodal learning tasks, significant differences may exist between various modalities, such as text, time series, images, structured data, and genomic data. Effectively extracting features, integrating information, and comprehensively analyzing these heterogeneous modalities is a core challenge in multimodal learning[7].

<Figure 1> illustrates the basic process of multimodal learning, which is typically divided into three main stages: feature extraction, modality feature fusion, and decision-making. The "Disparate Data" section in <Figure 1> shows common types of multimodal data, including notes, time series, imaging, omics, and structured data. These data are converted into modality-specific feature representations through distinct feature extraction methods. During the feature extraction stage, specialized networks such as Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), or Transformer models are employed to learn the characteristics of each modality[8,9,10].In the "Fusion Model" stage, the extracted modality features are integrated using different fusion models. Based on the timing of multimodal feature fusion, fusion methods can be categorized into early fusion, intermediate fusion, and late fusion. In <Figure 1>, a late fusion strategy is applied, where features from each modality are fed into separate neural networks and then integrated at the decision layer[11]. This strategy retains the independent information of each modality while leveraging higher-level feature interactions to improve the model's understanding of complex data[12,13,14]. The integrated features are then used for various decision-making and prediction tasks such as regression, classification, and clustering, enabling comprehensive analysis and reasoning of multimodal data[15]. Multimodal learning leverages complementary information from different modalities to improve the generalization and accuracy of models, and it has shown remarkable advantages in applications such as medical image analysis, natural language understanding, and multimodal sentiment analysis[16]. Thus, Figure 1 clearly illustrates the feature extraction and fusion processes for different data modalities in multimodal learning, providing a theoretical foundation and intuitive understanding for subsequent method improvements and experimental designs[17].

2.2. Application of Cross-Modal Attention Mechanism in Visual-Language Fusion

The Cross-Modal Attention Mechanism is a significant breakthrough in recent multimodal learning research, primarily used to handle complex interactions between visual and language modalities[18]. Traditional multimodal fusion methods usually employ simple concatenation or linear transformation to directly merge visual and language features, which makes it difficult to capture deep semantic relationships between the two modalities and results in suboptimal performance in complex scenarios. Consequently, the Cross-Modal Attention Mechanism has emerged as an effective solution[19,20,21].The Cross-Modal Attention Mechanism is based on the classic Self-Attention mechanism, establishing mutual dependencies between visual and language features, enabling dynamic allocation of attention weights, and accurately capturing interactions between the two modalities. Visual modality (e.g., image features) and language modality (e.g., text descriptions) are separately processed through their respective feature extraction networks, and then mutual attention is performed using the Cross-Modal Attention module[22]. This module deeply understands the features of each modality and uses contextual information from one modality to re-weight the features of the other[23]. Through this process, the Cross-Modal Attention Mechanism maps visual features into the language modality space or vice versa, achieving mutual enhancement and unified representation of visual and language information[24]. The Cross-Modal Attention Mechanism fundamentally works by generating Query, Key, and Value vectors for each modality and then calculating attention weights. Specifically, for each visual feature point, the model generates a query vector based on language modality features and uses the key and value vectors of language features for attention computation to obtain the similarity distribution between each visual feature point and the overall language modality features[25]. Conversely, language features can also compute attention based on visual modality features. This bidirectional interaction modeling enables the Cross-Modal Attention Mechanism to achieve fine-grained fusion of visual and language features, enhancing model performance in visual-language understanding tasks[26,27,28]. In practice, the Cross-Modal Attention Mechanism is widely applied to tasks such as Image Captioning, Visual Question Answering (VQA), and Visual Semantic Retrieval. For example, in image captioning, Cross-Modal Attention dynamically associates visual features with text sequences, accurately matching important objects in the image with corresponding language descriptions, thereby generating coherent and accurate image captions[29]. Similarly, in VQA tasks, Cross-Modal Attention can focus on the most relevant regions in the image based on the semantic emphasis of the question text, enabling the model to answer questions related to image content more precisely[30]. Beyond single visual-language fusion, the Cross-Modal Attention Mechanism can also handle the fusion of multiple modalities, such as integrating visual, language, and audio modalities, thereby improving performance in tasks like multimodal sentiment analysis and multimodal behavior recognition[31]. Moreover, researchers have introduced strategies such as Multi-Head Cross-Attention and Multi-Layer Cross-Attention to further enhance the representation capability of Cross-Modal Attention for different modality features[32]. In summary, the Cross-Modal Attention Mechanism constructs flexible attention connections between different modalities, effectively capturing complex cross-modal interactions and improving overall performance in multimodal tasks. As multimodal learning technology continues to evolve, the Cross-Modal Attention Mechanism is expected to exhibit great potential and value in broader application scenarios[33,34,35].

3. Architecture and Strategy

3.1. Model Architecture Design

This study proposes a visual-language fusion model based on the Cross-Modal Attention Mechanism, aiming to achieve effective integration between visual and language modalities through deep interaction, thereby enhancing multimodal data analysis performance. <Figure 2> illustrates the overall model architecture, including the image feature extraction module, text feature extraction module, and the cross-attention model[36].

In the image feature extraction process, the model utilizes the pre-trained DenseNet121 network to extract features from input images. Specifically, the input images are processed through a series of convolutional and pooling layers, generating high-dimensional feature maps containing spatial and channel information. To further optimize the image feature representation, the model introduces a Convolutional Block Attention Module (CBAM), which effectively focuses on the most discriminative regions and channels in the image, enhancing the quality of feature expression[37]. After being processed by CBAM, the image features are vectorized into fixed-dimensional feature

{I_{1}, I_{2}, \dots, I_{n}}

representations for subsequent cross-modal attention computation.In the process of text feature extraction, the model uses the pre-trained ALBERT model to encode the input text. Firstly, the input text is word segmented, and each word is converted into a word vector through the word Embedding layer[38,39,40]. At the same time, Position Embedding and Segment Embedding are added to retain the text sequence information. Then, the text vector sequence is fed into the ALBERT network, and the context semantic information in the text sequence is extracted through the self-attention mechanism to obtain the text feature vector representation

{T_{1}, T_{2}, \dots, T_{n}}

. Then, the image and text features are fed into the cross-modal attention model. The core of the module is the cross-attention mechanism based on the Transformer architecture, which can establish a flexible interactive relationship between visual and linguistic features. In the specific implementation, the model first generates Query, Key and Value vectors for the visual and language modalities respectively, and calculates the similarity weight between each feature point in the visual modality and the overall feature of the language modality through the multi-head attention mechanism, and vice versa. This cross-attention calculation method can preserve the independent features of image and text modalities at the same time, and establish fine-grained interaction between modalities on this basis to generate the fused multimodal representation[41,42,43]. In the cross-attention model, the visual features and language features processed by the multi-head cross-attention mechanism are concatenated, and then further integrated through the linear transformation layer to generate the fused multimodal feature representation

{Y_{C 1}, Y_{C 2}, \dots, Y_{C n}}

. Finally, the fused features were input to the output layer for decision making and prediction. The output layer of the model can be designed according to the task requirements, such as sentence generation in image description task or sentiment class prediction in sentiment classification task. The output on the right of <Figure 2> shows the application of the model to the sentiment classification task, resulting in the generation of three types of sentiment labels (positive, neutral, negative), which express the sentiment judgment on the combination of input image and text[44]. In general, the model effectively improves the deep interaction and fusion ability between vision and language modalities by introducing the cross-modal attention mechanism, and can obtain more accurate feature representation and better prediction effect in multimodal tasks[45].

3.2. Cross-Attention Mechanism and Fusion Strategy

The Cross-Attention Mechanism is a key strategy in multimodal learning for capturing fine-grained interactions between different modalities. Unlike traditional attention mechanisms, it dynamically maps one modality’s features into another’s context, establishing direct connections and deep feature interactions. This is particularly effective for visual-language fusion, cross-modal retrieval, and sentiment analysis, addressing semantic inconsistencies between modalities.The core concept involves generating Query, Key, and Value vectors for each modality, allowing for mutual feature modeling. Features from visual and language modalities are extracted separately using convolutional networks (CNNs) or pre-trained language models like BERT, then projected into Query, Key, and Value vectors. This enables cross-modal comparison within a unified representation space.The mechanism involves four key steps:

1. Feature Projection and Mapping: Project features into Query, Key, and Value vectors for both modalities to enable interaction. 2. Bidirectional Interaction Modeling: For each feature, generate attention distributions based on the other modality’s contextual information, creating bidirectional feature interactions. 3. Dynamic Allocation of Attention Weights: Compute dot-product similarity between Query and Key, assigning attention weights that reflect the relevance between features of different modalities, enhancing mutual feature expression. 4. Multi-Head and Multi-Layer Cross-Attention: Use multiple attention heads and layers to focus on various semantic aspects, enabling comprehensive feature fusion across higher dimensions.

The Cross-Attention Mechanism’s strength lies in its ability to align visual and language features dynamically, achieving precise modality alignment and feature enhancement. It effectively addresses heterogeneity issues, such as spatial and temporal differences, making it robust in complex scenarios. Future research can integrate this mechanism with other advanced techniques like Graph Neural Networks and Variational Autoencoders to further explore universal and efficient multimodal fusion strategies.

3.3. Loss Function and Optimization Strategy

The design of loss functions and selection of optimization strategies directly influence the convergence speed and final performance of multimodal fusion models during training. For the visual-language fusion model based on the Cross-Attention Mechanism, this paper adopts a multi-task joint loss function, combining Cross-Entropy Loss for classification tasks, Contrastive Loss for matching tasks, and Sequence Cross-Entropy Loss for generation tasks, to optimize multimodal fusion performance from multiple perspectives.For classification tasks (e.g., sentiment classification, image question answering), the model usually needs to determine whether an input image-text pair belongs to a specific category. To this end, the Cross-Entropy Loss is used to measure the difference between the predicted category distribution and the actual label distribution, defined as shown in Formula 1:

L_{C E} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{i, c} l o g {\hat{y}}_{i, c}

(1)

where N denotes the batch size, C represents the number of categories,

y_{i, c}

is the actual label of sample i (one-hot encoded), and

{\hat{y}}_{i, c}

represents the probability that the model predicts sample i belongs to category c . By minimizing the Cross-Entropy Loss, the model can better learn distinguishing features between categories, thus improving classification accuracy.For multimodal matching tasks (e.g., image-text matching), Contrastive Loss is introduced to measure the similarity between matching features and optimize the model's matching capability. The Contrastive Loss is defined as shown in Formula 2:

L_{C L} = \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \cdot D (f (I_{i}), g (T_{i})) + (1 - y_{i}) \cdot m a x (0, m - D (f (I_{i}), g (T_{i})))] ​

(2)

where

y_{i}

represents the matching label of sample i (1 indicates matching, 0 indicates non-matching),

D (\cdot, \cdot)

is the distance metric (e.g., Euclidean distance),

f (I_{i})

and

g (T_{i})

represent the feature representations of image

I_{i}

and text

T_{i}

, respectively, and m is the distance threshold. When the image and text are matching, the model minimizes the feature distance

D (f (I_{i}), g (T_{i}))

; when they are not matching, the model maximizes the feature distance beyond the threshold m, thereby improving the model's matching accuracy.For generation tasks (e.g., image captioning), the model needs to generate coherent textual descriptions based on the image content. Sequence Cross-Entropy Loss is used to optimize the model's generation ability, defined as shown in Formula 3:

L_{S E Q} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} l o g P (w_{t, i} ∣ w_{1, i}, w_{2, i}, \dots, w_{t - 1, i}, i ​, I_{i})

(3)

where T is the length of the text sequence,

w_{t, i}

is the ground truth word at time step t for sample i , and

P (w_{t, i} ∣ w_{1, i}, w_{2, i}, \dots, w_{t - 1, i}, i, I_{i})

represents the probability of the model predicting word

w_{t, i}

given the previous t-1 words and image features

I_{i}

. By minimizing this loss, the model can generate text that is semantically consistent and syntactically coherent with the target text, thereby enhancing generation task performance.To balance the impact of different tasks on the model, a multi-task joint loss strategy is adopted, where the three loss functions are weighted and combined to construct the final loss function as shown in Formula 4:

L = λ_{C E} L_{C E} + λ_{C L} L_{C L} + λ_{S E Q} L_{S E Q} ​

(4)

where λCE\lambda_{CE}λCE, λCL\lambda_{CL}λCL, and λSEQ\lambda_{SEQ}λSEQ are the weighting coefficients for Cross-Entropy Loss, Contrastive Loss, and Sequence Cross-Entropy Loss, respectively. By adjusting the weights of each loss term, the model can achieve a good balance among different tasks, thereby enhancing its overall performance.In terms of optimization strategy, this paper employs the Adam optimizer to update model parameters and utilizes a dynamic learning rate adjustment strategy to accelerate the convergence process. The initial learning rate is set to α0\alpha_0α0, and if the validation loss does not improve within a certain training period, the learning rate is decayed by a factor of β\betaβ. The update strategy is defined as shown in Formula 5:

α_{n e w ​} = β \cdot α_{o l d} i f v a l_l o s s \geq {v a l_l o s s}_{p r e v i o u s} ​

(5)

here

α_{n e w ​}

and

α_{o l d}

represent the learning rate for the current and previous periods, respectively, and

v a l_l o s s

and

{v a l_l o s s}_{p r e v i o u s}

are the current and previous validation losses, respectively. When the validation loss does not decrease, the learning rate is decayed, allowing the model to make parameter adjustments with smaller step sizes, thereby avoiding overfitting or issues with an excessively high learning rate.In conclusion, through the combination of multiple loss function designs and dynamic optimization strategies, the model is able to maintain high stability and accuracy across various multimodal tasks, further enhancing the overall performance and effectiveness of multimodal fusion.

4. Experiments and Analysis

This section provides a detailed description of the experimental setup, results, and data analysis to verify the effectiveness of the proposed cross-attention-based multimodal fusion model for visual-language tasks. The experiments encompass three main tasks: multimodal classification, image-text matching, and image captioning. We conducted experiments on several publicly available datasets and compared the proposed model with several state-of-the-art multimodal fusion methods to comprehensively evaluate its performance. The datasets used include:

1. COCO Caption Dataset: This dataset contains approximately 120,000 images, each with five different textual descriptions. It is used for image captioning and image-text matching tasks.

2. VQA (Visual Question Answering) Dataset: This dataset is designed for visual question answering tasks and includes approximately 200,000 images along with 260,000 questions and answers based on image content. It is used to evaluate the model's performance in multimodal classification and question answering tasks.

3. Flickr30k Dataset: This dataset consists of 30,000 images, each with five different textual descriptions. It is used to assess the performance of the image-text matching task.

Experimental Environment: Hardware Configuration: NVIDIA Tesla V100 GPU with 32 GB memory. Software Environment: Pytorch 1.8, CUDA 11.0, Python 3.7. Parameter Settings: The initial learning rate is set to 0.001, and the batch size is 32. The model uses the Adam optimizer for parameter updates, with a weight decay coefficient of 0.0001.The proposed model is compared with the following classic models: LSTM-Attention: An image captioning model based on Long Short-Term Memory (LSTM) networks and attention mechanisms. Dual-Attention: A method that uses bilinear attention networks to fuse visual and language features.M4C (Multimodal Multifactor Fusion): A visual question answering model based on multimodal factor fusion strategies.As shown in <Table 1>, the results of the image captioning task on the COCO Caption and Flickr30k datasets are summarized using evaluation metrics such as BLEU-1, BLEU-2, CIDEr, and METEOR scores. The results indicate that the proposed cross-attention fusion model outperforms other models across all metrics, particularly achieving a significant improvement in CIDEr scores. This demonstrates the proposed model's ability to generate text descriptions that are more consistent with the target descriptions.

In the image-text matching task, <Table 2> presents the Top-1 and Top-5 matching accuracies on the COCO Caption and Flickr30k datasets. The experimental results show that the proposed model achieves a Top-1 matching accuracy of 64.3%, which is a significant improvement compared to M4C's 59.2%. This indicates that the introduction of the cross-attention mechanism enables the model to better capture fine-grained interactions between image and text features.

To further evaluate the contribution of each module in the image captioning task, ablation experiments were conducted on the COCO Caption dataset. <Table 3> summarizes the ablation results using CIDEr scores as the evaluation metric. The results show that after removing the cross-attention mechanism, the CIDEr score drops by 8.5 points, while removing the CBAM module results in a 5.3-point decrease in CIDEr score. This indicates that both the cross-attention mechanism and CBAM module play critical roles in enhancing model performance.

Based on the experimental results, the proposed cross-attention fusion model demonstrates superior performance compared to existing multimodal fusion methods in multimodal tasks. The specific analysis is as follows:

1. Image Captioning Task: The proposed model significantly outperforms other models in BLEU and CIDEr evaluation metrics, indicating that the cross-attention mechanism effectively captures fine-grained interactions between visual and language modalities, generating more coherent and precise text descriptions.

2. Image-Text Matching Task: The proposed model achieves higher performance in Top-1 and Top-5 matching accuracies compared to existing methods, validating the cross-attention mechanism's dynamic alignment capability between different modality features, thereby improving matching accuracy.

3. Ablation Study Analysis: The ablation study results confirm that the cross-attention mechanism and CBAM module are essential for improving model performance. Removing these modules results in a significant performance drop, further demonstrating their critical roles in the model.

In conclusion, through a series of quantitative and qualitative experiments, the proposed model achieves excellent performance in multimodal fusion tasks, validating the effectiveness of the cross-attention mechanism in enhancing feature fusion capability and overall performance. Future research can further explore the combination of other attention mechanisms and multimodal feature representation methods to advance multimodal learning.

5. Conclusion

This paper proposes a visual-language multimodal fusion model based on the cross-attention mechanism, and its superior performance is validated through various experiments on image captioning, image-text matching, and other tasks. Experimental results demonstrate that the proposed model effectively captures fine-grained interactions between visual and language modalities, significantly improving classification accuracy and the quality of generated text descriptions. Compared to traditional methods, the introduction of the cross-attention mechanism enhances semantic alignment between modalities. Additionally, the ablation study further validates the critical role of the cross-attention mechanism and the CBAM module in improving model performance. Future research can explore more efficient attention mechanisms and feature fusion strategies to further advance multimodal learning.

References

Tan C, Zhang W, Qi Z, et al. Generating Multimodal Images with GAN: Integrating Text, Image, and Style[J]. arXiv preprint arXiv:2501.02167, 2025.
Zhang J, Xiang A, Cheng Y, et al. Research on Detection of Floating Objects in River and Lake Based on AI Image Recognition[J]. Journal of Artificial Intelligence Practice, 2024, 7(2): 97-106.
Xiang A, Zhang J, Yang Q, et al. Research on splicing image detection algorithms based on natural image statistical characteristics[J]. arXiv preprint arXiv:2404.16296, 2024.
Liu, Jiabei, et al. "Application of Deep Learning-Based Natural Language Processing in Multilingual Sentiment Analysis." Mediterranean Journal of Basic and Applied Sciences (MJBAS) 8.2 (2024): 243-260. [CrossRef]
Qi Z, Ma D, Xu J, et al. Improved YOLOv5 Based on Attention Mechanism and FasterNet for Foreign Object Detection on Railway and Airway tracks[J]. arXiv preprint arXiv:2403.08499, 2024.
Wang T, Cai X, Xu Q. Energy Market Price Forecasting and Financial Technology Risk Management Based on Generative AI[J]. Applied and Computational Engineering, 2024, 100: 29-34.
Wu, X.; Liu, X.; Yin, J. Multi-Class Classification of Breast Cancer Gene Expression Using PCA and XGBoost. Preprints 2024, 2024101775. https://doi.org/10.20944/preprints202410.1775.v3. [CrossRef]
Min, Liu, et al. "Financial Prediction Using DeepFM: Loan Repayment with Attention and Hybrid Loss." 2024 5th International Conference on Machine Learning and Computer Application (ICMLCA). IEEE, 2024.
Wu Z. An efficient recommendation model based on knowledge graph attention-assisted network (kgatax)[J]. arXiv preprint arXiv:2409.15315, 2024.
Wang, H., Hao Zhang and Yi Lin. 2024 "RPF-ELD: Regional Prior Fusion Using Early and Late Distillation for Breast Cancer Recognition in Ultrasound Images" Preprints. https://doi.org/10.20944/preprints202411.1419.v1. [CrossRef]
Qi Z, Ding L, Li X, et al. Detecting and Classifying Defective Products in Images Using YOLO[J]. arXiv preprint arXiv:2412.16935, 2024.
Yan, Hao, et al. "Research on image generation optimization based deep learning." Proceedings of the International Conference on Machine Learning, Pattern Recognition and Automation Engineering. 2024.
Tang, Xirui, et al. "Research on heterogeneous computation resource allocation based on data-driven method." 2024 6th International Conference on Data-driven Optimization of Complex Systems (DOCS). IEEE, 2024.
Wu Z, Chen J, Tan L, et al. A lightweight GAN-based image fusion algorithm for visible and infrared images[C]//2024 4th International Conference on Computer Science and Blockchain (CCSB). IEEE, 2024: 466-470.
Mo K, Chu L, Zhang X, et al. DRAL: Deep reinforcement adaptive learning for multi-UAVs navigation in unknown indoor environment[J]. arXiv preprint arXiv:2409.03930, 2024.
Zhang W, Huang J, Wang R, et al. Integration of Mamba and Transformer--MAT for Long-Short Range Time Series Forecasting with Application to Weather Dynamics[J]. arXiv preprint arXiv:2409.08530, 2024.
Zhao Y, Hu B, Wang S. Prediction of brent crude oil price based on lstm model under the background of low-carbon transition[J]. arXiv preprint arXiv:2409.12376, 2024.
Zhao Y, Hu B, Wang S. Prediction of brent crude oil price based on lstm model under the background of low-carbon transition[J]. arXiv preprint arXiv:2409.12376, 2024.
Diao, Su, et al. "Ventilator pressure prediction using recurrent neural network." arXiv preprint arXiv:2410.06552 (2024).
Gao, Dawei, et al. "Synaptic resistor circuits based on Al oxide and Ti silicide for concurrent learning and signal processing in artificial intelligence systems." Advanced Materials 35.15 (2023): 2210484. [CrossRef]
Shi X, Tao Y, Lin S C. Deep Neural Network-Based Prediction of B-Cell Epitopes for SARS-CoV and SARS-CoV-2: Enhancing Vaccine Design through Machine Learning[J]. arXiv preprint arXiv:2412.00109, 2024.
Wang, B., Chen, Y., & Li, Z. (2024). A novel Bayesian Pay-As-You-Drive insurance model with risk prediction and causal mapping. Decision Analytics Journal, 13, 100522. [CrossRef]
Li, Z., Wang, B., & Chen, Y. (2024). Incorporating economic indicators and market sentiment effect into US Treasury bond yield prediction with machine learning. Journal of Infrastructure, Policy and Development, 8(9), 7671. [CrossRef]
Zhao R, Hao Y, Li X. Business Analysis: User Attitude Evaluation and Prediction Based on Hotel User Reviews and Text Mining[J]. arXiv preprint arXiv:2412.16744, 2024.
Guo H, Zhang Y, Chen L, et al. Research on vehicle detection based on improved YOLOv8 network[J]. arXiv preprint arXiv:2501.00300, 2024.
Xu, Q., Wang, S., & Tao, Y. (2025). Enhancing Anti-Money Laundering Detection with Self-Attention Graph Neural Networks. Preprints. https://doi.org/10.20944/preprints202501.0587.v1. [CrossRef]
Ziang H, Zhang J, Li L. Framework for lung CT image segmentation based on UNet++[J]. arXiv preprint arXiv:2501.02428, 2025.
Weng, Y., & Wu, J. (2024). Fortifying the global data fortress: a multidimensional examination of cyber security indexes and data protection measures across 193 nations. International Journal of Frontiers in Engineering Technology, 6(2), 13-28. [CrossRef]
Wu Z. Large Language Model Based Semantic Parsing for Intelligent Database Query Engine[J]. Journal of Computer and Communications, 2024, 12(10): 1-13.
Wang, Zixuan, et al. "Improved Unet model for brain tumor image segmentation based on ASPP-coordinate attention mechanism." 2024 5th International Conference on Big Data & Artificial Intelligence & Software Engineering (ICBASE). IEEE, 2024.
Liu, Dong. "Mt2st: Adaptive multi-task to single-task learning." arXiv preprint arXiv:2406.18038 (2024).
Luo, D. (2024). Enhancing Smart Grid Efficiency through Multi-Agent Systems: A Machine Learning Approach for Optimal Decision Making. Preprints preprints, 202411, v1.
Luo,D. (2024). Quantitative Risk Measurement in Power System Risk Management Methods and Applications. Preprints. https://doi.org/10.20944/preprints202411.1636.v1. [CrossRef]
Luo, D. (2024). Decentralized Energy Markets: Designing Incentive Mechanisms for Small-Scale Renewable Energy Producers. Preprints. https://doi.org/10.20944/preprints202411.0696.v1. [CrossRef]
Li Z, Wang B, Chen Y. Knowledge Graph Embedding and Few-Shot Relational Learning Methods for Digital Assets in USA[J]. Journal of Industrial Engineering and Applied Science, 2024, 2(5): 10-18.
Weng, Y., & Wu, J. (2024). Leveraging artificial intelligence to enhance data security and combat cyber attacks. Journal of Artificial Intelligence General science (JAIGS) ISSN: 3006-4023, 5(1), 392-399. [CrossRef]
Liu, Dong, et al. "Graphsnapshot: Graph machine learning acceleration with fast storage and retrieval." arXiv preprint arXiv:2406.17918 (2024).
Tan C, Li X, Wang X, et al. Real-time Video Target Tracking Algorithm Utilizing Convolutional Neural Networks (CNN)[C]//2024 4th International Conference on Electronic Information Engineering and Computer (EIECT). IEEE, 2024: 847-851.
Liu, D. (2024). Contemporary model compression on large language models inference. arXiv preprint arXiv:2409.01990.
Li, Z., Wang, B., & Chen, Y. (2024). A contrastive deep learning approach to cryptocurrency portfolio with us treasuries. Journal of Computer Technology and Applied Mathematics, 1(3), 1-10.
Weng Y, Wu J, Kelly T, et al. Comprehensive overview of artificial intelligence applications in modern industries[J]. arXiv preprint arXiv:2409.13059, 2024.
Huang B, Lu Q, Huang S, et al. Multi-modal clothing recommendation model based on large model and VAE enhancement[J]. arXiv preprint arXiv:2410.02219, 2024.
Li, Z., Wang, B., & Chen, Y. (2024). Knowledge Graph Embedding and Few-Shot Relational Learning Methods for Digital Assets in USA. Journal of Industrial Engineering and Applied Science, 2(5), 10-18.
Zhao, P., & Lai, L. (2024). Minimax optimal q learning with nearest neighbors. IEEE Transactions on Information Theory. [CrossRef]
Feng, J., Wu, Y., Sun, H., Zhang, S., & Liu, D. (2025). Panther: Practical Secure 2-Party Neural Network Inference. IEEE Transactions on Information Forensics and Security. [CrossRef]

Figure 1. Schematic diagram of the fusion and decision-making process of different modal data in multimodal learning.

Figure 2. Design of the model architecture.

Table 1. Performance comparison of various models on the COCO Caption and Flickr30k datasets for the image captioning task.

Model	Dataset	BLEU-1	BLEU-2	CIDEr	METEOR
LSTM-Attention	COCO Caption	73.2	55.4	112.8	27.1
Dual-Attention	COCO Caption	76.5	58.9	123.4	28.3
M4C	COCO Caption	79.1	61.2	134.7	30.2
Proposed Model	COCO Caption	82.4	64.7	145.3	32.5
LSTM-Attention	Flickr30k	68.9	52.1	97.6	26.2
Dual-Attention	Flickr30k	72.8	55.4	108.3	27.4
M4C	Flickr30k	74.5	57.1	115.7	29.1
Proposed Model	Flickr30k	77.8	60.2	126.5	30.8

Table 2. Top-1 and Top-5 matching accuracy comparison of various models on the COCO Caption and Flickr30k datasets for the image-text matching task.

Model	Dataset	Top-1 Accuracy	Top-5 Accuracy
LSTM-Attention	COCO Caption	53.1	78.4
Dual-Attention	COCO Caption	58.7	81.3
M4C	COCO Caption	59.2	83.5
Proposed Model	COCO Caption	64.3	87.2
LSTM-Attention	Flickr30k	48.7	73.9
Dual-Attention	Flickr30k	52.9	76.5
M4C	Flickr30k	55.1	79.8
Proposed Model	Flickr30k	60.4	83.1

Table 3. Ablation analysis of different model configurations on the CIDEr scores for the image captioning task.

Model Configuration	CIDEr Score
Full Model	145.3
Remove Cross-Attention	136.8
Remove CBAM Module	140.0
Visual Features Only	121.4
Text Features Only	118.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.