Submitted:
29 May 2025
Posted:
29 May 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
3. Methodology
3.1. Model Overview
3.2. Image Encoder with Vision Transformer
3.3. Semantic Encoder based on LLMs
3.4. Multimodal Feature Fusion Layer
3.5. Multiscale Contextual Reasoning Layer
3.6. Final Decision Layer
3.7. Adaptive Loss Function with Multi-Loss Components
3.8. Model Output
4. Data Preprocessing
4.1. Image Preprocessing
4.2. Textual Data Preprocessing
4.3. Fusion Preparation
5. Evaluation Metrics
5.1. Accuracy
5.2. Precision
5.3. Recall
5.4. Average Precision (AP)
6. Experiment Results
7. Conclusions
References
- Lu, J.; Long, Y.; Li, X.; Shen, Y.; Wang, X. Hybrid Model Integration of LightGBM, DeepFM, and DIN for Enhanced Purchase Prediction on the Elo Dataset. In Proceedings of the 2024 IEEE 7th International Conference on Information Systems and Computer Aided Education (ICISCAE). IEEE, 2024, pp. 16–20. [CrossRef]
- Jin, T. Attention-Based Temporal Convolutional Networks and Reinforcement Learning for Supply Chain Delay Prediction and Inventory Optimization. Preprints 2025. [CrossRef]
- Kenton, J.D.M.W.C.; Toutanova, L.K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Proceedings of naacL-HLT. Minneapolis, Minnesota, 2019, Vol. 1.
- Li, S. Leveraging Large Language Models in a Retriever-Reader Framework for Solving STEM Multiple-Choice Questions. In Proceedings of the 2024 4th International Conference on Electronic Information Engineering and Computer Science (EIECS). IEEE, 2024, pp. 658–661.
- Huang, B.; Carley, K.M. Syntax-aware aspect level sentiment classification with graph attention networks. arXiv preprint arXiv:1909.02606, arXiv:1909.02606 2019.
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International conference on machine learning. PMLR, 2021, pp. 8748–8763.
- Jin, T. Integrated Machine Learning for Enhanced Supply Chain Risk Prediction 2025.
- Dai, W.; Jiang, Y.; Liu, Y.; Chen, J.; Sun, X.; Tao, J. CAB-KWS: Contrastive Augmentation: An Unsupervised Learning Approach for Keyword Spotting in Speech Technology. In Proceedings of the International Conference on Pattern Recognition. Springer, 2025, pp. 98–112. [CrossRef]




| Model | Accuracy (%) | Precision (%) | Recall (%) | AP (%) |
|---|---|---|---|---|
| MLIF-Net | 95.3 | 94.7 | 96.1 | 94.2 |
| ViT-GAN | 91.4 | 89.5 | 92.3 | 90.5 |
| ResNet-50-CNN | 85.6 | 83.2 | 87.0 | 84.8 |
| CLIP | 92.0 | 90.1 | 93.3 | 91.5 |
| Xception-V2 | 89.5 | 88.0 | 90.7 | 88.9 |
| Model Variant | Accuracy (%) | Precision (%) | Recall (%) | AP (%) |
|---|---|---|---|---|
| MLIF-Net | 95.3 | 94.7 | 96.1 | 94.2 |
| Without Fusion | 89.7 | 88.1 | 91.2 | 88.3 |
| Without Multiscale Reasoning | 93.1 | 92.6 | 93.5 | 92.0 |
| Without LLM Encoder | 90.8 | 89.2 | 92.1 | 89.7 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).