Submitted:
01 April 2025
Posted:
02 April 2025
You are already at the latest version
Abstract
Keywords:
I. Introduction
A. Definition of Multi-Modal Deep Learning
B. Objectives of the Review
C. Structure of the Paper
II. Bibliometric Analysis
A. Keyword Co-Occurrence Analysis
| Rank | Keyword | Total Link Strength | Occurrences |
| 1 | deep learning | 4369 | 536 |
| 2 | convolutional neural networks | 2785 | 307 |
| 3 | deep neural networks | 2601 | 307 |
| 4 | convolutional neural network | 2573 | 266 |
| 5 | convolution | 2281 | 221 |
| 6 | multi-modal | 1995 | 225 |
| 7 | neural networks | 1451 | 154 |
| 8 | learning systems | 1370 | 139 |
| 9 | classification | 1228 | 117 |
| 10 | feature extraction | 1196 | 102 |
B. Other Analysis
III. Core Concepts and Challenges in Multi-Modal Deep Learning
A. Representation
B. Translation
C. Alignment
D. Fusion
| Fusion Type | Base Models | Modalities | References |
| Early | CNN | Image (IR and Visible | [24] |
| CNN, LSTM | Audio, Video | [25] | |
| LSTM, GRU, Bert | Audio, Video, text | [26] | |
| Late | CNN, RNN, GNN | Image, NPK, Microscopic data | [27] |
| CNN, RNN | Image, text | [28] | |
| LSTM, CNN | Image, text, audio | [29] | |
| NLP, CNN | Text, Audio, Image | [30] | |
| Hybrid | Sparse RBM | Audio, video | [4] |
| DBM | Image, text | [31] | |
| CNN/LSTM | Text, Image | [32] | |
| CNN | Audio, video, text | [33] |
E. Co-Learning
IV. Applications of Multi-Modal Deep Learning
A. Healthcare and Medical Imaging
B. Autonomous Systems and Robotics
C. Natural Language Processing and Computer Vision
D. Environmental Monitoring and Remote Sensing
E. Social Media and Sentiment Analysis
F. Mining and Minerals
| Year | Modalities | Application | Base Model | References |
| 2024 | CAD-based Images, interpolated rock mass rating (RMR) data | Mining | neural networks, SVMs, KNNs | [52] |
| 2024 | sensor, image, and sound data | Mining | LSTM, STAN | [53] |
| 2024 | MRI, CT, and PET | Medical Imaging | CNN | [21] |
| 2024 | MRI, Text | Medicine | CNN, SVR | [34] |
| 2024 | Medical images, genomics and clinical data | Medicine | CNN, ResNet | [21] |
| 2024 | Image, LIDAR | Autonomous Systems | CNN | [37] |
| 2022 | IR image, Visible image | Robotics | CNN | [24] |
| 2023 | Image | Robotics | CNN, RNN | [38] |
| 2017 | Video, Audio | NLP | CNN, LSTM | [49] |
| 2024 | Image, text | NLP | CNN, LRM | [41] |
| 2024 | Images, Time Series data | Remote Sensing | ResNet | [42] |
| 2024 | Image, Weather Data | Remote Sensing | CNN, Ensemble | [44] |
| 2024 | Text, Image | Social media | RoBERTa, ViT | [50] |
| 2024 | Text, Image | Sentiment analysis | DAE, CNN | [28] |
V. Encoding and Decoding in Multi-Modal Architectures
A. Encoding Architectures in Multi-Modal Models
B. Decoding in Multi-Modal Architectures
C. Encoder-Decoder Architectures for Translation and Alignment
VI. Summary and Conclusions
Acknowledgment
Abbreviations and Acronyms
| ACNN | Attention Convolutional Neural Network |
| AEBO | Aquila EfficientNet-B0 |
| Bert | Bidirectional Encoder Representations from Transformers |
| BiLSTM-AHCNet | Bi-Directional Long Short-Term Memory assisted Attention Hierarchical Capsule Network |
| CBAM | Convolutional Block Attention Module |
| CCA | Canonical Correlation Analysis |
| cGANs | Conditional Generative Adversarial Networks |
| CMFF | Cross-Modal Feature Fusion |
| CNN | Convolutional Neural Network |
| CT | Computed Tomography |
| DAE | Denoising Autoencoder |
| DBM | Deep Boltzmann Machine |
| DECCFNet | Dual Encoder-Based Cross-Modal Complementary Fusion Network |
| DFEM | Deep Feature Extraction Module |
| DFNN | Deep Fusion Neural Networks |
| DL | Deep Learning |
| EAO | Edge Attention Operation |
| GNN | Graph Neural Network |
| GRU | Gated Recurrent Unit |
| IOU | Intersection Over Union |
| IR | Information Retrieval |
| KNNs | K-Nearest Neighbours |
| LIDAR | Light Detection and Ranging |
| LRM | Linear Regression Model |
| LSTM | Long Short-Term Memory |
| MAE | Mean Absolute Error |
| MAPE | Mean Absolute Percentage Error |
| MBR | Multi-Modal Bayesian Recommender |
| MDSC | Multi-Direction Strip Convolution |
| MFDNN | Multi-Modal Fusion Deep Neural Network |
| MFINet | Multi-Modal Feature Interaction Network |
| MMDFC | Multi-Modality Decision Fusion Classifier |
| MMDL | Multi-Modal Deep Learning |
| MRI | Magnetic Resonance Imaging |
| MSE | Mean Squared Error |
| NAIP | National Agriculture Imagery Program |
| NDVI | Normalised Difference Vegetation Index |
| NLP | Natural Language Processing |
| PCA | Principal Component Analysis |
| PET | Positron Emission Tomography |
| RBM | Restricted Boltzmann Machine |
| ResNet | Residual Network |
| RMR | Rock Mass Rating |
| RMSE | Root Mean Squared Error |
| RNN | Recurrent Neural Network |
| RoBERTa | Robustly Optimized Bert Approach |
| SAR | Synthetic Aperture Radar |
| STAN | Spatiotemporal Attention Networks |
| SVMs | Support Vector Machines |
| TwinCNN | Twin Convolutional Neural Network |
| ViT | Vision Transformer |
| VQA | Visual Question Answering |
References
- Baltrusaitis, T.; Ahuja, C.; Morency, L.P. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans Pattern Anal Mach Intell 2019, 41, 423–443. [Google Scholar] [CrossRef] [PubMed]
- Akkus, C.; et al. Multimodal Deep Learning. arXiv 2023, arXiv:2301.04856. [Google Scholar]
- Lu, X.; Xie, L.; Xu, L.; Mao, R.; Xu, X.; Chang, S. Multimodal fused deep learning for drug property prediction: Integrating chemical language and molecular graph. Comput Struct Biotechnol J 2024, 23, 1666–1679. [Google Scholar] [CrossRef] [PubMed]
- Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; Ng, A.Y. Multimodal Deep Learning. 2011. [Google Scholar]
- Tian, H.; Tao, Y.; Pouyanfar, S.; Chen, S.C.; Shyu, M.L. Multimodal deep representation learning for video classification. World Wide Web 2019, 22, 1325–1341. [Google Scholar] [CrossRef]
- Pawłowski, M.; Wróblewska, A.; Sysko-Romańczuk, S. Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 2023, 23, 2381. [Google Scholar] [CrossRef]
- Song, B.; Zhou, R.; Ahmed, F. Multi-modal Machine Learning in Engineering Design: A Review and Future Directions. arXiv 2023, arXiv:2302.10909. [Google Scholar] [CrossRef]
- Pei, X.; Zuo, K.; Li, Y.; Pang, Z. A Review of the Application of Multi-modal Deep Learning in Medicine: Bibliometrics and Future Directions. Int J Comput Intell Syst 2023. [Google Scholar] [CrossRef]
- Liang, P.P.; Zadeh, A.; Morency, L.-P. Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions. arXiv 2022, arXiv:2209.03430. [Google Scholar] [CrossRef]
- Liang, P.P.; Zadeh, A.; Morency, L.-P. Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions. arXiv 2022, arXiv:2209.03430. [Google Scholar] [CrossRef]
- Tsai, Y.-H.H.; Liang, P.P.; Zadeh, A.; Morency, L.-P.; Salakhutdinov, R. Learning Factorized Multimodal Representations. arXiv 2018, arXiv:1806.06176. [Google Scholar]
- Guo, W.; Wang, J.; Wang, S. Deep Multimodal Representation Learning: A Survey. IEEE Access 2019, 7, 63373–63394. [Google Scholar] [CrossRef]
- Guo, W.; Wang, J.; Wang, S. Deep Multimodal Representation Learning: A Survey. IEEE Access 2019, 7, 63373–63394. [Google Scholar] [CrossRef]
- Yang, X.; Ramesh, P.; Chitta, R.; Madhvanath, S.; Bernal, E.A.; Luo, J. Deep Multimodal Representation Learning from Temporal Data. arXiv 2017, arXiv:1704.03152. [Google Scholar]
- Pham, H.; Liang, P.P.; Manzini, T.; Morency, L.-P.; Póczos, B. Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities. 2019. Available online: www.aaai.org.
- Jia, N.; Zheng, C.; Sun, W. A multimodal emotion recognition model integrating speech, video and MoCAP. Multimed Tools Appl 2022, 81, 32265–32286. [Google Scholar] [CrossRef]
- Li, Z.; Xie, Y. BCRA: bidirectional cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. Multimed Syst 2024, 30, 177. [Google Scholar] [CrossRef]
- Karpathy, A.; Fei-Fei, L. Deep Visual-Semantic Alignments for Generating Image Descriptions. arXiv 2014, arXiv:1412.2306. [Google Scholar]
- Ma, Z.; Zhang, H.; Liu, J. MM-RNN: A Multimodal RNN for Precipitation Nowcasting. IEEE Transactions on Geoscience and Remote Sensing 2023, 61, 4101914. [Google Scholar] [CrossRef]
- Kline, A.; et al. Multimodal machine learning in precision health: A scoping review. npj Digital Medicine 2022. [Google Scholar] [CrossRef]
- S. K.B, S.; et al. An enhanced multimodal fusion deep learning neural network for lung cancer classification. Systems and Soft Computing 2024, 6, 200068. [Google Scholar] [CrossRef]
- Stahlschmidt, S.R.; Ulfenborg, B.; Synnergren, J. Multimodal deep learning for biomedical data fusion: A review. Brief Bioinform 2022. [Google Scholar] [CrossRef]
- Khan, M.; Gueaieb, W.; El Saddik, A.; Kwon, S. MSER: Multimodal speech emotion recognition using cross-attention with deep fusion. Expert Syst Appl 2024, 245, 122946. [Google Scholar] [CrossRef]
- Alabdulkreem, E.A.; Sedik, A.; Algarni, A.D.; Banby, G.M.E.; El-Samie, F.E.A.; Soliman, N.F. Enhanced Robotic Vision System Based on Deep Learning and Image Fusion. Computers, Materials and Continua 2022, 73, 1845–1861. [Google Scholar] [CrossRef]
- Hosseini, S.S.; Yamaghani, M.R.; Arabani, S.P. Multimodal modelling of human emotion using sound, image and text fusion. Signal Image Video Process 2024, 18, 71–79. [Google Scholar] [CrossRef]
- Zhang, Z.; et al. Multimodal Sensing for Depression Risk Detection: Integrating Audio, Video, and Text Data. Sensors 2024, 24, 3714. [Google Scholar] [CrossRef]
- Choudhari, A.; Bhoyar, D.B.; Badole, W.P. International Journal of INTELLIGENT SYSTEMS AND APPLICATIONS IN ENGINEERING MFMDLYP: Precision Agriculture through Multidomain Feature Engineering and Multimodal Deep Learning for Enhanced Yield Predictions’. Available online: www.ijisae.org.
- Kusal, S.; Panchal, P.; Patil, S. Pre-Trained Networks and Feature Fusion for Enhanced Multimodal Sentiment Analysis. In Proceedings of the 2024 MIT Art, Design and Technology School of Computing International Conference, MITADTSoCiCon 2024, Pune, India, 25–27 April 2024. [Google Scholar] [CrossRef]
- Singh, N.M.; Sharma, S.K. An efficient automated multi-modal cyberbullying detection using decision fusion classifier on social media platforms. Multimed Tools Appl 2024, 83, 20507–20535. [Google Scholar] [CrossRef]
- Dixit, C.; Satapathy, S.M. Deep CNN with late fusion for real time multimodal emotion recognition. Expert Syst Appl 2024, 240, 122579. [Google Scholar] [CrossRef]
- Srivastava, N. Deep Learning Models for Unsupervised and Transfer Learning. 2017. [Google Scholar]
- Ko, K.K.; Jung, E.S. Improving Air Pollution Prediction System through Multimodal Deep Learning Model Optimization. Applied Sciences 2022, 12, 405. [Google Scholar] [CrossRef]
- Jaafar, N.; Lachiri, Z. Multimodal fusion methods with deep neural networks and meta-information for aggression detection in surveillance. Expert Syst Appl 2023, 211, 118523. [Google Scholar] [CrossRef]
- Wang, C.; Tachimori, H.; Yamaguchi, H.; Sekiguchi, A.; Li, Y.; Yamashita, Y. A multimodal deep learning approach for the prediction of cognitive decline and its effectiveness in clinical trials for Alzheimer’s disease. Transl Psychiatry 2024, 14, 105. [Google Scholar] [CrossRef]
- Song, W.; Zeng, X.; Li, Q.; Gao, M.; Zhou, H.; Shi, J. CT and MRI image fusion via multimodal feature interaction network. Network Modeling Analysis in Health Informatics and Bioinformatics 2024, 13, 13. [Google Scholar] [CrossRef]
- Oyelade, O.N.; Irunokhai, E.A.; Wang, H. A twin convolutional neural network with hybrid binary optimizer for multimodal breast cancer digital image classification. Sci Rep 2024, 14, 692. [Google Scholar] [CrossRef] [PubMed]
- Liu, Y.; Meng, S.; Wang, H.; Liu, J. Deep learning based object detection from multi-modal sensors: an overview. Multimed Tools Appl 2024, 83, 19841–19870. [Google Scholar] [CrossRef]
- Ichiwara, H.; Ito, H.; Yamamoto, K.; Mori, H.; Ogata, T. Modality Attention for Prediction-Based Robot Motion Generation: Improving Interpretability and Robustness of Using Multi-Modality. IEEE Robot Autom Lett 2023, 8, 8271–8278. [Google Scholar] [CrossRef]
- Liu, X.; Xu, X.; Xie, J.; Li, P.; Wei, J.; Sang, Y. FDENet: Fusion Depth Semantics and Edge-Attention Information for Multispectral Pedestrian Detection. IEEE Robot Autom Lett 2024, 9, 5441–5448. [Google Scholar] [CrossRef]
- Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.-P. Tensor Fusion Network for Multimodal Sentiment Analysis. arXiv 2017, arXiv:1707.07250. [Google Scholar]
- He, J.; et al. Multi-modal Bayesian Recommendation System. In Proceedings of the IMCEC 2024 - IEEE 6th Advanced Information Management, Communicates, Electronic and Automation Control Conference, Chongqing, China, 24–26 May 2024; pp. 141–145. [Google Scholar] [CrossRef]
- Xia, H.; Chen, X.; Wang, Z.; Chen, X.; Dong, F. A Multi-Modal Deep-Learning Air Quality Prediction Method Based on Multi-Station Time-Series Data and Remote-Sensing Images: Case Study of Beijing and Tianjin. Entropy 2024, 26, 91. [Google Scholar] [CrossRef]
- Ren, B.; Liu, B.; Hou, B.; Wang, Z.; Yang, C.; Jiao, L. SwinTFNet: Dual-Stream Transformer With Cross Attention Fusion for Land Cover Classification. IEEE Geoscience and Remote Sensing Letters 2024, 21, 1–5. [Google Scholar] [CrossRef]
- Ramzan, Z.; Asif, H.M.S.; Shahbaz, M. Multimodal crop cover identification using deep learning and remote sensing. Multimed Tools Appl 2024, 83, 33141–33159. [Google Scholar] [CrossRef]
- Hong, D.; et al. More Diverse Means Better: Multimodal Deep Learning Meets Remote-Sensing Imagery Classification. IEEE Transactions on Geoscience and Remote Sensing 2021, 59, 4340–4354. [Google Scholar] [CrossRef]
- Zhang, X.; Zhou, Y.; Peng, P.; Wang, G. A Novel Multimodal Species Distribution Model Fusing Remote Sensing Images and Environmental Features. Sustainability 2022, 14, 14034. [Google Scholar] [CrossRef]
- Luo, H.; Wang, Z.; Du, B.; Dong, Y. A Deep Cross-Modal Fusion Network for Road Extraction With High-Resolution Imagery and LiDAR Data. IEEE Transactions on Geoscience and Remote Sensing 2024, 62, 1–15. [Google Scholar] [CrossRef]
- Saeed, N.; Alam, M.; Nyberg, R.G. A multimodal deep learning approach for gravel road condition evaluation through image and audio integration. Transportation Engineering 2024, 16, 100228. [Google Scholar] [CrossRef]
- Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.-P. Tensor Fusion Network for Multimodal Sentiment Analysis. arXiv 2017, arXiv:1707.07250. [Google Scholar]
- Shetty, N.P.; Bijalwan, Y.; Chaudhari, P.; Shetty, J.; Muniyal, B. Disaster assessment from social media using multimodal deep learning. Multimed Tools Appl 2024. [Google Scholar] [CrossRef]
- Li, H.; Lu, Y.; Zhu, H. Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism. Electronics 2024, 13, 2069. [Google Scholar] [CrossRef]
- Liang, R.; et al. Multimodal data fusion for geo-hazard prediction in underground mining operation. Comput Ind Eng 2024, 193, 110268. [Google Scholar] [CrossRef]
- Li, Y.; Fei, J. Construction of Mining Robot Equipment Fault Prediction Model Based on Deep Learning. Electronics 2024, 13, 480. [Google Scholar] [CrossRef]
- Majidi, S.; Babapour, G.; Shah-Hosseini, R. An encoder–decoder network for land cover classification using a fusion of aerial images and photogrammetric point clouds. Survey Review 2024. [Google Scholar] [CrossRef]
- Livezey, J.A.; Glaser, J.I. Deep learning approaches for neural decoding across architectures and recording modalities. Briefings in Bioinformatics 2021. [Google Scholar] [CrossRef]
- Pham, H.; Liang, P.P.; Manzini, T.; Morency, L.-P.; Póczos, B. Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities. 2019. Available online: www.aaai.org.
- Karpathy, A.; Joulin, A.; Fei-Fei, L. Deep Fragment Embeddings for Bidirectional Image Sentence Mapping. arXiv 2014, arXiv:1406.5679. [Google Scholar]
- Fukui, A.; Park, D.H.; Yang, D.; Rohrbach, A.; Darrell, T.; Rohrbach, M. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. arXiv 2016, arXiv:1606.01847. [Google Scholar]
- Deng, L.; Fu, R.; Li, Z.; Liu, B.; Xue, M.; Cui, Y. Lightweight cross-modal multispectral pedestrian detection based on spatial reweighted attentionmechanism. Computers, Materials and Continua 2024, 78, 4071–4089. [Google Scholar] [CrossRef]








Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).