Submitted:
26 July 2025
Posted:
28 July 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- To classify and contextualize the types of multimodal data generated in smart cities, with a focus on sensing technologies and urban data sources.
- To provide a systematic overview of multimodal machine learning techniques such as fusion strategies, cross-modal learning and attention mechanisms while addressing current challenges including alignment, scalability and data quality.
- To review the practical applications of MML in smart city domains, including mobility, environmental monitoring, public safety, healthcare and governance.
- To identify the current challenges related to deploying multimodal machine learning in smart city environments, including infrastructure limitations, policy constraints and ethical considerations.
- To outline future research directions and opportunities at the intersection of MML and smart city development, aiming to inform the design of robust, ethical and scalable intelligent urban systems.
2. Background and Foundations
3. Techniques in Multimodal Machine Learning for Smart Cities
3.1. Fusion Strategies in Deep Learning for Smart Cities
3.1.1. Early Fusion
3.1.2. Late Fusion
3.1.3. Hybrid Fusion
3.2. Deep Learning based Models for Fusion
- Convolutional Layer: This layer applies filters to the input image (or visual data) to detect low-level features, such as edges or corners. It produces feature maps that highlight important patterns in the data.
- Pooling Layer: After convolution, pooling is applied to reduce the spatial dimensions of the feature maps while retaining important information. This helps the model become invariant to small translations of the input data.
- Fully Connected Layer: The final layer combines the extracted features to make predictions or classifications based on the learned patterns. In multimodal fusion, these outputs are often combined with data from other sources (such as environmental sensor readings) at a later stage.
- Input Sequence: The input sequence of elements will be a mix of text tokens and image features (e.g., from CNNs for visual data).
- Query, Key and Value Vectors: Each element in the input (text or visual) is transformed into three vectors: Query (Q), Key (K) and Value (V).
- Attention Scores: This explains how the Query (Q) is compared to all Keys (K) to compute attention scores.
- Softmax and Weighted Sum: After computing the attention scores, we’ll show how they are normalized (via softmax) and used to weight the Value (V) vectors.
- Output: The final output of the self-attention layer, which is a contextualized representation for each element, will be shown.
3.3. Comparative Assessment of MML Techniques and Their Performance
3.4. Core Challenges in the Multimodal Fusion
3.4.1. Multimodal Representation Learning
3.4.2. Cross-Modal Alignment
Temporal Misalignment
Spatial Misalignment
Scalability and Real-Time Constraints
3.4.3. Robustness to Missing or Noisy Modalities
3.4.4. Interpretability of Models
3.4.5. Dataset and Benchmark Limitations
4. Applications of Multimodal Machine Learning in Smart Cities
- Multimodal surveillance for event monitoring: During large public events, integrated systems that analyze live video, crowd noise levels and social media activity help detect and manage instances of unrest or overcrowding [110].
- Healthcare and epidemiology surveillance: In response to the COVID-19 pandemic, some regions have experimented with merging data from wearable devices, public health databases and mobility tracking to understand the spread and impact of the virus at a neighborhood level [113].
4.1. Traffic and Transportation
4.2. Environmental Monitoring
4.3. Public Safety and Surveillance
4.4. Urban Planning and Infrastructure
4.5. Citizen Engagement & Services
4.6. IoT Platform
4.7. Cloud Computing
4.8. Edge Computing
4.9. Healthcare and Health Monitoring
5. Challenges and Limitations of MML Deployment in Smart Cities
5.1. Privacy and Security Concerns
5.2. Ethical Considerations
6. Research Gaps and Future Directions in Multimodal Sensing for Smart City Applications
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Baltrušaitis, T.; Ahuja, C.; Morency, L.-P. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence 2018, 41, 423–443. [Google Scholar] [CrossRef]
- Valet, L.; Mauris, G.; Bolon, P.; Keskes, N. A fuzzy rule-based interactive fusion system for seismic data analysis. Information Fusion 2003, 4, 123–133. [Google Scholar] [CrossRef]
- Florescu, D.; Koller, D. Using probabilistic information in data integration. In Proceedings of the In Proc. of the Int. Conf. on Very Large Data Bases (VLDB; 1997. [Google Scholar]
- Buccella, A.; Cechich, A.; Rodríguez Brisaboa, N. An ontology approach to data integration. Journal of Computer Science & Technology 2003, 3. [Google Scholar]
- Sharma, H.; Haque, A.; Blaabjerg, F. Machine learning in wireless sensor networks for smart cities: a survey. Electronics 2021, 10, 1012. [Google Scholar] [CrossRef]
- Anwar, M.R.; Sakti, L.D. Integrating artificial intelligence and environmental science for sustainable urban planning. IAIC Transactions on Sustainable Digital Innovation (ITSDI) 2024, 5, 179–191. [Google Scholar] [CrossRef]
- Ortega-Fernández, A.; Martín-Rojas, R.; García-Morales, V.J. Artificial intelligence in the urban environment: Smart cities as models for developing innovation and sustainability. Sustainability 2020, 12, 7860. [Google Scholar] [CrossRef]
- Ullah, Z.; Al-Turjman, F.; Mostarda, L.; Gagliardi, R. Applications of artificial intelligence and machine learning in smart cities. Computer communications 2020, 154, 313–323. [Google Scholar] [CrossRef]
- Pawłowski, M.; Wróblewska, A.; Sysko-Romańczuk, S. Effective techniques for multimodal data fusion: A comparative analysis. Sensors 2023, 23, 2381. [Google Scholar] [CrossRef]
- Huang, X.; Wang, S.; Yang, D.; Hu, T.; Chen, M.; Zhang, M.; Zhang, G.; Biljecki, F.; Lu, T.; Zou, L. Crowdsourcing geospatial data for earth and human observations: A review. Journal of Remote Sensing 2024, 4, 0105. [Google Scholar] [CrossRef]
- Lahat, D.; Adali, T.; Jutten, C. Multimodal data fusion: an overview of methods, challenges, and prospects. Proceedings of the IEEE 2015, 103, 1449–1477. [Google Scholar] [CrossRef]
- Kang, H.-W.; Kang, H.-B. Prediction of crime occurrence from multi-modal data using deep learning. PloS one 2017, 12, e0176244. [Google Scholar] [CrossRef]
- Srivastava, S.; Vargas-Munoz, J.E.; Tuia, D. Understanding urban landuse from the above and ground perspectives: A deep learning, multimodal solution. Remote sensing of environment 2019, 228, 129–143. [Google Scholar] [CrossRef]
- Prawiyogi, A.G.; Purnama, S.; Meria, L. Smart cities using machine learning and intelligent applications. International Transactions on Artificial Intelligence 2022, 1, 102–116. [Google Scholar] [CrossRef]
- Alam, F.; Mehmood, R.; Katib, I.; Albogami, N.N.; Albeshri, A. Data fusion and IoT for smart ubiquitous environments: A survey. Ieee Access 2017, 5, 9533–9554. [Google Scholar] [CrossRef]
- Lifelo, Z.; Ding, J.; Ning, H.; Dhelim, S. Artificial intelligence-enabled metaverse for sustainable smart cities: Technologies, applications, challenges, and future directions. Electronics 2024, 13, 4874. [Google Scholar] [CrossRef]
- Nasr, M.; Islam, M.M.; Shehata, S.; Karray, F.; Quintana, Y.J.I.a. Smart healthcare in the age of AI: recent advances, challenges, and future prospects. 2021, 9, 145248–145270. [Google Scholar] [CrossRef]
- Myagmar-Ochir, Y.; Kim, W. A survey of video surveillance systems in smart city. Electronics 2023, 12, 3567. [Google Scholar] [CrossRef]
- Bello, J.P.; Mydlarz, C.; Salamon, J. Sound analysis in smart cities. In Computational analysis of sound scenes and events; Springer: 2017; pp. 373–397.
- Lim, C.; Cho, G.-H.; Kim, J. Understanding the linkages of smart-city technologies and applications: Key lessons from a text mining approach and a call for future research. Technological Forecasting and Social Change 2021, 170, 120893. [Google Scholar] [CrossRef]
- Musa, A.A.; Malami, S.I.; Alanazi, F.; Ounaies, W.; Alshammari, M.; Haruna, S.I. Sustainable traffic management for smart cities using internet-of-things-oriented intelligent transportation systems (ITS): challenges and recommendations. Sustainability 2023, 15, 9859. [Google Scholar] [CrossRef]
- Mete, M.O. Geospatial big data analytics for sustainable smart cities. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences 2023, 48, 141–146. [Google Scholar] [CrossRef]
- Panahi, O. Wearable sensors and personalized sustainability: Monitoring health and environmental exposures in real-time. European Journal of Innovative Studies and Sustainability 2025, 1, 11–19. [Google Scholar] [CrossRef]
- Shahzad, S.K.; Ahmed, D.; Naqvi, M.R.; Mushtaq, M.T.; Iqbal, M.W.; Munir, F. Ontology driven smart health service integration. Computer Methods and Programs in Biomedicine 2021, 207, 106146. [Google Scholar] [CrossRef] [PubMed]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Advances in neural information processing systems 2017, 30. [Google Scholar]
- Driver, J.; Spence, C. Cross–modal links in spatial attention. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences 1998, 353, 1319–1331. [Google Scholar] [CrossRef] [PubMed]
- Ektefaie, Y.; Dasoulas, G.; Noori, A.; Farhat, M.; Zitnik, M. Multimodal learning with graphs. Nature Machine Intelligence 2023, 5, 340–350. [Google Scholar] [CrossRef] [PubMed]
- Wolniak, R.; Stecuła, K. Artificial intelligence in smart cities—applications, barriers, and future directions: a review. Smart cities 2024, 7, 1346–1389. [Google Scholar] [CrossRef]
- Sadiq, T.; Omlin, C.W. NLP-based Traffic Scene Retrieval via Representation Learning.
- Concas, F.; Mineraud, J.; Lagerspetz, E.; Varjonen, S.; Liu, X.; Puolamäki, K.; Nurmi, P.; Tarkoma, S. Low-cost outdoor air quality monitoring and sensor calibration: A survey and critical analysis. ACM Transactions on Sensor Networks (TOSN) 2021, 17, 1–44. [Google Scholar] [CrossRef]
- Rodríguez-Ibánez, M.; Casánez-Ventura, A.; Castejón-Mateos, F.; Cuenca-Jiménez, P.-M. A review on sentiment analysis from social media platforms. Expert Systems with Applications 2023, 223, 119862. [Google Scholar] [CrossRef]
- Luca, M.; Barlacchi, G.; Lepri, B.; Pappalardo, L. A survey on deep learning for human mobility. ACM Computing Surveys (CSUR) 2021, 55, 1–44. [Google Scholar] [CrossRef]
- Zhao, F.; Zhang, C.; Geng, B. Deep multimodal data fusion. ACM computing surveys 2024, 56, 1–36. [Google Scholar] [CrossRef]
- Li, L.H.; Yatskar, M.; Yin, D.; Hsieh, C.-J.; Chang, K.-W. Visualbert: A simple and performant baseline for vision and language. arXiv 2019. [Google Scholar] [CrossRef]
- Su, W.; Zhu, X.; Cao, Y.; Li, B.; Lu, L.; Wei, F.; Dai, J. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv 2019, arXiv:1908.08530 2019. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J. Learning transferable visual models from natural language supervision. In Proceedings of the International conference on machine learning; 2021; pp. 8748–8763. [Google Scholar]
- Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.-T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.-H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International conference on machine learning; 2021; pp. 4904–4916. [Google Scholar]
- Tan, H.; Bansal, M. Lxmert: Learning cross-modality encoder representations from transformers. arXiv 2019. [Google Scholar] [CrossRef]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 2019, 32. [Google Scholar]
- Chen, Y.-C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. Uniter: Universal image-text representation learning. In Proceedings of the European conference on computer vision; 2020; pp. 104–120. [Google Scholar]
- Li, J.; Selvaraju, R.; Gotmare, A.; Joty, S.; Xiong, C.; Hoi, S.C.H. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems 2021, 34, 9694–9705. [Google Scholar]
- Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International conference on machine learning; 2022; pp. 12888–12900. [Google Scholar]
- Gandhi, A.; Adhvaryu, K.; Poria, S.; Cambria, E.; Hussain, A. Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Information Fusion 2023, 91, 424–444. [Google Scholar] [CrossRef]
- Koroteev, M.V. BERT: a review of applications in natural language processing and understanding. arXiv 2021. [Google Scholar] [CrossRef]
- Li, W.; Gao, C.; Niu, G.; Xiao, X.; Liu, H.; Liu, J.; Wu, H.; Wang, H. Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv 2020, arXiv:2012.15409 2020. [Google Scholar]
- Xu, X.; Wang, Y.; He, Y.; Yang, Y.; Hanjalic, A.; Shen, H.T. Cross-modal hybrid feature fusion for image-sentence matching. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 2021, 17, 1–23. [Google Scholar] [CrossRef]
- Gadzicki, K.; Khamsehashari, R.; Zetzsche, C. Early vs late fusion in multimodal convolutional neural networks. In Proceedings of the 2020 IEEE 23rd international conference on information fusion (FUSION); 2020; pp. 1–6. [Google Scholar]
- Gao, J.; Li, P.; Chen, Z.; Zhang, J. A survey on deep learning for multimodal data fusion. Neural computation 2020, 32, 829–864. [Google Scholar] [CrossRef]
- Saleh, K.; Hossny, M.; Nahavandi, S. Driving behavior classification based on sensor data fusion using LSTM recurrent neural networks. In Proceedings of the 2017 IEEE 20th international conference on intelligent transportation systems (ITSC); 2017; pp. 1–6. [Google Scholar]
- Rudovic, O.; Zhang, M.; Schuller, B.; Picard, R. Multi-modal active learning from human data: A deep reinforcement learning approach. In Proceedings of the 2019 international conference on multimodal interaction; 2019; pp. 6–15. [Google Scholar]
- Wang, X.; Lyu, J.; Kim, B.-G.; Parameshachari, B.; Li, K.; Li, Q. Exploring multimodal multiscale features for sentiment analysis using fuzzy-deep neural network learning. IEEE Transactions on Fuzzy Systems 2024, 33, 28–42. [Google Scholar] [CrossRef]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International conference on machine learning; 2021; pp. 10347–10357. [Google Scholar]
- Takahashi, S.; Sakaguchi, Y.; Kouno, N.; Takasawa, K.; Ishizu, K.; Akagi, Y.; Aoyama, R.; Teraya, N.; Bolatkan, A.; Shinkai, N. Comparison of vision transformers and convolutional neural networks in medical image analysis: A systematic review. Journal of Medical Systems 2024, 48, 84. [Google Scholar] [CrossRef]
- Ahmed, M.W.; Sadiq, T.; Rahman, H.; Alateyah, S.A.; Alnusayri, M.; Alatiyyah, M.; AlHammadi, D.A. MAPE-ViT: multimodal scene understanding with novel wavelet-augmented Vision Transformer. PeerJ Computer Science 2025, 11, e2796. [Google Scholar] [CrossRef]
- Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-attention with relative position representations. arXiv 2018. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929 2020. [Google Scholar]
- Tian, Y.; Sun, C.; Poole, B.; Krishnan, D.; Schmid, C.; Isola, P. What makes for good views for contrastive learning? Advances in neural information processing systems 2020, 33, 6827–6839. [Google Scholar]
- Sadiq, T.; Omlin, C.W. Scene Retrieval in Traffic Videos with Contrastive Multimodal Learning. In Proceedings of the 2023 IEEE 35th International Conference on Tools with Artificial Intelligence (ICTAI); 2023; pp. 1020–1025. [Google Scholar]
- Wang, L.; Sng, D. Deep learning algorithms with applications to video analytics for a smart city: A survey. arXiv 2015. [Google Scholar] [CrossRef]
- Xiao, H.; Zhao, Y.; Zhang, H. Predict vessel traffic with weather conditions based on multimodal deep learning. Journal of Marine Science and Engineering 2022, 11, 39. [Google Scholar] [CrossRef]
- Liu, Y.; Yang, C.; Liu, K.; Chen, B.; Yao, Y. Domain adaptation transfer learning soft sensor for product quality prediction. Chemometrics and Intelligent Laboratory Systems 2019, 192, 103813. [Google Scholar] [CrossRef]
- Soni, U. Integration of traffic data from social media and physical sensors for near real time road traffic analysis. University of Twente, 2019.
- Luan, S.; Ke, R.; Huang, Z.; Ma, X. Traffic congestion propagation inference using dynamic Bayesian graph convolution network. Transportation research part C: emerging technologies 2022, 135, 103526. [Google Scholar] [CrossRef]
- Zhuang, D.; Gan, V.J.; Tekler, Z.D.; Chong, A.; Tian, S.; Shi, X. Data-driven predictive control for smart HVAC system in IoT-integrated buildings with time-series forecasting and reinforcement learning. Applied Energy 2023, 338, 120936. [Google Scholar] [CrossRef]
- Tan, C.; Sun, F.; Kong, T.; Zhang, W.; Yang, C.; Liu, C. A survey on deep transfer learning. In Proceedings of the International conference on artificial neural networks; 2018; pp. 270–279. [Google Scholar]
- Saini, K.; Sharma, S. Smart Road Traffic Monitoring: Unveiling the Synergy of IoT and AI for Enhanced Urban Mobility. ACM Computing Surveys 2025, 57, 1–45. [Google Scholar] [CrossRef]
- Ouoba, J.; Lahti, J.; Ahola, J. Connecting digital cities: Return of experience on the development of a data platform for multimodal journey planning. In International Summit, Smart City 360°; Springer: 2015; pp. 91–103.
- Botea, A.; Berlingerio, M.; Braghin, S.; Bouillet, E.; Calabrese, F.; Chen, B.; Gkoufas, Y.; Nair, R.; Nonner, T.; Laumanns, M. Docit: An integrated system for risk-averse multimodal journey advising. In Smart Cities and Homes; Elsevier: 2016; pp. 345–359.
- Asgari, F. Inferring user multimodal trajectories from cellular network metadata in metropolitan areas. Institut National des Télécommunications, 2016.
- Alessandretti, L.; Karsai, M.; Gauvin, L. User-based representation of time-resolved multimodal public transportation networks. Royal Society open science 2016, 3, 160156. [Google Scholar] [CrossRef]
- Pronello, C.; Gaborieau, J.-B. Engaging in pro-environment travel behaviour research from a psycho-social perspective: A review of behavioural variables and theories. Sustainability 2018, 10, 2412. [Google Scholar] [CrossRef]
- Kang, Y.; Youm, S. Multimedia application to an extended public transportation network in South Korea: optimal path search in a multimodal transit network. Multimedia Tools and Applications 2017, 76, 19945–19957. [Google Scholar] [CrossRef]
- Sokolov, I.; Kupriyanovsky, V.; Dunaev, O.; Sinyagov, S.; Kurenkov, P.; Namiot, D.; Dobrynin, A.; Kolesnikov, A.; Gonik, M. On breakthrough innovative technologies for infrastructures. The Eurasian digital railway as a basis of the logistic corridor of the new Silk Road. International Journal of Open Information Technologies 2017, 5, 102–118. [Google Scholar]
- Young, G.W.; Naji, J.; Charlton, M.; Brunsdon, C.; Kitchin, R. Future cities and multimodalities: how multimodal technologies can improve smart-citizen engagement with city dashboards. 2017.
- Kumar, S.; Datta, D.; Singh, S.K.; Sangaiah, A.K. An intelligent decision computing paradigm for crowd monitoring in the smart city. Journal of Parallel and Distributed Computing 2018, 118, 344–358. [Google Scholar] [CrossRef]
- Zhang, J.; Xiao, W.; Coifman, B.; Mills, J.P. Vehicle tracking and speed estimation from roadside lidar. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 2020, 13, 5597–5608. [Google Scholar] [CrossRef]
- Jordan, S.; Chandak, Y.; Cohen, D.; Zhang, M.; Thomas, P. Evaluating the performance of reinforcement learning algorithms. In Proceedings of the International Conference on Machine Learning; 2020; pp. 4962–4973. [Google Scholar]
- Maadi, S.; Stein, S.; Hong, J.; Murray-Smith, R. Real-time adaptive traffic signal control in a connected and automated vehicle environment: optimisation of signal planning with reinforcement learning under vehicle speed guidance. Sensors 2022, 22, 7501. [Google Scholar] [CrossRef]
- Nigam, N.; Singh, D.P.; Choudhary, J. A review of different components of the intelligent traffic management system (ITMS). Symmetry 2023, 15, 583. [Google Scholar] [CrossRef]
- Wu, P.; Zhang, Z.; Peng, X.; Wang, R. Deep learning solutions for smart city challenges in urban development. Scientific Reports 2024, 14, 5176. [Google Scholar] [CrossRef] [PubMed]
- Yu, W.; Wu, G.; Han, J. Deep Multimodal-Interactive Document Summarization Network and Its Cross-Modal Text–Image Retrieval Application for Future Smart City Information Management Systems. Smart Cities 2025, 8, 96. [Google Scholar] [CrossRef]
- Wu, C.; Wang, T.; Ge, Y.; Lu, Z.; Zhou, R.; Shan, Y.; Luo, P. $\pi $-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation. In Proceedings of the International Conference on Machine Learning; 2023; pp. 37713–37727. [Google Scholar]
- Bian, L. Multiscale nature of spatial data in scaling up environmental models. In Scale in remote sensing and GIS; Routledge: 2023; pp. 13–26.
- Pang, T.; Lin, M.; Yang, X.; Zhu, J.; Yan, S. Robustness and accuracy could be reconcilable by (proper) definition. In Proceedings of the International conference on machine learning; 2022; pp. 17258–17277. [Google Scholar]
- Yang, X.; Song, Z.; King, I.; Xu, Z. A survey on deep semi-supervised learning. IEEE transactions on knowledge and data engineering 2022, 35, 8934–8954. [Google Scholar] [CrossRef]
- Alzubaidi, L.; Al-Amidie, M.; Al-Asadi, A.; Humaidi, A.J.; Al-Shamma, O.; Fadhel, M.A.; Zhang, J.; Santamaría, J.; Duan, Y. Novel transfer learning approach for medical imaging with limited labeled data. Cancers 2021, 13, 1590. [Google Scholar] [CrossRef]
- Barua, A.; Ahmed, M.U.; Begum, S. A systematic literature review on multimodal machine learning: Applications, challenges, gaps and future directions. Ieee access 2023, 11, 14804–14831. [Google Scholar] [CrossRef]
- Kieu, N.; Nguyen, K.; Nazib, A.; Fernando, T.; Fookes, C.; Sridharan, S. Multimodal colearning meets remote sensing: Taxonomy, state of the art, and future works. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 2024, 17, 7386–7409. [Google Scholar] [CrossRef]
- Lopes, P.P. PONDIÔNSTRACKER: A FRAMEWORK BASED ON GTFS-RT TO IDENTIFY DELAYS AND ESTIMATE ARRIVALS DYNAMICALLY IN PUBLIC TRANSPORTATION NETWORK.
- Wu, R.; Wang, H.; Chen, H.-T.; Carneiro, G. Deep multimodal learning with missing modality: A survey. arXiv 2024, arXiv:2409.07825 2024. [Google Scholar]
- Seu, K.; Kang, M.-S.; Lee, H. An intelligent missing data imputation techniques: A review. JOIV: International Journal on Informatics Visualization 2022, 6, 278–283. [Google Scholar] [CrossRef]
- Psychogyios, K.; Ilias, L.; Ntanos, C.; Askounis, D. Missing value imputation methods for electronic health records. IEEE Access 2023, 11, 21562–21574. [Google Scholar] [CrossRef]
- Çetin, V.; Yıldız, O. A comprehensive review on data preprocessing techniques in data analysis. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi 2022, 28, 299–312. [Google Scholar] [CrossRef]
- Younis, E.M.; Zaki, S.M.; Kanjo, E.; Houssein, E.H. Evaluating ensemble learning methods for multi-modal emotion recognition using sensor data fusion. Sensors 2022, 22, 5611. [Google Scholar] [CrossRef]
- Ferrario, A.; Loi, M. How explainability contributes to trust in AI. In Proceedings of the Proceedings of the 2022 ACM conference on fairness, accountability, and transparency, 2022; pp. 1457–1466.
- Bell, A.; Solano-Kamaiko, I.; Nov, O.; Stoyanovich, J. It’s just not that simple: an empirical study of the accuracy-explainability trade-off in machine learning for public policy. In Proceedings of the Proceedings of the 2022 ACM conference on fairness, accountability, and transparency, 2022; pp. 248–266.
- Mahbooba, B.; Timilsina, M.; Sahal, R.; Serrano, M. Explainable artificial intelligence (XAI) to enhance trust management in intrusion detection systems using decision tree model. Complexity 2021, 2021, 6634811. [Google Scholar] [CrossRef]
- Salih, A.M.; Raisi-Estabragh, Z.; Galazzo, I.B.; Radeva, P.; Petersen, S.E.; Lekadir, K.; Menegaz, G. A perspective on explainable artificial intelligence methods: SHAP and LIME. Advanced Intelligent Systems 2025, 7, 2400304. [Google Scholar] [CrossRef]
- Zhu, X.; Wang, D.; Pedrycz, W.; Li, Z. Fuzzy rule-based local surrogate models for black-box model explanation. IEEE Transactions on Fuzzy Systems 2022, 31, 2056–2064. [Google Scholar] [CrossRef]
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2016; pp. 3213–3223.
- Lam, D.; Kuzma, R.; McGee, K.; Dooley, S.; Laielli, M.; Klaric, M.; Bulatov, Y.; McCord, B. xview: Objects in context in overhead imagery. arXiv 2018. [Google Scholar] [CrossRef]
- Pfeil, M.; Bartoschek, T.; Wirwahn, J.A. Opensensemap-a citizen science platform for publishing and exploring sensor data as open data. 2018.
- Bui, L. Breathing smarter: A critical look at representations of air quality sensing data across platforms and publics. In Proceedings of the 2015 IEEE First International Smart Cities Conference (ISC2); 2015; pp. 1–5. [Google Scholar]
- Sahoh, B.; Choksuriwong, A. The role of explainable Artificial Intelligence in high-stakes decision-making systems: a systematic review. Journal of Ambient Intelligence and Humanized Computing 2023, 14, 7827–7843. [Google Scholar] [CrossRef] [PubMed]
- Naidu, G.; Zuva, T.; Sibanda, E.M. A review of evaluation metrics in machine learning algorithms. In Proceedings of the Computer science on-line conference; 2023; pp. 15–25. [Google Scholar]
- Anjuma, K.; Arshad, M.A.; Hayawi, K.; Polyzos, E.; Tariq, A.; Serhani, M.A.; Batool, L.; Lund, B.; Mannuru, N.R.; Bevara, R.V.K. Domain Specific Benchmarks for Evaluating Multimodal Large Language Models. arXiv 2025. [Google Scholar] [CrossRef]
- Zhou, Y.; Gallego, G.; Lu, X.; Liu, S.; Shen, S. Event-based motion segmentation with spatio-temporal graph cuts. IEEE transactions on neural networks and learning systems 2021, 34, 4868–4880. [Google Scholar] [CrossRef] [PubMed]
- Peng, W.; Bai, X.; Yang, D.; Yuen, K.F.; Wu, J. A deep learning approach for port congestion estimation and prediction. Maritime Policy & Management 2023, 50, 835–860. [Google Scholar]
- Liu, J.; Ong, G.P. Prediction of Next-Time Traffic Congestion with Consideration of Congestion Propagation Patterns and Co-occurrence. IEEE Transactions on Vehicular Technology 2025. [Google Scholar] [CrossRef]
- Wattacheril, C.Y.; Hemalakshmi, G.; Murugan, A.; Abhiram, P.; George, A.M. Machine Learning-Based Threat Detection in Crowded Environments. In Proceedings of the 2024 International Conference on Smart Technologies for Sustainable Development Goals (ICSTSDG); 2024; pp. 1–7. [Google Scholar]
- Jiang, Q.; Kresin, F.; Bregt, A.K.; Kooistra, L.; Pareschi, E.; Van Putten, E.; Volten, H.; Wesseling, J. Citizen sensing for improved urban environmental monitoring. Journal of Sensors 2016, 2016, 5656245. [Google Scholar] [CrossRef]
- Lim, C.C.; Kim, H.; Vilcassim, M.R.; Thurston, G.D.; Gordon, T.; Chen, L.-C.; Lee, K.; Heimbinder, M.; Kim, S.-Y. Mapping urban air quality using mobile sampling with low-cost sensors and machine learning in Seoul, South Korea. Environment international 2019, 131, 105022. [Google Scholar] [CrossRef] [PubMed]
- Hu, T.; Wang, S.; She, B.; Zhang, M.; Huang, X.; Cui, Y.; Khuri, J.; Hu, Y.; Fu, X.; Wang, X. Human mobility data in the COVID-19 pandemic: characteristics, applications, and challenges. International Journal of Digital Earth 2021, 14, 1126–1147. [Google Scholar] [CrossRef]
- Almujally, N.A.; Qureshi, A.M.; Alazeb, A.; Rahman, H.; Sadiq, T.; Alonazi, M.; Algarni, A.; Jalal, A. A novel framework for vehicle detection and tracking in night ware surveillance systems. Ieee Access 2024, 12, 88075–88085. [Google Scholar] [CrossRef]
- Son, H.; Jang, J.; Park, J.; Balog, A.; Ballantyne, P.; Kwon, H.R.; Singleton, A.; Hwang, J. Leveraging advanced technologies for (smart) transportation planning: A systematic review. Sustainability 2025, 17, 2245. [Google Scholar] [CrossRef]
- Zaib, S.; Lu, J.; Bilal, M. Spatio-temporal characteristics of air quality index (AQI) over Northwest China. Atmosphere 2022, 13, 375. [Google Scholar] [CrossRef]
- Xu, Z.; Mei, L.; Lv, Z.; Hu, C.; Luo, X.; Zhang, H.; Liu, Y. Multi-modal description of public safety events using surveillance and social media. IEEE Transactions on Big Data 2017, 5, 529–539. [Google Scholar] [CrossRef]
- Alrashdi, I.; Alqazzaz, A.; Aloufi, E.; Alharthi, R.; Zohdy, M.; Ming, H. Ad-iot: Anomaly detection of iot cyberattacks in smart city using machine learning. In Proceedings of the 2019 IEEE 9th annual computing and communication workshop and conference (CCWC); 2019; pp. 0305–0310. [Google Scholar]
- Islam, M.; Dukyil, A.S.; Alyahya, S.; Habib, S. An IoT enable anomaly detection system for smart city surveillance. Sensors 2023, 23, 2358. [Google Scholar] [CrossRef]
- Zhong, C.; Guo, H.; Swan, I.; Gao, P.; Yao, Q.; Li, H. Evaluating trends, profits, and risks of global cities in recent urban expansion for advancing sustainable development. Habitat International 2023, 138, 102869. [Google Scholar] [CrossRef]
- Jadhav, S.; Durairaj, M.; Reenadevi, R.; Subbulakshmi, R.; Gupta, V.; Ramesh, J.V.N. Spatiotemporal data fusion and deep learning for remote sensing-based sustainable urban planning. International Journal of System Assurance Engineering and Management 2024, 1–9. [Google Scholar] [CrossRef]
- Qiu, J.; Zhao, Y. Traffic Prediction with Data Fusion and Machine Learning. Analytics 2025, 4, 12. [Google Scholar] [CrossRef]
- Karagiannopoulou, A.; Tsertou, A.; Tsimiklis, G.; Amditis, A. Data fusion in earth observation and the role of citizen as a sensor: A scoping review of applications, methods and future trends. Remote Sensing 2022, 14, 1263. [Google Scholar] [CrossRef]
- Hsu, I.-C.; Chang, C.-C. Integrating machine learning and open data into social Chatbot for filtering information rumor. Journal of Ambient Intelligence and Humanized Computing 2021, 12, 1023–1037. [Google Scholar] [CrossRef]
- Li, X.; Liu, H.; Wang, W.; Zheng, Y.; Lv, H.; Lv, Z. Big data analysis of the internet of things in the digital twins of smart city based on deep learning. Future Generation Computer Systems 2022, 128, 167–177. [Google Scholar] [CrossRef]
- Liu, Q.; Huang, Y.; Jin, C.; Zhou, X.; Mao, Y.; Catal, C.; Cheng, L. Privacy and integrity protection for IoT multimodal data using machine learning and blockchain. ACM Transactions on Multimedia Computing, Communications and Applications 2024, 20, 1–18. [Google Scholar] [CrossRef]
- Zekić-Sušac, M.; Mitrović, S.; Has, A. Machine learning based system for managing energy efficiency of public sector as an approach towards smart cities. International journal of information management 2021, 58, 102074. [Google Scholar] [CrossRef]
- Liu, H.; Cui, W.; Zhang, M. Exploring the causal relationship between urbanization and air pollution: Evidence from China. Sustainable Cities and Society 2022, 80, 103783. [Google Scholar] [CrossRef]
- Malatesta, T.; Breadsell, J.K. Identifying home system of practices for energy use with k-means clustering techniques. Sustainability 2022, 14, 9017. [Google Scholar] [CrossRef]
- Kilicay-Ergin, N.; Barb, A.S. Semantic fusion with deep learning and formal ontologies for evaluation of policies and initiatives in the smart city domain. Applied Sciences 2021, 11, 10037. [Google Scholar] [CrossRef]
- Naoui, M.A.; Lejdel, B.; Ayad, M.; Amamra, A.; kazar, O. Using a distributed deep learning algorithm for analyzing big data in smart cities. Smart and Sustainable Built Environment 2021, 10, 90–105. [Google Scholar] [CrossRef]
- Atitallah, S.B.; Driss, M.; Boulila, W.; Ghézala, H.B. Computer Science Review. 2020.
- Kline, A.; Wang, H.; Li, Y.; Dennis, S.; Hutch, M.; Xu, Z.; Wang, F.; Cheng, F.; Luo, Y. Multimodal machine learning in precision health: A scoping review. NPJ digital medicine 2022, 5, 171. [Google Scholar] [CrossRef]
- Dautov, R.; Distefano, S.; Buyya, R. Hierarchical data fusion for smart healthcare. Journal of Big Data 2019, 6, 1–23. [Google Scholar] [CrossRef]
- Nazari, E.; Chang, H.-C.H.; Deldar, K.; Pour, R.; Avan, A.; Tara, M.; Mehrabian, A.; Tabesh, H. A comprehensive overview of decision fusion technique in healthcare: A systematic scoping review. Iranian Red Crescent Medical Journal 2020, 22, 1–17. [Google Scholar]
- Haltaufderheide, J.; Ranisch, R. The ethics of ChatGPT in medicine and healthcare: a systematic review on Large Language Models (LLMs). NPJ digital medicine 2024, 7, 183. [Google Scholar] [CrossRef]
- Demelius, L.; Kern, R.; Trügler, A. Recent advances of differential privacy in centralized deep learning: A systematic survey. ACM Computing Surveys 2025, 57, 1–28. [Google Scholar] [CrossRef]
- Sampaio, S.; Sousa, P.R.; Martins, C.; Ferreira, A.; Antunes, L.; Cruz-Correia, R. Collecting, processing and secondary using personal and (pseudo) anonymized data in smart cities. Applied Sciences 2023, 13, 3830. [Google Scholar] [CrossRef]
- Labadie, C.; Legner, C. Building data management capabilities to address data protection regulations: Learnings from EU-GDPR. Journal of Information Technology 2023, 38, 16–44. [Google Scholar] [CrossRef]
- Oladosu, S.A.; Ike, C.C.; Adepoju, P.A.; Afolabi, A.I.; Ige, A.B.; Amoo, O.O. Frameworks for ethical data governance in machine learning: Privacy, fairness, and business optimization. Magna Sci Adv Res Rev 2024. [Google Scholar]
- Qu, Y.; Nosouhi, M.R.; Cui, L.; Yu, S. Privacy preservation in smart cities. In Smart cities cybersecurity and privacy; Elsevier: 2019; pp. 75–88.
- Rao, P.M.; Deebak, B.D. Security and privacy issues in smart cities/industries: technologies, applications, and challenges. Journal of Ambient Intelligence and Humanized Computing 2023, 14, 10517–10553. [Google Scholar] [CrossRef]
- Daoudagh, S.; Marchetti, E.; Savarino, V.; Bernabe, J.B.; García-Rodríguez, J.; Moreno, R.T.; Martinez, J.A.; Skarmeta, A.F. Data protection by design in the context of smart cities: A consent and access control proposal. Sensors 2021, 21, 7154. [Google Scholar] [CrossRef] [PubMed]
- Al-Turjman, F.; Zahmatkesh, H.; Shahroze, R. An overview of security and privacy in smart cities' IoT communications. Transactions on Emerging Telecommunications Technologies 2022, 33, e3677. [Google Scholar] [CrossRef]
- Rusinova, V.; Martynova, E. Fighting cyber attacks with sanctions: Digital threats, economic responses. Israel Law Review 2024, 57, 135–174. [Google Scholar] [CrossRef]
- Narasimha Rao, K.P.; Chinnaiyan, S. Blockchain-Powered Patient-Centric Access Control with MIDC AES-256 Encryption for Enhanced Healthcare Data Security. Acta Informatica Pragensia 2024, 13, 374–394. [Google Scholar] [CrossRef]
- Ahmed, S.; Ahmed, I.; Kamruzzaman, M.; Saha, R. Cybersecurity Challenges in IT Infrastructure and Data Management: A Comprehensive Review of Threats, Mitigation Strategies, and Future Trend. Global Mainstream Journal of Innovation, Engineering & Emerging Technology 2022, 1, 36–61. [Google Scholar]
- Balayn, A.; Lofi, C.; Houben, G.-J. Managing bias and unfairness in data for decision support: a survey of machine learning and data engineering approaches to identify and mitigate bias and unfairness within data management and analytics systems. The VLDB Journal 2021, 30, 739–768. [Google Scholar] [CrossRef]
- De Falco, C.C.; Romeo, E. Algorithms and geo-discrimination risk: What hazards for smart cities' development? In Smart Cities; Routledge: 2025; pp. 104–117.
- Le Quy, T.; Roy, A.; Iosifidis, V.; Zhang, W.; Ntoutsi, E. A survey on datasets for fairness-aware machine learning. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2022, 12, e1452. [Google Scholar] [CrossRef]
- Herdiansyah, H. Smart city based on community empowerment, social capital, and public trust in urban areas. Glob. J. Environ. Sci. Manag 2023, 9, 113–128. [Google Scholar]
- Sarker, I.H. Smart City Data Science: Towards data-driven smart cities with open research issues. Internet of Things 2022, 19, 100528. [Google Scholar] [CrossRef]
- Gao, L.; Guan, L. Interpretability of machine learning: Recent advances and future prospects. IEEE MultiMedia 2023, 30, 105–118. [Google Scholar] [CrossRef]
- Rashid, M.M.; Kamruzzaman, J.; Hassan, M.M.; Imam, T.; Wibowo, S.; Gordon, S.; Fortino, G. Adversarial training for deep learning-based cyberattack detection in IoT-based smart city applications. Computers & Security 2022, 120, 102783. [Google Scholar]
- Dutta, H.; Minerva, R.; Alvi, M.; Crespi, N. Data-driven Modality Fusion: An AI-enabled Framework for Large-Scale Sensor Network Management. arXiv 2025, arXiv:2502.04937 2025. [Google Scholar]
- Huang, J.; Zhang, Z.; Zheng, S.; Qin, F.; Wang, Y. {DISTMM}: Accelerating distributed multimodal model training. In Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24); 2024; pp. 1157–1171. [Google Scholar]
- Zhou, D.-W.; Wang, Q.-W.; Qi, Z.-H.; Ye, H.-J.; Zhan, D.-C.; Liu, Z. Class-incremental learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 2024. [Google Scholar] [CrossRef]
- Jafari, F.; Moradi, K.; Shafiee, Q. Shallow learning vs. Deep learning in engineering applications. In Shallow Learning vs. Deep Learning: A Practical Guide for Machine Learning Solutions; Springer: 2024; pp. 29–76.
- Bischl, B.; Binder, M.; Lang, M.; Pielok, T.; Richter, J.; Coors, S.; Thomas, J.; Ullmann, T.; Becker, M.; Boulesteix, A.L. Hyperparameter optimization: Foundations, algorithms, best practices, and open challenges. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2023, 13, e1484. [Google Scholar] [CrossRef]
- Schmitt, M. Securing the digital world: Protecting smart infrastructures and digital industries with artificial intelligence (AI)-enabled malware and intrusion detection. Journal of Industrial Information Integration 2023, 36, 100520. [Google Scholar] [CrossRef]
- Pearson, M. Pioneering Urban Biodiversity: Using AI-sensors, eDNA and traditional methods to create a novel biodiversity monitoring toolkit and assessment framework. 2024.


















| Modality | Examples | Data Characteristics | Urban Application Areas |
|---|---|---|---|
| Visual | CCTV footage, satellite images, drone videos | Real-time, high volume, spatial information | Traffic monitoring, public safety, event management [18,29]. |
| Sensor-based | Air quality sensors, weather stations, noise monitors | Continuous, structured, environmental data | Pollution monitoring, climate modeling [23,30] |
| Textual | Tweets, public service reports, news articles | Unstructured, periodic, noisy data | Public sentiment analysis, social media monitoring [20,31]. |
| Geospatial | GPS data, geolocation tracking, heatmaps | Spatial-temporal, dynamic | Traffic management, mobility optimization [21,22]. |
| Behavioral | Mobile app usage, pedestrian tracking | Structured and unstructured, behavioral | Urban mobility, public health monitoring [13,32]. |
| Fusion Type | Example Models | Fusion Point | Notes |
|---|---|---|---|
| Early Fusion | VisualBERT [34], VL-BERT [35] | Input-level | Joint transformers over concatenated modalities |
| Late Fusion | CLIP [36], ALIGN [37] | Output-level | Independent encoders, aligned via similarity |
| Hybrid Fusion | LXMERT [38], ViLBERT [39], UNITER [40], ALBEF [41], BLIP [42] | Mid-level / cross-modal layers | Modality-specific encoders + cross-attention |
| Architecture | Description | Applications in Smart Cities |
Advantages |
|---|---|---|---|
| Convolutional Neural Networks (CNNs) | CNNs are used for image processing but have been extended to integrate visual data with other modalities such as sensors or geospatial data. | Urban mobility prediction, environmental monitoring (analyzing traffic cameras and environmental sensor data). | Well-suited for spatial data analysis and can handle large-scale image data, which is common in traffic and environmental monitoring systems [47]. |
| Transformer-based Models | Transformer models, initially developed for natural language processing (NLP), use self-attention mechanisms to learn relationships across data modalities. | Image-text matching, video captioning, traffic event prediction (integrating visual and textual data). | Captures long-range dependencies across multiple modalities. Self-attention mechanism allows for flexible attention to relevant features across modalities [25]. |
| Graph Neural Networks (GNNs) | GNNs are designed to process graph-structured data, where relationships between entities are crucial for prediction tasks. | Traffic prediction, urban mobility, public safety (modeling road networks, sensors and traffic flows). | Effectively models spatial dependencies and interconnected data across multiple sources, such as roads, sensors and events [27]. |
| MML Technique | Accuracy (%) | Computational Complexity | Suitable Data Modalities | Application |
|---|---|---|---|---|
| Deep Learning | 90 | High | Audio, Video, Text | Urban video analytics [59,60] |
| Transfer Learning | 85 | Medium | Sensor Data, Images | Sensor-to-sensor adaptation [61] |
| Ensemble Methods | 88 | Low-Medium | Sensor Data, Social Media | Social sentiment and traffic data [62] |
| Graph-based Methods | 82 | Medium | Spatial Data, Networks | Traffic network inference [63] |
| Reinforcement Learning | 87 | High | IoT Data, Control Systems | IoT-based adaptive control [64] |
| Authors | Model/Approach | Pros | Cons |
|---|---|---|---|
|
Ouoba et al. [67] |
Multimodal Journey Planning | Addresses fragmented environments by integrating real-time data for comprehensive planning. | May require large computational resources for real-time data processing. |
|
Botea et al. [68] |
Risk-Averse Journey Advising | Accounts for uncertainties in public transport schedules, improving system reliability. | Does not fully address long-term dynamic changes in urban mobility. |
|
Asgari et al. [69] |
Multimodal Trajectories | Uses unsupervised models, effective for forecasting traffic flow without needing labeled data. | May struggle with real-time adjustments or highly dynamic urban environments. |
|
Alessandretti et al. [70] |
Public Transportation Networks | Leverages data-driven models to analyze complex transport networks, improving planning and efficiency. | Might be less effective in highly decentralized, less connected urban settings. |
|
Pronello et al. [71] |
Travel Behavior | Provides insights into behavioral shifts using multimodal data, improving transportation planning. | Requires large amounts of data to detect subtle shifts in behavior and patterns. |
|
Kang and Youm [72] |
Extended Public Transport | Improves route optimization, enhancing public transportation efficiency. | Complexity increases with the number of variables and real-time adjustments needed. |
|
Sokolov et al [73] |
Digital Railway Infrastructure | Integrates digital frameworks to optimize railway systems and reduce urban congestion. | High computational complexity and infrastructure requirements for implementation. |
|
Young et al. [74] |
Smart-Citizen Engagement | Leverages multimodal data for interactive city management, improving citizen engagement. | May face challenges in data privacy and ethical concerns when dealing with citizen data. |
|
Kumar et al. [75] |
Crowd Monitoring | Enhances public safety through intelligent monitoring systems for crowd management. | It can be costly and difficult to scale across large urban areas without specialized infrastructure. |
|
Zhang et al. [76] |
Vehicle Tracking | Improves vehicle tracking accuracy in complex urban environments by combining multiple data sources. | Requires continuous data input and faces challenges in real-time tracking in highly dynamic settings. |
| Challenge | Proposed Solution(s) | References |
|---|---|---|
| Multimodal Representation Learning | Feature Fusion, Transfer Learning | [1,51] |
| Cross-modal Alignment | Cross-modal Attention Mechanisms, Multi-modal Transformers | [52,53] |
| Scalability and Real-time Constraints | Distributed Computing, Cloud Infrastructure, Edge Computing | [54,55] |
| Robustness to Missing or Noisy Modalities | Data Imputation, Robust Training Methods, Noise Reduction | [56,57] |
| Interpretability of Models | Explainable AI Techniques, Model Visualization | [58,59] |
| Dataset and Benchmark Limitations | Standardized Datasets, Synchronized data, Robust Benchmarks, Composite metrics, Open Benchmarks. | [1,5] |
| Aspect | Description | Example in Smart Cities |
|---|---|---|
| Definition | Representation learning refers to the process of transforming raw data from different modalities into a shared latent space where they can be compared or combined. | In smart cities, this involves mapping data from traffic cameras and social media into a unified feature space to analyze traffic congestion in relation to weather [74,80]. |
| Goal | to create a common feature space that allows for the effective comparison and integration of different data types. | By aligning traffic images and weather data, MML models create a joint representation that helps to analyze traffic congestion during different weather conditions [60]. |
| Applications | It is widely used in cross-modal retrieval, where a system retrieves relevant data from one modality based on a query from another modality. | Cross-modal retrieval might involve retrieving relevant satellite images of an area based on textual descriptions about a traffic incident [58,81]. |
| Advantage | Learning shared representations allows for more context-aware decision-making, enabling models to integrate diverse insights more effectively. | By learning a joint representation, a smart city system can combine sensor data with social media sentiment to better respond to traffic incidents or public safety concerns. |
| Challenges | One challenge is ensuring that the representations learned are semantic and meaningful across different modalities, which require careful design and training. | In smart cities, aligning audio data (e.g., traffic sounds) with visual data (e.g., traffic cameras) may require sophisticated models to capture spatial and temporal context. |
| Aspect of Smart City | Impact of MML | Notes |
|---|---|---|
| Transportation | 20% Reduction in Traffic Congestion [20] | Achieved through traffic flow prediction, adaptive signals and incident detection. |
| Energy Management | 15% Increase in Energy Efficiency [23] | Enabled by demand forecasting and optimized grid operations via MML. |
| Public Safety | 30% Decrease in Emergency Response Times [24] | Real-time data fusion improves emergency detection and resource dispatch. |
| Environmental Monitoring | 25% Reduction in Air Pollution-related Illnesses [15] | Sensor data integration helps in pollution forecasting and alerts. |
| Urban Planning | 10% Improvement in Urban Infrastructure Efficiency [25] | Supports better zoning, infrastructure usage and investment decisions. |
| Research Direction | Description |
|---|---|
| Handling Big Data [152] | Developing scalable MML algorithms for large datasets |
| Improving Models Interpretability [153] | Exploring techniques for explaining complex MML models |
| Addressing Privacy and Ethical Concerns [142] | Investigating methods for preserving privacy in MML models |
| Enhancing Robustness [154] | Researching strategies for improving the robustness of MML |
| Exploring Novel Data Modalities [155] | Investigating the use of emerging data modalities in MML |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
