Submitted:
01 August 2025
Posted:
04 August 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
2.1. Single Modal Semantic Segmentation
2.2. Multimodel Semantic Segmentation (MSS)
3. Methodology
3.1. Framework of MMFNet
3.2. Dual Branch Encoder
3.3. Multimodel Feature Fusion Block
3.4. Transfomer Decoder
3.5. Loss Function
4. Experiment and Results
4.1. Dataset
4.2. Evaluation Metrics
4.3. Experiment Setup
4.4. Experimental Results and Analysis
4.4.1. Comparison Results on the Vaihingen Dataset
4.4.2. Comparison Results on the Potsdam Dataset
| Model | Backbone | IoU | OA | mF1 | mIoU | ||||
| Imp. | Bui. | Low. | Tre. | Car | |||||
| PSPNet | Resnet-18 | 78.98 | 88.93 | 68.23 | 68.42 | 77.77 | 86.12 | 82.51 | 76.47 |
| Swin | Swin-T | 79.28 | 90.5 | 69.98 | 70.41 | 79.64 | 87.05 | 83.69 | 77.96 |
| Unetformer | Resnet-18 | 84.51 | 92.08 | 72.70 | 71.42 | 83.45 | 89.19 | 89.20 | 80.83 |
| DCSwin | Swin-T | 82.96 | 92.50 | 71.31 | 71.24 | 82.29 | 88.31 | 88.71 | 80.06 |
| CMFNet | VGG-16 | 85.55 | 93.65 | 72.23 | 74.65 | 91.25 | 89.97 | 91.01 | 83.37 |
| Vmamba | Vmamba-T | 84.82 | 91.24 | 75.16 | 75.38 | 88.04 | 81.52 | 90.40 | 82.93 |
| RS3Mamba | R18-Mamba-T | 86.95 | 94.46 | 75.50 | 76.28 | 92.98 | 90.73 | 87.39 | 85.24 |
| MFMamba | R18-Mamba-T | 87.34 | 94.90 | 75.09 | 76.81 | 92.88 | 90.89 | 91.92 | 85.41 |
| Ours | R18-Mamba-T | 87.41 | 95.71 | 76.54 | 77.50 | 93.13 | 91.32 | 92.31 | 86.06 |
4.4.3. Computational Complexity Analysis
4.5. Ablation Studies
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Zhang, C.; Jiang, W.; Zhang, Y.; Wang, W.; Zhao, Q.; Wang, C. Transformer and CNN Hybrid Deep Neural Network for Semantic Segmentation of Very-High-Resolution Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sensing 2022, 60, 1–20. [Google Scholar] [CrossRef]
- Boonpook, W.; Tan, Y.; Xu, B. Deep Learning-Based Multi-Feature Semantic Segmentation in Building Extraction from Images of UAV Photogrammetry. International Journal of Remote Sensing 2021, 42, 1–19. [Google Scholar] [CrossRef]
- Weiss, M.; Jacob, F.; Duveiller, G. Remote Sensing for Agricultural Applications: A Meta-Review. Remote Sensing of Environment 2020, 236, 111402. [Google Scholar] [CrossRef]
- Asadzadeh, S.; Oliveira, W.J.D.; Souza Filho, C.R.D. UAV-Based Remote Sensing for the Petroleum Industry and Environmental Monitoring: State-of-the-Art and Perspectives. Journal of Petroleum Science and Engineering 2022, 208, 109633. [Google Scholar] [CrossRef]
- Grekousis, G. Local Fuzzy Geographically Weighted Clustering: A New Method for Geodemographic Segmentation. International Journal of Geographical Information Science 2021, 35, 152–174. [Google Scholar] [CrossRef]
- Zhou, X.; Zhou, L.; Gong, S.; Zhong, S.; Yan, W.; Huang, Y. Swin Transformer Embedding Dual-Stream for Semantic Segmentation of Remote Sensing Imagery. IEEE J. Sel. Top. Appl. Earth Observations Remote Sensing 2024, 17, 175–189. [Google Scholar] [CrossRef]
- Jiang, J.; Feng, X.; Ye, Q.; Hu, Z.; Gu, Z.; Huang, H. Semantic Segmentation of Remote Sensing Images Combined with Attention Mechanism and Feature Enhancement U-Net. International Journal of Remote Sensing 2023, 44, 6219–6232. [Google Scholar] [CrossRef]
- Lin, R.; Zhang, Y.; Zhu, X.; Chen, X. Local-Global Feature Capture and Boundary Information Refinement Swin Transformer Segmentor for Remote Sensing Images. IEEE Access 2024, 12, 6088–6099. [Google Scholar] [CrossRef]
- Qin, R.; Fang, W. A Hierarchical Building Detection Method for Very High Resolution Remotely Sensed Images Combined with DSM Using Graph Cut Optimization. photogramm eng remote sensing 2014, 80, 873–883. [Google Scholar] [CrossRef]
- Cao, Z.; Fu, K.; Lu, X.; Diao, W.; Sun, H.; Yan, M.; Yu, H.; Sun, X. End-to-End DSM Fusion Networks for Semantic Segmentation in High-Resolution Aerial Images. IEEE Geosci. Remote Sensing Lett. 2019, 16, 1766–1770. [Google Scholar] [CrossRef]
- Hosseinpour, H.; Samadzadegan, F.; Javan, F.D. CMGFNet: A Deep Cross-Modal Gated Fusion Network for Building Extraction from Very High-Resolution Remote Sensing Images. ISPRS Journal of Photogrammetry and Remote Sensing 2022, 184, 96–115. [Google Scholar] [CrossRef]
- Ma, X.; Zhang, X.; Pun, M.-O. A Crossmodal Multiscale Fusion Network for Semantic Segmentation of Remote Sensing Data. IEEE J. Sel. Top. Appl. Earth Observations Remote Sensing 2022, 15, 3463–3474. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv, 2020arXiv:2010.11929. [CrossRef]
- Lin, G.; Liu, F.; Milan, A.; Shen, C.; Reid, I. RefineNet: Multi-Path Refinement Networks for Dense Prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 1–1. [Google Scholar] [CrossRef]
- Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in Vision: A Survey. ACM Comput. Surv. 2022, 54, 1–41. [Google Scholar] [CrossRef]
- Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv, 2023arXiv:2312.00752. [CrossRef]
- Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. arXiv, 2024; arXiv:2401.09417. [Google Scholar] [CrossRef]
- Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. VMamba: Visual State Space Model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
- Wan, Z.; Zhang, P.; Wang, Y.; Yong, S.; Stepputtis, S.; Sycara, K.; Xie, Y. Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation 2024. arXiv 2024, arXiv:2404.04256. [Google Scholar]
- Wang, Y.; Cao, L.; Deng, H. MFMamba: A Mamba-Based Multi-Modal Fusion Network for Semantic Segmentation of Remote Sensing Images. Sensors 2024, 24, 7266. [Google Scholar] [CrossRef]
- Cao, Z.; Diao, W.; Sun, X.; Lyu, X.; Yan, M.; Fu, K. C3Net: Cross-Modal Feature Recalibrated, Cross-Scale Semantic Aggregated and Compact Network for Semantic Segmentation of Multi-Modal High-Resolution Aerial Images. Remote Sensing 2021, 13, 528. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Las Vegas, NV, USA, June, 2016; pp. 770–778. [Google Scholar]
- Chen, L.; Fu, Y.; Gu, L.; Yan, C.; Harada, T.; Huang, G. Frequency-Aware Feature Fusion for Dense Image Prediction 2024.
- Shaker, A.; Maaz, M.; Rasheed, H.; Khan, S.; Yang, M.-H.; Khan, F.S. SwiftFormer: Efficient Additive Attention for Transformer-Based Real-Time Mobile Vision Applications. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Paris, France, October 1, 2023; pp. 17379–17390. [Google Scholar]
- Sun, W.; Wang, R. Fully Convolutional Networks for Semantic Segmentation of Very High Resolution Remotely Sensed Images Combined With DSM. IEEE Geosci. Remote Sensing Lett. 2018, 15, 474–478. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, 2015; Volume 9351, pp. 234–241. ISBN 978-3-319-24573-7. [Google Scholar]
- Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Computer Vision – ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, 2018; Volume 11211, pp. 833–851. ISBN 978-3-030-01233-5. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Salt Lake City, UT, June, 2018; pp. 7132–7141. [Google Scholar]
- Wang, D.; Yang, R.; Zhang, Z.; Liu, H.; Tan, J.; Li, S.; Yang, X.; Wang, X.; Tang, K.; Qiao, Y.; et al. P-Swin: Parallel Swin Transformer Multi-Scale Semantic Segmentation Network for Land Cover Classification. Computers & Geosciences 2023, 175, 105340. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Montreal, QC, Canada, October, 2021; pp. 9992–10002. [Google Scholar]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Neural Information Processing Systems (NeurIPS) 2021. [Google Scholar]
- Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for Semantic Segmentation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Montreal, QC, Canada, October, 2021; pp. 7242–7252. [Google Scholar]
- Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like Transformer for Efficient Semantic Segmentation of Remote Sensing Urban Scene Imagery. ISPRS Journal of Photogrammetry and Remote Sensing 2022, 190, 196–214. [Google Scholar] [CrossRef]
- Zhu, Q.; Cai, Y.; Fang, Y.; Yang, Y.; Chen, C.; Fan, L.; Nguyen, A. Samba: Semantic Segmentation of Remotely Sensed Images with State Space Model. Heliyon 2024, 10, e38495. [Google Scholar] [CrossRef]
- Chi, K.; Guo, S.; Chu, J.; Li, Q.; Wang, Q. RSMamba: Biologically Plausible Retinex-Based Mamba for Remote Sensing Shadow Removal. IEEE Trans. Geosci. Remote Sensing 2025, 63, 1–10. [Google Scholar] [CrossRef]
- Diakogiannis, F.I.; Waldner, F.; Caccetta, P.; Wu, C. ResUNet-a: A Deep Learning Framework for Semantic Segmentation of Remotely Sensed Data. ISPRS Journal of Photogrammetry and Remote Sensing 2020, 162, 94–114. [Google Scholar] [CrossRef]
- Audebert, N.; Le Saux, B.; Lefèvre, S. Beyond RGB: Very High Resolution Urban Remote Sensing with Multimodal Deep Networks. ISPRS Journal of Photogrammetry and Remote Sensing 2018, 140, 20–32. [Google Scholar] [CrossRef]
- Hazirbas, C.; Ma, L.; Domokos, C.; Cremers, D. FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-Based CNN Architecture. In Computer Vision – ACCV 2016; Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, 2017; Vol. 10111, pp. 213–228. ISBN 978-3-319-54180-8. [Google Scholar]
- Yan, L.; Huang, J.; Xie, H.; Wei, P.; Gao, Z. Efficient Depth Fusion Transformer for Aerial Image Semantic Segmentation. Remote Sensing 2022, 14, 1294. [Google Scholar] [CrossRef]
- Li, Y.; Xing, Y.; Lan, X.; Li, X.; Chen, H.; Jiang, D. AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-Modal Alignment. arXiv 2024, arXiv:2412.00833. [Google Scholar]
- Chen, Y.; Wang, Q.; Zhao, Y.; Xiong, S.; Lu, X. Bilinear Parallel Fourier Transformer for Multimodal Remote Sensing Classification. IEEE Trans. Geosci. Remote Sensing 2025, 63, 1–14. [Google Scholar] [CrossRef]
- Ding, L.; Tang, H.; Bruzzone, L. LANet: Local Attention Embedding to Improve the Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sensing 2021, 59, 426–435. [Google Scholar] [CrossRef]







| Model | Backbone | Imp. | Bui. | Low. | Tre. | Car | OA | mF1 | mIoU |
| IoU | |||||||||
| PSPNet | Resnet-18 | 79.27 | 89.5 | 60.41 | 77.25 | 71.45 | 87.58 | 86.31 | 75.76 |
| Swin | Swin-T | 81.49 | 89.93 | 63.08 | 75.05 | 64.97 | 86.74 | 84.87 | 74.90 |
| Unetformer | Resnet-18 | 79.33 | 88.67 | 61.86 | 73.56 | 70.73 | 86.65 | 85.31 | 74.83 |
| DCSwin | Swin-T | 81.47 | 89.81 | 63.69 | 74.82 | 70.54 | 87.70 | 86.11 | 76.07 |
| CMFNet | VGG-16 | 86.59 | 94.25 | 66.75 | 82.75 | 77.03 | 91.39 | 89.50 | 81.47 |
| Vmamba | Vmamba-T | 82.58 | 89.06 | 67.35 | 77.83 | 57.07 | 82.23 | 84.56 | 74.78 |
| RS3Mamba | R18-Mamba-T | 85.21 | 92.34 | 66.51 | 82.94 | 81.21 | 90.87 | 89.28 | 81.64 |
| MFMamba | R18-Mamba-T | 86.04 | 94.02 | 66.14 | 83.64 | 77.43 | 91.37 | 89.48 | 81.46 |
| Ours | R18-Mamba-T | 87.55 | 94.37 | 68.92 | 84.21 | 82.42 | 92.06 | 90.77 | 83.50 |
| Method | FLOPs(G) | Parameter(M) | mIoU(%) |
| PSPNet | 64.15 | 65.60 | 75.76 |
| Swin | 60.28 | 59.02 | 74.90 |
| Unetformer | 5.99 | 11.72 | 74.83 |
| DCSwin | 40.08 | 66.95 | 76.07 |
| CMFNet | 159.55 | 104.07 | 81.47 |
| Vmamba | 12.41 | 29.94 | 74.78 |
| RS3Mamba | 19.78 | 43.32 | 81.64 |
| MFMamba | 19.12 | 62.43 | 81.46 |
| Ours | 19.15 | 69.85 | 83.50 |
| Dataset | Bands | Class OA (%) | mF1(%) | mIoU(%) | ||||
| Imp. | Bui. | Low. | Tre. | Car | ||||
| Vaihingen | NIRRG | 92.04 | 96.06 | 80.64 | 92.04 | 88.36 | 90.04 | 82.29 |
| NIRRG+DSM | 93.57 | 97.21 | 81.16 | 91.50 | 88.39 | 90.77 | 83.50 | |
| Potsdam | RGB | 92.45 | 97.50 | 88.94 | 86.63 | 96.11 | 91.63 | 84.91 |
| RGB+DSM | 92.70 | 98.01 | 88.78 | 87.65 | 96.38 | 92.31 | 86.06 | |
| WCA | FreqFusion | Imp. | Bui. | Low. | Tre. | Car | OA | mF1 | mIoU |
|---|---|---|---|---|---|---|---|---|---|
| √ | × | 87.53 | 95.49 | 75.71 | 77.48 | 93.30 | 91.13 | 92.21 | 85.90 |
| × | √ | 87.32 | 95.31 | 75.44 | 77.50 | 93.09 | 90.97 | 92.12 | 85.73 |
| √ | √ | 87.41 | 95.71 | 76.54 | 77.50 | 93.13 | 91.32 | 92.31 | 86.06 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).