Submitted:
27 April 2025
Posted:
28 April 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Heterogeneous Multi-branch Local Feature Extraction: The proposed model establishes a multimodal local feature extraction network through three parallel convolutional branches. Specifically, the feature extractors perform diverse processing operations within different architectures to better capture comprehensive representations such as high-frequency textures, mesoscale structures, and detail representations for the remote sensing images. By doing this, the aligned and concatenated composite features preserve both sub-pixel edge responses and regional semantic correlations, effectively addressing feature representation limitations of single convolutional kernels.
- Hierarchical Window Attention with Pyramid Pooling: To mitigate the Transformer’s local detail loss, we propose an enhanced customized architecture termed Pyramid Self-learning Fusion Module by constructing four-level feature pyramids via patch merging. The dynamic shifted window mechanism enables cross-scale context modeling. Innovatively, dual pyramid pooling modules perform multi-granular spatial pooling on local and global features. After bilinear upsampling and convolution-based channel control, this strategy significantly improves vegetation classification by fusing local details with global semantics.
- Spatial Graph Convolution-Guided Feature Propagation: The proposed Graph Fusion Module maps local, global, and fused features into high-dimensional graph nodes, constructing spatial topology using 4-neighbor adjacency matrices. Three GCN pathways facilitate feature propagation, with learnable weights dynamically aggregating improved node features. This design goes beyond traditional grid limitations, establishing long-range dependencies in the reconstruction space to address semantic discontinuities caused by high-rise occlusions.
2. Related Works
2.1. CNN-Based Remote Sensing Image Segmentation
2.2. Transformer-Based Remote Sensing Semantic Segmentation
2.3. GCNs-Based and Other Related Deep Learning Methods
2.4. Overall Architecture
3. Methodology
3.1. Parallel Encoding Architecture
3.2. Pyramid Self-learning Fusion Module
3.3. Graph Fusion Module
3.4. Loss Function
4. Experiments and Discussion
4.1. Dataset Descriptions
- Vaihingen Dataset: The Vaihingen dataset consists of 16 high-resolution orthophotos, each measuring pixels. Each image includes three spectral channels: Near-Infrared (NIR), Red (R), and Green (G), along with a DSM at ground sampling distance (GSD). The dataset’s labels consist of 5 foreground classes: Buildings, Trees, Low vegetation, Cars, and Impervious surfaces, along with one background class (Clutter). The 16 images were split into a training set (12 images) and a testing set (4 images).
- Potsdam Dataset: The Potsdam dataset consists of 24 high-resolution orthophotos, each with a resolution of pixels. It offers four multispectral channels: Red (R), Green (G), Blue (B), and Infrared (IR), with a GSD. Similar to the Vaihingen dataset, it includes 5 foreground classes and 1 background class, but the distribution of foreground classes differs.
4.2. Experimental Setup
4.3. Evaluation Metrics
- IoU (Intersection over Union): IoU is a standard metric for evaluating image segmentation model performance, calculated as the intersection area of the predicted and ground truth regions divided by their union area. A higher IoU indicates better segmentation accuracy. In the experiments, we computed the IoU for each class and averaged them to obtain the overall IoU. A high IoU value indicates that the model performs well in handling complex boundaries and small objects.
- F1-Score: F1-Score is the harmonic mean of precision and recall and is typically used to evaluate models with class imbalance. In remote sensing image segmentation tasks, some classes may have significantly fewer samples than others. The F1-Score provides a more comprehensive evaluation of the model’s performance across different object types. A higher F1-Score indicates better balance between precision and recall.
4.4. Comparative Experiments
4.5. Ablation Experiments
4.5.1. Performance Analysis From Different Modules
4.5.2. Performance Variations on Diverse Losses
4.6. Discussion
5. Conclusion
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Fu, K.; Lu, W.X.; Liu, X.Y.; Deng, C.B.; Yu, H.F.; Sun, X. A comprehensive survey and assumption of remote sensing foundation modal. National Remote Sensing Bulletin 2024, 28, 1667–1680. [Google Scholar] [CrossRef]
- Yuan, T.; Hu, B. REU-Net: A Remote Sensing Image Building Segmentation Network Based on Residual Structure and the Edge Enhancement Attention Module. Applied Sciences (2076-3417) 2025, 15. [Google Scholar] [CrossRef]
- He, Y.; Seng, K.P.; Ang, L.M.; Peng, B.; Zhao, X. Hyper-CycleGAN: A New Adversarial Neural Network Architecture for Cross-Domain Hyperspectral Data Generation. Applied Sciences 2025, 15, 4188. [Google Scholar] [CrossRef]
- Zhang, L.P.; Zhang, L.F.; Du, B. Deep Learning for Remote Sensing Data: A Technical Tutorial on the State of the Art. IEEE Geoscience and Remote Sensing Magazine 2016, 4, 22–40. [Google Scholar] [CrossRef]
- Kazerouni, A.; Karimijafarbigloo, S.; Azad, R.; Velichko, Y.; Bagci, U.; Merhof, D. FuseNet: Self-Supervised Dual-Path Network For Medical Image Segmentation. In Proceedings of the 2024 IEEE International Symposium on Biomedical Imaging (ISBI); 2023; pp. 1–5. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; et al. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the International Conference on Machine Learning, Salt Lake City, USA, 2021. [Google Scholar]
- Yuan, L.; Chen, D.; Chen, Y.L.; Codella, N.C.F.; Dai, X.; Gao, J.; Hu, H.; Huang, X.; Li, B.; Li, C.; et al. Florence: A New Foundation Model for Computer Vision. ArXiv 2021, abs/2111.11432.
- Bao, H.; Dong, L.; Piao, S.; et al. BEiT: BERT PreTraining of Image Transformers. In Proceedings of the The Tenth International Conference on Learning Representations, Vienna, Austria; 2022. [Google Scholar]
- Sherrah, J. Fully convolutional networks for dense semantic labelling of high-resolution aerial imagery. arXiv 2016, arXiv:1606.02585 2016. [Google Scholar]
- Guo, Y.; Jia, X.; Paull, D. Effective sequential classifier training for SVM-based multitemporal remote sensing image classification. IEEE Transactions on Image Processing 2018, 27, 3036–3048. [Google Scholar] [CrossRef] [PubMed]
- Pal, M. Random forest classifier for remote sensing classification. International Journal of Remote Sensing 2007, 26, 217–222. [Google Scholar] [CrossRef]
- Li, Y.; Zhang, Y. A New Paradigm of Remote Sensing Image Interpretation by Coupling Knowledge Graph and Deep Learning. Geomatics and Information Science of Wuhan University 2022, 47, 1176–1190. [Google Scholar] [CrossRef]
- Zhang, S.; Wu, G.; Gu, J.; Han, J. Pruning convolutional neural networks with an attention mechanism for remote sensing image classification. Electronics 2020, 9, 1209. [Google Scholar] [CrossRef]
- Dong, R.; Pan, X.; Li, F. DenseU-net-based semantic segmentation of small objects in urban remote sensing images. IEEE Access 2019, 7, 65347–65356. [Google Scholar] [CrossRef]
- Zhou, X.; Wu, G.; Sun, X.; Hu, P.; Liu, Y. Attention-Based Multi-Kernelized and Boundary-Aware Network for image semantic segmentation. Neurocomputing 2024, 597, 127988. [Google Scholar] [CrossRef]
- Sun, X.; Chen, C.; Wang, X.; Dong, J.; Zhou, H.; Chen, S. Gaussian dynamic convolution for efficient single-image segmentation. IEEE Transactions on Circuits and Systems for Video Technology 2021, 32, 2937–2948. [Google Scholar] [CrossRef]
- Li, Z.; Wu, G.; Liu, Y. Prototype Enhancement for Few-Shot Point Cloud Semantic Segmentation. In Proceedings of the International Conference on Algorithms and Architectures for Parallel Processing. Springer; 2024; pp. 270–285. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the Proc. NIPS; 2012; pp. 1097–1105. [Google Scholar]
- An, W.; Wu, G. Hybrid spatial-channel attention mechanism for cross-age face recognition. Electronics 2024, 13, 1257. [Google Scholar] [CrossRef]
- Lv, Q.; Sun, X.; Chen, C.; Dong, J.; Zhou, H. Parallel complement network for real-time semantic segmentation of road scenes. IEEE Transactions on Intelligent Transportation Systems 2021, 23, 4432–4444. [Google Scholar] [CrossRef]
- Liu, Z.; et al. Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the Proc. ICCV; 2021; pp. 10012–10022. [Google Scholar]
- Wan, Y.; Zhou, D.; Wang, C.; Liu, Y.; Bai, C. Multi-scale medical image segmentation based on pixel encoding and spatial attention mechanism. Sheng Wu Yi Xue Gong Cheng Xue Za Zhi 2024, 41, 511–519. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the Proc. ICLR; 2015. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the Proc. CVPR; 2015; pp. 3431–3440. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the Proc. MICCAI; 2015; pp. 234–241. [Google Scholar]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the Proc. ECCV; 2018; pp. 801–818. [Google Scholar]
- Chen, Y.; Wang, Y.; Jiao, P.; Feng, M. A self-attention CNN for remote sensing image semantic segmentation. IEEE Transactions on Geoscience and Remote Sensing 2021, 59, 3155–3169. [Google Scholar] [CrossRef]
- Wang, Q.; Liu, S.; Chanussot, J.; Li, X. Scene classification with recurrent attention of VHR remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 2019, 57, 1155–1167. [Google Scholar] [CrossRef]
- Vaswani, A.; et al. Attention is all you need. In Proceedings of the Proc. NIPS; 2017; pp. 5998–6008. [Google Scholar]
- Dosovitskiy, A.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the Proc. ICLR; 2021. [Google Scholar]
- Zheng, H.; et al. Swin-UNet: Unet-like pure transformer for medical image segmentation. In Proceedings of the Proc. MICCAI; 2021; pp. 1–10. [Google Scholar]
- Xu, Y.; Du, B.; Zhang, L. Multi-scale spatial context-aware transformer for remote sensing image segmentation. IEEE Transactions on Geoscience and Remote Sensing 2022, 60, 1–12. [Google Scholar]
- Bazi, Y.; Bashmal, L.; Al Rahhal, M.M.; Dayil, R.A.; Ajami, N. Vision transformers for remote sensing image classification. Remote Sensing 2021, 13, 516. [Google Scholar] [CrossRef]
- Liu, C.; Wu, H.; Li, Y.; Li, X. SwinFCN: A spatial attention Swin transformer backbone for semantic segmentation of high-resolution aerial images. Remote Sensing 2022, 14, 1075. [Google Scholar] [CrossRef]
- Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. In Proceedings of the Proc. ICLR; 2017. [Google Scholar]
- Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. SwinIR: Image restoration using Swin transformer. In Proceedings of the Proc. ICCV; 2021; pp. 1833–1844. [Google Scholar]
- Chen, Z.; Badrinarayanan, V.; Lee, C.Y.; Rabinovich, A. GradNet: Gradient-guided network for visual object tracking. In Proceedings of the Proc. ICCV; 2019; pp. 6162–6171. [Google Scholar]
- Sun, F.; Li, W.; Guan, X.; Liu, H.; Wu, J.; Gao, Y. Dual attention graph convolutional network for semantic segmentation of very high resolution remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing 2022, 60, 1–14. [Google Scholar]
- Zhang, Z.; Wang, L.; Zhang, Y. Multi-scale graph convolutional network for remote sensing image segmentation. IEEE Geoscience and Remote Sensing Letters 2022, 19, 1–5. [Google Scholar]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the Proc. CVPR; 2018; pp. 7794–7803. [Google Scholar]
- Li, Z.; Luo, Y.; Wang, Z.; Zhang, B. MAT-GCN: A multi-scale attention guided graph convolutional network for remote sensing scene classification. IEEE Transactions on Geoscience and Remote Sensing 2022, 60, 1–12. [Google Scholar]
- Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2023, arXiv:2312.00752 2023. [Google Scholar]
- Liu, M.; Dan, J.; Lu, Z.; Yu, Y.; Li, Y.; Li, X. CM-UNet: Hybrid CNN-Mamba UNet for Remote Sensing Image Semantic Segmentation. arXiv preprint 2024. [Google Scholar]
- Zhang, Q.; Li, Z.; Xu, H. Multimodal fusion for remote sensing image segmentation using Mamba model. Journal of Applied Remote Sensing 2022, 16, 019–032. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2015; pp. 779–788. [Google Scholar]
- Gao, Y.; Liu, Z.; Zhang, X. Real-time remote sensing image change detection using YOLO. Journal of Remote Sensing Technology 2019, 20, 89–101. [Google Scholar]
- Vennerød, C.; Kjærran, A.; Bugge, E. Long Short-term Memory RNN. ArXiv 2021, abs/2105.06756.
- Chen, J.; Zhang, Y.; Liu, Q. Remote sensing image segmentation based on LSTM for urban change detection. International Journal of Remote Sensing 2020, 41, 2048–2061. [Google Scholar]
- Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation. In Proceedings of the ECCV Workshops; 2021. [Google Scholar]
- Chen, Y.; Lin, G.; Li, S.; Bourahla, O.E.; Wu, Y.; Wang, F.; Feng, J.; Xu, M.; Li, X. BANet: Bidirectional Aggregation Network With Occlusion Handling for Panoptic Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2020; pp. 3792–3801. [Google Scholar]
- Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS Journal of Photogrammetry and Remote Sensing 2022, 190, 196–214. [Google Scholar] [CrossRef]
- Xiang, J.; Liu, J.; Chen, D.; Xiong, Q.; Deng, C. CTFuseNet: A Multi-Scale CNN-Transformer Feature Fused Network for Crop Type Segmentation on UAV Remote Sensing Imagery. Remote Sensing 2023, 15, 1151. [Google Scholar] [CrossRef]








| Model | Imp. surf. | Building | Car | Low veg. | Tree | mIoU | mF1 | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| IoU | F1 | IoU | F1 | IoU | F1 | IoU | F1 | IoU | F1 | |||
| SwinT | 72.74 | 82.27 | 81.90 | 88.83 | 26.24 | 44.02 | 66.31 | 79.93 | 68.41 | 79.46 | 63.12 | 74.90 |
| Swin UNet | 65.75 | 77.08 | 71.51 | 81.24 | 28.05 | 46.36 | 55.10 | 69.61 | 61.01 | 73.89 | 56.28 | 69.64 |
| BANet | 72.14 | 81.87 | 81.47 | 88.44 | 32.96 | 52.26 | 65.70 | 79.38 | 68.94 | 79.84 | 64.24 | 76.36 |
| FTUNetFormer | 75.91 | 84.60 | 84.49 | 90.38 | 43.71 | 63.99 | 69.06 | 82.33 | 70.28 | 80.88 | 68.69 | 80.44 |
| CTFuse | 77.19 | 85.37 | 78.51 | 87.96 | 56.06 | 71.84 | 73.20 | 84.53 | 72.03 | 83.74 | 71.40 | 82.69 |
| PLGTransformer | 81.46 | 88.45 | 77.64 | 87.41 | 59.05 | 74.26 | 73.38 | 84.65 | 72.74 | 84.22 | 72.85 | 83.80 |
| Model | Imp. surf. | Building | Car | Low veg. | Tree | mIoU | mF1 | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| IoU | F1 | IoU | F1 | IoU | F1 | IoU | F1 | IoU | F1 | |||
| SwinT | 65.17 | 74.50 | 58.06 | 63.03 | 51.71 | 58.41 | 67.84 | 79.01 | 58.74 | 68.76 | 60.30 | 68.74 |
| Swin UNet | 59.52 | 69.56 | 54.59 | 60.22 | 56.77 | 62.52 | 63.38 | 75.12 | 52.06 | 62.35 | 57.26 | 65.95 |
| BANet | 63.25 | 72.82 | 55.99 | 61.28 | 53.51 | 59.82 | 66.77 | 78.09 | 54.65 | 64.62 | 58.83 | 67.33 |
| FTUNetFormer | 66.87 | 75.78 | 59.23 | 63.96 | 57.66 | 63.24 | 70.16 | 81.14 | 60.25 | 70.30 | 62.83 | 70.88 |
| CTFuse | 67.98 | 76.55 | 68.81 | 81.52 | 59.56 | 74.66 | 77.46 | 87.30 | 61.37 | 76.06 | 67.04 | 79.22 |
| PLGTransformer | 58.90 | 74.14 | 85.21 | 92.01 | 70.44 | 82.66 | 61.11 | 70.40 | 75.14 | 85.81 | 70.16 | 81.00 |
| Model | Imp. surf. | Building | Car | Low veg. | Tree | mIoU | mF1 | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| IoU | F1 | IoU | F1 | IoU | F1 | IoU | F1 | IoU | F1 | |||
| Base | 65.75 | 77.08 | 71.51 | 81.24 | 18.05 | 26.36 | 45.10 | 59.61 | 61.01 | 73.89 | 52.29 | 63.64 |
| Base+PSFM | 67.28 | 78.41 | 78.70 | 86.81 | 6.90 | 11.24 | 50.68 | 64.70 | 61.90 | 74.27 | 53.09 | 63.09 |
| Base+GFM | 70.57 | 80.66 | 79.66 | 87.19 | 29.19 | 39.27 | 53.71 | 67.62 | 67.07 | 78.35 | 60.04 | 70.62 |
| Base+PSFM+GFM | 81.46 | 88.45 | 77.64 | 87.41 | 59.05 | 74.26 | 73.38 | 84.65 | 72.74 | 84.22 | 72.85 | 83.80 |
| Model | Imp. surf. | Building | Car | Low veg. | Tree | mIoU | mF1 | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| IoU | F1 | IoU | F1 | IoU | F1 | IoU | F1 | IoU | F1 | |||
| Base | 59.52 | 69.56 | 59.72 | 64.58 | 40.61 | 45.91 | 52.27 | 63.12 | 41.01 | 51.08 | 50.63 | 58.85 |
| Base+PSFM | 61.18 | 71.40 | 76.76 | 82.10 | 42.35 | 50.52 | 61.87 | 73.26 | 61.00 | 70.97 | 60.63 | 69.65 |
| Base+GFM | 62.26 | 72.06 | 75.00 | 80.57 | 56.55 | 62.58 | 58.73 | 70.19 | 65.70 | 76.22 | 63.65 | 72.32 |
| Base+PSFM+GFM | 58.90 | 74.14 | 85.21 | 92.01 | 70.44 | 82.66 | 61.11 | 70.40 | 75.14 | 85.81 | 70.16 | 81.00 |
| Setting | mIoU | mF1 | ||
|---|---|---|---|---|
| Baseline | ✓ | ✓ | 72.85 | 83.80 |
| Multi-class Region Focal Loss Only | ✓ | 61.62 | 70.09 | |
| Multi-scale Boundary Loss Only | ✓ | 63.43 | 74.06 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).