Submitted:
10 July 2025
Posted:
11 July 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Materials and Methodology
2.1. Dataset
2.2. Framework
2.2.1. Image Diffusion
2.2.2. Flashinternimage
2.2.3. Vision-Transformer
2.2.4. Deformable Vision Transformer
- DViT leverages the DCNv4 backbone’s powerful deformable convolution capability, dynamically adapting receptive fields to capture intricate spatial relationships effectively, thereby overcoming challenges like occlusions, rotations, and complex geometric distortions prevalent in remote sensing imagery.
- The carefully designed transition from convolutional feature maps to transformer embeddings, incorporating positional and classification tokens, enhances the network’s sensitivity to spatial context and positional variance.
- The use of a transformer encoder enables DViT to comprehensively model global interdependencies among image patches, addressing the limitations of convolution-only methods that typically neglect long-range interactions.
- The extensive use of normalization and dropout layers within both convolutional and transformer modules significantly improves the training stability and generalization capacity of the model, ensuring robust performance across diverse remote sensing scenarios.
3. Results
3.1. Environment Setup
3.2. Evaluation of Diffusion Framework
3.3. Evaluation of Model Performances
3.3.1. Hyperparameters Setup
3.3.2. Result Analysis of Each Model
3.4. Visualization of Model Attention via Heatmap
4. Discussion
5. Conclusions
Author Contributions
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Musetsho, K. D.; Chitakira, M.; Ramoelo, A. Ecosystem service valuation for a critical biodiversity area: Case of the Mphaphuli community, South Africa. Land, 11, 1696, 2022.
- Chen, H.; Cai, W. Multi-scale analysis of water purification ecosystem service flow in Taihu basin for land management and ecological compensation. Land, 13, 1694, 2024.
- Li, C.; Cui, H.; Tian, X. Remote Sensing Image Segmentation of Wetlands in Macau Based on Machine Learning. Journal of Physics: Conference Series, vol. 2665, 2023. 2023 International Conference on Big Data, Information and Intelligent Engineering, Wuhan, China, 17–18 September.
- Zhou, W.; Wu, T.; Tao, X. Exploring the spatial and seasonal heterogeneity of the cooling effect of an urban river on a landscape scale. Scientific Reports, 14, 8327, 2024.
- Allain, B.S.; Marechal, C.; Pottier, C. Wetland Water Segmentation Using Multi-Angle and Polarimetric Radarsat-2 Datasets. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Munich, Germany, 22–27 July 2012; pp. 4915–4917. [Google Scholar]
- Ke, Z.Y.; Ru, A.; Li, X.J. ANN Based High Spatial Resolution Remote Sensing Wetland Classification. In Proceedings of the 14th International Symposium on Distributed Computing and Applications for Business Engineering and Science (DCABES), Guiyang, China, 18–24 August 2015; pp. 180–183. [Google Scholar]
- Gui, Y.; Li, W.; Xia, X.G.; Tao, R.; Yue, A. Infrared Attention Network for Woodland Segmentation Using Multispectral Satellite Images. IEEE Transactions on Geoscience and Remote Sensing, vol. 60, 2022, p. 5627214.
- Cui, H.; Liang, J.; Li, C.; Tian, X. Improved Convolutional Neural Network with Attention Mechanisms for River Extraction. Water, vol. 17, no. 12, 2025, p. 1762.
- Mashala, M. J.; Dube, T.; Mudereri, B. T.; Ayisi, K. K.; Ramudzuli, M. R. A systematic review on advancements in remote sensing for assessing and monitoring land use and land cover changes impacts on surface water resources in semi-arid tropical environments. Remote Sens., 15, 3926, 2023.
- Li, C.; Cui, H.; Tian, X. A Novel CA-RegNet Model for Macau Wetlands Auto Segmentation Based on GF-2 Remote Sensing Images. Appl. Sci., 13, 12178, 2023.
- Lupa, M.; Pełka, A.; Młynarczuk, M.; Staszel, J.; Adamek, K. Why rivers disappear—Remote sensing analysis of post-mining factors using the example of the Sztoła River, Poland. Remote Sens., 16, 111, 2024.
- Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851, 2020.
- Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. arXiv:2011.13456, 2020.
- Li, C.; Zhou, K.; Liu, T.; Wang, Y.; Zhuang, M.; Gao, H.; Jin, B.; Zhao, H. AVD2: Accident video diffusion for accident video description. arXiv:2502.14801, 2025.
- Xu, T.; Zhang, P.; Huang, Q.; Zhang, H.; Gan, Z.; Huang, X.; He, X. AttnGAN: Fine-grained text-to-image generation with attentional GANs. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1316–1324, 2018.
- Ruan, S.; Zhang, Y.; Zhang, K.; Fan, Y.; Tang, F.; Liu, Q.; Chen, E. DAE-GAN: Dynamic aspect-aware GAN for text-to-image synthesis. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13960–13969, 2021.
- Zhao, R.; Shi, Z. Text-to-remote-sensing-image generation with structured generative adversarial networks. IEEE Geoscience and Remote Sensing Letters, 19, 1–5, 2021.
- Tao, M.; Tang, H.; Wu, F.; Jing, X.Y.; Bao, B.K.; Xu, C. DF-GAN: A simple and effective baseline for text-to-image synthesis. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16515–16525, 2022.
- Zhou, Y.; Zhang, R.; Chen, C.; Li, C.; Tensmeyer, C.; Yu, T.; Gu, J.; Xu, J.; Sun, T. Towards language-free training for text-to-image generation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17907–17917, 2022.
- Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; Chen, M. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv:2112.10741, 2021.
- Sebaq, A.; ElHelw, M. RSDiff: Remote-sensing image generation from text using diffusion model. arXiv:2309.02455, 2023.
- Baghirli, O.; Askarov, H.; Ibrahimli, I.; Bakhishov, I.; Nabiyev, N. SatDM: Synthesizing realistic satellite images with semantic-layout conditioning using diffusion models. arXiv:2309.16812, 2023.
- Khanna, S.; Liu, P.; Zhou, L.; Meng, C.; Rombach, R.; Burke, M.; Lobell, D.; Ermon, S. DiffusionSat: A generative foundation model for satellite imagery. arXiv:2312.03606, 2023.
- Tang, D.; Cao, X.; Hou, X.; Jiang, Z.; Liu, J.; and Meng, D. CRS-Diff: Controllable remote sensing image generation with diffusion model. IEEE Transactions on Geoscience and Remote Sensing, 1–14, 2024.
- De Vita, M.; Belagiannis, V. Diffusion model guided sampling with pixel-wise aleatoric uncertainty estimation. arXiv:2412.00205, 2024.
- Li, N.; Zhang, J.; Cui, J. Have we unified image generation and understanding yet? An empirical study of GPT-4o’s image generation ability. arXiv:2504.08003, 2025.
- Tzepkenlis, A.; Marthoglou, K.; Grammalidis, N. Efficient Deep Semantic Segmentation for Land Cover Classification Using Sentinel Imagery. Remote Sens. 2023, 15, 2027. [Google Scholar] [CrossRef]
- Marmanis, D.; Datcu, M.; Esch, T.; Stilla, U. Deep Learning Earth Observation Classification Using ImageNet Pre-trained Networks. IEEE GRSL 13(1):105-109, 2016.
- Li, C.; Chen, S.; Ma, Y.; Song, M.; Tian, X.; Cui, H. Wheat Pest Identification Based on Deep Learning Techniques. In Proceedings of the IEEE 7th International Conference on Big Data and Artificial Intelligence (BDAI), Beijing, China; 2024; pp. 87–91. [Google Scholar]
- Li, C.; Tian, Y.; Tian, X.; Zhai, Y.; Cui, H.; Song, M. An Advancing GCT-Inception-ResNet-V3 Model for Arboreal Pest Identification. Agronomy 2024, 14, 864. [Google Scholar] [CrossRef]
- Xia, G. -S.; Hu, J., Hu, F., Eds.; et al. AID: A Benchmark Dataset for Performance Evaluation of Aerial Scene Classification. IEEE TGRS 55(7):3965-3981, 2017. [Google Scholar]
- Ma, Y.; Huang, Y.; Li, C.; Chen, S.; Yang, S.; Zheng, Y. et al. A Hybrid Brain-Computer Interface based Wearable Exoskeleton System for Fine-Grained Hand Rehabilitation. In Proceedings of the 2024 IEEE Biomedical Circuits and Systems Conference (BioCAS), Xi’an, China; 2024; pp. 1–6. [Google Scholar]
- Chen, S.; Li, C.; Ma, Y.; Liang, J.; Zhu, J.; Tian, X. Deep Learning Techniques for Lunar Impact Crater Identification Based on CCD and DEM Data. In: Deligiannidis, L.; Ghareh Mohammadi, F.; Shenavarmasouleh, F.; Amirian, S.; Arabnia, H.R. (eds) Image Processing, Computer Vision, and Pattern Recognition and Information and Knowledge Engineering. CSCE 2024. Communications in Computer and Information Science, vol 2262. Springer, Cham, 2025.
- Vali, A.; Comai, S.; Matteucci, M. Deep Learning for Large-Scale Image Classification of High-Resolution Aerial Imagery. Applied Artificial Intelligence, 34(14):1177-1196, 2020.
- Khan, S.; Khan, M.; Rauf, A.; et al. Remote Sensing Image Classification: A Comprehensive Review and Possible Future Directions. Artificial Intelligence Review, 2022.
- Zhu, X.; et al. A Survey of Remote Sensing Image Classification Based on CNNs. Geo-spatial Information Science 26(1):67-95, 2019.
- Wang, K.; Lu, S.; Jiang, S. Unseen Obstacle Detection via Monocular Camera Against Speed Change and Background Noise. In: HCI International 2023 – Late Breaking Papers: 25th International Conference on Human-Computer Interaction, HCII 2023, Copenhagen, Denmark, –28, 2023, Proceedings, Part IV. Springer, 2023. 23 July.
- Khan, M.; Hanan, A.; Gazzea, M.; et al. Transformer-Based Land Use and Land Cover Classification with Explainability Using Satellite Imagery. Scientific Reports 14, 16744, 2024.
- Xiao, P.; Sun, F.; Wang, K.; Xiao, K.; Shang, X.; Liu, J. Positioning Performance Analysis of Real-Time BDS-3 PPP B2b/INS Tightly Coupled Integration in Urban Environments. Advances in Space Research, Volume 72, Issue 9, 2023, Pages 4008–4020.
- Han, S.; Zhao, M.; Wang, K.; Dong, J.; Su, A. Cross-Modal Images Matching Based Enhancement to MEMS INS for UAV Navigation in GNSS Denied Environments. Applied Sciences, vol. 13, no. 14, 2023, p. 8238.
- Zhao, L.; Zhang, J.; et al. RoadFormer: Pyramidal Deformable Vision Transformers for Road Network Extraction with Remote Sensing Images. ISPRS J. Photogramm. Remote Sens., 2022.
- Naushad, R.; Kaur, T.; Ghaderpour, E. Deep Transfer Learning for Land Use and Land Cover Classification: A Comparative Study. Sensors 21(23):8083, 2021.
- Alosaimi, N.; Alhichri, H.; Bazi, Y.; et al. Self-Supervised Learning for Remote Sensing Scene Classification under the Few-Shot Scenario. Scientific Reports 13, 433, 2023.
- Scheibenreif, L.; Hanna, J.; Mommert, M.; Borth, D. Self-Supervised Vision Transformers for Land-Cover Segmentation and Classification. In CVPR EarthVision Workshop, 2022.
- Wang, X.; Xie, L.; Dong, C.; Shan, Y. RealESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); 2021; pp. 1905–1914. [Google Scholar]
- Kang, M.; Zhu, J.Y.; Zhang, R.; Park, J.; Shechtman, E.; Paris, S.; Park, T. Scaling Up GANs for Text-to-Image Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2023; pp. 10124–10134. [Google Scholar]
- OpenAI. GPT-4 Technical Report. arXiv:2303.08774, 2023.
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1069; 5. [Google Scholar]
- Khanna, R.; Thakur, S.; Khurana, R. Satellite imagery augmentation using latent diffusion for improved classification performance. Remote Sensing, 2023. [Google Scholar]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Chen, W. LoRA: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, arXiv:2106.09685, 2022.
- Zhang, L.; Agrawala, M. Adding conditional control to text-to-image diffusion models. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 3847. [Google Scholar]
- Toker, A.; Kaplan, H.; Aksoy, S. Diffusion-based data augmentation for remote sensing image segmentation. Remote Sensing, 2024. [Google Scholar]
- Tang, M.; Xu, C.; Li, Y.; Zhao, Q. Controlled diffusion models for realistic aerial image generation. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2024. [Google Scholar]
- Xiong, Y.; Li, Z.; Chen, Y.; Wang, F.; Zhu, Xi. ; Luo, Jia.; Wang, W. Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [Google Scholar]
- Li, H.; Zhang, Y.; Zhang, Y.; Li, H.; Sang, L. DCNv3: Towards Next Generation Deep Cross Network for CTR Prediction. In CoRR, 2024, vol. abs/2407.13349. [CrossRef]
- Wang, Y.; Chen, Q.; Xiong, Y.; Xia, Y.; Li, X.; Yuan, L.; Zhang, X.; Dai, J. InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions. arXiv:2211.05778, 2022.
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2009; pp. 248–255. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision (ECCV), 2014; pp. 740–755.
- Zhou, B.; Zhao, H.; Puig, X.; Fidler, S.; Barriuso, A.; Torralba, A. Scene Parsing through ADE20K Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017; pp. 633–641. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; <i></i>., *!!! REPLACE !!!*; et al. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929, 2020.
- Chen, S.; Liang, J.; Zhu, J.; Tian, X. New Methods for Lunar Impact Crater Detection Based on YOLO v7 with Deformable ConvNets. In Proceedings of the 2023 IEEE International Conference on Electrical, Automation and Computer Engineering (ICEACE), Changchun, China; 2023; pp. 123–127. [Google Scholar]
- Bińkowski, M.; Letcher, A.; Kumar, A.; Sohl-Dickstein, J.; Kwiatkowski, T.; Szepesvári, C. Demystifying MMD GANs. arXiv:1801.01401, 2018.
- Li, A.; Wang, Z.; Zhang, H.; Huang, X.; Wang, Y.; Luo, Y. PID: Prompt-independent Data Protection Against Latent Diffusion Models. arXiv:2406.15305, 2024.
- Hu, J.; Mou, L.; and Zhu, X. X. Unsupervised domain adaptation using a teacher–student network for cross-city classification of Sentinel-2 images. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, XLIII-B2-2020, 1569–1574, 2020.
- Powers, D. M. W. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv:2010.16061, 2020.
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, arXiv:1711.05101, 2017.
- Zhang, Y.C.; Gao, J.P.; Zhou, H.L. Breeds Classification with Deep Convolutional Neural Network. In Proceedings of the 12th International Conference on Machine Learning and Computing, Shenzhen, China, 19–21 June 2020; pp. 145–151. [Google Scholar]







| Model | OA | mAcc | Kappa | Precision | Recall | F1-score |
|---|---|---|---|---|---|---|
| ViT | 0.9359 | 0.9360 | 0.9199 | 0.9361 | 0.9360 | 0.9356 |
| FlashInternImage | 0.9313 | 0.9303 | 0.9141 | 0.9367 | 0.9303 | 0.9304 |
| DViT | 0.9572 | 0.9568 | 0.9465 | 0.9590 | 0.9568 | 0.9574 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).