Submitted:
21 August 2025
Posted:
21 August 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Background
1.2. Motivation
- We design a novel efficient architecture that integrates Structure-Aware Preprocessing (SAP) and Structure-Aware Deformable Mamba (SADM) blocks, enabling adaptive visual processing through multi-directional scanning patterns and deformable spatial attention mechanisms that capture long-range dependencies and complex structural patterns while adapting to irregular object geometries through learned attention strategies and multi-scale structural feature extraction.
- We propose a Multi-Scale Feature Fusion (MSFF) framework with cross-scale attention mechanisms that enables progressive coordination between hierarchical layers and implements inter-scale dependency modeling, effectively handling objects of varying sizes and complex boundaries while enhancing both local detail preservation and global context representation through comprehensive feature alignment and attention-based fusion strategies.
- We introduce Cross-Scale Consistency (CSC) constraints that enforce feature and prediction consistency across different scales through comprehensive loss formulations, ensuring training stability and improving generalization while maintaining structural coherence throughout the processing pipeline via feature consistency and prediction consistency mechanisms.
- We conduct comprehensive experimental evaluation demonstrating superior performance across varying computational conditions and environmental scenarios, showing that our HybridSeg framework outperforms state-of-the-art CNN-based, Transformer-based, and existing Mamba-based approaches while maintaining the computational efficiency and real-time performance required for practical deployment in resource-limited edge computing platforms.
| Ref Feature |
[16] | [17] | [18] | [19] | [20] | [21] | [14] | [11] | [8] | [4] | Proposed work |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Mamba/State Space Models | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||
| Structure-Aware Processing | ✓ | ||||||||||
| Deformable Attention | ✓ | ✓ | |||||||||
| Multi-Scale Feature Fusion | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||
| Cross-Scale Consistency | ✓ | ||||||||||
| Linear Complexity Modeling | ✓ | ✓ | ✓ | ✓ | ✓ |
2. Related Work
2.1. Distributed Visual Processing Networks
2.2. State Space Models for Network Coordination
2.3. Multi-Scale Coordination and Network-Aware Processing
3. Method
3.1. Overall Architecture
3.2. Structure-Aware Preprocessing
3.2.1. Multi-Scale Structure Extraction
3.2.2. Adaptive Feature Weighting
3.3. Structure-Aware Deformable Mamba Block
3.3.1. Multi-Directional Scanning
3.3.2. Deformable Spatial Attention
3.3.3. Multi-Scale Integration
3.4. Multi-Scale Feature Fusion
3.4.1. Progressive Feature Alignment
3.4.2. Cross-Scale Attention
3.5. UNet Decoder with Enhanced Skip Connections
3.5.1. Gated Skip Connections
3.5.2. Progressive Upsampling
3.6. Cross-Scale Consistency
3.6.1. Feature Consistency
3.6.2. Prediction Consistency
3.7. Loss Function and Training Strategy
3.7.1. Structure-Aware Loss
3.7.2. Boundary-Aware Loss
4. Experiment
4.1. Data Description
4.2. Evaluation Metrics
4.2.1. Intersection over Union (IoU)
4.2.2. mean Intersection over Union (mIoU)
4.2.3. Pixel Accuracy (PA)
4.2.4. Cross-Domain mIoU
4.3. Parameter Settings
5. Results
5.1. Training Dynamics and Convergence Analysis
5.2. Qualitative Segmentation Results
5.3. Quantitative Comparison
5.4. Performance Under Environmental Variations
5.5. Ablation Studies
5.6. Computational Requirements
5.7. SADM Scanning Analysis
6. Conclusions
References
- Minaee, S.; Boykov, Y.; Porikli, F.; Plaza, A.; Kehtarnavaz, N.; Terzopoulos, D. Image Segmentation Using Deep Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 3523–3542. [Google Scholar] [CrossRef] [PubMed]
- Feng, Z.; Guo, Y.; Sun, Y. Segmentation of Road Negative Obstacles Based on Dual Semantic-Feature Complementary Fusion for Autonomous Driving. IEEE Trans. Intell. Veh. 2024, 9, 4687–4697. [Google Scholar] [CrossRef]
- An, T.; Huang, W.; Xu, D.; He, Q.; Hu, J.; Lou, Y. A Deep Learning Framework for Boundary-Aware Semantic Segmentation. In Proceedings of the Proc. 5th Int. Conf. on Artificial Intelligence and Industrial Technology Applications (AIITA); 2025; pp. 886–890. [Google Scholar] [CrossRef]
- Zhang, R.; Luo, X.; Lv, J.; Cao, J.; Zhu, Y.; Wang, J.; Zheng, B. Enhancing Medical Image Classification With Context Modulated Attention and Multi-Scale Feature Fusion. IEEE Access 2025, 13, 15226–15243. [Google Scholar] [CrossRef]
- Wu, B.; Cai, Z.; Wu, W.; Yin, X. AoI-aware resource management for smart health via deep reinforcement learning. IEEE Access 2023. [Google Scholar] [CrossRef]
- Wu, B.; Huang, J.; Yu, S. "X of Information” Continuum: A Survey on AI-Driven Multi-Dimensional Metrics for Next-Generation Networked Systems. arXiv 2025, arXiv:2507.19657. [Google Scholar]
- Wu, B.; Huang, J.; Duan, Q. FedTD3: An Accelerated Learning Approach for UAV Trajectory Planning. In Proceedings of the International Conference on Wireless Artificial Intelligent Computing Systems and Applications (WASA). Springer; 2025; pp. 13–24. [Google Scholar]
- Zeng, Q.; Zhou, J.; Tao, J.; Chen, L.; Niu, X.; Zhang, Y. Multiscale Global Context Network for Semantic Segmentation of High-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
- Wu, B.; Wu, W. Model-Free Cooperative Optimal Output Regulation for Linear Discrete-Time Multi-Agent Systems Using Reinforcement Learning. Mathematical Problems in Engineering 2023, 2023, 6350647. [Google Scholar] [CrossRef]
- Azad, R.; Aghdam, E.K.; Rauland, A.; Jia, Y.; Avval, A.H.; Bozorgpour, A.; Karimijafarbigloo, S.; Cohen, J.P.; Adeli, E.; Merhof, D. Medical Image Segmentation Review: The Success of U-Net. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10076–10095. [Google Scholar] [CrossRef]
- Li, K.; Wang, D.; Liu, G.; Zhu, W.; Zhong, H.; Wang, Q. DiagSWin: A Multi-Scale Vision Transformer With Diagonal-Shaped Windows for Object Detection and Segmentation. Neural Netw. 2024, 180, 106653. [Google Scholar] [CrossRef]
- Pan, D.; Wu, B.N.; Sun, Y.L.; Xu, Y.P. A Fault-Tolerant and Energy-Efficient Design of a Network Switch Based on a Quantum-Based Nano-Communication Technique. Sustain. Comput. Inform. Syst. 2023, 37, 100827. [Google Scholar] [CrossRef]
- Li, K.; Li, X.; Wang, Y.; He, Y.; Wang, Y.; Wang, L.; Qiao, Y. VideoMamba: State Space Model for Efficient Video Understanding. In Proceedings of the European Conference on Computer Vision (ECCV). Springer; 2024; pp. 237–255. [Google Scholar]
- Ma, C.; Wang, Z. Semi-Mamba-UNet: Pixel-Level Contrastive and Cross-Supervised Visual Mamba-Based UNet for Semi-Supervised Medical Image Segmentation. Knowl.-Based Syst. 2024, 300, 112203. [Google Scholar] [CrossRef]
- Wu, B.; Huang, J.; Duan, Q.; Dong, L.; Cai, Z. Enhancing Vehicular Platooning With Wireless Federated Learning: A Resource-Aware Control Framework. arXiv 2025, arXiv:2507.00856. [Google Scholar] [CrossRef]
- Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: Redesigning Skip Connections to Exploit Multiscale Features in Image Segmentation. IEEE Trans. Med. Imaging 2020, 39, 1856–1867. [Google Scholar] [CrossRef] [PubMed]
- Chen, B.; Liu, Y.; Zhang, Z.; Lu, G.; Kong, A.W.K. TransAttUnet: Multi-Level Attention-Guided U-Net With Transformer for Medical Image Segmentation. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 8, 55–68. [Google Scholar] [CrossRef]
- Zhang, J.; Liu, H.; Yang, K.; Hu, X.; Liu, R.; Stiefelhagen, R. CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation With Transformers. IEEE Trans. Intell. Transp. Syst. 2023, 24, 14679–14694. [Google Scholar] [CrossRef]
- Liu, J.; Yang, H.; Zhou, H.Y.; Yu, L.; Liang, Y.; Yu, Y.; Zhang, S.; Zheng, H.; Wang, S. Swin-UMamba: Adapting Mamba-Based Vision Foundation Models for Medical Image Segmentation. IEEE Trans. Med. Imaging. [CrossRef]
- Wang, F.; Wang, J.; Ren, S.; Wei, G.; Mei, J.; Shao, W.; Zhou, Y.; Yuille, A.; Xie, C. Mamba-Reg: Vision Mamba Also Needs Registers. In Proceedings of the Computer Vision and Pattern Recognition Conference; 2025; pp. 14944–14953. [Google Scholar]
- Shi, Y.; Xia, B.; Jin, X.; Wang, X.; Zhao, T.; Xia, X.; Xiao, X.; Yang, W. VmambaIR: Visual State Space Model for Image Restoration. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 5560–5574. [Google Scholar] [CrossRef]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
- Wu, H.; Zhao, Z.; Wang, Z. META-Unet: Multi-Scale Efficient Transformer Attention Unet for Fast and High-Accuracy Polyp Segmentation. IEEE Trans. Autom. Sci. Eng. 2024, 21, 4117–4128. [Google Scholar] [CrossRef]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017; pp. 2117–2125. [Google Scholar]
- Fang, Z.; Hu, S.; An, H.; Zhang, Y.; Wang, J.; Cao, H.; Chen, X.; Fang, Y. PACP: Priority-aware collaborative perception for connected and autonomous vehicles. IEEE Trans. Mob. Comput. 2024. [Google Scholar] [CrossRef]
- Huang, Y.; Wang, L.; Xu, J. Quantum Entanglement Path Selection and Qubit Allocation via Adversarial Group Neural Bandits. IEEE/ACM Trans. Netw.
- Huang, Y.; Zhang, L.; Xu, J. Adversarial Group Linear Bandits and Its Application to Collaborative Edge Inference. In Proceedings of the IEEE INFOCOM 2023–IEEE Conference on Computer Communications (INFOCOM). IEEE; 2023; pp. 1–10. [Google Scholar]
- Huang, Y.; Liu, Q.; Xu, J. Adversarial Combinatorial Bandits With Switching Cost and Arm Selection Constraints. In Proceedings of the IEEE INFOCOM 2024–IEEE Conference on Computer Communications (INFOCOM). IEEE; 2024; pp. 371–380. [Google Scholar]
- Rezaei Jafari, F.; Montavon, G.; Müller, K.R.; Eberle, O. Mambalrp: Explaining Selective State Space Sequence Models. Adv. Neural Inf. Process. Syst. 2024, 37, 118540–118570. [Google Scholar]
- Behrouz, A.; Hashemi, F. Graph Mamba: Towards Learning on Graphs With State Space Models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; 2024; pp. 119–130. [Google Scholar]
- Yao, Y.; Liu, Z.; Cui, Z.; Peng, Y.; Zhou, J. Selective Visual Prompting in Vision Mamba. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39; 2025; pp. 22083–22091. [Google Scholar]
- Liu, S.; Lin, Y.; Liu, D.; Wang, P.; Zhou, B.; Si, F. Frequency-Enhanced Lightweight Vision Mamba Network for Medical Image Segmentation. IEEE Trans. Instrum. Meas. 2025, 74, 1–12. [Google Scholar] [CrossRef]
- Fang, Z.; Wang, J.; Ma, Y.; Tao, Y.; Deng, Y.; Chen, X.; Fang, Y. R-ACP: Real-Time Adaptive Collaborative Perception Leveraging Robust Task-Oriented Communications. IEEE J. Sel. Areas Commun. 2025. [Google Scholar]
- Wu, B.; Huang, J.; Duan, Q. Real-time Intelligent Healthcare Enabled by Federated Digital Twins with AoI Optimization. IEEE Netw. [CrossRef]
- Fang, Z.; Liu, Z.; Wang, J.; Hu, S.; Guo, Y.; Deng, Y.; Fang, Y. Task-Oriented Communications for Visual Navigation With Edge-Aerial Collaboration in Low Altitude Economy. arXiv 2025, arXiv:2504.18317. [Google Scholar]
- Ding, Z.; Huang, J.; Duan, Q.; Zhang, C.; Zhao, Y.; Gu, S. A Dual-Level Game-Theoretic Approach for Collaborative Learning in UAV-Assisted Heterogeneous Vehicle Networks. In Proceedings of the 2025 IEEE International Performance, Computing, 2025, and Communications Conference (IPCCC). IEEE; pp. 1–8.
- Van Quyen, T.; Kim, M.Y. Feature Pyramid Network With Multi-Scale Prediction Fusion for Real-Time Semantic Segmentation. Neurocomputing 2023, 519, 104–113. [Google Scholar] [CrossRef]
- Guo, C.; Fan, B.; Zhang, Q.; Xiang, S.; Pan, C. AugFPN: Improving Multi-Scale Feature Learning for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020; pp. 12595–12604. [Google Scholar]
- Huang, J.; Wu, B.; Duan, Q.; Dong, L.; Yu, S. A Fast UAV Trajectory Planning Framework in RIS-assisted Communication Systems with Accelerated Learning via Multithreading and Federating. IEEE Trans. Mob. Comput. [CrossRef]
- Mitra, N.J.; Wand, M.; Zhang, H.; Cohen-Or, D.; Kim, V.; Huang, Q.X. Structure-Aware Shape Processing. In ACM SIGGRAPH 2014 Courses; 2014; pp. 1–21.
- Wu, Z.; Wang, X.; Lin, D.; Lischinski, D.; Cohen-Or, D.; Huang, H. SAGNet: Structure-Aware Generative Network for 3D-Shape Modeling. ACM Trans. Graph. (TOG) 2019, 38, 1–14. [Google Scholar] [CrossRef]
- Huang, S.K.; Yu, Y.T.; Huang, C.R.; Cheng, H.C. Cross-Scale Fusion Transformer for Histopathological Image Classification. IEEE J. Biomed. Health Inform. 2024, 28, 297–308. [Google Scholar] [CrossRef] [PubMed]
- Fang, Z.; Hu, S.; Wang, J.; Deng, Y.; Chen, X.; Fang, Y. Prioritized Information Bottleneck Theoretic Framework With Distributed Online Learning for Edge Video Analytics. IEEE Trans. Netw. [CrossRef]
- Zhu, S.; Li, Y.; Dai, X.; Mao, T.; Wei, L.; Yan, Y. A Multi-Resolution Hybrid CNN-Transformer Network With Scale-Guided Attention for Medical Image Segmentation. IEEE J. Biomed. Health Inform. [CrossRef]
- Xu, H.; Liao, J.; Liu, H.; Sun, Y. Learning Semantic Alignment Using Global Features and Multi-Scale Confidence. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 897–910. [Google Scholar] [CrossRef]






| Component | Parameter Description | Value | Component | Parameter Description | Value |
|---|---|---|---|---|---|
| Structure-Aware Preprocessing | SADM Architecture | ||||
| Enhancement factor | Feature enhancement control | 0.5 | Scanning patterns | Directional sequences | 4 |
| Multi-scale kernels | Convolution sizes | Hidden state dim | Mamba dimension | 256 | |
| Blur parameters | Gaussian smoothing | Deformable groups | Offset computation | 8 | |
| Adaptive pooling | Global average pooling | GAP | Modulation size | Deformable kernel | |
| Weight activation | Sigmoid normalization | Fusion mechanism | Attention weights | Softmax | |
| Multi-Scale Feature Fusion | Gated Decoder | ||||
| Pyramid levels | Feature hierarchy | 4 | Gate computation | Adaptive control | |
| Fusion coefficients | Cross-scale weights | Learnable | Skip integration | Feature combination | Element-wise |
| Attention dimension d | Feature channels | 256 | Upsampling method | Scale restoration | Bilinear |
| QKV projections | Linear mappings | Output generation | Final convolution | ||
| Loss Function Weights | Optimization Settings | ||||
| Structure loss | Gradient matching | 0.1 | Optimization method | Gradient descent | AdamW |
| Boundary loss | Edge enhancement | 0.2 | Initial learning rate | Training rate | |
| Consistency loss | Multi-scale alignment | 0.15 | Learning schedule | Rate decay | Cosine |
| Auxiliary outputs | Intermediate supervision | 3 levels | Maximum epochs | Training duration | 800 |
| Mini-batch size | Parallel samples | 16 | |||
| Image resolution | Input dimensions | ||||
| Regularization | Weight penalty | ||||
| Data augmentation | Transformation set | Geometric | |||
| Method | IoU (%) | F1-Score (%) | mIoU (%) | PA (%) | ||||
|---|---|---|---|---|---|---|---|---|
| Person | Object | Track | Person | Object | Track | |||
| FCN | 73.2±1.5 | 68.5±1.8 | 81.4±1.2 | 78.6±1.3 | 74.2±1.6 | 85.3±1.1 | 74.4±1.2 | 86.7±0.9 |
| DeepLabV3+ | 76.8±1.2 | 72.1±1.5 | 84.2±1.0 | 81.5±1.1 | 77.9±1.3 | 87.6±0.9 | 77.7±1.0 | 88.4±0.8 |
| PSPNet | 75.4±1.3 | 70.9±1.6 | 83.7±1.1 | 80.2±1.2 | 76.5±1.4 | 86.9±1.0 | 76.7±1.1 | 87.8±0.8 |
| SegFormer | 78.5±1.1 | 74.3±1.3 | 85.9±0.9 | 83.1±1.0 | 79.7±1.2 | 89.2±0.8 | 79.6±0.9 | 90.1±0.7 |
| Segmenter | 79.2±1.0 | 75.1±1.2 | 86.5±0.8 | 83.8±0.9 | 80.4±1.1 | 89.7±0.7 | 80.3±0.8 | 90.6±0.6 |
| Mask2Former | 80.1±0.9 | 80.8±1.0 | 87.2±0.7 | 84.6±0.8 | 86.1±0.9 | 90.5±0.6 | 82.7±0.7 | 91.3±0.5 |
| VM-UNet | 83.9±0.7 | 77.8±1.0 | 88.1±0.6 | 88.2±0.6 | 82.9±0.9 | 91.3±0.5 | 83.3±0.6 | 93.9±0.4 |
| HybridSeg | 82.6±0.5 | 79.4±0.7 | 90.4±0.4 | 87.1±0.6 | 84.8±0.8 | 93.2±0.3 | 84.1±0.4 | 93.2±0.3 |
| Condition | Person | Object | Track | mIoU |
|---|---|---|---|---|
| IoU (%) | IoU (%) | IoU (%) | (%) | |
| Daylight | 84.3±0.4 | 81.2±0.6 | 91.8±0.3 | 85.8±0.3 |
| Dusk/Dawn | 82.5±0.6 | 78.9±0.8 | 90.2±0.5 | 83.9±0.5 |
| Fog (vis<200m) | 81.7±0.7 | 78.1±0.9 | 89.6±0.6 | 83.1±0.6 |
| Rain (>5mm/h) | 81.1±0.8 | 77.4±1.0 | 89.1±0.7 | 82.5±0.7 |
| Night (IR) | 80.2±0.9 | 76.7±1.1 | 88.4±0.8 | 81.8±0.8 |
| Configuration | mIoU | PA | FPS | Memory | Params |
|---|---|---|---|---|---|
| (%) | (%) | (MB) | (M) | ||
| Baseline | 74.2 | 86.5 | 42.3 | 1420 | 23.5 |
| +SAP | 76.8 | 88.1 | 39.7 | 1580 | 26.8 |
| +SADM | 79.3 | 90.2 | 35.6 | 1720 | 32.4 |
| +MSFF | 81.7 | 91.8 | 33.2 | 1850 | 36.1 |
| +CSC Loss | 82.9 | 92.5 | 33.2 | 1850 | 36.1 |
| +Gated Dec. | 84.1 | 93.2 | 31.8 | 1950 | 38.9 |
| Full model | 84.1 | 93.2 | 31.8 | 1950 | 38.9 |
| Method | Params | FLOPs | Memory | FPS | mIoU |
|---|---|---|---|---|---|
| (M) | (G) | (MB) | (%) | ||
| DeepLabV3+ | 59.3 | 214.5 | 2840 | 28.7 | 77.7 |
| PSPNet | 46.7 | 189.2 | 2450 | 31.2 | 76.7 |
| SegFormer-B2 | 24.7 | 62.4 | 1680 | 45.6 | 79.6 |
| VM-UNet | 44.2 | 123.7 | 2120 | 36.8 | 83.3 |
| HybridSeg | 38.9 | 98.6 | 1950 | 31.8 | 84.1 |
| Scanning Pattern | mIoU (%) | FPS | Memory (MB) | |
|---|---|---|---|---|
| Horizontal | 79.8 | -4.3 | 35.2 | 1780 |
| Vertical | 79.4 | -4.7 | 35.4 | 1770 |
| Diagonal | 78.6 | -5.5 | 35.8 | 1760 |
| Anti-diagonal | 78.1 | -6.0 | 36.1 | 1750 |
| H+V | 81.6 | -2.5 | 33.6 | 1860 |
| H+V+D | 83.0 | -1.1 | 32.4 | 1920 |
| All (H+V+D+A) | 84.1 | 0.0 | 31.8 | 1950 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).