Submitted:
18 April 2025
Posted:
21 April 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- (1)
- We introduce a novel dual-branch RGB-D semantic segmentation network named CFANet. Tailored feature extraction modules are designed based on the distinct characteristics of RGB and depth maps, subsequently enhancing segmentation accuracy through adaptive cross-modal feature fusion.
- (2)
- BFEM introduces asymmetric convolution on the basis of dilated convolution to alleviate the gridding effect of dilated convolution. Additionally, it achieves rich contextual learning through dense connections and Criss-Cross Attention. DFEM extracts significant unimodal features from both the channel and spatial dimensions of depth maps.
- (3)
- Guide the RGB branch and depth map branch to interact rather than simply treating the depth map as a complement to the RGB image.
- (4)
- AFFCM employs a multi-head self-attention mechanism to address the semantic discrepancies between RGB and depth map features, resulting in their adaptive alignment. This process effectively enhances the complementary information exchange between the two modalities and mitigates redundancy.
- (5)
- We adopt different strategies for feature maps of different scales. Multi-scale feature maps are fused through SCM; considering that the first layer contains the richest detailed information and the last layer contains the richest semantic information, we designed FRH to fuse both.
2. Related Works
2.1. RGB-D Semantic Segmentation
2.2. Attention Mechanisms
3. Method
3.1. Overview of the Architecture
3.2. RGB Feature Extraction Module and Depth Map Feature Extraction Module
3.2.1. RGB Feature Extraction Module
3.2.2. Depth Map Feature Extraction Module
3.3. Adaptive Feature Complementary Fusion
3.4. Skip Connection Module and Feature Refinement Head
3.4.1. Feature Refinement Head
3.4.2. Skip Connection Module
4. Experiment
4.1. Dataset and Metrics
4.2. Implementation Details
4.3. Comparison with SOTA Methods
4.4. Visualization Results
4.5. Ablation Studies
4.5.1. Validate BFEM and DFEM
4.5.2. Validate AFEM, SCM, and FRH
4.5.3. Validate Backbone Network
5. Conclusion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- El Badaoui, R.; Bonmati Coll, E.; Psarrou, A.; Asaturyan, H.A.; Villarini, B. Enhanced CATBraTS for Brain Tumour Semantic Segmentation. Journal of Imaging 2025, 11, 8. [Google Scholar] [CrossRef] [PubMed]
- Chen, L.-Z.; Lin, Z.; Wang, Z.; Yang, Y.-L.; Cheng, M.-M. Spatial Information Guided Convolution for Real-Time RGBD Semantic Segmentation. Ieee Transactions on Image Processing 2021, 30, 2313–2324. [Google Scholar] [CrossRef]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2017. pp. 2881–2890.
- Li, X.; He, H.; Li, X.; Li, D.; Cheng, G.; Shi, J.; Weng, L.; Tong, Y.; Lin, Z. Pointflow: Flowing semantics through points for aerial image segmentation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021; pp. 4217-4226.
- Cao, J.; Leng, H.; Cohen-Or, D.; Lischinski, D.; Chen, Y.; Tu, C.; Li, Y. RGBxD: Learning depth-weighted RGB patches for RGB-D indoor semantic segmentation. Neurocomputing 2021, 462, 568–580. [Google Scholar] [CrossRef]
- Yan, X.; Hou, S.; Karim, A.; Jia, W. RAFNet: RGB-D attention feature fusion network for indoor semantic segmentation. Displays 2021, 70. [Google Scholar] [CrossRef]
- Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. Ieee Transactions on Pattern Analysis and Machine Intelligence 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
- Xiao, L.; Wu, B.; Hu, Y. Surface defect detection using image pyramid. IEEE Sensors Journal 2020, 20, 7181–7188. [Google Scholar] [CrossRef]
- Dollár, P.; Appel, R.; Belongie, S.; Perona, P. Fast feature pyramids for object detection. IEEE transactions on pattern analysis and machine intelligence 2014, 36, 1532–1545. [Google Scholar] [CrossRef]
- Kazerouni, A.; Karimijafarbigloo, S.; Azad, R.; Velichko, Y.; Bagci, U.; Merhof, D. Fusenet: self-supervised dual-path network for medical image segmentation. In Proceedings of the 2024 IEEE International Symposium on Biomedical Imaging (ISBI); 2024; pp. 1–5. [Google Scholar]
- Zheng, Z.; Xie, D.; Chen, C.; Zhu, Z. Multi-resolution Cascaded Network with Depth-similar Residual Module for Real-time Semantic Segmentation on RGB-D Images. In Proceedings of the 2020 IEEE International Conference on Networking, 30 Oct.-2 Nov. 2020, 2020, Sensing and Control (ICNSC); pp. 1–6.
- Zhou, W.; Yang, E.; Lei, J.; Yu, L. FRNet: Feature Reconstruction Network for RGB-D Indoor Scene Parsing. Ieee Journal of Selected Topics in Signal Processing 2022, 16, 677–687. [Google Scholar] [CrossRef]
- Zhou, W.; Yuan, J.; Lei, J.; Luo, T. TSNet: Three-Stream Self-Attention Network for RGB-D Indoor Semantic Segmentation. Ieee Intelligent Systems 2021, 36, 73–78. [Google Scholar] [CrossRef]
- Cao, J.; Leng, H.; Lischinski, D.; Cohen-Or, D.; Tu, C.; Li, Y. ; Ieee. In ShapeConv: Shape-aware Convolutional Layer for Indoor RGB-D Semantic Segmentation. In Proceedings of the 18th IEEE/CVF International Conference on Computer Vision (ICCV), Electr Network, 2021, 2021 Oct 11-17; pp. 7068–7077. [Google Scholar]
- Zhang, G.; Xue, J.H.; Xie, P.; Yang, S.; Wang, G. Non-Local Aggregation for RGB-D Semantic Segmentation. IEEE Signal Processing Letters 2021, 28, 658–662. [Google Scholar] [CrossRef]
- Zhou, W.; Yue, Y.; Fang, M.; Qian, X.; Yang, R.; Yu, L. BCINet: Bilateral cross-modal interaction network for indoor scene understanding in RGB-D images. Information Fusion 2023, 94, 32–42. [Google Scholar] [CrossRef]
- Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent models of visual attention. Advances in neural information processing systems 2014, 27. [Google Scholar]
- Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. Advances in neural information processing systems 2015, 28. [Google Scholar]
- Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2019; pp. 603-612.
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2018; pp. 7132-7141.
- Qin, Z.; Zhang, P.; Wu, F.; Li, X. Fcanet: Frequency channel attention networks. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2021; pp. 783-792.
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020; pp. 11534-11542.
- Yang, Z.; Zhu, L.; Wu, Y.; Yang, Y. Gated channel transformation for visual recognition. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020; pp. 11794-11803.
- Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the Proceedings of the European conference on computer vision (ECCV), 2018; pp. 3-19.
- Hu, J.; Wang, H.; Wang, J.; Wang, Y.; He, F.; Zhang, J. SA-Net: A scale-attention network for medical image segmentation. PloS one 2021, 16, e0247388. [Google Scholar] [CrossRef]
- Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019; pp. 3146-3154.
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, arXiv:2010.11929 2020.
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2021; pp. 10012-10022.
- Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Chen, D.; Guo, B. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022; pp. 12124-12134.
- Ding, X.; Guo, Y.; Ding, G.; Han, J. Acnet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2019; pp. 1911-1920.
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Advances in neural information processing systems 2017, 30. [Google Scholar]
- Ruan, H.; Tan, Z.; Chen, L.; Wan, W.; Cao, J. Efficient sub-pixel convolutional neural network for terahertz image super-resolution. Optics letters 2022, 47, 3115–3118. [Google Scholar] [CrossRef]
- Zhou, W.; Yuan, J.; Lei, J.; Luo, T. TSNet: Three-stream self-attention network for RGB-D indoor semantic segmentation. IEEE intelligent systems 2020, 36, 73–78. [Google Scholar] [CrossRef]
- Lopes, I.; Tuan-Hung, V.; de Charette, R. ; Ieee. In Cross-task Attention Mechanism for Dense Multi-task Learning. In Proceedings of the 23rd IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, 2023, 2023 Jan 03-07; pp. 2328–2337. [Google Scholar]
- Lin, D.; Zhang, R.; Ji, Y.; Li, P.; Huang, H. SCN: Switchable context network for semantic segmentation of RGB-D images. IEEE transactions on cybernetics 2018, 50, 1120–1131. [Google Scholar] [CrossRef]
- Jia, W.; Yan, X.; Liu, Q.; Zhang, T.; Dong, X. TCANet: three-stream coordinate attention network for RGB-D indoor semantic segmentation. Complex & Intelligent Systems 2024, 10, 1219–1230. [Google Scholar]
- Seichter, D.; Köhler, M.; Lewandowski, B.; Wengefeld, T.; Gross, H.-M. Efficient rgb-d semantic segmentation for indoor scene analysis. In Proceedings of the 2021 IEEE international conference on robotics and automation (ICRA); 2021; pp. 13525–13531. [Google Scholar]
- Zhou, H.; Qi, L.; Huang, H.; Yang, X.; Wan, Z.; Wen, X. CANet: Co-attention network for RGB-D semantic segmentation. Pattern Recognition 2022, 124. [Google Scholar] [CrossRef]
- Wu, P.; Guo, R.; Tong, X.; Su, S.; Zuo, Z.; Sun, B.; Wei, J. Link-RGBD: Cross-Guided Feature Fusion Network for RGBD Semantic Segmentation. Ieee Sensors Journal 2022, 22, 24161–24175. [Google Scholar] [CrossRef]
- Zhou, W.; Xiao, Y.; Qiang, F.; Dong, X.; Xu, C.; Yu, L. AESeg: Affinity-Enhanced Segmenter Using Feature Class Mapping Knowledge Distillation for Efficient RGB-D Semantic Segmentation of Indoor Scenes. Neural Networks 2025, 107438. [Google Scholar] [CrossRef] [PubMed]
- Wu, Z.; Allibert, G.; Stolz, C.; Ma, C.; Demonceaux, C. Depth-Adapted CNNs for RGB-D Semantic Segmentation. Arxiv 2022. [Google Scholar]
- Liu, H.; Xie, W.; Wang, S. Feature fusion and context interaction for RGB-D indoor semantic segmentation. Applied Soft Computing 2024, 167, 112379. [Google Scholar] [CrossRef]
- Yang, J.; Bai, L.; Sun, Y.; Tian, C.; Mao, M.; Wang, G.J.a.e.-p. Pixel Difference Convolutional Network for RGB-D Semantic Segmentation. 2023, arXiv:2302. [CrossRef]
- Ding, H.; Jiang, X.; Shuai, B.; Liu, A.Q.; Wang, G. Semantic Segmentation With Context Encoding and Multi-Path Decoding. Ieee Transactions on Image Processing 2020, 29, 3520–3533. [Google Scholar] [CrossRef]
- Zhang, J.; Liu, H.; Yang, K.; Hu, X.; Liu, R.; Stiefelhagen, R.J.a.e.-p. CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers. 2022, arXiv:2203. 0483. [Google Scholar] [CrossRef]
- Seichter, D.; Koehler, M.; Lewandowski, B.; Wengefeld, T.; Gross, H.-M. ; Ieee. In Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Xian, PEOPLES R CHINA, 2021, 2021 May 30-Jun 05; pp. 13525–13531. [Google Scholar]
- Hu, X.; Yang, K.; Fei, L.; Wang, K. ; Ieee. In ACNET: ATTENTION BASED NETWORK TO EXPLOIT COMPLEMENTARY FEATURES FOR RGBD SEMANTIC SEGMENTATION. In Proceedings of the 26th IEEE International Conference on Image Processing (ICIP), Taipei, TAIWAN, 2019, 2019 Sep 22-25; pp. 1440–1444. [Google Scholar]








| Models | Backbone | PixAcc | mAcc | mIoU |
|---|---|---|---|---|
| TSNet [33] | ResNet-34 | 73.50 | 59.6 | 46.1 |
| DensnMTL [34] | ResNet-101 | - | - | 40.84 |
| SCN[35] | ResNet-152 | - | - | 50.7 |
| TCANet[36] | ResNet-50 | 71.3 | 60.3 | 47.8 |
| ESANet[37] | ResNet-50 | 74.42 | 59.27 | 46.79 |
| CANet [38] | ResNet-50 | 77.1 | 64.6 | 50.9 |
| Link-RGBD [39] | ResNet-50 | 76.8 | 59.6 | 49.5 |
| AESeg [40] | ResNet-50 | 77.0 | - | 50.7 |
| Z-ACN [41] | ResNet-50 | 75.88 | 63.55 | 50.05 |
| FCINet [42] | ResNet-50 | 75.9 | 63.2 | 51.7 |
| Our(CFANet) | ResNet-50 | 78.47 | 65.31 | 53.86 |
| Models | Backbone | PixAcc | mAcc | mIoU |
|---|---|---|---|---|
| CANet [38] | ResNet-101 | 72.5 | 60.5 | 49.3 |
| PDCNet [43] | ResNet101 | 72.4 | - | 49.2 |
| SGNet[2] | ResNet-101 | 71.0 | - | 47.5 |
| CGBNet[44] | ResNet-101 | 72.3 | - | 48.2 |
| CMX-B2 [45] | MiT-B2 | 72.8 | - | 49.7 |
| ESANet [46] | ResNet-34 | - | - | 48.2 |
| Link-RGBD [39] | ResNet-50 | 73.1 | 53.5 | 48.4 |
| ACNet [47] | ResNet-50 | - | - | 48.1 |
| ESANet [46] | ResNet-50 | - | - | 48.3 |
| FCINet[42] | ResNet-50 | 72.6 | 60.9 | 49.5 |
| Our(CFANet) | ResNet-50 | 83.62 | 64.53 | 51.85 |
| Models | PixAcc | mAcc | mIoU |
|---|---|---|---|
| Without DFEM and BFEM | 73.26 | 62.35 | 48.31 |
| No interaction between the RGB and depth data | 74.16 | 63.43 | 49.98 |
| Replace DFEM with BFEM | 78.31 | 64.33 | 51.81 |
| Replace BFEM with ASPP | 78.31 | 64.29 | 51.51 |
| Replace DFEM with ASPP | 78.28 | 64.25 | 51.49 |
| Without AFEM | 74.56 | 63.51 | 50.23 |
| Without SCM | 78.02 | 64.12 | 50.75 |
| Without FRH | 77.83 | 64.56 | 50.94 |
| CFANet(ResNet-50) | 78.47 | 65.31 | 53.86 |
| Backbone | PixAcc | mAcc | mIoU |
| VGG16 | 75.17 | 61.79 | 48.75 |
| ResNet18 | 76.75 | 63.58 | 50.52 |
| ResNet101 | 78.02 | 63.66 | 52.08 |
| ResNet50 | 78.47 | 65.31 | 53.86 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).