Submitted:
18 November 2024
Posted:
19 November 2024
You are already at the latest version
Abstract

Keywords:
1. Introduction
2. Related Work
2.1. Vision Transformer
2.2. Atrous Spatial Pyramid Pooling
2.3. Selective Feature Fusion
3. Proposed Methods
3.1. CNN-ViT Encoder
3.1.1. Subsampled Residual Block
3.1.2. Residual Block
3.1.3. Vision Transformer

3.2. Adaptive Fusion Decoder
3.2.1. Separate Enhancement Addition Fusion Module
3.2.2. Separate Enhancement Concatenation Fusion Module
3.2.3. Adaptive Fusion Module
3.2.4. Up Convolution Module
3.2.5. Deep ASPP Module
3.3. Training Loss Function
4. Experimental Results
4.1. CNN-ViT Encoder with Various ViT Configurations
4.2. Adaptive Fusion Decoder with Various Fusion Modules
4.3. Comparisons on NYU Depth V2 Dataset
4.4. Comparisons on KITTI Dataset
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Conflicts of Interest
References
- F. Fabrizio and A. De Luca, "Real-time computation of distance to dynamic obstacles with multiple depth sensors," IEEE Robotics and Automation Letters, vol. 2, no. 1, pp. 56-63, Jan. 2017, doi: 10.1109/LRA.2016.2535859. [CrossRef]
- O. Natan and J. Miura, "End-to-end autonomous driving with semantic depth cloud mapping and multi-agent," IEEE Trans. on Intelligent Vehicles, vol. 8, no. 1, pp. 557-571, Jan. 2023, doi: 10.1109/TIV.2022.3185303. [CrossRef]
- P. Kauff, N. Atzpadin, C. Fehn, M. Müller, O. Schreer, A. Smolic and R. Tanger, “Depth map creation and image-based rendering for advanced 3DTV services providing interoperability and scalability,” Signal Processing: Image Communication, vol. 22, no. 2, February 2007, pp.217-234. [CrossRef]
- Gaile G. Gordon, "Face recognition based on depth maps and surface curvature," Proc. SPIE 1570, Geometric Methods in Computer Vision, September 1991, https://doi.org/10.1117/12.48428. [CrossRef]
- M. Ding, Y. Huo, H. Yi, Z. Wang, J. Shi, Z. Lu, P. Luo, “Learning depth-guided convolutions for monocular 3D object detection,” Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2020, pp. 1000-1001.
- Zhang, C., Wang, L., Yang, R., “Semantic segmentation of urban scenes using dense depth maps,” ECCV 2010. Lecture Notes in Computer Science, vol. 6314. Springer, https://doi.org/10.1007/978-3-642-15561-1_51. [CrossRef]
- J. Zbontar and Y. LeCun, “Stereo matching by training a convolutional neural network to compare image patches,” Journal of Machine Learning Research, vol.17, no. 1. Pp.1-32, 2016.
- J. Pang, W. Sun, J. S. Ren, C. Yang and Q. Yan, "Cascade residual learning: A two-stage convolutional neural network for stereo matching," Proc. of IEEE International Conference on Computer Vision Workshops, Venice, pp. 878-886, 2017. [CrossRef]
- J. Chang and Y. Chen, "Pyramid stereo matching network," Proc. of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, pp. 5410-5418, 2018.
- H. Hirschmuller, “Stereo processing by semiglobal matching and mutual information,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 30, no. 2, pp. 328-341, 2007. [CrossRef]
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556 [cs.CV], 2014.
- K. He, X. Zhang, S. Ren and J. Sun, “Deep residual learning for image recognition” in Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, pp. 770-778, 2016.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.
- Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh and J. Liang, "UNet++: Redesigning Skip Connections to Exploit Multiscale Features in Image Segmentation," IEEE Trans. on Medical Imaging, vol. 39, no. 6, pp. 1856-1867, June 2020, doi: 10.1109/TMI.2019.2959609. [CrossRef]
- D. Kim, W. Ka, P. Ahn, D. Joo, S. Chun, and J. Kim, “Global-local path networks for monocular depth estimation with vertical cutdepth,” arXiv preprint arXiv:2201.07436, 2022.
- D. Eigen, C. Puhrsch, and R. Fergus. “Depth map prediction from a single image using a multi-scale deep network,” Proc. of Advances in Neural Information Processing Systems, vol. 27, 2014.
- C. Godard, O. Aodha, and G. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- W. J. Yang, W. N. Tsung and P. C. Chung, "Video-based depth estimation autoencoder with weighted temporal feature and spatial edge guided modules," IEEE Transactions on Artificial Intelligence, vol. 5, no. 2, pp. 613-623, Feb. 2024, doi: 10.1109/TAI.2023.3324624. [CrossRef]
- Y. Bazi, L. Bashmal, M.M.A. Rahhal, R.A. Dayil, N.A. Ajlan, “Vision transformers for remote sensing image classification,”. Remote Sensing, 13, 516. 2021. https://doi.org/10.3390/rs13030516. [CrossRef]
- R. Strudel, R. Garcia, I. Laptev, C. Schmid, “Segmenter: Transformer for semantic segmentation,” Proc. of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 7262-7272.
- J. Yang, L. An, A. Dixit, J. Koo, S. I. Park, “Depth estimation with simplified transformer,” Proc. of Computer Vision and Pattern Recognition (CVPR), 2022, https://arxiv.org/abs/2204.13791v3.
- L. -C. Chen, G. Papandreou, I. Kokkinos, K. Murphy and A. L. Yuille, “DeepLab: semantic image segmentation with deep convolutional nets, Atrous convolution, and fully connected CRFs,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834-848, 1 April 2018.
- N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from RGBD images,” in Proc. of European Conference on Computer Vision, 2012, pp. 746-760, Springer.
- M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang, “Denseaspp for semantic segmentation in street scenes,” Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3684-3692, 2018.
- J. H. Lee, M. K. Han, D. W. Ko, and I. H. Suh, “From big to small: Multi-scale local planar guidance for monocular depth estimation,” arXiv preprint arXiv:1907.10326, 2019.
- A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, M. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “PyTorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems, vol. 32, 2019.
- A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354-3361, June 2012.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980.
- W. Yu, M. Luo, P. Zhou, C. Si, Y. Zhou, X. Wang, ... & S. Yan, "Metaformer is actually what you need for vision," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,.



















| ViT Positions n1 n2 n3 n4 n5 |
Flops (G) |
Params (MB) |
δ1↑ | δ2↑ | δ3↑ | RMS↓ | AbsRel↓ |
|---|---|---|---|---|---|---|---|
| 00000 (no ViTs) | 6.367 | 1.696 | 0.622 | 0.881 | 0.966 | 0.453 | 0.225 |
| 00005 | 13.229 | 64.815 | 0.875 | 0.967 | 0.991 | 0.371 | 0.105 |
| 00014 | 13.522 | 89.828 | 0.879 | 0.968 | 0.991 | 0.366 | 0.106 |
| 00023 | 13.522 | 89.828 | 0.880 | 0.971 | 0.992 | 0.360 | 0.101 |
| 00032 | 13.522 | 89.828 | 0.881 | 0.969 | 0.991 | 0.360 | 0.102 |
| 00041 | 13.522 | 89.828 | 0.878 | 0.969 | 0.992 | 0.365 | 0.102 |
| 00113 | 14.357 | 102.33 | 0.879 | 0.968 | 0.991 | 0.365 | 0.106 |
| 00122 | 14.357 | 102.33 | 0.881 | 0.968 | 0.990 | 0.361 | 0.105 |
| 00131 | 14.357 | 102.33 | 0.876 | 0.968 | 0.990 | 0.370 | 0.109 |
| 00212 | 14.603 | 102.33 | 0.878 | 0.969 | 0.991 | 0.363 | 0.101 |
| 00221 | 14.603 | 102.33 | 0.880 | 0.968 | 0.991 | 0.362 | 0.103 |
| 00311 | 14.849 | 102.33 | 0.879 | 0.970 | 0.992 | 0.363 | 0.104 |
| 01112 | 16.797 | 108.62 | 0.878 | 0.968 | 0.991 | 0.361 | 0.103 |
| 01121 | 16.797 | 108.62 | 0.882 | 0.970 | 0.992 | 0.357 | 0.100 |
| 01211 | 17.043 | 108.62 | 0.874 | 0.967 | 0.990 | 0.364 | 0.104 |
| 02111 | 18.030 | 108.62 | 0.880 | 0.968 | 0.991 | 0.358 | 0.105 |
| 11111 | 24.317 | 112.31 | 0.878 | 0.967 | 0.991 | 0.364 | 0.103 |
| Fusion Modules |
Params (MB) | δ1↑ | δ2↑ | δ3↑ | RMS↓ | AbsRel↓ |
|---|---|---|---|---|---|---|
| SFF (baseline) | 1.665 | 0.696 | 0.907 | 0.971 | 0.651 | 0.206 |
| SEAFM | 0.836 | 0.718 | 0.919 | 0.975 | 0.615 | 0.192 |
| SECFM | 2.159 | 0.717 | 0.917 | 0.973 | 0.626 | 0.195 |
| AFM | 3.320 | 0.747 | 0.930 | 0.978 | 0.589 | 0.181 |
| Network | δ1↑ | δ2↑ | δ3↑ | RMS↓ | AbsRel↓ |
|---|---|---|---|---|---|
| BTS [26] | 0.762 | 0.940 | 0.984 | 0.565 | 0.167 |
| GLPDepth [16] | 0.605 | 0.872 | 0.962 | 0.769 | 0.235 |
| RVTAF Net* | 0.773 | 0.942 | 0.984 | 0.560 | 0.162 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).