Submitted:
24 April 2025
Posted:
26 April 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- The proposed structure of our network targeted at the utilization of low-cost MR glasses can attain the optimal fusion of the sparse depth map and RGB image to achieve the accurate high-precision dense depth map.
- A novel adaptive dynamic range bins estimator is proposed for promptly estimating the depth distribution of the captured scene, making the finally resulted depth maps be considerably appropriate for specified employments.
- The proposed networks with two decoding variants have been successfully implemented on the Jorjin MR glasses [19] for the hand-gesture MR and the augmented reality (AR) applications.
2. Related Work
2.1. Depth Estimation
2.2. Depth Completion
2.3. Adaptive Bins Estimation
3. Proposed Method
3.1. Lightweight Encoder
3.2. Multilevel Shared-Decoder
3.2.1. Upsampling by SimpleUp
3.2.2. Upsampling by UpCSPN-k
3.3. Adaptive Bins Estimator
3.3.1. Feature Embedding MLPs (En)
| Layer type | Input channels | Output channels | Activation |
|---|---|---|---|
| FC | Cn | 64 | GeLU |
| FC | 64 | Ce=32 | - |
3.3.2. Bins Initialization MLPs (INIT):
| Layer type | Input channels | Output channels | Activation |
|---|---|---|---|
| FC | 32 | 64 | GeLU |
| FC | 64 | b | RLU |
| Global AvgPool | - | -- | - |
3.3.3. Bins Splitter MLPs (Sn)
| Layer type | Input channels | Output channels | Activation |
|---|---|---|---|
| FC | 32 | 64 | GeLU |
| FC | 64 | - | |
| Global AvgPool | - | -- | - |
3.3.4. Bias Prediction MLPs (Bn)
| Layer type | Input channels | Output channels | Activation |
|---|---|---|---|
| FC | 32 | 32 | GeLU |
| FC | 64 | 2 | - |
| Global AvgPool | - | -- | - |
3.4. Loss Functions
3.4.1. Scale-Invariant Log Loss
3.4.2. Bin-Centers Distribution Loss
3.5. LR Depth Generation for Training
- Step 1.
- Subblocking: To obtain a p×p low-resolution depth map from the ground truth depth map d, we first divide the ground truth depth map into p×p subblocks.
- Step 2a.
- Median pooling: We simply take the median value of each sub-block as the LP depth value to obtain the simulated LR depth map.
- Step 2b.
- Max-depth filtering (optional): In practical, the depth range on most MR glasses depth sensors is limited to a short distance range. To simulate the short distance range depth during training, we apply max-depth filtering on the sampled low-resolution depth map. As the depth value is greater than a max-depth threshold, max-depth filtering will reset the depth value to zero (i.e., invalid pixel)
4. Experimental Results
4.1. Datasets and Implementation Settings
4.2. Proposed Network Implementation on Low-cost MR Glasses
4.3. Performance Evaluation Results
| Network | Encoder | Decoder | Weights (M) | MAC(G) | RMSE |
|---|---|---|---|---|---|
| FastDepth [10] | ResNet50 | NNConv5 | 25.60 | 4.190 | 0.568 |
| FastDepth [10] | MobileNet | NNConv5 | 3.19 | 0.740 | 0.599 |
| MonoDepth [13] | ResNet50 | DispNet | 30.00 | 4.800 | >4.392. |
| Proposed | DMobileNet_s | SimpleUp | 2.18 | 0.675 | 0.223 |
| Proposed | DMobileNet_s | SimpleUp+ AdaDRBins |
2.28 | 1.150 | 0.199 |
| Proposed | DMobileNet_s | UpCSPN-7+ AdaDRBins | 43.57 | 9.445 | 0.185 |
4.4. Ablation Study
| Decoder | Bins Estimator | RMSE (m) | REL | ||||
|---|---|---|---|---|---|---|---|
| SimpleUp | -- | 0.223 | 0.046 | 97.36 | 99.43 | 99.85 | |
| UpCPSN-3 | -- | 0.206 | 0.039 | 97.87 | 99.56 | 99.89 | |
| UpCPSN-5 | -- | 0.197 | 0.036 | 97.87 | 99.55 | 99.89 | |
| UpCPSN-7 | -- | 0.202 | 0.036 | 97.87 | 99.54 | 99.89 | |
| SimpleUP | AdaBins | 0.197 | 0.037 | 98.04 | 99.64 | 99.91 | |
| SimpleUp | AdaDRBins | 0.199 | 0.038 | 98.06 | 99.64 | 99.91 | |
| UpCSPN-3 | AdaDRBins | 0.190 | 0.036 | 98.21 | 99.67 | 99.92 | |
| UpCSPN-5 | AdaDRBins | 0.191 | 0.036 | 98.21 | 99.67 | 99.92 | |
| UpCSPN-7 | AdaDRBins | 0.185 | 0.034 | 98.24 | 99.67 | 99.92 | |
| Decoder | Bins Estimator | #Bins | Weights (M) | MAC(G) | FPS | |
|---|---|---|---|---|---|---|
| SimpleUp | -- | - | 2.18 | 0.49 | 1170 | |
| UpCPSN-3 | -- | -- | 41.85 | 8.04 | 367 | |
| UpCPSN-5 | -- | -- | 42.13 | 8.36 | 356 | |
| UpCPSN-7 | -- | -- | 42.57 | 8.60 | 337 | |
| SimpleUp | AdaBins | 256 | 4.52 | 4.43 | 425 | |
| SimpleUp | AdaDRBins | 32 | 2.28 | 1.15 | 560 | |
| UpCSPN-3 | AdaDRBins | 32 | 42.85 | 8.70 | 340 | |
| UpCSPN-5 | AdaDRBins | 32 | 43.14 | 8.92 | 330 | |
| UpCSPN-7 | AdaDRBins | 32 | 43.57 | 9.26 | 310 | |
5. Discussion
6. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| MR | Mixture Reality |
| AR | Augmented Reality |
| 3D | Three Dimensional |
| RGB | Red, Green and Blue |
| LiDAR | Light Detection And Ranging |
| ToF | Time-of-Flight |
| LED | Light Emitting Diode |
| CNN | Convolutional Neural Network |
| CSPN | Convolutional Spatial Propagation Network |
| SPN | Spatial Propagation Network |
| LR | Low Resolution |
| HR | High Resolution |
| MLP | Multiple Layer Perception |
| SILog | Scale Invariant Log |
| CD | Chamfer Distance |
| NYU | New York University |
| ONNX | Open Neural Network Exchange |
| CPU | Central Processing Unit |
| GPU | Graphics Processing Unit |
References
- Fabrizio F.; De Luca, A. Real-time computation of distance to dynamic obstacles with multiple depth sensors. IEEE Robotics and Automation Letters, vol. 2, no. 1, pp. 56-63, Jan. 2017. [CrossRef]
- Kauff, P.; Atzpadin, N.; Fehn, C.; Müller, M.; Schreer, O.; Smolic. A.; Tanger, R. Depth map creation and image-based rendering for advanced 3DTV services providing interoperability and scalability. Signal Processing: Image Communication, vol. 22, no. 2, February 2007, pp.217-234. [CrossRef]
- Fehn, C. Depth-image-based rendering (DIBR), compression, and trans-mission for a new approach on 3d-tv. in Stereoscopic Displays and Virtual Reality Systems XI, vol. 5291. SPIE, 2004, pp. 93-104. [CrossRef]
- Yang, W.-J.; Yang, J.-F.; Chen, G.-C.; Chung, P.-C.; Chung, M.-F. An assigned color depth packing method with centralized texture depth packing formats for 3D VR broadcasting services. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 1, p. 122–132, 2018. [CrossRef]
- Natan, O.; Miura, J. End-to-end autonomous driving with semantic depth cloud mapping and multi-agent. IEEE Trans. on Intelligent Vehicles, vol. 8, no. 1, pp. 557-571, Jan. 2023. [CrossRef]
- Suarez, J.; Murphy, R. R. Hand gesture recognition with depth images: A review. in RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication. 2012, pp. 411–417. [CrossRef]
- Hirschmuller, H. Stereo processing by semiglobal matching and mutual information. in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 2, pp. 328-341, 2007. [CrossRef]
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. in Proc. of Advances in Neural Information Processing Systems, pp. 2366-2374, 2014.
- Alhashim, I.; Wonka, P. High quality monocular depth estimation via transfer learning,” arXiv preprint arXiv:1812.11941, 2018. [CrossRef]
- Wofk, D.; Ma, F.; Yang, T.-J. Karaman, S.; Sze, V. Fastdepth: Fast monocular depth estimation on embedded systems. in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 6101–6108. [CrossRef]
- Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper depth prediction with fully convolutional residual networks. in 2016 Fourth international conference on 3D vision (3DV). IEEE, pp. 239–248, 2016. [CrossRef]
- Bhat, S. F.; Alhashim, I.; Wonka, P. Adabins: Depth estimation using adaptive bins. in Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4009–4018. [CrossRef]
- Godard, C.; Aodha, O. M.; Brostow, G. J. Unsupervised monocular depth estimation with left-right consistency. in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 270-279, 2017. [CrossRef]
- Godard, C.; Aodha O., M.; Firman, M.; Brostow, G. J. Digging into self-supervised monocular depth estimation. in Proc. of the IEEE/CVF International Conference on Computer Vision, pp. 3828–3838, 2019. [CrossRef]
- Zama Ramirez, P.; Poggi, M.; Tosi, F.; Mattoccia, S.; Di Stefano, L. Geometry meets semantics for semi-supervised monocular depth estimation. In Computer Vision – ACCV 2018. vol 11363. Springer, Cham. [CrossRef]
- Kuznietsov, Y.; Stuckler, J.; Leibe, B. Semi-supervised deep learning for monocular depth map prediction. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6647–6655, July 2017. [CrossRef]
- Marti, E.; de Miguel, M. A.; Garcia, F.; Perez, J. A Review of sensor technologies for perception in automated driving. in IEEE Intelligent Transportation Systems Magazine, vol. 11, no. 4, pp. 94-108, winter 2019. [CrossRef]
- Foix, S.; Alenya, G.; Torras, d C. Lock-in time-of-flight (ToF) cameras: A Survey. in IEEE Sensors Journal, vol. 11, no. 9, pp. 1917-1926, Sept. 2011. [CrossRef]
- Jorjin J7EF Plus AR glasses. https://www.jorjin.com/products/ar-vr-glasses/j-reality/j7ef/.
- Ma, F.; Karaman, S. Sparse-to-dense: Depth prediction from sparse depth samples and a single image. in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp. 4796–4803, 2018. [CrossRef]
- Tang, J.; Tian, F.-P; Feng, W.; Li, J.; Tan, P. Learning guided convolutional network for depth completion. in IEEE Transactions on Image Processing, vol. 30, pp. 1116–1129, 2020. [CrossRef]
- Cheng, X.; Wang, P.; Yang, R. Depth estimation via affinity learned with convolutional spatial propagation network. In Proceedings of the European Conference on Computer Vision (ECCV); 2018; pp. 103–119. [Google Scholar]
- Liu, S.; De Mello, S.; Gu, J.; Zhong, G.; Yang, M.-H.; Kautz, J. Learning affinity via spatial propagation networks. Proc. of Advances in Neural Information Processing Systems, vol. 30, 2017.
- Chen, S.; Shi, Y.; Xiong, Z.; Zhu, X. X. HTC-DC Net: Monocular height estimation from single remote sensing images. IEEE Trans. Geoscience and Remote Sensing, vol. 61, no. 5623018, Oct. 2023. [CrossRef]
- Miclea, V.–C.; Nedevschi, S. SemanticAdaBins - Using semantics to improve depth estimation based on adaptive bins in aerial scenarios,” In Proc. of 4th International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME) 2024. [CrossRef]
- Yang, X.; Yuan, L.; Wilber, K.; Sharma, A.; Gu, X.; Qiao, S. PolyMaX: General dense prediction with mask transformer. In Proc. of IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024. [CrossRef]
- Jaisawal, P. K.; Papakonstantinou, S. Monocular fisheye depth estimation for UAV applications with segmentation feature integration,” In Proc. of IEEE 43rd Digital Avionics Systems Conference (DASC). 2024. [CrossRef]
- Lee, C. Y.; Kim, D. J.; Suh, Y. J.; Hwang, D. K. Improving monocular depth estimation through knowledge distillation: better visual quality and efficiency. IEEE Access, vol. 13, pp. 2763 – 2782, Dec. 2024. [CrossRef]
- Bhat, S. F.; Birkl, R.; Wofk, D.; Wonka, P.; Muller, M. ZoeDepth: Zero-shot transfer by combining relative and metric depth. in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, arXiv:2302.12288, 2023. [CrossRef]
- Howard, A. G.; Zhu M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. [CrossRef]
- Fan, H.; Su, H.; Guibas, L. A point set generation network for 3d object reconstruction from a single image." in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2463–2471, 2017.
- Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor segmentation and support inference from rgbd images. in European Conference on Computer Vision. Springer, pp. 746–760, 2012. [CrossRef]
- Zhang, Y.; Cao, C.; Cheng, J.; Lu, H. Egogesture: A new dataset and benchmark for egocentric hand gesture recognition. IEEE Transactions on Multimedia, vol. 20, no. 5, pp. 1038–1050, 2018. [CrossRef]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; Desmaison, A.; Kopf, A.; Yang, E.; DeVito, M.; Raison, M.; Tejani, A.; Chilamkurthy, S.; Steiner, B.; Fang, L.; Bai, J.; Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, vol.32, pp. 8026–8037, 2019.
- EPSON MOVERIO BO-IC400 User's Guide. https://download3.ebz.epson.net/dsc/f/03/00/16/15/46/95ef7d99574b1d519141f71c055ea50600b6b390/UsersGuide_BO-IC400_EN_Rev04.pdf.
- Lee, J.; Kim, T.; Bang, S.; Oh, S.; Kwon, H. Evasion attacks on deep learning-based helicopter recognition systems. In Hindawi Journal of Sensors, vol. 2024, Article ID 1124598. [CrossRef]











Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).