Submitted:
31 May 2023
Posted:
01 June 2023
You are already at the latest version
Abstract
Keywords:
1. Introduction
- We introduce the pelvic point as the root node and predict the relative depth of the other nodes with respect to the root node to reduce the prediction difficulty. Moreover, we design multiple encoder-decoder modules to gradually improve the prediction accuracy of point depth.
- We introduce the Transformer architecture to achieve complete end-to-end processing without any redundant post-processing. In addition, we propose dual decoders to gradually improve the recognition accuracy of the network.
- To the best of our knowledge, our method outperforms all known end-to-end methods and most two-stage methods in predicting the relative depth of 3D human joint points on the MuPoTs-3D dataset.
2. Related Work
2.1. 3D Human Posture Estimation
2.2. Transformer in Vision
3. Methodology
3.1. Overall Architecture
3.2. Feature Encoder
3.3. Posture Decoder
3.4. Joint Decoder
3.5. Training and Inference
4. Experimental
5. Results
5.1. Result on Dataset
5.2. Ablation Study
5.3. Visualization results
6. Conclusion
Funding
Conflicts of Interest
References
- Pham, H.H.; Khoudour, L.; Crouzil, A.; Zegers, P.; Velastin, S.A. Skeletal movement to color map: A novel representation for 3D action recognition with inception residual networks. 2018 25th IEEE International Conference on Image Processing (ICIP). IEEE, 2018, pp. 3483–3487.
- Hassan, M.; Choutas, V.; Tzionas, D.; Black, M.J. Resolving 3D human pose ambiguities with 3D scene constraints. Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 2282–2292.
- Sigal, L.; Isard, M.; Haussecker, H.; Black, M.J. Loose-limbed people: Estimating 3D human pose and motion using non-parametric belief propagation. International journal of computer vision 2012, 98, 15–48. [Google Scholar] [CrossRef]
- Yazdani, A.; Novin, R.S.; Merryweather, A.; Hermans, T. Occlusion-Robust Multi-Sensory Posture Estimation in Physical Human-Robot Interaction. arXiv preprint 2022, arXiv:2208.06494 2022. [Google Scholar]
- Zimmermann, C.; Welschehold, T.; Dornhege, C.; Burgard, W.; Brox, T. 3d human pose estimation in rgbd images for robotic task learning. 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 1986–1992.
- Clever, H.M.; Kapusta, A.; Park, D.; Erickson, Z.; Chitalia, Y.; Kemp, C.C. 3d human pose estimation on a configurable bed from a pressure image. 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 54–61.
- Tsoli, A.; Mahmood, N.; Black, M.J. Breathing life into shape: Capturing, modeling and animating 3D human breathing. ACM Transactions on graphics (TOG) 2014, 33, 1–11. [Google Scholar] [CrossRef]
- Hasler, N.; Stoll, C.; Sunkel, M.; Rosenhahn, B.; Seidel, H.P. A statistical model of human pose and body shape. Computer graphics forum. Wiley Online Library, 2009, Vol. 28, pp. 337–346.
- Trumble, M.; Gilbert, A.; Malleson, C.; Hilton, A.; Collomosse, J. Total capture: 3d human pose estimation fusing video and inertial sensors. Proceedings of 28th British Machine Vision Conference, 2017, pp. 1–13.
- Chen, C.H.; Ramanan, D. 3d human pose estimation= 2d pose estimation+ matching. Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7035–7043.
- Pavlakos, G.; Zhou, X.; Derpanis, K.G.; Daniilidis, K. Coarse-to-fine volumetric prediction for single-image 3D human pose. Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7025–7034.
- Martinez, J.; Hossain, R.; Romero, J.; Little, J.J. A simple yet effective baseline for 3d human pose estimation. Proceedings of the IEEE international conference on computer vision, 2017, pp. 2640–2649.
- Fang, H.S.; Xu, Y.; Wang, W.; Liu, X.; Zhu, S.C. Learning pose grammar to encode human body configuration for 3d pose estimation. Proceedings of the AAAI conference on artificial intelligence, 2018, Vol. 32.
- Reddy, N.D.; Guigues, L.; Pishchulin, L.; Eledath, J.; Narasimhan, S.G. Tessetrack: End-to-end learnable multi-person articulated 3d pose tracking. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15190–15200.
- Pavlakos, G.; Zhu, L.; Zhou, X.; Daniilidis, K. Learning to estimate 3D human pose and shape from a single color image. Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 459–468.
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. ; others. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, arXiv:2010.11929 2020.
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, –28, 2020, Proceedings, Part I 16. Springer, 2020, pp. 213–229. 23 August.
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, arXiv:2010.04159 2020.
- Cheng, B.; Schwing, A.; Kirillov, A. Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems 2021, 34, 17864–17875. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. Proceedings of the IEEE international conference on computer vision, 2017, pp. 764–773.
- Li, S.; Chan, A.B. 3d human pose estimation from monocular images with deep convolutional neural network. Computer Vision–ACCV 2014: 12th Asian Conference on Computer Vision, Singapore, Singapore, -5, 2014, Revised Selected Papers, Part II 12. Springer, 2015, pp. 332–347. 1 November.
- Zhen, J.; Fang, Q.; Sun, J.; Liu, W.; Jiang, W.; Bao, H.; Zhou, X. Smap: Single-shot multi-person absolute 3d pose estimation. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, –28, 2020, Proceedings, Part XV 16. Springer, 2020, pp. 550–566. 23 August.
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, -12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740–755. 6 September.
- Mehta, D.; Sotnychenko, O.; Mueller, F.; Xu, W.; Sridhar, S.; Pons-Moll, G.; Theobalt, C. Single-shot multi-person 3d pose estimation from monocular rgb. 2018 International Conference on 3D Vision (3DV). IEEE, 2018, pp. 120–130.
- Mehta, D.; Rhodin, H.; Casas, D.; Fua, P.; Sotnychenko, O.; Xu, W.; Theobalt, C. Monocular 3d human pose estimation in the wild using improved cnn supervision. 2017 international conference on 3D vision (3DV). IEEE, 2017, pp. 506–516.
- Moon, G.; Chang, J.Y.; Lee, K.M. Camera distance-aware top-down approach for 3d multi-person pose estimation from a single rgb image. Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 10133–10142.
- Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J. ; others. MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, arXiv:1906.07155 2019.
- Rogez, G.; Weinzaepfel, P.; Schmid, C. Lcr-net: Localization-classification-regression for human pose. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3433–3441.
- Mehta, D.; Sotnychenko, O.; Mueller, F.; Xu, W.; Elgharib, M.; Fua, P.; Seidel, H.P.; Rhodin, H.; Pons-Moll, G.; Theobalt, C. XNect: Real-time multi-person 3D motion capture with a single RGB camera. Acm Transactions On Graphics (TOG) 2020, 39, 82–1. [Google Scholar] [CrossRef]
- Guo, W.; Corona, E.; Moreno-Noguer, F.; Alameda-Pineda, X. Pi-net: Pose interacting network for multi-person monocular 3d pose estimation. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 2796–2806.
- Jin, L.; Xu, C.; Wang, X.; Xiao, Y.; Guo, Y.; Nie, X.; Zhao, J. Single-stage is enough: Multi-person absolute 3D pose estimation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13086–13095.
- Cheng, Y.; Wang, B.; Yang, B.; Tan, R.T. Monocular 3D multi-person pose estimation by integrating top-down and bottom-up networks. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 7649–7659.
- von Marcard, T.; Henschel, R.; Black, M.; Rosenhahn, B.; Pons-Moll, G. Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera. European Conference on Computer Vision (ECCV), 2018.







| Method | |||||||||
| Head | Neck | Shoulder | Elbow | Wrist | Hip | Knee | Ankle | Avg | |
| Two-stage | |||||||||
| Lcr-net [29] | 49.4 | 67.4 | 57.1 | 51.4 | 41.3 | 84.6 | 56.3 | 36.3 | 53.8 |
| XNect [30] | - | - | 81.4 | 67.2 | 53.2 | - | 75.8 | 54.3 | 72.1 |
| CDMP [27] | 79.1 | 92.6 | 85.1 | 79.4 | 67.0 | 96.6 | 85.7 | 73.1 | 81.8 |
| Pi-net [31] | 78.3 | 91.8 | 87.8 | 81.9 | 68.5 | 94.2 | 85.3 | 74.8 | 82.5 |
| 3DPose [25] | - | - | - | - | - | - | - | - | 89.6 |
| End-to-end | |||||||||
| Metha [26] | 62.1 | 81.2 | 77.9 | 57.7 | 47.2 | 97.3 | 66.3 | 47.6 | 66.0 |
| DRM [32] | 94.1 | 78.6 | 83.0 | 72.1 | 94.5 | 78.6 | 73.0 | 98.7 | 84.3 |
| EDD (Ours) | 93.8 | 78.5 | 86.4 | 76.9 | 95.5 | 86.0 | 79.6 | 98.5 | 87.4 |
| Root Point | Iterative inference depth | |||||
| Pelvis | Head | Shoulder | Elbow | Avg | ||
| ✓ | 91.7 | 92.7 | 82.7 | 72.6 | 84.6 | |
| ✓ | 91.6 | 92.3 | 84 | 73.6 | 85.6 | |
| ✓ | ✓ | 93.5 | 93.8 | 86.4 | 76.9 | 87.4 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).