Submitted:
13 August 2024
Posted:
20 August 2024
You are already at the latest version
Abstract
Keywords:
Introduction
- Skeleton-based Models: These models represent the human body as a collection of joints connected by bones. The coordinates of the joints are tracked over time to understand the movement and posture.
- Contour-based Models: These models focus on the outer contour of the body, capturing the silhouette to infer pose and movement.
- Volume-based Models: These models create a volumetric representation of the body, capturing the full 3D structure, which is useful for more detailed analysis.
- 2D Pose Estimation: This technique involves estimating key points in the joints of the human body in the 2D space with respect to the image or video. It serves as a foundation for more advanced computer vision tasks like 3D human pose estimation, motion prediction, and human parsing.
- 3D Pose Estimation: This technique involves estimating the actual spatial positioning of the body in the 3D space, introducing the z-dimension. It provides a more comprehensive understanding of the body’s posture and movement [3].
Literature Review
- A.
-
Early Models and Approaches
- Pictorial Structures: The concept of pictorial structures was introduced by Fischler and Elschlager in the 1970s. This approach represented objects using a collection of parts and their spatial relationships [4]. Felzenszwalb and Huttenlocher later made this method practical and tractable using the distance transform trick, which significantly improved its efficiency and accuracy [5].
- Datasets: Earlier models used smaller datasets like Parse and Buffy for evaluation. However, these datasets were not suitable for training complex models due to their limited size and variability. The introduction of larger datasets, such as the Leeds Sports Pose (LSP) dataset containing 10,000 images, marked a significant milestone in the development of HPE models [6].
- B.
-
Advancements with Larger Datasets
- COCO Dataset: The COCO (Common Objects in Context) dataset is a large-scale object detection, segmentation, and captioning dataset that has become a standard benchmark for HPE. It provides a diverse set of images with annotated key points, making it ideal for training and evaluating HPE models [7].
- MPII Human Pose Dataset: The MPII dataset is another extensive dataset that includes around 25,000 images with annotated body joints. It covers a wide range of human activities and poses, providing a robust benchmark for HPE algorithms [8].
- C.
- Key Libraries and Frameworks
-
- OpenPose: Developed by Zhe Cao and his team in 2019, OpenPose is a real-time multi-person key point detection library capable of detecting 135 key points. It uses a bottom-up approach, which is efficient for handling multiple persons in an image. OpenPose is trained on COCO and MPII datasets and has become one of the most popular tools in HPE [9].Figure 1. Overall pipeline of Real time 2D pose estimation.

- DeepCut: Presented by Leonid Pishchulin in 2016, DeepCut uses a bottom-up approach with Integral Linear Programming to model detected key points and form a skeleton representation. It addresses the challenge of multi-person pose estimation by simultaneously detecting and associating body parts [10].
- AlphaPose: Developed in 2016, AlphaPose uses a topdown approach for human pose estimation. It detects human bodies first and then localizes key points within the detected regions. AlphaPose supports various operating systems and is known for its high accuracy and robustness [11].
- D.
- Convolutional Neural Networks (CNNs) in HPE
-
- VGG-19: VGG-19 is a convolutional network that is 19 layers deep and was trained on the ImageNet database. It can classify more than 1,000 objects and is known for its performance in image recognition tasks. In this project, we use the first 10 layers of VGG-19 for feature extraction, which provides the basis for detecting key points in human poses [12].
- High-Resolution Net (HRNet): Introduced by Jingdong Wang, HRNet maintains high-resolution representations through the entire network. It has been used for semantic segmentation, object detection, and HPE, providing high accuracy and detailed pose estimations [13].
- C.
- Model Training
I. Methodology
- A.
- Data collection and Preprocessing
- B.
- Model Architecture
- C.
- Model Training
- D.
- Data Flow

- E.
- Deployment in Mobile Application
Result and Analysis
- A.
- Accuracy and Loss Analysis




- B.
- Output





- C.
- Output from Mobile Application



Conclusion
- Integrating face recognition and detection for defense applications.
- Enhancing the system to predict user movements, useful in defense and gaming.
- Applying pose estimation in CGI for movies and video games.
- Using 3D cameras for capturing three-dimensional human poses, providing better visualization and accuracy.
References
- N. Barla, ”V7Labs,” [Online]. Available:https://www.v7labs.com/blog/human-pose-estimation-guide.
- P. Ganesh, ”Towards Data Science,” 15 March 2019.[Online]. Available:https://towardsdatascience.com/human-pose-estimationsimplified-6cfd88542ab3.
- M. A. Fischler and R. A. Elschlager, ”The Representation and Matching of Pictorial Structures,” IEEE Transactions on Computers, vol. C-22, no. 1, pp. 67-92, 1973. [CrossRef]
- P. F. Felzenszwalb and D. P. Huttenlocher, ”Pictorial structures for object recognition,” International Journal of Computer Vision, vol. 61, no. 1,pp. 55-79, 2005. [CrossRef]
- L. Johnson and C. Everingham, ”Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation,” in Proceedings of the British Machine Vision Conference, 2010, pp. 1-11. [CrossRef]
- Y. Yorozu, M. Hirano, K. Oka, and Y. Tagawa, “Electron spectroscopy studies on magneto-optical media and plastic substrate interface,” IEEE Transl. J. Magn. Japan, vol. 2, pp. 740–741, August 1987 [Digests 9th Annual Conf. Magnetics Japan, p. 301, 1982]. [CrossRef]
- T.-Y. Lin et al., ”Microsoft COCO: Common Objects in Context,” in European Conference on Computer Vision (ECCV), 2014, pp. 740-755. [CrossRef]
- M. Andriluka et al., ”2D Human Pose Estimation: New Benchmark and State of the Art Analysis,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 3686-3693. [CrossRef]
- Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, ”OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 1, pp. 172-186, 2019. [CrossRef]
- L. Pishchulin et al., ”DeepCut: Joint Subset Partition and Labeling for Multi-Person Pose Estimation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4929-4937. [CrossRef]
- H. Fang et al., ”RMPE: Regional Multi-person Pose Estimation,” in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2334-2343. [CrossRef]
- K. Simonyan and A. Zisserman, ”Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv preprint arXiv:1409.1556, 2014.
- J. Wang et al., ”Deep High-Resolution Representation Learning for Visual Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 10, pp. 3349-3364, 2020. [CrossRef]
- T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dolla´r, ”Microsoft COCO: Common Objects in Context,” in European Conference on Computer Vision (ECCV), Zurich, Switzerland, 2014, pp. 740-755.
- K. Simonyan and A. Zisserman, ”Very Deep Convolutional Networks for Large-Scale Image Recognition,” in International Conference on Learning Representations (ICLR), San Diego, USA, 2015.
- Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, ”Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, USA, 2017, pp. 7291-7299.
- D. P. Kingma and J. Ba, ”Adam: A Method for Stochastic Optimization,” in International Conference on Learning Representations (ICLR), San Diego, USA, 2015.
- X. Peng and K. Saenko, ”Synthetic to Real Adaptation with Generative Correlation Alignment Networks,” in IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, USA, 2018, pp. 1982-1991.
- A. Newell, K. Yang, and J. Deng, ”Stacked Hourglass Networks for Human Pose Estimation,” in European Conference on Computer Vision (ECCV), Amsterdam, Netherlands, 2016, pp. 483-499.
- S. Johnson and M. Everingham, ”Learning Effective Human Pose Estimation from Inaccurate Annotation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, USA, 2011, pp. 14651472.
- D. Shrestha and D. Valles, “Evolving Autonomous Navigation: A NEAT Approach for Firefighting Rover Operations in Dynamic Environments,” 24th Annual IEEE International Conference on Electro Information Technology (EIT2024). [In Press].
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).