Wu, C.; Wei, X.; Li, S.; Zhan, A. MSTPose: Learning-Enriched Visual Information with Multi-Scale Transformers for Human Pose Estimation. Electronics2023, 12, 3244.
Wu, C.; Wei, X.; Li, S.; Zhan, A. MSTPose: Learning-Enriched Visual Information with Multi-Scale Transformers for Human Pose Estimation. Electronics 2023, 12, 3244.
Wu, C.; Wei, X.; Li, S.; Zhan, A. MSTPose: Learning-Enriched Visual Information with Multi-Scale Transformers for Human Pose Estimation. Electronics2023, 12, 3244.
Wu, C.; Wei, X.; Li, S.; Zhan, A. MSTPose: Learning-Enriched Visual Information with Multi-Scale Transformers for Human Pose Estimation. Electronics 2023, 12, 3244.
Abstract
Human pose estimation is a complex detection task in which the network needs to capture the rich information contained in the images. In this paper, we propose MSTPose (Multi-Scale Transformer for human Pose estimation). Specifically, MSTPose leverages a high-resolution Convolution neural network (CNN) to extract texture information from images. For the feature maps from three different scales produced by the backbone network, each branch performs the coordinate attention operations. The feature maps are then spatially and channel-wise flattened, combined with keypoint tokens generated through random initialization, and fed into a parallel Transformer structure to learn spatial dependencies between features. As the Transformer outputs one-dimensional sequential features, the mainstream two-dimensional heatmap method is abandoned in favor of one-dimensional coordinate vector regression. The experiments show that MSTPose outperforms other CNN-based pose estimation models and demonstrates clear advantages over CNN + Transformer networks of similar types.
Computer Science and Mathematics, Artificial Intelligence and Machine Learning
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.