Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Lightweight Context-aware Feature Transformer Network for Human Pose Estimation

Version 1 : Received: 10 January 2024 / Approved: 10 January 2024 / Online: 11 January 2024 (10:24:37 CET)

A peer-reviewed article of this Preprint also exists.

Ma, Y.; Shi, Q.; Zhang, F. A Lightweight Context-Aware Feature Transformer Network for Human Pose Estimation. Electronics 2024, 13, 716. Ma, Y.; Shi, Q.; Zhang, F. A Lightweight Context-Aware Feature Transformer Network for Human Pose Estimation. Electronics 2024, 13, 716.

Abstract

We propose Context-aware Feature Transformer Network (CaFTNet), a novel network for human pose estimation. To address the issue of limited modeling of global dependencies in convolutional neural networks, we design Transformerneck to strengthen the expressive power of features. Transformerneck directly substitutes the 3×3 convolution in bottleneck of HRNet with Contextual Transformer (CoT) block, while reducing the complexity of the network. Specifically, CoT first produces keys with static contextual information through 3×3 convolution. Then, relying on the query and contextualization keys, the dynamic contexts are generated through two concatenated 1×1 convolutions. Static and dynamic contexts are eventually fused as an output. Additionally, for the multi-scale networks, in order to further refine the features of the fusion output, we propose an Attention Feature Aggregation Module(AFAM). Technically, given an intermediate input, AFAM successively deduces attention maps along channel and spatial dimensions. Then, Adaptive refinement module(ARM) is exploited to activate the obtained attention maps. Finally, the input undergoes adaptive feature refinement through multiplication with the activated attention maps. Through the above studies, our lightweight network provides a powerful clue for detection of keypoints. Experiments are implemented on the COCO and MPII datasets. The model achieves 76.2 AP on the COCO val2017. Compared to other methods with the CNN as the backbone, CaFTNet reduces the number of parameters by 72.9 %. On the MPII, our method uses only 60.7% of the number of parameters, acquiring semblable results to other methods with the CNN as the backbone.

Keywords

Human pose estimation; Expressive power of features; Feature refinement; Global dependencies

Subject

Computer Science and Mathematics, Computer Vision and Graphics

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.