Submitted:
25 July 2024
Posted:
26 July 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Human Action Recognition Based on Skeleton Data
1.2. Graph Autoencoders
1.3. Proposed Spatiotemporal Model for Human Action Recognition
- The development of a novel spatiotemporal graph-autoencoder network for skeleton-based HAR, which effectively captures the complex spatial and temporal dynamics of human movements, offering a significant advancement in feature extraction and representation.
- Outperforming most existing methods on two widely used skeleton-based HAR datasets.
- Achieving notable performance improvements by incorporating additional modalities, as demonstrated in the experimental evaluation presented in Section 4.
2. Related Work
2.1. Graph Convolutional Networks (GCNs)
2.2. GCN-Based Skeleton Action Recognition
- Static and dynamic techniques: In static techniques, the topologies of GCNs remain constant throughout the inference process, whereas they are dynamically inferred during the inference process for dynamic techniques.
- Topology shared and topology non-shared techniques: Topologies are shared across all channels in topology shared techniques, whereas various topologies are employed in different channels or channel groups in topology non-shared techniques.
3. Materials and Methods
3.1. Datasets
- The subject samples are divided into two halves: 20 individuals provide training samples, while the remaining 20 provide testing samples. This standard is named cross-subject (x-sub).
- The testing samples are derived from the views of camera 1, while the training samples are derived from the views of cameras 2 and 3. This standard is named cross-view (x-view).
- The subject samples are divided into two halves: 53 individuals provide training data, and the remaining 53 provide testing data. This standard is named cross-subject (c-sub).
- The 32 setups were separated into two halves: sequences with even-numbered setup numbers provide training samples, and the remaining sequences with odd-numbered setup numbers provide testing samples. This standard is named cross-setup (x-setup).
3.2. Preliminaries
3.3. Spatiotemporal Graph Autoencoder Network for Skeleton-Based HAR Algorithm
3.4. Spatiotemporal Input Representations
3.5. Modalities of GA-GCN
4. Results
4.1. Implementation Details
4.2. Experimental Results
5. Discussion
5.1. Comparison of GA-GCN Modalities
5.2. Comparison with the State-of-the-Art
6. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Shahroudy, A.; Liu, J.; Ng, T.T.; Wang, G. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis. IEEE Conference on Computer Vision and Pattern Recognition, 2016.
- Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.Y.; Kot, A.C. NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 2019. [CrossRef]
- Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; others. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 2017.
- Liu, J.; Shahroudy, A.; Wang, G.; Duan, L.Y.; Kot, A.C. Skeleton-based online action prediction using scale selection network. IEEE transactions on pattern analysis and machine intelligence 2019, 42, 1453–1467. [CrossRef]
- Johansson, G. Visual perception of biological motion and a model for its analysis. Perception & psychophysics 1973, 14, 201–211. [CrossRef]
- Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-structural graph convolutional networks for skeleton-based action recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3595–3603.
- Liu, J.; Shahroudy, A.; Xu, D.; Kot, A.C.; Wang, G. Skeleton-based action recognition using spatio-temporal LSTM network with trust gates. IEEE transactions on pattern analysis and machine intelligence 2017, 40, 3007–3021. [CrossRef]
- Liu, J.; Wang, G.; Duan, L.Y.; Abdiyeva, K.; Kot, A.C. Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Transactions on Image Processing 2017, 27, 1586–1599. [CrossRef]
- Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. Advances in neural information processing systems 2017, 30.
- Kipf, T.N.; Welling, M. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308 2016.
- Chen, Y.; Zhang, Z.; Yuan, C.; Li, B.; Deng, Y.; Hu, W. Channel-wise topology refinement graph convolution for skeleton-based action recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13359–13368.
- Cai, Y.; Ge, L.; Liu, J.; Cai, J.; Cham, T.J.; Yuan, J.; Thalmann, N.M. Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 2272–2281.
- Malik, J.; Elhayek, A.; Guha, S.; Ahmed, S.; Gillani, A.; Stricker, D. DeepAirSig: End-to-End Deep Learning Based in-Air Signature Verification. IEEE Access 2020, 8, 195832–195843. [CrossRef]
- Bruna, J.; Zaremba, W.; Szlam, A.; Lecun, Y. Spectral Networks and Locally Connected Networks on Graphs. International Conference on Learning Representations (ICLR); Apr 14–16; Banff, AB, Canada, 2014.
- Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems 2016, 29.
- Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations (ICLR), 2017.
- Duvenaud, D.K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, R.; Hirzel, T.; Aspuru-Guzik, A.; Adams, R.P. Convolutional networks on graphs for learning molecular fingerprints. Advances in neural information processing systems 2015, 28.
- Niepert, M.; Ahmed, M.; Kutzkov, K. Learning convolutional neural networks for graphs. International conference on machine learning. PMLR, 2016, pp. 2014–2023.
- Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. International Conference on Learning Representations (ICLR), 2018.
- Liu, Z.; Zhang, H.; Chen, Z.; Wang, Z.; Ouyang, W. Disentangling and unifying graph convolutions for skeleton-based action recognition. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 143–152.
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 12026–12035.
- Tang, Y.; Tian, Y.; Lu, J.; Li, P.; Zhou, J. Deep progressive reinforcement learning for skeleton-based action recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5323–5332.
- Veeriah, V.; Zhuang, N.; Qi, G.J. Differential recurrent neural networks for action recognition. Proceedings of the IEEE international conference on computer vision, 2015, pp. 4041–4049.
- Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. Thirty-second AAAI conference on artificial intelligence, 2018. [CrossRef]
- Ye, F.; Pu, S.; Zhong, Q.; Li, C.; Xie, D.; Tang, H. Dynamic gcn: Context-enriched topology learning for skeleton-based action recognition. Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 55–63.
- Zhao, R.; Wang, K.; Su, H.; Ji, Q. Bayesian graph convolution lstm for skeleton based action recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6882–6892.
- Huang, Z.; Shen, X.; Tian, X.; Li, H.; Huang, J.; Hua, X.S. Spatio-temporal inception graph convolutional networks for skeleton-based action recognition. Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 2122–2130.
- Zhang, P.; Lan, C.; Zeng, W.; Xing, J.; Xue, J.; Zheng, N. Semantics-guided neural networks for efficient skeleton-based human action recognition. proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 1112–1121.
- Cheng, K.; Zhang, Y.; Cao, C.; Shi, L.; Cheng, J.; Lu, H. Decoupling gcn with dropgraph module for skeleton-based action recognition. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16. Springer, 2020, pp. 536–553.
- Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic convolution: Attention over convolution kernels. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11030–11039.
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- Li, S.; Li, W.; Cook, C.; Zhu, C.; Gao, Y. Independently recurrent neural network (indrnn): Building a longer and deeper rnn. Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5457–5466.
- Li, C.; Zhong, Q.; Xie, D.; Pu, S. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv preprint arXiv:1804.06055 2018.
- Si, C.; Chen, W.; Wang, W.; Wang, L.; Tan, T. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1227–1236.
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Skeleton-based action recognition with directed graph neural networks. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 7912–7921.
- Cheng, K.; Zhang, Y.; He, X.; Chen, W.; Cheng, J.; Lu, H. Skeleton-based action recognition with shift graph convolutional network. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 183–192.
- Song, Y.F.; Zhang, Z.; Shan, C.; Wang, L. Stronger, faster and more explainable: A graph convolutional baseline for skeleton-based action recognition. proceedings of the 28th ACM international conference on multimedia, 2020, pp. 1625–1633.
- Korban, M.; Li, X. Ddgcn: A dynamic directed graph convolutional network for action recognition. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16. Springer, 2020, pp. 761–776.
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Decoupled spatial-temporal attention network for skeleton-based action-gesture recognition. Proceedings of the Asian Conference on Computer Vision, 2020.
- Plizzari, C.; Cannici, M.; Matteucci, M. Skeleton-based action recognition via spatial and temporal transformer networks. Computer Vision and Image Understanding 2021, 208, 103219. [CrossRef]
- Chen, Z.; Li, S.; Yang, B.; Li, Q.; Liu, H. Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition. Proceedings of the AAAI conference on artificial intelligence, 2021, pp. 1113–1122. [CrossRef]
- Trivedi, N.; Sarvadevabhatla, R.K. PSUMNet: Unified Modality Part Streams are All You Need for Efficient Pose-based Action Recognition. Computer Vision–ECCV 2022 Workshops: Proceedings, Part V. Springer, 2023, pp. 211–227.
- Liu, J.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-temporal lstm with trust gates for 3d human action recognition. Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14. Springer, 2016, pp. 816–833. [CrossRef]
- Ke, Q.; Bennamoun, M.; An, S.; Sohel, F.; Boussaid, F. Learning clip representations for skeleton-based 3d action recognition. IEEE Transactions on Image Processing 2018, 27, 2842–2855. [CrossRef]






| Modality | bone | vel | fast-motion |
|---|---|---|---|
| joint | FALSE | FALSE | FALSE |
| joint motion | FALSE | TRUE | FALSE |
| bone | TRUE | FALSE | FALSE |
| bone motion | TRUE | TRUE | FALSE |
| joint fast motion | FALSE | FALSE | TRUE |
| joint motion fast motion | FALSE | TRUE | TRUE |
| bone fast motion | TRUE | FALSE | TRUE |
| bone motion fast motion | TRUE | TRUE | TRUE |
| Methods | Accuracy (%) |
|---|---|
| GA-GCN joint modality | 95.14 |
| GA-GCN joint motion modality | 93.05 |
| GA-GCN bone modality | 94.77 |
| GA-GCN bone motion modality | 91.99 |
| GA-GCN after ensemble joint, joint motion, bone and bone motion modalities in our machine | 96.51 |
| GA-GCN joint fast motion modality | 94.63 |
| GA-GCN joint motion fast motion modality | 92.61 |
| GA-GCN bone fast motion modality | 94.41 |
| GA-GCN bone motion fast motion modality | 91.54 |
| GA-GCN after ensemble joint fast motion, joint | |
| motion fast motion, bone fast motion and bone | |
| motion fast motion modalities in our machine | 96.36 |
| GA-GCN with 8 modalities | |
| joint, joint motion, bone, bone motion, joint fast | |
| motion, joint motion fast motion, bone fast | |
| motion and bone motion fast motion | 96.8 |
| Methods | NTU-RGB+D | |
|---|---|---|
| X-Sub (%) | X-View (%) | |
| Ind-RNN [32] | 81.8 | 88.0 |
| HCN [33] | 86.5 | 91.1 |
| ST-GCN [24] | 81.5 | 88.3 |
| 2s-AGCN [21] | 88.5 | 95.1 |
| SGN [28] | 89.0 | 94.5 |
| AGC-LSTM [34] | 89.2 | 95.0 |
| DGNN [35] | 89.9 | 96.1 |
| Shift-GCN [36] | 90.7 | 96.5 |
| DC-GCN+ADG [29] | 90.8 | 96.6 |
| PA-ResGCN-B19 [37] | 90.9 | 96.0 |
| DDGCN [38] | 91.1 | 97.1 |
| Dynamic GCN [25] | 91.5 | 96.0 |
| MS-G3D [20] | 91.5 | 96.2 |
| CTR-GCN [11] | 92.4 | 96.8 |
| DSTA-Net [39] | 91.5 | 96.4 |
| ST-TR [40] | 89.9 | 96.1 |
| 4s-MST-GCN [41] | 91.5 | 96.6 |
| PSUMNet [42] | 92.9 | 96.7 |
| GA-GCN | 92.3 | 96.8 |
| Methods | NTU-RGB+D 120 | |
|---|---|---|
| X-Sub (%) | X-Set (%) | |
| ST-LSTM [43] | 55.7 | 57.9 |
| GCA-LSTM [8] | 61.2 | 63.3 |
| RotClips+MTCNN [44] | 62.2 | 61.8 |
| ST-GCN [24] | 70.7 | 73.2 |
| SGN [28] | 79.2 | 81.5 |
| 2s-AGCN [21] | 82.9 | 84.9 |
| Shift-GCN [36] | 85.9 | 87.6 |
| DC-GCN+ADG [29] | 86.5 | 88.1 |
| MS-G3D [20] | 86.9 | 88.4 |
| PA-ResGCN-B19 [37] | 87.3 | 88.3 |
| Dynamic GCN [25] | 87.3 | 88.6 |
| CTR-GCN [11] | 88.9 | 90.6 |
| DSTA-Net [39] | 86.6 | 89.0 |
| ST-TR [40] | 82.7 | 84.7 |
| 4s-MST-GCN [41] | 87.5 | 88.8 |
| PSUMNet [42] | 89.4 | 90.6 |
| GA-GCN | 88.8 | 90.5 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).