1. Introduction
Effective feature representation of videos is key to action recognition. Spatiotemporal features [
1,
2], subspace features [
3,
4], and label information [
5] have been investigated for action recognition. Correlations between multiple features may provide distinctive information; hence, feature correlation mining has been explored to improve the recognition results when labeled data are scarce [
4,
6]. However, these approaches may have limitations in learning discriminant features, they have limitations. First, although existing algorithms evaluate the common shared structures among different actions, they do not take interclass separability into account. Second, current semisupervised approaches solve the nonconvex optimisation problem by impressive derivation, but the global optimum may not be computed mathematically through alternating least squares (ALS) iterative method.
To overcome the limitations of using multiple features for training, we propose modelling intraclass compactness and intermanifold separability simultaneously, then capturing highlevel semantic patterns via Multiple feature analysis. Considering the optimisation process, we introduce the PBB algorithm because of its effectiveness in obtaining an optimal solution [
7]. The PBB method is a nonmonotone linesearch technique considered for the minimisation of differentiable functions on closed convex sets [
8].
Inspired by the research using multiple features [
5,
6], our framework was extended in a multiplefeaturebased manner to improve recognition. We proposed the characterisation of highlevel semantic patterns through lowlevel action features using multiplefeature analysis. Multiple features were extracted from different view of labeled and unlabeled action videos. Based on the constructed graph model, pseudo information of unlabeled videos can be generated by label propagation and feature correlations. For each type of feature, nearby samples preserve the consistency separately, while unlabeled training data perform the label prediction by jointly global consistency of multiple features. Thus, an adaptive semisupervised action classifier was trained. The main contributions can be summarized as follows:
(1) This work first simultaneously consider manifold learning and Grassmannian kernels in semisupervised action recognition, as we assume that action videos samples may be found in a Grassmannian manifold space. By modelling a embedding manifold subspace, both interclass separability and intraclass compactness were considered.
(2) To solve the unconstrained minimisation problem, we incorporate PBB method to avoid matrix inversion, and apply globalisation strategy via adaptive step sizes to render the objective functions nonmonotonic, leading to improved convergence and accuracy.
(3) Extensive experiments verified that our method is better than other approaches on three benchmarks in a semisupervised setting. We believe that this study presents valuable insights in adaptive feature analysis for semisupervised action recognition.
4. Experiments
The proposed method, called the Kernel Grassmann Manifold Analysis (KGMA), is summarised in Algorithm 1. The conventional method that uses SPG [
10] and ALS method instead of PBB, called kernel spectral projected gradient analysis (KSPG) and kernel alternating least squares analysis (KALS), respectively, was also adopted to solve the objective function (
8) for comparison in our experiments.
Features. For handcrafted features, we follow [
10] to extracted improved dense trajectories (IDT) and Fisher vector (FV), as shown in
Figure 2. For deeplearned features, we retrained the temporal segment network (TSN) [
2] models of 15×
c, and then extracted the global pool features of 15×
c using pretrained TSN model, concatenating rgb+flow into 2048 dimensions with power L2normalisation, as listed in
Table 1.
We verified the proposed algorithm using three kernels: projection kernel ${k}^{\left[proj\right]}$, canonical correlation kernel ${k}^{\left[CC\right]}$, and combined kernel ${k}^{[proj+CC]}$. In some cases, ${k}^{\left[proj\right]}$ is better than ${k}^{\left[CC\right]}$, whereas vice versa, suggesting that the kernels combination is more suitable for different data distributions. For ${k}^{[proj+CC]}$, the mixing coefficients ${\delta}^{\left[proj\right]}$ and ${\delta}^{\left[CC\right]}$ were fixed at one. We obtain better results by combining ${\delta}^{[proj+CC]}$ two kernels.
Datasets. Three datasets were used in the experiments: JHMDB, HMDB51, and UCF101 [
1]. The
JHMDB dataset has 21 action categories. The average recognition accuracies over three training–test splits are reported. The
HMDB51 dataset records 51 action categories. We reported the MAP over three training–test splits. The
UCF101 dataset includes 101 action categories, containing 13,320 video clips. The average accuracy of the first split was reported.
For the JHMDB dataset, we followed the standard data partitioning (three splits) provided by the authors. For other datasets, we used the first split provided by the authors, and applied the original testing sets for fair comparison. Because the semisupervised training set contained unlabeled data, we performed the following procedure to reform the training set for each individual dataset. the class number c was denoted for each dataset (c = 21, 51, and 101 for JHMDB, HMDB51, and UCF101, respectively).
Using JHMDB as an example, we first randomly selected 30 training samples per category to form a training set ( $30\times c$ samples) in our experiment. From this training set, we randomly sampled m videos (m = 3, 5, 10, and 15) per category as labeled samples. Therefore, if $m=10$, $10\times c$ labeled samples will be available, leaving ($30\times c10\times c$) videos as unlabeled samples for the semisupervised training setting. We used a standard test set as the test set. Owing to the random selected training samples, the experiments were repeated 10 times to avoid bias.
To demonstrate the superiority of our approach (KGMA), we adopted 8 methods for comparison: SVM, SFUS [
35], SFCM [
3], MFCU [
4], KSPG, and KALS. Notably, SFUS, SFCM, MFCU, KSPG, and KALS are semisupervised action recognition approaches. Using the available codes, we can facilitate a fair comparison.
Table 1.
Comparison with deeplearned features (average accuracy ± std) when $15\times c$ training videos are labeled
Table 1.
Comparison with deeplearned features (average accuracy ± std) when $15\times c$ training videos are labeled

JHMDB 
HMDB51 
UCF101 
SFUS 
0.6942 ± 0.0121 
0.5217 ± 0.0114 
0.7910 ± 0.0087 
SFCM 
0.7125 ± 0.0099 
0.5394 ± 0.0108 
0.8070 ± 0.0101 
MFCU 
0.7154 ± 0.0088 
0.5556 ± 0.0098 
0.8429 ± 0.0085 
SVM${\chi}^{2}$

0.6931 ± 0.0106 
0.5190 ± 0.0095 
0.8138 ± 0.0108 
SVMlinear 
0.7140 ± 0.0086 
0.5385 ± 0.0077 
0.8450 ± 0.0087 
KSPG 
0.7287 ± 0.0114 
0.5697 ± 0.0833 
0.8552 ± 0.0111 
KALS 
0.7218 ± 0.0087 
0.5607 ± 0.0098 
0.8411 ± 0.0095 
KGMA 
0.7361±0.0096

0.5762±0.1040

0.8673±0.0087

For the semisupervised parameters
$\eta ,\beta ,\mu $ for SFUS, SFCM, MFCU, KSPG, KALS, and KGMA, we follow the same settings utilised in [
3,
4], ranging from
{
${10}^{4},{10}^{3},{10}^{2},{10}^{1},1,{10}^{1},{10}^{2},{10}^{3},{10}^{4}$}. Because the PBB parameters were not sensitive to our algorithm, we initialised the parameters as in [
7], as indicated in Algorithm 1. Notably, since KGMA applied PBB to solve the optimal value of objective function (
8), it resulted in nonmonotonic convergence with oscillating objective function values, as shown in
Figure 3. Thus, using only the absolute error made it difficult to determine when to stop iterating, relative error of objective function values was better than absolute error, which may be mathematically improper convergence. We chose constant
$\epsilon ={10}^{4}$ as the iterationstopping criterion in (
9).
Mathematical Comparisons. The recognition results with handcrafted features on three datasets were demonstrated in
Figure 2. We compared our method with deeplearned features in
Table 1.
Regarding the presented objective function
8,
Figure 3 summarized the computational results of the three optimization methods. When we used the 2048dimensional deeplearned features TSN on JHMDB dataset, the model was trained with only 15 labeled samples and 15 unlabeled samples per class, setup the same semisupervised parameters
$\eta ,\beta ,\mu $, then the performance differences during the solving of the same objective function could be compared in terms of running time, number of iterations, absolute error, relative error, and objective function value.
Figure 3 shown the convergence curves of three optimization methods. Since both SPG and PBB were nonmonotonic optimization methods with relatively large fluctuations in objective function values, we omitted the first 29 iterations of SPG and PBB in
Figure 3, and only displayed the data starting from the 30th iteration, so as to better illustrate the monotonic convergence process of ALS.
As shown in
Table 2, for a randomly selected video data sample, ALS exhibited the fewest iterations, shortest running time and fastest computation speed of 0.1220 seconds after extracting the deep features by TSN. In contrast, PBB exhibited the most iterations, longest running time and slowest computation speed of 0.4212 seconds; while SPG’s performance were intermediate between ALS and PBB. Considering
Figure 3 and
Table 2, it is evident that despite using the PBB optimization method, our KGMA algorithm still achieves the highest accuracy on the kernelized Grassmann manifold space. Nevertheless, the equation
9 using SPG results in marginal improvement over ALS, which likely attributable to our novel kernelized Grassmann manifold space.
Performance on Action Recognition. A linear SVM was utilised as the baseline. Based on the comparisons, we observe the following:1) KGMA achieved the best performance, our semisupervised algorithm was better than linear SVM which is widelyused supervised classifiers; 2) all methods achieved better performances using more labeled training data, as shown in
Figure 2, or enlarging semisupervised parameter (i.e.,
$\eta ,\beta ,\mu $) range such as
Figure 4; 3) we averaged an accuracy of
$3\times c$,
$5\times c$,
$10\times c$, and
$15\times c$ cases, and the recognition of KGMA on JHMDB, HMDB51, and UCF101 improved by 2.97%, 2.59%, and 2.40%, respectively. When using TSN features, the recognition of our KGMA on abovementioned datasets improved by 2.21%, 3.77%, and 2.23%, respectively. Evidently, our semisupervised method can improve recognition by leveraging unlabeled data compared to linear SVM with labeled data merely.
Figure 2 illustrated that our algorithm benefits from the multiplefeature analysis, kernelized Grassman space and iterative skills of PBB method.
These results can be attributed to several factors. First, our method not only leverages semisupervised approaches, but also leverages intraclass action variation and interclass action ambiguity simultaneously. Therefore, ours gain more significant performance than other approaches when there are few labeled samples. Second, we uncover the action feature subspace on Grassmannian manifold by incorporating Grassmannian kernels, and solve the objective function optimisation by adaptive linesearch strategy and PBB method mathematically. Hence, the proposed algorithm works well in few labeled case.
Convergence Study. According to the objective function (
4), we conducted experiments with the TSN feature, fixed the semisupervised parameters
$\eta ,\beta ,\mu $, and then executed both the ALS and PBB methods 10 times. The results of the study are listed in
Table 2. Although no oscillation exists in the convergence of the ALS and it requires fewer iterations, the PBB method can outperform the ALS for three reasons. First, the PBB method uses a nonmonotone linesearch strategy to globalise the process [
8], which can obtain the global optimal objective function value rather than being trapped in local optima using the monotone ALS method. Second, the character of adaptive step sizes is an essential characteristic that determines efficiency in the projected gradient methodology [
8], whereas the iteration step skill has not been considered in ALS. Finally, the efficient convergence properties of the projected gradient method have been demonstrated because the PBB is well defined [
8].
Computation Complexity. In the training stage, we computed the Laplacian matrix L, the complexity of which was $O\left({n}^{2}\right)$. To optimise the objective function, we computed the projected gradient and trace operators of several matrices. Therefore, the complexity of these operations was $O\left({n}^{3}\right)$.
Parameter Sensitivity Study. We verified that KGMA benefits from the intraclass and interclass by manifold discriminant analysis, as shown in
Figure 4. We analysis the impact of manifold learning on JHMDB and HMDB51, set
$\eta ={10}^{3}$ and
$\mu ={10}^{1}$ at optimal values over split2, for
$15\times c$labeled training data. As
$\beta $ varied from
${10}^{4}$ to
${10}^{4}$, the accuracy oscillated significantly and reached a peak value when
$\beta ={10}^{4}$. Since
$\beta $ controls the proportion of the intraclass local geometric structure and the interclass global manifold structure, as shown in
Figure 4. when the intraclass local geometric structure is treated as a constant 1,
$\frac{\beta}{1}$ can be considered that the interclass global manifold structure has a larger proportion in the objective function, and vice versa. When
$\beta =0$, no intermanifold structure is utilised; thus, if
$\beta \to +\infty $, no intraclass structure is present. When the Grassmann manifold space leverages an adequate balance of intraclass action variation and interclass action ambiguity, the proposed algorithm can further enhance the discriminatory power of the transformation matrix.