Multi-modal Data Fusion Method for Human Behavior Recognition Based on Two IA-Net and CHMM

The multi-modal data fusion method based on IA-net and CHMM technical proposed is designed to solve the problem that the incompleteness of target behavior information in complex family environment leads to the low accuracy of human behavior recognition.The two improved neural networks(STA-ResNet50、STA-GoogleNet)are combined with LSTM to form two IA-Nets respectively to extract RGB and skeleton modal behavior features in video. The two modal feature sequences are input CHMM to construct the probability fusion model of multi-modal behavior recognition.The experimental results show that the human behavior recognition model proposed in this paper has higher accuracy than the previous fusion methods on HMDB51 and UCF101 datasets. New contributions: attention mechanism is introduced to improve the efficiency of video target feature extraction and utilization. A skeleton based feature extraction framework is proposed, which can be used for human behavior recognition in complex environment. In the field of human behavior recognition, probability theory and neural network are cleverly combined and applied, which provides a new method for multi-modal information fusion.


Introduction
In the information-based family care for the elderly, human behavior recognition [1][2][3] is an important nursing value to master the situation on the spot, to judge abnormal behavior, to prevent accidents, and to ensure the safety of the elderly life. It has important nursing value. In complex environments, how to accurately recognition behaviors is a hot spot of research experts at home and abroad [4][5]. Behavior recognition using target features acquired by a single model is susceptible to environmental impact such as lighting, visual angle, background, etc. There are problems of missing and incomplete features, resulting in inaccurate recognition results [6][7][8]. The multi-modal fusion model can not only capture multi-modal data and solve the contradiction of data loss in the process of single sensor behavior recognition, but also improve the accuracy of behavior recognition by using the complementarity of different modal data Error! Reference source not found.. Multi-modal information [13][14] usually adopts an adaptive fusion method to obtain higher recognition accuracy than individual features, but there is usually no theoretical explanation for each feature weight assignment problem. A target recognition method based on fuzzy theory is proposed in [15], In order to improve the accuracy of the fusion model, an improved logsig function is introduced to express the importance of the information and then the weights are calculated using the fuzzy relationship to improve the recognition accuracy, However, the sensor weight to obtain the target characteristics has been given in advance and the 2 recognition results are vulnerable to human factors. The article [16]presents a behavior recognition method based on Hidden Markov Model (HMM), which uses probability fusion method to provide theoretical basis for multi-modal data fusion. However, this method needs to set model parameters adaptively, introduces too many system parameters and reduces the speed of model training and calculation. At present, there is no in-depth study on the recognition of complex background and target occlusion.
Two improved attention networks (IA net) [17][18][19] and a pair of hidden Markov (CHMM) [20]are combined for behavior recognition in this paper. IA-net are respectively combined of the improved spatio temporal attention ResNet50 (STA-ResNet50), Google net (STA-GoogleNet) and Long Short Time Memory (LSTM) network. Model advantages: incomplete features will reduce the classification accuracy of the first level and the two-level fusion mechanism can be repaired at a higher level. The behavior recognition method based on HMM needs to establish an adaptive HMM classifier for each behavior in the previous research. This framework uses LSTM model to automatically extract system parameters, which can be used to improve the learning of classifier in CHMM. Finally, IA-ResNet50, IA-GoogleNet are used for behavior feature extraction to avoid the negative impact of unsupervised local features (such as HOG) strengthen important features and weaken redundant features.

Improved feature extraction network
In the family environment with complex background, Realizing the dangerous behavior recognition of the elderly, treatment timely and ensuring their life safety of the elderly, the premise is accurate recognition of their behavior. However, due to the influence of object occlusion, light, environment and other factors in the family, the behavior recognition is inaccurate, which poses a threat to the life of the elderly. In order to improve the reliable recognition of human behavior, extracting effective behavior feature information is vitally improtant, this paper uses Microsoft Kinect equipment to obtain RGB and skeleton video from different angles, which not only overcome effectively the influence of complex background and avoid the occlusion problems, but also constructs the multi-modal fusion behavior recognition model, as shown in Fig. 1.

Fig.1. Multi-modal human behavior recognition model
As can be seen from Fig. 1. firstly, RGB and 3D skeleton videos are input STA-GoogleNet+LSTM (expressed as LSTMr) and STA-ResNet50+LSTM network (expressed as LSTMg) to obtain two stream feature sequences. Secondly, the multi-modal features are sent to CHMM as inputs to construct the probability fusion model of human behavior recognition.
Because information redundancy and different features in neural network have distinct effects on recognition results, SE-block [18]is introduced and analyzed in this paper. It is found that the model adopts two full connections (FC) layer, which cause the network parameters to increase exponentially. the sigmoid function is lead to the problem of neuron inactivation simultaneously. The Improved SE-block(ISEblock) is proposed and embedded into ResNet50, GoogleNet respectively. The specific implementation process is shown in Fig. 2. Given a video input X, its characteristic channel number is C ′ , after a series of convolution F tr transformations, the feature map with channel number C is obtained. Finally, which is introduced into the residual branch of ResNet50 and inceptionand of GoogleNet added with the STA-Net.
In Fig. 2, ISE residual modal, replace FC+ReLu+FC +Sigmoid of SE-block [17] with conv_1+ReLu+ conv_2+Sigmoid to obtain ISE block. In order to avoid the problem of excessive calculation consumption caused by the increase of parameters, conv_1 convolution replaces FC layer and ReLu function connection conv_2 convolution processing to obtain 0~1 normalized weight. Finally, ISE block is introduced into different networks to form ISE-nets, so that different important features can assign different weights and improve the efficiency of feature extraction.
The length of video required by different actions is not all the same, However, the classical neural network can only accept video input with a fixed length (7 frames), resulting in low behavior recognition accuracy of arbitrary length video. In order to more fully extract the features of continuous actions with different time lengths, this paper connects the LSTM model behind the improved neural network to overcome the complex human behavior representation of long video representation, after the model is connected to the full connection layer of IA-ResNet and IA-GoogleNet, the relationship between the features of continuous action sequences with various length is obtained.

Two IA-Nets+CHMM fusion calculation
In order to more accurately recognition the behavior of the elderly, the multi-modal information obtained in the second part is fused. The specific methods are as follows, In the process of home care for the elderly, for reducing the impact of background on human behavior recognition, the STA-ResNet+LSTM is used to obtain skeleton flow , which is expressed by quaternion: represents feature, and a complete 3D skeleton behavior is expressed as 4 ×25 =100 dimensional vector ( ) from another perspective, uses STA-GoogleNet+ LSTM to directly extract the RGB behavior features, then takes the RGB and skeleton features as the CHMM model input for modeling, so as to obtain the multi-modal fused human behavior recognition model, as shown in Fig. 3. The two network models have the same structure, but have different feature vector types. They are inputed into CHMM model for fusion to produce human behavior recognition results. Where, r t X , g t X are RGB or skeleton related vector input signals respectively, h t−1 is an intermediate hidden variable,

Fig.2. ISE-Net structure
h 0 is the initial value, ( )   and tanh (•) are the activation function, w is the network weight, and b is the deviation. Output In order to facilitate modeling, r and g distribution is defined to represent RGB and skeleton related information. The output of LSTM is r t y and g t y , which are RGB and skeleton sequences and r t y g t y are input to CHMM as observation signal. According to graph model theory, CHMM is divided into two basic models to simplify its calculation. It is a simple dynamic Bayesian network (DBN), which can be determined by Markov chain theory. It consist of parameters λ= (π, a, b), definition: π represents prior knowledge, A is the state transition matrix and B is the observation matrix.

STA-ResNet50+LSTMr
STA-GoogleNet+LSTMg  , then the information in the two sequences can be fused through probabilistic reasoning to produce a final state with high estimation accuracy.
The STA-ResNet50/STA-GoogleNet+LSTM+CHMM calculation process, CHMM is divided into two HMM to calculate the optimal hidden state probability. Behavior related HMM includes three parameters [16]: can be calculated from Bayesian theory and the estimation is as follows [7]: The observation y 1 thenyields The state probability at time t can be determined from: And the state can be optimized using an observation sequence: Where, , represents respectively the distribution state of g and r sequences at time t, then an optimized state can be estimated from the observed sequences:

Experiments and Discussion
In order to better verify the performance of the model proposed in this paper and provide a basis for the intelligent care of the elderly, experiments and analysises will be organized from five aspects: experimental environment, experimental parameter selection dataset, extracting behavior characteristics and fusion model performance evaluation.

Experimental environments
The computer used in the experiment is HP Pavilion 15, Windows10 operating system, Intel quad core (TM) i5-7300 processor, 2.6GHz main frequency, 8G memory, NVIDIA GEFDRCE GTX1050 graphics card and the test running environment is MATLAB 2021a.

Selection of experimental parameters
In order to train a high-precision neural network, HMDB51 dataset [21]is used for repeated training and Optimization for 30 times, as shown in Table 1. It can be obtained that if the learning rate is set too small, the convergence process becomes very slow and too large leads to failure of convergence, so the best learning rate is 0.0001. The 'adam' optimization algorithm with high accuracy in video classification is adopted. With the continuous increase of minbatch, the training accuracy increases, However, when it gives 32, due to hardware constraints, the network cannot train, resulting in program interruption, Therefore, the mini batch-size is 16. In order to prevent overfitting, dropout is finally selected to 0.7 through the experiment, which can obtain better verification accuracy, Finally, the processor is set as 16G graphics processing special 'GPU' to improve the training speed of the model.

Datasets
The two main stream datasets of UCF101 [22]and HMDB51 in behavior recognition are used to evaluate the performance of the proposed model. Among them, UCF101 contain 101 kinds of behaviors and 13320 videos, which are mainly divided into five categories, only body move-ments, human-human interaction, human-object interaction, playing music equipment and various sports. The HMDB51 dataset have 51 action categories, including 6766 video samples. In this paper, 60% of videos are used for training, 20% as validation and the rest dataset 20% for testing.

IA-Net extracting behavior features
IA net is used to visualize the different actions of 'Driving' and 'brush_hair' on the two datasets, the first conv_1 and the last convolution_fire2 visually observe different frames of the action respectively and the results are shown in Fig. 4. and Fig. 5. When the first layer of convolution is obtained, the contour information of convolution action is shown in the red box of the figure.
With the increase of convolution times, the finer the characteristic information of convolution is, as shown in the blue box of the figure, Therefore, in order to obtain more accurate behavior information, when improving the network, the attention mechanism is added at the end of each block convolution to improve the weight of important features and weaken redundant features.

Performance evaluation of the model
To assess the accuracy of the proposed model for identifying family behavior of the elderly in complex family environments. Two types of behavior features extracted in Section 3.4 are input to CHMM to fuse and form the final behavioral recognition model. Experiments are performed on two datasets HMDB51 and UCF101 to check the performance of the proposed model.

Evaluating performance on HMDB51 dataset:
To evaluate the performance of the model, two improved neural networks are combined with the CHMM algorithm to form the final recognition model. Firstly, the accuracy of behavior recognition is obtained through model training. Secondly, the experimental results are analyzed to find out the specific behavior of identifying errors, so as to provide basis for model optimization. In order to further observe the identification ability of the model, finally, the dimension reduction of clustering method is used to analyze the misidentified behavior and the performance of the model is judged from aboved three aspects.
(1)Experiments process The HMDB51 dataset contains 51 classes of actions. Ten classes are selected in turn. The training process of 1-10 classes is shown in Fig. 6. and the accuracy is 89.04%.

Fig.6. Training accuracy of HMDB51 dataset
After five round experiments, the training accuracy is obtained as Table 2 and the overall training accuracy on HMDB51 dataset is 87.68%. Table 2 Training Accuracy on HMDB51 Dataset (2)Experimental process analysis To observe the specific recognition of each action by the model on the HMDB51 validation dataset, the confusion matrix is used. From this, there are about 20 videos for each action of the 10 types, the left vertical axis represents the real labels of the actions and the horizontal axis represents the predicted results, Fig. 7. shows that the validation accuracy is 96.37%.
Real labels Real labels Predicted labels Predicted labels

Fig.7. Confusion matrix on HMDB51 verifies dataset
Meanwhile, the generalization ability of the model is trained on the testset with an accuracy of 87.88% in Fig. 8. Among them, the 'drive' motion recognition rate is only 76.0%, because there are many video segments of sailing, aerial driving and so on in the drive that almost escape the driving object at the bottom of the sailboat. To further validate the model's ability to discriminate behavior, this paper visualizes the low-dimensional distribution of different actions. Selecting 10 representative action classes from HMDB51 for unsupervised clustering is shown in Fig. 9. As is showed, the overall classification results are relatively better. However, individual behaviors such as 'brush hair' have sparse distribution points because of the large difference in motion amplitude and angle. The 'chew' three actions are misidentified as 'climb' because three people in the dataset chew on stairs and ladder scenes causing recognition errors, which provide a basis for later model improvement. (1)Experiments process In the UCF101 experimental dataset, 10 types of actions were randomly selected, each of which contained about 107 videos, lasted 3 seconds and totaled 1241 videos. According to 6:2:2 partition randomly, 60% as training dataset, 20% as validation dataset and the rest of 20% as test dataset, the experimental validation accuracy on this dataset is up to 99% in Fig.10. When the testset is used to predict, the training accuracy is high 97.92% in Fig.12. The action 'Fencing' and 'Breastroke' are recognized as 'Baseball pitch' due to the background and light, resulting in a low recognition rate.

(3)Visual Fusion Results
To further validate the model's ability to discriminate behavior, 10 types of actions from UCF101 dataset are selected to cluster in Fig. 13. From this, the results of 'billards' and 'Basketball' clustering are crossed, because they belong to spherical motion, the main features are human, sphere and there are many similar features, leading to partial confusion in the recognition results.

Comparison and analysis of experimental results
On UCF101 and HMDB51 datasets, this model is compared with the most advanced methods of behavior recognition. The results are shown in Table 3 and table 4. which are compared with different methods, including single-modal and multi-modal fusion models. It can be seen from table 3 that for HMDB51 dataset, the behavior recognition rate of traditional methods HDL and mandal algorithms using single-modal recognition is only 61.65% and 70.40%, its performance is far lower than that of multi-modal fusion method, In addition, some of the latest models have significantly improved the recognition accuracy by optimizing the network structure. Compared with other methods, the accuracy of the two IA Net+CHMM fusion model in this paper is as high as 87.88%, which has achieved obviously remarkable results. This is because it makes full use of the spatio-temporal attention model to give different weights to different features and adopts the method of probabilistic reasoning for fusion, which is 3.08 percentage points higher than the most advanced ST-ResNet model in Table 3. On the UCF101 dataset, ours method is compared with some two flow methods and some latest methods, as shown in Table 4. Behavior recognition is more based on RGB, from 3D CNN to mandal Model recognition accuracy increased from 82.3% to 95.7%, indicating that RGB has strong advantages in feature extraction, so it provides an important model information for multi-modal data fusion. In addition to MF+MVF, this method is much better than most of the latest methods. Specifically, the result of this method is 3.02% higher than that of STIAM fusion method and 1.95% higher than that of Shou fusion method. Finally, by comparing the test results based on RGB and RGB+bone models, the highest accuracy are 95.7% and 94.9% respectively, which shows that the recognition accuracy based on multi-modal fusion model is higher than that based on single-modal. The accuracy of ours model is as high as 97.92%, indicating its advanced.
Ours model has achieved excellent results on both UCF 101 and HMDB51 dataset in this paper. The main reason is that the model introduces the ISE-Block into ResNet and GoogleNet networks to form two IA-Nets, which extract behavior features with different weights based on RGB and bone flow respectively and strengthens the key feature recognition ratio. After that, splicing LSTM network is conducive to the classification of long-time video behavior. Finally, CHMM probability method is used for data fusion, so as to improve the accuracy of human behavior recognition.

Conclusion
A fusion method combining two IA-Nets and CHMM is proposed in this article. Based on multi-modal data fusion, probabilistic reasoning and deep learning analysis, improved IA-ResNet50 and IA-GoogleNet networks are designed respectively, then CHMM is used to fuse the feature information of the two models. The complementary advantages of different modal features are used to improve the behavior recognition rate. However, the problem of interactive behavior recognition has not been deeply discussed and relevant research will be carried out in the future to improve the application scene and scope of ours model.