Study on Temperature Variance for SimCLR based Activity Recognition

Human Activity Recognition (HAR) is a pro-cess to automatically detect human activities based on stream data generated from various sensors, including inertial sensors, physiological sensors, location sensors, cameras, time, and many others. In this paper, we pro-pose a robust SimCLR model for human activity recognition with a temperature variance study. In this work, SimCLR, a contrasting learning technique is optimized via regulating the temperature for visual representations, is incorporated for improving the HAR performance in healthcare.


Introduction
Precise human activity recognition involves consideration of links between actors, objects and their surroundings, often over long time periods. One reason why video comprehension is so difficult is because it requires an understanding of the interactions between actors, objects and other contexts on the scene. Moreover, these interactions cannot always be seen from a single frame, and therefore require reasoning over long periods of time. As such, some of them only model spatial relationships between actors and objects, but not the evolution of those interactions with time. Alternative approaches model long-range time interactions [1], Pranjal Kumar NIT Hamirpur, H.P, India-177005 Tel.: +918637511985 Fax: +91-1972-223834 E-mail: pranjal@nith.ac.in but do not capture and do not train spatial relations. Although certain methods model spatio-temporary interactions between objects [2,3], further supervision is required for their explicit representations of objects. Early works in this field included modelling humanobject interaction [4,5], various objects [6], and human actions/scenes context relations [7,8]. Moreover, human vision has also proven to be context dependent [9]. A major problem in video understanding is recognition of human action and recognition of group activities [10]. The techniques of action and activity recognition have been widely used, for example in the fields of social behavior understanding, sport video analysis and video monitoring. It is important to better understand a video scene with several people and to understand the action and collective activity of all individuals. Recently, SimCLR was incorporated for healthcare and HAR in particular for the first time [11]. In this paper, we suggest several ways to improve the functionality and efficiency of the Human action recognition using optimizations in contrastive loss [12,13]. Main contribution of the proposed methodology is summarized below: -We provide a detailed study for understanding the behaviour of contrastive learning(special emphasis on temperature coefficient) in sensor data context for human activity recognition. -We improve the SimCLR performance by regulating the temperature coefficient.

Action Recognition
In earlier works, hand-crafted attributes for encoding information from motion were used [14,15]. Advances in deep learning saw first the repurposes of video "two stream" networks of 2D image-convolutionary neural net-

Results & Discussion
works (CNNs) [16,17], and then the space-time 3D CNN[18-4.1 Dataset 21]. These architectures, however, concentrate on extracting broad features, video-based features and are not suitable for studying fine grain relationships. Graph neural network (GNN), by modelling them as nodes in a directed, undirected graph, explicitly models the interaction between entities [22][23][24] through a neighbourhood defined in each node. Each feature maps element in each function is a node, and all nodes are fully connected. The self-attention [25] and non-local operators [26] are also considered GNNs. Such models have been outstanding in several processing tasks for the natural language and informatics, inspiring numerous follow-up methods [27][28][29][30][31].

Human Object Interaction
The objective of Human Object Interaction (HOI) detection is to locate humans and objects and to recognise their interactions. Previous studies [32][33][34][35][36][37] show promising results of HOI sensing by decoupling it into the detection and classification of objects. In particular, the results of human and object detection first come from an object detector pre-trained, and then a pair of combined proposals for human objects interaction classification. In recent approaches [38,37,36], a substitute detection problem was introduced, which would indirectly optimise the HOI detection. Firstly, the proposal of interaction was predefined on the basis of human priors. UnionDet [37], for example, defines the proposal for interactions as a union box for human and object boxes.
As an interaction point, the central point from the human to the object is used by PPDM [36].

Framework
SimCLR [39] architecture consists of these primary modules.
-A data incrementation module that randomly transforms a given example of data leading to two correlated views on the same example. MotionSense [40] was used in our assessment as a publicly available dataset. This dataset comprises data from 24 individuals who carried an iPhone 6s in the front pocket of their pants and perform 6 different activities: walking downstairs, upstairs, walking, jogging, sitting and standing. In this study 6630 windows, each 400 timestamping and 50 percent overlap, were used for data from a 50% tri-axial accelerometer.

-Experiment 1: Temperature variation and loss function
In this section we conduct extensive studies on the temperature coefficient, in order to understand the modeling relationship of the proposed network using activity prediction precision as the assessment metric. Fig 1, 2, 3 and 4 shows loss polt results for T= 0.07, 0.1, 0.2, 1 respectively.

-Experiment 2: Visualisation of results via t-SNE plots after full model evalutaion
In this experiment, we analyze the performance of the group activity recognition with different settings. Fig 5, 6, 7 and 8 below shows the output from visualization model for T= 0.07, 0.1, 0.2, 1 respectively.

Comparative Study with Baseline Models
In this section, we compare our best models with the most advanced methods. A linear and finally defined evaluation was conducted using the MotionSense dataset to evaluate the impact of using different transformations for SimCLR pre-training. Results are shown in Table 1.

Conclusion
In this work, we have adapted to HAR, one of the most relevant tasks for digital health applications, the Sim-CLR contrasting learning framework from visual representation learning. We have studied the effect of temperature variance on contrastive loss adhering to and thereby improving the performance of HAR.