Submitted:
17 October 2024
Posted:
18 October 2024
You are already at the latest version
Abstract

Keywords:
1. Introduction
- Prepared the dataset using a number of methods, resulting in noise-free sequences that retain each child’s traits and record full movement information (e.g., gestures and position).
- Developed a high-performing predictive model using Temporal Swin Transformer with Renset-3D to extract and learn spatial and temporal features of autistic children’s SMMs during the Pre-Meltdown Crisis.
- Conducting an empirical investigation to optimize model structure and training parameters.
2. Related Works
2.1. Physical Activity Recognition for Normal People
2.2. Physical Activity Recognition for Autistic People
3. Theorical Study: Stereotyped Motor Movements (SMMs) of Autistic Children
| SMM | Description | Sample from public |
|---|---|---|
| Face | Grimacing, lips or tongue movements, opening the mouth, mouth stretching, sucking objects | ![]() |
| Head and neck | Head tilting, shaking, nodding, hair twirling, headbanging, neck stretching | ![]() |
| Trunk | Body rocking, spin, spinning or rotation of the entire body | ![]() |
| Shoulders | Bending, arching the back, shrugging the shoulders | ![]() |
| Arm OR leg | Arms flapping, bilateral repetitive movements involving the arms and hands such as crossing the arms on the chest, and tapping one’s feet | ![]() |
| Hand OR finger | Hand flapping, slapping, nail-biting, finger wiggling, Shaking, tapping, waving, clapping, opening-closing, rotating or twirling the hand or fingers, thumb-sucking, pointing, fanning fingers, fluttering fingers in front of the face, picking skin, and scratch self. | ![]() |
| Hand OR finger with object | Shaking, tapping, banging, twirling an object, rubbing, repetitive ordering, arranging toys in patterns, adding objects to a line, manipulating objects. | ![]() |
| Gait | Pacing, jumping, running, skipping, spinning. | ![]() |
| Self-directed | Covering the ears, mouthing, smelling, rubbing the eyes, tapping the chin, slapping self or an object or surface, and self-mutilating behavior. | ![]() |
4. "MeltdownCrisis" Dataset Collection and Preprocessing
4.1. Dataset Collection
4.2. Dataset Preprocessing
5. Methodology: Hybrid 3D Convolutional and Transformer Model
5.1. Video Data Augmentation
- Random Horizontal Flip: This is a deep learning approach that increases the amount of a dataset by flipping frames horizontally. Exposing a model to several variants of the same frame can assist increase its accuracy. This method is applied with a “p=0.5” parameter, which is the probability of the image being flipped.
- Random Rotation: is a data augmentation technique that rotates each frame (or picture) in a dataset by a random angle within a predetermined range. This allows the model to become invariant to the orientation of the objects in the frames. In our case, we apply this method by rotating frames randomly within a range of -15 to 15 degrees. When a frame is exposed to random rotation with a degree range of 15, it can be turned at any angle between -15 and 15 degrees. For example, one frame may be by -10 degrees, another by 5 degrees, and another by 14 degrees. The rotation angle is determined at random for each frame, ensuring that each frame may have a different rotation angle within the defined range.
- Resize Video (256x256): Resizing video frames to 256x256 pixels entails resizing each frame in the movie to 256 pixels in width and height. This modification guarantees that all frames have the same size, which is frequently required for input into neural network models. This method guarantees: the (1) Consistency seeing that all video frames have the proportions, which is critical for batch processing in machine learning pipelines; (2) Standardization, seeing that Models frequently demand inputs of a specific size. So, resizing all frames to 256x256 pixels standardizes the input, ensuring compatibility with the model design. And (3) Efficiency: reducing the size of the frames can minimize computational load and memory utilization, which is useful when training huge datasets.
- Random Crop (224x224): selecting a random area of each frame in the video and cropping it to 224x224 pixels. This transformation introduces variability in the training data, which can help improve the model’s robustness and generalization.
- Color Jitter: this includes making random adjustments to the (1) brightness which controls the intensity of light in a frame. Increasing brightness makes the picture brighter, while lowering brightness makes the image darker, (2) Contrast, which adjusts the contrast between the bright and dark regions of a picture. Higher contrast makes the shadows deeper and the highlights brighter, whilst lower contrast makes the image look more consistent, (3) Saturation, which controls the strength of colors in a frame. Increasing saturation makes colors more vibrant, and reducing it makes the image more grayscale and (4) Hue shifts the image’s hues along the color spectrum. Changing the hue can affect the colors of the items in the image. Hence, this approach makes models more resilient to fluctuations in lighting conditions and color discrepancies. In our case, this method takes a video file, applies color jitter transformations to its frames, and returns the transformed frames as tensors.
5.2. Dataset Transformation
- Transformations: Each frame is resized to 224x224 pixels. This standardizes the frame size, which is essential for effective batch processing in neural networks.
- Tensor Conversion: Frames are transformed from PIL images to PyTorch tensors. This change converts the picture data format from PIL’s HxWxC to PyTorch’s CxHxW, while also scaling pixel values from [0, 255] to [0, 1].
- Normalization: Tensor values are normalized based on mean and standard deviation parameters. This normalization method reduces the data’s mean to 0 and standard deviation to 1, as aided by mean= [0.485, 0.456, 0.406] and std= [0.229, 0.224, 0.225]. Such standardization improves the speed and consistency of deep learning model training.
5.3. Local Feature Extraction using 3D-ResNet pretrained model
- Convolutional Layer (3D): This layer performs 3D convolution on the input video frames. Unlike 2D convolution, which only functions on spatial dimensions (height and width), 3D convolution works on both spatial and temporal dimensions (height, width, and depth/time). This layer is responsible for extracting spatio-temporal characteristics from video frames. The implementation details of the Con3D layer are illustrated in Table 2.
- Batch Normalization (3D): This layer normalizes the Conv3D layer’s output for each mini-batch. It helps to accelerate the training process while also boosting the neural network’s performance and stability. Batch normalization for 3D data is applied across the feature maps while preserving temporal coherence. Here, the input channels are equal to 128, which matches the output channels of the previous Conv3D layer.
- The Rectified Linear Unit (ReLU): is an element-wise activation function that adds non-linearity to the model. It outputs the input straight if it is affirmative otherwise it returns zero. This enables the network to learn more complicated patterns. The Inplace is set to True in order to alter the input immediately and save memory.
- Max Pooling (3D): This layer down-samples in both spatial and temporal dimensions, lowering the dimensionality of the feature maps while maintaining the most important features. Max pooling reduces computing effort while controlling overfitting by offering an abstracted form of the representation (see. Table 3).
- 3D Convolutional Layer: This layer reduces the number of channels from 128 to 3, preparing the output for the next stage. The Input Channels are equal to 128, the Output Channels are equal to 3 and the Kernel Size is equal to (1, 1, 1), indicating point-wise convolution.
| Parameters | Description |
|---|---|
| Input Channels | 3 (RGB channels) |
| Output Channels | 128 |
| Kernel Size | (3, 7, 7), indicating the convolution kernel spans 3 frames in time and 7x7 pixels in space. |
| Stride | (1, 2, 2), meaning the kernel moves 1 frame at a time in the temporal dimension and 2 pixels in spatial dimensions. |
| Padding | (1, 3, 3), adding padding to maintain the spatial dimensions of the output. |
5.4. Global Feature Extraction Using Swin_3D_b Model
- Feature Patching: This stage includes breaking down the 3D features retrieved from the 3D-ResNet backbone into smaller patches. Each patch is handled as a separate token for the transformer. This approach improves the efficiency with which the spatial and temporal elements of the incoming video are handled.
- Linear Embedding: The patches are passed through a linear embedding layer. This layer converts the patches into dimensions that the transformer can handle. It simply translates the raw patches into a higher-dimensional space, allowing the transformer to better grasp the data.
-
Swin3D_b Transformer Layers: These are the core layers of the Swin3D transformer. The Swin3D transformer employs a hierarchical structure with shifting windows to collect both local and global characteristics in the video. It consists of numerous layers of self-attention and feed-forward networks, which enable the model to learn complicated connections between video frames. Figure 3 shows the typical architecture of a Swin-3D-based Transformer block. The Swin Transformer design, including the 3D edition, is composed of alternating layers of window-based multi-head self-attention (W-MSA) and shifting window multi-head self-attention (SW-MSA), as well as feed-forward networks (MLP layers) and normalization layers (LayerNorm). The provided diagram is consistent with these ideas. The following is a breakdown of the components depicted in the Figure 3 and how they fit into the Swin Transformer model [27]:
- −
- MLP (Multi-Layer Perceptron): The MLP is a feed-forward network made up of two completely linked layers and a GELU activation function in the middle. This component appears in both the standard transformer block and the Swin transformer block.
- −
- LayerNorm (Layer Normalization): Layer normalization is used before the MLP and self-attention layers to help stabilize and expedite training. It assures that the inputs to these layers have zero mean and unit variance, which is useful for training deep neural networks.
- −
- SW-MSA (Shifted Window Multi-Head Self-Attention): Swin Transformer computes self-attention inside local windows, which are then moved across layers to allow for cross-window connections. This shifted window method captures both local and global dependencies efficiently.
- −
- W-MSA (Window-based Multi-Head Self-Attention): This is the standard window-based self-attention method, which calculates attention within fixed-size windows. It focuses on capturing local dependencies between non-overlapping planes.
- Multi-Layer Perceptron Head: Following processing via the transformer layers, the resultant features are sent into a Multi-Layer Perceptron (MLP) head. The MLP head is generally made up of one or more fully connected layers with activation functions. This component serves to refine the extracted features by the transformer before they are fed into the classification layer.
5.5. Classification
6. Experimental Study
- Experiment 1 is a combination of local and global features was employed, together with data augmentation and 5-fold cross-validation. The model included EfficientNet-b0, a Transformer with a batch size of 16, a TimeDistributed layer, an LSTM, and a Dense layer, resulting in a validation accuracy of 71.67%.
- Experiment 2 focused on local features, using data augmentation but not cross-validation. The model used InceptionResNetV2, Flatten, Dense, two LSTM layers, Dropout, and another Dense layer, yielding a 75% validation accuracy.
- Experiment 3 used local and global features, VGG16, a Transformer with a batch size of 16, a TimeDistributed layer, an LSTM, and a Dense layer with data augmentation and 5-fold cross-validation, and achieved 77.56% accuracy.
- Experiment 4 followed the same setup as Experiment 3 but used ResNet50 instead of VGG16, resulting in an accuracy of 80.71%.
- Experiment 5, which focused on local features with data augmentation and no cross-validation, employed a 2D convolutional layer, one LSTM, and a Dense layer to achieve 81% accuracy.
- Experiment 6, which was likewise local-focused with data augmentation and no cross-validation, achieved 83% accuracy by using VGG16, Flatten, one LSTM, and a Dense layer.
- Experiment 7, using local features, data augmentation, and no cross-validation, used InceptionV3, Flatten, Dense, 2 LSTM layers, Dropout, and another Dense layer to achieve 87.5% accuracy.
- Experiment 8 achieved 89.46% accuracy by combining local and global features, data augmentation, and 5-fold cross-validation using ResNet18, a Transformer with a batch size of 16, a TimeDistributed layer, an LSTM, and a Dense layer.
- In Experiment 9, local features with data augmentation and no cross-validation were combined with ResNet50, Flatten, Dense, two LSTM layers, Dropout, and another Dense layer, resulting in a 91.25% accuracy.
| Experiment | Features Type | Data Augmentation | Cross Validation | Model Layers | Validation Accuracy |
|---|---|---|---|---|---|
| 1 | Local + Global | Yes | Yes | EfficientNet-b0+ Transformer-batch-16 + TimeDistributed Layer + LSTM + Dense Layer | 71.67% |
| 2 | Local | Yes | No | InceptionResNetV2 + Flatten + Dense + 2-LSTM + Dropout + Dense Layer | 75% |
| 3 | Local + Global | Yes | Yes | VGG16 + Transformer-batch-16 + TimeDistributed Layer + LSTM + Dense Layer | 77.56% |
| 4 | Local + Global | Yes | Yes | ResNet50 + Transformer-batch-16 + TimeDistributed Layer + LSTM + Dense Layer | 80.71% |
| 5 | Local | Yes | No | 2DConv + 1-LSTM Layer + Dense Layer | 81% |
| 6 | Local | Yes | No | VGG16 + Flatten + 1-LSTM+ Dense Layer | 83% |
| 7 | Local | Yes | No | InceptionV3 + Flatten + Dense + 2-LSTM + Dropout+ Dense Layer | 87.5% |
| 8 | Local + Global | Yes | Yes | ResNet18 + Transformer-batch-16 + TimeDistributed Layer + LSTM + Dense Layer | 89.46% |
| 9 | Local | Yes | No | ResNet50 + Flatten + Dense + 2-LSTM + Dropout + Dense Layer | 91.25% |
| Our | Local+ Global | Yes | Yes | 3D-ResNet+Swim-3D-b Transformer + Dense Layer with cross validation | 92.00% |
7. Qualitative and Quantitative Evaluation
7.1. Quantitative Evaluation
- Accuracy rate [scikit-learn]: It the correct is classification of classified videos as normal behavior or abnormal behavior. It allows for the calculation of the overall classification performance.
- Precision [scikit-learn]: also called Confidence, denotes the proportion of Predicted Positive cases that are correctly Real Positives.
- Recall [scikit-learn]: It is referred to as the true positive rate or sensitivity. It is defined as the ratio of the total number of correctly classified positive/abnormal behavior instances divided by the total number of positive/abnormal behavior instances.
- F1-score [scikit-learn]: It is the harmonic mean of precision and recall.
7.2. Qualitative Evaluation
8. Conclusions and Future Works
Funding
Acknowledgments
References
- Anzulewicz, A.; Sobota, K.; Delafield-Butt, J.T. Toward the Autism Motor Signature: Gesture patterns during smart tablet gameplay identify children with autism. Scientific reports 2016, 6, 31107. [CrossRef]
- Jazouli, M.; Majda, A.; Merad, D.; Aalouane, R.; Zarghili, A. Automatic detection of stereotyped movements in autistic children using the Kinect sensor. International Journal of Biomedical Engineering and Technology 2019, 29, 201–220. [CrossRef]
- spectrum disorders, A., 2020.
- Baumeister, A.A.; Forehand, R. Stereotyped acts. In International review of research in mental retardation; Elsevier, 1973; Vol. 6, pp. 55–96.
- Rad, N.M.; Furlanello, C. Applying deep learning to stereotypical motor movement detection in autism spectrum disorders. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW). IEEE, 2016, pp. 1235–1242.
- Masmoudi, M.; Jarraya, S.K.; Hammami, M. Meltdowncrisis: Dataset of autistic children during meltdown crisis. In Proceedings of the 2019 15th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS). IEEE, 2019, pp. 239–246.
- DSM-V., 2013.
- Zhao, C.; Chen, M.; Zhao, J.; Wang, Q.; Shen, Y. 3d behavior recognition based on multi-modal deep space-time learning. Applied Sciences 2019, 9, 716. [CrossRef]
- Saha, S.; Singh, G.; Sapienza, M.; Torr, P.H.; Cuzzolin, F. Deep learning for detecting multiple space-time action tubes in videos. arXiv preprint arXiv:1608.01529 2016.
- Singh, G.; Saha, S.; Sapienza, M.; Torr, P.H.; Cuzzolin, F. Online real-time multiple spatiotemporal action localisation and prediction. In Proceedings of the Proceedings of the IEEE international conference on computer vision, 2017, pp. 3637–3646.
- Qiu, Z.; Sun, J.; Guo, M.; Wang, M.; Zhang, D. Survey on deep learning for human action recognition. In Proceedings of the Data Science: 5th International Conference of Pioneering Computer Scientists, Engineers and Educators, ICPCSEE 2019, Guilin, China, September 20–23, 2019, Proceedings, Part II 5. Springer, 2019, pp. 3–21.
- Tang, Y.; Tian, Y.; Lu, J.; Li, P.; Zhou, J. Deep progressive reinforcement learning for skeleton-based action recognition. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5323–5332.
- Li, N.; Guo, H.W.; Zhao, Y.; Li, T.; Li, G. Active temporal action detection in untrimmed videos via deep reinforcement learning. IEEE Access 2018, 6, 59126–59140. [CrossRef]
- Ranasinghe, K.; Naseer, M.; Khan, S.; Khan, F.S.; Ryoo, M.S. Self-supervised video transformer. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2874–2884.
- Hussain, A.; Hussain, T.; Ullah, W.; Baik, S.W. Vision transformer and deep sequence learning for human activity recognition in surveillance videos. Computational Intelligence and Neuroscience 2022, 2022, 3454167. [CrossRef]
- Wensel, J.; Ullah, H.; Munir, A. Vit-ret: Vision and recurrent transformer neural networks for human activity recognition in videos. IEEE Access 2023. [CrossRef]
- Hosseyni, S.R.; Taheri, H.; Seyedin, S.; Rahmani, A.A. Human Action Recognition in Still Images Using ConViT. arXiv preprint arXiv:2307.08994 2023.
- Xing, Z.; Dai, Q.; Hu, H.; Chen, J.; Wu, Z.; Jiang, Y.G. Svformer: Semi-supervised video transformer for action recognition. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 18816–18826.
- Gonçalves, N.; Costa, S.; Rodrigues, J.; Soares, F. Detection of stereotyped hand flapping movements in Autistic children using the Kinect sensor: A case study. In Proceedings of the 2014 IEEE international conference on autonomous robot systems and competitions (ICARSC). IEEE, 2014, pp. 212–216.
- Dundi, U.R.; Kanaparthi, V.P.K.; Bandaru, R.; Umaiorubagam, G.S. Computer Vision Aided Machine Learning Framework for Detection and Analysis of Arm Flapping Stereotypic Behavior Exhibited by the Autistic Child. In Proceedings of the International Conference on Computational Intelligence in Data Science. Springer, 2023, pp. 203–217.
- Jones, R.; Wint, D.; Ellis, N. The social effects of stereotyped behaviour. Journal of Intellectual Disability Research 1990, 34, 261–268. [CrossRef]
- Ghanizadeh, A. Clinical approach to motor stereotypies in autistic children. Iranian journal of pediatrics 2010, 20, 149.
- Lam, K.S.; Aman, M.G. The Repetitive Behavior Scale-Revised: independent validation in individuals with autism spectrum disorders. Journal of autism and developmental disorders 2007, 37, 855–866. [CrossRef]
- Noris, B. Machine vision-based analysis of gaze and visual context: an application to visual behavior of children with autism spectrum disorders. PhD thesis, Citeseer, 2011.
- Albinali, F.; Goodwin, M.S.; Intille, S. Detecting stereotypical motor movements in the classroom using accelerometry and pattern recognition algorithms. Pervasive and Mobile Computing 2012, 8, 103–114. [CrossRef]
- Chandola, Y.; Virmani, J.; Bhadauria, H.; Kumar, P. Deep Learning for Chest Radiographs: Computer-Aided Classification; Elsevier, 2021.
- Liu, X.; Wang, Z.; Wan, J.; Zhang, J.; Xi, Y.; Liu, R.; Miao, Q. RoadFormer: road extraction using a swin transformer combined with a spatial and channel separable convolution. Remote Sensing 2023, 15, 1049. [CrossRef]
- Wang, H.; Wang, Z.; Du, M.; Yang, F.; Zhang, Z.; Ding, S.; Mardziel, P.; Hu, X. Score-CAM: Score-weighted visual explanations for convolutional neural networks. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp. 24–25.
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the Proceedings of the IEEE international conference on computer vision, 2017, pp. 618–626.





| Parameters | Description |
|---|---|
| Kernel Size | (1, 3, 3), indicating pooling over 1 frame in time and 3x3 pixels in space. |
| Stride | (1, 2, 2), meaning the kernel moves 1 frame at a time in the temporal dimension and 2 pixels in spatial dimensions. |
| Padding | (0, 1, 1), maintaining the spatial dimensions of the output. |
| Fold | Val_Accuracy | Precision | Recall | F-score |
|---|---|---|---|---|
| Fold 1 | 0.910 | 0.91 | 0.9 | 0.904 |
| Fold 2 | 0.920 | 0.92 | 0.902 | 0.912 |
| Fold 3 | 0.915 | 0.915 | 0.901 | 0.908 |
| Fold 4 | 0.930 | 0.93 | 0.902 | 0.916 |
| Fold 5 | 0.925 | 0.935 | 0.903 | 0.918 |
| The mean of all Folds | 0.92 | 0.922 | 0.9016 | 0.9116 |
| Paper | People Nature | Approach | Dataset | Recognition Rate |
|---|---|---|---|---|
| [13] | Normal people | Multi-task Network | THUMOS’14 Activity Net v1.2 | 61.2% 42.3% |
| [2] | Autistic people | Nearest neighbour classifier | Autistic Dataset | 91.57% |
| [16] | Normal people | Recurrent transformer (ReT) | 20 Actions Database 50 Action Database 101 Action Database | 80.0% 73.8% 71.7% |
| [14] | Normal people | self-supervised learning approach for video transformers | Kinetics-400 SSv2 µ UCF-101 HMDB-51 | 78.1% 59.2% 90.8% 67.2% |
| [18] | Normal people | SVFormer-B UCF-101 | Kinetics-400 | 86.7% 69.4% |
| [20] | Autistic people | Evaluating arm-flapping stereotypic behavior in autistic children using computer vision and machine learning approaches | Videos by mimicking the arm flapping stereotypic behavior | 95% |
| Our | Autistic children | Hybrid 3D Convolutional Transformer | Meltdown Crisis Dataset | 92% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).








