Intelligent Behaviour Recognition in Developmental Disorder Screening of Adolescents Using Augmented Deep CNN Models

Xinlong Zhang; Zhen Bin It; Jovan Bowen Heng; Tee Hui Teo

doi:10.20944/preprints202507.1797.v1

Submitted:

21 July 2025

Posted:

22 July 2025

You are already at the latest version

Abstract

Adolescent activity detection plays a critical role in ensuring safety, supporting early diagnosis of developmental disorders, and enhancing educational environments. This paper presents a deep learning-based approach for recognizing and classifying adolescent activities using the ResNet-18 convolutional neural network (CNN). The proposed system is trained and evaluated on a comprehensive dataset of adolescents’ activities, incorporating data augmentation techniques to improve generalization. Experimental results demonstrate that the ResNet-18 model achieves high accuracy in detecting various adolescent behaviours outperforming traditional CNN models. The lightweight nature of ResNet-18 ensures real-time performance, making it suitable for practical applications in smart surveillance, healthcare monitoring, and educational tools. Future work will focus on integrating more diverse datasets and optimizing the model for deployment in real-world environments. This study highlights the potential of ResNet-18 as an effective solution for intelligent adolescent activity recognition systems.

Keywords:

ResNet-18

;

Adolescent Activity Detection

;

Deep Learning

;

Convolutional Neural Network

;

Behavior Recognition

Subject:

Engineering - Electrical and Electronic Engineering

Introduction

Detecting adolescent activities is critical not only in order to keep them safe but also to provide early intervention and guarantee a level of health and wellbeing. In significant contrast, real-time monitoring of adolescents’ activities has been shown to identify potential hazards and health problems effectively. For example, function imitative therapy has been shown to increase neurological outcomes of adolescents who suffer adverse perinatal factors, highlighting how important it is for adolescents’ nil to execute healthy behaviour from birth [1]. Moreover, physiological detection systems in virtual reality immersive environments have been employed as ways to monitor motor learning progress among adolescents with motor disability, which can be used for adolescents under national conditions and adolescents circumstance education [2]. This equipment can follow the evolution of motor activity and how it affects the final result with the adolescent in real time which is a very qualitative study of learning models. When applied to EEG data, CNNs skilfully extract key features without human engineering, advancing detection of hand motions as Kok et al. demonstrated in 2024 [3]. Their work hinted at broader applications. In healthcare, eye-tracking and deep learning models such as ResNet-18 have greatly improved the accuracy and effectiveness of early autistic detection systems in adolescents [4]. Equally vital for adolescent safety, IoT based smart cradles allow infants’ activities to be monitored in real-time as well so that parents can look after their adolescents’ daily behaviour and health conditions more effectively [5]. Current techniques for identifying adolescents’ actions face substantial restrictions regarding characteristic extraction, real-time overall performance, and generalizing skills. Standard machine learning methods regularly rely on manually developed features, which struggle to successfully seize intricate and varied behavioural habits in kids, specifically in dynamic environments [6]. Furthermore, deep learning models like VGG and AlexNet, regardless of their robust potential for characteristic extraction, experience from high computational complexity, rendering them impractical for real- time detection and prone to over-fitting when dealing with large-scale, diverse data sets. To handle these obstacles, ResNet-18 leverages a residual learning system and a light-weight network structure, successfully mitigating gradient diminishing concerns and allowing the derivation of high-level characteristics from graphics, significantly enhancing detection accuracy and sturdiness in adolescent action reputation jobs [7]. CNNs can perform relevant behaviour recognition directly on edge devices [8]. This opens real-time, energy- efficient evaluation everywhere, moving monitoring from theoretical. Moreover, ResNet-18’s computational efficiency enables deployment in real-time monitoring platforms and mobile devices, making it very flexible for diverse adolescent behaviour observation situations [9]. Consequently, adopting the ResNet-18 style successfully overcomes the shortcomings of existing techniques, enabling more exact and effective adolescent activity detection. The research aims to design an efficient adolescent behaviour recognition system based on the ResNet-18 model to achieve accurate and real-time detection of adolescents’ activities.

Methods

While convolutional neural networks are powerful models, their performance declines dramatically as their depth increases [10]. ResNet created skip connections to bypass this degradation, instead allowing layers to learn residual functions. This breakthrough network architecture facilitates the learning of identity mappings, preventing gradients from vanishing and expediting convergence [11]. The introduction of skip connections to directly pass inputs to subsequent layers in the network enabled ResNet to accurately model complex patterns even at immense depths, vastly improving generalization beyond what was previously achievable for very deep models and revolutionizing the field of computer vision [12]. Figure 1 illustrates the architecture’s 18 convolutional and fully linked layers. Several parameters of importance are included in a typical convolution layer. For instance, the convolutional layer labelled "3 × 3, Conv, 256, /2" consists of 256 filters with a stride of 2 and a window size of 3 × 3. Comprised of only 18 layers yet retaining strong feature extraction, ResNet-18 proves a lightweight yet potent model. Its compressed form maintains top-tier results while streamlining resource usage, fitting resource-restricted scenarios like real-time processing. Further, model trimming reconfigures ResNet-18 for maximum proficiency with minimal footprint, preserving its prodigious prowess even when footprint matters most.

Recent analyses demonstrate how residual block architecture profoundly impacts deep neural systems. Increasing residual blocks enhances precision yet benefits diminish as depth expands, necessitating delicate balance [13]. Separately, multi-armed bandits prune blocks judiciously, preserving performance while simplifying complexity [14]. Furthermore, substituting standard 1x1 convolutions with 3x3 variants within residual blocks augments feature extraction, boosting outcomes on CIFAR datasets [15]. Additional refinements emerged through ResNEsts and DenseNEsts, optimizing shortcuts and addressing scant feature reuse in profound networks [16]. Within audio classification tasks, novel residual building designs similarly dictated precision and computational efficiency [17]. The nuanced interplay of depth, shortcuts, and internal connections profoundly determines a model’s abilities. This dataset is derived from Stanford 40 Action [18], selecting 20 categories related to adolescents’ activities while removing adult-specific actions. Data augmentation techniques, such as random cropping, rotation, and flipping, are applied to improve model generalization. The training process optimizes cross-entropy loss using the Adam optimizer with learning rate scheduling to enhance convergence. Figure 2 shows that random image from 20 action categories which include main actions of adolescent and deleted behaviour of adult like smoking.

This section outlines the key mathematical components that support the design, optimization, and evaluation of the ResNet-18-based adolescent activity recognition model.

The model is trained using the categorical cross-entropy loss function, which measures the dissimilarity between the true class labels and the predicted probability distribution generated by the network.

Here, C denotes the total number of classes, yi is the one-hot encoded ground truth label, and yˆi is the predicted probability for class i [19].

L_{C E} = \sum_{i = 1}^{C} y_{i} l o g (y^{i})

(1)

To overcome vanishing gradients and performance degradation in deep networks, the ResNet architecture introduces residual connections. These connections enable the network to learn identity mappings, allowing gradients to propagate more effectively during backpropagation.

In this equation, x is the input feature map, F denotes the residual function typically composed of convolutional, batch normalization, and ReLU layers, and {Wi} represents the parameters of these layers [20].

y = F (x, \{W_{i}\}) + x

(2)

To facilitate efficient convergence and prevent overshooting during training, a learning rate decay policy is employed. The learning rate at training epoch t is updated as follows.

Here, η0 is the initial learning rate, γ is the decay factor (typically less than 1), and t is the current epoch index [21].

η_{t} = η_{0} ∙ γ^{t}

(3)

At the final output layer, the model produces unnormalized logits, which are transformed into class probabilities via the softmax function.

Where zi is the logit for class i, and the denominator ensures a valid probability distribution over all C classes [22].

y^{i} = \frac{e^{Z_{i}}}{{C_{j = 1} e}^{Z_{j}}}

(4)

The model optimization uses the Adam algorithm, which combines momentum-based updates and adaptive learning rates. The parameter update rule is defined as:

Where mˆ t and vˆt are the bias-corrected first and second moment estimates of the gradients, ϵ is a small constant for numerical stability, and η is the learning rate [23].

θ_{t} = θ_{t - 1} - η ∙ \sqrt{\frac{m^{t}}{v^{t} + ϵ}}

(5)

To enhance model generalization and reduce overfitting, data augmentation is used as a stochastic transformation process. Given an original image x, an augmented version x˜ is sampled from a transformation distribution T.

The transformation set T includes operations such as random cropping, rotation, horizontal flipping, and color jittering. These augmentations introduce diversity to the training data and improve the robustness of the learned features.

x^{~} ~ T (x)

(6)

Together, these mathematical tools form the basis of the learning architecture, contributing to both the accuracy and efficiency of the proposed adolescent activity recognition system.

Results

Experimental results conclusively demonstrated that the ResNet- 18 model achieves high accuracy in discerning various adolescent behaviours, decidedly surpassing conventional CNN models in both prognostication precision and computational efficiency. The confusion matrix [24] analysis reliably conveys robust classification across most activity groupings, unambiguously indicating the model’s aptitude to differentiate between closely resembling activities. As showed in Figure 3, the coaching accuracy of the ResNet-18 model acceleratingly increases and stabilizes in close proximity to 100% following approximately 10 epochs. In contrast, the validation accuracy stabilizes at around 80%, reflectively exposing the model’s generalizability across diverse adolescent action categories. Moreover, as illustrated in Figure 4, the Precision, Recall, and F1 score of the model reliably progress during the original training epochs and settle around 0.80.

Figure 5 which is confusion matrix will review this model of its classification validity, yet it can only be used as a quick reference.

As activities that photographers often do, climbing, riding a bike, and jumping have a high percentage of true positives, and others are lower. However, small misclassifications arise when the model tries to distinguish visually similar activities such as reading and writing on books, blowing bubbles, and drinking water. These can be attributed to characteristics that overlap visually [25]. Despite these minor perplexities, the proportion in the upper right corner of correct designations attests to this model’s stamina and strength to separate complex adolescent behaviours.

The ROC curve [26] in Figure 6 confirms the model’s accuracy of adolescent activity recognition. With an AUC of 0.96, which consists mainly fitting curves and zero slopes at either end on this classic chart.

Compared to other state-of-the-art models in adolescent activity recognition, ResNet-18 brings both competitive and complementary strengths with it. An accuracy of 88.7% in recognizing adolescents’ activities was achieved in a kindergarten setting, using a YOLOv8s-based model [27] to separate hierarchical structures from the rest of the data and detect moving objects in real time [28]. However, YOLOv8s does best at object detection, whereas ResNet-18 does good feature extraction for activity classification. In addition, the Graph Convolutional Network mode developed based on his innovative ideas reaches a peak of 82.24% correct rate after feature extraction and fine-tuning techniques are introduced [29]. Although GCNs effectively model social networks and dependencies over time, they tend to involve elaborate data preprocessing and quite large computational resources. In contrast, ResNet-18 provides a simpler more efficient solution without compromising classification performances. A deep learning framework to recognize behaviours associated was released with neurodevelopmental disorders from videos of adolescents who played [30]. While models like YOLOv8s and GCN deliver notable accuracy improvements in specific contexts, ResNet-18 offers a well-balanced solution by combining high accuracy, computational efficiency, and versatility across diverse adolescent activity recognition tasks.

Conclusion

This study effectively utilizes the ResNet-18 model for adolescent activity recognition. ResNet-18 utilizes its residual learning framework and lightweight structure to achieve high accuracy and efficient real-time performance. This model can achieve better classification accuracy and computational efficiency than traditional CNN architectures, and can be useful in a variety of applications such as smart surveillance, healthcare monitoring, and educational tool. In the future, we will continue to enrich the dataset with diverse and increasingly complicated activity types, incorporate attention mechanisms, and further optimize the model for edge deployment, thereby improving its adaptability and robustness.

Data Availability

All data generated or analysed during this study are included in this published article.

References

Z. Zhussupova, D. Ayaganov, G. Zharmakhanova, G. Nurlanova, L. Tekebayeva, and A. Mamedbayli, “The influence of movement imitation therapy on neurological outcomes in children who have experienced adverse perinatal conditions,” West Kazakhstan Medical Journal, vol. 66, pp. 331–342, 12 2024.
S. Houzangbe, M. Lemay, and D. E. Levac, “Towards physiological detection of a “just-right” challenge level for motor learning in immersive virtual reality: a pilot study. (preprint),” JMIR Research Protocols, vol. 13, p. e55730, 12 2023.
C. L. Kok, C. K. Ho, T. H. Aung, Y. Y. Koh, and T. H. Teo, “Transfer learning and deep neural networks for robust intersubject hand movement detection from eeg signals,” Applied Sciences, vol. 14, no. 17, 2024.
I. A. Ahmed, E. M. Senan, T. H. Rassem, M. A. H. Ali, H. S. A. Shatnawi, S. M. Alwazer, and M. Alshahrani, “Eye tracking-based diagnosis and early detection of autism spectrum disorder using machine learning and deep learning techniques,” Electronics, vol. 11, p. 530, 02 2022.
S. Kusuda, Y. Aihara, I. Kusakawa, and S. Hosono, Eds., Handbook of Positional Plagiocephaly. Springer Nature Singapore, 2024.
S. Agrawal, G. S. N. P., B. K. Singh, G. B., and M. V., “Eeg based classification of learning disability in children using pretrained network and support vector machine,” in Biomedical Engineering Science and Technology, B. K. Singh, G. Sinha, and R. Pandey, Eds. Cham: Springer Nature Switzerland, 2024, pp. 143–153.
M. Qasim and E. Verdu, “Video anomaly detection system using deep convolutional and recurrent models,” Results in Engineering, vol. 18, p. 101026, 2023.
K. H. H. Aung, C. L. Kok, Y. Y. Koh, and T. H. Teo, “An embedded machine learning fault detection system for electric fan drive,” Electronics, vol. 13, no. 3, 2024.
H. M. Alyahya, M. M. Ben Ismail, and A. Al-Salman, “Intelligent resnet-18 based approach for recognizing and assessing arabic children’s handwriting,” in 2023 International Conference on Smart Computing and Application (ICSCA), 2023, pp. 1–7.
J. Li, P. Zheng, and L. Wang, “Remote sensing image change detection based on lightweight transformer and multi-scale feature fusion,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, pp. 1–14, 2025.
B. Bhavana and S. L.Sabat, “A low-complexity deep learning model for modulation classification and detection of radar signals,” IEEE Transactions on Aerospace and Electronic Systems, pp. 1–10, 2024.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2015.
W. Cheng, “Application and analysis of residual blocks in galaxy classification,” Applied and Computational Engineering, 2023.
S. Y. B. Mohamed Akrem Benatia, Yacine Amara and A. Hocini, “Block pruning residual networks using multi-armed bandits,” Journal of Experimental & Theoretical Artificial Intelligence, vol. 0, no. 0, pp. 1–16, 2023.
X. Hu, G. Sheng, D. Zhang, and L. Li, “A novel residual block: replace conv1x1 with conv3x3 and stack more convolutions,” PeerJ Computer Science, vol. 9, p. e1302, 2023.
K.-L. Chen, C.-H. Lee, H. Garudadri, and B. D. Rao, “Resnests and densenests: Block-based dnn models with improved representation guarantees,” Advances in neural information processing systems, vol. 34, pp. 3413–3424, 2021.
J. Naranjo-Alcazar, S. Perez-Castanos, I. Martin-Morato, P. Zuccarello, and M. Cobos, “On the performance of residual block design alternatives in convolutional neural networks for end-to-end audio classification,” 2019.
B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas, and L. Fei-Fei, “Human action recognition by learning bases of action attributes and parts,” in 2011 International Conference on Computer Vision, 2011, pp. 1331–1338.
I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
Y. Bengio, “Practical recommendations for gradient-based training of deep architectures,” in Neural Networks: Tricks of the Trade. Springer, 2012, pp. 437–478.
C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
J. T. Townsend, “Theoretical analysis of an alphabetic confusion matrix,” Perception & Psychophysics, vol. 9, pp. 40–50, 1971.
L. [Lingxuan, Y. Zejun, M. Zhenwei, and Z. XING], “Method for bearing fault quantitative diagnosis based on mtf and improved residual network,” Journal of Northeastern University (Natural Science), vol. 45, no. 5, pp. 697–706, 2024.
Z. H. Hoo, J. Candlish, and D. Teare, “What is a roc curve?” Emergency Medicine Journal, vol. 34, no. 6, pp. 357–359, 2017.
A. Longon, “Interpreting the residual stream of resnet18,” 2024.
T. Bisla, R. Shukla, M. Dhawan, M. R. Islam, T. Koyasu, K. Teramoto, Y. Kataoka, and K. Horio, “Kid activity recognition: A comprehensive study of kid activity recognition with monitoring activity level using yolov8s algorithms,” in 2024 3rd International Conference on Artificial Intelligence for Internet of Things (AIIoT), 2024, pp. 1–6.
S. Mohottala, P. Samarasinghe, D. Kasthurirathna, and C. Abhayaratne, “Graph neural network-based child activity recognition,” in 2022 IEEE International Conference on Industrial Technology (ICIT), 2022, pp. 1–8.
M. Kohli, A. K. Kar, V. G. Prakash, and A. P. Prathosh, “Deep learning-based human action recognition framework to assess children on the risk of autism or developmental delays,” in Neural Information Processing, M. Tanveer, S. Agarwal, S. Ozawa, A. Ekbal, and A. Jatowt, Eds. Singapore: Springer Nature Singapore, 2023, pp. 459–470.
Bosanquet M, Copeland L, Ware R, Boyd R. A systematic review of tests to predict cerebral palsy in young children. Developmental Medicine and Child Neurology. 2013;55(5):418-426.
El-Dib M, Massaro AN, Glass P, Aly H. Neurodevelopmental assessment of the newborn: An opportunity for prediction of outcome. Brain and Development. 2011;33(2):95-105.
Prechtl HFR. Spontaneous motor activity as a diagnostic tool: Functional assessment of the young nervous system. A Scientific Illustration of Prechtl’s Method: GM Trust; 1997.
Tamboer, P., Vorst, H.C.M., Ghebreab, S., Scholte, H.S.: Machine learning and dyslexia: classification of individual structural neuro-imaging scans of students with and without dyslexia. NeuroImage Clin. 11, 508–514 (2016).
Al-Barhamtoshy, H.M., Motaweh, D.M.: Diagnosis of Dyslexia using computation analysis. In: 2017 International Conference on Informatics, Health & Technology (ICIHT), pp. 1–7. IEEE.
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014).
J. Liu, B. Kuipers, and S. Savarese. Recognizing human actions by attributes. In CVPR, 2011. 2.
V. Delaitre, I. Laptev, and J. Sivic. Recognizing human actions in still images: A study of bag-of-features and partbased representations. In BMVC, 2010. 1, 2, 4, 7. 1.
B. Yao, A. Khosla, and L. Fei-Fei. Combining randomization and discrimination for fine-grained image categorization. In CVPR, 2011. 1.
P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained partbased models. IEEE T. Pattern Anal., 32(9):1627-1645, 2010. 2, 4.
Wu, Z., Nagarajan, T., Kumar, A., Rennie, S., Davis, L. S., Grauman, K., & Feris, R. (2018). Blockdrop: Dynamic inference paths in residual networks. In Proceedings of the ieee conference on computer vision and pattern recognition, Salt Lake City, UT, USA (pp. 8817–8826).
D. Parikh and K. Grauman. Interactively building a discriminative vocabulary of nameable attributes. In CVPR, 2011. 1, 2.
G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov Improving neural networks by preventing coadaptation of feature detectors. arXiv:1207.0580, 2012.
K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman The devil is in the details: an evaluation of recent feature encoding methods. In BMVC, 2011.
S. Liu, D. Marinelli, L. Bruzzone, and F. Bovolo, “A review of change detection in multitemporal hyperspectral images: Current techniques, applications, and challenges,” IEEE Geosci. Remote Sens. Mag., vol. 7, no. 2, pp. 140–158, Jun. 2019.
G. Montúfar, R. Pascanu, K. Cho, and Y. Bengio On the number of linear regions of deep neural networks. In NIPS, 2014.
O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Proc. 18th Int. Conf. Med. Image Comput. Comput.-Assist. Interv., Munich, Germany, 2015, pp. 234–241.
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014.
I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio Maxout networks. arXiv:1302.4389, 2013.
C. Yu, H. Li, Y. Hu, Q. Zhang, M. Song, and Y. Wang, “Frequency-temporal attention network for remote sensing imagery change detection,” IEEE Geosci. Remote Sens. Lett., vol. 21, 2024, Art. no. 5005305.
J. Guo, “CMT: Convolutional neural networks meet vision transformers,” in Proc. 2022 IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 12165–12175.
A. Alqurashi and L. Kumar, “Investigating the use of remote sensing and GIS techniques to detect land use and land cover change: A review,” Adv. Remote Sens., vol. 2, no. 2, pp. 193–204, 2013.
Z. Zheng, Y. Zhong, J. Wang, A. Ma, and L. Zhang, “Building damage assessment for rapid disaster response with a deep object-based semantic change detection framework: From natural disasters to man-made disasters,” Remote Sens. Environ., vol. 265, 2021, Art. no. 112636.

Figure 1. Structure of the Resnet-18 model.

Figure 2. Random Image Selected from 20 Action Categories

Figure 3. Training and Validation Accuracy Over Epochs.

Figure 4. Precision, Recall, and F1-Score Over Epochs.

Figure 5. Confusion Matrix

Figure 6. Receiver Operating Characteristic Curve

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.