Preprint
Article

This version is not peer-reviewed.

ASD Recognition through Weighted Integration of Landmark-Based Handcrafted and Pixel-Based Deep Learning Features

Submitted:

26 December 2025

Posted:

29 December 2025

You are already at the latest version

Abstract
Autism Spectrum Disorder (ASD) is a neurological condition that impairs communication skills, with individuals often experiencing mild to severe challenges that may require specialised care. While numerous researchers are developing automated ASD recognition systems, achieving high performance remains challenging due to the lack of effective features. In this study, we propose a novel dual-stream model that combines handcrafted facial-landmark features and pixel-level deep learning features to classify ASD and non-ASD faces. The system processes images through two distinct streams to capture complementary features. In the first stream, facial landmarks are extracted using Mediapipe, initially capturing 478 points and selecting 137 symmetric landmarks. The face position is determined by applying in-plane rotation using the angles calculated from the outer eye corners (landmarks 33 and 263). Geometric features and 52 blendshape features are then fed into Dense layers (128 units) with dropout for regularisation. These features are merged and refined through additional Dense layers (128 and 64 units) to produce the final output for Stream-1. In the second stream, the RGB image is resized, normalised using the preprocessing function corresponding to the chosen backbone (e.g., ResNet50V2, DenseNet121, InceptionV3), and then extracted features using a Convolutional Neural Network (CNN) enhanced with Squeeze-and-Excitation (SE) blocks. Global Average Pooling (GAP) reduces dimensionality, followed by DenseNet (256 units with dropout) and a final Dense layer (64 units) to extract features for Stream-2. The outputs from both streams are concatenated, and a softmax gate with weighted concatenation is applied to combine the features. A final Dense layer (128 units with dropout) refines the features before passing them through a softmax layer to produce the probabilistic classification score. This hybrid approach, integrating landmark-based and RGB-based features, significantly enhances the model’s ability to distinguish between ASD and Non-ASD faces. Using the Kaggle dataset, the model achieved an accuracy of 96.43%, with a precision of 97.10%, recall of 95.71%, and an F1 score of 96.40%. On the YTUIA dataset, the accuracy increased to 97.83%, with a precision of 97.78%, recall of 97.78%, and an F1 score of 97.78%. Although these results are promising, they fall short of surpassing the highest reported performance of 95.00\% for Kaggle and 95.90\% for YTUIA. Future work will focus on optimizing the model’s performance to exceed these benchmarks.
Keywords: 
;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2025 MDPI (Basel, Switzerland) unless otherwise stated