3. Related Works
Facial Expression Recognition (FER) has been widely studied, with various approaches proposed for both static and dynamic systems. Dynamic AFER systems, which analyze image sequences, can capture the temporal evolution of facial expressions. However, these systems often face challenges in terms of computational cost, limiting their real-time applicability.
Perveen et al. [
11] proposed a dynamic kernel-based approach that captures local
spatio-temporal representations of facial movements using a universal Gaussian Mixture Model with Mean Interval Kernel (
uGMM-MIK). This method preserves local similarities between frames while managing changes in the global context, demonstrating that probability-based kernels provide superior discriminative performance and matching kernels offer improved computational efficiency. Similarly, Zhang et al. [
12] introduced a hybrid deep learning model combining spatial and temporal Convolutional Neural Networks (CNNs). Their approach integrates these features into a deep fusion network based on a Deep Belief Network (DBN), leading to improved recognition performance on video-based datasets by effectively capturing both spatial and temporal dynamics.
To capture facial dynamics,
spatio-temporal features are essential. Aghamaleki and Ashkani Chenarlogh [
16] proposed a multi-stream CNN architecture that combines handcrafted features like LBP and Sobel edge maps with CNNs. This method addresses the challenge of limited training data by integrating both handcrafted and learned features, enhancing the model’s ability to recognize dynamic facial expressions. Shahid et al. [
17] further refined this approach by analyzing eleven sub-local facial regions. Their method uses contour and region shape harmonics to model facial variations, improving robustness against challenges such as alignment issues, illumination changes, and occlusions. This localized approach allows for more accurate capture of facial expression dynamics in real-world conditions.
Efficient feature selection plays a crucial role in improving recognition performance while reducing computational costs. Pham et al. [
18] proposed a novel loss function to enhance CNN-based Facial Expression Recognition (FER) performance by minimizing intra-class variation and maximizing inter-class variation, resulting in more discriminative features. This approach significantly improves feature discriminability, making it highly effective for facial expression recognition. Vaijayanthi and Arunnehru [
19] used dense Scale-Invariant Feature Transform (SIFT) descriptors to capture temporal facial dynamics. These descriptors, combined with machine learning algorithms, enhance the recognition of facial expressions under varying conditions, particularly in dynamic AFER systems that require robust feature extraction from video sequences.
Various classification methods, such as SVMs, have been employed for facial expression classification. Sen et al. [
14] explored the use of Directed Acyclic Graph SVMs (DAGSVM) for
multi-class emotion recognition. Their method efficiently handles multiple emotions, providing faster processing while preserving the discriminative power of SVMs. Kartheek et al. [
20] introduced Windmill Graph-based Feature Descriptors (WGFD) for FER. This graph-based method captures both local and distant relationships between pixels and, combined with a
multi-class SVM, outperforms traditional methods on benchmark FER datasets, demonstrating the effectiveness of graph-based feature descriptors in facial emotion recognition.
Real-time FER systems require methods that can process facial expressions efficiently, as high computational costs often hinder their performance. Lopez-Gil and Garay-Vitoria [
21] addressed this challenge by classifying individual photograms (frame-by-frame images) in video sequences. Their approach combines multiple classifiers to improve efficiency and accuracy, making it suitable for real-time applications. Similarly, Perveen et al. [
11] focused on real-time FER by using dynamic kernels, which manage computational complexity while preserving the discriminative power of
spatio-temporal features.
These related works highlight the importance of both temporal and spatial feature extraction, classification techniques, and computational efficiency in dynamic AFER systems. Our proposed approach builds on these foundations by using statistical spatio-temporal geometric features and feature selection techniques to strike a balance between classification accuracy and computational efficiency, making it well-suited for real-time applications.
3.1. Limitations & Challenges
The review of existing methods highlights several challenges and limitations associated with dynamic AFER systems. One of the most recurrent issues is the necessity of sequence normalization. In practice, the number of frames per image sequence may vary from one sample to another, while most classification techniques require feature vectors of identical size. To address this, many methods define a fixed number of frames and perform sequence normalization accordingly. Typically, this involves duplicating frames for shorter sequences and removing frames from longer ones. Although this operation mitigates the mismatch issue, it alters the original input data by adding or removing information. Alternatively, some authors chose to consider only the initial frame (neutral state) and the frame corresponding to the peak expression (apex), addressing the size constraint at the expense of sequence integrity.
Unlike static systems, dynamic systems are expected to process image sequences rather than single images. To achieve a spatio-temporal representation, several methods extract feature vectors from each frame and then concatenate them into a single descriptor vector. Although this strategy preserves all frames, it often results in very large feature vectors, which can be problematic. Since AFER systems are typically expected to be computationally efficient and suitable for real-time applications, excessive feature vector size can significantly impact processing speed and memory requirements.
In summary, the five key requirements for designing an effective dynamic AFER system include:
Preservation of sequence integrity, without adding or discarding information,
Construction of a spatio-temporal representation with spatial and temporal information.
Efficient processing and recognition suitable for real-time applications,
High accuracy or recognition rate compared to state-of-the-art methods,
Strong robustness when evaluated across different benchmark datasets.
Although this list is not exhaustive, it served as the foundation for the design of our proposed dynamic AFER approach.