1. Introduction
In our daily lives, facial expressions are quite important. Facial expression is a kind of non-verbal communication for human that helps humans to express their feelings or messages in a more useful and effective way. Facial expression is first explored by Dr. Ekman and Friesen in 1971 [
1]. They proposed a Facial Action Code System (FACS) to distinct the facial expressions by using action units (AUs). There are two categories of facial expressions, which are micro expression and macro expression. Micro expression is a form of reflexive behaviour in which an individual expresses their actual feelings via facial muscle movements. In contrast, macro expression is recognizable and easy to be captured by raw eyes. It can represent an individual genuine feeling or acted emotion. Both micro and macro expressions provide effective communication and help to understand a person’s current feelings. The common facial expressions are smile, sad, fear, angry, surprise, disgust, confused and neutral [
2].
Facial expression recognition (FER) is a cutting-edge technology within the fields of computer vision and affective computing that utilizes facial features taken from images or videos in order to automatically identify and interpret human emotions. Three main stages were often included in the FER methodology: face detection, feature extraction and facial expression classification. Even though FER is commonly practiced by researchers, it is still facing some critical issues which affect the effectiveness and efficiency of FER. For example, the current existing facial expression datasets that frequently utilized to conduct the studies on FER is suffered from several limitations such as inconsistent labelling for classes, limited diversity of the databases, and insufficient scale for the currently available datasets.
In order to overcome the limitations of facial expressions database, generative models have been used in some studies. Generative model is a kind of machine learning model that has the ability to produce new data which is similar to the data it was trained on. Generative models focus on understanding the fundamental distribution of input data, in contrast to discriminative models, which emphasise distinguishing between categories or predicting labels. Once the generative model is trained, it can generate completely new, synthetic data that share similar features to the original dataset. Generative models can be used to generate data like images, videos, audio, texts, text-to-image, text-to-video, image captioning, 3D modelling and other possible data. There are many types of generative models including Generative Adversarial Network (GAN), Variational Autoencoder (VAE), Autoregressive model, Diffusion model others. Generative models are rapidly being employed across a variety of areas such as scientific research, healthcare, simulation, art and design, content creation, natural language processing (NLP) etc. Generative models were deployed to generate new facial expression images in this work. In the following study, a LAUN upgraded StarGAN for face emotion recognition is proposed [
3]. StarGAN and LAUN improved StarGAN were utilized to create a series of higher quality fake facial emotions images for every emotion. [
4] employed a conditional generative adversarial network (CGAN) in their work. It is an unsupervised domain adaption for the recognition of face emotions. Furthermore, [
5] proposed an unsupervised learning micro-expression generative adversarial network (ULME-GAN) to generate micro-expression sequences. To improve accuracy, an AU-matrix re-encoding (AUMR) was deployed, and transfer learning approach was utilized to train their proposed generator network. Then, a multi-sequence based micro-expression (ME) generation approach for ME recognition is introduced by [
6]. A facial expression recognition is proposed by [
7] using maximum margin Gaussian Mixture Models (GMM). By recording Complex Spatio-Temporal Relations between face muscles, [
8] suggested a method for recognizing facial expressions.
The performance and resilience of FER systems have been significantly improved by the recent developments in deep learning architectures, especially CNN. To elevate the performance of FER, pre-trained CNN models are proposed in several most recent works. Pre-trained CNN models are deep learning models which have been trained using a large dataset like ImageNet. It eliminates the need of train from scratch, which is time consuming and cost large amount of resources. It is much more efficient and has better performance. Pre-trained CNN models are commonly used in applications such as object detection, face recognition, image classification, medical imaging and others. There are several popular pre-trained CNN models that can be adopted in this work including VGG, ResNet, AlexNet, SqueezeNet, GoogleNet, Inception, MobileNet and others. In the following study, Xception CNN paired with a K-fold cross-validation technique is adopted by [
9]. In order to identify students’ moods based on their facial expressions, [
10] proposed a CNN model. Furthermore, [
11] employed a 2-dimensional (2D) CNN to identify the facial emotion and evaluated the model by using a self-collected facial emotion database which contains five emotions. In order to enhance the performance of CNN model, an unique Venturi Architecture that contains 6 hidden layers and one output layer for CNN was proposed by [
12]. Moreover, a CNN model derived facial expression recognition algorithm was proposed by [
13]. Besides, CNN was used in Naik and Mehta’s Hand-over-Face Gesture based Facial Emotion Method (HFG_FERM) [
14]. A new model called feature redundancy-reduced convolutional neural network (FRR-CNN) has been proposed by [
15] to recognize facial expressions and experimented using CK+ and JAFFE. In order to recognise face expressions in the wild, [
16] employed three convolution neural network models including Light CNN, dual-branch CNN and pre-trained CNN. By employing a residual network, [
17] presented micro-expression identification. For the purpose of recognizing facial expressions, [
18] presented a Fast Regions with Convolutional Neural Network Features (Faster R-CNN). Besides that, [
19] introduced an automated facial emotion classification system based on the Convolution Neural Network (CNN) and the extracted features of the Speeded Up Robust Features (SURF).
With enhanced feature extraction and preprocessing techniques, [
20] presented a real-time face emotion recognition system. Based on textural pattern and convolutional neural networks, [
21] presented facial emotion recognition. In addition, a facial expression recognition by using local learning with deep and handcrafted features was presented [
22]. By using facial expressions, [
23] demonstrate an emotion identification system for drivers while driving. This experiment uses FERDERnet to recognise the facial expressions of drivers while they are driving. Additionally, an Identity-Aware CNN (IACNN) is created by [
24] for the facial expression recognition. Then, Attention Net (FERAtt) was proposed by [
25] for facial expression recognition. There are two methods implemented in this experiment: a model with attention and classification (FERAtt+Cls), and a model with attention, classification, and representation (FERAtt+Rep+Cls). Moreover, [
26] used a 3D CNN and transfer learning to recognize facial micro-expression. [
27] adopted (TLCNN) model to recognize micro-expression (ME) with a tiny sample size. Furthermore, deep convolution networks were proposed by [
28] for face emotion identification by applying normalization, Action Units (AUs) and CNN. For FER, [
29] presented an attention mechanism-based CNN. The micro movements of the faces are captured, and the texture information of the image is obtained using LBP features. Based on face local regions, [
30] presented a compact and efficient facial emotion detection network. [
31] used an enhanced spatial-temporal learning network (ESTLNet) to conduct dynamic facial expression recognition.
The idea of this research is to introduce a novel approach for facial expression recognition by combining the generative model and pre-trained CNN model. Diffusion model was used to generate new AI generated images in order to create an AI creation database. CASME Ⅱ was used by generative model as training data in order to generate new AI images. The new AI generated images will be fed into the pre-trained CNN models for classification, and the performance of each pre-trained CNN model will be compared to each other.
The rest of the paper is organized as the following:
Section 2 describes the dataset used, generative model used, and the models used in this work.
Section 3 defines the settings used and discussed the model’s performance in this work.
Section 4 concludes the findings of this work and
Section 5 suggests some ideas for the future work.
2. Methodology
2.1. Methodology Overview
In this research, an AI creation facial expression database will be created for facial expression recognition. This research uses a generative AI model to create an AI-generated dataset and pre-trained CNN models are utilized to assess the AI-generated dataset. CASME Ⅱ is the dataset utilized in this work. Firstly, CASME Ⅱ is utilized to train the Diffusion model in order to produce a new AI-generated facial expression dataset. In this work, there are two types of datasets utilized: CASME Ⅱ and the proposed AI-generated dataset. The images obtained from both datasets were pre-processed before feeding into the pre-trained CNN models for the classification task. The preprocessing approaches employed are resizing the image, rescaling and turning into grayscale. The pre-processed images are used to train and test every single pre-trained CNN model adopted in this work. This work can be segmented into four sections: train and test using CASME Ⅱ, train and test using the proposed AI-generated dataset, train using CASME Ⅱ and test using the proposed AI-generated dataset, and train using the proposed AI-generated dataset and test using CASME Ⅱ. Inception V3, VGG-16 and ResNet 50 are the pre-trained CNN models proposed in this work. Inception V3, VGG-16 and ResNet 50 carried out the classification task to distinct the seven different facial expressions including happiness, sadness, disgust, repression, fear, surprise and others. Lastly, the performance for each model was obtained and compared to each other. The proposed methodology’s overview is given in
Figure 1.
2.2. Datasets
2.2.1. CASME Ⅱ
The Chinese Academy of Sciences Micro-expression 2 (CASME Ⅱ) is an enhanced spontaneous micro-expression database which created by [
32] in 2014. CASME Ⅱ is an enhanced version of CASME dataset, which was developed by [
33]. There are a few types of facial expressions labelled in this database: happiness, sadness, disgust, repression, fear, surprise, and others. The samples are collected in a controlled environment laboratory, and the participants’ micro-expressions were taken using a high-speed camera. To elicit their micro-expression, every participant had to view a short video clip throughout the sample collecting process. The camera was set up and faced directly to the participant’s face. The micro-expression samples from every subject were recorded at a speed of 200 fps with a pixel size of 280×340. Out of 3000 facial movements in the database, 247 micro-expressions labelled with action units and emotions were chosen. The dataset has a total of 17124 static images in seven different facial expressions. This dataset includes seven categories of facial expression, which include Happiness, Sadness, Surprise, Disgust, Fear, Repression, and others. The distribution of each facial expression’s image count was recorded in
Table 1. The sample images with 7 different micro-expressions are shown in
Figure 2.
2.2.2. Proposed AI-Generated Dataset
In this work a, self-proposed AI-generated facial expression dataset is introduced. The proposed AI-generated dataset is created using Diffusion model. The proposed AI-generated facial expression dataset is generated based on the original CASME Ⅱ obtained from the author. The pixel size of the images obtained from CASME Ⅱ is resized from 280×340 to 48×48 and turned into grayscale before fit into the Diffusion model for training purpose. Then, the Diffusion model is fine-tuned in order to fulfil the pre-processed images specification. The diffusion model is well trained before it is utilized to generate the new facial expression images. The process of generating new AI-generated facial expression datasets is shown in
Figure 3. Moreover, the completely trained model is then used to generate new facial expression images that contains seven categories of facial expression, which include Happiness, Sadness, Surprise, Disgust, Fear, Repression, and others. The newly generated facial expression images are in grayscale with pixel sizes of 48×48. The generated facial expression dataset consists of 15464 images. The distribution of each facial expression’s image count was recorded in
Table 2. The sample images with 7 different micro-expressions are shown in
Figure 4.
2.3. Preprocessing
Raw data often contains unnecessary and noisy information such as background change. Hence, preprocessing is introduced in this case. It is often known as data cleaning and data organizing process. Preprocessing is a series of processes of preparing raw data before feeding it into a machine learning model or deep learning model. It provides cleaner and precise data for the models and helps them to have a better learning of the data.
In this work, there are two set of preprocessing carried out. One for the Diffusion model and the other for the pre-trained CNN models. During the preprocessing for Diffusion model, the images obtained from CASME Ⅱ is resized from pixel size of 280×340 to 48×48. Then, the images were turned from RGB to grayscale. These approaches were taken due to the resources constraints to train the Diffusion model.
In the preprocessing for the pre-trained CNN models, the images obtained from both CASME Ⅱ, and our proposed AI-generated dataset are first resized from 48×48 pixels to 160×160 pixels. Then, data augmentation is utilized to artificially raise the number of images to help the architecture to have a better learning of the features. The image is rotated 15 degrees and horizontally flipped. Moreover, width shift and height shift were applied with the range settings of 0.1. The brightness of the images was also adjusted between the range of 0.8 to 1.2. After all these adjustments, the images’ fill mode is set to nearest and rescaled to 1/255. Furthermore, the dataset was further separated into 80% for training and 20% for testing.
2.4. Diffusion Model
Diffusion model is one of the most popular generative models which mainly used for image generation. It not only has the usage of image generation but also denoising and data generation. Unlike other generative models, it transforms the data slowly into a dense structure. There are two different steps involved in diffusion model: forward process and reserve process. During forward process, diffusion model will take an image and slowly add noise to the image over a series of steps. The noise of the image will be increased in every step until the image becomes nearly indistinguishable from the random noise. It attempts to simulate a Markov chain, in which data gets gradually corrupted. The reverse process is started once the noise is added into the images. This process is also called denoising because it will try to eliminate the added noise from the image and reconstruct the image. In order to denoise the image, a neural network is employed. The neural network will be trained to recognize the original image based on the input noise and try to restore the original state of the image.
2.4.1. Forward Process (Diffusion Process)
Forward process is also known as diffusion process. In Forward process, the model will increasingly destroy the original input image by adding Gaussian noises into it. The noises will keep adding into the image until the image becomes blurry from pure noise.
Figure 5 illustrates the process of Forward Process.
In theory, a clean data
is step by step corrupted with the introduce of small amounts of Gaussian noise over many time steps
. A Markov chain might be applied to explain this, in which the current noisy sample
relies solely on the prior
and so on. It is mathematically expressed as:
is a small variance term, it controls the amount of noise introduced at each step. This process turns the structured data into random noise over a number of steps.
Instead of modelling noise addition in every step, the whole forward process is defined in a single closed-form formula that directly ties the noisy sample
to the original data
:
where
and
. By using the sampling equation, a noisy version of the input at any timestep
represents a random noise vector. As increases, the term declines, indicating that the original data contributes less and the sample is controlled by noise.
2.4.2. Reverse Process (Denoising Process)
Reverse process is also known as denoising process. In Reverse process, the model will try to learn removing noise from the image and restore the image back to its original form. Once the model is trained, it gains the ability to produce new images by applying the step-by-step reverse diffusion method. The process of Reverse Process is presented in
Figure 6.
In theory, it starts with random noise and the reverse step from
to
is given by a Gaussian:
where a neural network is parameterized by
predicts both mean
and variance
. Since the actual reverse distribution is unidentified, the network is taught to estimate it.
While in the training, the model is trained to predict the noise
that was inserted to the forward process using a simple objective function:
The network attempts to identify the source of a noisy sample
. Once trained, the process is reversed using the formula:
represents as a small random noise sample, while represents as a variance term. By repeating this process from to , noise is progressively eliminated, producing a realistic and excellent sample.
2.5. Convolutional Neural Network (CNN)
Object identification, facial recognition, image classification, medical imaging, and other tasks are commonly performed using a deep learning model recognized as a convolutional neural network (CNN). It originated from the architecture of LeNet, which was invented by [
34]. CNN is applied to analyse input such as pictures or numeric data. CNN does not require much preprocessing. CNN is designed according to the human brain neuron networks. The convolutional layer of CNN is used to collect picture features. Max pooling and average pooling layers; the common pooling layers employed in CNN. To classify the outcomes of data classifications, a fully connected layer is utilized. A CNN model generally comprises of input layer, convolution layers, pooling layers, fully connected layers, and an output layer.
2.5.1. Input Layer
An input layer, the first layer of a CNN model, which responsible for receiving raw data such as images and the data will be passed to convolution layer to extract the features from the data.
2.5.2. Convolution Layer
A convolutional layer, which is where most of the processing is done, and it is a crucial part of CNN. Input data, a filter, and a feature map are among things it requires. The input data is processed using convolutional techniques to extract important characteristics and capture spatial correlations. In the convolution procedure, kernel slid over the input data. The outcome of convolving the filter with a corresponding local area of the input is then denoted by every segment of the feature map, which is formed by applying the filter to specific local sections of the input.
2.5.3. Activation Layer
An activation layer is the layer that usually utilized after every convolution layer or fully connected layer. It is utilized in order to integrate non-linearity into the architecture. Activation functions come in a variety of forms, such as Rectified Linear Unit (ReLU), Leaky ReLU, Sigmoid function, Hyperbolic Tangent (Tanh), and SoftMax function, but the most common exploited activation function is Rectified Linear Unit (ReLU). ReLU is mathematically defined as:
2.5.4. Pooling Layer
Pooling layer is typically included in the construction of CNN used for deep learning tasks. Its objective is to maintain the most important features while shrinking the spatial dimensions of the input tensor. Average pooling as well as Max pooling are the two different pooling layers. The concept of Average pooling and Max pooling are illustrated in
Figure 7.
2.5.5. Fully Connected Layer (FC Layer)
Occasionally, a thick layer in CNN is used to refer to a fully connected layer (FC). It usually appears at the bottom of a neural network. Each intersection in FC layer’s output layer has a direct connection to an intersection in the pooling layer above. Afterward, the output will be passed to a corresponding layer for image classification.
2.5.6. Output Layer
An output layer is the final layer of CNN. This layer is mainly focus on the prediction and classification tasks. There are several types of activation functions applied in output layer, namely Sigmoid function, SoftMax function and Linear function. Sigmoid function is adopted in binary classification where normally occurs two classes classification. Meanwhile, SoftMax function is employed for multi-class classification where normally occurs two or more classes classification. Moreover, linear function is suitable for regression tasks. It is used to predict continuous value such as stock market price. Output layer is mathematically defined as:
where
is represented as the activation function (e.g. Softmax, Sigmoid, Linear). Besides, W is represented as the weights learned during the training process. Then,
is defined as the input vector, which is also known as features. Lastly,
is defined as the bias term.
2.6. Transfer Learning
In CNN, there is a methodology known as transfer learning. It enables the application of an architecture trained on a larger dataset to a new but similar task. Training a CNN model from scratch often requires a significant amount of time, computational power and labelled data. With the use of transfer learning, it can eliminate the need to train a CNN model from scratch and leverages the knowledge gained from the previously trained architecture which has been trained on a sizable dataset (i.e., ImageNet, which contains over 1000 object classes). Then, the previously trained model is utilized to a new target domain. There are several popular pre-trained CNN models including AlexNet, SqueezeNet, GoogleNet, ResNet, VGG, Inception, MobileNet, EfficientNet, DenseNet etc. The concept of transfer learning is as illustrated in
Figure 8. The knowledge gained from the Domain A is saved and is utilized it in Domain B. In conventional CNN model, the data sample will be added into the input layer, and passed into the convolution layers to train the Network A. After the trained knowledge has completed in Network A, the knowledge will transfer to the Network B, and the other set of input will be used to further train the convolution layers in the Network B. The learned knowledge will pass through the fully connected layers in Network B in order to further training and classifying the data into different category and it will be displayed at the output layer.
2.7. Pre-Trained CNN Models
Pre-trained CNN models are the deep learning architecture which has been trained on a sizable and diverse dataset. It is the model adopted transfer learning; it eases the need of train from scratch and enhances the knowledges form the previous trained model in order to perform a new task. Pre-trained CNN models can save up a lot of resources including time, computational resources and data. It also able to provide state-of-the-art performance. There are a variety of popular pre-trained CNN models to be deployed in FER, namely AlexNet, SqueezeNet, VGG-16, VGG-19, ResNet, Inception, MobileNet and EfficientNet. In this study, the pre-trained CNN models that are chosen to adopt including ResNet 50, VGG-16 and Inception V3.
2.7.1. ResNet 50
ResNet-50, a deep convolution neural network architecture with 50 layers, is extensively utilized for image recognition applications. Microsoft Research debuted it in 2015 as a part of the ResNet family, which secured victory in the ImageNet competition that year [
35]. The use of “skip connections”, which promote gradient flow during backpropagation and mitigate the vanishing gradient problem that occasionally occurs in very deep networks, is one of ResNet-50’s standout features. With 49 convolutional layers and a final fully connected layer, the architecture is composed of a series of bottleneck blocks, each with three layers: a 1×1 convolution for dimensionality decreasing, a 3×3 convolution, and another 1×1 convolution for restoring dimensions. Because of its deep and computationally efficient design, ResNet-50 enables high accuracy image classification and transfer learning applications. In
Figure 9, the architecture of ResNet 50 is explained.
2.7.2. VGG-16
VGG-16 [
36], a well-known deep convolutional neural network design was introduced in the 2014 ILSVRC (ImageNet Large Scale Visual Recognition Challenge). It is developed by the Visual Geometry Group at the University of Oxford. It has 16 weight layers, containing 3 fully connected and 13 convolutional layers, and a SoftMax layer for categorization. The primary idea of VGG-16 is the recurrent application of tiny 3×3 convolution filters, which enables the network to capture intricate characteristics while keeping the number of parameters under control. A max-pooling layer is placed after every of the five blocks that make up these levels in order to minimise spatial dimensions. VGG-16 is popular for its clear and consistent architecture, which makes it powerful and simple to utilize for image classification, feature extraction, and transfer learning issues. The architecture of VGG-16 is presented in
Figure 10.
2.7.3. Inception V3
Inception V3 [
37] is also known as GoogleNet, it is a deep CNN architecture developed by Google as a part of the Inception series. Inception V3 is the third version of Inception model, and it was introduced in 2015. The goal of Inception V3 was to carry out large-scale image recognition tasks more accurate and faster at the same time. Inception V3 is begin with an input size of 299×299×3 and goes through a few convolution layers and pooling layers. It then passes through several kinds of Inception modules, including Inception Modules A, B and C. The network identifies multi-scale spatial patterns by using comparable convolutional paths with various filter sizes such as 1×1, 3×3 and 5×5, and pooling layers. The large convolutions are factorized into smaller convolutions to enhance the performance. For example, a 5×5 layer is replaced with two 3×3 layers. The feature maps are down-sampled from 35×35 to 17×17 and subsequently to 8×8 using Reduction-A and Reduction-B modules. This increases depth while keeping important information. Batch normalization and auxiliary classifiers are utilized across the network to improve training stability and regularization. The last stage consists of a global average pooling, a dropout layer for overfitting prevention, fully connected layers, and SoftMax activation for class probabilities. In
Figure 11, the Inception V3 architecture’s summary is illustrated.
Figure 1.
The proposed methodology’s overview.
Figure 1.
The proposed methodology’s overview.
Figure 2.
Sample from CASME Ⅱ database.
Figure 2.
Sample from CASME Ⅱ database.
Figure 3.
Process of generating new dataset.
Figure 3.
Process of generating new dataset.
Figure 4.
Sample from proposed AI-generated facial expression dataset.
Figure 4.
Sample from proposed AI-generated facial expression dataset.
Figure 5.
Forward process.
Figure 5.
Forward process.
Figure 6.
Reverse process.
Figure 6.
Reverse process.
Figure 7.
Pooling layers.
Figure 7.
Pooling layers.
Figure 8.
Concept of transfer learning.
Figure 8.
Concept of transfer learning.
Figure 9.
Architecture of ResNet 50.
Figure 9.
Architecture of ResNet 50.
Figure 10.
Architecture of VGG-16.
Figure 10.
Architecture of VGG-16.
Figure 11.
Overview of Inception V3 architecture.
Figure 11.
Overview of Inception V3 architecture.
Table 1.
Distribution of number of images for CASME Ⅱ database.
Table 1.
Distribution of number of images for CASME Ⅱ database.
| Type of Expression |
Number of Images |
| Happiness |
2360 |
| Sadness |
150 |
| Surprise |
1729 |
| Disgust |
4204 |
| Fear |
127 |
| Repression |
2187 |
| Others |
6367 |
| Total |
17124 |
Table 2.
Distribution of number of images for proposed AI-generated facial expression dataset.
Table 2.
Distribution of number of images for proposed AI-generated facial expression dataset.
| Type of Expression |
Number of Images |
| Happiness |
2780 |
| Sadness |
1632 |
| Surprise |
1845 |
| Disgust |
3120 |
| Fear |
1760 |
| Repression |
2187 |
| Others |
2140 |
| Total |
15464 |
Table 3.
Performance of every pre-trained CNN model on CASME Ⅱ.
Table 3.
Performance of every pre-trained CNN model on CASME Ⅱ.
| Models |
Accuracy (%) |
| 1 |
2 |
3 |
4 |
5 |
Average |
| VGG-16 |
99.13 |
99.35 |
99.30 |
99.42 |
99.46 |
99.33 |
| ResNet 50 |
85.62 |
86.49 |
90.17 |
89.42 |
89.98 |
88.34 |
| Inception V3 |
98.16 |
99.32 |
99.45 |
99.23 |
99.39 |
99.11 |
Table 7.
Performance of every pre-trained CNN model while trained and tested on CASME Ⅱ (Data Augmentation).
Table 7.
Performance of every pre-trained CNN model while trained and tested on CASME Ⅱ (Data Augmentation).
| Models |
Accuracy (%) |
| 1 |
2 |
3 |
4 |
5 |
Average |
| VGG-16 |
92.47 |
98.04 |
98.42 |
98.56 |
98.66 |
97.23 |
| ResNet 50 |
99.07 |
99.18 |
99.39 |
99.50 |
98.87 |
99.20 |
| Inception V3 |
94.69 |
99.19 |
99.60 |
99.55 |
99.52 |
98.51 |
Table 8.
Performance of every pre-trained CNN model while trained on the proposed AI-generated dataset (Data Augmentation).
Table 8.
Performance of every pre-trained CNN model while trained on the proposed AI-generated dataset (Data Augmentation).
| Models |
Accuracy (%) |
| 1 |
2 |
3 |
4 |
5 |
Average |
| VGG-16 |
92.51 |
95.46 |
96.29 |
97.52 |
96.88 |
95.73 |
| ResNet 50 |
98.77 |
98.90 |
99.13 |
99.22 |
99.15 |
99.03 |
| Inception V3 |
95.61 |
99.00 |
99.27 |
99.33 |
99.47 |
98.54 |
Table 9.
Performance of every pre-trained CNN model while trained on CASME Ⅱ and tested on the proposed AI-generated dataset (Data Augmentation).
Table 9.
Performance of every pre-trained CNN model while trained on CASME Ⅱ and tested on the proposed AI-generated dataset (Data Augmentation).
| Models |
Accuracy (%) |
| 1 |
2 |
3 |
4 |
5 |
Average |
| VGG-16 |
98.32 |
97.82 |
98.86 |
98.16 |
98.94 |
98.42 |
| ResNet 50 |
99.63 |
99.51 |
99.40 |
99.59 |
99.58 |
99.54 |
| Inception V3 |
99.35 |
99.42 |
99.44 |
99.55 |
99.67 |
99.49 |
Table 10.
Performance of every pre-trained CNN model while trained on the proposed AI-generated dataset and tested on CASME Ⅱ (Data Augmentation).
Table 10.
Performance of every pre-trained CNN model while trained on the proposed AI-generated dataset and tested on CASME Ⅱ (Data Augmentation).
| Models |
Accuracy (%) |
| 1 |
2 |
3 |
4 |
5 |
Average |
| VGG-16 |
97.37 |
98.39 |
98.53 |
98.48 |
99.12 |
98.38 |
| ResNet 50 |
98.12 |
99.22 |
99.30 |
99.35 |
99.35 |
99.07 |
| Inception V3 |
99.36 |
97.79 |
99.44 |
99.44 |
99.50 |
99.11 |
Table 11.
Performance of every pre-trained CNN model while trained and tested on CASME Ⅱ (Freezing Layer 30%).
Table 11.
Performance of every pre-trained CNN model while trained and tested on CASME Ⅱ (Freezing Layer 30%).
| Models |
Accuracy (%) |
| 1 |
2 |
3 |
4 |
5 |
Average |
| VGG-16 |
98.65 |
95.78 |
92.04 |
97.57 |
98.54 |
96.52 |
| ResNet 50 |
86.76 |
90.65 |
93.92 |
96.65 |
98.05 |
93.21 |
| Inception V3 |
98.38 |
99.40 |
98.20 |
99.40 |
99.56 |
98.99 |
Table 12.
Performance of every pre-trained CNN model while trained and tested on the proposed AI-generated dataset (Freezing Layer 30%).
Table 12.
Performance of every pre-trained CNN model while trained and tested on the proposed AI-generated dataset (Freezing Layer 30%).
| Models |
Accuracy (%) |
| 1 |
2 |
3 |
4 |
5 |
Average |
| VGG-16 |
92.65 |
95.69 |
99.02 |
99.36 |
99.34 |
97.21 |
| ResNet 50 |
96.44 |
97.56 |
97.72 |
97.88 |
97.90 |
97.50 |
| Inception V3 |
98.87 |
99.46 |
99.50 |
99.57 |
98.31 |
99.14 |
Table 13.
Performance of every pre-trained CNN model while trained on CASME Ⅱ and tested on the proposed AI-generated dataset (Freezing Layer 30%).
Table 13.
Performance of every pre-trained CNN model while trained on CASME Ⅱ and tested on the proposed AI-generated dataset (Freezing Layer 30%).
| Models |
Accuracy (%) |
| 1 |
2 |
3 |
4 |
5 |
Average |
| VGG-16 |
99.15 |
99.29 |
99.07 |
99.52 |
99.59 |
99.32 |
| ResNet 50 |
98.06 |
99.39 |
99.47 |
99.52 |
99.55 |
99.20 |
| Inception V3 |
99.30 |
99.41 |
99.47 |
99.50 |
99.53 |
99.44 |
Table 14.
Performance of every pre-trained CNN model while trained on the proposed AI-generated dataset and tested on CASME Ⅱ (Freezing Layer 30%).
Table 14.
Performance of every pre-trained CNN model while trained on the proposed AI-generated dataset and tested on CASME Ⅱ (Freezing Layer 30%).
| Models |
Accuracy (%) |
| 1 |
2 |
3 |
4 |
5 |
Average |
| VGG-16 |
98.57 |
93.67 |
98.24 |
93.85 |
98.26 |
96.52 |
| ResNet 50 |
97.90 |
99.29 |
97.71 |
99.20 |
99.27 |
98.67 |
| Inception V3 |
99.51 |
99.52 |
99.50 |
99.55 |
99.57 |
99.53 |
Table 15.
Performance assessment of the proposed works against other state-of-the-art methods.
Table 15.
Performance assessment of the proposed works against other state-of-the-art methods.
| Model |
Dataset |
Accuracy |
| 3D-CNN [26] |
CASME Ⅱ |
97.60% |
| TLCNN [27] |
CASME Ⅱ |
69.10% |
| SVM with Knowledge Distillation [38] |
CASME Ⅱ |
72.60% |
| Pre-trained ResNet 50 [17] |
CASME Ⅱ |
98.41% |
| Modified VGG-Net with CGAN [39] |
Trained on AI-generated Oulu and test on AI-generated CK+ |
90.37% |
| VGG-16 with LAUN improved StarGAN [3] |
AI-generated MMI |
98.30% |
| Proposed Pre-trained VGG-16 |
CASME Ⅱ |
99.33% |
| Proposed Pre-trained ResNet 50 (with Data Augmentation) |
Training on CASME Ⅱ and Tested on Proposed AI-generated Dataset |
99.54% |
| Proposed Pre-trained Inception V3 (with 30% Freezing Layers) |
Training on Proposed AI-generated Dataset and Tested on CASME Ⅱ |
99.53% |