Classical machine learning systems require manual feature extraction, and then features are fed to the classifier for training. This process is doable for less complex problems. However, for more complex problems, such as emotion recognition, ML traditional techniques may fail and require additional effort. In contrast, the automatic feature extraction method allows deep learning models to excel.
Figure 3 shows the difference between a traditional machine-learning style and a deep-learning approach. A deep learning approach is built on artificial intelligence networks (ANNs). In neural networks, which are based on layers of multiple neurons, communication is formed between neurons of adjacent layers. Although the deep traditional art of reading can extract features automatically, CNN must reduce the potential parameters available in the neural network and train the model with minimum computation time [
39].
A convolutional layer in a neural network allows images to be processed regardless of size or complexity with few parameters. Moreover, there are different types of layers in CNN, such as pooling layers for dimensionality reduction, fully connected layers, batch normalization layers for fast training and convergence, etc.
Figure 4 shows the architecture of fully connected neural networks and CNN layers.
Convolution Neural Networks combine input layers, convolutional, fully connected layers, and pooling. The generalized architectures of CNN models are shown in
Figure 5. The last classification layer after the FC layers makes predictions to detect emotions: surprise, sad, fear, joy, anger, neutral, and disgust.
3.2.1. Neural Architecture Search Network
The neural Architecture Search Network is a model developed by the Google ML team in 2017 while working on new ways to build ConvNets based on Neural Architecture Search [
40].
The Neural Architecture Search method obtains the best structures using gradients. Zoph and Le [
41] noted that a variable length string could specify the connections and configurations of a neural network. This allows character units to be created using a repeating mesh that acts as a "controller", with the character unit representing the "child mesh".
The "child mesh" networks are then trained using real-time data while analyzing the accuracy of the validation set. Using accuracy as a reward signal, policy gradients are calculated to adjust control, as shown in
Figure 6. During the subsequent iterations, the controller learns and offers higher options in building high accuracy, thus keeping the cables (child networks) very accurate. Using NAS, Zoph, and Le [
41] obtained an efficient ConvNet model that can perform better than most artificial structures. Our proposed model was tested and achieved a test error rate of 1.65, faster than the existing models.
3.2.2. Composition of Proposed Network
NASNet is a CNN platform built using the scalable NAS method mentioned above, and the Google ML team's technique was based on reinforcement learning. There is a parent AI unit, Recurrent Neural Network (RNN), called "controller", which monitors the performance of the child AI unit, i.e., "child network" on the CNN, and corrects the creation of the "child network". These adjustments are made to the number of layers, weights, and more to improve the efficiency of the "sub-mesh", as shown in
Figure 7. The active blocks placed on the RNN controller forming the slave network are shown in
Table 3.
Using all the above performance blocks, RNN creates the Network architecture. The architecture is trained with different image sizes to produce two types of Network features: NetLarge and Netmobile. Netmobile has 53,26,716 parameters, while NetLarge has 8,89,49,818 parameters; therefore, Netmobile is more reliable than NetLarge due to the difference in total parameters. Each Net model has a block, which is the smallest unit. A cell is a mixture of blocks created by combining different functional blocks, such as those listed above, and many cells make up the Net architecture.
RNN controllers optimize cells in blocks that are not modified; they are chosen for a specific database. Each block is a working module, and tasks that can be done using the block are Max Pooling, Convolutions, Avg. Pooling, Identity Mapping, Inter alia, and Separable Convolutions.
Each block shows current and previous inputs (H1 and H0) to one output map, as shown in
Figure 8. The proposed network employs element-based addition, which is more sophisticated and better than vector-based additions. When using a feature map as input, two types of convolutional cells are used, as follows:
Normal Cell: These convolution-based cells provide the same size mapping features. For example, if the cell allows block input with the feature map size H × W, having stride 1, the calculated output will eventually have the same sizes as the features map.
Reducing Cell: They are also convolutional cells that return maps with the length and width of the feature map minimized by a factor of 2 (e.g., if step = 2, size / 2) [
42]. The Taxonomy of the proposed model is shown in
Figure 9.
The growth of networks is based on different phases, such as the number of filters in the first layer (F), the number of cells to be stacked (N), and the cell structure.
The F and N values are set in the initial stages of the search. However, the N and F values in the first layer are adjusted to change the depth and width of the mesh. Once the search is complete, different models with different sizes are developed to be compatible with the datasets. The cells are then connected to form a structure to create the best possible proposed network. Variability in convolutional networks exists in the form of variations in normal cells and true reduction cells that the RNN controller looks for. Each cell is connected to two hidden input settings in the search space. An example of hidden regions can be seen in
Figure 10. Hidden layers can also have convolution and pooling. The best cells are selected in the proposed network using the optimization results. This makes searching faster and makes features available in a general form.
3.2.3. Reinforcement Learning (LR)
The proposed network gets training with Reinforcement Learning while achieving a better accuracy as R. An accuracy R is used as a reward signal, employing RL to train the RNNs Controller. To get an optimized architecture, the controller is automated to increase its expected reward value, referred to as J(θc), as given in Equation 1.
R is a non-differentiate reward signal. The gradients policy is utilized to review the expected reward θc repeatedly. The law of enforcement is applied as set out in Equation. 2.
The empirical approximation of the above quantity is calculated according to Equations 3.
Rk
Here, m represents the number of different architectures sampled by the controller in one batch. T refers to the number of hyperparameters the controller can predict for the neural network architecture. Rk represents the verification accuracy obtained by the k-th NN architecture after training on a specific training database. The approximation in Equation 3 represents a gradient. However, it has the disadvantage of high variability. The basis function described in Equation 4 was used to minimize variance.
Base b shifts the accuracy average by architecture in previous batches. Structures can be found in the search area, as seen in
Figure 11.