1. Introduction
Over recent years, there has been a growing interest in spiking neural networks (SNN) and spiking models, which have found applicability in various domains, such as pattern recognition and clustering, among others. Spiking neural networks, regarded as the third generation of artificial neural networks (ANN), differ from classic ANN by processing data as sequences of spikes known as spike trains. This implies that SNN, in terms of computation, only require a single bit line toggling between logical levels ’0’ and ’1’ in contrast to classic ANN that operate with real or integer-valued inputs. SNN excel in processing both temporal and spatial patterns, rendering them computationally more potent than ANN [
1].
Spiking neural networks (SNNs) transmit spike signals between neurons, operating as an event-driven or clock-driven computing systems where power consumption is primarily concentrated in the current active parts of the networks. This design allows for effective energy savings in inactive regions, enabling SNNs to perform distributed and asynchronous computing with minimized network time delays and enhanced real-time capabilities [
2,
3]. While Convolutional Neural Networks (CNNs) have proven highly successful for natural image classification [
4], their training and operational demands require substantial computing resources. Notably, SNNs demonstrate superior high-speed operational performance, in contrast to CNNs, which exhibit strengths in classification tasks.
While SNNs exhibit impressive computational capabilities, they still lack effective learning mechanisms aligned with biological processes [
5]. The predominant learning principle in SNNs, Spike-timing-dependent plasticity (STDP), proves inadequate for training multilayer neural networks. Consequently, there is a growing interest in the training approach for SNNs. This method involves initially training a conventional artificial neural network using the backpropagation algorithm. Subsequently, the network parameters, such as weights and biases, undergo conversion through suitable methods for utilization in SNNs. Various approaches have been explored to adapt existing neural networks to SNNs. Cao et al. customized a standard CNN to meet SNN requirements, albeit with some resulting performance losses [
6]. Diehl et al. [
7] improved network performance by converting a CNN to an SNN through weight normalization, reducing conversion errors. Hunsberger et al. [
8] enhanced conversion performance by incorporating Leaky Integrate-and-Fire (LIF) neurons into the SNN. Theoretically, SNNs can match or surpass the performance of CNNs [
9], yet achieving equivalent practical performance remains challenging.
There are examples of intelligent systems, converting data directly from sensors [
10,
11], controlling manipulators [
12] and robots [
13], doing recognition or detection tasks [
14,
15], tactile sensing [
16] or processing neuromedical data [
17]. Li et al. [
18] incorporated the mechanism of LIF neurons into the MLP models and propose a full-precision LIF operation to communicate between patches, including horizontal LIF and vertical LIF in different directions. The SNN-MLP model achieves 81.9%, 83.3% and 83.5% top-1 accuracy on ImageNet dataset with only 4.4G, 8.5G and 15.2G FLOPs. Zhang et al. [
19] proposed a multiscale dynamic coding improved spiking actor network (MDC-SAN) for reinforcement learning to achieve effective decision-making. The population coding at the network scale is integrated with the dynamic neurons coding at the neuron scale towards a powerful spatial-temporal state representation. Cuadrado et al. [
20] proposed a U-Net-like SNN encouraging both minimal norm for the error vector and minimal angle between ground-truth and predicted flow to make dense optical flow estimations. In addi-tion, the use of 3d convolutions contributed to capture the dynamic nature of the data by increasing the temporal receptive fields. Zou et al. [
21] dedicated end-to-end sparse deep learning approach for event-based pose tracking and achieved a computation reduction of 20% in FLOPS. It is based entirely upon the framework of Spiking Neural Networks (SNNs), which consists of Spike-Element-Wise (SEW) ResNet and a spiking spatiotemporal transformer.
Facial expression recognition is a pivotal field in computer comprehension of human emotions and a crucial element of human-computer interaction. It involves selecting facial expressions from static photos or video sequences to determine the emotional and psychological changes in individuals. In the 1970s, American psychologists Ekman and Friesen defined six fundamental human expressions through extensive experiments: happiness, anger, surprise, fear, disgust, and sadness.
However, recognizing such expressions under naturalistic conditions poses significant challenges due to variations in head pose, illumination, occlusions, and the nuanced nature of unposed expressions. The Facial Expression Recognition Challenge, as a prominent track in three machine learning contests, is notably demanding. For instance, a manual test conducted on the official Fer2013 dataset revealed that human recognition accuracy for the original dataset is approximately 65%. It is evident that label recognition is challenging even for humans. The official extraction of a small, clean subset from the original dataset resulted in a human recognition accuracy of around 68%.
The analysis of human face characteristics and the recognition of its emotional state are considered to be very challenging and difficult tasks. The main difficulty comes from the non-uniform nature of the human face and various limitations related to lighting, shadows, facial pose and orientation conditions [
22]. In the Large Scale Visual Recognition Challenge (ILSVRC) 2012, the AlexNet model, utilizing CNN, notably enhanced Facial Expression Recognition (FER) accuracy. Subsequently, more intricate CNN variants emerged like VGGNet [
23], GoogleNet [
24], and ResNet [
25]. However, these deep learning network models were complex and had a large number of parameters, making them unsuitable for embedded computers and mobile devices. It’s worth noting that current research on Spiking Neural Networks is still in the model exploration stage, with relatively fewer studies focusing on practical applications. Notably, there is a lack of research introducing SNNs into the field of facial expression recognition.
The structure of the paper is organized as follows. In Sect. 2, background topics on spiking neurons, the STDP fine-tune method, construction of convolutional SNN and its loss function are examined. Sect. 3, presents the experimental study conducted, examines the results collected and discusses the main findings. After that, Sect. 4, describes the feature visualization results of the SNN. Finally, Sect. 5 concludes the paper and draws directions for future work.
This paper makes several significant contributions: It proposes a highly efficient convolutional SNN capable of facial expression recognition. The method fully explores the SNN’s clock-driven and synaptic sparsity properties. It significantly reduces the model parameter while approaching the accuracy of ANNs, thereby reducing computational consumption and enhancing training speed. Moreover, the paper proposes a novel fine-tuning approach for SNNs based on Spike-Timing-Dependent Plasticity (STDP). This method effectively integrates unsupervised learning inspired by biological neural computation to enhance supervised learning in SNNs, improving recognition accuracy and model generalization.