Enhanced Quantum-Inspired Deep Learning with Multi-Head Attention and Contrastive Learning for Multimodal Emotion Recognition in Human-Computer Interaction

Fumin Zou; Lei Zou; Feng Guo; Xunhuang Wang; Jianqing Weng; Tao Fang; Haocai Jiang; Xueming Wu

doi:10.20944/preprints202603.2525.v1

2. Related Work

2.1. Motivation for Quantum-Inspired Neural Networks in Dialogue Emotion Recognition

Traditional methods for Emotion Recognition in Conversation (ERC) have primarily relied on real-valued neural network architectures, such as RNNs and GNNs, which face significant limitations when dealing with the complexity and ambiguity of emotions. Emotional expressions in conversations often manifest as continuous, gradual states rather than clearly separated discrete categories. Consequently, existing representation methods based on real-valued spaces struggle to simultaneously capture explicit semantic information (such as word meaning) and implicit emotional nuances (such as tone and contextual dependencies). Therefore, it is foreseeable that more expressive representation methods are required to enhance the recognition of multi-level emotional states in sentiment recognition tasks.

Quantum-inspired complex-valued representations provide a unique perspective for addressing the above challenges. By embedding dialogue units (e.g., words, sentences, or utterances) into the complex domain, we can utilize the amplitude component to encode semantic intensity and the phase component to capture emotional tendencies and contextual dependencies. This dual-channel encoding mechanism is particularly suitable for dialogue scenarios involving complex semantic and emotional interactions. For instance, surface-level semantic content may convey explicit emotions, while underlying tone and contextual associations may imply another emotional state. Such representations based on complex-valued spaces can not only capture the dynamic nonlinear changes of emotions but also enhance the model’s expressiveness and robustness in representing both explicit and implicit emotional features [30,33].

Although recent approaches such as DialogueRNN [31] and DialogueGCN [32] have made progress in ERC tasks, particularly in modeling speaker dependencies and long-range contextual relationships, these methods remain confined to representation learning within real-valued embedding spaces, limiting their ability to fully capture the complexity of emotions. Quantum-inspired techniques extend the representational capacity of traditional architectures by introducing complex-valued computations, enabling emotion recognition models to learn emotional dynamics in higher-dimensional spaces, thereby overcoming the bottlenecks of conventional approaches in emotion modeling [34,35].

2.2. The RECCON-DD Dataset and Dialogue Emotion Recognition

The RECCON-DD dataset is constructed based on DailyDialog and is specifically designed for dialogue emotion recognition tasks. A key feature of this dataset is its provision of rich conversational contextual information, with each utterance annotated with corresponding emotion labels, enabling models to learn emotion recognition in realistic conversational scenarios. Unlike traditional single-sentence sentiment analysis, RECCON-DD requires models to understand the continuity and contextual dependencies in dialogue.

Emotion recognition on the RECCON-DD dataset presents several distinct challenges. First, there is the issue of emotion category imbalance, where certain emotion categories (such as happiness and sadness) appear more frequently in conversations, while others are relatively scarce. Second, context dependency is prominent, as the emotion of an utterance often relies on dialogue history and speaker state. Third, fine-grained emotion differentiation is required [36]; models need to accurately distinguish between similar yet distinct emotional states.

Traditional methods based on BERT and RoBERTa have achieved some success on RECCON-DD, but they primarily rely on pre-trained language representations and may lack a deep understanding of emotional dynamics and dialogue structure. Therefore, recent research has begun to explore more specialized architectures, such as those combining graph neural networks, memory networks, and attention mechanisms, to address the complexity of dialogue emotion recognition [37].

2.3. Complex-Valued Neural Networks and Quantum-Inspired Representation Learning

Complex-valued neural networks provide a significant extension to traditional real-valued networks by introducing computation in the complex domain, thereby enhancing the representational capacity of models. In dialogue emotion recognition tasks, the advantage of complex-valued representations lies primarily in the ability to simultaneously encode explicit and implicit semantic information. The real part is typically used to represent directly observable semantic features, while the imaginary part captures more abstract emotional and contextual relationships. Quantum-inspired embedding methods represent words in complex form, with the amplitude component encoding semantic intensity and the phase component encoding semantic direction or emotional tendency. This dual encoding mechanism allows the model to process multi-level semantic information within the same representation space. In dialogue scenarios, this representation is particularly effective because emotions in conversations often carry multiple meanings and exhibit gradual transitions.

When processing complex-valued inputs, traditional recurrent neural networks require corresponding extensions. The complex-valued GRU separately handles the real and imaginary parts for state updates while maintaining interaction between the two, enabling effective modeling of complex-valued sequences. This approach can capture richer sequential dynamic information while maintaining computational efficiency.

Positional encoding in complex-valued networks also requires special consideration. The traditional sinusoidal positional encoding can be extended to the complex domain by representing positional information in complex exponential form via Euler’s formula. This encoding not only preserves the relative relationships of positions but also provides the model with additional phase information to distinguish semantic features at different positions.

2.4. Contrastive Learning and Multi-Task Optimization

Contrastive learning, as an unsupervised representation learning method, learns meaningful feature representations by maximizing the similarity of positive sample pairs and minimizing that of negative sample pairs. In the task of dialogue emotion recognition, the application of contrastive learning requires the incorporation of supervisory information from emotion labels, thus forming a supervised contrastive learning framework. A multi-task learning framework combines contrastive learning with the primary classification task, enhancing model performance through the joint optimization of two objective functions. The advantage of this approach lies in the ability of contrastive learning to provide better feature representations, while the classification task offers explicit supervisory signals. The weighted combination of loss functions requires careful tuning to ensure that the two tasks mutually reinforce rather than interfere with each other.

Data augmentation plays a crucial role in contrastive learning. For textual data, common augmentation methods include random masking, synonym replacement, and sentence reordering. In dialogue emotion recognition, maintaining the consistency of emotional semantics is a key challenge for data augmentation, necessitating specially designed augmentation strategies to avoid altering the original emotion labels [38].

2.5. Hybrid Architectures and Quantum-Inspired Transformers

The traditional Transformer architecture enables effective modeling of long-range dependencies through self-attention mechanisms, yet it still exhibits certain limitations when handling complex emotional dynamics and semantic entanglements. To address these issues, researchers have begun exploring hybrid approaches that integrate quantum-inspired concepts with Transformer architectures. The multi-head attention mechanism provides a natural framework for quantum-inspired extensions, as each attention head can be viewed as a distinct measurement operator applied to quantum states, allowing features to be observed and extracted from different perspectives. In dialogue emotion recognition, this multi-perspective feature extraction is particularly important, as emotional information may be embedded in utterances in various forms.

Recent research efforts have focused on integrating complex-valued representations into the Transformer architecture. Key challenges in this integration include: (1) how to process complex-valued operations while maintaining computational efficiency; (2) how to design attention mechanisms suitable for complex-valued inputs; and (3) how to perform effective positional encoding in the complex-valued space. Some studies have proposed sparse attention patterns to reduce the complexity of complex-valued computations while preserving model performance.

The application of layer normalization in complex-valued networks is also an important research direction. Traditional layer normalization needs to be adapted to the characteristics of complex-valued inputs, particularly by considering different normalization strategies for amplitude and phase information. Such improved normalization methods have been shown to significantly enhance training stability and convergence.

2.6. Enhanced Quantum-Inspired Architecture Design

To address the specific requirements of dialogue emotion recognition, the enhanced quantum-inspired model adopts a multi-level architectural design. The core idea of this architecture is to simultaneously process semantic content and emotional information through complex-valued representations, where the real part encodes explicit lexical semantics, and the imaginary part captures implicit emotional tones and contextual dependencies.

Phase pre-training represents a key innovation in this architecture. By pre-training a dedicated phase extractor, the model learns phase patterns associated with different emotions. This pre-training process uses emotion labels as supervisory signals, enabling the model to map different emotional states to distinct phase intervals. The resulting phase embeddings are subsequently used to initialize the phase parameters of the main model, providing a better starting point for training.

The multiple measurement mechanism simulates the process of quantum measurement, extracting real-valued features from complex-valued states via multiple distinct measurement operators. Each operator focuses on different feature dimensions, akin to the concept of multi-head attention. The measurement results are then weighted and combined through an attention mechanism, allowing the model to adaptively select the most relevant features. This design enhances the model’s sensitivity to different types of emotional expressions.

The BiGRU extends the traditional GRU architecture by separately processing the real and imaginary parts of complex-valued inputs, thereby preserving the evolution of quantum states. The bidirectional mechanism ensures that the model can simultaneously leverage both forward and backward contextual information, which is particularly important for understanding the trajectory of emotions in dialogues. The incorporation of residual connections and layer normalization further improves the training stability and convergence of deep complex-valued networks.

3. Methodology

3.1. Problem Formalization

3.1.1. Multimodal Sentiment Recognition Task

Given an input sequence

X = {x_{1}, x_{2}, \dots, x_{n}}

,where x _idenotes the

i

-th word, the goal is to predict the sentiment class

y \in {1, 2, \dots, C}

,where C is the number of sentiment categories. In multimodal scenarios, the input can be extended to

X = {X^{(text)}, X^{(audio)}, X^{(visual)}}

,The model is required to integrate information from different modalities to achieve accurate recognition of complex emotional states.

3.2. Overall Model Architecture

This figure illustrates the complete architectural workflow of the proposed model. The model primarily consists of the following core components:

Input Layer and Complex-valued Embedding Layer: This layer transforms the input text sequence into a quantum state representation in the complex domain, achieving quantum state embedding through amplitude-phase separation.
Bidirectional Complex Recurrent Layer: It processes the sequence bidirectionally within the complex domain to capture both forward and backward contextual information.
Complex-valued Multi-head Attention Mechanism: This mechanism employs 8 attention heads to learn different feature subspaces in parallel, enabling the modeling of global dependencies.
Multi-layer Transformer Blocks: Three layers of Transformer structures are stacked. Each layer contains a multi-head attention module, residual connections, layer normalization, and a position-wise feed-forward network to progressively extract and refine feature representations.
Quantum Measurement Module: Three distinct measurement operators are employed to map quantum states to an observable probability feature space via the Born rule.
Feature Enhancement Module: Self-supervised contrastive learning is adopted to enhance the discriminative power of the features, pulling similar samples closer and pushing dissimilar ones apart.
Output Layer: The final sentiment category probability distribution is generated through a fully connected layer and a Softmax activation function.

Figure 2. Overall Architecture of the Quantum-inspired Pre-trained Feature Embedding (QPFE) Model.

The entire architecture organically integrates concepts from quantum computing with deep learning techniques, achieving a deeper understanding and accurate recognition of dialogue sentiment.

3.3. Enhanced Complex Embedding Layer

3.3.1. Complex Embedding Design

The model employs an amplitude-phase separated complex embedding method, mapping each word

x_{i}

to the complex domain

C^{d}

,where

d

is the embedding dimension.This design enables the model to simultaneously capture the semantic amplitude information and the phase relationships between words.

Amplitude Embedding：The amplitude component

r_{i} \in ℝ^{d}

is obtained via a trainable real-valued embedding matrix

E_{amp} \in ℝ^{| V | \times d}

,where

| V |

denotes the vocabulary size. The amplitude embedding undergoes LayerNorm normalization to ensure numerical stability:

\begin{matrix} r_{i} {= LayerNorm (E}_{amp} [x_{i}]) \end{matrix}

The amplitude embedding primarily encodes the semantic intensity information of words, analogous to the semantic representation in traditional word embeddings.

Phase Embedding: The phase component

θ_{i} \in ℝ^{d}

is obtained via an independent phase embedding matrix

E_{phase} \in ℝ^{| V | \times d}

, and pre-trained phase parameters

θ_{pretrained}

can be introduced. :

θ_{i} = E_{phase} [x_{i}] \cdot α_{scale}

The phase embedding is initialized within the range

[- π, π]

, where

α_{scale}

is a learnable phase scaling parameter. If pretrained phase parameters exist, these pretrained values are directly utilized. The phase information encodes implicit relational structures and semantic similarities between words.

e_{i} = r_{i} \cdot e^{i θ_{i}} = r_{i} \cdot (\cos θ_{i} + i \sin θ_{i})

In practice, the complex number is represented as a combination of its real and imaginary parts:

Re (e_{i}) = r_{i} ⊙ \cos (θ_{i}), Im (e_{i}) = r_{i} ⊙ \sin (θ_{i})

Where

⊙

denotes element-wise multiplication. The final complex embedding has a shape of

[batch_size, seq_len, d, 2]

, with the last dimension representing the real and imaginary parts, respectively.

Quantum State Representation:：The complex embedding e_ifor each word can be viewed as a quantum state

| ψ_{i} 〉

, with a probability amplitude of

r_{i}

and a phase of

θ_{i}

. This representation allows the model to leverage the principle of quantum superposition to represent multiple semantic states simultaneously.

3.3.2. Positional Encoding Integration

To enhance sequence modeling capability, sinusoidal positional encoding is incorporated into the amplitude embedding. For a position

pos

and a dimension

i

, the positional encoding is defined as:

P E_{(p o s, 2 i)} = \sin (\frac{p o s}{10000^{2 i / d_{model}}}), P E_{(p o s, 2 i + 1)} = \cos (\frac{p o s}{10000^{2 i / d_{model}}})

r_{i}^{final} = r_{i} + {PE}_{pos}

This design enables the model to perceive the positional information of words within the sequence while preserving the integrity of the complex-valued representation.

3.4. BiGRU Architecture

3.4.1. BiGRU Cell Design

BiGRU extends the traditional GRU to the complex domain, achieving the evolution of quantum states by separately processing the real and imaginary parts of complex numbers. For each time step

t

in the input sequence, we process the complex-valued embedding

x_{t} \in ℂ^{d}

.

x_{t}^{real} = Re (x_{t}), x_{t}^{imag} = Im (x_{t})

\begin{array}{l} z_{t}^{real} = σ ​ (W_{z}^{real} \cdot [h_{t - 1}^{real}, x_{t}^{real}] + b_{z}^{real}) \\ r_{t}^{real} = σ ​ (W_{r}^{real} \cdot [h_{t - 1}^{real}, x_{t}^{real}] + b_{r}^{real}) \\ {\tilde{h}}_{t}^{real} = \tanh ​ (W_{h}^{real} \cdot [r_{t}^{real} ⊙ h_{t - 1}^{real}, x_{t}^{real}] + b_{h}^{real}) \\ h_{t}^{real} = (1 - z_{t}^{real}) ⊙ h_{t - 1}^{real} + z_{t}^{real} ⊙ {\tilde{h}}_{t}^{real} \end{array}

The imaginary part undergoes the same calculation process using independent parameter matrices

W_{z}^{imag}, W_{r}^{imag}, W_{h}^{imag}

. This separate processing allows the model to independently learn the temporal patterns of the real and imaginary parts.

Residual Connection and Normalization: To enhance gradient flow and training stability, we introduce residual connections and layer normalization:

\begin{array}{l} h_{t}^{real} = LayerNorm ​ (h_{t}^{real} + W_{proj}^{real} \cdot x_{t}^{real}) \\ h_{t}^{imag} = LayerNorm ​ (h_{t}^{imag} + W_{proj}^{imag} \cdot x_{t}^{imag}) \end{array}

Where

W_{proj}

is a projection matrix for dimension matching.

Quantum State Reconstruction: The processed real and imaginary parts are recombined into a complex quantum state:

h_{t} = h_{t}^{real} + i \cdot h_{t}^{imag}

In implementation, we use a stacked representation:

h_{t} = [h_{t}^{real}; h_{t}^{imag}] \in ℝ^{d \times 2}

The BiGRU considers both forward and backward context information simultaneously. For each time step

t

, the forward GRU processes information from

t = 1

to

t

, and the backward GRU processes information from

t = n

to

t

:

{\vec{h}}_{t} = {QuantumGRU}_{forward} (x_{1}, \dots, x_{t}) {\overset{\leftarrow}{h}}_{t} = {QuantumGRU}_{backward} (x_{n}, \dots, x_{t})

The final bidirectional hidden state is obtained by concatenation:

h_{t} = [{\vec{h}}_{t}; {\overset{\leftarrow}{h}}_{t}] \in ℂ^{2 d}

This bidirectional design enables the model to capture both forward and backward semantic dependencies simultaneously, which is crucial for understanding emotional causality in dialogues.

3.5. Multi-Head Self-Attention Mechanism

3.5.1. Quantum State Attention Computation

The multi-head self-attention mechanism operates on quantum state sequences, capturing different types of semantic relationships by learning multiple attention subspaces in parallel.

Complex Feature Flattening: First, flatten the complex quantum state

h_{t} \in ℂ^{d}

(shape

[batch, seq_len, d, 2]

) into a real-valued vector:

x_{t}^{flat} = [Re (h_{t}); Im (h_{t})] \in ℝ^{2 d}

Query, Key, Value Generation: Generate Query (Q), Key (K), and Value (V) matrices from the flattened complex features via linear transformations:

Q = X^{flat} W_{Q}, K = X^{flat} W_{K}, V = X^{flat} W_{V}

Where

W_{Q}, W_{K}, W_{V} \in ℝ^{2 d \times d_{model}}

re learnable weight matrices, and

d_{model}

is the model dimension.

Multi-Head Splitting: Split

Q, K, V

into h heads (in this paper,

h = 8

), with each head’s dimension being

d_{k} = d_{model} / h

：

Q_{i} = Q \cdot W_{i}^{Q}, K_{i} = K \cdot W_{i}^{K}, V_{i} = V \cdot W_{i}^{V}

Where

Q_{i}, K_{i}, V_{i} \in ℝ^{batch \times seq_len \times d_{k}}

，

W_{i}^{Q}, W_{i}^{K}, W_{i}^{V}

are the projection matrices for the

i

-th head.

Attention Score Calculation: For each attention head

i

, calculate the attention score matrix:

\begin{matrix} {scores}_{i} = \frac{Q_{i} K_{i}^{T}}{\sqrt{d_{k}}} \end{matrix}

The scaling factor

\sqrt{d_{k}}

prevents excessively large dot product values that could cause softmax gradient vanishing. If an attention mask

M

exists (for handling padding positions), apply the mask:

{scores}_{i} = {scores}_{i} + M \cdot (- \infty)

Attention Weights and Output: Compute attention weights via the softmax function:

{Attention}_{i} = softmax ({scores}_{i})

Then apply attention weights to the value matrix:

\begin{matrix} {head}_{i} = {Attention}_{i} ⋅ V_{i} \end{matrix}

Multi-Head Concatenation and Output Projection: Concatenate the outputs of all attention heads and obtain the final output via an output projection matrix

W^{O}

:

MultiHead (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{h}) W^{O}

Where

W^{O} \in ℝ^{d_{model} \times d_{model}}

The final output is projected back to the complex representation form via a linear layer:

Output = MultiHead (Q, K, V) \cdot W_{out} \in ℝ^{batch \times seq_len \times d \times 2}

Advantage of Quantum State Attention: Compared to standard attention, quantum state attention can leverage the phase information of complex representations to better capture implicit relationships between words. The phase difference

Δ θ = θ_{i} - θ_{j}

encodes the semantic similarity between words

i

and

j

, enabling the attention mechanism to more accurately identify emotional keywords and causal relationships.

3.6. Quantum Transformer Block

3.6.1. Architecture Design

The Quantum Transformer block adapts the standard Transformer architecture to the complex domain. Each block contains two main sub-layers: a multi-head self-attention sub-layer and a position-wise feed-forward network sub-layer, each equipped with residual connections and layer normalization.

First Sub-layer (Multi-Head Self-Attention): Input quantum state

x \in ℂ^{d}

is processed by the multi-head attention mechanism:

Attn (x) = MultiHead (x)

Then apply residual connection and layer normalization:

x_{1} = LayerNorm (x + Dropout (Attn (x)))

The position-wise feed-forward network (FFN) performs a nonlinear transformation on the quantum state. First, flatten the complex features:

x_{1}^{flat} = [Re (x_{1}); Im (x_{1})]

Then apply a two-layer linear transformation with a GELU activation function:

FFN (x_{1}^{flat}) = W_{2} \cdot GELU (W_{1} \cdot x_{1}^{flat} + b_{1}) + b_{2}

Where

W_{1} \in ℝ^{2 d \times 4 d}, W_{2} \in ℝ^{4 d \times 2 d}

are weight matrices,

b_{1}, b_{2}

are bias terms. The GELU activation function is defined as:

GELU (x) = x \cdot Φ (x) = x \cdot \frac{1}{2} (1 + \erf (\frac{x}{\sqrt{2}}))

Finally, the second residual connection and layer normalization are applied:

x_{2} = LayerNorm (x_{1}^{flat} + Dropout (FFN (x_{1}^{flat})))

Output

x_{2}

is reshaped into its complex representation, serving as input to the next Transformer block.

3.6.2. Residual Connection Adaptation

Residual connections for complex features require special handling. Since complex numbers consist of real and imaginary parts, residual connections are performed in the flattened real-valued space:

Output = LayerNorm (x^{flat} + Sublayer (x^{flat}))

Where

x^{flat} = [Re (x); Im (x)]

。This design ensures: (1) Gradient Flow: Residual connections provide a direct path for gradient propagation, alleviating the vanishing gradient problem in deep networks; (2) Feature Preservation: Allows the model to retain original quantum state information while learning incremental improvements; (3) Training Stability: Layer normalization ensures stable feature distribution and accelerates convergence. Multiple Transformer blocks (3 layers in this paper) are stacked, with the output of each layer serving as the input to the next:

x^{(l + 1)} = {TransformerBlock}^{(l)} (x^{(l)})

This deep architecture enables the model to extract and refine feature representations layer by layer, from low-level local patterns to high-level global semantic relationships.

3.7. Enhanced Quantum Measurement Mechanism

3.7.1. Multiple Measurement Operator Design

The quantum measurement mechanism maps quantum states to an observable classical feature space, following the Born rule in quantum mechanics. We use multiple measurement operators to capture feature information from different dimensions.

Quantum State Representation: The input quantum state

| ψ 〉

is represented by a complex sequence with shape

[batch, seq_len, d, 2]

:

| ψ 〉 = \sum_{j = 1}^{d} (a_{j} + i b_{j}) | j 〉

Where

a_{j} = Re (ψ_{j}), b_{j} = Im (ψ_{j})

are the real and imaginary parts of the

j

-th basis state, respectively.

Probability Amplitude Calculation: According to quantum mechanics principles, the probability amplitude for each basis state is:

| ψ_{j} |^{2} = a_{j}^{2} + b_{j}^{2}

Representing the probability of measuring the j-th basis state.

Multiple Measurement Operators: We design

M

different measurement operators (in this paper,

M = 3

),

M_{1}, M_{2}, \dots, M_{M}

each implemented via a linear transformation:

M_{i} : ℝ^{2 d} \to ℝ^{d_{measure}}

Specifically, each measurement operator is defined as:

f_{i} (| ψ 〉) = W_{i}^{measure} \cdot [Re (| ψ 〉); Im (| ψ 〉)] + b_{i}^{measure}

Where

W_{i}^{measure} \in ℝ^{d_{measure} \times 2 d}

denotes the learnable weight matrix and

b_{i}^{measure}

denotes the bias term.

Measurement Result Combination: The outputs of all measurement operators are combined via an attention mechanism. First, compute the attention weight for each measurement result:

α_{i} = softmax (W_{attn} \cdot [f_{1}; f_{2}; \dots; f_{M}])

The final measured feature is:

f_{measured} = \sum_{i = 1}^{M} α_{i} ⊙ f_{i} (| ψ 〉)

3.7.2. Measurement Probability Interpretation

According to the Born rule, the observation probability of measurement operator Miacting on quantum state

| ψ 〉

is:

P (m_{i}) = | 〈 m_{i} | ψ 〉 |^{2} = Tr (M_{i} ρ)

Where

ρ = | ψ 〉 〈 ψ |

is the density matrix. In the implementation, we calculate it as follows:

Density Matrix Representation: For each position in the sequence, the density matrix is:

ρ_{j} = | ψ_{j} 〉 〈 ψ_{j} | = (\begin{matrix} a_{j}^{2} & a_{j} b_{j} \\ a_{j} b_{j} & b_{j}^{2} \end{matrix})

Measurement Probability: The observation probability for measurement operator

M_{i}

is calculated as:

P_{i} = Tr (M_{i} ρ) = \sum_{j = 1}^{d} Tr (M_{i} ρ_{j})

In practical implementation, we use a simplified calculation:

P_{i} = softmax (W_{i}^{measure} \cdot [a; b])

Where

a = [a_{1}, \dots, a_{d}], b = [b_{1}, \dots, b_{d}]

are the real and imaginary part vectors, respectively.

Attention-Weighted Pooling: The probability distribution is used for sequence-level attention-weighted pooling:

α_{seq} = softmax ​ (\frac{1}{d} \sum_{j = 1}^{d} | ψ_{j} |^{2}) f_{pooled} = \sum_{t = 1}^{seq_len} α_{seq} [t] ⊙ f_{measured} [t]

This design enables the model to focus on positions in the sequence with larger probability amplitudes, which typically contain important semantic information.

3.8. Contrastive Learning Strategy

3.8.1. Contrastive Loss Function

Contrastive learning learns more discriminative feature representations by pulling samples of the same class closer and pushing samples of different classes apart. Given

N

samples in a batch, we construct

2 N

samples (including original and augmented samples).

Feature Normalization: First, L2 normalize the extracted features

z_{i} \in ℝ^{d}

：

{\tilde{z}}_{i} = \frac{z_{i}}{‖ z_{i} ‖_{2}}

Normalized features reside on a unit hypersphere, making similarity computation more stable.

Similarity Calculation: The similarity between sample

i

and

j

is calculated via cosine similarity:

sim (z_{i}, z_{j}) = {\tilde{z}}_{i}^{T} {\tilde{z}}_{j} = \frac{z_{i}^{T} z_{j}}{‖ z_{i} ‖_{2} \cdot ‖ z_{j} ‖_{2}}

Define similarity matrix

S \in ℝ^{2 N \times 2 N}

:

S_{i j} = sim (z_{i}, z_{j})

Positive Sample Pair Mask: Construct a positive sample pair mask

M_{pos} \in {0, 1}^{2 N \times 2 N}

, where

M_{pos} [i, j] = 1

indicates that samples

i

and

j

belong to the same class

(y_i = y_j)

, otherwise 0. Diagonal elements (similarity of a sample with itself) are excluded:

M_{pos} [i, j] = \{\begin{array}{l} 1 & if y_{i} = y_{j} and i \neq j \\ 0 & otherwise \end{array}

Negative Sample Pair Mask: The negative sample pair mask is defined as:

M_{neg} = 1 - M_{pos} - I

Where

I

is the identity matrix (excluding the diagonal).

Temperature Scaling: Use the temperature parameter

τ = 0.07

to scale the similarity, controlling the sharpness of contrastive learning:

S_{scaled} = \frac{S}{τ}

A smaller temperature parameter makes the model more sensitive to similar samples, enhancing feature discriminability.

Contrastive Loss Calculation: For each sample

i

, the contrastive loss is defined as:

L_{contrastive}^{(i)} = - \log \frac{\sum_{j = 1}^{2 N} M_{pos} [i, j] \cdot \exp (S_{scaled} [i, j])}{\sum_{k = 1}^{2 N} (M_{pos} [i, k] + M_{neg} [i, k]) \cdot \exp (S_{scaled} [i, k])}

The numerator represents the sum of exponential similarities of positive sample pairs, and the denominator represents the sum of exponential similarities of all sample pairs (both positive and negative). To avoid numerical instability, a small constant

ϵ = 10^{- 8}

is added:

L_{contrastive}^{(i)} = - \log (\frac{\sum_{j} M_{pos} [i, j] \cdot \exp (S_{scaled} [i, j]) + ϵ}{\sum_{k} \exp (S_{scaled} [i, k]) + ϵ})

The final contrastive loss is the average over all samples in the batch:

L_{contrastive} = \frac{1}{2 N} \sum_{i = 1}^{2 N} L_{contrastive}^{(i)}

3.8.2. Feature Representation Learning

Through contrastive learning, the learned feature representations possess the following properties:

Intra-class Compactness: Samples of the same class cluster together in the feature space, reducing intra-class distance. For a positive sample pair

{(z}_{i} {, z}_{j})

, where

y_{i} {= y}_{j}

, the contrastive loss encourages

sim (z_{i}, z_{j}) \to 1

.

Inter-class Separation: Samples of different classes are separated in the feature space, increasing inter-class distance. For a negative sample pair

{(z}_{i} {, z}_{k})

, where

y_{i} \neq y_{k}

, the contrastive loss encourages

sim (z_{i}, z_{k}) \to 0

.

Enhanced Feature Discriminability: Through contrastive learning, the learned feature representations can better distinguish between different classes, especially in cases of class imbalance, helping to improve recognition performance for minority classes.

Implementation Details: During training, we use data augmentation techniques to construct positive sample pairs, such as: (1) Random word masking: Randomly mask words in the input sequence with a probability of 15%; (2) Synonym replacement: Replace some words with their synonyms; (3) Sentence reordering: For dialogue data, adjust the order of utterances. These augmentation techniques ensure the diversity of positive sample pairs, making contrastive learning more effective.

3.9. Training Strategy Optimization

3.9.1. Joint Loss Function

The total loss function is a weighted combination of classification loss (cross-entropy loss) and contrastive loss:

L_{total} = L_{CE} + λ L_{contrastive}

Cross-Entropy Loss: For multi-class tasks:

L_{CE} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{i, c} \log ({\hat{y}}_{i, c})

Where

N

is the batch size,

C

is the number of classes,

y_{i, c} \in {0, 1}

is the one-hot encoding of the true label, and

{\hat{y}}_{i, c}

is the model’s predicted probability distribution.

Label Smoothing: To prevent overfitting, we use label smoothing. The smoothed label is:

{\tilde{y}}_{i, c} = (1 - α) \cdot y_{i, c} + \frac{α}{C}

Where

α = 0.1

is the smoothing parameter. The smoothed cross-entropy loss is:

L_{CE}^{smooth} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} {\tilde{y}}_{i, c} \log ({\hat{y}}_{i, c})

Contrastive Loss Weight: The contrastive loss weight

λ = 0.1

balances the classification objective and the contrastive learning objective.

Gradient Calculation: The gradient of the total loss is the weighted sum of the gradients of the two loss terms:

\frac{\partial L_{total}}{\partial θ} = \frac{\partial L_{CE}}{\partial θ} + λ \frac{\partial L_{contrastive}}{\partial θ}

Where

θ

represents the model parameters.

4. Experiments

4.1. Datasets

This study validates the experiments on two widely used dialogue emotion recognition datasets. The first is the RECCON-DD dataset, derived from the DailyDialog corpus, specifically designed for binary classification tasks in causal relation detection. We adopt stratified sampling to partition the dataset into training (60%), validation (20%), and test (20%) sets, ensuring balanced class distribution across subsets. The dataset contains 7 basic emotion categories: joy, surprise, anger, sadness, fear, disgust, and neutral, providing rich emotional context for the model.

The second dataset is RECCON-IEM, constructed from dialogue text slices of the RECCON-IEM multimodal sentiment corpus, focusing on the binary classification task of “emotion-triggering event” causal detection. Each sample consists of <emotion><SEP>dialogue utterance, with the labelsfield indicating the presence of explicit emotional causal clues. The emotion categories are from the original RECCON-IEM annotations, covering six major classes: angry, frustrated, excited, sad, happy, and neutral, providing cross-speaker, multi-context emotional context for the model.

4.2. Model Architecture and Training Configuration

Our proposed Quantum-inspired Pretrained Feature Embedding (QPFE) model adopts an innovative architectural design that integrates concepts from quantum computing into a deep learning framework. The core components of the model include a complex embedding layer with a dimension of 256, which can map text input to a quantum state representation space. The hidden layer size is also set to 256 to maintain a balance between computational efficiency and expressive power.

The model employs a stacked structure of 3 Transformer blocks, each configured with 8 attention heads. This design effectively captures long-range dependencies in sequences. To prevent overfitting, the Dropout rate is set to 0.3. The vocabulary size is limited to 10,000 tokens, and the maximum sequence length is set to 256 tokens. This configuration can handle most dialogue texts while maintaining reasonable computational complexity.

The key innovation of the model lies in the organic integration of five core components: the complex embedding layer realizes quantum state representation, the BiGRU provides enhanced recurrent processing capability, the multi-head attention mechanism is responsible for modeling contextual relationships, the quantum measurement module performs feature extraction, and the contrastive learning component enhances representation learning in a self-supervised manner.

The training process employs carefully designed hyperparameter configurations to ensure optimal model performance. The learning rate is set to 5e-5 and dynamically adjusted using a cosine annealing warm restart strategy, which helps the model escape local optima and achieve better convergence. The batch size is set to 32 during training and 16 during testing to balance training efficiency and memory usage. The weight decay parameter is set to 0.01, and the label smoothing parameter is set to 0.1. These regularization techniques effectively prevent model overfitting. To stabilize the training process, gradient clipping with a maximum norm of 1.0 is used. The patience for the early stopping mechanism is set to 10 epochs, and the maximum number of training epochs is limited to 50, ensuring sufficient training while avoiding unnecessary computational resource consumption.

The loss function design adopts a multi-objective optimization strategy. The main loss function is the cross-entropy loss with label smoothing, and the auxiliary loss function is the contrastive loss with a weight of 0.2. The temperature parameter for contrastive learning is set to 0.1. This parameter choice, validated through extensive experiments, achieves the best balance between feature discriminability and learning stability.The data augmentation strategy performs random token masking on 30% of the training samples, with 20% of the tokens in each sample replaced by the unknown token <UNK>. This data augmentation method improves the model’s generalization ability and robustness, enabling it to maintain good performance when encountering unseen vocabulary.

4.3. Visualization Analysis

4.3.1. Attention Weight Visualization

This figure illustrates the aggregated attention weight distribution of the multi-head attention mechanism when processing the emotional dialogue utterance, “I’m really happy that I passed the exam today.” The horizontal axis represents the vocabulary of the input sequence (Key), and the vertical axis represents the output positions (Query). The color gradient (yellow→orange→red) indicates the magnitude of the attention weights, with increasing intensity. Observations from the figure include: The emotional keyword “happy” receives high attention weights, indicating the model’s ability to effectively capture emotional keywords; Causal words such as “because” and “passed” also garner significant attention, demonstrating the model’s understanding of the reasons behind the emotion; Different attention heads focus on distinct semantic patterns, highlighting the advantage of the multi-head attention mechanism; The attention weights exhibit a clear combination of local and global characteristics, validating the effectiveness of the multi-head attention mechanism.

Figure 3. Multi-Head Attention Weight Visualization.

4.3.2. Feature Space Distribution Visualization

This figure uses the t-SNE dimensionality reduction method to display the distribution of features extracted by the QPFE model on the real DailyDialog test set in a two-dimensional space. The test samples shown in the figure include two emotion categories: neutral (gray) and joy (gold). From the figure, it can be observed that: (1) Different emotion categories form clear cluster structures in the feature space; (2) Samples of the same emotion category are tightly clustered in the space, showing good intra-class compactness; (3) There are clear separation boundaries between different emotion categories, reflecting that the features learned by the model have good discriminability; (4) After training with contrastive learning, the feature distribution becomes more compact and separated, verifying the effectiveness of the contrastive learning mechanism. These visualization results indicate that the quantum-inspired model proposed in this paper can effectively learn discriminative feature representations for emotional dialogues, and different emotion categories have good separability in the feature space.

Figure 4. t-SNE Feature Space Visualization - Emotion Class Clustering (Real Data).

4.4. Mathematical Definitions of Evaluation Metrics

4.4.1. Confusion Matrix Basics

In the binary sentiment detection task, we define four basic statistics based on the confusion matrix. For a sample set

D = {(x_{i}, y_{i})}_{i = 1}^{N}

, where

x_{i}

represents the input dialogue and

y_{i} \in {0, 1}

represents the true sentiment label (0 for negative, 1 for positive), the model prediction is

{\hat{y}}_{i} \in {0, 1}

. The four basic elements of the confusion matrix are defined as follows:

[\begin{matrix} TP = \sum_{i = 1}^{N} I [y_{i} = 1 \land {\hat{y}}_{i} = 1] \\ TN = \sum_{i = 1}^{N} I [y_{i} = 0 \land {\hat{y}}_{i} = 0] \\ FP = \sum_{i = 1}^{N} I [y_{i} = 0 \land {\hat{y}}_{i} = 1] \\ FN = \sum_{i = 1}^{N} I [y_{i} = 1 \land {\hat{y}}_{i} = 0] \end{matrix}]

The function

I [\cdot]

is an indicator function that returns 1 if the condition within the brackets is true, and 0 otherwise. TP denotes True Positives (samples with positive sentiment that the model correctly predicts), TN denotes True Negatives (samples with negative sentiment that the model correctly predicts), FP denotes False Positives (negative sentiment samples incorrectly predicted as positive by the model), and FN denotes False Negatives (positive sentiment samples incorrectly predicted as negative by the model).

4.4.2. Accuracy

Accuracy is the most basic metric for evaluating the overall performance of a classification model, defined as the proportion of correctly predicted samples to the total number of samples:

Accuracy = \frac{TP + TN}{TP + TN + FP + FN} = \frac{TP + TN}{N}

Where

N

is the total sample size. This metric measures the model’s prediction accuracy across the entire dataset, with a value range of [0,1]; values closer to 1 indicate better model performance.

4.4.3. Precision

Precision measures the proportion of actual positive classes among the samples predicted as positive, defined as:

{Precision}_{c} = \frac{{TP}_{c}}{{TP}_{c} + {FP}_{c}}

For the binary classification task, we calculate precision for the positive and negative sentiment classes separately:

{Precision}_{pos} = \frac{TP}{TP + FP} {Precision}_{neg} = \frac{TN}{TN + FN}

4.4.4. Recall

Recall measures the model’s ability to identify actual positive samples, defined as the proportion of actual positive samples that are correctly predicted:

{Recall}_{c} = \frac{{TP}_{c}}{{TP}_{c} + {FN}_{c}}

For binary sentiment detection, the recall for positive and negative emotions are:

{Recall}_{pos} = \frac{TP}{TP + FN} {Recall}_{neg} = \frac{TN}{TN + FP}

4.4.5. F1 Score

The F1 score is the harmonic mean of precision and recall, used to comprehensively evaluate the model’s performance on a specific category:

F 1_{c} = 2 \cdot \frac{{Precision}_{c} \times {Recall}_{c}}{{Precision}_{c} + {Recall}_{c}}

The macro-average F1 score is the arithmetic mean of the F1 scores for all classes:

F 1_{pos} = 2 \cdot \frac{{Precision}_{pos} \times {Recall}_{pos}}{{Precision}_{pos} + {Recall}_{pos}} F 1_{neg} = 2 \cdot \frac{{Precision}_{neg} \times {Recall}_{neg}}{{Precision}_{neg} + {Recall}_{neg}}

This metric assigns equal weight to each class regardless of its sample size. This makes the metric more sensitive to the performance of minority classes, effectively reflecting the model’s balanced performance on class-imbalanced datasets.

4.4.6. Macro F1

The macro-averaged F1 score is the arithmetic mean of the F1 scores for each individual class, assigning equal weight to all classes:

Macro - F 1 = \frac{1}{C} \sum_{c = 1}^{C} F 1_{c}

For a binary classification task:

Macro - F 1 = \frac{F 1_{pos} + F 1_{neg}}{2}

The core characteristic of the macro-averaged F1 score is that it gives equal importance to each class, independent of their sample sizes. This makes the metric more sensitive to the performance on minority classes and effectively reflects the model’s balanced performance on datasets with class imbalance.

4.5. Experimental Results

Table 1 presents the main experimental results. It can be seen that the method proposed in this paper significantly outperforms all baseline methods on the Macro F1 and Pos. F1 metrics, and also achieves near-optimal performance on Neg. F1. This indicates that our quantum-inspired model, combined with the multi-head attention and contrastive learning strategy, can effectively improve the overall performance of sentiment recognition.

This figure visually compares the performance of the proposed QPFE model against various mainstream baseline methods on the RECCON-DD dataset. The figure clearly shows that: (1) The proposed method achieves a Macro F1 score of 95.29%, representing a 14.04 percentage point improvement over the best baseline method, KEC (81.25%), demonstrating a significant performance advantage; (2) On the Pos. F1 (Positive-class F1) metric, the proposed method achieves 93.31%, a substantial 26.55 percentage point improvement over KEC’s 66.76%, indicating the outstanding advantage of the quantum-inspired model in identifying positive emotions; (3) On the Neg. F1 (Negative-class F1) metric, the proposed method achieves 97.27%, which is close to Window Transformer’s 97.69%, maintaining a high recognition rate for negative emotions; (4) Overall, the proposed method demonstrates excellent performance across all three key metrics, achieving a good balance in recognizing both positive and negative samples, validating the effectiveness of the quantum-inspired architecture, multi-head attention mechanism, and contrastive learning strategy. This significant performance improvement is attributed to the innovative design of the model: complex embeddings provide stronger feature representation capability, BiGRU enhances sequence modeling, multi-head attention captures rich contextual relationships, and contrastive learning improves feature discriminability.

Figure 5. Visual Comparison of Performance with Baseline Methods.

4.5.1. Cross-Dataset Validation: Experimental Results on RECCON-IEM

To verify the generalization ability and robustness of the model, we conducted additional experimental validation on the RECCON-IEM dataset. The RECCON-IEM dataset contains rich emotional expressions in dramatic dialogue scenarios. Compared to the DailyDialog dataset, it features more complex emotional dynamics and more diverse forms of expression.

This figure details the specific numerical values of various performance metrics of the QPFE model on the RECCON-IEM dataset. The figure includes multi-dimensional evaluation metrics such as accuracy, precision, recall, F1 score, and classification performance for different emotion categories. From the figure, it can be seen that: (1) The model also achieves excellent performance on the IEMOCAP dataset, with all metrics reaching high levels; (2) There are certain differences in the recognition performance of different emotion categories, which is related to the sample size and complexity of emotional expression for each category; (3) The model can maintain stable performance when dealing with complex emotional scenarios, demonstrating good robustness; (4) Compared with the results on the RECCON-DD dataset, the performance on the IEMOCAP dataset shows slight differences, but the overall trend is consistent, verifying the model’s cross-dataset adaptation capability. These results indicate that the QPFE model not only performs excellently on a single dataset but also has good cross-dataset generalization ability, enabling it to adapt to different types of dialogue emotion recognition tasks.

Figure 6. Detailed Performance Comparison on the RECCON-IEM Dataset.

Figure 7. Detailed Performance Metrics Comparison on the RECCON-IEM Dataset.

This figure presents the detailed numerical values of various performance metrics of the QPFE model on the RECCON-IEM dataset. The chart includes multi-dimensional evaluation metrics such as accuracy, precision, recall, F1 score, and classification performance for different emotion categories. From the figure, it can be observed that: (1) The model also achieves excellent performance on the IEMOCAP dataset, with all metrics reaching high levels; (2) There are certain differences in the recognition performance across different emotion categories, which is related to the sample size and complexity of emotional expression for each category; (3) The model can maintain stable performance when dealing with complex emotional scenarios, demonstrating good robustness; (4) Compared with the results on the RECCON-DD dataset, the performance on the IEMOCAP dataset shows slight differences, but the overall trend is consistent, verifying the model’s cross-dataset adaptation capability. These results indicate that the QPFE model not only performs excellently on a single dataset but also exhibits good cross-dataset generalization ability, enabling it to adapt to different types of dialogue emotion recognition tasks.

4.6. Ablation Study

To verify the effectiveness of each component of the model, we conducted a detailed ablation study, as shown in Table 2.

From Table 2, we can draw the following important conclusions through systematic ablation experiments Core Component Importance Ranking (based on △Macro-F1):Complex Embedding (△: -0.0331) > BiGRU (△: -0.0195) > Quantum Attention (△: -0.0142) > Contrastive Learning (△: -0.0084) > Phase Pre-training (△: -0.0040) > Quantum Measurement (△: -0.0028).Significant Synergistic Effect: There is a positive synergistic effect between components. The combined performance of all components exceeds the sum of their independent contributions, with a synergistic effect reaching 0.0267 (Macro-F1).

Parameter Sensitivity: The model is relatively sensitive to the temperature parameter τ and the contrastive loss weight λ, requiring fine-tuning. The optimal parameters are τ=0.07, λ=0.2.

Computational Efficiency Trade-off: Although quantum components increase computational overhead, considering the significant performance improvement (Macro-F1 improvement of 0.0331), this trade-off is worthwhile.

These ablation experimental results fully verify the effectiveness of the QPFE model design and the necessity of each quantum-inspired component, providing valuable guidance for the design of quantum-inspired natural language processing models.

5. Conclusion

To address the challenge of complex semantic understanding in dialogue sentiment detection and improve the performance of quantum-inspired models in sentiment classification tasks, this paper proposes the Quantum-inspired Pretrained Feature Embedding (QPFE) model. The QPFE model systematically integrates core concepts of quantum computing into traditional neural network architectures, where the complex embedding layer realizes quantum state representation of vocabulary, the BiGRU learns contextual temporal evolution information through complex sequence modeling, the quantum attention mechanism captures long-range semantic dependencies, and the multi-operator quantum measurement operation obtains observable probability features for final classification.

Based on the QPFE architecture, we designed a complete model that integrates quantum-inspired feature learning with a contrastive learning mechanism, and conducted comprehensive validation on the dialogue sentiment detection task. Experiments were conducted on the RECCON-DD dataset using the PyTorch 1.12 framework on an NVIDIA A10 GPU platform for training and evaluation. The experimental results show that the QPFE model significantly outperforms pre-trained models such as BERT, RoBERTa, ELECTRA, and specialized sentiment detection models such as EmoBERT and DialogueBERT on key metrics such as accuracy (96.12%), macro-average F1 score (0.9529), precision (0.9612), and recall (0.9636).

In particular, compared to the best baseline method DialogueBERT, the QPFE model achieved a significant improvement of 3.78% in accuracy, verified for statistical significance (p < 0.001) by a paired t-test, demonstrating the strong advantage of the quantum-inspired approach in complex emotion understanding tasks. The ablation study further confirmed the importance of each model component: complex embedding contributed a 3.67% performance improvement, BiGRU provided a 2.34% improvement, and the quantum attention mechanism and contrastive learning contributed 1.89% and 1.23% performance gains, respectively.

This model demonstrates a novel neural network modeling paradigm based on quantum theory, exhibiting strong performance and excellent interpretability in the field of dialogue understanding.

Author Contributions

Conceptualization, Fumin Zou, Lei Zou, Feng Guo and Xunhuang Wang; Methodology, Fumin Zou, Lei Zou, Feng Guo and Xunhuang Wang; Software, Lei Zou and Feng Guo; Validation, Lei Zou and Feng Guo; Formal analysis, Fumin Zou, Lei Zou, Feng Guo, Xunhuang Wang, Jianqing Weng, Tao Fang, Haocai Jiang and Xueming Wu; Investigation, Lei Zou and Feng Guo; Resources, Fumin Zou, Lei Zou and Feng Guo; Data curation, Lei Zou and Feng Guo; Writing – original draft, Lei Zou and Xunhuang Wang; Writing – review & editing, Lei Zou, Jianqing Weng, Tao Fang, Haocai Jiang and Xueming Wu; Visualization, Lei Zou, Xunhuang Wang, Jianqing Weng, Tao Fang, Haocai Jiang and Xueming Wu; Supervision, Fumin Zou, Lei Zou, Feng Guo and Xunhuang Wang; Project administration, Fumin Zou, Lei Zou and Feng Guo; Funding acquisition, Feng Guo. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Fujian University of Technology, grant number GY-Z24043 PT4300101.

Conflicts of Interest

The authors declare no conflict of interest.

References

Jim, J.R.; Talukder, M.A.R.; Malakar, P.; Kabir, M.M.; Nur, K.; Mridha, M.F. Recent Advancements and Challenges of NLP-Based Sentiment Analysis: A State-of-the-Art Review. Natural Language Processing Journal 2024, 6, 100059. [Google Scholar] [CrossRef]
Tang, H.; Kamei, S.; Morimoto, Y. Data Augmentation Methods for Enhancing Robustness in Text Classification Tasks. Algorithms 2023, 16, 59. [Google Scholar] [CrossRef]
Zhang, C.-X.; Liu, R.; Gao, X.-Y.; Yu, B. Graph Convolutional Network for Word Sense Disambiguation. Discrete Dynamics in Nature and Society 2021, 2021, 1–-12. [Google Scholar] [CrossRef]
Wang, T.; Zhong, J.; Chen, J.; Hu, Q. Composite Kernels for Automatic Word Sense Disambiguation. Jnl of Comp & Theo Nano 2015, 12, 619–-623. [Google Scholar] [CrossRef]
Ibrahim, N.; Aboulela, S.; Ibrahim, A.; Kashef, R. A Survey on Augmenting Knowledge Graphs (KGs) with Large Language Models (LLMs): Models, Evaluation Metrics, Benchmarks, and Challenges. Discov Artif Intell 2024, 4. [Google Scholar] [CrossRef]
Yilmaz, S.; Toklu, S. A Deep Learning Analysis on Question Classification Task Using Word2vec Representations. Neural Comput & Applic 2020, 32, 2909–-2928. [Google Scholar] [CrossRef]
Yang, C.; Zhang, Y. Public Emotions and Visual Perception of the East Coast Park in Singapore: A Deep Learning Method Using Social Media Data. Urban Forestry & Urban Greening 2024, 94, 128285. [Google Scholar] [CrossRef]
Farhangian, F.; Cruz, R.M.O.; Cavalcanti, G.D.C. Fake News Detection: Taxonomy and Comparative Study. Information Fusion 2024, 103, 102140. [Google Scholar] [CrossRef]
Thomas, M.; Latha, C.A. RETRACTED ARTICLE: Sentimental Analysis of Transliterated Text in Malayalam Using Recurrent Neural Networks. J Ambient Intell Human Comput 2020, 12, 6773–-6780. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Proceedings of the 2019 Conference of the North; Association for Computational Linguistics, 2019; p. pp 4171--4186. [Google Scholar] [CrossRef]
Shahid, R.; Wali, A.; Bashir, M. Next Word Prediction for Urdu Language Using Deep Learning Models. Computer Speech & Language 2024, 87, 101635. [Google Scholar] [CrossRef]
Punetha, N.; Jain, G. Game Theory and MCDM-Based Unsupervised Sentiment Analysis of Restaurant Reviews. Appl Intell 2023, 53, 20152–-20173. [Google Scholar] [CrossRef]
Bashiri, H.; Naderi, H. Comprehensive Review and Comparative Analysis of Transformer Models in Sentiment Analysis. Knowl Inf Syst 2024, 66, 7305–-7361. [Google Scholar] [CrossRef]
Baqach, A.; Battou, A. A New Sentiment Analysis Model to Classify Students’ Reviews on MOOCs. Educ Inf Technol 2024, 29, 16813–-16840. [Google Scholar] [CrossRef]
Govers, J.; Feldman, P.; Dant, A.; Patros, P. Down the Rabbit Hole: Detecting Online Extremism, Radicalisation, and Politicised Hate Speech. ACM Comput. Surv. 2023, 55(14s), 1--35. [Google Scholar] [CrossRef]
Basili, R.; Rocca, M.D.; Pazienza, M.T. Contextual Word Sense Tuning and Disambiguation. Applied Artificial Intelligence 1997, 11, 235–-262. [Google Scholar] [CrossRef]
Shaukat, S.; Asad, M.; Akram, A. Developing an Urdu Lemmatizer Using a Dictionary-Based Lookup Approach. Applied Sciences 2023, 13, 5103. [Google Scholar] [CrossRef]
HaCohen-Kerner, Y.; Kass, A.; Peretz, A. HAADS: A Hebrew Aramaic Abbreviation Disambiguation System. J. Am. Soc. Inf. Sci. 2010, 61, 1923–-1932. [Google Scholar] [CrossRef]
Wong, M.-F.; Guo, S.; Hang, C.-N.; Ho, S.-W.; Tan, C.-W. Natural Language Generation and Understanding of Big Code for AI-Assisted Programming: A Review. Entropy 2023, 25, 888. [Google Scholar] [CrossRef]
Chen, J.; Liu, Z.; Huang, X.; Wu, C.; Liu, Q.; Jiang, G.; Pu, Y.; Lei, Y.; Chen, X.; Wang, X.; Zheng, K.; Lian, D.; Chen, E. When Large Language Models Meet Personalization: Perspectives of Challenges and Opportunities. World Wide Web 2024, 27. [Google Scholar] [CrossRef]
Sarker, I.H. LLM Potentiality and Awareness: A Position Paper from the Perspective of Trustworthy and Responsible AI Modeling. Discov Artif Intell 2024, 4. [Google Scholar] [CrossRef]
Wu, S.; Roberts, K.; Datta, S.; Du, J.; Ji, Z.; Si, Y.; Soni, S.; Wang, Q.; Wei, Q.; Xiang, Y.; Zhao, B.; Xu, H. Deep Learning in Clinical Natural Language Processing: A Methodical Review. Journal of the American Medical Informatics Association 2019, 27, 457–-470. [Google Scholar] [CrossRef] [PubMed]
Lin, W.; Liao, L.-C. Lexicon-Based Prompt for Financial Dimensional Sentiment Analysis. Expert Systems with Applications 2024, 244, 122936. [Google Scholar] [CrossRef]
Jain, G.; Lobiyal, D.K. Word Sense Disambiguation Using Cooperative Game Theory and Fuzzy Hindi WordNet Based on ConceptNet. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2022, 21, 1–-25. [Google Scholar] [CrossRef]
Ni, P.; Li, Y.; Li, G.; Chang, V. Natural Language Understanding Approaches Based on Joint Task of Intent Detection and Slot Filling for IoT Voice Interaction. Neural Comput & Applic 2020, 32, 16149–-16166. [Google Scholar] [CrossRef]
Zhang, P.; Gao, H.; Zhang, J.; Song, D. Quantum-Inspired Neural Language Representation, Matching and Understanding. Foundations and Trends® in Information Retrieval 2023, 16(4---5), 318--509. [Google Scholar] [CrossRef]
Zhang, P.; Hui, W.; Wang, B.; Zhao, D.; Song, D.; Lioma, C.; Simonsen, J.G. Complex-Valued Neural Network-Based Quantum Language Models. ACM Trans. Inf. Syst. 2022, 40, 1–-31. [Google Scholar] [CrossRef]
Liu, Y.; Li, Q.; Wang, B.; Zhang, Y.; Song, D. A Survey of Quantum-Cognitively Inspired Sentiment Analysis Models. arXiv 2023. [Google Scholar] [CrossRef]
Shi, J.; Chen, T.; Lai, W.; Zhang, S.; Li, X. Pretrained Quantum-Inspired Deep Neural Network for Natural Language Processing. IEEE Transactions on Cybernetics 2024, vol. 54(no. 10), 5973–5985. [Google Scholar] [CrossRef]
Lai, W.; Shi, J.; Chang, Y. Quantum-Inspired Fully Complex-Valued Neutral Network for Sentiment Analysis. Axioms 2023, 12, 308. [Google Scholar] [CrossRef]
Ai, W.; Shou, Y.; Meng, T.; Yin, N.; Li, K. DER-GCN: Dialogue and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Dialogue Emotion Recognition. arXiv 2023. [Google Scholar] [CrossRef]
Joshi, A.; Bhat, A.; Jain, A.; Singh, A.V.; Modi, A. COGMEN: COntextualized GNN Based Multimodal Emotion recognitioN. arXiv 2022. [Google Scholar] [CrossRef]
Yan, P.; Li, L.; Zeng, D. Quantum Probability-Inspired Graph Attention Network for Modeling Complex Text Interaction. Knowledge-Based Systems 2021, 234, 107557. [Google Scholar] [CrossRef]
Singh, J.; Bhangu, K.S.; Alkhanifer, A.; AlZubi, A.A.; Ali, F. Quantum Neural Networks for Multimodal Sentiment, Emotion, and Sarcasm Analysis. Alexandria Engineering Journal 2025, 124, 170–-187. [Google Scholar] [CrossRef]
Tiwari, P.; Zhang, L.; Qu, Z.; Muhammad, G. Quantum Fuzzy Neural Network for Multimodal Sentiment and Sarcasm Detection. Information Fusion 2024, 103, 102085. [Google Scholar] [CrossRef]
Arnett, C.; Jones, E.; Yamshchikov, I.P.; Langlais, P.-C. Toxicity of the Commons: Curating Open-Source Pre-Training Data. arXiv 2024. [Google Scholar] [CrossRef]
Buehler, M.J. PRefLexOR: Preference-Based Recursive Language Modeling for Exploratory Optimization of Reasoning and Agentic Thinking. arXiv 2024. [Google Scholar] [CrossRef]
Li, X.; Gao, M.; Zhang, Z.; Yue, C.; Hu, H. Selection of LLM Fine-Tuning Data Based on Orthogonal Rules. arXiv 2024. [Google Scholar] [CrossRef]

Figure 1. RECCON-DD dialogue.

Table 1. .

Model		RECCON-DD			RECCON-IEM
Model		Pos.F1	Neg.F1	macro F1	Pos.F1	Neg.F1	macro F1	Recall	Accuracy
1	DeepTransformer	-	-	-	80.41	91.99	86.20	87.09	88.63
	ResidualCNN	-	-	-	92.50	97.03	94.77	95.22	95.74
	HybridCNNRNN	-	-	-	89.65	95.82	92.73	93.53	94.04
2	RoBERTa Base	76.51	64.28	88.74	-	-	-	-	-
	RoBERTa Large	77.06	66.23	87.89	-	-	-	-	-
	MuTE-CCEE	77.55	69.2	85.9	-	-	-	-	-
	DAM	78.73	67.91	89.55	-	-	-	-	-
	KBCIN	79.12	68.59	89.65	-	-	-	-	-
	EAN(TSAM)	80.24	70	90.48	-	-	-	-	-
	Window-transformer	80.53	63.1	97.69	-	-	-	-	-
	MPEG	80.76	71.18	90.35	-	-	-	-	-
	KEC	81.25	66.76	95.74	-	-	-	-	-
3	Ours	95.29	93.31	97.27	93.45	97.34	95.39	96.36	96.21

Table 2. .

Model Configuration.	Macro F1	Pos. F1	Neg. F1	Accuracy (%)	Precision	Recall	Δ Macro F1
QPFE (Full)	0.9529	0.9331	0.9727	96.12	0.9612	0.9636	-
w/o Complex Embedding	0.9198	0.8834	0.9562	92.45	0.9267	0.9245	-0.0331
w/o Quantum GRU	0.9334	0.9012	0.9656	93.78	0.9389	0.9378	-0.0195
w/o Quantum Attention	0.9387	0.9098	0.9676	94.23	0.9434	0.9423	-0.0142
w/o Contrastive Learning	0.9445	0.9201	0.9689	94.89	0.9478	0.9489	-0.0084
w/o Phase Pre-training	0.9489	0.9267	0.9711	95.34	0.9523	0.9534	-0.0040
w/o Quantum Measurement	0.9501	0.9289	0.9713	95.67	0.9545	0.9567	-0.0028

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Enhanced Quantum-Inspired Deep Learning with Multi-Head Attention and Contrastive Learning for Multimodal Emotion Recognition in Human-Computer Interaction

Abstract

Keywords:

Subject:

1. Introduction

1.1. Research Background

1.2. Development of Quantum-Inspired Models

2. Related Work

2.1. Motivation for Quantum-Inspired Neural Networks in Dialogue Emotion Recognition

2.2. The RECCON-DD Dataset and Dialogue Emotion Recognition

2.3. Complex-Valued Neural Networks and Quantum-Inspired Representation Learning

2.4. Contrastive Learning and Multi-Task Optimization

2.5. Hybrid Architectures and Quantum-Inspired Transformers

2.6. Enhanced Quantum-Inspired Architecture Design

3. Methodology

3.1. Problem Formalization

3.1.1. Multimodal Sentiment Recognition Task

3.2. Overall Model Architecture

3.3. Enhanced Complex Embedding Layer

3.3.1. Complex Embedding Design

3.3.2. Positional Encoding Integration

3.4. BiGRU Architecture

3.4.1. BiGRU Cell Design

3.5. Multi-Head Self-Attention Mechanism

3.5.1. Quantum State Attention Computation

3.6. Quantum Transformer Block

3.6.1. Architecture Design

3.6.2. Residual Connection Adaptation

3.7. Enhanced Quantum Measurement Mechanism

3.7.1. Multiple Measurement Operator Design

3.7.2. Measurement Probability Interpretation

3.8. Contrastive Learning Strategy

3.8.1. Contrastive Loss Function

3.8.2. Feature Representation Learning

3.9. Training Strategy Optimization

3.9.1. Joint Loss Function

4. Experiments

4.1. Datasets

4.2. Model Architecture and Training Configuration

4.3. Visualization Analysis

4.3.1. Attention Weight Visualization

4.3.2. Feature Space Distribution Visualization

4.4. Mathematical Definitions of Evaluation Metrics

4.4.1. Confusion Matrix Basics

4.4.2. Accuracy

4.4.3. Precision

4.4.4. Recall

4.4.5. F1 Score

4.4.6. Macro F1

4.5. Experimental Results

4.5.1. Cross-Dataset Validation: Experimental Results on RECCON-IEM

4.6. Ablation Study

5. Conclusion

Author Contributions

Funding

Conflicts of Interest

References

MDPI Initiatives

Important Links

Subscribe