Preprint
Article

This version is not peer-reviewed.

Multi-Modal Data-Driven Sentiment Analysis in Online Public Opinion During Public Health Emergencies

Submitted:

03 June 2026

Posted:

04 June 2026

You are already at the latest version

Abstract
During public health emergencies (PHEs), social media generates massive online streaming features, yet only a sparse subset is truly relevant to sentiment analysis. Moreover, multimodal fusion must account for cross-modal interactions between text and images. To address these two issues, this study proposes an online multimodal sentiment analysis framework for PHE-related public opinion. Firstly, to handle the sparse relevant features among numerous online streaming features, we develop a Multimodal Online Divide-and-Conquer Markov Blanket Learning algorithm that incrementally selects robust sentiment-relevant features. Secondly, to capture cross-modal interactions, we design a Cross-Modal Interactive Enhanced Fusion Network with a two-stage cross-modal interactive attention mechanism, complemented by adaptive modal weighting and residual connections. Experiments on a large-scale PHE multimodal dataset show that our full model (O-DC + Our Fusion) achieves the best performance: 89.3% precision, 85.3% recall, 87.2% F1-score, and 87.1% accuracy, significantly outperforming state-of-the-art baselines.
Keywords: 
;  ;  ;  ;  

1. Introduction

In recent years, public health emergencies (PHEs) have occurred frequently worldwide, such as the COVID-19 pandemic and Ebola virus outbreaks. During such events, social media platforms serve as central arenas where the public expresses emotions, disseminates information, and seeks support[1]. This data is inherently multimodal: text directly expresses viewpoints and emotions, while images contain rich contextual information and emotional cues, often reinforcing, supplementing, or even correcting the emotional tendencies conveyed by the text[2,3,4].
Therefore, accurate and real-time sentiment analysis of multimodal online public opinion during PHEs holds strategic value for relevant authorities to promptly grasp public sentiment trends, identify panic and rumors, assess the socio-psychological impact of intervention policies, and implement targeted risk communication and psychological support[5,6,7]. However, PHE public opinion data presents core challenges: social media generates massive online streaming features, yet only a sparse subset is truly relevant to sentiment analysis. Moreover, multimodal fusion must account for cross-modal interactions between text and images, rendering traditional sentiment analysis methods inadequate[8].
Current research primarily faces two major issues: ①Online Feature Selection: PHE public opinion data arrives continuously in a streaming manner, with the feature space dynamically changing. Traditional batch-mode feature selection methods cannot adapt to this streaming scenario, selecting sparse subset that is truly relevant to sentiment analysis from massive online streaming features. Although online feature selection research exists, it mostly focuses on single-modal text and lacks efficient online learning mechanisms for complex interdependencies among features. ②Cross-Modal Fusion Challenge: Simple feature concatenation or late decision fusion struggles to capture complex non-linear interactions and fine-grained semantic alignment between text and images. Despite advances in deep learning-based multimodal fusion, two issues remain particularly overlooked in the context of public health emergencies. One is modeling true bidirectional, multi-level interactions across modalities—not just coarse alignment—to extract complementary cues. The other is dynamic, context-aware fusion: the contribution of text versus image should not be fixed. Consider a post where the image shows a packed hospital corridor while the text merely says “a bit worried.” Here, the visual emotion clearly dominates, yet typical methods fail to adapt accordingly.
Addressing the aforementioned challenges, the study proposes an innovative solution with the following main contributions: Firstly, we propose the Online Divide-and-Conquer Markov Blanket Learning Algorithm. We extend Markov blanket theory to the streaming multimodal feature selection scenario. This algorithm can approximate the sparse subset that is truly relevant to sentiment analysis from massive online streaming features via incremental updates, while ensuring time and space efficiency. Secondly, we design the Cross-Modal Interactive Enhanced Fusion Network. We introduce a Two-Stage Cross-Modal Interactive Attention mechanism and an Adaptive Modal Weighting module, achieving deep semantic fusion from local to global and from static to dynamic. Combined with residual connections, we construct a powerful and stable core model for multimodal emotion recognition.
The remainder of this paper is structured as follows: Section 2 reviews related work. Section 3 formally defines the problem and introduces the overall framework. Section 4 details the principles and derivation of the Online Divide-and-Conquer Markov Blanket Learning Algorithm. Section 5 elaborates on the design details of the Cross-Modal Interactive Enhanced Fusion Network. Section 6 presents the experimental setup, result analysis, and discussion. Section 7 concludes the paper and outlines future research directions.

3. Problem Definition and Overall Framework

3.1. Problem Formulation

Let a multimodal data case arriving at time step S i = X i T , X i I , Y i , where: X i T R d T denotes the raw high-dimensional feature vector for the text modality. X i I R d I denotes the raw feature vector for the image modality. Y i Y = p o s i t i v e , n e u t r a l , n e g a t i v e is the sentiment label.
The total raw feature space is F i = X i T X i I , whose dimension d = d T + d I can be very large and grows with the emergence of new words and visual patterns. Our objective is to design an online learning model M such that at any time step i , it can dynamically select a feature subset S i F i from the high-dimensional streaming feature space F i , with low redundancy and strong relevance to Y i . Then, based on S i , a powerful classification function f : X i T × X i I Y is constructed to accurately predict Y i , where X i T , X i I are the feature representations after selection.

3.2. Overall Framework

The proposed framework consists of two core stages: ①Online Multimodal Feature Selection Stage: For each newly arrived case s t , its raw text and image features are input into the Online Divide-and-Conquer Markov Blanket Learning Algorithm. This algorithm maintains two dynamic “feature pools” —the text-relevant feature set P C T and the image-relevant feature set P C I . Based on incremental mutual information computation, the algorithm decides whether to include new features into the corresponding pool or discard old ones. The outputs are the filtered text feature subset X i T and image feature subset X i I . ②Cross-Modal Interactive Enhanced Fusion and Classification Stage: The filtered features are first encoded by modality-specific encoders to obtain deep representations H T and H I . Subsequently, they enter the Cross-Modal Interactive Enhanced Fusion Network. This network generates a context-aware fused representation H f u s i o n through two-stage interactive attention and adaptive weight assignment. Finally, H f u s i o n passes through residual connections and a fully connected classification layer to output the sentiment prediction Y i . The model is updated via an online cross-entropy loss.

4. Multimodal Online Divide-And-Conquer Markov Blanket Learning Algorithm

In real-world streaming data applications—such as social media monitoring during public health emergencies (PHEs)—raw features naturally arrive in ‌multimodal form‌. The primary ‌raw features‌ include ‌textual content‌ and ‌visual content‌. Traditional online feature selection methods often operate on pre-processed, unimodal representations, lacking a principled framework to jointly and efficiently process heterogeneous data streams.
Online Markov Blanket (MB) learning aims to dynamically identify the minimal optimal feature set for predicting a target variable from a stream of incoming features. However, existing methods face three core challenges in multimodal settings: ‌(1) Cross-modal redundancy‌, where features from different modalities convey overlapping information; ‌(2) Heterogeneous dependency modeling‌, due to the distinct statistical nature of text (discrete) and image (continuous) features; and ‌(3) Computational inefficiency‌, especially when exhaustive subset enumeration is required.
To address these challenges, we propose the ‌Multimodal Online Divide-and-Conquer Markov Blanket Learning (M-O-DC)‌ algorithm.

4.1. Markov Blanket and Mutual Information

In probabilistic graphical models, the Markov blanket M B ( Y ) of a target variable Y is defined as the set of variables that renders Y conditionally independent of all other variables in the graph. Formally, for any variable X i M B ( Y ) , we have X i Y M B ( Y ) . The Markov blanket of Y consists of its parents, children, and the other parents of its children (spouses). For classification tasks, M B ( Y ) constitutes the minimal sufficient feature subset for predicting Y , as it contains all the information necessary to determine Y while discarding irrelevant and redundant features.
Mutual information (MI) is a fundamental measure of dependency between two random variables:
I X , Y = x X y Y p x , y l o g p ( x , y ) p x p ( y )
Conditional mutual information I X ; Y Z quantifies the association between X and Y given Z , and is defined analogously using conditional probabilities.
I X ; Y Z = x , y , z p x , y , z l o g p ( x , y z ) p ( x z ) p ( y z )
A feature X i belongs to M B ( Y ) if and only if it provides unique predictive information about Y that is not contained in any other subset of M B ( Y ) . This property is fundamental for online feature selection: we aim to incrementally maintain an approximation of M B ( Y ) as features arrive in a stream.

4.2. M-O-DC Algorithm

4.2.1. Problem Setup and Multimodal Input Definition‌

Let the streaming feature space consist of a sequence of ‌textual features‌ F T = X i T and ‌Image visual features‌ F I = X i I , arriving sequentially. The sentiment label Y is influenced by both modalities. We define the ‌candidate P C set per modality‌, denoted C _ P C Y m for m T , I , as the dynamically maintained set of features that are likely parents or children of Y within modality m .

4.2.2. Decoupled Multimodal Processing‌

Input: Current text PC set C _ P C Y T , image PC set C _ P C Y I , new feature X i m , estimated target distribution p ^ t y , modality-specific thresholds τ i n c m , τ r e d m . τ i n c m is the inclusion threshold. When a new feature X i m arrives, if its mutual information with the target variable Y , denoted I ( X i m , Y ) , exceeds this threshold, the algorithm considers X i m a candidate for the Parent-Child (PC) set.
Output: Updated multimodal Markov Blanket candidate set M B ( Y ) .
Step A: Intra-modal PC Learning‌
For a newly arrived feature X i m ①Independence Test‌: Compute I ( X i m , Y ) , if I X i m , Y < τ i n c m , discard X i m as irrelevant.
②Redundancy Removal‌: Compare X i m with features in C _ P C Y m using symmetric uncertainty (SU). Remove any X k m C _ P C Y m that becomes redundant given X i m .
③PC Update‌: If X i m provides significant non-redundant information, add it to C _ P C Y m Step B: Cross-modal Spouse Learning‌
When a new PC node X i m is identified, trigger cross-modal spouse discovery:
①For each ‌cross-modal‌ candidate Z j n C _ P C Y n ( n m ), check if X i m P C ( Z j n ) or vice versa.
②Apply the ‌Cross-modal Spouse Identification Theorem‌: If Z j n and X i m are strongly associated, and the presence of X i m increases the conditional mutual information I ( Y ; Z j n X i m ) > I ( Y ; Z j n ) , then Z j n is a cross-modal spouse of Y .
③Add qualifying Z j n to the spouse set S P ( Y ) , which is inherently multimodal.
This design ensures that a newly discovered textual PC can help identify a relevant visual spouse, and vice versa, enabling a complete causal structure discovery across modalities.

4.3. Online Mutual Information Estimation for Text and Image

4.3.1. Text Features

Text features are typically discretized.
Sufficient Statistics‌: Maintain a count matrix C T f t , y for each discrete feature f t Online Update Rules: C T X , y C T X , y + I f t = X , Y = y e x p i r e d _ c o n t r i b u t i o n MI Computation (with Laplace Smoothing):
I T Y ; f t = X , y P ~ T X , y l o g P ~ T ( X , y ) P ~ T X P ~ ( y )
Where P ~ denotes smoothed probability estimates

4.3.2. Image Features

Image features are high-dimensional continuous vectors.
Sufficient Statistics‌: Use ‌online Kernel Density Estimation (KDE)‌ to approximate class-conditional densities.
Incremental Parameter Updates (Exponential Moving Averages)‌:
μ y I λ μ y I + ( 1 λ ) X I , n e w
σ y I λ σ y I + 1 λ ( X I , n e w μ y I
MI Estimation via KL Divergence‌:
I I ( Y ; f I ) I P y D K L ( P f I y ) P ( f I )
Where P ( f I y ) is modeled using online-updated Gaussian approximations.

4.3.3. Cross-modal Conditional MI Approximation‌

Computing I ( Y ; Z j I X i T ) is challenging due to heterogeneity. M-O-DC uses a ‌chain factorization with cross-modal independence assumption‌:
I Y ; Z j I X i T I Y ; Z j I k w k I ( Z j I ; C k T )
Where C k T are the top-k text features in C _ P C Y T most correlated with Z j I . This approximation is effective when text and image features are conditionally independent given Y .

4.4. Multimodal Dynamic Adaptation Mechanisms‌

4.4.1. Modality-Adaptive Thresholds‌

Thresholds τ i n c m and τ r e d m are dynamically adjusted per modality:
τ i n c m τ i n c m ( 1 + α C P C Y m N m a x m )
This accounts for differences in feature density and redundancy between text and image streams.

4.4.2. Multimodal Concept Drift Detection‌

Public sentiment may evolve at different rates across modalities. M-O-DC defines a ‌multimodal causal drift statistic‌:
D t ( m , n ) = 1 C _ P C Y m X i m C _ P C Y m I Y ; X i m w i n d o w _ a v e r a g e d M I
If D t ( m , n ) > k σ D , a drift is detected, triggering adaptive forgetting in the affected modality.

5. Cross-Modal Interactive Enhanced Fusion Network

5.1. Deep Encoder Design

5.1.1. Formalization of Multimodal Feature Encoding

The features filtered by M-O-DC, X ~ T R d T and X ~ I R d I , are encoded by modality-specific deep encoders.
Text Encoder (based on Transformer architecture):
H T = E n c o d e r T ( X ~ T ) R L × d h
Where E n c o d e r T = S t a c k M u l t i H e a d S e l f A t t n , F F n l = 1 N T , Let L = m a x ( d T d t o k e n , L m i n ) be the sequence length.
Image Encoder (based on Vision Transformer):
H I = E n c o d e r I ( X ~ I ) R N × d h
where X ~ I is split into N = H × W P 2 patches, each projected to dimension P × P × C d h .

5.2. Two-Stage Cross-Modal Interactive Attention:

5.2.1. Stage 1: Bidirectional Cross-Modal Attention

We compute cross-modal attention weights in both directions to establish bidirectional interactive mappings. For text-to-image attention, text features serve as queries and image features as keys:
Text-to-Image Attention: Q T = H T W Q T R L × d k , K I = H I W k I R N × d k , V I = H I W v I R N × d h W Q T , W k I , W v I : learnable weight matrices for query, key, value projections
d k : dimension of query/key vectors (usually d k = d h h where h is number of attention heads)
Q T : query matrix from text; k I : key matrix from image; V I : value matrix from image.
The attention weight matrix is:
A T I l , n = e x p ( 1 d k q l T , k n I ) m = 1 N e x p ( 1 d k q l T , k m I )
where ⟨·,·⟩ denotes the inner product.
The text-to-image attention output (cross-modal feature) is: A t t n T I = A T I V I R L × d h , where V I = H I W V I R N × d h Similarly, image-to-text attention is computed symmetrically: Q I = H I W Q I R N × d k , K T = H T W k T R L × d k , V T = H T W v T R L × d h .
A I T n , l = e x p ( 1 d k q n I , k l T ) l = 1 L e x p ( 1 d k q n I , k l T )
Image-to-Text Attention (symmetric): A t t n I T = A I T V T R N × d h The contextual representations after first-stage interaction (enhanced by residual connection and FFN) are:
C T I = L a y e r N o r m ( H T + F F N A t t n T I )
C I T = L a y e r N o r m ( H I + F F N A t t n I T )
Where, F F N is a two-layer M L P used for feature transformation: F F N x = G E L U x W 1 + b 1 W 2 + b 2 .

5.2.2. Stage 2: Deep Interactive Normalization

The second stage performs deep processing of the weighted features generated in the first stage. We apply softmax normalization to the interaction matrices, transforming attention weights into probability distributions that quantify the matching relationships between local textual and visual features. This normalization process not only quantifies local feature alignments but also mines potential cross-modal semantic associations that may not be directly observable.
We implement this using Multi-Head Co-Attention (MHCA):
C ~ T = M H C A ( C T I , C I T , C I T )
H ~ T = L a y e r N o r m ( C ~ T + F F N C ~ T )
C ~ I = M H C A ( C I T , C T I , C T I )
H ~ I = L a y e r N o r m ( C ~ I + F F N C ~ I )
Compared to methods that only perform direct feature weighting based on initial attention weights, our two-stage hierarchical processing captures finer complementary information and interaction patterns between modalities, enhancing the expressive power of cross-modal features for sentiment semantics.

5.3. Adaptive Modal Weighting

5.3.1. Enhanced Cross-Modal Features

We first enhance the cross-modal features by adding the original representations to the attention outputs from Stage 1: H ^ T = H T + A t t n T I R L × d h , H ^ I = H I + A t t n I T R N × d h These enhanced features integrate both the original modality information and the cross-modal influences.

5.3.2. Gated Weight Generation Network

To dynamically adjust the fusion weights for text and image modalities based on the current context, we concatenate the enhanced features along the sequence dimension and apply a learnable gating network.
First, we obtain a global representation by mean pooling over the sequence dimension:
h ¯ T = 1 L l = 1 L h ^ l T R d h
h ¯ I = 1 N n = 1 N h ^ n I R d h
Then we concatenate them to form a joint vector:
z = h ¯ T ; h ¯ I R 2 d h
The gating network consists of two fully connected layers with ReLU activation, followed by a sigmoid layer to produce the text weight:
h 1 = R e L U ( W 1 z + b 1 ) , W 1 R d h × 2 d h , b 1 R d h
h 2 = R e L U ( W 2 h 1 + b 2 ) , W 2 R d h × d h , b 2 R d h
g t = σ ( W 3 h 2 + b 3 ) b 2 ) , W 3 R 1 × d h , b 3 R
g i = 1 g t
This design forms a closed learning loop from cross-modal attention computation to feature weight assignment, enabling the model to adaptively balance modality contributions based on the semantic content of each post.

5.3.3. Weighted Fusion

Using the adaptive weights, we compute the modality-weighted representations:
H ~ w e i g h t T = g t H ~ T , H ~ w e i g h t I = g i H ~ I
where H ~ T and H ~ I are the outputs of the second-stage attention (Equations 21 and 23). These weighted features are then fused via concatenation:
F f u s e d = H ~ w e i g h t T ; H ~ w e i g h t I R ( L + N ) × d h

5.4. Residual Fusion and Classification

5.4.1. Residual-Enhanced Feature Fusion

To optimize gradient flow and reduce information loss during feature transmission, we introduce residual connections. The fused features are passed through a feed-forward network with a skip connection:
H r e s = L a y e r N o r m F f u s e d + F F N F f u s e d R L + N × d h
This residual design preserves original information while enhancing the expressive capacity of cross-modal features, mitigating overfitting and improving model generalization.

5.4.2. Self-Attentive Pooling

We employ self-attentive pooling to aggregate the sequence of fused features into a single vector:
α = s o f t m a x ( H r e s W a ) R L + N , W a R d h × 1
h p o o l = i = 1 L + N α i h i r e s R d h
where h i r e s denotes the i -th row of H r e s .

5.4.3. Sentiment Classification

The pooled features are passed through a fully connected layer with ReLU activation, followed by Dropout for regularization:
h h i d d e n = R e L U W f c 1 h p o o l + b f c 1 , W f c 1 R d h × d h , b f c 1 R d h
h d r o p = D r o p o u t ( h h i d d e n )
The final classification layer uses softmax to output probabilities for three sentiment categories (negative, neutral, positive):
p y S = s o f t m a x W c l s h d r o p + b c l s , W c l s R 3 × d h , b c l s R 3

5.5. Loss Function and Optimization

5.5.1. Multi-Task Loss

We employ multi-task learning with three loss components.
Classification Loss (cross-entropy):
L c l s = 1 B i = 1 B c = 1 C y i , c l o g p i , c
where B is the batch size, y i , c is the ground-truth one-hot indicator, and p i , c is the predicted probability for class C .
Contrastive Learning Loss to enhance representation learning:
L c o n t = 1 B i = 1 B l o g e x p ( s i m ( z i , z i + ) / τ j = 1 B e x p ( s i m ( z i , z j + ) / τ
where z i = h p o o l , i (the pooled feature for sample i ), z i + is a positive sample obtained through data augmentation (e.g., random masking of text or image patches), s i m ( , ) denotes cosine similarity, and τ is a temperature parameter.
Causal Regularization Loss to encourage causally stable representations:
L c a u s a l = m T , I H ~ m L c l s F 2
where H ~ m L c l s is the gradient of the classification loss with respect to the modality-specific representations, and F denotes the Frobenius norm. This term penalizes sensitivity to small perturbations, promoting features that capture causal factors rather than spurious correlations.
The total loss is:
L = L c l s + λ 1 L c o n t + λ 2 L c a u s a l
where λ 1 and λ 2 are hyperparameters balancing the contributions.

5.5.2. Optimization

We optimize the model using AdamW with weight decay, which prevents excessive weights and overfitting better than traditional Adam. The learning rate is set to 5 × 10 5 , and the model is updated online as new data streams arrive (batch size = 1, update every 100 samples). All weight matrices are initialized using Xavier uniform initialization, and biases are initialized to zero.

6. Experiments and Results Analysis

6.1. Experimental Data

(1) Dataset Collection
In China, Weibo and Xiaohongshu have emerged as pivotal platforms for public opinion and sentiment expression. Owing to their extensive user bases, image content related to public health emergencies on these platforms typically encapsulates richer information, aggregating substantial volumes of authentic user perspectives and emotional responses, while encompassing diverse stances, viewpoints, and affective tendencies. Consequently, this study utilizes data sourced from Weibo and Xiaohongshu to validate the effectiveness of the proposed method.
By crawling public posts from Weibo and Xiaohongshu between January 2020 and December 2022, this study collects public opinion data from three public health emergencies on the Weibo and Xiaohongshu platforms, comprising 15,677 text-image pairs. The data distribution across the events is as follows: 10,245 pairs related to COVID-19, 3,210 pairs to Avian Flu, and 2,222 pairs to Monkeypox.
(2) Data Annotation
The annotation of the dataset involved a structured dual-process approach, combining both manual and semi-automated methods tailored to the data modality. For image sentiment ,a manual dual-annotator protocol was employed. Each image was independently labeled as “Negative,” “Neutral,” or “Positive” by two annotators. Inter-annotator agreement was measured using Cohen’s Kappa coefficient (0.7123), indicating substantial consistency. Any discordant labels were resolved through subsequent discussion between the annotators to reach a consensus. For text sentiment annotation, a hybrid semi-automated pipeline was implemented. Initial sentiment labels (“Negative,” “Neutral,” “Positive”) were generated automatically using the SnowNLP toolkit. These automated labels were then validated by two independent human annotators. Instances where the manual annotations disagreed with the automated output or with each other were flagged. These discrepancies were subsequently reviewed and adjudicated through manual re-evaluation to ensure final label accuracy.The finalized sentiment distribution across the three public health event datasets is presented in Table 1.
(3) Experimental Environment and Evaluation MetricsThe experiments were conducted using the PyCharm IDE with Python 3.9. All models were trained on an NVIDIA RTX 3060 GPU with 32 GB of dedicated memory in Table 2. The AdamW optimizer was employed for training across all experiments, with an initial learning rate of 5×10⁻⁵. To ensure fair comparisons and mitigate confounding variables, identical training and testing splits were used for all models. Specifically, the dataset was partitioned into 80% for training and 20% for testing.
This study uses four standard evaluation metrics: Accuracy (A), Precision (P), Recall (R), and F1-score (F1).

6.2. Main Results and Analysis

6.2.1. The Classification Results

The model proposed in this paper was applied to three public health emergencies, and the classification results are shown in Table 3.
Based on the sentiment classification results of three public health events—COVID-19, Avian Influenza, and Monkeypox—the sentiment analysis model demonstrates strong overall performance, with an average accuracy of 87.08% and a macro-average F1 score of 87.24%, indicating good generalization ability and practical application potential. In terms of sentiment-specific recognition, the model performs best in identifying negative sentiments, with F1 scores for the negative class exceeding 87.75% across all three events, reaching as high as 91.38% for Monkeypox. This highlights the model’s clear advantage in capturing negative public opinion during crisis events. Positive sentiment recognition is also stable, with precision consistently above 91% and a low false positive rate. However, the model shows a notable weakness in recognizing neutral sentiment, with F1 scores for the neutral class falling below 85.5% across all events—reaching as low as 82.12%—and generally low recall, reflecting difficulty in accurately identifying texts with weak or ambiguous emotional polarity. Overall, the model is ready for practical deployment, particularly for negative public opinion monitoring in public health contexts.

6.2.2. Baseline Models

we organize the baselines into three categories—online feature selection, text-only, and multimodal fusion approaches—and further design combined experiments that pair different feature selection and fusion strategies to isolate the contribution of each component.①Online Feature Selection Baselines: Online Group LASSO (OGL): Regularization parameter λ = 0.01 , group structure defined by modality. Alpha-investing (AI): Initial alpha budget α 0 = 0.05 , ω = 1.0 . ②Text-Only Baselines: Online SVM (Text):Based on TF-IDF features, RBF kernel, C = 1.0 , online learning rate η = 0.01 . Online BERT: Based on DistilBERT-base, hidden layer dimension 768, learning rate 2e-5, updated every 100 samples. ③Multimodal Fusion Baselines: Early Fusion (EF): Text and image features concatenated at the input layer (text 768-dim + image 512-dim = 1280-dim). Late Fusion (LF):Text and image predictions are combined using a weighted average, with weights determined on the validation set. Cross-Modal Attention (CMA): Number of attention heads = 8, hidden dimension = 256. MARN: Memory unit size = 128, number of memory slots = 16. CLIP (finetuned): We employ the pretrained CLIP (ViT-B/32) model, which consists of a Vision Transformer for images and a Transformer for text. The image and text features are concatenated and fed into a linear classifier for sentiment prediction. Fine-tuning follows the same online learning protocol (batch size = 1, update every 100 samples, learning rate = 2e-5) to ensure fair comparison.
Comparative Experimental Design: The proposed O-DC method is combined with different fusion approaches: O-DC+EF; O-DC+LF; O-DC+CMA. Our fusion network is combined with different feature selection methods: OGL+Our Fusion; AI+Our Fusion.

6.2.3. Results and Discussion

To comprehensively evaluate the effectiveness of the proposed online dynamic cross-modal fusion framework (O-DC + Our Fusion), we compare it against a diverse set of baseline models, including text-only classifiers, multimodal fusion architectures, and online feature selection methods. As reported in Table 4, all models are evaluated on the test stream in terms of Precision (P), Recall (R), Macro-F1, and Accuracy (A).
Among the text-only baselines, Online BERT significantly outperforms Online SVM, achieving a 5.7% higher F1-Score 6.5% vs. 70.8%) and a 5.7% improvement in accuracy (78.2% vs. 72.5%). This demonstrates the superiority of pre-trained language models in capturing semantic nuances from textual content in public health emergency events. However, both text-only models exhibit relatively low recall (below 67%), indicating a tendency to miss a substantial portion of relevant samples, particularly in imbalanced streaming settings.
Incorporating visual information through multimodal fusion leads to consistent performance gains. For instance, EF + OGL improves recall to 79.2% and F1-Score to 78.9%, suggesting that early integration of visual features helps recover false negatives. The Cross-Modal Attention (CMA) model further boosts performance, achieving 81.3% recall and 80.8% F1-Score, which underscores the importance of modality interaction mechanisms. The MARN model, equipped with memory-augmented networks, achieves the highest recall among all baselines (91.3%) and an accuracy of 83.1%, validating the effectiveness of modeling temporal dependencies in online multimodal streams.
The inclusion of a CLIP-based vision-language pretrained model (CLIP finetuned) provides a strong contemporary baseline. CLIP achieves 88.1% precision, 83.9% recall, 85.9% F1-Score, and 86.2% accuracy, outperforming all traditional multimodal fusion methods and text-only models. This result confirms that leveraging large-scale pretraining on image-text pairs yields powerful multimodal representations even in specialized domains like PHE sentiment analysis. However, CLIP still exhibits a precision-recall trade-off: its recall (83.9%) is lower than MARN’s (91.3%), indicating that some sentiment signals are missed, while its precision (88.1%) is higher than MARN’s (82.7%), reflecting fewer false positives.
Despite these advances, baseline methods still face trade-offs between precision and recall. The proposed O-DC+CMA combination partially mitigates this issue, achieving a balanced performance with 85.6% precision and 83.2% F1-Score. Nevertheless, it still lags behind in recall (71.8%), suggesting that attention-based cross-modal interaction alone is insufficient for comprehensive feature selection in dynamic environments.
Remarkably, the full proposed model (O-DC + Our Fusion) achieves the best overall performance across all metrics: 89.3% precision, 85.3% recall, 87.2% F1-Score, and 87.1% accuracy. Compared to the strongest baseline MARN, our method improves F1-Score by 5.6% and accuracy by 4.0%, while maintaining a superior balance between precision and recall. More importantly, compared to the state-of-the-art pretrained model CLIP, our method yields gains of +1.2% in precision, +1.4% in recall, +1.3% in F1-Score, and +0.9% in accuracy. This demonstrates that the integration of online dynamic selection (M-O-DC) with our specially designed two-stage interactive fusion network adds significant value over simply fine-tuning a generic vision-language model. The improvement can be attributed to two factors: (1) M-O-DC adaptively filters irrelevant and redundant features in the streaming setting, reducing noise and focusing on task-relevant information; (2) the two-stage cross-modal attention with adaptive weighting captures fine-grained semantic alignments and dynamically balances modality contributions, which is crucial for PHE posts where modality importance varies greatly.
The online streaming performance evolution (Figure 1) further illustrates the robustness of our method. Prior to the simulated concept drift at batch 20, our model exhibits the fastest growth and reaches the highest F1-Score plateau. Upon drift, both MARN and O-DC+CMA experience noticeable degradation, while CLIP (not shown in Figure 1 due to its offline nature) would likely suffer similarly if evaluated online. In contrast, our proposed method maintains its performance with only a marginal dip and quickly recovers, showcasing its resilience to distributional shifts. This resilience is attributed to the online dynamic selection mechanism, which adaptively identifies and emphasizes informative features while discarding noisy or outdated ones.
The feature dimension dynamics chart compares the number of selected features over 50 batches among three online feature selection methods integrated with our fusion network. The proposed O-DC+Our Fusion method demonstrates a rapid decline in feature dimensions from approximately 900 to around 380 within the first 20 batches, after which it stabilizes at a low and consistent level in Figure 2. This indicates its ability to quickly identify and retain only the most informative features while discarding redundant or noisy ones. In contrast, OGL+Our Fusion exhibits larger fluctuations and maintains a higher average dimension (600), reflecting its group-wise sparsity constraints that are less adaptive to streaming dynamics. AI+Our Fusion initially selects a large number of features (1000) but gradually reduces to around 480, yet with noticeable variance. The efficiency advantage of the proposed method is evident: it achieves the highest F1-Score (87.2%) while using the fewest features, confirming that its online dynamic selection mechanism not only reduces computational overhead but also enhances feature quality, leading to better generalization in evolving data streams.
Figure 3 presents a grouped bar chart comparing the performance of eight models across Precision, Recall, F1-Score, and Accuracy. Several observations can be drawn. First, text-only baselines (Online SVM and Online BERT) exhibit the lowest overall performance, particularly in Recall (below 67%), indicating their limitation in capturing sufficient positive samples in streaming scenarios. Second, multimodal fusion methods (EF+OGL, CMA, MARN) significantly improve Recall, with MARN achieving the highest Recall (91.3%), but at the cost of reduced Precision (82.7%), revealing a precision-recall trade-off. Third, the CLIP (finetuned) model, leveraging large-scale vision-language pretraining, achieves strong performance (Precision 88.1%, F1-Score 85.9%, Accuracy 86.2%), outperforming traditional multimodal methods but still exhibiting a slight imbalance between precision and recall. Fourth, the O-DC+CMA combination improves Precision to 85.6% but suffers a sharp Recall drop to 71.8%, suggesting that attention-based cross-modal interaction alone cannot fully balance the two metrics. Most importantly, the proposed O-DC+Our Fusion model achieves the best overall performance, leading in Precision (89.3%), F1-Score (87.2%), and Accuracy (87.1%), while maintaining competitive Recall (85.3%). This balanced superiority demonstrates that the integration of online dynamic selection with our fusion network effectively resolves the precision-recall dilemma and outperforms even strong pretrained models like CLIP in the streaming multimodal setting.

6.2.4. Ablation Study Analysis

To rigorously evaluate the contribution of each key component in our proposed online dynamic cross-modal fusion framework, we conduct a series of ablation experiments by systematically removing individual modules. The results are summarized in Table 5, where the Full Model achieves the best performance with 86.9% accuracy and 85.4% F1-Score, serving as the baseline for comparison. Each variant is described below, along with the corresponding performance degradation and efficiency impact.
w/o O-DC (using full features): This variant removes the online dynamic selection (O-DC) module and instead utilizes all original features (text 768-d + image 512-d) without any dimensionality reduction. Compared to the Full Model, accuracy drops by 2.8% to 84.1%, and F1-Score decreases by 2.7% to 82.7%. Notably, the processing time increases to 82 ms, which is 14 ms longer than the Full Model. This indicates that O-DC not only preserves discriminative information but also reduces computational overhead by filtering out redundant features, thereby enhancing both effectiveness and efficiency.
w/o 2nd-stage Attention: This variant eliminates the second-stage cross-modal attention mechanism, which is designed to refine feature interactions after initial fusion. The removal leads to an accuracy decline of 1.3% (85.6%) and a F1-Score drop of 1.4% (84.0%). The processing time slightly decreases to 64 ms, suggesting that the second-stage attention adds a modest computational cost while delivering noticeable performance gains. The results confirm that deeper attention-based refinement is beneficial for capturing complex modality relationships.
w/o Adaptive Weighting: Here, the adaptive weighting module that dynamically adjusts the contribution of each modality based on streaming context is removed. Accuracy falls to 85.9% (Δ = -1.0%), and F1-Score to 84.3% (Δ = -1.1%), with a processing time of 66 ms. The moderate performance drop underscores the importance of context-aware modality fusion, especially in evolving data streams where the reliability of text and visual signals may vary over time.
w/o Residual: This variant removes residual connections within the fusion network. Accuracy decreases by 0.8% to 86.1%, and F1-Score by 0.7% to 84.7%, while processing time remains similar to the Full Model (67 ms vs. 68 ms). Although the impact is relatively smaller, residual connections still contribute to stabilizing training and improving representational capacity, as evidenced by the consistent performance gain.
The ablation results clearly demonstrate that each module plays a vital role in the overall framework. The most significant performance degradation occurs when O-DC is removed, highlighting its critical function in online feature selection. The second-stage attention and adaptive weighting also contribute substantially to accuracy and F1-Score, while residual connections provide marginal yet consistent improvements. Moreover, the Full Model achieves the best trade-off between performance and efficiency, with processing time only slightly higher than the fastest variant but substantially lower than the feature-heavy w/o O-DC version. These findings validate the necessity and effectiveness of each designed component in our proposed method for online multimodal sentiment classification.

7. Conclusion

This study presented a novel online multimodal emotion recognition framework for public health emergency public opinion analysis. To address the challenges of streaming, high-dimensional, and heterogeneous data, we proposed: (1) an Online Divide-and-Conquer Markov Blanket Learning Algorithm for dynamic and efficient feature selection, and (2) a Cross-Modal Interactive Enhanced Fusion Network with two-stage attention and adaptive weighting for deep semantic fusion. Theoretical analysis provided guarantees on convergence, information preservation, and generalization. Comprehensive experiments on a large-scale PHE dataset demonstrated that our framework significantly outperforms state-of-the-art baselines in accuracy, F1-score, online learning efficiency, and robustness to concept drift, while maintaining interpretability.
Future work will focus on: (1) Extending the framework to incorporate more modalities, particularly video and audio; (2) Designing fully adaptive mechanisms for threshold parameters in O-DC; (3) Exploring more advanced architectures for handling complex, long-term concept drift.

References

  1. Che, S. P.; Wang, X.; Zhang, S. N.; et al. Effect of daily new cases of COVID-19 on public sentiment and concern: Deep learning-based sentiment classification and semantic network analysis. J. Public Health 2024, 32(3), 509–528. [Google Scholar] [CrossRef]
  2. Mumuni, A.; Mumuni, F. Data augmentation with automated machine learning: approaches and performance comparison with classical data augmentation methods. Knowl. Inf. Syst. 2025, 67(5), 1–11. [Google Scholar] [CrossRef]
  3. Tang, H.; Wang, Y.; Zhang, Y.; et al. TS-Mixer: A lightweight text representation model based on context awareness. Expert Syst. 2025, 42(2). [Google Scholar] [CrossRef]
  4. Zhang, L.; Wang, X.; Wang, J.; Liao, G. Research on emergency decision quality evaluation and optimization basing on public sentiment big data analysis. Comput. Ind. Eng. 2024, 193(7), 109452. [Google Scholar] [CrossRef]
  5. Han, P.; Zhang, W.; Zhang, Z. Sentiment Analysis of Weibo Posts on Public Health Emergency with Feature Fusion and Multi-Channel. Data Anal. Knowl. Discov. 2021, 5(11), 68–79. [Google Scholar]
  6. Gariboldi, M.I.; Lin, V.; Bland, J.; et al. Foresight in the time of COVID-19. Lancet Reg. Health West. Pac. 2021, 6, 100049. [Google Scholar] [CrossRef]
  7. Ahelegbey, D.F.; Celani, A.; Cerchiello, P. Measuring the impact of the EU health emergency response authority on the economic sectors and the public sentiment. Socioecon. Plan. 2024, 92, 101842. [Google Scholar] [CrossRef]
  8. Chen, X.; Zhang, W.; Xu, X.; Cao, W. A public and large-scale expert information fusion method and its application: Mining public opinion via sentiment analysis and measuring public dynamic reliability. Inf. Fusion 2022, 78, 71–85. [Google Scholar] [CrossRef]
  9. An, L.; Xu, M. Measuring online trust in government microblogs in public health emergencies. Data Anal. Knowl. Discov. 2022, 6(1), 55–68. [Google Scholar]
  10. Cai, Y.; Yang, K.; Huang, D.; Zhou, Z.; Lei, X.; Xie, H.; Wong, T. A hybrid model for opinion mining based on domain sentiment dictionary. Int. J. Mach. Learn. Cybern. 2019, 10(8), 2131–2142. [Google Scholar] [CrossRef]
  11. Xu, G.; Yu, Z.; Yao, H.; Li, F.; Meng, Y.; Wu, X. Chinese text sentiment analysis based on extended sentiment dictionary. IEEE Access 2019, 7, 43749–43762. [Google Scholar] [CrossRef]
  12. Yang, S.; Chen, F. Analyzing sentiments of Micro-blog posts based on support vector machine. Data Anal. Knowl. Discov. 2017, 1(2), 73–79. [Google Scholar]
  13. Fan, H.; Li, P. Sentiment analysis of short text based on FastText word vector and bidirectional GRU recurrent neural network. Inf. Sci. 2021, 39(4), 15–22. [Google Scholar]
  14. Liu, J.; Gu, F. Unbalanced text sentiment analysis of network public opinion based on BERT and BiLSTM hybrid method. J. Intell. 2022, 41(04), 104–110. [Google Scholar]
  15. Hyun, D.; Park, C.; Yang, M.; Song, I.; Lee, J.; Yu, H. Target-aware convolutional neural network for target-level sentiment analysis. Inf. Sci. 2019, 491, 166–178. [Google Scholar] [CrossRef]
  16. Guo, X.; Zhao, N.; Cui, S. Consumer reviews sentiment analysis based on CNN-BILSTM. Syst. Eng. Theory Pract. 2020, 40, 653–663. [Google Scholar]
  17. Lai, X.; Tang, H.; Chen, H.; Li, S. Multimodal sentiment analysis based on feature fusion of attention mechanism-bidirectional gated recurrent unit. J. Comput. Appl. 2021, 41(5), 1268–1274. [Google Scholar]
  18. Duan, W. J.; Deng, J. K.; Zhang, S. X.; et al. Aspect-based sentiment analysis model based on multilevel knowledge enhancement. CAAI Trans. Intell. Syst. 2024, 19(5), 1287–1297. [Google Scholar]
  19. Liu, J. H.; Li, l.; Wu, R. W.; et al. Mutli-prompt learning based aspect-category sentiment analysis. J. Front. Comput. Sci. Technol. 2025, 19(05), 1334–1341. [Google Scholar]
  20. Zhang, W.; Chu, Z. Y.; Chen, X. Q.; et al. Aspect-based sentiment analysis with syntactic prompt. J. Comput. Appl. 2024, 44, 35–43. [Google Scholar]
  21. Li, Z.; Xu, H.; Duan, B. Research on image emotion feature extraction based on deep learning CNN model. Libr. Inf. Serv. 2019, 63(11), 96–107. [Google Scholar]
  22. Cai, G.; He, X.; Chu, Y. Visual sentiment analysis by combining global and local regions of image. J. Comput. Appl. 2019, 39(8), 2181–2185. [Google Scholar]
  23. Alamoodi, A.H.; Zaidan, B.B.; Zaidan, A.A.; et al. Sentiment analysis and its applications in fighting COVID-19 and infectious diseases: A systematic review. Expert Syst. Appl. 2021, 167, 114155. [Google Scholar] [CrossRef]
  24. Arbane, M.; Benlamri, R.; Brik, Y.; Alahmar, A.D. Social media-based COVID-19 sentiment classification model using Bi-LSTM. Expert Syst. Appl. 2023, 212, 118710. [Google Scholar] [CrossRef] [PubMed]
  25. Blanco, G.; Lourenço, A. Optimism and pessimism analysis using deep learning on COVID-19 related twitter conversations. Inf. Process. Manag. 2022, 59(3), 102918. [Google Scholar] [CrossRef]
  26. Tan, H.; Peng, S.; Zhu, C.; et al. Long-term effects of the COVID-19 pandemic on public sentiments in mainland China: Sentiment analysis of social media posts. J. Med. Internet Res. 2021, 23(8), e29150. [Google Scholar] [CrossRef] [PubMed]
  27. Kumar, A.; Garg, G. Sentiment analysis of multimodal twitter data. Multimed. Tools Appl. 2019, 78(17), 24103–24119. [Google Scholar] [CrossRef]
  28. Gan, Z. H.; Miao, Y. Q.; Liu, T. L. Multimodal aspect-level sentiment analysis based on cross-modal interaction Transformer. Appl. Res. Comput. 2025, 42(9), 2707–2713. [Google Scholar]
  29. Perkins, S.; Theiler, J. Online feature selection using grafting[C]. the Twentieth International Conference on Machine Learning, 2003; pp. 592–599. [Google Scholar]
  30. Ungar, L. Streaming feature selection using alpha-investing[C]. the Eleventh International Conference on Knowledge Discovery and Data Mining, 2005; pp. 384–393. [Google Scholar]
  31. Wu, X.; Yu, K.; Ding, W.; et al. Online feature selection with streaming features. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35(5), 1178–1192. [Google Scholar]
  32. Lin, Y.; Hu, Q.; Zhang, J. Multi-label feature selection with streaming labels. Inf. Sci. 2016, 372, 256–275. [Google Scholar] [CrossRef]
  33. Lin, Y.; Hu, Q.; Liu, J.; et al. Streaming feature selection for multi-label learning based on fuzzy mutual information. IEEE Trans. Fuzzy Syst. 2017, 25(6), 1491–1507. [Google Scholar] [CrossRef]
  34. Liu, J.; Lin, Y.; Li, Y.; et al. Online multi-label streaming feature selection based on neighborhood rough set. Pattern Recognit. 2018, 84, 273–287. [Google Scholar] [CrossRef]
  35. Li, H.; Wu, X.; Li, Z.; et al. Group feature selection with streaming features[C]. 13th International Conference on Data Mining, 2013; pp. 1109–1114. [Google Scholar]
  36. Jing, W.; Meng, W.; Li, P.; et al. Online feature selection with group structure analysis[J]. IEEE Trans. Knowl. Data Eng. 2015, 27(11), 3029–3041. [Google Scholar]
  37. Pearl, J.; Mackenzie, D. The book of why: the new science of cause and effect. Science 2018, 361(6405), 852–855. [Google Scholar] [CrossRef]
  38. Jake, H.; Amit, S.; Duncan, W. Prediction and explanation in social systems. Science 2017, 355(6324), 486–488. [Google Scholar] [CrossRef]
  39. Gao, T.; Ji, Q. Efficient markov blanket discovery and its application. IEEE Trans. Cybern. 2016, 47(5), 1169–1179. [Google Scholar] [CrossRef] [PubMed]
  40. Ling, Z.; Yu, K.; Wang, H.; et al. Bamb: A balanced markov blanket discovery approach to feature selection. ACM Trans. Intell. Syst. Technol. (TIST) 2019, 10(5), 1–25. [Google Scholar] [CrossRef]
  41. Wang, H.; Ling, Z.; Yu, K.; et al. Towards efficient and effective discovery of markov blankets for feature selection. Inf. Sci. 2020, 509, 227–242. [Google Scholar] [CrossRef]
  42. Wu, X.; Jiang, B.; Yu, K.; et al. Accurate markov boundary discovery for causal feature selection. IEEE Trans. Cybern. 2019, 50(12), 4983–4996. [Google Scholar] [CrossRef]
  43. Liu, C.; Wang, Y.; Yang, J. A transformer-encoder-basedmultimodalmulti-attention fusion network for sentiment analysis. Appl. Intell. 2024, 54, 8415–8441. [Google Scholar] [CrossRef]
  44. Chen, X.; Zhang, W.; Xu, X.; Cao, W. A public and large-scale expert information fusion method and its application: Mining public opinion via sentiment analysis and measuring public dynamic reliability. Inf. Fusion 2022, 78, 71–85. [Google Scholar] [CrossRef]
  45. Wang, Y.; Xie, J.; Chen, B.; Xu, X. Multi-modal sentiment analysis based on cross-modal context-aware attention. Data Anal. Knowl. Discov. 2021, 5(4), 49–59. [Google Scholar]
  46. Harish, A. B.; Sadat, F. Trimodal Attention Module for Multimodal Sentiment Analysis. Assoc. Adv. Artif. Intell. (AAAI) 2020, 1–10. [Google Scholar]
  47. Majumder, N.; Hazarika, D.; Gelbukh, A. Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl. Based Syst. 2018, 161, 124–133. [Google Scholar] [CrossRef]
  48. Sun, H.; Chen, Y. W.; Lin, L. Tensor Former: A Tensor-Based Multimodal Transformer for Multimodal Sentiment Analysis and Depression Detection. IEEE Trans. Affect. Comput. 2023, 14(4), 2776–2786. [Google Scholar] [CrossRef]
  49. Tsai, Y. H. H.; Bai, S. J.; Liang, P. P.; et al. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019; pp. 6558–6569. [Google Scholar]
  50. Yang, K. C.; Xu, H.; Gao, K. CM-BERT: Cross-Modal BERT for Text-Audio Sentiment Analysis. In Proceedings of the 28th ACM International Conference on Multimedia, 2020; pp. 521–528. [Google Scholar]
  51. Yuzhu, Wang; Jun, Xie; Bo, Chen; et al. Multi-modal Sentiment Analysis Based on Cross modal Context-aware Attention. Data Anal. Knowl. Discov. 2021, 5(4), 49–59. [Google Scholar]
  52. Yansong, Chen; Le, Zhang; Leihan, Zhang. Multimodal Sentiment Analysis Method Based on Cross-Modal Attention and Gated Unit Fusion Network. Data Anal. Knowl. Discov. 2024, 8(7), 67–76. [Google Scholar]
  53. Bengong, Yu; Zhongyu, Shi. Deep Attention and Two-Stage Fusion of Image-Text Sentiment Contrastive Learning Method. Comput. Eng. Appl. 2025, 61(3), 223–233. [Google Scholar]
  54. Alahmadi, K.; Alharbi, S.; Wang, X. Integrating dense layers with residual connections into transformers for enhanced sentiment classification. J. Supercomput. 2025, 81, 1542. [Google Scholar] [CrossRef]
  55. He, K.; Zhang, X.; Ren, S.; et al. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016; pp. 770–778. [Google Scholar]
  56. Islam, M.S.; Xiangdong, L.; Ahmed, J. BERT: advancements in language understanding for different NLP tasks: challenges and future perspectives. J. Electr. Syst. Inf. Technol. 2026, 13, 49. [Google Scholar] [CrossRef]
  57. Yanxiao, Han; Jing, Ma. The RCHFN Model: A Multimodal Feature Fusion Approach for Sentiment Classification. Data Anal. Knowl. Discov. 2024, 8(12), 18–29. [Google Scholar]
  58. Su, Y. Y.; Han, C. J.; Li, A. M.; et al. Research on Image-text Multimodal Sentiment Recognition Driven by Large Model Enhancement and Multi-feature Cross-fusion. Inf. Stud. Theory Appl. 2025, 9, 1–16. [Google Scholar]
  59. Shahid, S. D.; Mohammad, Z.; Karan, B.; et al. A social context-aware graph-based multi-modal attentive learning framework for disaster content classification during emergencies. Expert Syst. With Appl. 2025, 259, 125337–125360. [Google Scholar]
  60. Zeng, Z. M.; Sun, S. Q.; Li, Q. Q. Multimodal negative sentiment recognition of online public opinion on public health emergencies based on graph convolutional networks and ensemble learning. Inf. Process. Manag. 2023, 60(4), 103378–103395. [Google Scholar] [CrossRef]
Figure 1. Online Streaming Performance Evolution.
Figure 1. Online Streaming Performance Evolution.
Preprints 216759 g001
Figure 2. Feature Dimension Dynamics.
Figure 2. Feature Dimension Dynamics.
Preprints 216759 g002
Figure 3. Performance Comparison of Different Models.
Figure 3. Performance Comparison of Different Models.
Preprints 216759 g003
Table 1. The Sentiment Annotation Results of Public Opinion in Three Sudden Events.
Table 1. The Sentiment Annotation Results of Public Opinion in Three Sudden Events.
Public Health Emergency Names Total Sample Size Positive Neutral Negative
COVID-19 10,245 2,377 4,252 3,616
Avian Influenza 3,210 803 1445 962
Monkeypox 2,222 489 934 799
Total 15,677 3,669 6631 5377
Table 2. Experimental Parameter Settings.
Table 2. Experimental Parameter Settings.
Stage Parameters Value
O-DC thresholds τ t e x t = 0.35 ,
τ i m a g e = 0.40
Optimizer Learning_rate=5×10⁻⁵
β=(0.9, 0.999)
Training: Online learning batch size=1
update every 100 samples
Table 3. Sentiment Classification Result Analysis of Three Public Health Opinion Events.
Table 3. Sentiment Classification Result Analysis of Three Public Health Opinion Events.
Events Sentiment Category Precision (P) Recall (R) F1-score (F) Accuracy (A)
COVID-19 positive 91.15% 82.54% 86.64%
Neutral 88.45% 82.78% 85.52% 86.62%
Negative 90.31% 85.35% 87.75%
Avian Influenza positive 91.37% 89.29% 90.30%
Neutral 84.76% 81.78% 83.27%
Negative 90.48% 88.12% 89.25% 87.15%

Monkeypox
positive 91.23% 86.72% 88.92%
Neutral 83.97% 80.34% 82.12% 87.47%
Negative 92.09% 90.75% 91.38%
Average of the three events 89.31% 85.30% 87.24% 87.08%
Table 4. Main Performance Comparison (Average on Test Stream).
Table 4. Main Performance Comparison (Average on Test Stream).
Model Precision (P) Recall (R) F1-Score (F) Accuracy (A)
Online SVM (Text) 72.2% 65.5% 70.8% 72.5%
Online BERT 75.5% 66.3% 76.5% 78.2%
EF + OGL 64.7% 79.2% 78.9% 80.1%
CMA 77.3% 81.3% 80.8% 82.3%
MARN 82.7% 91.3% 81.6% 83.1%
CLIP (finetuned) 88.1% 83.9% 85.9% 86.2%
O-DC + CMA 85.6% 71.8% 83.2% 84.7%
O-DC + Our Fusion 89.3% 85.3% 87.2% 87.1%
Table 5. Ablation Study Results.
Table 5. Ablation Study Results.
Model Variant Accuracy (%) ΔAcc F1-Score (%) ΔF1 Processing Time (ms)
Full Model 87.1 85.4 68
w/o O-DC (using all features) 84.1 -2.8 82.7 -2.7 82
w/o 2nd-stage Attention 85.6 -1.3 84.0 -1.4 64
w/o Adaptive Weighting 85.9 -1.0 84.3 -1.1 66
w/o Residual Connections 86.1 -0.8 84.7 -0.7 67
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Accessibility

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated