Submitted:
27 October 2025
Posted:
28 October 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
- Picard’s foundational study established the concept of machine emotion perception [1], while subsequent research on sentiment analysis [2,3] highlighted progress but failed to handle fuzzy semantics—where emotional meaning varies with linguistic ambiguity and context, as theorized by Zadeh’s fuzzy set principles [4].
- Recent advances introduced fuzzy models that better capture ambiguous emotional expressions [5], an ability critical for healthcare and elderly communication where subtle cues often precede physical decline.
- However, these methods increase computational cost and are unsuitable for real-time or low-power systems, especially in medical AI [8].
- Building upon these directions, the proposed Fuzzy-Semantic Evaluation Framework (FSEF) not only integrates accuracy, calibration, and latency into a unified structure but also emphasizes reproducibility and privacy-preserving deployment for real-world IoT environments [15]
3. Methods
3.1. Theoretical Framework
- Accuracy (A) captures external validity—the degree to which predicted emotion labels coincide with human annotation.
- Calibration (C) measures how consistently the model’s confidence values correspond to empirical correctness. For probability-based models, this is assessed through metrics such as ECE and Brier Score. For zero-shot similarity-based models, these metrics are applied to sigmoid-normalized similarity scores as confidence reliability proxies, allowing a consistent comparison of relative reliability across inference paradigms.
- Latency (L) quantifies computational realism, or the average inference time required per utterance on the designated single-core, 2 GB IoT device
3.1.1. Mathematical Formulation of the Framework
3.1.2. Rationale and Semantic Context
3.2. Framework Architecture
3.2.1. Model Selection and Lightweight Design

3.2.2. Inference Mechanisms: Zero-shot and Fine-tuned Models
3.3. Experimental Environment
3.4. Training Setup

3.5. Deployment and Inference Configuration
3.6. Evaluation Metrics and Statistical Rationale
| Metric | Level of Evaluation | Purpose / Interpretation |
| DecisionAcc(θ) | Sample-level | Measures coverage of partially correct predictions (whether at least one correct label is captured). |
| DecisionPrec(θ) | Sample-level | Evaluates selectivity—the proportion of predicted labels that are correct (low false positives). |
| DecisionRec(θ) | Sample-level | Evaluates completeness—the proportion of true labels correctly recovered (low false negatives). |
| DecisionF1(θ) | Sample-level | Balances precision and recall through a harmonic mean, representing overall correctness. |
| Jaccard Index | Set-level | Measures intersection over union between predicted and true label sets, summarizing overlap accuracy. |
| Micro-F1 | Global-level | Aggregates predictions across all classes, emphasizing performance on frequent labels. |
| Top-1 Accuracy | Global-level | Assesses correctness of the model’s single highest-confidence prediction. |
- The two metric types serve complementary analytical roles: decision-based metrics describe conditional precision and recall at the sample level, and global metrics capture aggregate performance across the dataset. This integrated design enables a balanced and interpretable assessment of accuracy within a unified evaluation framework.
| Dimension | Focus of Evaluation | Representative Metrics | Interpretive Meaning |
| Accuracy (A) | External correctness and completeness of predictions | DecisionAcc(θ), DecisionPrec(θ), DecisionRec(θ), DecisionF1(θ), Jaccard Index, Micro-F1, Top-1 Accuracy | Measures how well the model identifies emotional labels at sample and dataset levels. |
| Calibration (C) | Reliability of model confidence estimates | ECE, Brier Score | Indicates how closely the model’s confidence values correspond to empirical correctness across probability- and similarity-based outputs. |
| Latency (L) | Computational efficiency and real-time feasibility | Average latency | Quantifies inference speed and resource efficiency on the IoT device. |
- The Area Under the Risk–Coverage Curve (AURC) quantifies the global trade-off between prediction confidence and risk. It is computed by sorting samples according to confidence scores and integrating the empirical risk as coverage increases from 0 to 1. Lower AURC values indicate better overall reliability across confidence thresholds.
4. Results
4.1. Accuracy Evaluation
4.2. Reliability and Calibration Analysis
4.3. Robustness Evaluation
4.4. Latency and Deploy Ability
5. Discussion
5.1. Verification of the A–C–L Framework
5.2. Calibration Reliability and Responsible Inference
5.3. Architectural Moderation and Resource-Constrained Feasibility
5.4. Aging-Friendly Application and Societal Contribution
5.5. Methodological and Theoretical Implications
6. Future Work and Limitations
7. Conclusions
References
- Picard, R.W. Affective Computing; MIT Press: Cambridge, MA, 1997. [Google Scholar]
- Cambria, E.; Poria, S.; Bajpai, R.; Schuller, B. Affective computing and sentiment analysis. IEEE Intelligent Systems 2020, 35, 102–107. [Google Scholar]
- Poria, S.; Cambria, E.; Gelbukh, A. Aspect extraction for opinion mining with a deep convolutional neural network. Knowledge-Based Systems 2016, 108, 42–49. [Google Scholar] [CrossRef]
- Zadeh, L.A. The concept of a linguistic variable and its application to approximate reasoning—I. Information sciences 1975, 8, 199–249. [Google Scholar] [CrossRef]
- Xiong, Y.; et al. Fuzzy speech emotion recognition considering semantic ambiguity. Journal of Intelligent & Fuzzy Systems 2024. [Google Scholar]
- Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On calibration of modern neural networks. In Proceedings of the Proc. Int. Conf. Machine Learning (ICML); 2017; pp. 1321–1330. [Google Scholar]
- Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Proceedings of the Proc. Advances in Neural Information Processing Systems (NeurIPS); 2017; pp. 6402–6413. [Google Scholar]
- Mehrtash, Alireza, et al. Confidence calibration and predictive uncertainty estimation for deep medical image segmentation. IEEE transactions on medical imaging 2020, 39, 3868–3878. [Google Scholar] [CrossRef] [PubMed]
- Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
- Wang, W.; Bao, F.; Dong, L.; Wei, H.; Xu, K. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Proceedings of the Proc. Advances in Neural Information Processing Systems (NeurIPS); 2020. [Google Scholar]
- Zhou, Z.; Chen, X.; Li, E.; Zeng, L.; Luo, K.; Zhang, J. Edge intelligence: Paving the last mile of artificial intelligence with edge computing. Proceedings of the IEEE 2019, 107, 1738–1762. [Google Scholar] [CrossRef]
- Yang, Kangning, et al. Mobile emotion recognition via multiple physiological signals using convolution-augmented transformer. In Proceedings of the 2022 International Conference on Multimedia Retrieval; 2022.
- Doshi-Velez, F.; Kim, B. Towards a rigorous science of interpretable machine learning. arXiv arXiv:1702.08608, 2017. [CrossRef]
- Arrieta, A.B.; Díaz-Rodríguez, N.; Del Ser, J.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
- Deng, S.; et al. Edge intelligence: The confluence of edge computing and artificial intelligence. IEEE Internet of Things Journal 2020, 7, 7457–7469. [Google Scholar] [CrossRef]


| Model | Top-1 | micro-F1 | DecisionF1 | Jaccard | DecisionAcc | DecisionPrec | DecisionRec |
| MiniLM-ZeroShot | 0.223 | 0.086 | 0.088 | 0.046 | 0.97 | 0.047 | 0.971 |
| DistilBERT-ZeroShot | 0.225 | 0.091 | 0.093 | 0.049 | 0.96 | 0.050 | 0.955 |
| MiniLM-OvR | 0.334 | 0.271 | 0.494 | 0.178 | 0.55 | 0.516 | 0.499 |
| DistilBERT-OvR | 0.413 | 0.371 | 0.516 | 0.276 | 0.57 | 0.532 | 0.528 |
| Condition | Metric | Clean | Noisy | Absolute Change (Δ) | Observation |
| MiniLM-OvR | micro-F1 | 0.271 | 0.265 | decreases by 0.006 | Slight reduction, indicating stable performance. |
| ECE | 0.058 | 0.060 | increases by 0.002 | Calibration remains stable under noisy input. | |
| AURC | 0.309 | 0.316 | increases by 0.007 | Minor increase in risk–coverage area. | |
| DistilBERT-OvR | micro-F1 | 0.371 | 0.364 | decreases by 0.007 | Consistent with MiniLM-OvR, showing slightly better robustness overall. |
| ECE | 0.059 | 0.061 | increases by 0.002 | Very small change, showing stable calibration. | |
| AURC | 0.315 | 0.318 | increases by 0.003 | Negligible change in risk–coverage balance. |
| Model ID | Encoder | Classification Method | Average Latency (ms) | micro-F1 | Jaccard | Observation |
| MiniLM-ZeroShot | MiniLM-L6-v2 | Prototype Matching (Cosine) | 29.5 ms | 0.087 | 0.046 | Lightweight zero-shot baseline; easy to deploy but low precision and high activation rate. |
| DistilBERT-ZeroShot | DistilBERT-base-nli-stsb | Prototype Matching (Cosine) | 75.0ms | 0.091 | 0.049 | Slower but slightly higher semantic coverage; calibration weak. |
| MiniLM-OvR | MiniLM-L6-v2 | One-vs-Rest Logistic Regression | 26.7 ms | 0.271 | 0.178 | Lowest latency and balanced accuracy–calibration performance. |
| DistilBERT-OvR | DistilBERT-base-nli-stsb | One-vs-Rest Logistic Regression | 74.0 ms | 0.371 | 0.276 | Highest accuracy, moderate latency; |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).