Submitted:
22 February 2026
Posted:
28 February 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Background and Related Work
2.1. MLOps: Principles and Limitations
2.2. Healthcare AI: Deployment Challenges
2.3. Regulatory Frameworks: GDPR and the EU AI Act
2.4. Research Gap
3. The Clinical MLOps Framework
3.1. Layer 1 — Privacy-Preserving Deployment Patterns
3.2. Layer 2 — Clinical Observability Mechanisms
3.3. Layer 3 — Compliance-Oriented Audit Trail Architecture
3.4. Layer 4 — Human-in-the-Loop Governance Protocols
4. Demonstrative Pipeline: Patient Deterioration Prediction with MIMIC-IV
4.1. Dataset Description
4.2. Prediction Task
4.3. Pipeline Architecture
| Component | Technology | Clinical MLOps Layer |
|---|---|---|
| Data extraction & versioning | DVC + PostgreSQL | Layer 3 (Lineage) |
| Feature engineering | Apache Spark / PySpark | Layer 1 (Minimization) |
| Experiment tracking | MLflow | Layer 3 (Audit Trail) |
| Model training | scikit-learn / XGBoost | Layer 2 (Uncertainty) |
| CI/CD pipeline | GitHub Actions | Layer 3 (Change Mgmt) |
| Model serving | FastAPI + Docker | Layer 1 (Isolation) |
| Monitoring | Grafana + Prometheus | Layer 2 (Observability) |
| Drift detection | Evidently AI | Layer 2 (Drift) |
| Secrets management | HashiCorp Vault | Layer 1 (Encryption) |
| Audit logging | Immutable object store (S3) | Layer 3 (Logging) |
| Human review interface | Custom dashboard | Layer 4 (HITL) |
4.4. Feature Engineering and Data Minimization
4.5. Model Training and Uncertainty Quantification
4.6. Fairness Evaluation
4.7. Drift Simulation and Monitoring
4.8. Audit Trail Implementation
5. Discussion
5.1. Gap Analysis: Standard MLOps vs. Clinical MLOps
5.2. Regulatory Alignment
5.3. Limitations
5.4. Future Work
6. Conclusions
References
- Angelopoulos, A. N.; Bates, S. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv 2022, arXiv:2107.07511. [Google Scholar] [CrossRef]
- Bifet, A.; Gavalda, R. Learning from time-changing data with adaptive windowing. In Proceedings of the 2007 SIAM International Conference on Data Mining, 2007; pp. 443–448. [Google Scholar]
- Char, D. S.; Shah, N. H.; Magnus, D. Implementing machine learning in health care — addressing ethical challenges. New England Journal of Medicine 2018, 378(11), 981–983. [Google Scholar] [CrossRef] [PubMed]
- European Parliament. (2016). General Data Protection Regulation (GDPR). Regulation (EU) 2016/679.
- European Parliament. (2024). Artificial Intelligence Act. Regulation (EU) 2024/1689.
- Finlayson, S. G., Subbaswamy, A., Singh, K., Bowers, J., Kupke, A., Zittrain, J., ... & Saria, S. The clinician and dataset shift in artificial intelligence. New England Journal of Medicine 2021, 385(3), 283–286. [CrossRef] [PubMed]
- Grasselli, G.; Zangrillo, A.; Zanella, A.; Antonelli, M.; Cabrini, L.; Castelli, A. ... & COVID-19 Lombardy ICU Network Baseline characteristics and outcomes of 1591 patients infected with SARS-CoV-2 admitted to ICUs of the Lombardy region, Italy. JAMA 2020, 323(16), 1574–1581. [Google Scholar] [PubMed]
- Hardt, M.; Price, E.; Srebro, N. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems; 2016; p. 29. [Google Scholar]
- Harutyunyan, H.; Khachatrian, H.; Kale, D. C.; Ver Steeg, G.; Galstyan, A. Multitask learning and benchmarking with clinical time series data. Scientific Data 2019, 6(1), 96. [Google Scholar] [CrossRef] [PubMed]
- Johnson, A. E. W., Bulgarelli, L., Shen, L., Gayles, A., Shammout, A., Horng, S., ... & Mark, R. G. MIMIC-IV, a freely accessible electronic health record dataset. Scientific Data 2023, 10(1), 1. [CrossRef] [PubMed]
- Kreuzberger, D.; Kühl, N.; Hirschl, S. Machine learning operations (MLOps): Overview, definition, and architecture. IEEE Access 2023, 11, 31866–31879. [Google Scholar] [CrossRef]
- Kusner, M. J.; Loftus, J.; Russell, C.; Silva, R. Counterfactual fairness. In Advances in Neural Information Processing Systems; 2017; p. 30. [Google Scholar]
- McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics; 2017; pp. 1273–1282. [Google Scholar]
- Obermeyer, Z.; Powers, B.; Vogeli, C.; Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 2019, 366(6464), 447–453. [Google Scholar] [CrossRef] [PubMed]
- Paleyes, A.; Urma, R.-G.; Lawrence, N. D. Challenges in deploying machine learning: A survey of case studies. ACM Computing Surveys 2022, 55(6), 1–29. [Google Scholar] [CrossRef]
- Platt, J. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers 1999, 10(3), 61–74. [Google Scholar]
- Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., ... & Dennison, D. (2015). Hidden technical debt in machine learning systems. Advances in Neural Information Processing Systems, 28. p. 28.
- Topol, E. J. High-performance medicine: The convergence of human and artificial intelligence. Nature Medicine 2019, 25(1), 44–56. [Google Scholar] [CrossRef] [PubMed]
- Wiens, J., Saria, S., Sendak, M., Ghassemi, M., Liu, V. X., Doshi-Velez, F., ... & Goldenberg, A. (2019). Do no harm: A roadmap for responsible machine learning for health care. Nature Medicine, 25(9), 1337–1340. [CrossRef] [PubMed]
| Subgroup | AUC-ROC | Sensitivity | Specificity | PPV |
|---|---|---|---|---|
| Overall | 0.847 | 0.781 | 0.833 | 0.612 |
| Male | 0.851 | 0.793 | 0.828 | 0.624 |
| Female | 0.839 | 0.764 | 0.841 | 0.597 |
| Age 18–44 | 0.821 | 0.748 | 0.861 | 0.541 |
| Age 45–64 | 0.844 | 0.779 | 0.834 | 0.608 |
| Age 65–79 | 0.853 | 0.788 | 0.829 | 0.621 |
| Age 80+ | 0.838 | 0.761 | 0.842 | 0.589 |
| Gap in Standard MLOps | Clinical Risk | Clinical MLOps Control |
|---|---|---|
| No demographic fairness monitoring | Undetected bias in patient subgroups | Layer 2: Disaggregated performance monitoring |
| Automated retraining on drift detection | Unvalidated model in production | Layer 2: Human-escalation on drift alert |
| No clinical outcome linkage | Model calibration invisible to operators | Layer 2: Outcome-linked calibration monitoring |
| No immutable inference logging | AI Act non-compliance | Layer 3: Append-only audit trail |
| No uncertainty quantification at serving | Overconfident predictions used uncritically | Layer 2: Conformal prediction intervals |
| Container security not enforced | Data exfiltration risk | Layer 1: Network isolation + vulnerability scanning |
| No human override tracking | AI Act Article 14 non-compliance | Layer 4: Override logging and review |
| Model lineage not formally recorded | Audit trail incomplete | Layer 3: Immutable model registry with lineage |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.