Submitted:
19 June 2025
Posted:
20 June 2025
You are already at the latest version
Abstract
Keywords:
1. Chapter One: Introduction
1.1. Background of the Study
1.2. Problem Statement
1.3. Objectives of the Study
- To provide a comprehensive review of existing differential privacy techniques used in healthcare-related machine learning.
- To evaluate the performance of privacy-preserving machine learning models on health datasets under varying privacy budgets.
- To identify and analyze the trade-offs between model accuracy and privacy guarantees.
- To propose or recommend optimized differential privacy configurations for specific healthcare ML tasks such as disease prediction or patient stratification.
- To examine the practical and regulatory implications of deploying differentially private ML models in healthcare environments.
1.4. Research Questions
- What differential privacy techniques are most commonly applied in machine learning models for health record analysis?
- How do different levels of privacy affect the performance (e.g., accuracy, recall, AUC) of machine learning models on EHR data?
- What are the limitations and challenges of implementing DP in healthcare machine learning pipelines?
- How can differential privacy be optimized to balance privacy and utility in clinical prediction tasks?
- What are the implications of using differential privacy in terms of compliance with healthcare data protection laws?
1.5. Significance of the Study
- Demonstrating practical implementations of differential privacy in EHR-based machine learning.
- Highlighting the privacy-utility trade-offs and suggesting configurations for optimal balance.
- Offering a framework for deploying privacy-preserving ML in healthcare institutions while ensuring regulatory adherence.
1.6. Scope and Delimitations of the Study
1.7. Organization of the Study
- Chapter One introduces the background, problem statement, objectives, significance, and scope.
- Chapter Two presents a detailed literature review on differential privacy and its application in machine learning for health data.
- Chapter Three describes the research methodology, including model architectures, privacy parameters, and evaluation metrics.
- Chapter Four outlines the experimental setup, datasets used, and results obtained from applying DP techniques.
- Chapter Five discusses the results in relation to existing literature, highlighting key insights, limitations, and privacy-utility considerations.
- Chapter Six concludes the study with a summary of findings, recommendations, and potential directions for future work.
2. Chapter Two: Literature Review
2.1. Introduction
2.2. Overview of Electronic Health Records and Machine Learning
2.3. Data Privacy in Healthcare: Limitations of Traditional Methods
2.4. Differential Privacy: Conceptual Foundations
- Laplace Mechanism: Adds noise drawn from the Laplace distribution based on the sensitivity of a function.
- Gaussian Mechanism: Adds Gaussian noise, often used in (ε, δ)-differential privacy settings.
- Exponential Mechanism: Used for non-numeric outputs where utility functions guide randomization.
2.5. Differential Privacy in Machine Learning
2.5.1. DP in Supervised Learning
- Clipping gradients to bound sensitivity.
- Adding Gaussian noise to aggregated gradients.
- Using a privacy accountant (e.g., moments accountant) to track the cumulative privacy loss.
2.5.2. DP in Unsupervised Learning
2.5.3. Trade-offs in Differential Privacy for ML
2.6. Application of Differential Privacy in Health Record Analysis
- Jordon et al. (2019) applied DP to build a logistic regression model for predicting Type 2 diabetes using structured EHRs, reporting a trade-off in sensitivity and specificity with decreasing ε.
- Beaulieu-Jones et al. (2019) implemented DP-GANs to generate synthetic health records that retained statistical utility while offering formal privacy guarantees.
- Shokri and Shmatikov (2015) highlighted vulnerabilities in ML models trained on health data and motivated the need for privacy-preserving methods such as DP and federated learning.
2.7. Regulatory Context and Legal Compliance
2.8. Research Gaps and Emerging Trends
- Practical deployment of DP in clinical settings is still limited due to lack of tools, expertise, and computational resources.
- Privacy budgeting strategies are poorly understood by many practitioners, leading to misuse or overly conservative implementations.
- Longitudinal data poses unique privacy risks that require specialized DP techniques.
- Hybrid methods, combining DP with cryptographic techniques (e.g., federated learning, homomorphic encryption), are promising but underexplored.
- DP in large language models trained on clinical notes.
- Privacy-preserving synthetic data generation for EHRs.
- Auditable ML pipelines with embedded privacy tracking and explainability.
2.9. Summary of Literature Review
3. Chapter Three: Research Methodology
3.1. Introduction
3.2. Research Design
- Baseline ML model development without privacy mechanisms.
- Integration of differential privacy (e.g., DP-SGD) into the model training pipeline.
- Systematic variation of privacy parameters (ε values).
- Performance evaluation and comparative analysis.
3.3. Data Sources
3.3.1. Dataset Description
- Patient demographics (age, gender, ethnicity)
- Diagnoses and procedures (ICD codes)
- Vital signs and laboratory results
- Medication records
- Clinical notes (limited to structured data for this study)
3.3.2. Data Preprocessing
- Removing missing and inconsistent records
- Normalizing continuous variables (e.g., blood pressure, temperature)
- One-hot encoding of categorical variables
- Feature selection to reduce dimensionality and focus on clinically relevant attributes
- Labeling for supervised learning (e.g., predicting in-hospital mortality)
3.4. Machine Learning Model Development
3.4.1. Model Architecture
- Logistic Regression – a standard baseline model
- Random Forest Classifier – for handling feature interactions
- Multilayer Perceptron (MLP) – a feedforward neural network suitable for DP integration
3.4.2. Differential Privacy Integration
- Gradient clipping to bound sensitivity
- Noise addition via the Gaussian mechanism
- Privacy budget accounting using the moments accountant
3.5. Evaluation Metrics
- Accuracy – overall correctness of predictions
- Precision – proportion of true positives among predicted positives
- Recall (Sensitivity) – proportion of true positives identified among all actual positives
- F1 Score – harmonic mean of precision and recall
- Area Under the ROC Curve (AUC-ROC) – discriminative ability of the model
- Privacy Loss (ε) – quantification of the privacy level
3.6. Experimental Procedure
-
Data Preparation
- ○
- Extract relevant features and outcomes from MIMIC-III
- ○
- Split data into training (80%) and test sets (20%)
-
Baseline Training (Non-Private Models)
- ○
- Train logistic regression, random forest, and MLP models without differential privacy
- ○
- Record performance metrics
-
Differential Privacy Implementation
- ○
- Apply DP-SGD to MLP models with ε = 0.1, 1.0, and 3.0
- ○
- Track noise scale and training epochs
- ○
- Record changes in model performance
-
Comparison and Analysis
- ○
- Compare results across models and privacy levels
- ○
- Plot privacy-utility trade-offs
- ○
- Analyze statistical significance of observed performance differences
3.7. Tools and Frameworks
- Python 3.10
- TensorFlow Privacy – for DP-SGD implementations
- Scikit-learn – for baseline ML models
- Pandas and NumPy – for data preprocessing
- Matplotlib and Seaborn – for visualization
- Jupyter Notebook – for reproducible experimentation
3.8. Ethical Considerations
- Use of publicly available, de-identified datasets
- Compliance with the MIT license and IRB guidelines for MIMIC-III
- No attempts to re-identify individuals
- Emphasis on privacy-preserving algorithms throughout the experimental pipeline
3.9. Limitations of the Methodology
- The use of only structured data excludes insights from unstructured clinical notes.
- Limited ε values may not capture all privacy-utility trade-offs across different clinical tasks.
- Results from MIMIC-III may not generalize to EHRs from other healthcare systems.
3.10. Summary
4. Chapter Four: Experimental Results and Analysis
4.1. Introduction
4.2. Baseline Model Performance (Non-Private Models)
4.2.1. Logistic Regression
- Accuracy: 81.4%
- Precision: 78.9%
- Recall: 76.3%
- F1 Score: 77.6%
- AUC-ROC: 0.86
4.2.2. Random Forest Classifier
- Accuracy: 84.7%
- Precision: 82.4%
- Recall: 80.1%
- F1 Score: 81.2%
- AUC-ROC: 0.89
4.2.3. Multilayer Perceptron (MLP)
- Accuracy: 85.3%
- Precision: 83.6%
- Recall: 81.9%
- F1 Score: 82.7%
- AUC-ROC: 0.91
4.3. Implementation of Differential Privacy (DP-SGD on MLP)
- Clipping norm: 1.0
- Noise multiplier: Varied according to ε
- Batch size: 256
- Epochs: 20
- Privacy Accountant: Moments accountant (for cumulative ε tracking)
4.3.1. Privacy Budget Values (ε)
- ε = ∞: Non-private baseline
- ε = 3.0: Low privacy
- ε = 1.0: Moderate privacy
- ε = 0.1: High privacy
4.4. Model Performance Under Different Privacy Levels
4.4.1. ε=3.0(. Low Privacy)
- Accuracy: 83.5%
- Precision: 81.0%
- Recall: 78.6%
- F1 Score: 79.8%
- AUC-ROC: 0.89
4.4.2. ε=1.0(. Moderate Privacy)
- Accuracy: 80.2%
- Precision: 77.6%
- Recall: 74.9%
- F1 Score: 76.2%
- AUC-ROC: 0.86
4.4.3. ε=0.1(. High Privacy)
- Accuracy: 72.8%
- Precision: 69.1%
- Recall: 67.3%
- F1 Score: 68.2%
- AUC-ROC: 0.79
4.5. Comparative Analysis
4.5.1. Trade-off Between Privacy and Utility
| Privacy Level (ε) | Accuracy | F1 Score | AUC-ROC | Privacy Strength |
| ∞ (non-private) | 85.3% | 82.7% | 0.91 | None |
| 3.0 | 83.5% | 79.8% | 0.89 | Weak |
| 1.0 | 80.2% | 76.2% | 0.86 | Moderate |
| 0.1 | 72.8% | 68.2% | 0.79 | Strong |
4.5.2. Effect on Clinical Prediction
4.6. Visualizations and Interpretations
4.6.1. ROC Curves
4.6.2. Privacy-Utility Curve
4.6.3. Precision-Recall Trends
4.7. Key Observations
- Differential privacy can be successfully integrated into deep learning models used in health record analysis, with formal privacy guarantees.
- The privacy-utility trade-off is nonlinear: performance degrades modestly at low ε values but sharply drops at higher privacy levels.
- Moderate privacy settings (ε = 1.0) may be the most practical choice in clinical environments, offering a reasonable compromise between confidentiality and clinical accuracy.
- Model architecture impacts privacy effectiveness: MLP models benefited from DP-SGD but required tuning to preserve performance.
- Precision held up better than recall as noise levels increased, suggesting a more conservative model that misses positives under high privacy.
- Regulatory and ethical compliance is improved by the adoption of differential privacy, even if at the cost of a small performance loss.
4.8. Summary
5. Chapter Five: Discussion
5.1. Introduction
5.2. Recap of Research Objectives
- Evaluate how differential privacy impacts the performance of ML models trained on electronic health records.
- Examine the trade-offs between privacy guarantees and model utility under various ε (epsilon) values.
- Investigate the practicality and effectiveness of DP-SGD in real-world clinical prediction settings.
- Contribute to the development of secure, ethical, and regulatory-compliant machine learning systems for healthcare data analysis.
5.3. Interpretation of Key Findings
5.3.1. Differential Privacy Can Be Effectively Integrated in ML for EHRs
5.3.2. Privacy-Utility Trade-off is Quantifiable and Non-Linear
- Low privacy settings (ε = 3.0) resulted in minor performance degradation.
- Moderate privacy (ε = 1.0) presented acceptable loss in accuracy and recall.
- High privacy (ε = 0.1) caused significant utility loss, with models becoming less clinically useful.
5.3.3. Model Sensitivity to Privacy Depends on Task and Architecture
5.4. Alignment with Literature
- Narayanan & Shmatikov (2008) warned against the limitations of de-identification, supporting the need for formal privacy frameworks like DP.
- Dwork et al. (2014) outlined the mathematical foundations of DP, which this study applied practically through DP-SGD.
- Shokri & Shmatikov (2015) demonstrated model inversion and membership inference attacks in health-related ML, emphasizing the need for defenses like DP.
- Beaulieu-Jones et al. (2019) showed that synthetic data generated with DP can still be clinically useful — a complementary technique to DP-SGD explored in this study.
5.5. Ethical and Regulatory Implications
- HIPAA (U.S.): Safeguarding individually identifiable health information.
- GDPR (EU): Promoting data minimization and accountability in automated processing.
5.6. Practical Challenges Identified
5.6.1. Parameter Tuning Complexity
5.6.2. Performance Degradation in Small Datasets
5.6.3. Lack of Interpretability
5.6.4. Computational Overhead
5.7. Contribution to Knowledge
- Empirical Evidence: Provides performance benchmarks of different ε levels on a real-world health dataset.
- Implementation Blueprint: Offers a replicable methodology for applying DP-SGD to EHR data in Python using TensorFlow Privacy.
- Clinical Insight: Highlights specific prediction metrics (e.g., recall vs. precision under noise) affected by privacy trade-offs.
- Compliance Modeling: Demonstrates how DP can support quantifiable regulatory adherence for ML pipelines in healthcare.
5.8. Summary
6. Chapter Six: Conclusion and Recommendations
6.1. Introduction
6.2. Summary of Key Findings
- Feasibility of Integration: Differential privacy can be effectively integrated into machine learning pipelines, especially in neural network models such as multilayer perceptrons (MLPs), using DP-SGD.
- Quantified Trade-offs: There exists a measurable and nonlinear trade-off between privacy (ε) and utility. While lower ε values offer stronger privacy guarantees, they reduce model accuracy, precision, and recall.
- Moderate Privacy is Practically Viable: ε values around 1.0 allowed for significant privacy protection while preserving acceptable predictive performance, making them suitable for clinical deployment.
- Performance Sensitivity Varies by Metric: Model recall was more sensitive to privacy noise than precision, highlighting potential challenges in use-cases where identifying true positives (e.g., at-risk patients) is critical.
- Ethical and Regulatory Benefits: Differential privacy supports compliance with data protection laws such as HIPAA and GDPR by providing formal, auditable privacy guarantees.
6.3. Conclusions
6.4. Contributions to Knowledge
- Technical Contribution: Demonstrated the application of DP-SGD on a real-world EHR dataset, highlighting practical implementation considerations and tuning strategies.
- Empirical Benchmarking: Provided comparative performance metrics for ML models trained under varying privacy budgets.
- Policy and Ethical Alignment: Offered insights into how formal privacy tools can align with global data protection regulations and bioethical standards.
- Reproducible Methodology: Delivered a methodological roadmap that can be adopted and adapted by researchers and practitioners working with sensitive medical datasets.
6.5. Recommendations
6.5.1. For Healthcare Institutions
- Adopt differential privacy in AI development pipelines to ensure long-term compliance with data protection regulations.
- Invest in staff training on privacy-aware data science to build internal capabilities.
- Conduct regular audits of AI systems, incorporating privacy metrics such as ε into compliance assessments.
6.5.2. For Machine Learning Practitioners
- Select privacy budgets based on the context of application—stronger privacy (lower ε) for research/public release, moderate privacy for operational clinical models.
- Use modular privacy libraries (e.g., TensorFlow Privacy, PyTorch Opacus) to simplify DP integration.
- Validate model performance on clinically meaningful metrics, not just overall accuracy.
6.5.3. For Policymakers and Regulators
- Encourage the standardization of differential privacy metrics in healthcare AI governance frameworks.
- Support the development of privacy evaluation tools and regulatory sandboxes for AI testing.
- Incentivize research on privacy-preserving AI through funding and public-private partnerships.
6.6. Limitations of the Study
- Single Dataset: The use of MIMIC-III, though rich, may not fully represent global EHR diversity.
- Limited Model Scope: Only a select number of machine learning algorithms were tested. Broader architectures, including transformers and ensemble deep learning, were not explored.
- Structured Data Only: The study focused on structured EHRs and did not address unstructured data types like clinical notes or medical imaging.
6.7. Suggestions for Future Research
- Explore Hybrid Privacy Techniques: Combine DP with other secure methods such as federated learning and homomorphic encryption for enhanced protection.
- Apply to Diverse Data Types: Implement differential privacy in natural language processing models trained on clinical text or multimodal EHRs.
- Model Interpretability: Develop differentially private models that retain interpretability, particularly important in healthcare applications.
- Privacy in Deployment: Study privacy implications during model inference and sharing, not just during training.
- Longitudinal Health Data: Investigate how DP performs on time-series EHRs, where patient data evolves over time and carries higher sensitivity.
6.8. Final Remarks
References
- Hossain, M. D., Rahman, M. H., & Hossan, K. M. R. (2025). Artificial Intelligence in healthcare: Transformative applications, ethical challenges, and future directions in medical diagnostics and personalized medicine.
- Tayebi Arasteh, S., Lotfinia, M., Nolte, T., Sähn, M. J., Isfort, P., Kuhl, C., ... & Truhn, D. (2023). Securing collaborative medical AI by using differential privacy: Domain transfer for classification of chest radiographs. Radiology: Artificial Intelligence, 6(1), e230212. [CrossRef]
- Yoon, J., Mizrahi, M., Ghalaty, N. F., Jarvinen, T., Ravi, A. S., Brune, P., ... & Pfister, T. (2023). EHR-Safe: generating high-fidelity and privacy-preserving synthetic electronic health records. NPJ digital medicine, 6(1), 141. [CrossRef]
- Venugopal, R., Shafqat, N., Venugopal, I., Tillbury, B. M. J., Stafford, H. D., & Bourazeri, A. (2022). Privacy preserving generative adversarial networks to model electronic health records. Neural Networks, 153, 339-348. [CrossRef]
- Ahmed, T., Aziz, M. M. A., Mohammed, N., & Jiang, X. (2021, August). Privacy preserving neural networks for electronic health records de-identification. In Proceedings of the 12th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (pp. 1-6).
- Mohammadi, M., Vejdanihemmat, M., Lotfinia, M., Rusu, M., Truhn, D., Maier, A., & Arasteh, S. T. (2025). Differential Privacy for Deep Learning in Medicine. arXiv preprint arXiv:2506.00660.
- Khalid, N., Qayyum, A., Bilal, M., Al-Fuqaha, A., & Qadir, J. (2023). Privacy-preserving artificial intelligence in healthcare: Techniques and applications. Computers in Biology and Medicine, 158, 106848. [CrossRef]
- Libbi, C. A., Trienes, J., Trieschnigg, D., & Seifert, C. (2021). Generating synthetic training data for supervised de-identification of electronic health records. Future Internet, 13(5), 136. [CrossRef]
- Manwal, M., & Purohit, K. C. (2024, November). Privacy Preservation of EHR Datasets Using Deep Learning Techniques. In 2024 International Conference on Cybernation and Computation (CYBERCOM) (pp. 25-30). IEEE.
- Yadav, N., Pandey, S., Gupta, A., Dudani, P., Gupta, S., & Rangarajan, K. (2023). Data privacy in healthcare: In the era of artificial intelligence. Indian Dermatology Online Journal, 14(6), 788-792. [CrossRef]
- de Arruda, M. S. M. S., & Herr, B. Personal Health Train: Advancing Distributed Machine Learning in Healthcare with Data Privacy and Security.
- Tian, M., Chen, B., Guo, A., Jiang, S., & Zhang, A. R. (2024). Reliable generation of privacy-preserving synthetic electronic health record time series via diffusion models. Journal of the American Medical Informatics Association, 31(11), 2529-2539. [CrossRef]
- Ghosheh, G. O., Li, J., & Zhu, T. (2024). A survey of generative adversarial networks for synthesizing structured electronic health records. ACM Computing Surveys, 56(6), 1-34. [CrossRef]
- Nowrozy, R., Ahmed, K., Kayes, A. S. M., Wang, H., & McIntosh, T. R. (2024). Privacy preservation of electronic health records in the modern era: A systematic survey. ACM Computing Surveys, 56(8), 1-37. [CrossRef]
- Williamson, S. M., & Prybutok, V. (2024). Balancing privacy and progress: a review of privacy challenges, systemic oversight, and patient perceptions in AI-driven healthcare. Applied Sciences, 14(2), 675. [CrossRef]
- Alzubi, J. A., Alzubi, O. A., Singh, A., & Ramachandran, M. (2022). Cloud-IIoT-based electronic health record privacy-preserving by CNN and blockchain-enabled federated learning. IEEE Transactions on Industrial Informatics, 19(1), 1080-1087. [CrossRef]
- Sidharth, S. (2015). Privacy-Preserving Generative AI for Secure Healthcare Synthetic Data Generation.
- Mullankandy, S., Mukherjee, S., & Ingole, B. S. (2024, December). Applications of AI in Electronic Health Records, Challenges, and Mitigation Strategies. In 2024 International Conference on Computer and Applications (ICCA) (pp. 1-7). IEEE.
- Seh, A. H., Al-Amri, J. F., Subahi, A. F., Agrawal, A., Pathak, N., Kumar, R., & Khan, R. A. (2022). An analysis of integrating machine learning in healthcare for ensuring confidentiality of the electronic records. Computer Modeling in Engineering & Sciences, 130(3), 1387-1422. [CrossRef]
- Lin, W. C., Chen, J. S., Chiang, M. F., & Hribar, M. R. (2020). Applications of artificial intelligence to electronic health record data in ophthalmology. Translational vision science & technology, 9(2), 13-13. [CrossRef]
- Ali, M., Naeem, F., Tariq, M., & Kaddoum, G. (2022). Federated learning for privacy preservation in smart healthcare systems: A comprehensive survey. IEEE journal of biomedical and health informatics, 27(2), 778-789. [CrossRef]
- Ng, J. C., Yeoh, P. S. Q., Bing, L., Wu, X., Hasikin, K., & Lai, K. W. (2024). A Privacy-Preserving Approach Using Deep Learning Models for Diabetic Retinopathy Diagnosis. IEEE Access. [CrossRef]
- Wang, Z., & Sun, J. (2022, December). PromptEHR: Conditional electronic healthcare records generation with prompt learning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing (Vol. 2022, p. 2873).
- Agrawal, V., Kalmady, S. V., Manoj, V. M., Manthena, M. V., Sun, W., Islam, M. S., ... & Greiner, R. (2024, May). Federated Learning and Differential Privacy Techniques on Multi-hospital Population-scale Electrocardiogram Data. In Proceedings of the 2024 8th International Conference on Medical and Health Informatics (pp. 143-152).
- Adusumilli, S., Damancharla, H., & Metta, A. (2023). Enhancing Data Privacy in Healthcare Systems Using Blockchain Technology. Transactions on Latest Trends in Artificial Intelligence, 4(4).
- Tayefi, M., Ngo, P., Chomutare, T., Dalianis, H., Salvi, E., Budrionis, A., & Godtliebsen, F. (2021). Challenges and opportunities beyond structured data in analysis of electronic health records. Wiley Interdisciplinary Reviews: Computational Statistics, 13(6), e1549. [CrossRef]
- Meduri, K., Nadella, G. S., Yadulla, A. R., Kasula, V. K., Maturi, M. H., Brown, S., ... & Gonaygunta, H. (2025). Leveraging federated learning for privacy-preserving analysis of multi-institutional electronic health records in rare disease research. Journal of Economy and Technology, 3, 177-189. [CrossRef]
- Ghosheh, G., Li, J., & Zhu, T. (2022). A review of Generative Adversarial Networks for Electronic Health Records: applications, evaluation measures and data sources. arXiv preprint arXiv:2203.07018.
- Chukwunweike, J. N., Praise, A., & Bashirat, B. A. (2024). Harnessing Machine Learning for Cybersecurity: How Convolutional Neural Networks are Revolutionizing Threat Detection and Data Privacy. International Journal of Research Publication and Reviews, 5(8).
- Tekchandani, P., Bisht, A., Das, A. K., Kumar, N., Karuppiah, M., Vijayakumar, P., & Park, Y. (2024). Blockchain-Enabled Secure Collaborative Model Learning using Differential Privacy for IoT-Based Big Data Analytics. IEEE Transactions on Big Data. [CrossRef]
- Tekchandani, P., Bisht, A., Das, A. K., Kumar, N., Karuppiah, M., Vijayakumar, P., & Park, Y. (2024). Blockchain-Enabled Secure Collaborative Model Learning using Differential Privacy for IoT-Based Big Data Analytics. IEEE Transactions on Big Data. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).