Preprint
Article

This version is not peer-reviewed.

Federated Learning for Privacy-Preserving Medical Data Sharing in Drug Development

A peer-reviewed version of this preprint was published in:
Applied and Computational Engineering 2025, 134(1), 80-84. https://doi.org/10.54254/2755-2721/2025.20847

Submitted:

21 October 2024

Posted:

22 October 2024

You are already at the latest version

Abstract
This study explores the potential of Federated Learning (FL) to facilitate the sharing and collaboration of medical data in drug development under the premise of privacy protection. While traditional centralized data processing methods limit effective collaboration across agencies due to data privacy and compliance concerns, federated learning avoids the risk of privacy breaches through a distributed architecture that allows participants to train artificial intelligence (AI) models together without sharing raw data. This paper systematically describes the core mechanism of federated learning, including the key technologies such as model parameter updating, differential privacy and homomorphic encryption, and their applications in drug development and medical data processing. Examples, such as NVIDIA Clara's Federated learning application and COVID-19 resource prediction, show that federated learning improves the efficiency of multi-party collaboration and model performance while ensuring data privacy. In addition, this study explores the scalability and generality of federated learning in the medical field, and points out that the technology is not only suitable for drug development, but also has broad cross-industry application potential, especially in areas such as finance and insurance, where data privacy is critical.
Keywords: 
;  ;  ;  

1. Introduction

For a long time in the past, AI has been hailed as an important part of the industrial revolution, and continues to penetrate other industries, such as education, business, finance, manufacturing, as well as social media platforms and healthcare. With the continuous improvement of the data age and the emergence of advanced computer algorithms, people have a better opportunity to build new artificial intelligence models, and use it to achieve faster computing methods, so as to get more convenience. However, especially in healthcare, the centralization of a lot of data and AI faces multiple potential challenges in terms of privacy and regulation.
Hypothetically, if we can find a more efficient way to integrate AI. of data into one and be able to break through the existing challenges while optimizing these risks, this will be a whole new area of research. [1] Federal Learning (FL) is the solution. Medical data is often scattered across different systems, and security and privacy concerns complicate its effective use. However, advances in AI have brought opportunities for integration and collaboration to this fragmented data. However, data is often scattered across siloed systems, and security and privacy concerns complicate its effective use.
Federated Learning (FL) is therefore the ideal solution to this problem. It allows data to remain local while fostering collaboration between agencies to build more robust AI models together without sacrificing data privacy and security. Through this approach, organizations can share information while protecting sensitive medical data, creating more possibilities for data utilization in drug development and driving innovation in medicine with privacy protection.

3. Methodology

While Federated learning (FL) can provide a high level of security in terms of privacy protection, there are still some risks, such as reconstructing a single training model through model backward inference. One response is to inject noise and distort updates during the training of each node to hide the contribution of individual model nodes and limit the granularity of information shared between training nodes. However, existing research on privacy protection has focused on common machine learning benchmark datasets (such as MNIST) and stochastic gradient descent algorithms.
In this study, we implemented and evaluated a federal learning system for drug development data sharing. By experimenting with clinical trial data, we demonstrate the feasibility of medical data privacy protection technology in drug development.
Our key contributions include: (1) To our knowledge, the implementation and evaluation of the first privacy-protected federal learning system for drug development data analysis; (2) The use of joint average algorithm to deal with momentum optimization and unbalanced training nodes is compared; (3) The sparse vector technique (SVT) is empirically studied to obtain a strong differential privacy guarantee.

3.1. Federated Learning Framework

This paper uses the joint average algorithm to investigate a federated learning system based on a client-server architecture (shown on the left in Figure 5), where a central server maintains a global DNN model and coordinates local random gradient descent (SGD) updates on the client. This section describes the training process of the client model, the aggregation process of the server model, and the deployment of the privacy protection module on the client side.

3.2. Patient Data Model Training Process

We assume that each institution participating in federated learning has a fixed local data set and sufficient computing resources to conduct small-batch SGD updates. Clients share the same DNN structure and loss function. In round t of joint training, the local model is initialized by reading the global model parameter w(t) from the server and updated to w(l,t) through multiple SGD iterations. After a fixed number of local iterations, the model differences Δw(t) are shared with the aggregation server.
In drug development, clinical trial data is often optimized using momentum-based SGD. The momentum gradient introduces the SGD calculations from the previous step to help speed up training and reduce fluctuations. We explore design options for dealing with these steps in Federated Learning. We recommend restarting the momentum gradient at each round of joint training (algorithm 1, when using the ADAM optimizer) so that local states do not interfere with the global update of the model.

3.3. Patient Data Privacy Protection Model

The client has full control over the shared data and the local training data does not leave the client. Nevertheless, a model regression attack may extract patient privacy information from the updated Δw(t) or global model w(t). We employ selective parameter updating and sparse vector technology (SVT) to provide strong indirect data leakage protection.
Selective parameter updating: At the end of client training, the complete model may overfit and remember local training data, and sharing such a model may lead to data leakage. Therefore, the selective parameter sharing method limits the amount of information shared by clients. Clients upload only a portion of Δw(t) k, which is shared only if the parameter _ i _ _ is greater than the threshold τ(t) k. In addition, data privacy is further protected by clipping its values into a fixed range. The combination of clip gradient and selective parameter sharing can further enhance differential privacy through SVT.

3.4. Server Data Model Aggregation

The server distributes the global model in each round of federated learning and receives simultaneous updates from all clients (Algorithm 3). Due to the different number of local iterations of different clients, the Δw(t) k generated by the client may have different update rates. Therefore, the contribution of each client should be weighed during the aggregation process, especially when dealing with uneven data in drug development. The sparse characteristics shared by some models can also effectively reduce the communication overhead.

3.5. Experimental Data

To evaluate the practical application of the federated learning system, we selected a multimodal drug clinical trial dataset containing patient trial data from different institutions. These data are unevenly distributed, and different institutions use different test equipment and protocols, which makes the distribution of data characteristics different. We divided the data set into a 242-patient training set and a 43-patient validation test set. To make joint training more realistic, we further divided the training set into 13 non-overlapping subsets, assigned to each client.

3.6. Experimental Result

Compared with centralized data set training, federated learning systems can also achieve better model performance without sharing customer data. In the drug development scenario, FL model training, despite a longer convergence time (about 600 rounds), still guarantees similar performance to a centralized dataset model. In addition, in the experiment, FL training time depends on the computing speed of the slowest client.
Figure 6. Comparison of segmentation performance on the test set with (left): FIvs. non-F'L training, and(right): partial model sharing.
Figure 6. Comparison of segmentation performance on the test set with (left): FIvs. non-F'L training, and(right): partial model sharing.
Preprints 121806 g005

4. Conclusions and Discussion

Through this study, we demonstrate the great potential of Federated learning (FL) in privacy-protected medical data sharing, especially in the field of drug development. Our experimental results show that despite significant differences in the data characteristics of different clients, the federated learning system is able to effectively train the model across multiple independent institutions without centralized sharing of sensitive patient data.
Momentum restart and weighted average: From the experimental results, the restart strategy for momentum variables significantly improves the convergence rate of the model, proving the necessity of restarting momentum in each round. This strategy avoids the interference of momentum variables between clients, thus ensuring the stability of the global model. Compared to simple model averaging, weighted averaging of momentum variables further improves the performance of the global model, especially when dealing with unbalanced training iterations between clients. The weighted average is better able to accommodate the different data sizes and training resources of different institutions, which is a major advantage in drug development, as pharmaceutical companies and research institutions often differ in the amount of data and computational resources.
Local model sharing and differential privacy: In our experiments, the local model sharing strategy demonstrated good performance, especially when the client shared 40% of the model parameters, the performance was almost identical to that of centralized training. This means that high model accuracy can be maintained by selectively sharing local model parameters, even when privacy is protected. In addition, experiments show that the differential privacy (DP) parameters have an important effect on the performance of the model. By controlling for the share of parameters protected by DP, we found that sharing fewer model parameters performed better at the same privacy cost. This provides key implications for privacy protection in drug development - by optimizing the proportion of parameters shared, the best balance between privacy protection and model performance can be achieved.
Implications for drug development: Data security and privacy are important considerations in the drug development process, especially in the clinical trial phase. Federal learning technologies allow different pharmaceutical companies and research institutions to share clinical trial data without compromising patient privacy, accelerating the drug discovery and development process. For example, pharmaceutical companies can jointly develop more accurate drug response models while maintaining data localization, improving the efficiency of new drugs to market. In addition, federated learning can effectively address the challenge of unbalanced data in drug development, which is especially important for achieving equitable model training across different research institutions.
Future challenges and directions for Improvement: Although this study demonstrates the feasibility and effectiveness of federated learning in drug development, several challenges remain. First, communication overhead is an important issue, especially when multiple clients are involved. Although we have reduced some of the communication burden through sparse vector technology in the experiment, we still need to develop more efficient communication protocols in the future to further reduce the bandwidth requirements of federated learning. Secondly, model heterogeneity among clients may affect the performance of the global model. Future research can explore adaptive model aggregation strategies to update personalized models according to the specific characteristics of clients, so as to improve the performance of global models. In addition, the further optimization of differential privacy technology is also an important direction for future research, especially in large-scale drug development data, how to achieve stronger privacy protection without significantly affecting the model performance.
This study demonstrates the great potential of federal learning in the field of drug development. Through techniques such as momentum restart, local model sharing, and differential privacy, we successfully demonstrated how to achieve efficient model training while protecting data privacy. Future research could further optimize communication efficiency, model aggregation strategies, and privacy protection techniques to advance the application of federated learning in real-world drug development.

References

  1. Bakas, S. , et al.: Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the BRATS challenge. arXiv:1811.02629 (2018).
  2. Hitaj, B. , Ateniese, G., Perez-Cruz, F.: Deep models under the GAN: information leakage from collaborative deep learning. In: SIGSAC. pp. 603–618. ACM (2017).
  3. Kingma, D.P. , Ba, J.: Adam: A method for stochastic optimization. arXiv:1412.6980.
  4. Li, L. , Fan, Y. , Tse, M., & Lin, K. Y. A review of applications in federated learning. Computers & Industrial Engineering 2020, 149, 106854. [Google Scholar] [CrossRef]
  5. Geyer, R.C. , Klein, T., Nabi, M.: Differentially private federated learning: A client level perspective. arXiv:1712.07557 (2017).
  6. Truex, S.; , Baracaldo, N.; Anwar, A.; Steinke, T.; Ludwig, H.; Zhang, R.; Zhou, Y. A hybrid approach to privacy-preserving federated learning. In Proceedings of the 12th ACM workshop on artificial intelligence and security, November 2019; pp. 1–11.
  7. Xu, K. , Zhou, H., Zheng, H., Zhu, M., & Xin, Q. (2024). Intelligent Classification and Personalized Recommendation of E-commerce Products Based on Machine Learning. arXiv:2403.19345.
  8. Xu, K. , Zheng, H., Zhan, X., Zhou, S., & Niu, K. (2024). Evaluation and Optimization of Intelligent Recommendation System Performance with Cloud Resource Automation Compatibility. [CrossRef]
  9. Zheng, H. , Xu, K. , Zhou, H., Wang, Y., & Su, G. Medication Recommendation System Based on Natural Language Processing for Patient Emotion Analysis. Academic Journal of Science and Technology 2024, 10, 62–68. [Google Scholar] [CrossRef]
  10. Zheng, H.; Wu, J.; Song, R.; Guo, L.; Xu, Z. Predicting Financial Enterprise Stocks and Economic Data Trends Using Machine Learning Time Series Analysis. Applied and Computational Engineering 2024, 87, 26–32. [Google Scholar] [CrossRef]
  11. El Ouadrhiri, A. , & Abdelhadi, A. Differential privacy for deep and federated learning: A survey. IEEE access 2022, 10, 22359–22380. [Google Scholar] [CrossRef]
  12. Xu, R. , Baracaldo, N. , Zhou, Y., Anwar, A., & Ludwig, H. Hybridalpha: An efficient approach for privacy-preserving federated learning. In Proceedings of the 12th ACM workshop on artificial intelligence and security (pp. 13-23)., November 2019. [Google Scholar]
  13. Li, J. , Wang, Y. , Xu, C., Liu, S., Dai, J., & Lan, K. Bioplastic derived from corn stover: Life cycle assessment and artificial intelligence-based analysis of uncertainty and variability. Science of The Total Environment 2024, 946, 174349. [Google Scholar] [CrossRef]
  14. Xiao, J. , Wang, J. , Bao, W., Deng, T., & Bi, S. Application progress of natural language processing technology in financial research. Financial Engineering and Risk Management 2024, 7, 155–161. [Google Scholar] [CrossRef]
  15. Truong, N. , Sun, K. , Wang, S., Guitton, F., & Guo, Y. Privacy preservation in federated learning: An insightful survey from the GDPR perspective. Computers & Security 2021, 110, 102402. [Google Scholar] [CrossRef]
  16. Mo, F., Haddadi, H, Katevas; K., Marin, E., Perino, D., & Kourtellis, June). PPFL: Privacy-preserving federated learning with trusted execution environments. In Proceedings of the 19th annual international conference on mobile systems, applications, and services, June 2021; pp. 94–108.
  17. Xu, K. , Zhou, H., Zheng, H., Zhu, M., & Xin, Q. (2024). Intelligent Classification and Personalized Recommendation of E-commerce Products Based on Machine Learning. arXiv:2403.19345.
  18. Xu, K. , Zheng, H., Zhan, X., Zhou, S., & Niu, K. (2024). Evaluation and Optimization of Intelligent Recommendation System Performance with Cloud Resource Automation Compatibility. Appl. Comput. Eng. 2024, 87, 228–233. [Google Scholar] [CrossRef]
  19. Zheng, H. , Xu, K. , Zhou, H., Wang, Y., & Su, G. Medication Recommendation System Based on Natural Language Processing for Patient Emotion Analysis. Academic Journal of Science and Technology 2024, 10, 62–68. [Google Scholar] [CrossRef]
  20. Zheng, H.; Wu, J.; Song, R.; Guo, L.; Xu, Z. Predicting Financial Enterprise Stocks and Economic Data Trends Using Machine Learning Time Series Analysis. Applied and Computational Engineering 2024, 87, 26–32. [Google Scholar] [CrossRef]
  21. Liang, P. , Song, B. , Zhan, X., Chen, Z., & Yuan, J. Automating the training and deployment of models in MLOps by integrating systems with machine learning. Applied and Computational Engineering 2024, 67, 1–7. [Google Scholar] [CrossRef]
  22. Wu, B. , Gong, Y. , Zheng, H., Zhang, Y., Huang, J., & Xu, J. Enterprise cloud resource optimization and management based on cloud operations. Applied and Computational Engineering 2024, 67, 8–14. [Google Scholar] [CrossRef]
  23. Liu, B. , & Zhang, Y. Implementation of seamless assistance with Google Assistant leveraging cloud computing. Journal of Cloud Computing 2023, 12, 1–15. [Google Scholar] [CrossRef]
  24. Zhang, M. , Yuan, B. , Li, H., & Xu, K. LLM-Cloud Complete: Leveraging Cloud Computing for Efficient Large Language Model-based Code Completion. Journal of Artificial Intelligence General science (JAIGS) ISSN: 3006-4023 2024, 5, 295–326. [Google Scholar] [CrossRef]
  25. Li, P., Hua, Y., Cao, Q., Zhang, M. Improving the Restore Performance via Physical-Locality Middleware for Backup Systems. In Proceedings of the 21st International Middleware Conference, December 2020; pp. 341–355.
  26. Zhou, S. , Yuan, B., Xu, K., Zhang, M., & Zheng, W. THE IMPACT OF PRICING SCHEMES ON CLOUD COMPUTING AND DISTRIBUTED SYSTEMS. Journal of Knowledge Learning and Science Technology ISSN: 2959-6386 (online) 2024, 3, 193–205. [Google Scholar] [CrossRef]
  27. Adnan, M. , Kalra, S. , Cresswell, J. C., Taylor, G. W., & Tizhoosh, H. R. Federated learning and differential privacy for medical image analysis. Scientific reports 2022, 12, 1953. [Google Scholar] [CrossRef] [PubMed]
  28. Ju, Chengru, and Yida Zhu. "Reinforcement Learning Based Model for Enterprise Financial Asset Risk Assessment and Intelligent Decision Making." (2024).
  29. Yu, Keke, et al. "Loan Approval Prediction Improved by XGBoost Model Based on Four-Vector Optimization Algorithm." (2024).
  30. Zhou, S. , Sun, J., & Xu, K. (2024). AI-Driven Data Processing and Decision Optimization in IoT through Edge Computing and Cloud Architecture. Zhou, S.; Sun, J.; Xu, K. AI-Driven Data Processing and Decision Optimization in IoT through Edge Computing and Cloud Architecture. Preprints 2024, 2024100736. [CrossRef]
  31. Sun, J. , Zhou, S., Zhan, X., & Wu, J. (2024). Enhancing Supply Chain Efficiency with Time Series Analysis and Deep Learning Techniques. Preprints 2024, 2024090983. [CrossRef]
  32. Zheng, H. , Xu, K. , Zhang, M., Tan, H., & Li, H. Efficient resource allocation in cloud computing environments using AI-driven predictive analytics. Applied and Computational Engineering 2024, 82, 6–12. [Google Scholar] [CrossRef]
Figure 1. Architecture for a federated learning system.
Figure 1. Architecture for a federated learning system.
Preprints 121806 g001
Figure 5. Left: illustration of the federated learning system; right: distribution of thetraining subjects(N=242) across the participating federated clients(C=13) studied in this paper.
Figure 5. Left: illustration of the federated learning system; right: distribution of thetraining subjects(N=242) across the participating federated clients(C=13) studied in this paper.
Preprints 121806 g004
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated