MediVault: An Auditable and Secure Federated Learning System for Privacy-Preserving Healthcare Collaboration

Jie Li; Usman Adeel; Safwan Akram

doi:10.20944/preprints202604.1320.v1

Submitted:

17 April 2026

Posted:

20 April 2026

You are already at the latest version

Abstract

Healthcare analytics is often limited by amounts of data and strict privacy requirements, which make it difficult to share patient-level records across organisations and to build robust predictive models. Federated learning (FL) provides an alternative by keeping data local and exchanging model updates instead of raw records. However, many existing FL solutions remain difficult to deploy in healthcare settings, as they provide limited support for auditability, governance-oriented evidence, and system-level transparency. This paper presents MediVault, an auditable and secure federated learning-based system for privacy-preserving healthcare collaboration. MediVault combines round-based federated training, protected update exchange, audit-ready telemetry, and an interactive dashboard that exposes non-sensitive evidence of collaboration, model progress, and protocol execution. In addition, the system supports controlled reporting to improve stakeholder communication during pilot deployments. We evaluate MediVault on two public healthcare classification datasets, Breast Cancer Wisconsin (Diagnostic) and Heart Disease, under settings designed to reflect multi-site heterogeneity. Experiments are conducted using two interpretable linear models, logistic regression and linear SVM, under matched settings. Results show that federated training remains competitive with centralised training across both datasets. These findings suggest that an auditable and secure FL workflow can preserve predictive utility while also supporting the transparency, governance readiness, and practical system behaviour needed for privacy-preserving multi-organisation healthcare collaboration.

Keywords:

federated learning

;

privacy preserving

;

healthcare collaboration

Subject:

Computer Science and Mathematics - Computer Science

1. Introduction

Digital healthcare systems generate large amounts of data, but turning these data into useful decisions is still difficult. Electronic health records (EHRs), laboratory systems, imaging scanning platforms, and patient-facing devices all produce valuable information. These data could support earlier risk detection, more personalised care, and better clinical pathway planning. However, building deployable clinical intelligence systems from such data is still challenging [1,2,3,4,5]. Patient-level data are distributed across different hospitals and systems, governed by privacy and ethical requirements, and often stored in different formats. This makes cross-site model development slow and costly. As a result, many studies still rely on data from a single institution, which can limit external validity. The issue is even more serious for rare-disease prediction, where cohorts are small and geographically distributed, and collaboration across multiple organisations is often needed to obtain sufficient statistical power [6].

Federated learning (FL) has become an important approach for multi-site modelling under privacy constraints [7,8,9]. In FL, each site keeps its data locally and shares only model updates rather than raw records. This fits well when sensitive data needs to remain at the source. FL has already been explored in healthcare applications such as risk prediction, medical imaging, and population health analytics [10,11,12,13,14,15]. A key advantage of FL is that it allows models to learn from multiple sites without centralising patient records, which may improve generalisability while reducing the need for large-scale data transfer.

Despite these advantages, many healthcare FL studies are still difficult to translate into deployable services. First, clinical adoption requires auditability and accountability. Stakeholders often need to know what model was trained, which sites participated, what configuration was used, how performance changed over time, and whether the process was stable and compliant. However, many FL implementations focus mainly on the training loop and provide only limited support for audit-ready logging, end-to-end traceability, and governance-oriented reporting [9,11]. Second, in healthcare domain, FL should deal with heterogeneity [16]. Data across sites are often non-IID (Independent and Identically Distributed) because of differences in patient populations, clinical practice, coding styles, and measurement processes. This can slow convergence and lead to unstable or inconsistent results if it is not properly monitored [8,9]. Third, practical deployment also requires clear communication. Clinicians and operational teams often need short and understandable summaries of performance, limitations, and readiness, rather than only raw metrics or low-level system logs. Finally, real deployment requires workflow integration, including round orchestration, monitoring, API-based service boundaries, and predictable runtime behaviour that can be demonstrated during evaluation and stakeholder review.

Existing FL frameworks provide strong technical foundations [17,18,19], but many deployment-related components, such as dashboards, evidence trails, stakeholder reporting, and controlled baseline comparisons, still need to be implemented separately. Because of this, there is still a gap between algorithm-focused FL prototypes and healthcare-ready systems that treat operational transparency, audit readiness, and stakeholder communication as first-class requirements.

To address this gap, we present MediVault, a privacy-first platform designed to operationalise federated learning for healthcare pilots, with particular attention to deployability, auditability, and communication. MediVault is motivated by multi-organisation scenarios such as rare-disease risk prediction, where institutions need to collaborate while maintaining local control of sensitive records. Rather than treating FL as a standalone training routine, MediVault provides an integrated workflow from orchestration to evaluation. It allows teams to (i) run round-based training across distributed sites, (ii) compare FL against a centralised baseline using the same model and hyperparameters, and (iii) produce governance- and clinician-oriented summaries based on logged evidence. In particular, MediVault provides an integrated workflow for privacy-preserving healthcare collaboration. It combines federated coordination, site-level local training, audit-ready telemetry, and a dashboard for evaluation and governance. In addition, the platform supports evidence-grounded reporting to improve stakeholder communication. We evaluate MediVault on two public healthcare classification datasets under both IID and non-IID settings to reflect realistic site heterogeneity. The results show that federated training is competitive with centralised training, while MediVault also provides the transparency, auditability, and communication features needed for practical healthcare pilots.

The rest of this paper is organised as follows. Section 2 reviews related work on federated learning in healthcare, privacy-preserving collaboration, and deployment-oriented FL systems. Section 3 presents the MediVault architecture, workflow, and key system components. Section 4 describes the experimental setup and evaluates the platform from both system-level and model-level perspectives. Finally, Section 5 concludes the paper and outlines directions for future work.

2. Background

2.1. Federated Learning in Healthcare

Federated learning (FL) is a distributed learning paradigm in which multiple sites train local models on private data and periodically share model updates with a coordinator, which aggregates them into a global model [7,9]. In healthcare, FL is particularly attractive because medical data are naturally distributed across institutions and are subject to strict privacy, legal, and governance constraints. By keeping data local and exchanging model updates instead of raw records, FL provides a practical alternative to centralised learning for collaborative healthcare analytics.

Although FL is promising for healthcare, several practical and methodological challenges remain. First, statistical heterogeneity is common. Hospitals may differ in patient demographics, clinical workflows, measurement devices, and coding practices, which can lead to client drift and unstable training [8,9]. Second, system heterogeneity can arise from differences in compute capacity, network conditions, and local operational constraints. Third, privacy risks do not disappear simply because raw data are not shared. Prior studies have shown that gradients or model updates may still leak sensitive information under certain attack settings, such as gradient inversion or feature leakage attacks [20,21]. Therefore, healthcare FL systems must consider not only predictive performance, but also update confidentiality, secure aggregation, and operational trust.

2.2. Update Confidentiality and Secure Aggregation

Protecting privacy in FL requires distinguishing two related goals: update confidentiality and secure aggregation. Update confidentiality aims to prevent eavesdroppers or untrusted intermediaries from reading individual client updates in transit or at rest. Secure aggregation aims to ensure that the coordinator learns only an aggregate statistic, such as a sum or an average, rather than any single participant’s update.

Secure aggregation is commonly implemented using ideas from secure multi-party computation (SMPC). A representative approach is additive masking, where each client masks its update in such a way that the masks cancel out during aggregation, allowing the server to recover only the sum of updates [22]. Such protocols are attractive because they can be efficient and provide a clear privacy guarantee under an explicit threat model, often an honest-but-curious server. However, practical deployment still requires solutions for dropout handling, key management, and robustness to stragglers.

Homomorphic encryption (HE) provides another way to protect model updates by allowing computation directly over encrypted values [23]. In FL, HE is often used so that the coordinator performs additive aggregation on ciphertexts rather than plaintext updates. This can provide stronger confidentiality for transmitted updates, but it also introduces additional computation and communication overhead. As a result, many systems often restrict operations to simple additions, smaller models, or lightweight HE settings in order to balance privacy and efficiency.

2.3. Auditability, Governance Evidence, and Trust in Cross-Organisation Collaboration

In cross-organisation healthcare collaborations, technical privacy mechanisms alone are often not enough for deployment readiness. Partners may also require operational evidence that collaboration occurred as claimed and that protocol constraints were followed. For example, they may want evidence that local training was performed on-site, no raw data were exported, and only restricted message contents were exchanged. This introduces a requirement beyond predictive performance: auditability and governance readiness. Prior work in healthcare FL has highlighted the importance of trust, transparency, and practical deployment considerations, but many FL systems mainly expose training metrics and engineering logs rather than a structured, non-sensitive evidence layer designed for post-hoc review and partner assurance [11,12].

2.4. Related Work

Federated learning has been widely applied in supporting collaborative healthcare analytics [24,25,26,27]. Existing works have explored FL in a range of healthcare settings, including electronic health records, medical imaging, remote monitoring, and other smart-health applications, showing that multi-site model development is feasible even when direct data sharing is restricted [25,26]. In addition, these reviews consistently note that FL in healthcare domain are still facing challenges such as non-IID data, communication overhead, privacy risk, and deployment complexity [24,26,27].

A number of existing works have therefore focused on strengthening the security and privacy of FL beyond simple local-data retention. In particular, secure aggregation has become a central topic, since the coordinator should ideally learn only an aggregated statistic rather than any individual client update. Studies in [22,24,26] have explored several directions for secure aggregation, including masking-based schemes, secret sharing, and cryptographic approaches such as homomorphic encryption, while also emphasising practical requirements such as communication efficiency, robustness to user dropout, and compatibility with high-dimensional model updates . These approaches strengthen confidentiality beyond the assumption that data merely remain local, although they may introduce additional computational and engineering overhead in real deployments [23,25,27].

At the same time, researches have shown that “data stay local” does not fully remove privacy risks in federated learning. Attacks on gradients and parameter updates demonstrate that sensitive information may still be inferred from shared model updates, which motivates stronger update-protection mechanisms in privacy-sensitive domains such as healthcare [20,21]. This concern is also reflected in recent healthcare FL reviews, which argue that privacy protection often requires a combination of local training, protected update exchange, and system-level safeguards rather than reliance on local data retention alone [24,25,26].

From a systems perspective, existing FL research provides strong foundations for federated coordination, local training, privacy-aware aggregation, and healthcare-oriented application design [24,25,26,27]. However, most existing works primarily focus on learning architectures, privacy mechanisms, or application taxonomies. Much less attention is given to governance-oriented transparency at the system level, such as protocol timelines, secure message evidence, and non-sensitive audit views that can support partner assurance in real healthcare collaborations. As above reason, MediVault differs from existing works in two ways. 1) it combines federated learning with protected update exchange through HE-based update protection and an SMPC-inspired secure aggregation workflow. 2) it introduces a secure collaboration evidence layer through the dashboard, exposing protocol-level artefacts that support auditability without revealing raw data or per-sample information. This positions MediVault as a deployment-oriented healthcare FL system that addresses not only privacy-preserving training, but also the operational transparency needed for cross-organisation trust.

3. Proposed System

The proposed system, called MediVault, is an auditable and secure federated learning-based system that supports collaborative model training across multiple healthcare data custodians without centralising patient-level data. The system is designed for healthcare analytics scenarios in which institutions need to collaborate while preserving local data control. MediVault follows a federated learning (FL) setting in which participating healthcare sites train locally on their private datasets and share only protected model updates. Rather than treating FL as an isolated training routine, MediVault provides an integrated workflow that combines round orchestration, protected update exchange, secure aggregation, and audit-ready system evidence.

3.1. System Architecture

Figure 1 shows the overall MediVault workflow. In each training round, the federated coordinator broadcasts the current global model to the participating healthcare sites. Each site then performs local training on its private dataset and computes a local model update. Before transmission, the local update is protected through an HE-based update protection step. The protected updates are then combined through an SMPC-inspired secure aggregation process, so that the coordinator receives only an aggregated result rather than plaintext individual updates. This aggregated result is used to form the updated global model, which is broadcast again for the next round.

MediVault combines a federated coordinator, site-level local training at peer nodes, and a protected update pipeline for secure submission and aggregation. Together, these elements support a protected round-based workflow in which local model updates are generated at each site, protected before transmission, securely aggregated, and then used to update the global model for the next round. The round-based training procedure is described next, followed by the two protection mechanisms used for secure update handling and aggregation.

3.2. Federated Learning Workflow

Assume that training proceeds in synchronous rounds

t = 1, \dots, T

. Let

w^{(t)}

denote the global model parameters at round t. Each peer

i \in {1, \dots, N}

holds a private local dataset

D_{i}

. The workflow below describes how local updates are generated and then passed to the protected update pipeline for secure submission and aggregation.

Broadcast: The coordinator broadcasts the current global model $w^{(t)}$ and round identifier t to all participating peers.
Local training: Each peer performs local optimisation for E epochs (or steps) and obtains updated parameters $w_{i}^{(t)}$ . The local model update is then computed as

$Δ w_{i}^{(t)} = w_{i}^{(t)} - w^{(t)} .$

(1)
Protected submission: Each peer protects $Δ w_{i}^{(t)}$ using the protected update pipeline described in Section 3.3 and Section 3.4, and submits only the protected update to the coordinator.
Aggregation and model update: The coordinator aggregates the protected updates and applies the resulting global update:

$w^{(t + 1)} = w^{(t)} + η \cdot \frac{1}{N} \sum_{i = 1}^{N} Δ w_{i}^{(t)},$

(2)

where $η$ is the server learning rate.

In MediVault, the summation is not carried out over plaintext individual updates. Instead, aggregation is performed through the protected update pipeline described below.

3.3. HE-Based Update Protection for Encrypted Aggregation

To protect the confidentiality of peer updates during transmission and aggregation, MediVault uses an additive homomorphic cryptosystem, specifically Paillier. Let

Enc (\cdot)

and

Dec (\cdot)

denote encryption and decryption under the corresponding public and private keys.

Each peer encrypts its protected update vector element-wise:

c_{i}^{(t)} = Enc ({\tilde{Δ w}}_{i}^{(t)}),

(3)

where

{\tilde{Δ w}}_{i}^{(t)}

denotes the peer update after optional masking. Due to the additive homomorphism of Paillier, the coordinator can combine ciphertexts without decrypting individual updates:

c_{sum}^{(t)} = ⨁_{i = 1}^{N} c_{i}^{(t)} = Enc (\sum_{i = 1}^{N} {\tilde{Δ w}}_{i}^{(t)}),

(4)

where ⊕ denotes ciphertext-domain addition. The coordinator decrypts only the aggregated ciphertext:

{\tilde{Δ w}}_{sum}^{(t)} = Dec (c_{sum}^{(t)}) .

(5)

This design prevents the coordinator from directly observing plaintext individual updates during aggregation. In the current prototype, this encrypted aggregation remains practical because the evaluated models are lightweight linear models, which keep the protected update dimensionality manageable.

3.4. SMPC-Inspired Secure Aggregation via Additive Masking

MediVault further reduces exposure of individual updates by combining HE with an SMPC-inspired additive masking mechanism. The goal is that the coordinator receives only encrypted, masked updates and recovers only an aggregated result.

Let

Δ w_{i}^{(t)}

denote peer i’s local model update at round t, and let

m_{i}^{(t)}

denote a pseudo-random mask vector derived from a shared seed and the round identifier t. Peer i forms a masked update as

{\tilde{Δ w}}_{i}^{(t)} = Δ w_{i}^{(t)} + s_{i} m_{i}^{(t)},

(6)

where

s_{i} \in {+ 1, - 1}

controls mask cancellation. The peer then encrypts and transmits only

Enc ({\tilde{Δ w}}_{i}^{(t)}) .

(7)

Using HE additivity, the coordinator combines ciphertexts and decrypts only the aggregated masked sum:

Dec (\sum_{i = 1}^{N} Enc ({\tilde{Δ w}}_{i}^{(t)})) = \sum_{i = 1}^{N} (Δ w_{i}^{(t)} + s_{i} m_{i}^{(t)}) .

(8)

In the current prototype, masking is implemented in a two-party setting (

N = 2

) by assigning opposite signs to the two peers so that masks cancel after aggregation:

{\tilde{Δ w}}_{1}^{(t)} + {\tilde{Δ w}}_{2}^{(t)} = Δ w_{1}^{(t)} + Δ w_{2}^{(t)} .

(9)

Thus, the coordinator recovers only the aggregated update and not any individual plaintext update. The aggregated update is then used in a FedAvg-style global model update. Extending this masking mechanism to multi-party settings with dropout tolerance is left as future work.

In addition, MediVault records round-level metadata, including round identifiers, participating peers, protected message metadata, aggregation status, and model-level summaries. These records are surfaced through the dashboard to support auditability and governance review without exposing patient-level data.

4. Evaluation

This section evaluates MediVault from two aspects: (i) system-level evidence, showing that the current implementation supports end-to-end execution with a working dashboard, protected update exchange, and an auditable protocol timeline; and (ii) model-level utility, comparing federated training against a centralised baseline under both IID and non-IID data partitions. We report results for two lightweight linear classifiers: logistic regression (LOGREG) and linear SVM (LINSVM).

4.1. Implementation and Dashboard Views

A key contribution of MediVault is that the secure collaboration workflow is not only specified conceptually but also demonstrated through an operational dashboard. Figure 2, Figure 3 and Figure 4 provide end-to-end evidence of: (i) global task configuration and round-level learning status at the coordinator; and (ii) peer-side execution where each site trains locally and submits encrypted, masked model updates rather than raw patient records. These views support a deployment-oriented narrative: the current implementation operationalises secure multi-party collaboration while preserving data locality.

In addition to the primary workflow, MediVault provides a dedicated secure collaboration evidence layer, as shown in Figure 5 and Figure 6. This layer is designed to improve auditability and partner confidence by exposing protocol-level artefacts, such as message metadata, encryption timings, and aggregation steps, while remaining non-sensitive. Such evidence is particularly relevant for healthcare collaborations where governance requirements demand operationally verifiable traces without disclosure of patient-level information. In addition, Figure 7 shows an optional reporting interface that generates narrative summaries from non-sensitive aggregated evidence rather than raw patient records. This interface is intended to support stakeholder communication by translating logged metrics and protocol-level evidence into a more accessible form, and can be achieved using either a cloud-based generative AI service or a local model.

4.2. Experimental Setup

We evaluate MediVault on two public binary classification datasets: breast_cancer [28] and heart_disease [29]. Each dataset is split into train and test partitions using a fixed random seed (seed = 7) and a with the 80/20 split. All reported metrics are computed on the held-out test set.

4.2.1. Models, Baselines, and FL Setting

We compare two lightweight linear models that are common in clinical risk prediction:

LOGREG: logistic regression (probabilistic linear classifier).
LINSVM: linear SVM (margin-based linear classifier).

For each model, we compare:

Centralised baseline (Non-FL): the model trained on the union of all training data.
Federated learning (FL): peers train locally and submit model updates to a coordinator. The coordinator applies a FedAvg-style aggregation over received updates and evaluates the global model each round.

4.2.2. Peer Partitions (IID vs Non-IID)

To study heterogeneity, training data are partitioned across peers under:

IID: each peer receives a roughly representative sample of the overall data distribution.
Non-IID: peer data distributions are intentionally skewed so that different peers no longer follow the same underlying distribution, reflecting realistic site heterogeneity.

We evaluate 2 peers and 5 peers to examine how scaling the number of sites influences convergence and performance.

4.2.3. Metrics and Reporting Protocol

We report three standard metrics for medical risk prediction:

Accuracy (ACC): overall classification correctness.
Area Under the ROC Curve (AUC): threshold-independent ranking quality.
F1-score (F1): balances precision and recall, which is useful under potential class imbalance.

For each setting, we report: (i) Final round performance (fixed-budget deployment view), (ii) Best over rounds (attainable peak, relevant for early stopping), and (iii) Mean±Std across rounds (stability). In Table 1, Table 2 and Table 3,

Δ

denotes FL minus Base under the same dataset/model/partition/peer configuration.

4.2.4. Secure Update Confidentiality and Secure Aggregation

MediVault follows an update-confidential FL design that combines homomorphic encryption (HE) with an SMPC-inspired additive masking mechanism (see the full protocol description in the Proposed System section). As shown in Figure 5 and Figure 6, we validate the operational behaviour of this design by exposing non-sensitive protocol artefacts in the dashboard: (i) each peer submits only encrypted, masked updates (no plaintext updates and no patient-level records); (ii) the coordinator performs additive combination on ciphertexts and decrypts only the aggregated sum; and (iii) the secure-aggregation trace view provides message-level evidence such as vector dimensionality, payload size, mask identifiers/signs, and ciphertext hashes or samples, together with an ordered protocol timeline for auditability. These dashboard traces demonstrate that encrypted update exchange and masked aggregation are executed end-to-end in the prototype, supporting partner assurance without revealing local training data.

4.3. Auditability and Governance Evidence

Beyond predictive performance, MediVault is evaluated on auditability—the ability to provide non-sensitive, machine-recorded evidence that a privacy-preserving collaboration occurred. This is particularly important for cross-organisation healthcare deployments where partners must justify data governance decisions and demonstrate compliance-oriented controls.

As shown in Figure 5 and Figure 6, the implementation exposes an evidence layer that logs: (i) secure message metadata (peer identifier, round index, payload size, encryption time, mask identifiers, and protocol events); and (ii) an ordered protocol timeline of events (round start, message receipt, secure combine, global update, and evaluation). These artefacts are designed to be non-patient-level yet operationally verifiable, supporting post-hoc inspection and partner assurance without disclosing local records or per-sample information.

We therefore treat auditability as a first-class evaluation axis alongside accuracy metrics: the dashboard evidence demonstrates that MediVault provides a practical governance view for secure collaboration in addition to model training outcomes.

4.4. Results: Performance Comparison (Centralised vs Federated)

Table 1, Table 2 and Table 3 summarise results for LOGREG and LINSVM. On breast_cancer, both models achieve near-ceiling performance centrally, and FL remains competitive: LOGREG largely matches the centralised baseline (final-round ACC ≈ 0.986), while LINSVM shows small but consistent drops under FL (e.g., up to ∼2–3% absolute ACC under 2-peer non-IID). On heart_disease, LOGREG yields the strongest centralised baseline and FL provides gains under IID partitioning (e.g., ACC improves from 0.853 to 0.868). Under non-IID partitions, final-round metrics can drop (ACC 0.838) but best-over-rounds remains competitive in this setting, suggesting that monitoring and early stopping may be practical strategies under heterogeneous deployments. LINSVM exhibits more sensitivity across configurations: it can match or improve baseline in some cases (e.g., 2-peer non-IID ACC 0.836 vs 0.803) but degrades under others (e.g., 5-peer IID ACC 0.787 vs 0.803), indicating higher variance in heterogeneous or small-sample regimes.

Figure 8, Figure 9 and Figure 10 visualise final-round performance for Base vs FL across datasets, partitions, peer counts, and models. Across conditions, FL is competitive with the centralised baseline. Differences are small on breast_cancer due to near-ceiling performance, while larger sensitivity is observed on heart_disease, particularly under non-IID partitions and for LINSVM.

To complement the summary tables, we plot round-wise ACC/AUC/F1 trajectories under representative IID and non-IID settings to illustrate convergence behaviour and stability.

Figure 11 shows that under IID partitioning, FL converges smoothly and can match or exceed the centralised baseline. Under non-IID partitioning, convergence remains observable but with larger fluctuations and a lower final-round operating point. This pattern is consistent with client drift and slower stabilisation under heterogeneous sites.

Figure 12 indicates that both LOGREG and LINSVM achieve near-ceiling performance for breast_cancer, and FL closely tracks the centralised baseline under both IID and non-IID partitions. In this dataset, differences are small and are more visible through stability than through final-point performance.

4.5. Discussion

The evaluation shows that MediVault provides both system-level and model-level value for privacy-preserving healthcare collaboration. At the system level, the dashboard and evidence views demonstrate that the current implementation supports local training, protected update exchange, secure aggregation, and auditable protocol traces without exposing patient-level data. This is important for healthcare deployments, where partner trust depends not only on privacy-preserving computation but also on visible and reviewable operational evidence.

At the model level, federated learning with both LOGREG and LINSVM remains competitive with the centralised baseline across the evaluated datasets. On breast_cancer, both models operate near a saturated performance regime, so differences between centralised and federated settings are small. On heart_disease, LOGREG is generally more robust across partition and peer configurations, while LINSVM shows greater sensitivity, especially under heterogeneous settings. This suggests that model choice matters in practical FL deployments, particularly when data distributions differ across sites.

The results also confirm the expected effect of heterogeneity. Non-IID partitions increase variance and can lower final-round performance, especially on heart_disease. At the same time, the best-over-rounds results indicate that competitive operating points are still reachable, which supports the use of monitoring and early stopping in practical deployments.

Finally, the optional reporting interface shown in Figure 7 illustrates how non-sensitive aggregated evidence can be presented in a more accessible form for stakeholders. Together with the protocol-level metadata exposed by the dashboard, this suggests that MediVault can support not only privacy-preserving training, but also the transparency and governance readiness needed for real multi-site healthcare collaboration.

5. Conclusion

This paper presented MediVault, a proof-of-concept system that enables privacy-preserving healthcare collaboration without sharing raw patient records. MediVault combines federated learning with (prototype) encrypted update exchange and an SMPC-inspired secure aggregation workflow, and exposes an auditable evidence layer through a working dashboard to support governance and partner trust. Experiments on breast_cancer and heart_disease using logistic regression show that federated training achieves performance comparable to a centralised baseline under both IID and non-IID partitions, with expected sensitivity to data heterogeneity.

Several directions remain for future work. First, the current secure aggregation and homomorphic-encryption mechanisms are demonstrated at prototype level, and future work should consider stronger adversarial settings and more formal security analysis. Second, the present experiments use benchmark tabular datasets rather than real clinical deployment data, so further validation will require more representative healthcare datasets and appropriate governance or ethical pathways. Third, broader evaluation is needed under more realistic network conditions, larger numbers of peers, and more extensive analyses of robustness, fairness, and distribution shift.

Funding

The project is funded by Cyber Security Academic Startup Accelerator Programme (Year 9), Project Code: 10173383.

References

Zuo, Z.; Li, J.; Xu, H.; Al Moubayed, N. Curvature-based feature selection with application in classifying electronic health records. Technological Forecasting and Social Change 2021, 173, 121127. [CrossRef]
Brasil, S.; Pascoal, C.; Francisco, R.; dos Reis Ferreira, V.; A. Videira, P.; Valadão, G. Artificial intelligence (AI) in rare diseases: is the future brighter? Genes 2019, 10, 978.
Lee, J.; Liu, C.; Kim, J.; Chen, Z.; Sun, Y.; Rogers, J.R.; Chung, W.K.; Weng, C. Deep learning for rare disease: A scoping review. Journal of biomedical informatics 2022, 135, 104227. [CrossRef]
Visibelli, A.; Roncaglia, B.; Spiga, O.; Santucci, A. The impact of artificial intelligence in the odyssey of rare diseases. Biomedicines 2023, 11, 887. [CrossRef]
Decherchi, S.; Pedrini, E.; Mordenti, M.; Cavalli, A.; Sangiorgi, L. Opportunities and challenges for machine learning in rare diseases. Frontiers in medicine 2021, 8, 747612.
Schaefer, J.; Lehne, M.; Schepers, J.; Prasser, F.; Thun, S. The use of machine learning in rare diseases: a scoping review. Orphanet journal of rare diseases 2020, 15, 145.
McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the Artificial intelligence and statistics. Pmlr, 2017, pp. 1273–1282.
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated optimization in heterogeneous networks. Proceedings of Machine learning and systems 2020, 2, 429–450.
Kairouz, P.; McMahan, H.B. Advances and open problems in federated learning. Foundations and trends in machine learning 2021, 14, 1–210.
Sheller, M.J.; Reina, G.A.; Edwards, B.; Martin, J.; Bakas, S. Multi-institutional deep learning modeling without sharing patient data: A feasibility study on brain tumor segmentation. In Proceedings of the International MICCAI Brainlesion Workshop. Springer, 2018, pp. 92–104.
Rieke, N.; Hancox, J.; Li, W.; Milletarì, F.; Roth, H.R.; Albarqouni, S.; Bakas, S.; Galtier, M.N.; Landman, B.A.; Maier-Hein, K.; et al. The future of digital health with federated learning. NPJ Digital Medicine 2020, 3. [CrossRef]
Xu, J.; Glicksberg, B.S.; Su, C.; Walker, P.; Bian, J.; Wang, F. Federated learning for healthcare informatics. Journal of healthcare informatics research 2021, 5, 1–19. [CrossRef]
Austin, J.A.; Lobo, E.H.; Samadbeik, M.; Engstrom, T.; Philip, R.; Pole, J.D.; Sullivan, C.M. Decades in the making: the evolution of digital health research infrastructure through synthetic data, common data models, and federated learning. Journal of Medical Internet Research 2024, 26, e58637.
Shafik, W. Digital healthcare systems in a federated learning perspective. In Federated learning for digital healthcare systems; Elsevier, 2024; pp. 1–35.
Bashir, A.K.; Victor, N.; Bhattacharya, S.; Huynh-The, T.; Chengoden, R.; Yenduri, G.; Maddikunta, P.K.R.; Pham, Q.V.; Gadekallu, T.R.; Liyanage, M. Federated learning for the healthcare metaverse: Concepts, applications, challenges, and future directions. IEEE Internet of Things Journal 2023, 10, 21873–21891. [CrossRef]
Milasheuski, U.; Barbieri, L.; Tedeschini, B.C.; Nicoli, M.; Savazzi, S. On the impact of data heterogeneity in federated learning environments with application to healthcare networks. In Proceedings of the 2024 IEEE conference on artificial intelligence (CAI). IEEE, 2024, pp. 1017–1023.
Bonawitz, K.; Ivanov, V.; Kreuter, B.; Marcedone, A.; McMahan, H.B.; Patel, S.; Ramage, D.; Segal, A.; Seth, K. Practical secure aggregation for privacy-preserving machine learning. In Proceedings of the proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 2017, pp. 1175–1191.
Beutel, D.J.; Topal, T.; Mathur, A.; Qiu, X.; Fernandez-Marques, J.; Gao, Y.; Sani, L.; Li, K.H.; Parcollet, T.; de GusmÃĢo, P.P.B.; et al. Flower: A friendly federated learning research framework. arXiv preprint arXiv:2007.14390 2020.
He, C.; Li, S.; So, J.; Zeng, X.; Zhang, M.; Wang, H.; Wang, X.; Vepakomma, P.; Singh, A.; Qiu, H.; et al. Fedml: A research library and benchmark for federated machine learning. arXiv preprint arXiv:2007.13518 2020.
Zhu, L.; Liu, Z.; Han, S. Deep Leakage from Gradients. Advances in Neural Information Processing Systems 2019, 32, 1–11. Placeholder entry. Please update with full details if needed.
Melis, L.; Song, C.; De Cristofaro, E.; Shmatikov, V. Exploiting unintended feature leakage in collaborative learning. In Proceedings of the 2019 IEEE symposium on security and privacy (SP). IEEE, 2019, pp. 691–706.
Bonawitz, K.; Ivanov, V.; Kreuter, B.; Marcedone, A.; McMahan, H.B.; Patel, S.; Ramage, D.; Segal, A.; Seth, K. Practical secure aggregation for privacy-preserving machine learning. In Proceedings of the proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 2017, pp. 1175–1191.
Gentry, C. Fully homomorphic encryption using ideal lattices. In Proceedings of the Proceedings of the forty-first annual ACM symposium on Theory of computing, 2009, pp. 169–178.
Antunes, R.S.; André da Costa, C.; Küderle, A.; Yari, I.A.; Eskofier, B. Federated learning for healthcare: Systematic review and architecture proposal. ACM Transactions on Intelligent Systems and Technology (TIST) 2022, 13, 1–23.
Chaddad, A.; Wu, Y.; Desrosiers, C. Federated learning for healthcare applications. IEEE internet of things journal 2023, 11, 7339–7358.
Nguyen, D.C.; Pham, Q.V.; Pathirana, P.N.; Ding, M.; Seneviratne, A.; Lin, Z.; Dobre, O.; Hwang, W.J. Federated learning for smart healthcare: A survey. ACM Computing Surveys (Csur) 2022, 55, 1–37.
Dhade, P.; Shirke, P. Federated learning for healthcare: a comprehensive review. Engineering Proceedings 2024, 59, 230.
Zwitter, M.; Soklic, M. Breast Cancer. UCI Machine Learning Repository, 1988. [CrossRef]
Janosi, Andras, S.W.P.M.; Detrano, R. Heart Disease. UCI Machine Learning Repository, 1989. [CrossRef]

Figure 1. Overview of the MediVault workflow

Figure 2. Coordinator dashboard overview showing the current learning status, update dimensionality, and aggregated non-sensitive site snapshot.

Figure 3. Peer A view. Local data remain on-site; only encrypted and masked updates are exchanged, while local non-sensitive aggregates are visualised.

Figure 4. Peer B view. Local data remain on-site; only encrypted and masked updates are exchanged, while local non-sensitive aggregates are visualised.

Figure 5. Encrypted and masked updates submitted by peers. The coordinator decrypts only the sum of masked ciphertexts; masks cancel in aggregate and the global update is applied in the current SMPC-inspired prototype.

Figure 6. Secure message table and protocol timeline. The dashboard records message sizes, encryption time, mask identifiers, and protocol events to support auditability.

Figure 7. Optional LLM Insights interface: evidence-grounded summaries generated from aggregated metrics and site-level aggregates, without using patient-level records.

Figure 8. Final-round Accuracy (ACC) comparison between centralized baseline (Base) and federated learning (FL) (LOGREG vs LINSVM), partitions (IID/Non-IID).

Figure 9. Final-round AUC comparison between centralized baseline (Base) and federated learning (FL) (LOGREG vs LINSVM), partitions (IID/Non-IID).

Figure 10. Final-round F1-score (F1) comparison between centralized baseline (Base) and federated learning (FL), (LOGREG vs LINSVM), partitions (IID/Non-IID).

Figure 11. Round-wise performance curves for heart_disease comparing LOGREG and LINSVM (Base vs FL). (a) IID, 5 peers. (b) Non-IID, 5 peers, where heterogeneity increases variance and affects the final operating point.

Figure 12. Round-wise performance curves for breast_cancer comparing LOGREG and LINSVM (Base vs FL). (a) IID, 5 peers, where both models achieve near-ceiling performance. (b) Non-IID, 5 peers, where FL remains stable with only limited degradation.

Table 1. Accuracy (ACC) comparison between centralized training (Base) and MediVault federated learning (FL). BC = breast_cancer; HD = heart_disease; Non = Non-IID.

Δ

denotes FL minus Base. “Best” is the maximum ACC achieved across rounds.

Table 1. Accuracy (ACC) comparison between centralized training (Base) and MediVault federated learning (FL). BC = breast_cancer; HD = heart_disease; Non = Non-IID.

Δ

denotes FL minus Base. “Best” is the maximum ACC achieved across rounds.

				Final			Best			Mean±Std
Dataset	Model	Part.	Peers	Base	FL	$Δ$	Base	FL	$Δ$	Base	FL
BC	LINSVM	IID	2	0.982	0.965	-0.018	0.982	0.974	-0.009	0.977±0.007	0.969±0.008
BC	LINSVM	IID	5	0.982	0.974	-0.009	0.982	0.974	-0.009	0.977±0.007	0.972±0.006
BC	LINSVM	Non	2	0.982	0.956	-0.026	0.982	0.965	-0.018	0.977±0.007	0.951±0.013
BC	LINSVM	Non	5	0.982	0.974	-0.009	0.982	0.974	-0.009	0.977±0.007	0.969±0.009
BC	LOGREG	IID	2	0.986	0.986	+0.000	0.986	0.986	+0.000	0.981±0.008	0.976±0.011
BC	LOGREG	IID	5	0.986	0.979	-0.007	0.986	0.979	-0.007	0.981±0.008	0.967±0.012
BC	LOGREG	Non	2	0.986	0.979	-0.007	0.986	0.979	-0.007	0.981±0.008	0.974±0.010
BC	LOGREG	Non	5	0.986	0.979	-0.007	0.986	0.979	-0.007	0.981±0.008	0.965±0.015
HD	LINSVM	IID	2	0.803	0.803	+0.000	0.836	0.820	-0.016	0.812±0.016	0.805±0.006
HD	LINSVM	IID	5	0.803	0.787	-0.016	0.836	0.820	-0.016	0.812±0.016	0.797±0.010
HD	LINSVM	Non	2	0.803	0.836	+0.033	0.836	0.852	+0.016	0.812±0.016	0.833±0.011
HD	LINSVM	Non	5	0.803	0.803	+0.000	0.836	0.820	-0.016	0.812±0.016	0.802±0.008
HD	LOGREG	IID	2	0.853	0.868	+0.015	0.882	0.882	+0.000	0.864±0.008	0.867±0.006
HD	LOGREG	IID	5	0.853	0.868	+0.015	0.882	0.882	+0.000	0.864±0.008	0.869±0.006
HD	LOGREG	Non	2	0.853	0.838	-0.015	0.882	0.882	+0.000	0.864±0.008	0.855±0.018
HD	LOGREG	Non	5	0.853	0.838	-0.015	0.882	0.882	+0.000	0.864±0.008	0.864±0.012

Table 2. Area Under the ROC Curve (AUC) comparison between centralized training (Base) and federated learning (FL). BC = breast_cancer; HD = heart_disease; Non = Non-IID.

Δ

denotes FL minus Base. “Best” is the maximum AUC achieved across rounds.

Table 2. Area Under the ROC Curve (AUC) comparison between centralized training (Base) and federated learning (FL). BC = breast_cancer; HD = heart_disease; Non = Non-IID.

Δ

denotes FL minus Base. “Best” is the maximum AUC achieved across rounds.

				Final			Best			Mean±Std
Dataset	Model	Part.	Peers	Base	FL	$Δ$	Base	FL	$Δ$	Base	FL
BC	LINSVM	IID	2	0.999	0.998	-0.001	0.999	0.998	-0.001	0.999±0.001	0.998±0.001
BC	LINSVM	IID	5	0.999	0.999	-0.001	0.999	0.999	-0.001	0.999±0.001	0.999±0.001
BC	LINSVM	Non	2	0.999	0.998	-0.002	0.999	0.998	-0.001	0.999±0.001	0.998±0.001
BC	LINSVM	Non	5	0.999	0.997	-0.002	0.999	0.999	-0.001	0.999±0.001	0.998±0.002
BC	LOGREG	IID	2	1.000	0.999	-0.001	1.000	0.999	-0.001	1.000±0.000	0.998±0.002
BC	LOGREG	IID	5	1.000	0.998	-0.001	1.000	0.998	-0.001	1.000±0.000	0.996±0.003
BC	LOGREG	Non	2	1.000	0.999	-0.001	1.000	0.999	-0.001	1.000±0.000	0.997±0.002
BC	LOGREG	Non	5	1.000	0.998	-0.001	1.000	0.998	-0.001	1.000±0.000	0.995±0.003
HD	LINSVM	IID	2	0.864	0.881	+0.017	0.915	0.916	+0.001	0.894±0.018	0.903±0.013
HD	LINSVM	IID	5	0.864	0.878	+0.014	0.915	0.916	+0.001	0.894±0.018	0.897±0.012
HD	LINSVM	Non	2	0.864	0.866	+0.002	0.915	0.900	-0.015	0.894±0.018	0.878±0.021
HD	LINSVM	Non	5	0.864	0.881	+0.017	0.915	0.916	+0.001	0.894±0.018	0.900±0.014
HD	LOGREG	IID	2	0.911	0.915	+0.004	0.924	0.926	+0.003	0.912±0.008	0.918±0.004
HD	LOGREG	IID	5	0.911	0.920	+0.009	0.924	0.931	+0.007	0.912±0.008	0.920±0.004
HD	LOGREG	Non	2	0.911	0.917	+0.006	0.924	0.931	+0.007	0.912±0.008	0.917±0.007
HD	LOGREG	Non	5	0.911	0.920	+0.009	0.924	0.931	+0.007	0.912±0.008	0.920±0.004

Table 3. F1-score (F1) comparison between centralized training (Base) and federated learning (FL). BC = breast_cancer; HD = heart_disease; Non = Non-IID.

Δ

denotes FL minus Base. “Best” is the maximum F1 achieved across rounds.

Table 3. F1-score (F1) comparison between centralized training (Base) and federated learning (FL). BC = breast_cancer; HD = heart_disease; Non = Non-IID.

Δ

denotes FL minus Base. “Best” is the maximum F1 achieved across rounds.

				Final			Best			Mean±Std
Dataset	Model	Part.	Peers	Base	FL	$Δ$	Base	FL	$Δ$	Base	FL
BC	LINSVM	IID	2	0.986	0.972	-0.014	0.986	0.978	-0.008	0.981±0.007	0.976±0.007
BC	LINSVM	IID	5	0.986	0.979	-0.007	0.986	0.979	-0.007	0.981±0.007	0.978±0.005
BC	LINSVM	Non	2	0.986	0.964	-0.022	0.986	0.970	-0.016	0.981±0.007	0.959±0.012
BC	LINSVM	Non	5	0.986	0.979	-0.007	0.986	0.979	-0.007	0.981±0.007	0.976±0.009
BC	LOGREG	IID	2	0.989	0.989	+0.000	0.989	0.989	+0.000	0.982±0.009	0.981±0.009
BC	LOGREG	IID	5	0.989	0.983	-0.006	0.989	0.983	-0.006	0.982±0.009	0.972±0.010
BC	LOGREG	Non	2	0.989	0.983	-0.006	0.989	0.983	-0.006	0.982±0.009	0.980±0.008
BC	LOGREG	Non	5	0.989	0.983	-0.006	0.989	0.983	-0.006	0.982±0.009	0.971±0.012
HD	LINSVM	IID	2	0.824	0.824	+0.000	0.849	0.836	-0.013	0.825±0.010	0.824±0.006
HD	LINSVM	IID	5	0.824	0.812	-0.012	0.849	0.836	-0.013	0.825±0.010	0.814±0.010
HD	LINSVM	Non	2	0.824	0.848	+0.025	0.849	0.862	+0.013	0.825±0.010	0.845±0.011
HD	LINSVM	Non	5	0.824	0.824	+0.000	0.849	0.836	-0.013	0.825±0.010	0.823±0.008
HD	LOGREG	IID	2	0.839	0.847	+0.009	0.867	0.867	+0.000	0.844±0.009	0.848±0.007
HD	LOGREG	IID	5	0.839	0.852	+0.014	0.867	0.867	+0.000	0.844±0.009	0.851±0.007
HD	LOGREG	Non	2	0.839	0.820	-0.019	0.867	0.867	+0.000	0.844±0.009	0.835±0.019
HD	LOGREG	Non	5	0.839	0.847	+0.009	0.867	0.867	+0.000	0.844±0.009	0.847±0.011

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

MediVault: An Auditable and Secure Federated Learning System for Privacy-Preserving Healthcare Collaboration

Abstract

Keywords:

Subject:

1. Introduction

2. Background

2.1. Federated Learning in Healthcare

2.2. Update Confidentiality and Secure Aggregation

2.3. Auditability, Governance Evidence, and Trust in Cross-Organisation Collaboration

2.4. Related Work

3. Proposed System

3.1. System Architecture

3.2. Federated Learning Workflow

3.3. HE-Based Update Protection for Encrypted Aggregation

3.4. SMPC-Inspired Secure Aggregation via Additive Masking

4. Evaluation

4.1. Implementation and Dashboard Views

4.2. Experimental Setup

4.2.1. Models, Baselines, and FL Setting

4.2.2. Peer Partitions (IID vs Non-IID)

4.2.3. Metrics and Reporting Protocol

4.2.4. Secure Update Confidentiality and Secure Aggregation

4.3. Auditability and Governance Evidence

4.4. Results: Performance Comparison (Centralised vs Federated)

4.5. Discussion

5. Conclusion

Funding

References

MDPI Initiatives

Important Links

Subscribe