Submitted:
08 July 2025
Posted:
09 July 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- 1)
- We innovatively propose a multi-stage dimensionality reduction framework applicable to vertical federated learning clustering based on Product Quantization (PQ) and Multidimensional Scaling (MDS) techniques. Locally, feature compression and codebook generation effectively reduce data volume, while PQ-quantized parameters inherently provide noise injection effects to enhance privacy. Innovatively, we introduce one-dimensional MDS embedding to map clustering centers in the codebook into distance-preserving indices. This achieves zero raw codebook upload, abandons data reconstruction on the server side, and fundamentally eliminates the risk of data distribution leakage.
- 2)
- The multi-stage dimensionality reduction mechanism significantly reduces transmitted data volume and communication frequency while ensuring clustering accuracy, thereby improving communication efficiency.
- 3)
- The combination of PQ dimensionality reduction and MDS embedding algorithms mitigates clustering bias caused by feature imbalance in vertical federated learning.
- 4)
- Extensive experiments on the MNIST dataset validate that our algorithm satisfies federated learning privacy requirements while preserving clustering accuracy.
2. Related Work
2.1. k-means clustering algorithm:
| Algorithm 1 k-means algorithm steps: |
| Input: Dataset ,number of clusters K,maximum iterations T. |
| Output: Partitioning result of K clusters. |
Setps:
|
2.2. Vertical Federated Learning
2.3. PQ Quantization Technique
- 1)
-
Space Decomposition: Decompose a vector of dimension D into Msubspaces, each with dimension D/M(requiring Dmod M=0).Mathematical Expression: The original vector v∈ s partitioned into M subvectors vi∈,with each subvector quantized independently
- 2)
- Subspace Quantization: Perform k-means clustering on each subspace to generate a codebook containing k cluster centers.
- 3)
- Encoding Representation: The original vector is represented by a combination of indices corresponding to the nearest cluster centers of its subvectors. For example, a 128-dimensional vector can be compressed into eight 8-bit indices (requiring only 64 bits of storage, achieving a compression ratio of up to 97%).
- 1)
-
Reconstruction Error:Each subspace generates K centroids via K-means (). Subvector is quantized to its nearest centroid , with reconstruction error:This error decreases as K increases.
- 2)
-
Subspace Partitioning Error:A D-dimensional vector is partitioned into M subspaces (each of dimension d=D/M). For an original vector v∈, partitioned as . Independent quantization across subspaces destroys inter-dimensional correlations, and the lower bound of the reconstruction error for is:When subspace partitioning does not align with the principal components of the data distribution (e.g., without PCA preprocessing), the error increases significantly.
2.4. MDS Mapping Method:
| Algorithm 2 MDS One-Dimensional Embedding Algorithm Steps: |
| Input: m×ndata matrixX,distance metric (default: Euclidean distance). |
| Output: One-dimensional coordinate vector |
| Setps: |
|
3. Method
3.1. Algorithm Design

-
Encrypted Entity AlignmentSelect common samples: Extract common samples from each party’s dataset.
-
Local Initialization and Training of PQ Quantizer:
- 1)
-
Pad dimensions: Determine whether the local dimension dim is divisible by sub_dim. If not divisible, pad dimensions.First, calculate the number of dimensions p to pad:Then, pad the original vector by appending p zeros to its end.
- 2)
- Generate subspace codebooks based on training data.
-
Secondary mapping:Perform secondary mapping on cluster centers in subspace codebooks using MDS one-dimensional embedding algorithm, then use the normalized mapped values as codebook indices.
-
Data transmission:Codebooks are stored locally. Transfer the indices to the server side.
-
Server-side global cluster center aggregation:Execute k-means algorithm on the indices uploaded by clients at the server side to obtain abstract global cluster centers.When using abstract cluster centers, apply the same mapping operation to local data, then compute distances to global abstract cluster centers to determine true cluster assignments.The algorithm requires only one round of communication.Table 1 Symbol Explanation:
| Symbol | Description |
|---|---|
| D | Description |
| sub_dim | Subspace data dimensionality |
| sub_k | k-value for the g-th client |
| Dataset of the g-th client | |
| Codebook of the g-th client | |
| Cluster center set of the i-th subspace for the g-th client | |
| Data index set of the g-th client | |
| B | Global data index set |
| Algorithm 3 Algorithm: |
| Input: Distributed dataset, where is the data of the g-th client. Number of clusters k. Maximum iterations T. |
| Output: Global cluster centers . |
| Setps: |
|
3.2. Privacy Enhancement
- Serial Composition:For a given dataset D,assume there exists random algorithms , with privacy budgetsrespectively. The composition algorithm provides -DP protection. That is, for the same dataset, applying a series of differentially private algorithms sequentially provides protection equivalent to the sum of privacy budgets.
- Parallel Composition:For disjoint datasets,assume there exists random algorithms ,with privacy budgets respectively,respectively provides ()-DP privacy budgets. That is,for disjoint datasets, applying different differentially private algorithms separately in parallel provides privacy protection equivalent to the maximum privacy budget among the composed algorithms.
4. Experiments and Results
4.1. Experimental Settings
4.1.1. Dataset
4.1.2. Parameter Settings
- Total number of clients:2
- Total number of client features: 784
- Client feature ratios: (1:1),(1:6),(1:13)
- PQ quantization subspace dimensions: 1,2,4,8
- Number of PQ quantization cluster centers per subspace: 10,64,128,256
4.1.3. Evaluation Metrics
4.2. Performance Analysis
4.2.1. Comparison Between Proposed Method and Centralized Kmeans
| Dispe rsion Coeffi cient |
Hash Dimen sion |
Proposed Method | Without MDS Algorithm Using Codebook Indices at Server |
Without MDS Algorithm, Restored Data at Server |
Centralized Algorithm |
||||
|---|---|---|---|---|---|---|---|---|---|
| NMI | ARI | NMI | ARI | NMI | ARI | NMI | ARI | ||
| 1 | 10 | 0.53403 | 0.42542 | 0.50209 | 0.41110 | 0.51871 | 0.40490 | 0.49581 | 0.36387 |
| 2 | 10 | 0.49353 | 0.37211 | 0.47981 | 0.38815 | 0.49406 | 0.38168 | ||
| 4 | 10 | 0.51018 | 0.38628 | 0.44081 | 0.33740 | 0.52128 | 0.40873 | ||
| 8 | 10 | 0.52476 | 0.43289 | 0.42186 | 0.31696 | 0.52207 | 0.40520 | ||
| 1 | 64 | 0.49707 | 0.36712 | 0.48437 | 0.37222 | 0.51585 | 0.40136 | ||
| 2 | 64 | 0.53773 | 0.42763 | 0.44766 | 0.33290 | 0.49617 | 0.36369 | ||
| 4 | 64 | 0.50124 | 0.37786 | 0.42137 | 0.32132 | 0.49500 | 0.38318 | ||
| 8 | 64 | 0.48329 | 0.39052 | 0.41128 | 0.34143 | 0.49390 | 0.36302 | ||
| 1 | 128 | 0.48375 | 0.36266 | 0.50666 | 0.41889 | 0.49081 | 0.36068 | ||
| 2 | 128 | 0.49855 | 0.38596 | 0.46257 | 0.35915 | 0.49043 | 0.36039 | ||
| 4 | 128 | 0.48026 | 0.35906 | 0.42605 | 0.34344 | 0.49403 | 0.38208 | ||
| 8 | 128 | 0.47704 | 0.37444 | 0.40413 | 0.31093 | 0.48159 | 0.37014 | ||
| 1 | 256 | 0.52029 | 0.40768 | 0.49181 | 0.40905 | 0.49049 | 0.36069 | ||
| 2 | 256 | 0.48873 | 0.36059 | 0.49902 | 0.39875 | 0.48150 | 0.35965 | ||
| 4 | 256 | 0.49801 | 0.39486 | 0.43289 | 0.35194 | 0.49628 | 0.36436 | ||
| 8 | 256 | 0.46309 | 0.36877 | 0.41015 | 0.34877 | 0.48375 | 0.36400 | ||
4.2.2. Impact of MDS on Algorithm Performance
4.2.3. Impact of Client Feature Quantity on Performance
| Number of Clients |
Feature Ratio Among Clients |
Subspace Dimension |
Cluster Centers per Subspace |
Proposed Method | |
|---|---|---|---|---|---|
| NMI | ARI | ||||
| 2 | 1:1 | 1 | 10 | 0.53403 | 0.42542 |
| 2 | 1:1 | 2 | 10 | 0.49353 | 0.37211 |
| 2 | 1:1 | 4 | 10 | 0.51018 | 0.0.38628 |
| 2 | 1:1 | 8 | 10 | 0.52476 | 0.43289 |
| 2 | 1:6 | 1 | 10 | 0.51950 | 0.40631 |
| 2 | 1:6 | 2 | 10 | 0.51160 | 0.39005 |
| 2 | 1:6 | 4 | 10 | 0.51162 | 0.38794 |
| 2 | 1:6 | 8 | 10 | 0.52514 | 0.42123 |
| 2 | 1:13 | 1 | 10 | 0.49672 | 0.36139 |
| 2 | 1:13 | 2 | 10 | 0.49207 | 0.36924 |
| 2 | 1:13 | 4 | 10 | 0.51211 | 0.38790 |
| 2 | 1:13 | 8 | 10 | 0.53030 | 0.44559 |
5. Conclusion
References
- McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the Artificial intelligence and statistics. PMLR, 2017, pp. 1273–1282.
- Stallmann, M.; Wilbik, A. Towards Federated Clustering: A Federated Fuzzy-Means Algorithm (FFCM). arXiv preprint arXiv:2201.07316 2022.
- Chen, M.; Shlezinger, N.; Poor, H.V.; Eldar, Y.C.; Cui, S. Communication-efficient federated learning. Proceedings of the National Academy of Sciences 2021, 118, e2024789118. [CrossRef]
- Zhu, H.; Xu, J.; Liu, S.; Jin, Y. Federated learning on non-IID data: A survey. Neurocomputing 2021, 465, 371–390. [CrossRef]
- Zhou, X.; Yang, G. Communication-efficient and privacy-preserving large-scale federated learning counteracting heterogeneity. Information Sciences 2024, 661, 120167. [CrossRef]
- Mohammadi, N.; Bai, J.; Fan, Q.; Song, Y.; Yi, Y.; Liu, L. Differential privacy meets federated learning under communication constraints. IEEE Internet of Things Journal 2021, 9, 22204–22219. [CrossRef]
- Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. Advances and open problems in federated learning. Foundations and trends® in machine learning 2021, 14, 1–210. [CrossRef]
- Wei, K.; Li, J.; Ma, C.; Ding, M.; Wei, S.; Wu, F.; Chen, G.; Ranbaduge, T. Vertical federated learning: Challenges, methodologies and experiments. arXiv preprint arXiv:2202.04309 2022.
- Yang, Q.; Liu, Y.; Chen, T.; Tong, Y. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 2019, 10, 1–19. [CrossRef]
- Li, J.; Wei, H.; Liu, J.; Liu, W. FSLEdge: An energy-aware edge intelligence framework based on Federated Split Learning for Industrial Internet of Things. Expert Systems with Applications 2024, 255, 124564. [CrossRef]
- Khan, L.U.; Pandey, S.R.; Tran, N.H.; Saad, W.; Han, Z.; Nguyen, M.N.; Hong, C.S. Federated learning for edge networks: Resource optimization and incentive mechanism. IEEE Communications Magazine 2020, 58, 88–93. [CrossRef]
- Rieke, N.; Hancox, J.; Li, W.; Milletari, F.; Roth, H.R.; Albarqouni, S.; Bakas, S.; Galtier, M.N.; Landman, B.A.; Maier-Hein, K.; et al. The future of digital health with federated learning. NPJ digital medicine 2020, 3, 119. [CrossRef] [PubMed]
- Wu, Z.; Hou, J.; He, B. Vertibench: Advancing feature distribution diversity in vertical federated learning benchmarks. arXiv preprint arXiv:2307.02040 2023.
- Khan, A.; ten Thij, M.; Wilbik, A. Communication-efficient vertical federated learning. Algorithms 2022, 15, 273. [CrossRef]
- Cheng, K.; Fan, T.; Jin, Y.; Liu, Y.; Chen, T.; Papadopoulos, D.; Yang, Q. Secureboost: A lossless federated learning framework. IEEE intelligent systems 2021, 36, 87–98. [CrossRef]
- Li, Z.; Wang, T.; Li, N. Differentially private vertical federated clustering. arXiv preprint arXiv:2208.01700 2022.
- Zhao, F.; Li, Z.; Ren, X.; Ding, B.; Yang, S.; Li, Y. VertiMRF: Differentially Private Vertical Federated Data Synthesis. In Proceedings of the Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024, pp. 4431–4442.
- Mazzone, F.; Brown, T.; Kerschbaum, F.; Wilson, K.H.; Everts, M.; Hahn, F.; Peter, A. Privacy-Preserving Vertical K-Means Clustering. arXiv preprint arXiv:2504.07578 2025.
- Li, C.; Ding, S.; Xu, X.; Guo, L.; Ding, L.; Wu, X. Vertical Federated Density Peaks Clustering under Nonlinear Mapping. IEEE Transactions on Knowledge and Data Engineering 2024. [CrossRef]
- Ahmed, M.; Seraj, R.; Islam, S.M.S. The k-means algorithm: A comprehensive survey and performance evaluation. Electronics 2020, 9, 1295. [CrossRef]
- Han, J.; Kamber, M.; Mining, D. Concepts and techniques. Morgan kaufmann 2006, 340, 94104–103205.
- Mary, S.S.; Selvi, T. A study of K-means and cure clustering algorithms. Int. J. Eng. Res. Technol 2014, 3, 1985–1987.
- GAO, Y.; XIE, Y.; DENG, H.; ZHU, Z.; ZHANG, Y. A Privacy-preserving Data Alignment Framework for Vertical Federated Learning. J. Electron. Inf. Technol. 2024, 46, 3419–3427.
- Yang, L.; Chai, D.; Zhang, J.; Jin, Y.; Wang, L.; Liu, H.; Tian, H.; Xu, Q.; Chen, K. A survey on vertical federated learning: From a layered perspective. arXiv preprint arXiv:2304.01829 2023.
- Liu, Y.; Kang, Y.; Zou, T.; Pu, Y.; He, Y.; Ye, X.; Ouyang, Y.; Zhang, Y.Q.; Yang, Q. Vertical federated learning: Concepts, advances, and challenges. IEEE Transactions on Knowledge and Data Engineering 2024, 36, 3615–3634. [CrossRef]
- Zhao, Z.; Mao, Y.; Liu, Y.; Song, L.; Ouyang, Y.; Chen, X.; Ding, W. Towards efficient communications in federated learning: A contemporary survey. Journal of the Franklin Institute 2023, 360, 8669–8703. [CrossRef]
- Yang, H.; Liu, H.; Yuan, X.; Wu, K.; Ni, W.; Zhang, J.A.; Liu, R.P. Synergizing Intelligence and Privacy: A Review of Integrating Internet of Things, Large Language Models, and Federated Learning in Advanced Networked Systems. Applied Sciences 2025, 15, 6587. [CrossRef]
- Zhang, C.; Li, S. State-of-the-art approaches to enhancing privacy preservation of machine learning datasets: A survey. arXiv preprint arXiv:2404.16847 2024.
- Qi, Z.; Meng, L.; Li, Z.; Hu, H.; Meng, X. Cross-Silo Feature Space Alignment for Federated Learning on Clients with Imbalanced Data 2025.
- Hu, K.; Xiang, L.; Tang, P.; Qiu, W. Feature norm regularized federated learning: utilizing data disparities for model performance gains. In Proceedings of the Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024, pp. 4136–4146.
- Aramian, A. Managing Feature Diversity: Evaluating Global ModelReliability in FederatedLearning for Intrusion Detection Systems in IoT, 2024.
- Johnson, A. A Survey of Recent Advances for Tackling Data Heterogeneity in Federated Learning 2025.
- Jegou, H.; Douze, M.; Schmid, C. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 2010, 33, 117–128. [CrossRef]
- Konečnỳ, J.; McMahan, H.B.; Yu, F.X.; Richtárik, P.; Suresh, A.T.; Bacon, D. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492 2016.
- Yue, K.; Jin, R.; Wong, C.W.; Baron, D.; Dai, H. Gradient obfuscation gives a false sense of security in federated learning. In Proceedings of the 32nd USENIX Security Symposium (USENIX Security 23), 2023, pp. 6381–6398.
- Ge, T.; He, K.; Ke, Q.; Sun, J. Optimized product quantization. IEEE transactions on pattern analysis and machine intelligence 2013, 36, 744–755. [CrossRef] [PubMed]
- Xiao, S.; Liu, Z.; Shao, Y.; Lian, D.; Xie, X. Matching-oriented product quantization for ad-hoc retrieval. arXiv preprint arXiv:2104.07858 2021.
- Deisenroth, M.P.; Faisal, A.A.; Ong, C.S. Mathematics for machine learning; Cambridge University Press, 2020.
- Izenman, A.J. Linear Dimensionality Reduction. Springer New York 2013.
- Vadhan, S.; Zhang, W. Concurrent Composition Theorems for Differential Privacy 2022.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).