Submitted:
04 March 2025
Posted:
05 March 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- ➢
- Effective Dimensionality Reduction Using UMAP: We successfully applied the nonlinear dimensionality reduction technique UMAP to the clinical dataset, reducing complexity while preserving essential data structures. This allowed for clearer visualization and better handling of nonlinear relationships between medical variables.
- ➢
- Identification of Patient Groups with Varying Cardiac Risk Levels: By utilizing the K-means clustering algorithm on the reduced dataset, we identified two distinct groups of patients with different levels of risk for heart attacks. This highlights the potential of unsupervised ML methods in uncovering hidden patterns in medical data.
- ➢
- Insights into Critical Biomarkers for Heart Attack Risk: We identified troponin, KCM, and glucose levels, along with gender, as significant factors in stratifying cardiovascular risk among patients. This finding can aid clinicians in focusing on key biomarkers for early detection and intervention.
- ➢
- Contribution to Personalized Medicine and Preventive Cardiology: Our approach demonstrates how AI and machine learning techniques can enhance risk stratification accuracy, leading to more targeted interventions for high-risk patients and improved prevention strategies in cardiology.
2. Materials and Methods
- Exploratory Data Analysis (EDA): We began the study with a thorough analysis of the medical data, identifying the most relevant features for subsequent analysis. During this phase, we evaluated the distribution of variables, checked for missing data, and explored possible relationships between variables. Our goal in the EDA was to gain a clear understanding of the data and prepare the dataset for the next phases of analysis. Detecting outliers and normalizing variables were critical steps to ensure data homogeneity and readiness for modeling. By thoroughly understanding the data's characteristics, we aimed to minimize biases and enhance the accuracy of our modeling efforts.
- Dimensionality Reduction: The original dataset contained multiple medical features that could have nonlinear relationships. To facilitate clustering and pattern visualization, we implemented dimensionality reduction techniques. We applied the Uniform Manifold Approximation and Projection (UMAP), a nonlinear dimensionality reduction method ideal for preserving both local and global data structures. We chose UMAP over linear techniques like Principal Component Analysis (PCA) due to its superior ability to maintain complex, nonlinear relationships within the data crucial aspect when dealing with medical datasets where variables often interact in intricate ways. This phase reduced the dataset's complexity to a two-dimensional space, simplifying the subsequent clustering stage and enhancing pattern discernibility.
- K-Means Clustering Algorithm: After reducing the data to two dimensions using UMAP, we applied the K-means clustering algorithm to identify natural groupings among the patients. K-means partitions the data into a predefined number of clusters, aiming to minimize intra-cluster distances and maximize separation between clusters. In this study, we chose to establish two clusters after evaluating several options and determining that this number provided the most meaningful segmentation of patients based on critical biomarkers like troponin, KCM and glucose levels. This decision was guided by methods such as the elbow method and silhouette analysis, ensuring that our clustering approach was both data-driven and clinically relevant.
- Validation and Visualization: After applying K-means, we validated the model using internal validation metrics and visual inspection of the clusters. We adjusted based on the cluster cohesion and separation observed in the visualization. Once satisfied with the cluster quality, we proceeded to interpret the clusters clinically, identifying relevant patterns and differences between patient groups that could inform risk stratification and intervention strategies.
2.1. Medical Dataset and EDA
- Age: Patient's age in years.
- Gender: Male or Female (represented as 0 and 1, respectively).
- Pulse Rate: Heart rate measured in beats per minute.
- High Blood Pressure (Systolic Pressure): Maximum arterial pressure during heart contraction.
- Low Blood Pressure (Diastolic Pressure): Minimum arterial pressure between heartbeats.
- Glucose Level: Blood glucose concentration in mg/dL.
- CK-MB (Creatine Kinase MB) (KCM): is an enzyme primarily found in the heart and, to a lesser extent, in skeletal muscles.
- Troponin Level: Blood troponin concentration in ng/mL, a specific biomarker for myocardial damage.
2.2. Dimensionality Reduction
2.3. Application of KMeans
- Elbow Method: This technique involves running K-means clustering on the dataset for a range of K values and computing the within-cluster sum of squares (WCSS). By plotting WCSS against the number of clusters, we look for an "elbow" point where the rate of decrease sharply changes, indicating diminishing returns with additional clusters [16]. In our analysis, the elbow point suggested that K=2 was optimal.
- Silhouette Coefficient: This metric measures how well each data point fits within its assigned cluster compared to other clusters. It ranges from -1 to 1, where a higher value indicates better clustering quality [17]. We calculated the silhouette scores for different K values and found that the highest average silhouette score occurred at K=2, reinforcing the result from the elbow method.
3. Simulations Results
3.1. EDA Application
3.2. Dimensionality Reduction Applications
- Preservation of Local and Global Structure: UMAP is specifically designed to preserve both the local and global structures of the data. This means it attempts to maintain close relationships between similar data points as well as the broader relationships among groups of points in the high-dimensional space. By doing so, UMAP provides a more faithful representation of the data's intrinsic geometry in a lower-dimensional space. This characteristic is crucial when dealing with complex datasets where important patterns may exist at different scales.
- Manifold Approximation: UMAP operates under the assumption that the data lies on a low-dimensional manifold within the high-dimensional space. It seeks to find a representation of this manifold in a lower-dimensional space. This approach can result in a clearer separation of clusters or patterns, making it easier to identify distinct groups within the data. UMAP's ability to capture nonlinear relationships enhances the visualization and interpretability of data.
3.3. KMeans Application

4. Discussion
5. Conclusions
- Early Risk Identification: Enables the detection of patterns and risk factors that might go unnoticed in traditional analyses.
- Personalized Treatments: Facilitates patient stratification, potentially leading to more personalized and effective interventions.
- Impact on Precision Medicine: Incorporating biomarker analysis into risk assessment offers an opportunity to implement targeted and evidence-based interventions. For instance, patients in Group 2 could benefit from comprehensive management programs to reduce cardiovascular risk and improve long-term clinical outcomes, aligning with the principles of personalized medicine [32].
- Optimization of Healthcare Resources: Helps prioritize medical care toward patients at higher risk, enhancing efficiency in resource allocation.
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
| Age | Gender | Impulse | Pressurehight | Pressurelow | Glucose | KCM | Troponin |
|---|---|---|---|---|---|---|---|
| 64 | 1 | 66 | 160 | 83 | 160 | 1.80 | 0.012 |
| 21 | 1 | 94 | 98 | 46 | 296 | 6.75 | 1.060 |
| 55 | 1 | 64 | 160 | 77 | 270 | 1.99 | 0.003 |
| 64 | 1 | 70 | 120 | 55 | 270 | 13.87 | 0.122 |
| 55 | 1 | 64 | 112 | 65 | 300 | 1.08 | 0.003 |
| 58 | 0 | 61 | 112 | 58 | 87 | 1.83 | 0.004 |
| 32 | 0 | 40 | 179 | 68 | 102 | 0.71 | 0.003 |
| 63 | 1 | 60 | 214 | 82 | 87 | 300 | 2.370 |
| 44 | 0 | 60 | 154 | 81 | 135 | 2.35 | 0.004 |
| 67 | 1 | 61 | 160 | 95 | 100 | 2.84 | 0.011 |
References
- World Health Organization (WHO), “Cardiovascular diseases (CVDs).” Accessed: Oct. 23, 2024. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds).
- Y. Chen et al., “Rhodiola rosea: A Therapeutic Candidate on Cardiovascular Diseases,” Oxid Med Cell Longev, vol. 2022, 2022. [CrossRef]
- J. Heaton, “Ian Goodfellow, Yoshua Bengio, and Aaron Courville: Deep learning,” Genet Program Evolvable Mach, vol. 19, no. 1–2, pp. 305–307, Jun. 2018. [CrossRef]
- Esteva et al., “A guide to deep learning in healthcare,” Nat Med, vol. 25, no. 1, pp. 24–29, Jan. 2019. [CrossRef]
- E. J. Topol, “High-performance medicine: the convergence of human and artificial intelligence,” Nat Med, vol. 25, no. 1, pp. 44–56, Jan. 2019. [CrossRef]
- M. Capó, A. Pérez, and J. A. Lozano, “An efficient approximation to the K-means clustering for massive data,” Knowl Based Syst, vol. 117, no. 1, pp. 56–69, Feb. 2017. [CrossRef]
- T. Jolliffe and J. Cadima, “Principal component analysis: a review and recent developments,” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 374, no. 2065, Apr. 2016. [CrossRef]
- L. McInnes, J. Healy, and J. Melville, “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction,” Journal of Machine Learning Research, Sep. 2020, [Online]. Available: http://arxiv.org/abs/1802. 0342.
- F. S. Apple, Y. Sandoval, A. S. Jaffe, and J. Ordonez-Llanos, “Cardiac Troponin Assays: Guide to Understanding Analytical Characteristics and Their Impact on Clinical Care,” Clin Chem, vol. 63, no. 1, pp. 73–81, Jan. 2017. [CrossRef]
- S. V Shah et al., “High sensitivity cardiac troponin and the under-diagnosis of myocardial infarction in women: prospective cohort study,” BMJ, vol. 350, no. 7992, p. g7873, Jan. 2015. [CrossRef]
- P. Cunningham and Z. Ghahramani, “Linear Dimensionality Reduction: Survey, Insights, and Generalizations,” Journal of Machine Learning Research, vol. 16, pp. 2859–2900, Jun. 2015.
- H. Abdi and L. J. Williams, “Principal component analysis,” WIREs Computational Statistics, vol. 2, no. 4, pp. 433–459, Jul. 2010. [CrossRef]
- D. Gonzalez-Franco, J. E. Preciado-Velasco, J. E. Lozano-Rizk, R. Rivera-Rodriguez, J. Torres-Rodriguez, and M. A. Alonso-Arevalo, “Comparison of Supervised Learning Algorithms on a 5G Dataset Reduced via Principal Component Analysis (PCA),” Future Internet, vol. 15, no. 10, Oct. 2023. [CrossRef]
- McInnes, J. Healy, N. Saul, and L. Großberger, “UMAP: Uniform Manifold Approximation and Projection,” J Open Source Softw, vol. 3, no. 29, p. 861, Sep. 2018. [CrossRef]
- D. Xu and Y. Tian, “A Comprehensive Survey of Clustering Algorithms,” Annals of Data Science, vol. 2, no. 2, pp. 165–193, Jun. 2015. [CrossRef]
- P. Bholowalia and A. Kumar, “EBK-Means: A Clustering Technique based on Elbow Method and K-Means in WSN,” Int J Comput Appl, vol. 105, no. 9, pp. 17–24, Nov. 2014, [Online]. Available: https://api.semanticscholar. 1759.
- O. Arbelaitz, I. Gurrutxaga, J. Muguerza, J. M. Pérez, and I. Perona, “An extensive comparative study of cluster validity indices,” Pattern Recognit, vol. 46, no. 1, pp. 243–256, Jan. 2013. [CrossRef]
- P. Schober, C. Boer, and L. A. Schwarte, “Correlation Coefficients: Appropriate Use and Interpretation,” Anesth Analg, vol. 126, no. 5, pp. 1763–1768, 18. 20 May. [CrossRef]
- P. K. Whelton et al., “ACC/AHA/AAPA/ABC/ACPM/AGS/APhA/ASH/ASPC/NMA/PCNA Guideline for the Prevention, Detection, Evaluation, and Management of High Blood Pressure in Adults,” J Am Coll Cardiol, vol. 71, no. 19, pp. e127–e248, 18. 20 May. [CrossRef]
- B. S. Everitt, S. Landau, M. Leese, and D. Stahl, Cluster Analysis, 5th ed. in Wiley Series in Probability and Statistics. UK: Wiley, 2011. [CrossRef]
- Van Der Maaten and, G. Hinton, “Visualizing Data using t-SNE,” Netherlands, Aug. 2008. Accessed: Oct. 25, 2024. [Online]. Available: https://jmlr.csail.mit.edu/papers/volume9/vandermaaten08a/vandermaaten08a.
- V. Regitz-Zagrosek and G. Kararigas, “Mechanistic Pathways of Sex Differences in Cardiovascular Disease,” Physiol Rev, vol. 97, no. 1, pp. 1–37, Jan. 2017. [CrossRef]
- E. A. Ashley, “The Precision Medicine Initiative,” JAMA, vol. 313, no. 21, p. 2119, Jun. 2015. [CrossRef]
- K. Thygesen et al., “Fourth Universal Definition of Myocardial Infarction (2018),” Circulation, vol. 138, no. 20, Nov. 2018. [CrossRef]
- Daccord et al., “High-quality de novo assembly of the apple genome and methylome dynamics of early fruit development,” Nat Genet, vol. 49, no. 7, pp. 1099–1106, Jul. 2017. [CrossRef]
- American Diabetes Association, “Standards of Medical Care in Diabetes—2021 Abridged for Primary Care Providers,” Clinical Diabetes, vol. 39, no. 1, pp. 14–43, Jan. 2021. [CrossRef]
- B. A. Swinburn et al., “The Global Syndemic of Obesity, Undernutrition, and Climate Change: The Lancet Commission report,” The Lancet, vol. 393, no. 10173, pp. 791–846, Feb. 2019. [CrossRef]
- M. Brownlee, “The Pathobiology of Diabetic Complications,” Diabetes, vol. 54, no. 6, pp. 1615–1625, Jun. 2005. [CrossRef]
- W. C. Knowler et al., “Reduction in the incidence of type 2 diabetes with lifestyle intervention or metformin.,” N Engl J Med, vol. 346, no. 6, pp. 393–403, Feb. 2002. [CrossRef]
- S. M. Grundy et al., “Diagnosis and management of the metabolic syndrome: an American Heart Association/National Heart, Lung, and Blood Institute Scientific Statement.,” Circulation, vol. 112, no. 17, pp. 2735–52, Oct. 2005. [CrossRef]
- V. Kumar, J. K. Thakur, and M. Prasad, “Histone acetylation dynamics regulating plant development and stress responses.,” Cell Mol Life Sci, vol. 78, no. 10, pp. 4467–4486, 21. 10.1007/s00018-021-03794-x. 20 May.
- E. J. Topol, “High-performance medicine: the convergence of human and artificial intelligence,” Nat Med, vol. 25, no. 1, pp. 44–56, Jan. 2019. [CrossRef]






| age | pulse | Pressure_hight | Presurre_low | glucose | KCM | troponin | |
|---|---|---|---|---|---|---|---|
| Mean | 56.19 | 78.34 | 127.17 | 72.26 | 146.63 | 15.27 | 0.36 |
| Std | 13.65 | 51.63 | 26.12 | 14.03 | 74.92 | 46.33 | 1.15 |
| Min | 14 | 20 | 42 | 38 | 35 | 0.32 | 0.001 |
| 25% | 47 | 64 | 110 | 62 | 98 | 1.66 | 0.006 |
| 50% | 58 | 74 | 124 | 72 | 116 | 2.85 | 0.014 |
| 75% | 65 | 85 | 143 | 81 | 169 | 5.81 | 0.086 |
| max | 103 | 1111 | 223 | 154 | 541 | 300 | 10.3 |
| Group 1 | Group 2 | |
|---|---|---|
| Age | 58 | 55 |
| Gender | Only 3% are men | 97% are men |
| Pressure_hight | 127 | 127 |
| Pressure_low | 73 | 72 |
| glucose | 143 | 150 |
| KCM | 8.18 | 18.65 |
| troponin | 0.1186 | 0.4761 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).