Submitted:
20 August 2025
Posted:
21 August 2025
You are already at the latest version
Abstract
Keywords:
Introduction
Methods and Materials
Ethical Approval
Source of Study Population
Inclusion Criteria
Exclusion Criteria
- Pregnant women and those uncertain of their pregnancy status (N = 278).
- Individuals with mismatches between self-reported and genetically determined sex (N = 320).
- Participants related up to the second degree were identified using a kinship coefficient cutoff of 0.0884 (N = 33,369).
- Individuals with a prior or current diagnosis of vascular or heart problems reported at baseline (N = 25,340).
- Participants using cholesterol-lowering medications (N = 34,243).
- Individuals who reported ceasing smoking or alcohol consumption due to health reasons or medical advice (N = 58,752).
- Participants with missing data on key confounding variables (N = 61,961).
- Non-hypertensive participants were defined as those not meeting the hypertension criteria outlined in the inclusion section.
Study Population and Period
Genotyping and Imputation
Study Variables
- Sociodemographic factors - age (in years) and sex (male/female).
- Lifestyle factors - alcohol consumption (current, previous, never) and cigarette smoking status (current, previous, never).
- Clinical factors - Body mass index (BMI), total cholesterol (TC) and low-density lipoprotein (LDL) levels, and diabetes mellitus (DM).
Definition of the Outcome
Computation of Genetic Liabilities
- Filtering SNPs with low call rates or violations of the Hardy-Weinberg equilibrium,
- Excluding SNPs with minor allele frequency (MAF) < 1%,
- Linkage disequilibrium (LD) pruning using a window size of 250 kb, step size = 50, and an r² threshold of 0.1.
Statistical Analyses
Prediction Model Development and Performance Assessment
Implementation
Cox Proportional Hazards Model (CoxPH)
Penalized Regression Models
- (1)
- Penalized Cox Regression (CoxNet)
- (2)
- Penalized Logistic Regression (GLMnet)
Neural Network Model (Nnet)
Assessing Model Prediction Performance
Results
Study Characteristics and Statistical Analysis
Prediction Performance of the Models
Model Performance at Multiple Follow-Up Times Points
Discussions
Comparison Between Traditional and Machine Learning Models That Include the Effect of Time
Comparison Based on the Different Follow-Up Time Lengths
Gender Differences
Age Differences
Strengths and Limitations
Conclusion
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgment
Conflicts of Interest
Figure legend
- Figure 1. Exclusion Criteria of the study: The flowchart of the study participant selection. UK Biobank (UKB) data had over 500,000 participants at the time of the beginning of this study. We used the K-mean cluster method to extract 425,054 participants of European ancestry and have genetic data. The final dataset included 116,216 participants who met the inclusion criteria. .
- Figure 2. Correlation matrix plot: The plot shows the correlation coefficients between numerical features. TC and LDL are highly correlated (r2 >0.8). LDL was excluded from further analysis (prediction model construction). BMI: Body mass index; TC: Total cholesterol; LDL: Low-density lipoprotein cholesterol. .
References
- Roth GA, Johnson C, Abajobir A, Abd-Allah F, Abera SF, Abyu G, Ahmed M, Aksut B, Alam T, Alam K, Alla F, Alvis-Guzman N, Amrock S, Ansari H, Ärnlöv J, Asayesh H, Atey TM, Avila-Burgos L, Awasthi A, Banerjee A, Barac A, Bärnighausen T, Barregard L, Bedi N, Belay Ketema E, Bennett D, Berhe G, Bhutta Z, Bitew S, Carapetis J, Carrero JJ, Malta DC, Castañeda-Orjuela CA, Castillo-Rivas J, Catalá-López F, Choi JY, Christensen H, Cirillo M, Cooper L, Criqui M, Cundiff D, Damasceno A, Dandona L, Dandona R, Davletov K, Dharmaratne S, Dorairaj P, Dubey M, Ehrenkranz R, El Sayed Zaki M, Faraon E, Esteghamati A, Farid T, Farvid M, Feigin V, Ding EL, Fowkes G, Gebrehiwot T, Gillum R, Gold A, Gona P, Gupta R, Habtewold TD, Hafezi-Nejad N, Hailu T, Hailu GB, Hankey G, Hassen HY, Abate KH, Havmoeller R, Hay SI, Horino M, Hotez PJ, Jacobsen K, James S, Javanbakht M, Jeemon P, John D, Jonas J, Kalkonde Y, Karimkhani C, Kasaeian A, Khader Y, Khan A, Khang YH, Khera S, Khoja AT, Khubchandani J, Kim D, Kolte D, Kosen S, Krohn KJ, Kumar GA, Kwan GF, Lal DK, Larsson A, Linn S, Lopez A, Lotufo PA, El Razek H: Global, Regional, and National Burden of Cardiovascular Diseases for 10 Causes, 1990 to 2015. Journal of the American College of Cardiology 2017, 70(1). [CrossRef]
- Krishnamurthi R, Ikeda T, Feigin V: Global, Regional and Country-Specific Burden of Ischaemic Stroke, Intracerebral Haemorrhage and Subarachnoid Haemorrhage: A Systematic Analysis of the Global Burden of Disease Study 2017. Neuroepidemiology 2020, 54(2):171–179. [CrossRef]
- King D, Wittenberg R, Patel A, Quayyum Z, Berdunov V, Knapp M: The future incidence, prevalence and costs of stroke in the UK. Age and ageing 2020, 49(2):277–282. [CrossRef]
- Boehme AK, Esenwa C, Elkind MSV: Stroke Risk Factors, Genetics, and Prevention. Circulation Research 2017, 120(3):472–495. [CrossRef]
- Wajngarten M, Silva GS: Hypertension and Stroke: Update on Treatment. European Cardiology Review 2019, 14(2):111–115. [CrossRef]
- Du X, McNamee R, Cruickshank K: Stroke Risk from Multiple Risk Factors Combined with Hypertension: A Primary Care Based Case-control Study in a Defined Population of Northwest England. Annals of epidemiology 2000, 10(6):380–388. [CrossRef]
- Roy-O’Reilly M, McCullough LD: Age and Sex Are Critical Factors in Ischemic Stroke Pathology. Endocrinology 2018, 159(8):3120–3131. [CrossRef]
- Yao Q, Zhang J, Yan K, Zheng Q, Li Y, Zhang L, Wu C, Yang Y, Zhou M, Zhu C: Development and validation of a 2-year new-onset stroke risk prediction model for people over age 45 in China. Medicine 2020, 99(41):e22680–e22680. [CrossRef]
- Chun M, Clarke R, Cairns BJ, Clifton D, Bennett D, Chen Y, Guo Y, Pei P, Lv J, Yu C, Yang L, Li L, Chen Z, Zhu T: Stroke risk prediction using machine learning: a prospective cohort study of 0.5 million Chinese adults. Journal of the American Medical Informatics Association 2021, 28(8):1719–1727. [CrossRef]
- Wang Y, Deng Y, Tan Y, Zhou M, Jiang Y, Liu B: A comparison of random survival forest and Cox regression for prediction of mortality in patients with hemorrhagic stroke. BMC medical informatics and decision making 2023, 23(1):1–215. [CrossRef]
- Chen Y, Chung J, Yeh Y, Lou S, Lin H, Lin C, Hsien H, Hung K, Yeh SJ, Shi H: Predicting 30-Day Readmission for Stroke Using Machine Learning Algorithms: A Prospective Cohort Study. Frontiers in neurology 2022, 13:875491. [CrossRef]
- MacCarthy G, Pazoki R: Evaluation of Machine Learning and Traditional Statistical Models to Assess the Value of Stroke Genetic Liability for Prediction of Risk of Stroke Within the UK Biobank. Healthcare (Basel) 2025, 13(9):1003. [CrossRef]
- Papadopoulou A, Harding D, Slabaugh G, Marouli E, Deloukas P: Prediction of atrial fibrillation and stroke using machine learning models in UK Biobank. Heliyon 2024, 10(7):e28034. [CrossRef]
- Gong L, Chen S, Yang Y, Hu W, Cai J, Liu S, Zhao Y, Pei L, Ma J, Chen F: Designing machine learning for big data: A study to identify factors that increase the risk of ischemic stroke and prognosis in hypertensive patients. Digital health 2024, 10:20552076241288833. [CrossRef]
- Yang Y, Zheng J, Du Z, Li Y, Cai Y: Accurate Prediction of Stroke for Hypertensive Patients Based on Medical Big Data and Machine Learning Algorithms: Retrospective Study. JMIR medical informatics 2021, 9(11):e30277. [CrossRef]
- Li A, Ji Y, Zhu S, Hu Z, Xu X, Wang Y, Jian X: Risk probability and influencing factors of stroke in followed-up hypertension patients. BMC cardiovascular disorders 2022, 22(1):1–10. [CrossRef]
- MacCarthy G, Pazoki R: Using Machine Learning to Evaluate the Value of Genetic Liabilities in the Classification of Hypertension within the UK Biobank. Journal of clinical medicine 2024, 13(10):2955. [CrossRef]
- Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, Downey P, Elliott P, Green J, Landray M, Liu B, Matthews P, Ong G, Pell J, Silman A, Young A, Sprosen T, Peakman T, Collins R: UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age. PLoS medicine 2015, 12(3):e1001779. [CrossRef]
- Bycroft C, Freeman C, Petkova D, Band G, Elliott L, Sharp K, Motyer A, Vukcevic D, Delaneau O, O'connell J, Cortes A, Welsh S, Mcvean G, Leslie S, Donnelly P, Marchini J: Genome-wide genetic data on ~500,000 UK biobank participants. bioRxiv 2017, :.
- Welsh S, Peakman T, Sheard S, Almond R: Comparison of DNA quantification methodology used in the DNA extraction protocol for the UK Biobank cohort. BMC Genomics 2017, 18(1):26. [CrossRef]
- Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, Motyer A, Vukcevic D, Delaneau O, O'Connell J, Cortes A, Welsh S, Young A, Effingham M, McVean G, Leslie S, Allen N, Donnelly P, Marchini J: The Uk biobank resource with deep phenotyping and genomic data. Nature 2018, 562(7726):203–209. [CrossRef]
- Sacks DB, Arnold M, Bakris GL, Bruns DE, Horvath AR, Kirkman MS, Lernmark A, Metzger BE, Nathan DM: Guidelines and Recommendations for Laboratory Analysis in the Diagnosis and Management of Diabetes Mellitus. Clinical chemistry (Baltimore, Md.) 2011, 57(6):e1–e47. [CrossRef]
- Malik R, Rannikmäe K, Traylor M, Georgakis MK, Sargurupremraj M, Markus HS, Hopewell JC, Debette S, Sudlow CLM, Dichgans M: Genome-wide meta-analysis identifies 3 novel loci associated with stroke. Annals of neurology 2018, 84(6):934–939. [CrossRef]
- Suresh K, Severn C, Ghosh D: Survival prediction models: an introduction to discrete-time modeling. BMC medical research methodology 2022, 22(1):1–207.
- Lunardon N, Menardi G, Torelli N: ROSE: a Package for Binary Imbalanced Learning. The R journal 2014, 6(1):79.
- Wei Q, Dunbrack J, Roland L: The Role of Balanced Training and Testing Data Sets for Binary Classifiers in Bioinformatics. PLoS ONE 2013, 8(7):e67863. [CrossRef]
- Greenwood CJ, Youssef GJ, Letcher P, Macdonald JA, Hagg LJ, Sanson A, Mcintosh J, Hutchinson DM, Toumbourou JW, Fuller-Tyszkiewicz M, Olsson CA: A comparison of penalised regression methods for informing the selection of predictive markers. PloS one 2020, 15(11):e0242730. [CrossRef]
- Pavlou M, Ambler G, Seaman S, De Iorio M, Omar RZ: Review and evaluation of penalised regression methods for risk prediction in low-dimensional data with few events. Statistics in medicine 2016, 35(7):1159–1177. [CrossRef]
- PLATT J: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Largin Margin Classifiers 2000.
- Huang Y, Li W, Macheret F, Gabriel RA, Ohno-Machado L: A tutorial on calibration measurements and calibration models for clinical prediction models. Journal of the American Medical Informatics Association 2020, 27(4):621–633.
- Suresh K, Severn C, Ghosh D: Survival prediction models: an introduction to discrete-time modeling. BMC medical research methodology 2022, 22(1):1–18. [CrossRef]
- Ding Q, Liu S, Yao Y, Liu H, Cai T, Han L: Global, Regional, and National Burden of Ischemic Stroke, 1990–2019. Neurology 2022, 98(3):e279–e290. [CrossRef]
- Sun L, Pennells L, Kaptoge S, Nelson CP, Ritchie SC, Abraham G, Arnold M, Bell S, Bolton T, Burgess S, Dudbridge F, Guo Q, Sofianopoulou E, Stevens D, Thompson JR, Butterworth AS, Wood A, Danesh J, Samani NJ, Inouye M, Di Angelantonio E: Polygenic risk scores in cardiovascular risk prediction: A cohort study and modelling analyses. PLoS medicine 2021, 18(1):e1003498. [CrossRef]


| Characteristics | Overall (N=116,216) | Control (N=102,450) | Case (N=13,766) |
HR (95% CI) |
p-value |
| Age(years); Mean( SD) |
57.6 (7.53) |
57.6 (7.53) |
60.9 (6.70) |
1.67 (1.57, 1.78) |
<0.001 |
| Body mass index (kg/m^2); Mean (SD) |
28.0 (4.83) |
28.0 (4.83) |
27.9 (4.77) |
0.98 (0.93, 1.03) |
0.44* |
| Total cholesterol (mmol/L); Mean (SD) |
6.05 (1.06) |
6.05 (1.06) |
5.96 (1.07) |
0.92 (0.88, 0.97) |
0.004 |
| Low-density lipoprotein (mmol/L); Mean (SD) |
3.84 (0.81) |
3.84 (0.81) |
3.81 (0.81) |
0.96 (0.91, 1.02) |
0.16* |
| Sex (Male); n (%) |
55269 (47.6%) |
54476 (47.4%) | 793 (57.6%) | 1.50 (1.35, 1.67) |
<0.001 |
| Diabetes Mellitus (Yes); n (%) | 4638 (4.0%) |
4545 (4.0%) |
93 (6.8%) |
1.75 (1.42, 2.16) |
<0.001 |
| Smoking status | <0.001 | ||||
| Current n (%) | 37098 (31.9%) |
36579 (31.9%) | 519 (37.7%) |
1.33 (1.19, 1.48) |
|
| Previous; n (%) | 1320 (1.1%) |
1283 (1.1%) |
37 (2.7%) |
1.15 (0.83, 1.59) |
|
| Drinking status | <0.001 | ||||
| Current; n (%) | 108867 (93.7%) |
107635 (93.7%) | 1232 (89.5%) | 0.61 (0.49, 0.78) |
|
| Previous; n (%) | 3333 (2.9%) |
3263 (2.8%) |
70 (5.1%) |
2.67 (1.92, 3.72) |
|
| n (%); Mean (SD), HR = Hazard Ratio, CI = Confidence Interval | |||||
| Characteristic | HR | 95% CI | p-value |
| Continuous Stroke Genetic Liability | |||
| Model 1 (unadjusted) | 1.11 | 1.06, 1.18 | <0.001 |
| Model 2 (Model 1 + age and sex) | 1.12 | 1.06, 1.18 | <0.001 |
| Model 3 (Model 2 + clinical factors) * | 1.12 | 1.06, 1.18 | <0.001 |
| Model 4 (Model + lifestyle factors)** | 1.12 | 1.06, 1.18 | <0.001 |
| Models | Model Features | Sample | AUC (95% CI) | Brier Score |
|---|---|---|---|---|
| CoxPH | ||||
| Conventional factors + genetics | whole population (n =116,216) |
67.0 (64, 70) |
0.01 | |
| Conventional factors + genetics | Men (n = 55,269) |
68.0 (64, 71) |
0.01 | |
| Conventional factors + genetics | Women (n = 60,947) |
65.0 (60, 69) |
0.01 | |
| Conventional factors + genetics | Age > 59 (n = 55370) |
60.0 (56, 64) |
0.02 | |
| Conventional factors + genetics | Age < 59 (n = 60846) |
58.0 (53, 63) |
0.02 | |
| CoxNet | ||||
| Conventional factors + genetics | whole population (n=116,216) | 67.0 (64, 70) |
0.01 | |
| Conventional factors + genetics | Men (n = 55,269) |
67.0 (64, 71) |
0.01 | |
| Conventional factors + genetics | Women (n = 60,947) |
65.0 (60, 69) |
0.01 | |
| Conventional factors + genetics | Age > 59 (n = 55370) |
60.0 (57, 64) |
0.02 | |
| Conventional factors + genetics | Age < 59 (n = 60846) |
56.0 (51, 61) |
0.01 | |
| GLMnet | ||||
| Conventional factors + genetics | whole population (n=116,216) | 67.0 (64, 70) |
0.001 | |
| Conventional factors + genetics | Men (n = 55,269) |
67.0 (63, 70) |
0.002 | |
| Conventional factors + genetics | Women (n = 60,947) |
65.0 (61, 69) |
0.001 | |
| Conventional factors + genetics | Age > 59 (n = 55370) |
61.0 (58, 65) |
0.002 | |
| Conventional factors + genetics | Age < 59 (n = 60846) |
57.0 (52, 62) |
0.001 | |
| Nnet | ||||
| Conventional factors + genetics | whole population (n=116,216) | 66.0 (63, 69) |
0.001 | |
| Conventional factors + genetics | Men (n = 55,269) |
65.0 (62, 69) |
0.002 | |
| Conventional factors + genetics | Women (n = 60,947) |
64.0 (59, 68) |
0.001 | |
| Conventional factors + genetics | Age > 59 (n = 55370) |
60.0 (56, 63) |
0.002 | |
| Conventional factors + genetics | Age < 59 (n = 60846) |
55.0 (50, 60) |
0.001 |
| Years | Cases | Survivors (without the event) | Censored | CoxPH (AUC %) |
Coxnet (AUC %) |
GLMnet (AUC %) |
Nnet (AUC %) |
|---|---|---|---|---|---|---|---|
| 2 | 75 | 34789 | 0 | 64.62 | 64.63 | 63.18 | 62.47 |
| 4 | 168 | 34696 | 0 | 65.26 | 65.29 | 69.98 | 68.78 |
| 6 | 282 | 34582 | 0 | 65.91 | 65.91 | 68.68 | 67.10 |
| 8 | 389 | 20450 | 14025 | 67.35 | 67.37 | 66.72 | 65.21 |
| 10 | 417 | 0 | 34447 | N/A | N/A | 50.72 | 53.40 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).