Submitted:
01 August 2023
Posted:
03 August 2023
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Materials and Methods
2.1. Traditional Statistical Models
2.1.1. Logistic Regression
2.1.2. Cox Proportional Hazards Model
2.1.3. Forward Stepwise Variable Selection
2.2. Machine Learning Methods
2.2.1. LASSO
2.2.2. Elastic Net
2.3. Performance Measurements
3. Results
3.1. Impact of Over-Selection
3.2. Comparison of Prediction Methods
3.3. Real Data Analysis
4. Discussion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| AIC | Akaike information criterion |
| AUC | area under the curve |
| BIC | Bayesian information criterion |
| C-SVS | Cox regression with forward stepwise selection |
| EN | elastic net |
| LASSO | least absolute shrinkage and selection operator |
| L-SVS | logistic regression with forward stepwise selection |
| ML | machine learning |
| MLE | maximu likelihood estimation |
| RF | random forest |
| ROC | receiver operating characteristic |
| R-SVS | regression methods with stepwise selection |
| SLN | sentinel lymph node |
| SVS | stepwise variable selection |
Appendix A. Tables
| (S1) | (S2) | ||||||
|---|---|---|---|---|---|---|---|
| LASSO | EN | L-SVS | LASSO | EN | L-SVS | ||
| (i) | |||||||
| Total Selections | 31.15 | 74.59 | 5.32 | 28.59 | 56.61 | 5.32 | |
| True Selections | 4.03 | 4.49 | 3.25 | 4.28 | 4.60 | 3.25 | |
| AUC-Training | 0.92 | 0.95 | 0.84 | 0.92 | 0.95 | 0.85 | |
| AUC-Validation | 0.70 | 0.69 | 0.71 | 0.72 | 0.71 | 0.71 | |
| (ii) | |||||||
| Total Selections | 34.64 | 101.47 | 5.63 | 44.56 | 84.76 | 6.88 | |
| True Selections | 6.73 | 7.92 | 3.60 | 7.76 | 8.19 | 4.31 | |
| AUC-Training | 0.92 | 0.96 | 0.83 | 0.96 | 0.97 | 0.87 | |
| AUC-Validation | 0.69 | 0.68 | 0.67 | 0.73 | 0.72 | 0.70 | |
| (S1) | (S2) | ||||||
|---|---|---|---|---|---|---|---|
| LASSO | EN | C-SVS | LASSO | EN | C-SVS | ||
| (i) & 30% Censoring | |||||||
| Total Selections | 16.51 | 24.30 | 4.80 | 19.28 | 25.89 | 5.37 | |
| True Selections | 4.11 | 4.46 | 3.53 | 4.45 | 4.53 | 3.78 | |
| (training) | 22.79 | 26.25 | 17.32 | 25.76 | 28.24 | 19.68 | |
| (validation) | 9.06 | 8.76 | 9.79 | 10.45 | 10.20 | 10.41 | |
| C-index (training) | 0.74 | 0.76 | 0.71 | 0.76 | 0.77 | 0.73 | |
| C-index (validation) | 0.64 | 0.64 | 0.65 | 0.66 | 0.66 | 0.66 | |
| (ii) & 10% Censoring | |||||||
| Total Selections | 20.16 | 23.89 | 5.74 | 20.96 | 25.00 | 5.86 | |
| True Selections | 4.61 | 4.82 | 4.44 | 4.78 | 4.82 | 4.43 | |
| (training) | 27.60 | 29.72 | 22.51 | 29.73 | 31.61 | 23.83 | |
| (validation) | 12.89 | 12.88 | 14.95 | 14.66 | 14.39 | 15.69 | |
| C-index (training) | 0.74 | 0.75 | 0.71 | 0.75 | 0.76 | 0.72 | |
| C-index (validation) | 0.66 | 0.66 | 0.67 | 0.67 | 0.67 | 0.68 | |
| (iii) & 30% Censoring | |||||||
| Total Selections | 26.68 | 36.83 | 6.57 | 30.35 | 37.73 | 7.85 | |
| True Selections | 7.52 | 8.26 | 4.89 | 8.44 | 8.61 | 5.69 | |
| (training) | 30.12 | 34.26 | 20.61 | 34.18 | 36.50 | 24.90 | |
| (validation) | 9.89 | 9.87 | 9.07 | 13.07 | 12.87 | 11.48 | |
| C-index (training) | 0.78 | 0.81 | 0.73 | 0.81 | 0.82 | 0.76 | |
| C-index (validation) | 0.66 | 0.66 | 0.64 | 0.69 | 0.69 | 0.67 | |
| (iv) & 10% Censoring | |||||||
| Total Selections | 29.46 | 36.69 | 8.52 | 32.47 | 37.53 | 9.65 | |
| True Selections | 8.56 | 8.85 | 6.69 | 9.10 | 9.32 | 7.71 | |
| (training) | 36.21 | 38.68 | 27.61 | 39.76 | 41.38 | 32.14 | |
| (validation) | 14.77 | 14.44 | 15.62 | 18.44 | 18.26 | 19.54 | |
| C-index (training) | 0.78 | 0.79 | 0.74 | 0.80 | 0.80 | 0.76 | |
| C-index (validation) | 0.67 | 0.67 | 0.68 | 0.70 | 0.70 | 0.70 | |
| # Selected | AUC | |||
|---|---|---|---|---|
| Method | Features | Training | Validation | Selected Features |
| LASSO | 13 | 1.0000 | 0.6619 | add_trt, GE_CCL1, GE_CLEC6A, |
| GE_HLA_DQA1, GE_IL1RL1, GE_IL25, | ||||
| GE_MAGEA12, GE_MASP1, GE_MASP2, | ||||
| GE_PRAME, GE_S100B, GE_SAA1, | ||||
| GE_USP9Y | ||||
| Elastic Net | 13 | 1.0000 | 0.6381 | add_trt, GE_CCL1, GE_CLEC6A, |
| GE_HLA_DQA1, GE_IL1RL1, GE_IL1RL2, | ||||
| GE_IL25, GE_MASP1, GE_MASP2, | ||||
| GE_PRAME, GE_S100B, GE_SAA1, | ||||
| GE_USP9Y | ||||
| L-SVS | 4 | 0.9833 | 0.6905 | add_trt, GE_IL1RL1, GE_IL17F, GE_IL1RL2 |
| # Selected | p-value | C-index | |||
|---|---|---|---|---|---|
| Method | Features | Training | Validation | Training | Validation |
| LASSO | 5 | 3.7153 | 1.3623 | 0.8848 | 0.6959 |
| Elastic Net | 11 | 3.8872 | 1.1965 | 0.9058 | 0.6701 |
| C-SVS | 4 | 3.1182 | 1.4518 | 0.9634 | 0.6907 |
| Selected Features | |||||
| LASSO | add_trt, GE_CCL3, GE_CCL4, GE_IL17A, GE_NEFL | ||||
| Elastic Net | add_trt, GE_CCL3, GE_CCL4, GE_CRP, GE_CXCL1, | ||||
| GE_CXCR4, GE_HLA_DRB4, GE_IL17A, GE_IL8, | |||||
| GE_MAGEA12, GE_NEFL | |||||
| C-SVS | add_trt, GE_NEFL, GE_IFNL1, GE_MAGEC1 | ||||
| LASSO | Elastic Net | C-SVS | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Feature | Coef. | p-value | Feature | Coef. | p-value | Feature | Coef. | p-value | ||
| add_trt | 4.558 | 0.004 | add_trt | 3.021 | 0.251 | add_trt | 5.352 | 0.001 | ||
| GE_CCL3 | 3.634 | 0.153 | GE_CCL3 | 0.094 | 0.908 | GE_NEFL | 0.006 | |||
| GE_CCL4 | 1.163 | 0.563 | GE_CCL4 | 6.818 | 0.693 | GE_IFNL1 | 0.146 | 0.003 | ||
| GE_IL17A | 6.268 | 0.073 | GE_CRP | 0.191 | GE_MAGEC1 | 0.011 | ||||
| GE_NEFL | 0.023 | GE_CXCL1 | 7.499 | 0.164 | ||||||
| GE_CXCR4 | 0.548 | |||||||||
| GE_HLA_DRB4 | 0.412 | |||||||||
| GE_IL17A | 5.879 | 0.202 | ||||||||
| GE_IL8 | 0.802 | |||||||||
| GE_MAGEA12 | 0.846 | |||||||||
| GE_NEFL | 0.736 | |||||||||
Appendix B. Figures





References
- Christodoulou E et al. (2019) A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol., 110:12-22. [CrossRef]
- Cox, D. R. (1972) Regression Models and Life-Tables. Journal of the Royal Statistical Society. Series B (Methodological), 34(2), 187–220.
- Engelhard MM et al. (2021) Incremental Benefits of Machine Learning—When Do We Need a Better Mousetrap. JAMA Cardiol., 6(6):621–623. [CrossRef]
- Farrow NE et al. (2021) Characterization of Sentinel Lymph Node Immune Signatures and Implications for Risk Stratification for Adjuvant Therapy in Melanoma. Ann Surg Oncol., 28(7):3501-3510. [CrossRef]
- Jing B et al. (2022) Comparing Machine Learning to Regression Methods for Mortality Prediction Using Veterans Affairs Electronic Health Record Clinical Data. Medical Care, 60(6):470-479. [CrossRef]
- Kattan MW. (2003) Comparison of Cox regression with other methods for determining prediction models and nomograms. J Urol., 170(6 Pt 2):S6-10. [CrossRef]
- Khera R et al. (2021) Use of Machine Learning Models to Predict Death After Acute Myocardial Infarction. JAMA Cardiol., 6(6):633–641. [CrossRef]
- Kuhle S et al. (2018) Comparison of logistic regression with machine learning methods for the prediction of fetal growth abnormalities: a retrospective cohort study. BMC Pregnancy Childbirth, 18(1):333. [CrossRef]
- Piros P et al. (2019) Comparing machine learning and regression models for mortality prediction based on the Hungarian Myocardial Infarction Registry. Knowledge-Based Systems, 179(1):1-7. [CrossRef]
- Simon R et al. (2003) Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. Journal of National Cancer Institute, 95:14–8. [CrossRef]
- Song X et al. (2021) Comparison of machine learning and logistic regression models in predicting acute kidney injury: A systematic review and meta-analysis. International Journal of Medical Informatics, 151:104484. [CrossRef]
- Stylianou N et al. (2015) Mortality risk prediction in burn injury: Comparison of logistic regression with machine learning approaches. Burns, 41(5):925-934. [CrossRef]
- Tibshirani R. (1996) Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267-288.
- Tibshirani R. (1997) The lasso Method for Variable Selection in the Cox Model. Statistics in Medicine, 16 (4): 385–395. [CrossRef]
- ToTolles J and Meurer WJ (2016) Logistic Regression: Relating Patient Characteristics to Outcomes. JAMA, 316(5):533–534. [CrossRef]
- Zou H. and Hastie T. (2005) Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 67(2), 301-320. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).