Submitted:
31 May 2024
Posted:
07 June 2024
Read the latest preprint version here
Abstract
Keywords:
1. Introduction
2. Materials and Methods
2.1. Sparsity-Ranked Lasso
2.2. Black-Box Algorithms
2.3. Issues with Black-Box Algorithms
2.4. PMLB Processing Steps
2.5. Modeling Procedures
3. Results
3.1. Data Set Characteristics
3.2. Overall Model Performance
3.3. Case Studies
3.3.1. Exemplars
3.3.2. SRL Underperforming RF
4. Discussion
5. Conclusions
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| MDPI | Multidisciplinary Digital Publishing Institute |
| DOAJ | Directory of open access journals |
| SRL | Sparsity-ranked lasso |
References
- Breiman, L. Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Statistical Science 2001, 16, 199–231. [Google Scholar] [CrossRef]
- Peterson, R.A.; Cavanaugh, J.E. Ranked Sparsity: A Cogent Regularization Framework for Selecting and Estimating Feature Interactions and Polynomials. AStA Advances in Statistical Analysis 2022, 106, 427–454. [Google Scholar] [CrossRef]
- Romano, J.D.; Le, T.T.; La Cava, W.; Gregg, J.T.; Goldberg, D.J.; Chakraborty, P.; Ray, N.L.; Himmelstein, D.; Fu, W.; Moore, J.H. PMLB v1.0: An open source dataset collection for benchmarking machine learning methods. arXiv preprint arXiv:2012.00058v2, arXiv:2012.00058v2 2021.
- Olson, R.S.; La Cava, W.; Orzechowski, P.; Urbanowicz, R.J.; Moore, J.H. PMLB: A large benchmark suite for machine learning evaluation and comparison. BioData Mining 2017, 10, 1–13. [Google Scholar] [CrossRef] [PubMed]
- Tibshirani, R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society: Series B 1996, 58, 267–288. [Google Scholar] [CrossRef]
- Zou, H. The adaptive lasso and its oracle properties. Journal of the American statistical association 2006, 101, 1418–1429. [Google Scholar] [CrossRef]
- Breiman, L. Random forests. Machine learning 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Cortes, C.; Vapnik, V. Support-vector networks. Machine learning 1995, 20, 273–297. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785–794.
- Liaw, A.; Wiener, M. Classification and Regression by randomForest. R News 2002, 2, 18–22. [Google Scholar]
- Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press, 2016. http://www.deeplearningbook.org.
- Le, T. ; makeyourownmaker.; Moore, J. pmlbr: Interface to the Penn Machine Learning Benchmarks Data Repository, 2023. [Google Scholar]
- Christodoulou, E.; Ma, J.; Collins, G.S.; Steyerberg, E.W.; Verbakel, J.Y.; Van Calster, B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. Journal of Clinical Epidemiology 2019, 110, 12–22. [Google Scholar] [CrossRef] [PubMed]
- van der Ploeg, T.; Austin, P.; Steyerberg, E. Modern modelling techniques are data hungry: A simulation study for predicting dichotomous endpoints. BMC Med Res Methodol 2014, 14, 137. [Google Scholar] [CrossRef] [PubMed]





| Characteristic | Overall, N = 110 | categorical, N = 69 | continuous, N = 41 |
| n_instances | 856.21 (1,619.0) | 611.93 (795.8) | 1,267.32 (2,406.2) |
| n_features | 10.15 (7.0) | 12.07 (7.6) | 6.93 (4.4) |
| n_categorical_features | 5.02 (7.0) | 7.97 (7.4) | 0.05 (0.3) |
| n_continuous_features | 5.14 (6.0) | 4.10 (6.5) | 6.88 (4.5) |
| n_classes | 148.57 (858.0) | 2.00 (0.0) | 395.24 (1,380.8) |
| imbalance | 0.08 (0.1) | 0.11 (0.2) | 0.04 (0.1) |
| 1 Mean (SD) | |||
| LSO | NN | RF | SRL | SV | XB | |
| Continuous | ||||||
| CV Rsq; mean (SD) | 65.4 (22) | 41.8 (40) | 75.4 (18) | 69.4 (23) | 67.6 (22) | 76.5 (18) |
| Test Rsq; mean (SD) | 68.1 (23) | 58.3 (29) | 74.6 (23) | 71 (24) | 66.4 (24) | 71.6 (23) |
| Best performance (%) | 15.4 | 15.4 | 33.3 | 23.1 | 2.6 | 10.3 |
| Within 5% of best (%) | 46.2 | 35.9 | 66.7 | 66.7 | 43.6 | 51.3 |
| Run time (s?); mean (SD) | 3.2 (2) | 8.1 (8) | 15.8 (17) | 4.9 (3) | 10.3 (12) | 15.1 (5) |
| Binary | ||||||
| Test AUC; mean (SD) | 82.4 (17) | 83.4 (16) | 85.1 (18) | 85.9 (15) | 73.3 (18) | 85.3 (16) |
| Best performance (%) | 22.7 | 27.3 | 37.9 | 34.8 | 6.1 | 39.4 |
| Within 5% of best (%) | 65.2 | 56.1 | 69.7 | 78.8 | 18.2 | 71.2 |
| Run time (s?); mean (SD) | 7.2 (8) | 12.7 (10) | 13.9 (14) | 11.6 (11) | 8.3 (8) | 14.9 (3) |
| CV Rsq | Test Rsq | AUC | ||||
| Term | Estimate (CI) | p | Estimate (CI) | p | Estimate (CI) | p |
| Intercept | 69.4 (63, 76) | < 0.001 | 71 (64, 79) | < 0.001 | 85.9 (82, 90) | < 0.001 |
| LSO | -4 (-9, 1) | 0.11 | -2.9 (-9, 3) | 0.32 | -3.5 (-6, -1) | 0.018 |
| NN | -2.4 (-7, 3) | 0.35 | -7.2 (-13, -1) | 0.016 | -2.5 (-5, 0) | 0.092 |
| RF | 5.9 (1, 11) | 0.019 | 3.5 (-2, 9) | 0.22 | -0.8 (-4, 2) | 0.60 |
| SV | -1.8 (-7, 3) | 0.46 | -4.7 (-10, 1) | 0.11 | -12.6 (-16, -10) | < 0.001 |
| XB | 7 (2, 12) | 0.005 | 0.6 (-5, 6) | 0.85 | -0.6 (-3, 2) | 0.70 |
| Model | Test R-squared | Test RMSE | Runtime (s) |
| SRL | 0.773 | 3.12 | 13.2 |
| LSO | 0.741 | 3.34 | 9.0 |
| RF | 0.767 | 3.17 | 47.5 |
| SV | 0.744 | 3.32 | 34.5 |
| NN | 0.673 | 3.78 | 17.1 |
| XB | 0.770 | 3.14 | 4.1 |
| Model | ROC AUC | Runtime (s) |
| SRL | 0.885 | 5.8 |
| LSO | 0.894 | 2.5 |
| RF | 0.899 | 7.9 |
| SV | 0.821 | 9.6 |
| NN | 0.917 | 10.5 |
| XB | 0.811 | 18.1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).