Submitted:
15 August 2023
Posted:
16 August 2023
You are already at the latest version
Abstract

Keywords:
1. Introduction
2. Introductory example case
3. Materials and Methods
3.1. Sample data sets
3.2. Experimentation
4. Results
4.1. Regression occasionally generalizes poorly compared to alternative methods
4.2. Regression inadequately captures the structural characteristics of certain data sets
4.3. Variables chosen by the most successful algorithms are more generalizable
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Sample Availability
References
- Lo, A.; Chernoff, H.; Zheng, T.; Lo, S.H. Why significant variables aren’t automatically good predictors. Proceedings of the National Academy of Sciences of the United States of America 2015, 112, 13892–13897. [Google Scholar] [CrossRef] [PubMed]
- Ultsch, A.; Lötsch, J. The Fundamental Clustering and Projection Suite (FCPS): A Dataset Collection to Test the Performance of Clustering and Data Projection Algorithms. Data 2020, 5, 13. [Google Scholar] [CrossRef]
- Pearson, K. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 1901, 2, 559–572. [Google Scholar] [CrossRef]
- Team, R.D.C. R: A Language and Environment for Statistical Computing; 2008.
- Wickham, H. ggplot2: Elegant Graphics for Data Analysis; Springer-Verlag: New York, 2009. [Google Scholar]
- Ligges, U.; Mächler, M. Scatterplot3d - an R Package for Visualizing Multivariate Data. Journal of Statistical Software 2003, 8, 1–20. [Google Scholar] [CrossRef]
- Olsen, L.R.; Zachariae, H.B. cvms: Cross-Validation for Model Selection, 2023.
- Gu, Z.; Eils, R.; Schlesner, M. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics 2016, 32, 2847–9. [Google Scholar] [CrossRef]
- Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Thrun, M.; Stier, Q. Fundamental clustering algorithms suite. SoftwareX 2021, 13, 100642–100642. [Google Scholar] [CrossRef]
- Minsky, M.; Papert, S. Perceptrons; an Introduction to Computational Geometry; MIT Press, 1969. [Google Scholar]
- Khadirnaikar, S.; Shukla, S.; Prasanna, S.R.M. Machine learning based combination of multi-omics data for subgroup identification in non-small cell lung cancer. Sci Rep 2023, 13, 4636. [Google Scholar] [CrossRef]
- Ihaka, R.; Gentleman, R. R: A Language for Data Analysis and Graphics. Journal of Computational and Graphical Statistics 1996, 5, 299–314. [Google Scholar] [CrossRef]
- Van Rossum, G.; Drake Jr, F.L. Python tutorial; Vol. 620, Centrum voor Wiskunde en Informatica Amsterdam, 1995.
- Kuhn. ; Max. Building Predictive Models in R Using the caret Package. Journal of Statistical Software 2008, 28, 1–26. [Google Scholar] [CrossRef]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, E. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 2011, 12, 2825–2830. [Google Scholar]
- Fisher, R.A. The use of multiple measurements in taxonomic problems. Annals of Eugenics 1936, 7, 179–188. [Google Scholar] [CrossRef]
- Ho, T.K. Random decision forests. Proceedings of 3rd International Conference on Document Analysis and Recognition, 1995, Vol. 1, pp. 278–282 vol.1. [CrossRef]
- Cortes, C.; Vapnik, V. Support-Vector Networks. Machine Learning 1995, 20, 273–297. [Google Scholar] [CrossRef]
- Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theor. 1967, 13, 21–27. [Google Scholar] [CrossRef]
- Bayes, M.; Price, M. An Essay towards Solving a Problem in the Doctrine of Chances. By the Late Rev. Mr. Bayes, F. R. S. Communicated by Mr. Price, in a Letter to John Canton, A. M. F. R. S. Philosophical Transactions 1763, 53, 370–418. [Google Scholar] [CrossRef]
- Cohen, W.W. Fast Effective Rule Induction. ICML, 1995.
- Brodersen, K.H.; Ong, C.S.; Stephan, K.E.; Buhmann, J.M. The Balanced Accuracy and Its Posterior Distribution. Pattern Recognition (ICPR), 2010 20th International Conference on, 2010, pp. 3121–3124. [CrossRef]
- Peterson, W.; Birdsall, T.; Fox, W. The theory of signal detectability. Transactions of the IRE Professional Group on Information Theory 1954, 4, 171–212. [Google Scholar] [CrossRef]
- Ultsch, A.; Lötsch, J. Computed ABC Analysis for Rational Selection of Most Informative Variables in Multivariate Data. PLoS One 2015, 10, e0129767. [Google Scholar] [CrossRef]
- Juran, J.M. The non-Pareto principle; Mea culpa. Quality Progress 1975, 8, 8–9. [Google Scholar]
- Venables, W.N.; Ripley, B.D. Modern Applied Statistics with S, fourth ed.; Springer: New York, 2002; ISBN 0-387-95457-0. [Google Scholar]
- Guyon, I. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
- Waskom, M.L. seaborn: statistical data visualization. Journal of Open Source Software 2021, 6, 3021. [Google Scholar] [CrossRef]
- Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review 1958, 65, 386–408. [Google Scholar] [CrossRef] [PubMed]
- Elizondo, D. The linear separability problem: some testing methods. IEEE Transactions on Neural Networks 2006, 17, 330–344. [Google Scholar] [CrossRef]
- Verikas, A.; Bacauskiene, M. Feature selection with neural networks. Pattern Recognition Letters 2002, 23, 1323–1335. [Google Scholar] [CrossRef]
- Lötsch, J.; Mayer, B. A Biomedical Case Study Showing That Tuning Random Forests Can Fundamentally Change the Interpretation of Supervised Data Structure Exploration Aimed at Knowledge Discovery. BioMedInformatics 2022, 2, 544–552. [Google Scholar] [CrossRef]
- Hu, Y.H.; Palreddy, S.; Tompkins, W.J. A patient-adaptable ECG beat classifier using a mixture of experts approach. IEEE Trans Biomed Eng 1997, 44, 891–900. [Google Scholar] [CrossRef] [PubMed]
- Leclercq, M.; Vittrant, B.; Martin-Magniette, M.L.; Scott Boyer, M.P.; Perin, O.; Bergeron, A.; Fradet, Y.; Droit, A. Large-Scale Automatic Feature Selection for Biomarker Discovery in High-Dimensional OMICs Data. Front Genet 2019, 10, 452. [Google Scholar] [CrossRef]
- Miettinen, T.; Nieminen, A.I.; Mäntyselkä, P.; Kalso, E.; Lötsch, J. Machine Learning and Pathway Analysis-Based Discovery of Metabolomic Markers Relating to Chronic Pain Phenotypes. Int J Mol Sci 2022, 23. [Google Scholar] [CrossRef]
- Kringel, D.; Kaunisto, M.A.; Kalso, E.; Lötsch, J. Machine-learned analysis of global and glial/opioid intersection-related DNA methylation in patients with persistent pain after breast cancer surgery. Clin Epigenetics 2019, 11, 167. [Google Scholar] [CrossRef]
- Lötsch, J.; Schiffmann, S.; Schmitz, K.; Brunkhorst, R.; Lerch, F.; Ferreiros, N.; Wicker, S.; Tegeder, I.; Geisslinger, G.; Ultsch, A. Machine-learning based lipid mediator serum concentration patterns allow identification of multiple sclerosis patients with high accuracy. Sci Rep 2018, 8, 14884. [Google Scholar] [CrossRef]









| Variable | Regression | ||||
|---|---|---|---|---|---|
| Estimate | Std. Error | Z-value | Pr(>|z|) | Signif. | |
| (Intercept) | 0.01541 | 0.08657 | 0.178 | 8.59E-01 | |
| X | 0.03846 | 0.0827 | 0.465 | 0.642 | |
| Y | 1.56726 | 0.11461 | 13.674 | <2e-16 | *** |
| Z | -0.06873 | 0.0824 | -0.834 | 0.404 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).