Submitted:
31 May 2026
Posted:
02 June 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
- A framework that integrates singular value decomposition with classification using machine learning methods for high-dimensional colon cancer datasets.
- A systematic comparison of traditional classifiers against those enhanced using genetic algorithm in a nested cross-validation scenario.
- An in-depth analysis of the merits and demerits of applying evolutionary algorithms to genomic data classification.
- Insights into difficult examples from an interpretation perspective, such as the case of sensitivity drop-off for the GA-SVM classifier.
2. Related Work
3. Dataset and Methodology
3.1. Dataset and Preprocessing
3.2. Singular Value Decomposition (SVD)
- U is an m×m orthogonal matrix whose columns are the left singular vectors of A,
- V is an n×n orthogonal matrix whose columns are the right singular vectors of A,
- Σ is an m×n diagonal matrix containing the non-negative singular values arranged in descending order.
4. Algorithms
4.1. Support Vector Machine (SVM)
4.2. Random Forest (RF)
- Step 1: Draw B bootstrap samples from the original training data.
- Step 2: Build unpruned decision trees using each of these bootstrap samples.
- Step 3: Randomly choose m_try predictor variables from all the predictor variables at each node.
- Step 4: Choose the optimal split using just these predictor variables.
- Step 5: Continue this procedure until all trees are built.
- Step 6: Obtain the prediction from all the trees by using majority vote.
4.3. Logistic Regression (LR)
- . is the intercept term,
- . are regression coefficients,
- . ,represent predictor variables.
4.4. Gradient Boosting Machine (GBM)
- . is the learning rate,
- . is the output prediction at the previous step.
4.5. Genetic Algorithm (GA)
- step: Selection – selecting fittest individuals among current population to reproduce.
- step: Crossover – creating new individuals from two parent solutions.
- step: Mutation – adding mutations to offspring solutions.
4.6. GA-SVM Algorithm
- Definition of Fitness Function: A fitness function is defined in order to measure the performance of the candidate solutions.
- Definition of GA Parameters: Search space of the SVM parameters and maximum number of GA iterations are set.
- Execution of Genetic Algorithm: The optimal parameters of the SVM model are determined using GA operations.
- 4.
- Evaluation of SVM Model: Classification accuracy, sensitivity, specificity, and AUC values are used in evaluation process.
5. Experimental Design and Validation Strategy
5.1. Nested Cross-Validation Strategy
5.2. Performance Evaluation Metrics
6. Discussion
6.1. Failure Analysis of GA-SVM in High Dimensional Data
![]() |
Appendix A. Additional Performance Details
![]() |
![]() |
Conclusion
References
- Gomiasti, F. S.; Warto, W.; Kartikadarma, E.; Gondohanindijo, J.; Setiadi, D. R. I. M. Enhancing lung cancer classification effectiveness through hyperparameter-tuned support vector machine. J. Comput. Theor. Appl. 2024, 1(4), 396–406. [Google Scholar] [CrossRef]
- Ankrah, B. N.; Brew, L.; Acquah, J. Multi-class classification of genetic mutation using machine learning models. Comput. J. Math. Stat. Sci. 2024, 3(2), 280–315. [Google Scholar] [CrossRef]
- Kolluru, P. K. Svm based dimensionlity reduction and classification of hyperspectral data. In University of Twente; 2013. [Google Scholar]
- Biau, G.; Scornet, E. A random forest guided tour. Test 2016, 4. 25(2), 197–227. [Google Scholar] [CrossRef]
- Yang, D. Singular value decomposition for high dimensional data; University of Pennsylvania, 2012. [Google Scholar]
- Noble, W. S. What is a support vector machine? Nat. Biotechnol. 2006, 24(12), 1565–1567. [Google Scholar] [CrossRef] [PubMed]
- Alon, U.; Barkai, N.; Notterman, D. A.; Gish, K.; Ybarra, S.; Mack, D.; Levine, A. J. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. 1999, 96(12), 6745–6750. [Google Scholar] [CrossRef] [PubMed]
- Breiman, L. Random forests. Mach. Learn. 2001, 45(1), 5–32. [Google Scholar] [CrossRef]
- Hussein, S. M. Performance Classification for Lasso Weights with Penalized Logistic Regression for High-Dimensional Data. J. Econ. Adm. Sci. 2024, 30(139), 149–160. [Google Scholar] [CrossRef]
- Friedman, J. H.; Hastie, T.; Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010, 33, 1–22. [Google Scholar] [CrossRef] [PubMed]
- Liu, Y.; Wu, S.; Wu, Z.; Zhou, S. Application of gradient boosting machine in satellite-derived bathymetry using Sentinel-2 data for accurate water depth estimation in coastal environments. J. Sea Res. 2024, 201, 102538. [Google Scholar] [CrossRef]
- Qasim, O.; Alhafedh, M. A. Improved Classification Performance of Support Vector Machine Technique Using the Genetic Algorithm. Al-Rafidain J. Comput. Sci. Math. 2018, 12(2), 49–60. [Google Scholar]
- Varma, S.; Simon, R. Bias in error estimation when using cross-validation for model selection. BMC Bioinform. 2006, 7(1), 91. [Google Scholar] [CrossRef] [PubMed]




Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.


