Enhancing Cancer Classification Using Machine Learning and Singular Value Decomposition for High-Dimensional Data

Afiaa Raheem Khudhair; Saja Mohammad Hussein; Taylan Demir

doi:10.20944/preprints202606.0129.v1

Submitted:

31 May 2026

Posted:

02 June 2026

You are already at the latest version

Abstract

Classification of genomic data with many dimensions is a very difficult issue because of many variables and few observations. In this research, we evaluate some machine learning approaches in order to examine the effectiveness of each algorithm in colon cancer diagnosis based on a high-dimensional gene expression dataset consisting of 62 samples and 2000 genes. SVD was utilized as a tool in feature selection in order to reduce the high dimensionality of the dataset. Classification performance of SVM, RF, LR, and GBM was measured and compared against that obtained from two other hybrid algorithms, i.e., GA-SVM and GA-RF. Classification performance measurement was carried out through nested cross-validation and various performance metrics such as accuracy, sensitivity, specificity, and AUC were considered for evaluating model performance. Based on the experiment, the performance of SVM was found to be better compared to other classifiers due to its high degree of effectiveness, while the Random Forest model also proved to be an efficient predictor. However, the performance of GA-SVM was highly inconsistent due to the inherent drawbacks of evolution-based algorithms for small sample sizes. These results show that using dimensionality reduction in combination with appropriate machine learning models could significantly boost the accuracy of classifications in high-dimensional biomedical problems, which is helpful for future research in cancer diagnosis studies.

Keywords:

colon cancer classification

;

machine learning

;

singular value decomposition

;

highdimensional data

;

support vector machine

;

genetic algorithm

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

The advancement of high-throughput genomic technology has created enormous data characterized by many features but a few samples. High dimensional data is typical in cancer genomics research, where data consists of thousands of gene expressions for only a few samples. Such data poses substantial difficulties when it comes to classifying samples because of issues such as overfitting and high computation requirements. The machine learning algorithms have proven to be critical in resolving such issues and have found wide application in the areas of cancer detection, prediction, biomarkers identification, and personalized treatments. The most popular among the supervised learning methods include SVM (support vector machine), RF (random forest), LR (logistic regression), GBM (gradient boosting machines), and variations of these models. They have shown their good predictive performance in biomedical applications as a result of the capability to detect intricate patterns in big data sets [1]. In general, machine learning involves building a predictive model using data that already exists in history, and using such models for classifying or predicting observations that have not been seen before. Based on whether there is labeled data available or not, machine learning algorithms can be classified as supervised and unsupervised learning approaches. In supervised learning, data with labels are used to make the connections between predictors and the response variable, while unsupervised learning makes use of unlabeled data for identifying patterns within data. Supervised learning algorithms are particularly significant in cancer classification tasks since diagnostic labels exist. While machine learning algorithms have been successful, the issue of analyzing genomic datasets is quite difficult because of the existence of a problem known as “large-p small-n,” where the number of variables far exceeds that of observations. In cases like these, it is important to use dimensionality reduction in order to improve efficiency, minimize noise, and increase classification accuracy. One of the best matrix factorization approaches is singular value decomposition. Model optimization is another crucial factor that affects the success of the classification model. Many evolutionary computation approaches such as genetic algorithms (GA) have been used for optimizing the parameters of the classification models. The use of GA for optimization in the high-dimensional genomic environment where samples sizes are small still needs further investigation. Driven by this problem, in this study the classification of the colon cancer gene expression dataset using several machine learning techniques such as SVM, RF, LR, and GBM is explored. Additionally, the performance of two other hybrid machine learning algorithms called GA-SVM and GA-RF is analyzed to study the impact of evolutionary computation on the prediction accuracy. Since the used dataset is characterized by its high dimensionality, the dimensionality reduction using SVD is performed before building any models. The assessment of models' performance is done via a number of performance measures, including classification accuracy, sensitivity, specificity, and AUC. The major contributions of this study can be enumerated as:

A framework that integrates singular value decomposition with classification using machine learning methods for high-dimensional colon cancer datasets.
A systematic comparison of traditional classifiers against those enhanced using genetic algorithm in a nested cross-validation scenario.
An in-depth analysis of the merits and demerits of applying evolutionary algorithms to genomic data classification.
Insights into difficult examples from an interpretation perspective, such as the case of sensitivity drop-off for the GA-SVM classifier.

The rest of the paper will be structured as follows. In Section 2, related work on machine learning-based approaches for cancer classification is discussed. In Section 3, the dataset used in this study along with the data preprocessing techniques is presented. In Section 4, classification algorithms and dimensional reduction techniques adopted are outlined. Section 5 outlines the methodology for the experiments performed. In Section 6, the results are discussed, and finally, conclusions are drawn.

2. Related Work

The issue of classifying high-dimensional biomedical data has received much attention in recent years as a result of the fast development of genome-based technology and more available genomic data for study. The problem encountered in most cases is that there are too many variables but few samples to measure. Thus, the model will suffer from poor generalization and overfitting. This has made it necessary to adopt the methods of feature selection and feature extraction in order to improve the classification system used for cancer. SVM algorithms have been extensively utilized in performing classification of high dimensional datasets based on solid theory and construction of reliable decision surfaces. SVM algorithms coupled with dimensionality reduction methods have shown efficacy in several research studies. An SVM-based algorithm for dimensionality reduction and classification (SVMDRC) has proved useful in classification of hyperspectral imagery, wherein dimensionality reduction has been found useful for achieving improved classification accuracy [2]. As an analytical approach within genomic data analysis, SVD has been proved to be an efficient mathematical technique capable of identifying informative low-dimensional data structures out of high-dimensional biological data. For instance, Anisi et al. (2016) used SVD in conjunction with ensemble SVM approaches and iterative feature elimination methods in microarray data analysis. The experimental outcomes revealed that their algorithm showed significant improvements in the classification results of various cancerous datasets such as leukemia, breast cancer, and colon cancer. Previous research done by Ghosh (2001) examined the use of SVD for tumor classification tasks based on microarray data of gene expression. This research showed that matrix decomposition methods can effectively extract the underlying structure of high-dimensional biological data, thus making it easier to interpret classifiers based on the biological data. However, in this paper, only regression-based approaches were considered, while many other machine learning algorithms used today have not been considered. On the other hand, techniques like Random Forest (RF) and Gradient Boosting Machine (GBM) ensembles have emerged as popular tools due to their predictive power and resilience to noisy data. The RF classifier has proved its prowess in terms of accuracy in different application fields such as medical diagnosis, engineering applications, and risk estimation cases [3]. Also, GBM algorithms have exhibited high flexibility in dealing with non-linear predictor relationships [4]. In addition to that, evolutionary algorithms such as Genetic Algorithms (GA) have been combined with various learning systems for achieving better parameter tuning and classification accuracy. There exist hybrid techniques based on GA-SVM and GA-RF that aim to optimize the parameters of the models automatically in order to increase their accuracy. However, it is not clear whether this approach would be effective for very high dimensional genomic spaces with a very small number of samples. While there have been previous works on dimensionality reduction, machine learning classification, and evolutionary optimization, few researchers have done an elaborate comparison of standard machine learning methods and GA-assisted models within a rigorous nested cross-validation approach for colon cancer gene expression datasets. In addition, there is insufficient research on the performance of GA-assisted optimization in small-sample, high-dimensional domains. In view of these challenges, this study explores the use of Singular Value Decomposition, machine learning algorithms, and Genetic Algorithms in the context of colon cancer diagnosis. Specific focus will be on the analysis and comparison of the stability of models, as well as their classification accuracies, sensitivities, specificities, and predictive capabilities.

3. Dataset and Methodology

3.1. Dataset and Preprocessing

The dataset being used in this analysis is the famous Colon Cancer Gene Expression Data Set first presented by Alon et al. This dataset comprises 62 samples and 2001 attributes, including 2000 gene expressions and 1 binary target variable indicating the class label, which was coded as 1 for tumor and 0 for normal samples. This particular dataset can be considered a classic example of a high-dimensional data problem, since the number of predictors far surpasses that of the sample size (p >> n). The problem with such a type of data lies in the tendency of the data itself for overfitting. Thus, before using any classifier methods, dimensionality reduction is vital. Before developing the model, the response variable was first separated from the matrix of predictors. The predictors were then converted into a matrix format to allow the use of matrix factorization methods. Given that the initial dataset has thousands of gene expression variables, it was necessary to reduce the dimensions of the data by keeping only the essential information. In order to obtain an objective assessment and avoid any data leaks, the entire preprocessing process was conducted only on the training parts of the nested cross-validation procedure. These transformations were later applied to the test parts.

3.2. Singular Value Decomposition (SVD)

The Singular Value Decomposition (SVD) is among the best matrix decomposition methods employed for reducing dimensionality and representing data within high-dimensional domains. The SVD technique has proven to be very useful in statistics, machine learning, signal processing, image compression, bioinformatics, and genomic data analysis. When dealing with high dimensional scenarios where statistical estimation becomes complicated due to the large number of variables, the SVD is effective in decomposing the data and revealing its main structure. SVD differs from other feature selection methods by mapping the original variables to an orthogonal feature space, where the most significant variation from the original data is maintained with a much smaller computation load for the next algorithmic step; Let A be an m×n real matrix. The Singular Value Decomposition of A is given by

A = U Σ V^{T}

(1)

where

U is an m×m orthogonal matrix whose columns are the left singular vectors of A,
V is an n×n orthogonal matrix whose columns are the right singular vectors of A,
Σ is an m×n diagonal matrix containing the non-negative singular values arranged in descending order.

The singular values measure the amount of variability captured by each corresponding component. Larger singular values indicate more informative directions in the data space. The decomposition is explained as;

A = U Σ V^{T} = [u_{1}, u_{2}, \dots, u_{m}] [\begin{matrix} σ_{1} & 0 & \dots & 0 & \dots & 0 \\ 0 & σ_{2} & ⋮ & ⋮ & 0 & ⋮ \\ ⋮ & ⋮ & σ r & 0 & ⋱ & ⋮ \\ 0 & 0 & \dots & 0 & \dots & 0 \end{matrix}] [\begin{matrix} v_{1}^{T} \\ v_{2}^{T} \\ ⋮ \\ v_{n}^{T} \end{matrix}] V^{T} (n * n)

(2)

We say that r is the rank of matrix A. As a feature extraction method, SVD was used in the current work to extract features from the gene expression dataset associated with colon cancer. Instead of selecting a subset of features from the whole feature set, SVD maps the original feature space into a lower dimension by constructing new features from linear combinations of the existing ones. A variance-based approach was used for determining the required number of singular components. Specifically, the cumulative proportion of variance explained by the singular values was calculated according to

P (k) = \frac{\sum_{i = 1}^{k} σ_{i}^{2}}{\sum_{i = 1}^{k} σ_{i}^{2}},

(3)

The smallest value of

k

satisfying

P (k) \geq 0.98

(4)

was selected.

Therefore, at least 98% of the total variance included in the dataset was captured by this compact representation. In order to prevent any data leakage, fitting of the SVD was done only on the training portion of the data using nested cross-validation. Once the projection matrix was computed from the training portion of the data, it was similarly used for transforming the corresponding test portion of the data into the lower dimensional space. This low-dimensional representation was further used as an input feature for the ML models considered for investigation in this research, which includes SVM, RF, LR, and GBM.

4. Algorithms

4.1. Support Vector Machine (SVM)

The Support Vector Machine (SVM) is one of the most popular algorithms of supervised learning that is utilized to solve problems related to classification based on high-dimensional data. Due to solid theoretical background and excellent generalization ability, SVM has been successfully implemented in a wide variety of fields, including bioinformatics, cancer detection, gene expression studies, and medical decision support systems [5]. The key concept of SVM is to identify the best possible separating hyperplane in the feature space [6]. Figure 1 below shows an example of graphical representation of the principles of SVM classification.

For a binary classification problem, the separating hyperplane can be represented as

f (x) = w^{T} x + b,

(5)

where

w

denotes the weight vector and

b

represents the bias term. The classification rule is defined by

\hat{y} = \{\begin{matrix} + 1, i f f (x) > 0, \\ - 1, i f f (x) < 0, \end{matrix}

(6)

where

\hat{y}

denotes the predicted class label. The optimal hyperplane is obtained by solving the following optimization problem:

\begin{matrix} m i n \\ w, b \end{matrix} \frac{1}{2} {| |w| |}^{2}

(7)

subject to

y_{i} (w^{T} x_{i} + b) \geq 1, i = 1,2, \dots n .

(8)

This formulation guarantees that there exists maximum distance between the two classes while keeping all the training instances classified correctly. In several practical applications, especially those involving genomic classification tasks, the relation between predictor and the response variable is nonlinear in nature. To overcome this problem, the kernel functions are used to transform the data into a higher dimensional space. Out of many such kernel functions, RBF (Radial Basis Function) kernel gives very good results in high-dimensional biomedical data sets [7]. The RBF kernel is defined as

K (x_{i}, x_{j}) = e x p (- γ {‖x_{i} - x_{j}‖}^{2}),

(9)

where γ controls the influence of individual training samples on the decision boundary. The performance of SVM classification algorithm is mainly determined by two parameters. Parameter

C

is responsible for the tradeoff between maximizing the margin and minimizing misclassification errors, while

γ

is used to determine the flexibility of the nonlinear boundary function. For this study, we use SVM with RBF kernel as our classifier. The process of optimizing hyperparameters was done in the inner fold of nested cross-validation. Values of

C

and

γ

were chosen from the training fold only. This method makes sure that there is no information leak and the model can be fairly evaluated using its classification accuracy, sensitivity, specificity, and Area Under ROC curve (AUC).

4.2. Random Forest (RF)

Random Forest is an ensemble learning method that uses a large number of decision trees to achieve accurate classifications and avoid overfitting. The purpose of the development of this technique was to increase the prediction power of decision trees through bootstrap aggregating and random attribute selection while building decision trees [8]. The core idea behind Random Forest is the construction of many decision trees using bootstrap sampling of the initial training set. In each tree, splitting is performed based on only a random subset of attributes. Such a random approach reduces the correlation between trees, resulting in better generalizability of the model [8]. The Random Forest algorithm can be described as follows:

Step 1: Draw B bootstrap samples from the original training data.
Step 2: Build unpruned decision trees using each of these bootstrap samples.
Step 3: Randomly choose m_try predictor variables from all the predictor variables at each node.
Step 4: Choose the optimal split using just these $m_{t r y}$ predictor variables.
Step 5: Continue this procedure until all trees are built.
Step 6: Obtain the prediction from all the trees by using majority vote.

Finally, the classification output is derived as

\hat{Y} = m o d e \{h_{1} (X), h_{2} (X), \dots, h_{B} (X)\},

(10)

where

h_{b} (X)

denotes the prediction of the

b

-th decision tree and

B

represents the total number of trees in the forest. There are several strengths of using Random Forest for genomic classification tasks. For one thing, it enables processing data sets that include several thousand predictor variables efficiently. Second, Random Forest demonstrates high stability with respect to multicollinearity and noise. Third, the model performs well, even if the amount of observations is much lower than the amount of variables [8]. In this study, we used the Random Forest classifier for predicting the outcome of the colon cancer SVD-transformed data set. The quality of the developed model was measured using the cross-validation method and classification metrics such as accuracy, sensitivity, specificity, and area under the ROC curve.

4.3. Logistic Regression (LR)

Logistic Regression (LR) is among the most commonly utilized statistical techniques for solving binary classification problems. Because of its simplicity, interpretability, and computational speed, logistic regression has seen extensive use in diagnostic and bioinformatics applications, including medical diagnosis, bioinformatics, and genomic classification [9]. For cancer classification, logistic regression has proven especially effective when the outcome is binary in nature, i.e., tumor vs. normal tissue. Let

Y

denote a binary response variable, where,

Y = \{\begin{matrix} 1, t u m o r s a m p l e, \\ 0, n o r m a l s a m p l e . \end{matrix}

(11)

The logistic regression model estimates the probability that an observation belongs to the positive class. This probability is expressed as,

P (Y = 1 | X = x) = π (x),

(12)

where

π (x)

denotes the conditional probability of class membership. The logistic function is defined as,

π (x) = \frac{e x p (β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots + β_{p} x_{p})}{1 + e x p (β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots + β_{p} x_{p})},

(13)

where

. $β_{0}$ is the intercept term,
. $β_{1}, \dots ., β_{p}$ are regression coefficients,
. $x_{1}, \dots$ , $x_{p}$ represent predictor variables.

The probability of belonging to the negative class is

P (Y = 0 | X = x) = 1 - π (x) .

(14)

Taking the logarithm of the odds ratio yields the logit model

l o g (\frac{π (x)}{1 - π (x)}) = β_{0} + \sum_{j = 1}^{p} β_{j} x_{j} .

(15)

The parameters of the model are estimated by maximizing the likelihood function, which means finding the values for the coefficients that maximize the likelihood of obtaining the observed training data [10]. In this research study, logistic regression analysis was used on the colon cancer data set after transformation using singular value decomposition, and the performance was measured using nested cross-validation with accuracy, sensitivity, specificity, and AUC [9].

4.4. Gradient Boosting Machine (GBM)

Gradient Boosting Machine (GBM) is an ensemble machine learning algorithm which uses many weak classifiers or decision trees to generate a strong classifier. GBM algorithms do not rely on random sampling as in the case of bagging-based algorithms, for example, Random Forest; rather, they use sequential tree building algorithms such that each next tree compensates for the mistakes of the previous ensemble [11]. The principle behind GBM involves minimizing a differentiable loss function using a stage-by-stage addition approach. Let us represent the output after

m

iterations as,

F_{m} (x) .

The model is updated iteratively according to

F_{m} (x) = F_{m - 1} (x) + η h_{m} (x),

(16)

where

.

h_{m} (x)

is the weak learner built at the mth step,

. $η$ is the learning rate,
. $F_{m - 1} (x)$ is the output prediction at the previous step.

At each step of boosting, the algorithm trains a new model on the gradient opposite to the loss function gradient. The pseudo-residuals are computed as

r_{i m} = {- \frac{\partial L (y_{i,} F (x_{i}))}{\partial F (x_{i})}|}_{F (x) = F_{m - 1} (x)}

(17)

where

L (.)

denotes the selected loss function. Training is done on the residuals using the weak learner, and the new model is computed based on Equation (17). This continues until a specified number of boosting iterations is completed. The goal of GBM is to minimize the empirical loss

L = \sum_{i = 1}^{n} l (y_{i,} F (x_{i})),

(18)

where

n

denotes the number of observations and

l (\cdot)

represents the loss associated with each prediction. Some of the benefits of applying GBM to high-dimensional biomedicine data include the possibility to incorporate nonlinear relationships, automatic feature selection, and good predictability. Nevertheless, special care needs to be taken in choosing the optimal values of the learning rate and the number of iterations to prevent overfitting [11]. For this research, GBM was used to analyze SVD-transformed colon cancer data with nested cross-validation. The evaluation measures were classification accuracy, sensitivity, specificity, and AUC.

4.5. Genetic Algorithm (GA)

Genetic Algorithm (GA) is an optimization algorithm that applies concepts of natural selection and biological evolution to solve problems. It was invented by John Holland and became one of the most popular methods of optimization in machine learning, engineering, robotics, and artificial intelligence [12]. The primary purpose of GA is to discover the best solutions through improvement of a population of solutions in each iteration of the process. Contrary to traditional optimization methods, GA operates with a whole population of potential solutions. By using evolutionary operators, GA evolves and improves solutions based on their fitness value defined for each problem. There are three basic operators used in GA:

step: Selection – selecting fittest individuals among current population to reproduce.
step: Crossover – creating new individuals from two parent solutions.
step: Mutation – adding mutations to offspring solutions.

The flowchart of the genetic algorithm used in the experiment can be seen in Figure 2.

The optimization process, which is illustrated in Figure 2, starts with generating the first population of possible solutions. Individuals are evaluated according to their fitness, which is assessed with the help of the fitness function. Based on the evaluation results, the fittest individuals will engage in reproduction; crossover and mutation take place to produce new offspring. The offspring population is again evaluated, and the process continues until a termination condition is met. Finally, the last population will include the fittest solution. Mathematically, the optimization objective can be expressed as,

\max F (x),

where

F (x)

denotes the fitness function and

x

represents a candidate solution encoded as a chromosome. The fitness value affects the possibility of being chosen for reproduction. Therefore, those with higher fitness values are likely to participate in creating future generations by providing their genetic information. Over the iterations of selection, crossover, and mutation, the population slowly converges to find the optimal solution [12]. In this work, the GA has been applied as an optimization method in order to optimize machine learning models and to find optimal parameters for colon cancer classifications with high dimensions.

4.6. GA-SVM Algorithm

SVM classification performance highly relies on choosing suitable hyperparameters. In this regard, a Genetic Algorithm (GA) was used in an effort to tune the SVM hyperparameters to achieve maximum performance [12]. A Genetic Algorithm works by exploring the hyperparameters’ search space until it finds the optimal set of parameters for maximum classification performance. A genetic algorithm uses an optimization process which is mainly based on defining a fitness function. Possible solutions are represented in the form of chromosomes which hold hyperparameter values. Selection, crossover, and mutation operations are used throughout the evolution process to get a better solution. Figure 3 shows the flowchart of the GA-SVM optimization process.

Figure 3 highlights that the optimization process involves five distinct steps:

Definition of Fitness Function: A fitness function is defined in order to measure the performance of the candidate solutions.
Definition of GA Parameters: Search space of the SVM parameters and maximum number of GA iterations are set.
Execution of Genetic Algorithm: The optimal parameters of the SVM model are determined using GA operations.

Train SVM with Optimized Parameters: Finally, the SVM model is trained with the determined optimal parameters.

4.: Evaluation of SVM Model: Classification accuracy, sensitivity, specificity, and AUC values are used in evaluation process.

The fitness function employed during optimization can be expressed as

F (θ) = A c c u r a c y (θ)

where θ represents the vector of SVM hyperparameters and 𝐹(θ) denotes the corresponding classification accuracy. The parameter combination producing the highest fitness value is selected as the optimal solution and subsequently used for constructing the final GA-SVM classification model.

5. Experimental Design and Validation Strategy

5.1. Nested Cross-Validation Strategy

In order to obtain an impartial evaluation of the performance of different models, a nested cross-validation approach was employed during all phases of this research work. The use of a nested cross-validation approach is especially applicable to datasets characterized by high dimensionality and relatively low number of observations because it minimizes any potential optimistic bias due to tuning parameters [13]. There were two separate loops in the process of nested validation. The first loop is applied to estimate the models while the second loop is employed to optimize hyperparameters. As for the first loop, the data set was split into K folds. Then, at each stage of the loop, one fold was used for testing while the others were utilized to build a model. In each outer loop training fold, another inner loop cross-validation procedure was conducted in order to determine the best parameters of the machine learning algorithms under consideration. In the case of SVM classifier, the penalty coefficient

C

and kernel coefficient

γ

were determined. In the case of Random Forest algorithm, the number of trees and variables candidates were adjusted. For GA-SVM and GA-RF, the Genetic Algorithm parameters were optimized together with the classifier parameters. Having determined the optimal parameters, the model was trained again using the full training dataset in the particular outer fold, and finally tested on the corresponding validation fold. These procedures were repeated until all outer folds were used at least once as test folds. The performance was measured by taking into account the mean value and standard deviation of the results obtained through outer loop cross-validation. The cross-validation procedure utilized in the current research is demonstrated in Figure 4 below.

5.2. Performance Evaluation Metrics

For determining how successful the classification models performed, a number of different statistical measures of performance that are frequently used to determine the success of classification algorithms in biomedical research have been adopted. The first statistical measure considered for evaluating the performance of the classifiers was classification accuracy, which is

Accuracy = \frac{TP + TN}{TP + TN + FP + {FN}^{'}}

(20)

where

T P

,

T N, F P

, and

F N

denote true positives, true negatives, false positives, and false negatives, respectively. Sensitivity (also known as Recall or True Positive Rate) measures the ability of a classifier to correctly identify positive samples and is given by

Sensitivity = \frac{TP}{TP + {FN}^{'}}

(21)

while specificity evaluates the ability of the classifier to correctly identify negative samples,

Specificity = \frac{TN}{TN + FP} .

(22)

Moreover, the Area Under the Receiver Operating Characteristic Curve (AUC) metric was included to gain a more complete understanding of classifier performance. A larger AUC metric reflects an increased level of separation between the two classes under consideration. In conjunction with these performance criteria, confusion matrices were employed to analyze details regarding classification errors and classwise prediction patterns. Nested cross-validation and various performance metrics will allow comparing the predictive accuracy of SVM, RF, LR, GBM, GA-SVM, and GA-RF classifiers based on colon cancer data sets.

6. Discussion

This subsection focuses on analyzing the classification results of the examined methods shown in Table 1 and their efficacy in dealing with a high-dimensional dataset of colon cancers. Several performance parameters have been used for this purpose, such as classification accuracy, sensitivity, specificity, and Area Under the Receiver Operating Characteristic Curve (AUC). As noted above, sensitivity is the most important parameter in medicine since it determines the ability of a classifier to classify positively labeled instances. From the provided statistics, we can see that SVM, RF, and LR demonstrated perfect sensitivity equal to 1.00, which means that all cancerous samples were classified. This is an essential parameter for a good classifier, especially when its incorrect functioning can bring some adverse effects. However, GA-RF and GBM exhibited poor sensitivity at 0.60, which means that about 40% of positively labeled samples were misclassified. The worst performance among those tested algorithms belonged to GA-SVM, whose sensitivity turned out to be 0.00. Specificity represents the classifier's capacity to distinguish negative samples. The classifiers GA-RF and GA-SVM showed a perfect value of specificity (1.00) without any wrong positive classification. Nevertheless, the perfection in the performance of the specificity measure cannot guarantee a reliable classifier when the value of sensitivity is low at the same time. SVM classifier was characterized by specificity equal to 0.94 and perfect sensitivity, showing the balance in the work of the classification algorithm. Logistic regression classifier showed the minimum specificity value (0.70). Classification accuracy is another measure of classifiers that allows proving their superiority over each other. The classifier SVM showed the maximum accuracy value equal to 0.95, and the method RF showed the second place (accuracy value 0.94). That means that these classifiers were capable to reveal the data structure in the multi-dimensional space. At the same time, LR and GA-SVM achieved less accurate values (0.73 and 0.67, respectively). The Area Under the ROC Curve (AUC) was also taken into account to examine the discriminative power of each individual classifier regardless of a certain threshold. The higher the value of AUC, the better the ability to distinguish between the two categories. Being a combination of the parameters of sensitivity and specificity, the AUC score shows a high performance for the classifier under consideration. One thing that can be noted from the results is that adding GA techniques does not always lead to improved classification performance. While GA optimization should generally increase accuracy by tuning parameters, these techniques could potentially become instable for very high dimensional data with few data points available. Such behavior is shown by the GA-SVM algorithm that, while having achieved perfect specificity, showed extremely poor sensitivity values. In conclusion, it can be stated that the SVM classifier proved to be the most reliable one concerning the examined colon cancer database. This algorithm provided the highest values of sensitivity, specificity, and classification accuracy while having excellent generalization ability at the same time. The Random Forest classifier was also very effective and could be used as an alternative one. However, the GA-SVM model revealed rather poor predictive capacity and needs to be modified considerably before being used in practice.

6.1. Failure Analysis of GA-SVM in High Dimensional Data

However, the findings obtained from the experiment conducted have exposed an interesting property of the GA-SVM model as shown in Table 1 below. Despite the purpose of using Genetic Algorithm in improving the efficiency of SVM in classification, it turned out that the model could not detect a single case of positive cancer of colon patients with sensitivity rate being 0.00, and the Area Under Curve being 0.50. Such behavior may be attributed to various features inherent to high-dimensional genomic datasets. Firstly, the Colon Cancer Gene Expression Dataset features a great deal of variables as well as an unusually small amount of observations

(p ≫ n) .

In such situations, the space covered by the Genetic Algorithm grows excessively big compared to the amount of data provided by the learning examples. As a result, the algorithm is more likely to converge into suboptimal solutions that do not have any practical clinical value. The other possibility can be attributed to the objective function used during optimization. Genetic Algorithms usually find parameter values that would optimize global criteria such as the accuracy of classification. With highly skewed or small sample sizes, optimizing for global accuracy will result in an unintentional bias towards predicting as much data in the majority category. Consequently, the classifier will lose its ability to detect minority category cases causing a drop in the sensitivity score. Table 3 shows that there are strong grounds for this conclusion. For instance, the GA-SVM classifier generated zero true positives while producing six false negatives. On the other hand, there were zero false positives. These results show that the classifier is biased towards assigning all observations to the negative category, thereby having perfect specificity without sensitivity. Another crucial point relates to over-fitting. Optimization of SVM hyper-parameters simultaneously using evolutionary algorithms may lead to high flexibility of the model being developed. Although SVM algorithm is recognized as stable and well functioning in cases of high dimensionality of the data, further use of genetic algorithms may cause increase in the variation and hence instability of the model, especially with limited observations. Such problems typically appear in genomics where numerous attributes are measured for just a few observations. A comparison with the traditional SVM classifier strengthens this finding. In the absence of Genetic Algorithm optimization, the SVM attained a 0.95 accuracy, 1.00 sensitivity, 0.94 specificity, and an AUC of 0.97. This finding shows that the traditional SVM classifier was nearly optimal since the genetic algorithm optimization failed to improve its predictive capabilities, thus worsening its performance. In conclusion, this paper reveals that genetic algorithms optimization must be used with caution in high-dimensional genomics classifiers. Evolutionary optimization methods have proven very useful in various machine learning tasks; however, their use may not guarantee improvement in certain tasks, especially those that involve a small number of samples. Further work may look into the use of different fitness functions and class-balanced optimization criteria.

Table 1. Result e the Accuracy Measures for algorithms.

Appendix A. Additional Performance Details

The following tables provide detailed classification results, reported for completeness and reference.

Table 2. the Misclassification Rate Measures.

Table 3. the Confusion Matrix.

Conclusion

In this study, the performance of various machine learning techniques for classifying colon cancer was assessed through the use of a gene expression data set that contains high dimensions and few observations. In order to overcome the difficulties involved in dealing with p≫n problems, the Singular Value Decomposition method was used for feature selection before performing the classification tasks. Afterward, the performance of different classifiers such as SVM, RF, LR, GBM, GA-RF, and GA-SVM was analyzed. The results obtained from experimentation showed that Support Vector Machine had better performance compared to other classification methods. The best performance was achieved by SVM with regard to having highest accuracy rate (0.95), highest sensitivity rate (1.00), highest specificity rate (0.94), and highest AUC value (0.97). This means that the performance was quite effective when diagnosing colon cancer. Another good classification method is Random Forest algorithm. The findings further indicate that the application of Genetic Algorithm was not necessarily a good idea for enhancing classification efficiency. Even though GA optimization technique is widely used for improving the parameters of the models, the GA-SVM classification model experienced a considerable reduction in performance, with 0% sensitivity and an AUC value of 0.50 only. Thus, the use of evolutionary optimization techniques might become unstable in case of using them on very high dimensional datasets with a small number of samples. An important finding of the paper includes the analysis of how classifiers behave in cases where data is of high-dimension. It is found that reducing dimensions using the SVD method increases computational efficacy without loss of critical discriminatory information stored in gene expression data. Besides, it is also found that simple and well-known classification techniques can work better than more advanced hybrid techniques in cases where samples are limited. Further studies could be done by employing other types of dimensionality reduction methods, as well as balancing classes during optimization, and more sophisticated ensembling techniques. Additionally, incorporating bigger genomic datasets and utilizing neural networks might yield additional results in developing cancer classifiers. The outcomes of this study add up to the existing literature in the field of machine learning application in the analysis of biomedical data and offer valuable advice on choosing effective classification algorithms for genomic data.

References

Gomiasti, F. S.; Warto, W.; Kartikadarma, E.; Gondohanindijo, J.; Setiadi, D. R. I. M. Enhancing lung cancer classification effectiveness through hyperparameter-tuned support vector machine. J. Comput. Theor. Appl. 2024, 1(4), 396–406. [Google Scholar] [CrossRef]
Ankrah, B. N.; Brew, L.; Acquah, J. Multi-class classification of genetic mutation using machine learning models. Comput. J. Math. Stat. Sci. 2024, 3(2), 280–315. [Google Scholar] [CrossRef]
Kolluru, P. K. Svm based dimensionlity reduction and classification of hyperspectral data. In University of Twente; 2013. [Google Scholar]
Biau, G.; Scornet, E. A random forest guided tour. Test 2016, 4. 25(2), 197–227. [Google Scholar] [CrossRef]
Yang, D. Singular value decomposition for high dimensional data; University of Pennsylvania, 2012. [Google Scholar]
Noble, W. S. What is a support vector machine? Nat. Biotechnol. 2006, 24(12), 1565–1567. [Google Scholar] [CrossRef] [PubMed]
Alon, U.; Barkai, N.; Notterman, D. A.; Gish, K.; Ybarra, S.; Mack, D.; Levine, A. J. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. 1999, 96(12), 6745–6750. [Google Scholar] [CrossRef] [PubMed]
Breiman, L. Random forests. Mach. Learn. 2001, 45(1), 5–32. [Google Scholar] [CrossRef]
Hussein, S. M. Performance Classification for Lasso Weights with Penalized Logistic Regression for High-Dimensional Data. J. Econ. Adm. Sci. 2024, 30(139), 149–160. [Google Scholar] [CrossRef]
Friedman, J. H.; Hastie, T.; Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010, 33, 1–22. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Wu, S.; Wu, Z.; Zhou, S. Application of gradient boosting machine in satellite-derived bathymetry using Sentinel-2 data for accurate water depth estimation in coastal environments. J. Sea Res. 2024, 201, 102538. [Google Scholar] [CrossRef]
Qasim, O.; Alhafedh, M. A. Improved Classification Performance of Support Vector Machine Technique Using the Genetic Algorithm. Al-Rafidain J. Comput. Sci. Math. 2018, 12(2), 49–60. [Google Scholar]
Varma, S.; Simon, R. Bias in error estimation when using cross-validation for model selection. BMC Bioinform. 2006, 7(1), 91. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Graphical representation of a linear Support Vector Machine showing the separating hyperplane, support vectors, and maximum-margin boundaries between two classes.

Figure 2. : Flowchart illustrating the principal stages of the genetic algorithm, including population initialization, fitness evaluation, selection, crossover, mutation, and termination.

Figure 3. Flowchart illustrating the GA-SVM optimization framework used for hyperparameter selection and model evaluation.

Figure 4. Schematic representation of the nested cross-validation framework used for model training, hyperparameter optimization, and independent performance evaluation.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Enhancing Cancer Classification Using Machine Learning and Singular Value Decomposition for High-Dimensional Data

Abstract

Keywords:

Subject:

1. Introduction

2. Related Work

3. Dataset and Methodology

3.1. Dataset and Preprocessing

3.2. Singular Value Decomposition (SVD)

4. Algorithms

4.1. Support Vector Machine (SVM)

4.2. Random Forest (RF)

4.3. Logistic Regression (LR)

4.4. Gradient Boosting Machine (GBM)

4.5. Genetic Algorithm (GA)

4.6. GA-SVM Algorithm

5. Experimental Design and Validation Strategy

5.1. Nested Cross-Validation Strategy

5.2. Performance Evaluation Metrics

6. Discussion

6.1. Failure Analysis of GA-SVM in High Dimensional Data

Appendix A. Additional Performance Details

Conclusion

References

MDPI Initiatives

Important Links

Subscribe