An Enhanced Tree Ensemble for Classification in the Presence of Extreme Class Imbalance

Samir K. Safi; Sheema Gul

doi:10.20944/preprints202409.0681.v1

Submitted:

09 September 2024

Posted:

09 September 2024

You are already at the latest version

Abstract

Machine learning methods used for classification can face challenges due to class imbalance, where a certain class is underrepresented. Over/under-sampling of minority/majority class observations, or model selection for ensemble methods alone, may not be effective if the class imbalance ratio is very high. To address this issue, this paper proposes a method called enhance tree ensemble (ETE) based on generating synthetic data for minority class observations in conjunction with tree selection based on their performance on the training data. The proposed method first generates minority class instances to balance the training data and then uses the idea of tree selection by leveraging out-of-bag (〖ETE〗_OOB) and sub-samples (〖ETE〗_SS) observations, respectively. The efficacy of the proposed method is assessed using twenty benchmark problems for binary classification with moderate to extreme class imbalance, comparing it against other well-known methods such as optimal tree ensemble (OTE), SMOTE random forest (〖RF〗_SMOTE), over-sampling random forest (〖RF〗_OS), under-sampling random forest (〖RF〗_US), k-nearest-neighbor (k-NN), support vector machine (SVM), tree, and artificial neural network (ANN). Performance metrics such as classification error rate and precision are used for evaluation purposes. The analyses of the study revealed that the proposed method, based on data balancing and model selection, yielded better results than the other methods.

Keywords:

Random Forest

;

Tree Selection

;

Classification

;

Class-Imbalance Problem

;

Synthetic Data Generation

Subject:

Computer Science and Mathematics - Probability and Statistics

MSC: 68U01; 62P10; 62P20

1. Introduction

The challenge of dealing with imbalanced problems in the field of machine learning has been a significant topic for a long time. Imbalanced problems, which have an uneven distribution of classes, can lead to biased and subpar results, affecting the accuracy of classification models. To tackle this issue, researchers have developed various methods, including sampling methods and specialized algorithms [1,2,3,4]. Oversampling techniques such as extreme anomalous oversampling technique (EXOT) [5], which synthesizes positive instances based on anomalous scores is an example of such methods. Another strategy involves increasing minority samples while maintaining low misclassification rates for the majority class [6]. Additionally, imbalanced problems can be tackled using re-sampling algorithms such as spread sub-sample, Class Balancer, SMOTE, and Resample [7]. However, the effectiveness of different sampling procedures may vary depending on the specific dataset and the nature of the problem [8,9]. In the context of cancer diagnosis, imbalanced problem is often encountered, and balancing techniques have shown significant improvements in classifier performance. It's crucial to utilize customized approaches because different cancer datasets respond differently to various balancing strategies and classifiers [10].

The performance of machine learning classifiers can be negatively impacted when working with imbalanced problems. The distribution of the training data greatly affects their performance [11]. In a study conducted by [12], several techniques for dealing with class imbalance, including the Synthetic Minority Oversampling Technique (SMOTE) in conjunction with Tomek links and Wilson’s Edited Nearest Neighbor Rule (ENN), were introduced for balancing datasets. However, a systematic comparison of these techniques with optimal tree forest classifiers is currently lacking in the available literature. Similarly, studies in [13,14] have assessed the performance of individual classifiers, such as One Class SVM, Cost-Sensitive SVM, Optimum Path Forest, and decision tree classifier (C4.5 Tree), in the context of imbalanced data.

The study by [15] investigates ensemble techniques, such as the random forest classifier and over-sampling methods, to address class imbalance. Moreover, the study given in [16] addresses the effects of class imbalance on forest point cloud data using a weighted random forest classifier. Most studies suggest that oversampling techniques outperform under-sampling techniques based on assessments made via area under the ROC curve (AUC). While some research [17,18] suggest that oversampling may not always impact performance, it is noted that more complex decision trees can be generated through oversampling. Specialized tree construction methods, such as minority condensation trees, have been proposed to enhance the performance of Random Forest in imbalanced environments [19]. Additionally, incorporating resampling techniques can improve the efficiency of ensemble methods like Random Forests, although these improvements may not always be statistically significant [20]. It has been shown that combining algorithmic strategies with resampling techniques can produce good results [21]. While the choice of splitting indices (GINI index or information gain) does not seem to impact accuracy for balanced or imbalanced data [22], other methods have shown enhanced results. For example, balanced bagging, an ensemble technique, improves the geometric mean, AUC scores, and accuracy of decision tree classification for imbalanced problems [23]. Additionally, using hybrid sampling, which combines techniques like random oversampling and under-sampling, can create balanced datasets that enhance the performance of C4.5 decision trees on imbalanced problems [24]. While there hasn't been a thorough investigation into the most effective combination of the implementation of decision tree and random forest classifiers with hyper-parameter tuning and resampling techniques, there hasn't been a thorough investigation into the most effective combination of these elements in the presence of extreme class imbalances [25].

The concept of optimal tree ensembles (OTE) has become popular in machine learning due to their ability to improve predictive performance and model robustness. Among the various techniques used, out-of-bag (OOB) samples and sub-sampling are crucial for optimizing tree ensembles. Many research papers [26,27] explore the effectiveness of using balanced data and optimal tree forest methods, specifically by utilizing OOB estimates and sub-sampling. In their paper, [28] introduces an improved random forest algorithm that assigns weights to decision trees based on OOB errors. OOB errors can be utilized to estimate the model's performance without needing a separate validation set [29]. The use of OOB observations in modified tree selection techniques for optimal trees ensemble (OTE) is examined by [30] who propose that using OOB observations can improve predictive accuracy for individual and group performance evaluations. Similarly, [31,32] discuss the Modified Balanced Random Forest (MBRF) algorithm, which employs under-sampling based on clustering techniques. This approach differs from the OOB and sub-sampling methods mentioned earlier. To tackle imbalanced data, [33,34,35] propose a random forest algorithm based on Generative Adversarial Networks (GANs), presenting a unique alternative to OOB and sub-sampling. Furthermore, [36] combines the skew-insensitivity of Hellinger distance decision trees with the diversity of Random Forest and Rotation Forest, providing another perspective for optimizing tree-based methods for imbalanced data. In conclusion, the literature suggests that OOB estimates and sub-sampling are valuable techniques in the context of balanced data and optimal tree forests, and can enhance the performance of tree-based models, especially when dealing with imbalanced datasets.

To this end, addressing extreme class imbalance in datasets can be effectively achieved through the use of ensemble tree-based classifiers like Random Forests, potentially enhanced by resampling methods or specialized tree construction techniques. The best approach may differ depending on the specific dataset and context, but a combination of algorithmic and data-level interventions seems to be a promising strategy for improving classifier performance in the presence of class imbalances.

Random forest [37,38], provided an upper bound for the overall prediction error as a function of the accuracy and diversity of the base tree models: i.e.,

ε \leq \hat{ρ} ϵ_{t}

(1)

where

ε

is the total prediction error of the forest,

ϵ_{t}

is the estimate of any

t

tree in the forest, and

\hat{ρ}

is the weighted correlation between the residuals from two independent classification trees expressed as the mean correlation over the entire random forest. Therefore, developing a tree ensemble that ensures the accuracy and diversity of the base model is believed to yield promising results. Therefore, this study proposes an ensemble method that uses the above concepts to learn effectively from datasets with class-imbalanced problems. First, the given training data is balanced by generating new synthetic observations for the minority class. Consequently, datasets with extreme class imbalances are considered in this study. These data sets are obtained from several publicly available sources. Once the dataset is balanced, classification trees are grown on bootstrap/sub-samples from the data. The training prediction accuracy of each tree is estimated, and the top-performing trees are selected for the final ensemble. Based on the above notion, this study attempted to increase the accuracy of individual tree models in forests, in addition to randomizing their construction. The accuracy is increased by balancing the given training data with the model selection based on individual prediction performance.

The main contributions of this paper are:

Mitigating the impact of extreme class imbalance on the random forest ensemble.
Generating synthetic data from the minority class observations during the training of the tree forest.
Exploring the concept of tree selection in combination with data balancing to achieve an overall improved ensemble.

The remainder of this article is organized as follows. The proposed methods are given in Section 2, and the experiments and results are described in Section 3. Section 4 presents the simulations, which include both of the simulated scenarios for the proposed method. The final section concludes by summarizing the main findings and providing suggestions for future work.

2. Materials and Methods

2.1. Balancing the Training Data

Let Υ = (X, Y) represent the given training dataset, where X is an n×p matrix, i.e., X

= [x_{i \times j}]_{n \times p}

where,

i = 1, 2, 3, . . ., n,

and

j = 1, 2, 3, . . . p

, where, p represents the number of features and n represents the number of samples. Y is a binary vector of length n, with elements in {0,1}. These numbers represent the binary response variable, with 1 indicating membership in one class and 0 indicating membership in the other class.

Assume that the given training data come with a severely skewed class distribution, i.e.,

n^{1} ≫ n^{0}

. To balance the training data before building the tree ensembles, the following procedure is considered. The dataset, i.e., Υ is balanced by adding

K_{b}

observations, where,

K_{b}

=

n^{1} - n^{0}

by selecting bootstrap samples, each having the size of the minority class observations, i.e.,

n^{0}

. Let the sample be

X_{v}

, where v = 1, 2, …,

K_{b}

. Each bootstrap sample has

n^{0}

observations and p features. The bootstrap samples included both continuous and categorical features; computing the means and modes of the features give a vector of observations, i.e.,

p = p_{1} + p_{2}

. Each bootstrap sample is employed to add a row to the data by calculating the mean

{\bar{u_{i}}}_{i = 1}^{p 1}

of each numeric column and the mode

{{\tilde{u}}_{i}}_{i = 1}^{p 2}

of each categorical column. In the generated matrix, the rows represented the bootstrapped samples, and each row contained elements arranged in the original order of features. The last column of the matrix contained the class labels of the minority class, denoted as,

X_{n e w}^{r}

, where r = 1, 2, …, p, that is,

X_{n e w}^{r} = ({\bar{u_{i}}}_{i = 1}^{p 1}, {{\tilde{u}}_{i}}_{i = 1}^{p 2}

(2)

In this way, a new vector is generated, denoted as

\hat{X}

.

\overset{´}{Υ} = (\hat{X}, Y = 0)_{n^{o} \times p},

(3)

It follows from Equation 3 that the given training data is combined with the generated data (

\overset{´}{Υ}

) to create a balanced dataset (

\tilde{Υ}

).

\tilde{Υ} = Υ \cup \overset{´}{Υ} .

(4)

For each class, there is an equal number of data points in

\tilde{Υ}

as shown in Equation 4. In order to grow optimal trees, balanced data (

\tilde{Υ}

) will be used instead of the original data (

Υ

). Two methods are used to grow and select optimal tree using this balanced data (

\tilde{Υ}

). The first method utilized the corresponding OOB observations to assess each tree individually. Trees were ranked based on their individual performance using the OOB observations. The second method involved random subsets of training data for tree growth. In this study, balanced data is used in conjunction with an ensemble method, namely the enhanced tree ensemble using

{E T E}_{S S}

and sub-sample

{E T E}_{S S}

observations, on balanced data (

\tilde{Υ}

).

2.2. Enhanced Tree Ensembles via Out-of-Bag ( ${E T E}_{O O B}$ ) Observations:

The first approach, i.e.,

{E T E}_{O O B}

uses OOB observations for trees grown on the balanced training data

\tilde{Υ}

. From the balanced data

\tilde{Υ}

, we grow a total of T classification trees from the bootstrap samples

B τ

,

τ

= 1,2,3, …,T , where, T is the number of bootstrap samples used in building trees. Let

{\bar{B}}_{τ}

be the corresponding OOB observations resulted from bootstrap sample (

B τ

). G(

B τ

) represents the classification tree that was grown on

B τ

. Using Equation 6, compute

{\tilde{e r r o r}}_{τ}

the OOB error for the tree grown from sample

B τ

,

{\tilde{e r r o r}}_{τ} = \frac{1}{| {\bar{B}}_{τ} |} \sum_{x_{i} ϵ {\bar{B}}_{τ}} ϕ (y \neq \tilde{y}),

(5)

where, y is the true class label in the bootstrap sample, i.e.,

{\bar{B}}_{τ}

,

\tilde{y}

is the corresponding predicted value via classification tree G(

B τ

). The

ϕ

is an indicator function that takes value 1 if y does not equal to

\tilde{y}

otherwise 0 , expressed in Equation 6:

ϕ (y \neq \tilde{y}) = \{\begin{matrix} 1, if y \neq \tilde{y} \\ 0, Otherwise . \end{matrix}

(6)

After growing the desired number of classification trees (G(

B τ

) arearranged in ascending order based on their prediction error (

{\tilde{e r r o r}}_{τ}

) on OOB observations. The top H trees, that have the lowest error (

{\tilde{e r r o r}}_{τ}

), are selected. Let the top-ranked (

G^{W_{1}}

), second-top-ranked tree (

G^{W_{2}}

),…and so one, be shown in Equation 7,

{E T E}_{O O B} = G^{W_{1}}, G^{W_{2}}, \dots, G^{W_{H}}

(7)

A certain number of trees from the above-ranked trees is selected for the final ensemble. The ensemble is then used to predict the new/test data.

2.3. Enhanced Tree Ensembles using Sub-Samples ( ${E T E}_{S S}$ ) Observations

The second proposed method, i.e.,

{E T E}_{S S}

used sub-sample based method where trees were grown on sub-samples from the balanced data (

\tilde{Υ}

). Unlike the OOB observations, the remaining observations from each sample acted as test data for evaluating the predictive performance of each corresponding tree. Given that

B τ

,

τ

= 1,2,3, …, T be the random sample of size m < n, let

{\bar{B}}_{τ}

represent the corresponding remaining subset of observations of size n - m. G(

B τ

), where,

τ

=1,2, …,T, is the classification tree built on

B τ

. It is also assumed that the error of G(

B τ

) on

{\bar{B}}_{τ}

is represented by

{\tilde{e r r o r}}_{s u b τ}

. On

τ

,

τ

= 1,2, …,T, for the T classification trees. was used to estimate

{\tilde{e r r o r s u b}}_{τ}

on each tree.

Let the top, second highest, and so on ranked trees be, that is,

{E T E}_{S S} = G^{W_{1}}, G^{W_{2}}, \dots, G^{W_{H}}

(8)

The remaining procedure is the same as that for (

{E T E}_{S S}

). This method might be useful in small-sample situations where one wants to retain large amounts of training data to build trees. This method can also be tuned by selecting the optimal values for the initial number of trees grown and the number of trees selected for the final ensemble. The pseudo-code and flow chart of the proposed ensembles is provided in Algorithm 1 and Figure 1.

Algorithm 1: Pseudo-code of the proposed method.

Training data $Υ$ consisting of $n^{1} ≫ n^{0}$ observations and p variables;
$n^{1}$ ← Number of majority class observations;
$n^{0}$ ←Number of minority class observations;
$\tilde{Υ}$ ←Balanced data
$K_{b}$ ← Bootstrap sample
If | $n^{0}$ | < | $n^{1}$ |; $K_{b} = n^{1} - n^{0}$ .
for $ν \leftarrow$ 1 : $K_{b}$ : do
Using the training data ( $Υ$ ), take a bootstrap sample from the minority class;
If a feature is continuous, find its mean ( ${\bar{u}}_{i}$ ).
If categorical, find its mode ( ${\tilde{u}}_{i}$ )
Concatenate the values in Steps 9 and 10 to get a new row arranged according to the original training data.
Add the new row ( $\overset{´}{Υ}$ ) to the training data $(Υ)$ .
Combine the training data ( $Υ$ ) with generated data ( $\overset{´}{Υ}$ ) to obtain the balanced data ( $\tilde{Υ}$ )
end for
for t $\leftarrow$ 1 : T do
Take a bootstrap/sub-sample ( ${\bar{B}}_{τ}$ ) from balanced training data ( $\tilde{Υ}$ ).
Store OOB/out of sample observations.
Grow classification tree (G(B $τ$ )) on the bootstrap/ sub-sample ( ${\bar{B}}_{τ}$ ).
Use OOB/out of sample observations and estimate prediction error ( ${\tilde{e r r o r}}_{τ}$ ).
end for
Arrange the trees $G^{W_{1}}, G^{W_{2}}, \dots, G^{W_{H}}$ in ascending order with respect to OOB/out-of-sample errors.
Select the top ranked trees ( $H$ ) as the final ensemble

3. Experiments and Results

This study evaluated the performance of the proposed methods

{E T E}_{O O B}

and

{E T E}_{S S}

on extremely imbalanced classification datasets using a set of benchmark problemsThe effectiveness of these methods was compared to several state-of-the-art approaches, including optimal tree ensemble (OTE) [39,40], SMOTE random forest (

{R F}_{S M O T E}

) [41,42], over-sampling random forest (

{R F}_{O S}

), under-sampling random forest (

{R F}_{U S}

) [43], k-nearest-neighbor (k-NN) [44], support vector machine (SVM) [45], tree, and artificial neural network (ANN) [46].

The method, i.e.,

{E T E}_{O O B}

involves growing 1500 trees using bootstrap samples from the training data, and using OOB observations for tree assessment. Similarly, for MathType@Translator@5@5@MathML2 (no namespace).tdl@MathML 2.0 (no namespace)

1500 trees are grown using random samples without replacement. Each sample consists of 90% of the training data, with the remaining 10% used for assessing individual trees. The parameter H is set at 20% of T. Various hyperparameters of the individual tree model are tuned using the tune.rpart function from the R package e1071 [47]. The optimal tree depth and the number of splits are determined by testing values of 5, 10, 15, 20, and 25. To optimize the random forest model, we use the tune.randomForest function from the e1071 package to adjust key hyperparameters, including node size (nodesize), the number of trees (ntree), and the size of the feature subset (mtry). We evaluate different values for nodesize(10, 15, 20, 25, 30), ntree (1000, 1500, 2000), and mtry (sqrt(d)). The support vector machine (SVM) uses automatic sigma estimation from the kernlab package [48], with default settings for other parameters. For the k-nearest neighbors’ classifier (kNN), the tune.knn function from the e1071 package is used to select the optimal value for k, ranging from 3 to 15. To ensure consistent comparisons, the same training and testing data were used across all models, including

{E T E}_{O O B}

,

{E T E}_{S S}

, OTE),

{R F}_{S M O T E}

,

{R F}_{O S}

,

{R F}_{U S}

, k-NN, SVM, tree, and ANN Model performance was evaluated using two metrics: classification error rate and precision. All experiments were conducted in R version 4.3.1 on a 1.30 GHz Intel Core i5-1235U with 8 GB of memory, running on a 64-bit operating system.

An overview of the datasets is provided in Table 1. The first column displays the acronyms used for the datasets (DS). The second column shows the names of the datasets. The third and fourth columns present the number of instances/observations (n) and features (p) respectively. The fifth column provides the class distribution, the sixth column shows the imbalance ratio (

n^{1} : n^{0}

) , and the last column gives the data source. The results from various training and testing scenarios are presented in Table 2, Table 3, Table 4 and Table 5.

The training and testing datasets were consistent across all methods under consideration. The results, summarized in Table 2, Table 3, Table 4 and Table 5, along with the box plots in Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9 and Figure 10, provide a comprehensive comparison of the proposed methods,

{E T E}_{O O B}

and

{E T E}_{S S}

, against other techniques in terms of classification error rate and precision. Table 2 highlights the classification error rates of the proposed methods in comparison with other state-of-the-art approaches. The results for

{E T E}_{O O B}

and

{E T E}_{S S}

are shown in bold. The data clearly indicate that the proposed methods,

{E T E}_{O O B}

and

{E T E}_{S S}

, consistently outperform the other methods in terms of classification error rate, achieving the lowest error rates across most datasets. The proposed methods exhibited minimal classification error rates, ranging from 0 to 0.0005. Generally,

{E T E}_{O O B}

performs well on the

{D S}_{1}

,

{D S}_{2}

,

{D S}_{3}

,

{D S}_{8}

,

{D S}_{9}

,

{D S}_{10}

,

{D S}_{12}

,

{D S}_{14}

,

{D S}_{17}

,

{D S}_{18}

and

{D S}_{19},

respectively, while

{E T E}_{S S}

performs better on the

{D S}_{4}

,

{D S}_{6}

,

{D S}_{7}

,

{D S}_{11}

,

{D S}_{13}

,

{D S}_{15}

,

{D S}_{16}

, respectively. However,

{E T E}_{O O B}

and

{E T E}_{S S}

did not perform well on

{D S}_{5}

and

{D S}_{20}

, respectively. In contrast,

{R F}_{S M O T E}

achieved a low error rate on

{D S}_{5}

, while

{R F}_{O S}

,

{R F}_{U S}

, k-NN, SVM, tree, and ANN did not perform well on any of the datasets.

Table 3 offers a detailed comparison of the proposed methods with other state-of-the-art techniques across various datasets, focusing on precision.

{E T E}_{O O B}

delivered better results for the majority of the datasets in terms of precision. Table 4 further reinforces these findings, showing that

{E T E}_{O O B}

achieved precision rates ranging from 93.36% to 100% across several datasets, underscoring the effectiveness of the proposed methods. In contrast,

{E T E}_{S S}

provided better precision results for

{D S}_{4}

,

{D S}_{7}

,

{D S}_{8}

,

{D S}_{9}

,

{D S}_{11}

,

{D S}_{14}

,

{D S}_{16}

,

{D S}_{17}

,

{D S}_{20}

, respectively. Additionally,

{R F}_{S M O T E}

demonstrated high precision in the

{D S}_{5}

datasets. The box plots provide visual confirmation of the superior performance of

{E T E}_{O O B}

and

{E T E}_{S S}

in terms of classification error rate and precision, as shown in Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9 and Figure 10. Furthermore, for Table 4 and Table 5 same conclusion could be drawn for the classification error rate and precision, with 90% training data and 10% testing.

A multi-line plot demonstrates the consistent improvement in performance and reduction in error rates achieved by the proposed methods,

{E T E}_{O O B}

and

{E T E}_{S S}

, as the number of trees varies. These figures analyze the impact of different numbers of trees (H) within the ensemble on the error rate. The x-axis represents H as a percentage (indicating the proportion of trees selected for inclusion in the ensemble), while the y-axis shows the corresponding classification error rate. Figure 11 illustrates that increasing the number of trees in the ensemble consistently results in a reduction in the error rate on

{D S}_{1}

,

{D S}_{2}

,

{D S}_{3}

,

{D S}_{4}

,

{D S}_{5}

,

{D S}_{6}

and

{D S}_{7}

, respectively. Similarly, Figure 12 shows that the error rate for

{E T E}_{S S}

decreases as the number of trees in the ensemble increases. Furthermore, Figure 11 and Figure 12 support these findings by illustrating the relationship between the percentage of trees selected (10%, 30%, 50%, 70%, 90%) for the ensemble (x-axis) and the resulting error rates (y-axis). These figures collectively emphasize the importance of optimizing the number of trees in an ensemble to achieve better classification performance.

4. Simulation

In this section, we present two scenarios for creating scientific datasets for simulation. The first scenario involves generating a synthetic imbalanced dataset for simulation (

{S I D S}_{s i m}

), while the second scenario involves creating a synthetic balanced dataset for simulation (

{S B D S}_{s i m}

). We then evaluate the performance of the proposed method, i.e.,

{E T E}_{O O B}

and

{E T E}_{S S}

using these datasets. The first scenario aims to demonstrate when the proposed method is beneficial, while the second scenario showcases a data-generating environment where the proposed method may not be appropriate. In total, 10,000 instances were synthetically generated, involving 19 variables. Each of these variables produces observations following a multivariate normal distribution with varying means and variances, except for 8 variable which generates observations following a multi-nomial distribution with four categories. To generate observations for a binary response, the first imbalance ratio specifies the imbalance in the observations. The binary response , denoted as

Y = B (p)

, is generated using a logit-type function given the variables as shown in Equation 9 .

p (y_{i m b} | X) = \frac{e x p (y_{i m b})}{1 + e x p (1 - y_{i m b})} .

(9)

The ratio of the number of instances in the majority class to the minority class observations was 5.67. There were 8500 instances in the majority class and 1500 in the minority class. The values used in Equation 2 are as follows:

y = X β + ϵ .

(10)

This setup was used to generate 10,000 observations. All the methods presented in this study were applied using the same experimental setup as that used for the benchmark datasets. The second model was constructed similarly. The class-wise distribution was 50/50, where a class ratio of 1.02 indicated that the data were balanced. The difference between the two models was that the former contained an imbalanced class distribution, whereas the latter did not. A total of 100 realizations were performed. Bar plots of the proposed method on the simulated datasets are shown in Figure 13. The bar-plots results indicate that the proposed method,

{E T E}_{O O B}

and

{E T E}_{S S}

, has lower classification error rates compared to other state-of-the-art methods, especially on

{S I D S}_{s i m}

. These findings suggest that

{E T E}_{O O B}

and

{E T E}_{S S}

is more effective for classifying imbalanced datasets than other state-of-the-art approaches in the presence of a severe class imbalance problem in the data. In the first scenario, the proposed methods,

{E T E}_{O O B}

and

{E T E}_{S S}

, outperformed the other rivals. However, in the second scenario,

{E T E}_{O O B}

and

{E T E}_{S S},

did not perform well due to the absence of a class imbalance problem. The model's performance was affected by the lack of a class imbalance problem, leading to sub-optimal outcomes in the second scenario.

5. Conclusions

The performance of machine learning models can be negatively impacted by a significant imbalance in the distribution of classes, with one class significantly outnumbering the other. To address this issue, balancing the data by creating additional observations for the minority class, typically the focus of interest, in combination with tree ensemble, has led to improved prediction accuracy. The proposed methods, along with synthetic data generation for balancing, effectively tackle the challenges presented by highly imbalanced classification problems. This study introduced two innovative methods, i.e.,

{E T E}_{O O B}

and

{E T E}_{S S}

that successfully address the issue of imbalanced data. These methods were shown to outperform traditional machine learning approaches, such as OTE, SMOTE random forest (

{R F}_{S M O T E}

), over-sampling random forest (

{R F}_{O S}

), under-sampling random forest (

{R F}_{U S}

), k-nearest-neighbor (k-NN), support vector machine (SVM), tree, and artificial neural network (ANN), across multiple metrics in analyses of various benchmark and simulated datasets.

For future work in the direction of this paper, one may consider the use of methods such as SMOTE and Adaptive Synthetic Minority Oversampling Technique (ADASYN) to create synthetic data tailored to the dataset. This can enhance model performance by better balancing the minority class. Additionally, consider developing new features that are particularly useful for the minority class to provide more relevant information for the model and improve its performance. Apply feature selection techniques to identify and use the most relevant features for classification in imbalanced scenarios. Incorporate cost-sensitive learning methods to adjust the model’s focus on the minority class by assigning higher costs to errors made on this class. For better results, combine this approach with the enhanced tree ensemble.

Our method is designed specifically for binary classification problems. However, it's important to note that the time complexity of the method increases exponentially as the data size grows. This challenge can be addressed by using parallel computing techniques, which allow computational tasks to be distributed across multiple processors, improving efficiency and reducing processing time. Implementing parallel computing could greatly improve the scalability of our method, making it more suitable for larger datasets.

Author Contributions

Conceptualization, S.S.; methodology, S.S. and S.G.; software, S.G.; validation, S.S. and S.G.; formal analysis, S.G. and S.S.; investigation, S.S. and S.G.; resources, S.S.; data curation, S.S., and S.G.; writing—original draft preparation, S.S., and S.G.; writing—review and editing, S.G. and S.S.; visualization, S.S. and S.G.; supervision, S.S.; project administration, S.S.; funding acquisition, S.S. Both authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the START-UP Research, grant number 12B025, College of Business and Economics at the United Arab Emirates University.

Data Availability Statement

The data supporting the findings of this study are available upon request. Interested researchers may contact the corresponding author to obtain access to the data for further analysis and validation.

Acknowledgments

The authors gratefully acknowledge funding support from the United Arab Emirates University, which made this research possible.

Conflicts of Interest

The authors declare no conflicts of interest. The authors have no relevant financial or non-financial interests to disclose.

References

Fernández, A., García, S., Galar, M., Prati, R. C., Krawczyk, B., & Herrera, F. (2018). Learning from imbalanced data sets (Vol. 10, No. 2018). Cham: Springer.
Hoens, T. R., & Chawla, N. V. (2013). Imbalanced datasets: from sampling to classifiers. Imbalanced learning: Foundations, algorithms, and applications, 43-59.
Juba, B., & Le, H. S. (2019, July). Precision-recall versus accuracy and the role of large data sets. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, No. 01, pp. 4039-4048).
Tsai, H., Yang, T. W., Wong, W. M., Kao, H. Y., & Chou, C. F. (2024). A Hybrid Approach for Binary Classification of Imbalanced Data. International Journal of Computational Intelligence and Applications, 2450013.
Chiamanusorn, C., & Sinapiromsaran, K. (2017, December). Extreme anomalous oversampling technique for class imbalance. In Proceedings of the 2017 International Conference on Information Technology (pp. 341-345).
Emu, I. J. , Jahin, D., Akter, S., Patwary, M. J., & Akter, S. (2022, February). A novel technique to solve class imbalance problem. In 2022 international conference on innovations in science, engineering and technology (ICISET) (pp. 486-491). IEEE.
Zakaria, A. Z. , Selamat, A., Cheng, L. K., & Krejcar, O. (2022, November). Improving Class Imbalance Detection And Classification Performance: A New Potential of Combination Resample and Random Forest. In 2022 IEEE International Conference on Computing (ICOCO) (pp. 316-323). IEEE.
Velarde, G., Sudhir, A., Deshmane, S., Deshmunkh, A., Sharma, K., & Joshi, V. (2023). Evaluating XGBoost for balanced and imbalanced data: application to fraud detection. arXiv preprint arXiv:2303.15218.K.
Weiss, G. M. , & Provost, F. (2001). The effect of class distribution on classifier learning: an empirical study.
Fotouhi, S., Asadi, S., & Kattan, M. W. (2019). A comprehensive data level analysis for cancer diagnosis on imbalanced data. Journal of biomedical informatics, 90, 103089.
Brabec, J., & Machlica, L. (2018). Bad practices in evaluation methodology relevant to class-imbalanced problems. arXiv preprint arXiv:1812.01388.
Aswathi, M., Ghosh, A., & Namboothiri, L. V. (2022). Borda count versus majority voting for credit card fraud detection. In Ubiquitous Intelligent Systems: Proceedings of ICUIS 2021 (pp. 319-330). Springer Singapore.
Di Martino, M. , Decia, F., Molinelli, J., & Fernández, A. (2012). Improving electric fraud detection using class imbalance strategies. In International Conference on Pattern Recognition Applications and Methods (IPRAM 2012).
Rhmann, W. (2024). An empirical study on the class imbalance handling techniques for different diseases. Soft Computing, 1-18.
Ali, M. Z., Rauf, S., Javed, K., & Hussain, S. (2021). Improving hate speech detection of Urdu tweets using sentiment analysis. IEEE Access, 9, 84296-84305.
Adimoolam, Y. , Pillai, N. D., Lakshmanan, G., Mishra, D., & Dadhwal, V. K. (2022). Estimation of Above Ground Volume of Mangrove Forest Trees from Terrestrial LiDAR Data using Supervised Machine Learning Algorithms.
Batista, G. E., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, 6(1), 20-29.
Van Hulse, J., Khoshgoftaar, T. M., & Napolitano, A. (2007, June). Experimental perspectives on learning from imbalanced data. In Proceedings of the 24th international conference on Machine learning (pp. 935-942).
Homjandee, S., & Sinapiromsaran, K. (2021). A Random Forest with Minority Condensation and Decision Trees for Class Imbalanced Problems. WSEAS TRANSACTIONS ON SYSTEMS AND CONTROL, 16, 502-507.
Dittman, D. J. , Khoshgoftaar, T. M., & Napolitano, A. (2015, August). The effect of data sampling when using random forest on imbalanced bioinformatics data. In 2015 IEEE international conference on information reuse and integration (pp. 457-463). IEEE.
Pristyanto, Y. , Nugraha, A. F., Pratama, I., Dahlan, A., & Wirasakti, L. A. (2021, January). Dual approach to handling imbalanced class in datasets using oversampling and ensemble learning techniques. In 2021 15th international conference on ubiquitous information management and communication (IMCOM) (pp. 1-7). IEEE.
Tangirala, S. (2020). Evaluating the impact of GINI index and information gain on classification using decision tree classifier algorithm. International Journal of Advanced Computer Science and Applications, 11(2), 612-619.
Pristyanto, Y., & Zein, A. A. (2023). Model Balanced Bagging Berbasis Decision Tree Pada Dataset Imbalanced Class. Jurnal Sisfokom (Sistem Informasi dan Komputer), 12(1), 9-15.
Seiffert, C. , Khoshgoftaar, T. M., & Van Hulse, J. (2009). Hybrid sampling for imbalanced data. Integrated Computer-Aided Engineering, 16(3), 193-210.
Kumar, S. , & Ratnoo, S. MULTI-OBJECTIVE HYPERPARAMETER TUNING OF CLASSIFIERS FOR DISEASE DIAGNOSIS.
Owaida, M., Alonso, G., Fogliarini, L., Hock-Koon, A., & Melet, P. (2019). Lowering the latency of data processing pipelines through fpga based hardware acceleration. Proceedings of the VLDB Endowment, 13(1), 71-85. [CrossRef]
Yasodhara, A., Asgarian, A., Huang, D., & Sobhani, P. (2021). On the trustworthiness of tree ensemble explainability methods. Lecture Notes in Computer Science, 293-308. [CrossRef]
Zhou, L., & Wang, H. (2012). Loan default prediction on large imbalanced data using random forests. TELKOMNIKA Indonesian Journal of Electrical Engineering, 10(6), 1519-1525.
Mohandoss, D. P. , Shi, Y., & Suo, K. (2021, January). Outlier prediction using random forest classifier. In 2021 IEEE 11th Annual Computing and Communication Workshop and Conference (CCWC) (pp. 0027-0033). IEEE.
Khan, Z. , Gul, N., Faiz, N., Gul, A., Adler, W., & Lausen, B. (2021). Optimal trees selection for classification via out-of-bag assessment and sub-bagging. IEEE Access, 9, 28591-28607.
Agusta, Z. P. (2019). Modified balanced random forest for improving imbalanced data prediction. International Journal of Advances in Intelligent Informatics, 5(1), 58-65.
ao, A. R., Wang, H., & Gupta, C. (2024). Predictive Analysis for Optimizing Port Operations. arXiv preprint arXiv:2401.14498.
Li, Z. , Shahrajabian, H., Bagherzadeh, S. A., Jadidi, H., Karimipour, A., & Tlili, I. (2020). Effects of nano-clay content, foaming temperature and foaming time on density and cell size of PVC matrix foam by presented Least Absolute Shrinkage and Selection Operator statistical regression via suitable experiments as a function of MMT content. Physica A: Statistical Mechanics and its Applications, 537, 122637.
Shahgholi, M. , Firouzi, P., Malekahmadi, O., Vakili, S., Karimipour, A., Ghashang, M.,... & Baghaei, S. (2022). Fabrication and characterization of nanocrystalline hydroxyapatite reinforced with silica-magnetite nanoparticles with proper thermal conductivity. Materials Chemistry and Physics, 289, 126439.
Shu, Q. , Hu, T., & Liu, S. (2020, May). Random Forest Algorithm based on GAN for imbalanced data classification. In Journal of Physics: Conference Series (Vol. 1544, No. 1, p. 012014). IOP Publishing.
Su, C., Ju, S., Liu, Y., & Yu, Z. (2015). Improving random forest and rotation forest for highly imbalanced datasets. Intelligent Data Analysis, 19(6), 1409-1432.
Breiman, L. (2001). Random forests. Machine learning, 45, 5-32.
Korn, J. (2024). Ensemble Classification: An Analysis of the Random Forest Model.
Mišić, V. V. (2017). Optimization of tree ensembles. [CrossRef]
Gul, N. , Faiz, N., Brawn, D., Kulakowski, R., Khan, Z., & Lausen, B. (2020). Optimal survival trees ensemble. [CrossRef]
Ma, J. , Sheridan, R. P., Liaw, A., Dahl, G. E., & Svetnik, V. (2015). Deep neural nets as a method for quantitative structure–activity relationships. Journal of Chemical Information and Modeling, 55(2), 263-274. [CrossRef]
Biggs, M., Hariss, R., & Perakis, G. (2023). Constrained optimization of objective functions determined from random forests. Production and Operations Management, 32(2), 397-415. [CrossRef]
Rahman, R. , Haider, S., Ghosh, S., & Pal, R. (2015). Design of probabilistic random forests with applications to anticancer drug sensitivity prediction. Cancer Informatics, 14s5, CIN.S30794. [CrossRef]
Wright, M. N. and Ziegler, A. (2017). ranger: a fast implementation of random forests for high dimensional data in c++ and r. Journal of Statistical Software, 77(1). [CrossRef]
Khan, Z., Gul, A., Mahmoud, O., Miftahuddin, M., Perperoglou, A., Adler, W., … & Lausen, B. (2016). An ensemble of optimal trees for class membership probability estimation. Analysis of Large and Complex Data, 395-409. [CrossRef]
López, O. A. M., López, A. M., & Crossa, J. (2022). Support vector machines and support vector regression. Multivariate Statistical Machine Learning Methods for Genomic Prediction, 337-378. [CrossRef]
Meyer, D. , Dimitriadou, E., Hornik, K., Weingessel, A., & Leisch, F. (2014). e1071: Misc Functions of the Department of Statistics (e1071). R package version 1.6-4. TU Wien, Vienna.
Karatzoglou, A. , Smola, A., Hornik, K., & Zeileis, A. (2004). kernlab-an S4 package for kernel methods in R. Journal of statistical software, 11, 1-20.

Figure 1. Flow chart of the proposed methods.

Figure 2. Box-plots comparing

{E T E}_{O O B}

and

{E T E}_{S S}