3. Proposed Hybrid Model for Software Defect Prediction
This section presents the proposed software defect prediction model, which integrates machine learning classifiers with Binary Multi-Objective Starfish Optimization (BMOSFO) for feature selection. The BMOSFO algorithm is employed to identify the most relevant software defect features, improving classification accuracy and computational efficiency. The machine learning models used in this study include Artificial Neural Networks (ANN), Decision Trees (DT), Random Forest (RF), K-Nearest Neighbors (KNN), Naive Bayes (NB), Logistic Regression (LR), and Support Vector Machines (SVM). Furthermore, an ensemble classifier based on the Choquet Fuzzy Integral is applied to aggregate predictions and enhance defect classification performance.
The integration of BMOSFO for feature selection with the Choquet Fuzzy Integral-based ensemble classification presents a novel approach in software defect prediction. This combination leverages the exploration-exploitation balance of BMOSFO and the interdependency modeling capability of the Choquet Integral, leading to improved classification accuracy and interpretability. This innovative hybridization addresses the limitations of traditional defect prediction models by simultaneously optimizing feature relevance and classifier interaction, making it a robust solution for complex software datasets.
Figure 1 illustrates the workflow of the proposed software defect prediction framework. The system follows a structured pipeline: (1) Dataset acquisition, (2) Preprocessing, (3) Feature selection using BMOSFO, (4) Model training and classification using multiple ML classifiers, (5) Calculation of fuzzy membership values, and (6) Final prediction aggregation via the Choquet Integral.
Each feature in the dataset is represented as a binary feature vector in BMOSFO, ranging from
to
, as shown in
Figure 2. If a feature’s associated bit is 1, it is retained in the training process; otherwise, it is removed. The BMOSFO algorithm optimizes feature selection by minimizing the number of selected features while ensuring high classification accuracy. The fitness function is designed to balance model simplicity and error minimization, progressively refining the selected feature set to reduce classification errors.
The datasets used in this study were obtained from the NASA repository [
43], which has been publicly available since 2005. These datasets include software metrics such as branch count, Halstead’s complexity measures, McCabe’s cyclomatic complexity, and various line-of-code criteria.
Table 1 summarizes the datasets, while
Table 2 provides an overview of the features used in this investigation.
The NASA datasets are particularly relevant for software defect prediction due to their high dimensionality, class imbalance, and noise levels, which present significant challenges for conventional machine learning models. These datasets contain complex software metrics such as cyclomatic complexity and Halstead measures, making them ideal for evaluating the effectiveness of the proposed BMOSFO feature selection and Choquet Fuzzy Integral-based ensemble classification. By testing the proposed model on these real-world datasets, this study demonstrates the robustness and generalizability of the approach in handling complex defect prediction scenarios.
The proposed model employs multi-objective optimization to balance feature selection efficiency and predictive accuracy. BMOSFO integrates swarm intelligence techniques with multi-objective search mechanisms to optimize the classifier’s performance. Inspired by starfish movement patterns, BMOSFO incorporates exploratory and exploitative search strategies, enabling robust selection of relevant defect prediction features while minimizing computational overhead.
This section elaborates on the advantages of BMOSFO in software defect classification, detailing how binary optimization techniques enhance feature selection while maintaining model interpretability. Additionally, modifications for handling binary search spaces are introduced, ensuring compatibility with real-world software defect datasets.
3.1. Starfish Optimization Algorithm: Exploration and Exploitation Mechanisms
Metaheuristic algorithms often face the challenge of balancing exploration (global search capability) and exploitation (local search refinement). Effective optimization requires achieving an optimal trade-off between these two phases to prevent premature convergence to suboptimal solutions. As metaheuristic strategies rely on randomized search mechanisms, they do not guarantee optimal solutions for every problem. This principle is emphasized by the No-Free-Lunch (NFL) theorem [
44], which suggests that no single optimization algorithm can outperform all others across all problem domains. Consequently, the development of adaptive and problem-specific metaheuristic techniques remains an active research area.
In this study, the Starfish Optimization Algorithm (SFOA) [
45] is employed as a foundation for feature selection in the proposed Binary Multi-Objective Starfish Optimization (BMOSFO) framework. The biological inspiration for SFOA is drawn from the movement, hunting, and regenerative abilities of starfish. Starfish, also known as sea stars, comprise over 2,000 species globally, typically exhibiting a five-arm radial symmetry extending from a central disk. Some species, however, possess seven or more arms, with certain variations exceeding ten [
46]. The average lifespan of starfish ranges from ten to thirty-five years, depending on environmental conditions and species characteristics.
Figure 3 illustrates the biological attributes of starfish that serve as the foundation for SFOA.
Figure 3 (a) and (b) showcase the body structure of different starfish species, highlighting their characteristic symmetry.
Figure 3 (c) represents the reproductive behavior of starfish, which inspires the regeneration mechanism in the optimization algorithm. Lastly,
Figure 3 (d) depicts starfish prey interactions, which influence the algorithm’s preying strategy in the exploitation phase.
The exploration phase of SFOA mimics the foraging behavior of starfish, while exploitation is modeled through preying and regeneration strategies. SFOA utilizes a hybrid search mechanism that incorporates:
A five-dimensional search (), inspired by the five arms of a starfish, for diverse exploration.
A one-dimensional search () to improve convergence when feature space is smaller.
The optimization process of SFOA consists of three key stages:
-
Initialization: At the beginning of the optimization process, starfish positions are randomly generated within the predefined design space, formulated as:
where
N is the population size,
L is the number of design variables, and the initial positions are computed as:
The fitness score of each starfish is evaluated based on the objective function, enabling an adaptive search process.
-
Exploration: SFOA employs different strategies based on the problem’s dimensionality:
For , a five-dimensional search is used for large-scale optimization.
For , a one-dimensional search is applied for improved local refinement.
The position update rule in the exploration phase is formulated as:
where
is a random number
, and
,
, and
represent the calculated, current, and best positions, respectively. The parameters
a and
are given by:
If the revised location is outside the margins of the design parameters, the arms are more likely to remain in the previous position rather than migrate. The exploration phase updates the position using the unidimensional search pattern if
. In this case, a starfish utilizes position data from others to move a single arm toward the food source:
where
and
are randomly selected p-dimensional locations from two starfish,
and
are random numbers
, and
is computed as:
-
Exploitation: The exploitation phase includes hunting and regeneration strategies. The position of starfish is updated based on the best location:
The new position is computed using:
where
and
are random values in (0,1), and
,
are randomly chosen distances.
Additionally, in the regeneration phase, if a starfish sacrifices an arm to avoid predators, the location is updated using:
The final update ensures values remain within bounds:
3.2. BMOSFO: A Binary Multi-Objective Starfish Optimization Approach for Feature Selection
The proposed BMOSFO framework consists of three main phases: initialization, exploration, and exploitation. Unlike conventional optimization strategies, BMOSFO executes exploration and exploitation in parallel, with fitness scores computed at each iteration following positional updates. Upon convergence, the algorithm outputs both the best global solution and the convergence curve.
Figure 4 illustrates the workflow of BMOSFO, depicting the key decision points and the transition between exploration and exploitation phases. The framework ensures that non-dominated solutions are maintained in an external buffer, leveraging multi-objective selection to balance feature relevance and classification accuracy.
The BMOSFO optimization process is structured as follows:
Step 1: Population Initialization. The BMOSFO framework initializes a population of
N starfish, each representing a candidate feature subset. Each starfish is encoded as a binary vector of length
L, where
L corresponds to the total number of features in the dataset. A value of 1 denotes the inclusion of a feature, whereas 0 indicates its exclusion.
Figure 2 illustrates the binary representation of the starfish population.
-
Step 2: Starfish Evaluation. Feature selection is modeled as a multi-objective binary optimization problem in this study. Each starfish is evaluated based on two objective functions, formulated as:
The first objective function quantifies the number of selected features within a given subset. Subsequently, a reduced dataset is generated, retaining only the features corresponding to indices where
. The second objective function computes the classification accuracy of a K-Nearest Neighbors (KNN) classifier, acting as a wrapper-based feature selection metric:
Step 3: Preserving Non-Dominated Solutions. Unlike single-objective optimization, multi-objective algorithms generate Pareto-optimal solutions at each iteration. These solutions are stored in an external buffer, ensuring that superior candidates are retained. The external buffer undergoes screening in subsequent iterations, discarding solutions that are dominated by newly discovered candidates. If a new solution is non-dominated, it replaces the most crowded buffer element based on Crowding Distance (CD) [
47].
-
Step 4: Starfish Position Update. BMOSFO dynamically alternates between exploration and exploitation phases based on a control parameter and a randomly generated value in the range . The following rules govern the transition:
- -
-
If , the exploration phase is executed:
If
, locations are updated using Equation
3.
- *
If
, locations are updated using Equation
5.
- -
-
If , the exploitation phase is executed:
- *
Positions are updated using Equation
8.
- *
At
, regeneration occurs via Equation
9.
Boundary conditions are checked using Equation
10, and final updates are applied to the binary domain via a sigmoid transfer function:
The final binary update is performed as [
48]:
where
is a random number
,
X represents the starfish’s location,
L denotes the dimension,
t is the current iteration, ¬ denotes negation and
is the transfer function.
-
Step 5: Computational Complexity of BMOSFO. The computational complexity of BMOSFO depends on the number of samples
N, the number of features
L, and the total iterations
. The total complexity is formulated as:
The overall complexity is composed of several components:
- -
Initialization Complexity: Initializing the population requires .
- -
Fitness Evaluation Complexity: Evaluating two objective functions takes .
- -
Non-Dominated Solution Filtering: Using the dominance tree method requires
[
49].
- -
External Buffer Management: Crowding Distance (CD) sorting in the external buffer requires
[
47].
- -
-
Exploration Complexity: The starfish position update depends on whether or :
- *
For : .
- *
For : .
- -
Exploitation Complexity: Exploitation requires .
- -
Final Buffer Update Complexity: The final update in each iteration takes:
Therefore, the total computational complexity of BMOSFO-based feature selection is:
The optimized feature set obtained through BMOSFO directly impacts the classification performance by eliminating irrelevant and redundant features, thereby reducing noise and enhancing model generalizability. By focusing on the most relevant software metrics, BMOSFO not only improves classification accuracy but also enhances computational efficiency. The selected features are then used to train multiple classifiers, ensuring that the models are trained on the most informative attributes, ultimately leading to more reliable defect prediction outcomes.
3.2.1. Proposed BMOSFO Algorithm
The complete feature selection procedure, from initialization to selecting the best feature subsets based on crowding distance, is formalized in Algorithm 1. By balancing these competing objectives, BMOSFO ensures that non-dominated solutions are effectively discovered in terms of feature count and classification accuracy.
|
Algorithm 1 BMOSFO Algorithm |
|
Input: Algorithm parameters N, , G
Output: Optimized feature set stored in external buffer.
- 1:
Initialize a population of N starfishes randomly, as shown in Figure 2.
- 2:
Evaluate each starfish using the objective functions presented above.
- 3:
Store the optimal Pareto solutions in an external buffer.
- 4:
for to do
- 5:
Generate a random number uniformly distributed in [0, 1].
- 6:
if then
- 7:
Calculate and using Equations 4 and 6, respectively.
- 8:
for each starfish do
- 9:
if then
- 10:
Update the starfish location using Equation 3.
- 11:
else
- 12:
Select a random index p.
- 13:
Update the p-index of the starfish position using Equation 5.
- 14:
end if
- 15:
Check boundary conditions.
- 16:
end for
- 17:
else
- 18:
for each starfish do
- 19:
Update the position using Equation 8.
- 20:
if then
- 21:
Perform the position update using Equation 9.
- 22:
end if
- 23:
Check boundary conditions.
- 24:
end for
- 25:
end if
- 26:
Apply the sigmoid transfer function (Equation 13) and update using Equation 14 to convert positions to binary.
- 27:
Recalculate objective values.
- 28:
Update the external buffer with new Pareto-optimal solutions.
- 29:
end for
- 30:
Return the best solution from the buffer based on crowding distance (CD).
|
After optimizing the feature set using BMOSFO, the selected features are used to train a diverse set of machine learning classifiers. To further enhance predictive performance and robustness, an ensemble approach based on the Choquet Fuzzy Integral is employed. This ensemble method effectively aggregates classifier outputs by modeling their interdependencies, enabling more accurate defect classification.
3.3. Choquet Fuzzy Integral-Based Ensemble Classification
The proposed ensemble classification method employs the Choquet fuzzy integral to aggregate the outputs of multiple classifiers, enhancing prediction reliability and robustness. The classification process follows these steps:
The dataset is initially divided into training and testing sets.
The training set is further split into training and validation subsets.
The fuzzy measure values F are calculated based on the classifier weights .
The confidence scores of the individual classifiers are combined using the Choquet integral [
50].
The fuzzy measure value
for the
classifier is computed as:
This ensures that the fuzzy measure values are normalized and sum up to 1, preserving relative classifier importance.
Let
denote the confidence score of the
class predicted by the
classifier. The fuzzy confidence score for each class
j is computed using:
To determine the remaining fuzzy measure values for the classifier combination, the following equation is applied recursively:
where
, and
is a parameter that ensures the monotonicity of the fuzzy measure. Here,
represents the interaction strength between classifiers, regulating the non-additivity of the Choquet integral and its value is computed as:
3.3.1. Prediction Score Computation
After training, the classifiers
generate prediction scores for each test sample. These scores are then aggregated using the Choquet integral, which computes the final class-wise prediction scores. The Choquet integral is applied as follows:
where
represents a subset defined as:
For example:
If , then .
If , then .
The values
represent the prediction scores sorted in descending order such that:
3.3.2. Final Classification Decision
For each test sample, the Choquet integral computes an aggregated prediction score for every class. The final classification decision is made by selecting the class with the highest Choquet integral value:
Unlike simple averaging, the Choquet integral effectively models dependencies among classifiers, preventing over-reliance on any single weak learner.
By leveraging the Choquet fuzzy integral, this ensemble approach effectively models classifier dependencies, accounting for the interactions between classifiers rather than treating them as independent entities. This methodology enhances classification performance, especially in imbalanced and uncertain environments.