ARTICLE | doi:10.20944/preprints201805.0248.v1
Subject: Mathematics & Computer Science, General & Theoretical Computer Science Keywords: software fault prediction; data preprocessing; feature selection; rough set theory; class imbalance; noise filter; easy ensemble
Online: 17 May 2018 (13:01:51 CEST)
Software fault prediction is the very consequent research topic for software quality assurance. Data driven approaches provide robust mechanisms to deal with software fault prediction. However, the prediction performance of the model highly depends on the quality of dataset. Many software datasets suffers from the problem of class imbalance. In this regard, under-sampling is a popular data pre-processing method in dealing with class imbalance problem, Easy Ensemble (EE) present a robust approach to achieve a high classification rate and address the biasness towards majority class samples. However, imbalance class is not the only issue that harms performance of classifiers. Some noisy examples and irrelevant features may additionally reduce the rate of predictive accuracy of the classifier. In this paper, we proposed two-stage data pre-processing which incorporates feature selection and a new Rough set Easy Ensemble scheme. In feature selection stage, we eliminate the irrelevant features by feature ranking algorithm. In the second stage of a new Rough set Easy Ensemble by incorporating Rough K nearest neighbor rule filter (RK) afore executing Easy Ensemble (EE), named RKEE for short. RK can remove noisy examples from both minority and majority class. Experimental evaluation on real-world software projects, such as NASA and Eclipse dataset, is performed in order to demonstrate the effectiveness of our proposed approach. Furthermore, this paper comprehensively investigates the influencing factor in our approach. Such as, the impact of Rough set theory on noise-filter, the relationship between model performance and imbalance ratio etc. comprehensive experiments indicate that the proposed approach shows outstanding performance with significance in terms of area-under-the-curve (AUC).