This section presents a detailed empirical investigation into the performance of twenty classifier–sampler configurations across three highly imbalanced datasets: credit card fraud detection, Yeast protein localization, and Ozone level detection. Under extreme class imbalance, the primary objective is to examine the sensitivity and reliability of five evaluation metrics: ROC-AUC, PR-AUC, F₂-score, MCC, and H-measure. Rather than comparing classifiers per se, the focus lies on understanding how each metric responds to variations in false positives and false negatives induced by different sampling techniques. Results highlight notable inconsistencies in ROC-AUC’s ability to reflect practical performance costs, whereas alternative metrics demonstrate more substantial alignment with operational realities and domain expert expectations.
5.1. Detailed Per-dataset Results (Fraud Dataset)
This section compares the behaviour of twenty classifier–sampler configurations on three datasets that are thematically unrelated yet similarly skewed: the credit-card fraud dataset, the Yeast protein localization dataset, and the Ozone Level Detection dataset.
5.1.1. Fraud Dataset Results
Table 3 presents the results of 20 distinct classifier–sampler configurations, including the corresponding confusion matrix components and five evaluation metrics. All evaluations were conducted on the test set (unseen data) of the Fraud dataset. Since the study’s objective is metric evaluation, not model comparison, we examine how each metric responds to the dramatic swings in FP and FN counts that arise under extreme class-imbalance. The empirical evaluation conducted on the Fraud dataset demonstrates clearly the limitations inherent in relying on ROC-AUC as an evaluation metric for rare-event binary classification tasks. Although ROC-AUC scores across various classifiers and sampling methods remain consistently high, a deeper inspection of the performance using alternative metrics reveals significant shortcomings in ROC-AUC’s reliability for highly imbalanced datasets.
Taking the Logistic Regression classifier with ADASYN sampling as a notable example, the ROC-AUC score is observed to be impressively high at 0.968. However, this apparently robust performance contrasts with extremely poor values for other critical metrics: an F2-score of just 0.09, MCC of 0.126, and an H-measure of 0.638. Further exacerbating this discrepancy is the notably large number of false positives (FP = 6595), which clearly illustrates that the ROC-AUC cannot adequately penalize the misclassification of negative class instances. Similarly, another striking contradiction is observed when examining Logistic Regression with SMOTE sampling. Despite achieving a high ROC-AUC score of 0.968, this combination demonstrates poor F2 (0.237), MCC (0.227), and H-measure (0.651) scores, which are further compounded by an extremely high false positive rate (FP = 2019). This trend persists across multiple combinations, highlighting ROC-AUC’s inability to reflect meaningful performance deficiencies in classifiers when dealing with highly imbalanced datasets.
The inconsistency in performance, as indicated by the ROC-AUC compared to more practically relevant metrics, is further exemplified by the Logistic Regression classifier combined with Borderline-SMOTE sampling, which achieves an acceptable ROC-AUC score of 0.935. Nonetheless, substantial performance issues arise, as clearly evidenced by an F2-score of 0.448, an MCC of 0.360, and an H-measure of 0.558, coupled with a high false positive count (FP = 647). These results underscore the critical failure of ROC-AUC in accurately capturing and penalizing the actual misclassification costs associated with rare-event classes. We emphasize that the behavior observed for logistic regression under aggressive oversampling is not a failure of the classifier per se, but an expected consequence of training a linear model on heavily resampled data at extreme prevalence; the key finding is that ROC-AUC remains insensitive to the resulting explosion in false positives, whereas MCC, F₂, PR-AUC, and the H-measure appropriately penalize such configurations.
Conversely, metrics such as MCC, F2, and H-measure exhibit greater consistency in identifying performance inadequacies, effectively distinguishing between well-performing and poorly performing models. For instance, the baseline Random Forest classifier achieves strong, stable performance across MCC (0.855), F2 (0.796), and H-measure (0.761) with low FP (5), clearly indicative of genuine classification effectiveness.
In summary, the empirical evidence firmly establishes that, despite its widespread use, ROC-AUC often provides an overly optimistic and misleading assessment of classifier performance in highly imbalanced contexts. Alternative metrics, specifically MCC, F2, and H-measure, are more effective and accurate indicators of genuine predictive performance and should be preferred in evaluation methodologies involving rare-event classification.
Table 4 summarizes the analysis conducted on the Fraud dataset, encapsulating the observed performance ranges, sensitivity to variations in false positives and false negatives, and key observations for ROC-AUC, PR-AUC, F₂-score, MCC, and H-measure. This comparative overview highlights significant discrepancies between ROC-AUC and alternative metrics, underscoring ROC-AUC's limited sensitivity to misclassification costs in highly imbalanced datasets.
Complementing the scalar summaries in
Table 4, a concise cross-metric visualization aids interpretation.
Figure 1 (a–e) provide small-multiples radar plots that compare five evaluation criteria, including F
2, H-measure, MCC, ROC-AUC, and PR-AUC, for Random Forest, Logistic Regression, XGBoost, and CatBoost under each resampling strategy. Axes are fixed across panels and scaled to [0,1]; polygons report fold-wise means. The purpose is illustrative: to visualize pattern and separation across metrics, complementing the confidence-interval and rank-based analyses reported later.
Two consistent regularities are apparent across all samplers. First, ROC-AUC exhibits a ceiling effect: for every classifier and sampler, the ROC-AUC spoke lies close to the outer ring, producing minimal model separation. Second, the threshold-dependent/cost-aligned metrics (MCC and F2) reveal substantial differences that the ROC-AUC curve masks. In particular, Logistic Regression deteriorates sharply under synthetic minority schemes: under SMOTE and ADASYN, the Logistic Regression polygon collapses on the MCC and F2 axes, while remaining near-maximal on ROC-AUC, indicating a severe precision loss (inflated false positives) that does not materially affect rank-based AUC. The tree/boosted models (Random Forest, XGBoost, CatBoost) remain comparatively stable in terms of MCC and F2 across samplers, with XGBoost and Random Forest typically forming the largest polygons (i.e., the strongest across the bundle).
The Baseline panel serves as a reference: ensembles dominate on MCC and F2, while Logistic Regression exhibits weaker performance but does not collapse. Moving to SMOTE and ADASYN, the Logistic Regression degradation intensifies (MCC and F2 shrink markedly) even though PR-AUC and H-measure decline only moderately, and ROC-AUC stays saturated. This pattern is consistent with decision-boundary distortion and score miscalibration induced by aggressive oversampling at a prevalence of 0.17%, which disproportionately inflates false positives at practically relevant thresholds. Borderline-SMOTE and SVM-SMOTE exhibit the same qualitative behavior, albeit with milder Logistic Regression degradation; ensembles, however, retain broad, well-rounded polygons, reflecting their robustness to these resampling variants.
Taken together, the radars visualize the complementarity within the proposed metric bundle. PR-AUC and H-measure track the MCC and F2 separations (though less dramatically), reinforcing their role as threshold-free and cost-sensitive companions, respectively. Conversely, the near-constant ROC-AUC across panels underscores its limited diagnostic value in this ultra-imbalanced setting. These visual regularities align with our Kendall-τ concordance results (strong agreement among MCC, F2, H-measure, and PR-AUC; weak with ROC-AUC) and the critical-difference rankings that favor tree/boosted models. We therefore use the radars as an intuitive summary of sampler–classifier interactions and as corroborating evidence for the central claim: relying solely on ROC-AUC can misrepresent practical performance, whereas a multi-metric, cost-aligned protocol reveals operationally meaningful differences.
5.1.2. Yeast Dataset Results
Table 5 presents the results of 20 distinct classifier–sampler configurations, including the corresponding confusion matrix components and five evaluation metrics. All evaluations were conducted on the test set (unseen data) of the Yeast dataset. The empirical evaluation conducted on the Yeast dataset clearly demonstrates the limitations inherent in relying on ROC-AUC as an evaluation metric for rare-event binary classification tasks. Although ROC-AUC scores across various classifiers and sampling methods frequently appear stable or relatively high, a deeper analysis using alternative metrics reveals significant shortcomings in the reliability of ROC-AUC for highly imbalanced datasets.
For instance, the Logistic Regression classifier combined with SMOTE sampling yields an apparently high ROC-AUC score of 0.908. However, this performance is sharply contradicted by considerably lower scores in crucial alternative metrics, such as F2 (0.207), MCC (0.174), and H-measure (0.677). The substantial false positive rate observed in this scenario (FP = 92) further highlights the inability of ROC-AUC to reflect the practical costs associated with increased false alarms effectively. Similarly, the XGBoost classifier combined with SMOTE sampling yields a ROC-AUC score of 0.884, which initially appears moderate. However, detailed metrics, including F2 (0.484), MCC (0.455), and H-measure (0.353), reveal critical weaknesses in performance, particularly when considering that even a relatively modest increase in false positives (FP = 4) can negatively impact the model’s practical effectiveness.
Additionally, analysis of the Logistic Regression classifier with ADASYN sampling provides further evidence of ROC-AUC’s limitations. Despite maintaining a high ROC-AUC score (0.908), this combination demonstrates poor performance in alternative metrics: F2 at 0.146, MCC at 0.125, and H-measure at 0.677. Moreover, this classifier configuration suffers from an extremely high false positive count (FP = 142), further underscoring the ROC-AUC’s inadequate sensitivity to misclassification costs.
Conversely, metrics such as MCC, F2, and H-measure consistently provide a more accurate representation of classifier performance, effectively distinguishing between models that perform well and those that do not. For example, the baseline Random Forest classifier achieves stable and relatively high scores across MCC (0.608), F2 (0.536), and H-measure (0.577), while maintaining a low false positive count (FP = 1), clearly signalling robust classification capability.
In summary, the empirical evidence from the Yeast dataset clearly demonstrates that ROC-AUC often presents a misleadingly optimistic view of classifier performance in highly imbalanced scenarios. Alternative metrics, such as MCC, F
2, and H-measure, emerge as more reliable and practically meaningful indicators of model performance in rare-event classification problems.
Table 6 summarizes the detailed analysis conducted on the Yeast dataset, presenting the performance range, sensitivity to false positives and false negatives, and key observations for ROC-AUC, PR-AUC, F₂-score, MCC, and H-measure. This summary clearly highlights the inadequacy of ROC-AUC and supports the practical relevance and greater accuracy of alternative metrics for highly imbalanced datasets.
In addition to the scalar results in
Table 6, a compact cross-metric perspective provides an integrated view.
Figure 2 (a–e) present small-multiples radar plots for the Yeast dataset (1.35% positives), comparing F2, H-measure, MCC, ROC-AUC, and PR-AUC for Random Forest, Logistic Regression, XGBoost, and CatBoost under each resampling strategy. Axes are fixed across panels, scaled to [0,1], and polygons report fold-wise means. As with the Fraud radars, the goal is illustrative: to visualize patterns and separation across metrics, complementing the following confidence-interval and rank-based analyses.
Two regularities again emerge. First, ROC-AUC remains near the outer ring for all models and samplers, yielding limited separation. Second, threshold-dependent/cost-aligned metrics (MCC and F2) reveal material differences that ROC-AUC alone obscures, with PR-AUC and H-measure generally moving in the same direction, albeit less sharply.
Dataset-specific nuances are notable. In the Baseline panel, XGBoost exhibits a pronounced collapse on F2, MCC, PR-AUC, and H, despite a high ROC-AUC score, an archetypal instance of AUC saturation masking practically relevant errors. CatBoost and Logistic Regression form larger, more rounded polygons, and Random Forest sits in between. Under SMOTE and ADASYN, Logistic Regression exhibits a mixed profile: PR-AUC and H increase substantially, whereas MCC (and at times F2) decreases, indicating that oversampling improves ranking and cost-weighted separation while simultaneously inflating false positives at decision-useful thresholds (score–threshold miscalibration). Borderline-SMOTE mitigates this tension, with milder degradation in Logistic Regression on MCC and F2, and stable ensemble performance. SVM-SMOTE yields the most balanced polygons overall (especially for Logistic Regression and CatBoost), suggesting that margin-aware synthesis can enhance both ranking-based and threshold-dependent metrics on the Yeast dataset.
Taken together, these radars (i) make the ROC-AUC ceiling effect visually explicit; (ii) highlight sampler–classifier interactions that matter operationally (e.g., Baseline XGBoost collapse; Logistic Regression oversampling trade-offs); and (iii) show PR-AUC and H-measure qualitatively tracking the MCC and F2 separations. The visual patterns are consistent with the Kendall-τ concordance and critical-difference rankings reported for Yeast, reinforcing the central claim that relying solely on ROC-AUC is insufficient. In contrast, a multi-metric, cost-aligned protocol reveals differences of practical consequence.
5.1.3. Ozone Dataset Results
Table 7 presents the results of 20 distinct classifier–sampler configurations, including the corresponding confusion matrix components and five evaluation metrics. All evaluations were conducted on the Ozone dataset's test set (unseen data). The empirical evaluation conducted on the Ozone dataset provides further compelling evidence of the limitations inherent in using ROC-AUC as an evaluation metric for rare-event binary classification tasks. Despite ROC-AUC scores consistently appearing moderate to high across multiple classifiers and sampling methods, a detailed examination using alternative metrics reveals substantial shortcomings in the reliability of ROC-AUC for highly imbalanced datasets.
For example, the Logistic Regression classifier combined with SMOTE sampling achieves an ROC-AUC of 0.860, which initially might suggest acceptable model performance. However, this apparent performance contrasts sharply with notably weaker results in critical alternative metrics: F2-score at 0.336, MCC at 0.225, and H-measure at 0.105. This combination also yields a substantial false positive rate (FP = 57), highlighting the ROC-AUC’s failure to capture the practical implications of increased false alarms adequately.
Similarly, XGBoost with Borderline-SMOTE achieves a relatively moderate ROC-AUC of 0.874; however, a deeper inspection through alternative metrics reveals significant shortcomings. Despite its ROC-AUC score, the combination yields a relatively low F2-score (0.337), MCC (0.294), and H-measure (0.095), alongside an elevated false positive count (FP=15). These findings further underscore ROC-AUC’s inability to reflect misclassification costs sensitively. Another illustrative case is observed with Logistic Regression using ADASYN sampling. The ROC-AUC score of 0.860 might initially seem satisfactory; however, alternative metrics such as F2 (0.338), MCC (0.228), and H-measure (0.105) clearly indicate substantial deficiencies in performance. Moreover, the high false positive count (FP=56) strongly emphasizes ROC-AUC’s limited sensitivity to the actual cost of misclassification.
In contrast, metrics such as MCC, F2, and H-measure consistently provide a more precise representation of classifier performance by distinguishing between models that perform genuinely well and those that perform inadequately. For instance, the Random Forest classifier combined with Borderline-SMOTE sampling exhibits relatively strong and balanced performance across MCC (0.407), F2 (0.417), and H-measure (0.215), with a comparatively low false positive rate (FP = 9), clearly indicating effective classification performance. In summary, empirical evidence from the Ozone dataset strongly reinforces that ROC-AUC is often misleadingly optimistic when assessing classifier performance in highly imbalanced scenarios. Alternative metrics, particularly MCC, F2, and H-measure, provide a more reliable and practical assessment of classifier effectiveness in rare-event classification tasks.
Table 8 summarizes the comprehensive analysis of the Ozone dataset, capturing the observed performance ranges, sensitivity to variations in false positives and false negatives, and key observations for ROC-AUC, PR-AUC, F₂-score, MCC, and H-measure. This comparative overview highlights the inadequacies of ROC-AUC and underscores the greater practical relevance and accuracy of MCC, F₂, and H-measure in assessing classifier performance on highly imbalanced datasets.
Beyond the scalar summaries in
Table 8, a compact cross-metric view is useful.
Figure 3(a-e) show small-multiples radar plots for the Ozone dataset (≈3% positives), comparing F2, H-measure, MCC, ROC-AUC, and PR-AUC for Random Forest, Logistic Regression, XGBoost, and CatBoost under each resampling strategy. Axes are fixed across panels, scaled to [0,1], and polygons report fold-wise means. As in the prior datasets, these plots are illustrative and provide a compact view of pattern and separation across metrics that complements the subsequent confidence-interval and rank-based analyses.
Two regularities recur. First, ROC-AUC lies close to the outer ring for all models and samplers, yielding limited discriminatory power among classifiers. Second, the threshold-dependent/cost-aligned metrics (MCC and F2) exhibit meaningful spread, with PR-AUC and H-measure generally moving in the same qualitative direction (though less sharply), thereby visualizing the complementarity within the proposed metric bundle.
Dataset-specific nuances are evident. In the Baseline panel, Random Forest forms the broadest and most balanced polygon, leading on MCC, F2, and PR-AUC, while CatBoost is competitive, and Logistic Regression and XGBoost exhibit weaker performance, despite uniformly high ROC-AUC for all four models. Under SMOTE, polygons contract on MCC and F2 across models (with only modest changes in PR-AUC and H-measure), indicating that naive oversampling degrades performance at decision-relevant thresholds even as rank-based AUC remains high. Borderline-SMOTE and SVM-SMOTE partially recover this loss: Random Forest again dominates on MCC and F2, and CatBoost closes the gap, whereas Logistic Regression and XGBoost improve mainly on PR-AUC and H-measure with smaller gains on MCC and F2. The most pronounced divergence occurs under ADASYN: Logistic Regression exhibits a marked increase in PR-AUC (and occasionally H-measure) while collapsing on MCC and F2, a signature of oversampling-induced score/threshold miscalibration that inflates false positives at practical operating points. In contrast, the ensemble methods maintain relatively rounded polygons across samplers, reflecting greater robustness to resampling variance.
Overall, the Ozone radars in
Figure 3 (a-e), (i) make the ROC-AUC ceiling effect visually explicit, (ii) reveal consequential sampler–classifier interactions (e.g., ADASYN’s trade-off for Logistic Regression), and (iii) show PR-AUC and H-measure qualitatively tracking the separations exposed by MCC and F2. These visual regularities align with the Kendall-τ concordance and critical-difference rankings reported for Ozone, reinforcing the central conclusion that relying solely on ROC-AUC is insufficient. In contrast, a multi-metric, cost-aligned protocol surfaces operationally meaningful differences among models.
5.2. Cross-Domain Kendall Rank Correlations
5.2.1. Kendall Rank Correlations Between Metrics (Fraud Dataset)
Having analyzed each dataset independently, we now synthesize results across datasets to assess the consistency of metric rankings and statistically significant differences between evaluation metrics. The pairwise Kendall rank correlation coefficients, summarized in
Table 9 and illustrated in
Figure 4, reveal a statistically coherent structure in how the five-evaluation metrics rank the 20 classifier–sampler configurations evaluated on the Fraud dataset, exhibiting a minority class prevalence of approximately 0.17%. Throughout the analysis, τ denotes Kendall’s rank correlation coefficient, and
p-values refer to the two-sided significance level derived from the exact null distribution, as formulated by Kendall [
50].
An additional layer of analysis was conducted on the Fraud dataset using pairwise Kendall rank correlation coefficients (τ), accompanied by two-sided significance levels (p-values) calculated from the exact null distribution [
50]. This analysis aimed to evaluate the degree of concordance between different performance metrics and further highlight the relative alignment or divergence of ROC-AUC with metrics more sensitive to rare-event classification.
Low Kendall’s τ values indicate that different metrics rank identical classifier–sampler pipelines inconsistently, implying that model selection is highly sensitive to metric choice in ultra-imbalanced settings; conversely, high concordance suggests that metric choice is less consequential for deployment decisions.
The results reveal a relatively weak positive correlation between PR-AUC and ROC-AUC (τ = 0.337, p = 0.0398), suggesting that although some concordance exists, it is neither strong nor robust. This weak association supports the notion that ROC-AUC may fail to reliably track changes in precision-recall performance under highly imbalanced conditions. More notably, ROC-AUC exhibits very low and statistically insignificant correlations with F₂ (τ = 0.216, p = 0.183), MCC (τ = 0.047, p = 0.770), and H-measure (τ = 0.179, p = 0.288). These findings emphasize that ROC-AUC rankings are largely disconnected from metrics that prioritize misclassification costs and the effectiveness of detecting rare events.
In contrast, strong and statistically significant correlations are observed among the alternative metrics. PR-AUC shows moderate-to-strong correlations with F₂ (τ = 0.565, p = 0.0005), MCC (τ = 0.438, p = 0.0071), and H-measure (τ = 0.695, p = 0.000003), indicating that these metrics capture similar aspects of classifier performance. Similarly, F₂ correlates strongly with MCC (τ = 0.640, p = 0.000085) and H-measure (τ = 0.533, p = 0.0010), while MCC and H-measure themselves exhibit a strong concordance (τ = 0.617, p = 0.0001).
These results highlight two critical insights: first, the ROC-AUC is weakly aligned with metrics that account for precision, recall, and misclassification asymmetry; second, alternative metrics, such as F₂, MCC, and H-measure, display substantial agreement, reinforcing their utility as complementary and reliable indicators for performance evaluation in highly imbalanced datasets.
5.2.2. Kendall Rank Correlations Between Metrics (Yeast Dataset)
The pairwise Kendall rank correlation coefficients, summarized in
Table 10 and illustrated in
Figure 5, reveal a statistically coherent structure in how the five-evaluation metrics rank the 20 classifier–sampler configurations evaluated on the Yeast dataset, exhibiting a minority class prevalence of approximately 1.35%. Throughout the analysis, τ denotes Kendall’s rank correlation coefficient, and
p-values refer to the two-sided significance level derived from the exact null distribution, as formulated by Kendall [
50].
An additional layer of analysis was conducted on the Yeast dataset using pairwise Kendall rank correlation coefficients (τ), accompanied by two-sided significance levels (p-values) calculated from the exact null distribution [
50]. This analysis aimed to assess the degree of concordance between different performance metrics and further investigate the alignment of ROC-AUC with alternative measures sensitive to rare-event classification.
The results reveal an extremely weak and statistically insignificant correlation between PR-AUC and ROC-AUC (τ = 0.105, p = 0.5424), indicating a lack of meaningful concordance between these metrics. Furthermore, ROC-AUC exhibits negligible and non-significant correlations with F₂ (τ = 0.011, p = 0.9475), MCC (τ = 0.039, p = 0.8178), and H-measure (τ = 0.043, p = 0.7947). These findings highlight the discrepancy between ROC-AUC and metrics that prioritize detecting rare events and penalize misclassification costs. In contrast, notable correlations are observed among alternative metrics. PR-AUC shows a strong and statistically significant correlation with H-measure (τ = 0.840, p < 0.0001), suggesting a high degree of agreement in how these metrics rank classifier performance. F₂ and MCC demonstrate a robust concordance (τ = 0.893, p < 0.0001), highlighting their mutual sensitivity to class imbalances. However, F₂ and H-measure (τ = 0.268, p = 0.1132) and MCC and H-measure (τ = 0.162, p = 0.3389) show weaker and statistically non-significant associations.
Overall, these results highlight two key insights: ROC-AUC exhibits minimal alignment with alternative metrics, underscoring its inadequacy in highly imbalanced scenarios; and strong correlations among specific pairs of alternative metrics (particularly F₂ and MCC) indicate that they are particularly relevant in rare-event classification tasks.
5.2.3. Kendall Rank Correlations Between Metrics (Ozone Dataset)
The pairwise Kendall rank correlation coefficients, summarized in
Table 11 and illustrated in
Figure 6, reveal a statistically coherent structure in how the five-evaluation metrics rank the 20 classifier–sampler configurations evaluated on the Ozone dataset, exhibiting a minority class prevalence of approximately 3.1%. Throughout the analysis, τ denotes Kendall’s rank correlation coefficient, and p-values refer to the two-sided significance level derived from the exact null distribution, as formulated by [
50]. An additional layer of analysis was conducted on the Ozone dataset using pairwise Kendall rank correlation coefficients (τ), accompanied by two-sided significance levels (p-values) calculated from the exact null distribution [
50]. This analysis aimed to evaluate the degree of concordance between different performance metrics and to assess ROC-AUC’s alignment with alternative measures sensitive to rare-event classification.
The results show an extremely weak and statistically insignificant correlation between PR-AUC and ROC-AUC (τ = 0.053, p = 0.7732), suggesting almost no concordance between these metrics. More concerningly, ROC-AUC demonstrates negative correlations with F₂ (τ = -0.185, p = 0.2559), MCC (τ = -0.248, p = 0.1271), and H-measure (τ = -0.168, p = 0.3189), though these associations are not statistically significant. These findings suggest that ROC-AUC may not align with alternative metrics and can even rank classifier performance inversely in some instances, further underscoring its limitations for evaluating imbalanced data. In contrast, strong and statistically significant positive correlations are observed among the alternative metrics. PR-AUC exhibits substantial concordance with MCC (τ = 0.639, p = 0.000086) and H-measure (τ = 0.716, p < 0.0001), highlighting shared sensitivity to precision-recall trade-offs and misclassification costs. Similarly, F₂ correlates strongly with MCC (τ = 0.640, p = 0.000085) and moderately with H-measure (τ = 0.343, p = 0.0349), while MCC and H-measure also display a robust association (τ = 0.554, p = 0.0007).
These findings reinforce two critical insights: ROC-AUC is poorly aligned with alternative metrics and may produce misleading performance rankings in highly imbalanced contexts; meanwhile, the strong concordance among PR-AUC, F₂, MCC, and H-measure underscores their suitability as reliable and complementary metrics for evaluating rare-event classification performance.
5.3. Cross-Metric Synthesis and Evaluation Strategy
The synthesis of results across the Fraud, Yeast, and Ozone datasets reinforces a clear hierarchy among the evaluated metrics. Kendall’s rank correlation analyses consistently demonstrate that τ(MCC, F₂) ≫ τ(PR-AUC, MCC or F₂) ≫ τ(ROC-AUC, any other metric). This ordering highlights MCC and F₂ as capturing similar operational trade-offs, PR-AUC as offering a compatible but threshold-free perspective, and ROC-AUC as providing minimal practical guidance in ultra-imbalanced settings. Consequently, we recommend a reporting bundle of MCC + PR-AUC, with F₂ included when high recall is mission-critical, ROC-AUC should be reported only with explicit caution regarding its limitations in ultra-imbalanced settings.
Table 12 shows that the cross-domain analysis of the three datasets yields consistent conclusions.
The findings reinforce that MCC and F₂-score capture complementary aspects of model performance, reflecting trade-offs between false positives and false negatives at a fixed decision threshold. While MCC offers a symmetric, prevalence-agnostic summary, F₂ is more sensitive to recall and proves particularly useful when the cost of false negatives outweighs that of false positives. PR-AUC, although threshold-independent, aligns reasonably well with these metrics, providing a global view of ranking quality that remains valuable when decision thresholds are not yet defined. ROC-AUC, by contrast, consistently fails to align with operational needs in ultra-imbalanced settings. Its scores remain artificially high even when models exhibit severe false-positive inflation, thus obscuring practical deficiencies that MCC, F₂, and PR-AUC readily expose.
These observations point to a clear recommendation: PR-AUC and MCC should form the core of any evaluation framework for rare-event classification. Where high recall is critical (for instance, in fraud detection or medical screening), the inclusion of F₂ offers additional insight aligned with stakeholder priorities. ROC-AUC may only be reported for completeness or legacy comparisons if accompanied by a clear disclaimer outlining its insensitivity to class imbalance and misalignment with operational costs. These conclusions are not merely theoretical; they translate into actionable strategies for practitioners working with datasets where the minority class comprises less than 3% of the population. The primary recommendation is to adopt PR-AUC to evaluate global ranking ability and MCC as a threshold-specific measure of balanced performance. In domains where false negatives carry disproportionate risk, such as missed fraud cases or undiagnosed patients, the F₂-score serves as a vital complement, emphasizing recall without compromising the need for precision.
The consistent misbehavior of ROC-AUC in our study warrants caution. In multiple cases, ROC-AUC ranked models favorably even when both MCC and PR-AUC indicated poor discriminative performance. For example, the combination of Logistic Regression with SMOTE in the fraud dataset achieved a ROC-AUC score well above 0.90, despite a significant increase in false positives (FP = 2019, MCC = 0.23), effectively masking operational failure. Such discordance between ROC and MCC rankings (especially when discrepancies exceed 10 percentile points) should be treated as a red flag in model validation pipelines.
Oversampling methods, too, must be evaluated in a contextual manner. While techniques like SMOTE can offer measurable gains in certain domains (e.g., the Yeast dataset), they may introduce detrimental artifacts in other contexts. It is therefore critical that researchers assess the impact of oversampling not only on headline metrics but also on the raw components of the confusion matrix. Finally, in settings where the economic or human cost of misclassification is asymmetric, the flexible Fβ family offers tailored sensitivity. Selecting β between 2 and 4 allows evaluators to reflect real-world stakes by emphasizing recall where it matters most, while retaining the interpretability of a single scalar score.