This section presents a detailed empirical investigation into the performance of twenty classifier–sampler configurations across three highly imbalanced datasets: credit card fraud detection, Yeast protein localization, and Ozone level detection. Under extreme class imbalance, the primary objective is to examine the sensitivity and reliability of five evaluation metrics—ROC-AUC, PR-AUC, F₂-score, MCC, and H-measure. Rather than comparing classifiers per se, the focus lies on understanding how each metric responds to variations in false positives and false negatives induced by different sampling techniques. Results highlight notable inconsistencies in ROC-AUC’s ability to reflect practical performance costs, whereas alternative metrics demonstrate more substantial alignment with operational realities and domain expert expectations.
5.1. Detailed Per-dataset Results (Fraud Dataset)
This section contrasts the behaviour of twenty classifier–sampler configurations on three thematically unrelated yet similarly skewed datasets: the credit-card fraud collection, the Yeast protein-localisation set, and the Ozone Level Detection.
5.1.1. Fraud Dataset
Table 2 presents the results of 20 distinct classifier–sampler configurations, including the corresponding confusion matrix components and five evaluation metrics, and all evaluations were conducted on the test set (unseen data) of the Fraud dataset. Since the study’s objective is metric evaluation, not model comparison, we examine how each metric responds to the dramatic swings in FP and FN counts that arise under extreme class-imbalance. The empirical evaluation conducted on the Fraud dataset demonstrates clearly the limitations inherent in relying on ROC-AUC as an evaluation metric for rare-event binary classification tasks. Although ROC-AUC scores across various classifiers and sampling methods remain consistently high, a deeper inspection of the performance using alternative metrics reveals significant shortcomings in ROC-AUC’s reliability for highly imbalanced datasets.
Taking the Logistic Regression classifier with ADASYN sampling as a notable example, the ROC-AUC score is observed to be impressively high at 0.968. However, this apparently robust performance contrasts with extremely poor values for other critical metrics: an F2-score of just 0.090, MCC of 0.126, and an H-measure of 0.638. Further exacerbating this discrepancy is the notably large number of false positives (FP=6595), illustrating clearly that the ROC-AUC cannot adequately penalize the misclassification of negative class instances.
Similarly, another striking contradiction is observed when examining LR with SMOTE sampling. Despite achieving a high ROC-AUC score of 0.968, this combination demonstrates poor F2 (0.237), MCC (0.227), and H-measure (0.651) scores, compounded by an extremely high false positive rate (FP=2019). This trend persists across multiple combinations, highlighting ROC-AUC’s inability to reflect meaningful performance deficiencies in classifiers when dealing with highly imbalanced datasets.
The inconsistency in performance indicated by ROC-AUC compared to more practically relevant metrics is further exemplified by the LR classifier combined with Borderline-SMOTE sampling, where an acceptable ROC-AUC score of 0.935 is recorded. Nonetheless, substantial performance issues arise, as clearly evidenced by an F2-score of 0.448, MCC of 0.360, and H-measure of 0.558, coupled with a high false positive count (FP=647). These results underscore the critical failure of ROC-AUC in capturing and penalizing the actual misclassification cost associated with rare-event classes.
Conversely, metrics such as MCC, F2, and H-measure exhibit greater consistency in identifying performance inadequacies, effectively distinguishing between well-performing and poorly performing models. For instance, the baseline Random Forest classifier achieves strong, stable performance across MCC (0.855), F2 (0.796), and H-measure (0.761) with low FP (5), clearly indicative of genuine classification effectiveness.
In summary, the empirical evidence firmly establishes that
despite its widespread use, ROC-AUC frequently offers an overly optimistic and
misleading assessment of classifier performance in highly imbalanced contexts.
Alternative metrics, specifically MCC, F2, and H-measure, are more
effective and accurate indicators of genuine predictive performance and should
be preferred in evaluation methodologies involving rare-event classification.
Table 3 summarizes the analysis conducted on the Fraud dataset, encapsulating the observed performance ranges, sensitivity to variations in false positives and false negatives, and key observations for ROC-AUC, PR-AUC, F₂-score, MCC, and H-measure. This comparative overview underscores significant discrepancies between ROC-AUC and alternative metrics, highlighting ROC-AUC's insufficient sensitivity to misclassification costs in highly imbalanced datasets.
Complementing the scalar summaries in
Table 3, a concise cross-metric visualization aids interpretation.
Figure 1(a)–(e) provide small-multiples radar plots that compare five evaluation criteria— F
2, H-measure, MCC, ROC-AUC, and PR-AUC—for RF, LR, XGB, and CB under each resampling strategy. Axes are fixed across panels and scaled to [0, 1]; polygons report fold-wise means. The purpose is illustrative: to visualize pattern and separation across metrics, complementing the confidence-interval and rank-based analyses reported later.
Two consistent regularities are apparent across all samplers. First, ROC-AUC exhibits a ceiling effect: for every classifier and sampler, the ROC-AUC spoke lies close to the outer ring, producing minimal model separation. Second, the threshold-dependent/cost-aligned metrics—MCC and F2—expose substantial differences that ROC-AUC masks. In particular, LR deteriorates sharply under synthetic-minority schemes: under SMOTE and ADASYN, the LR polygon collapses on the MCC and F2 axes while remaining near-maximal on ROC-AUC, indicating severe precision loss (inflated false positives) that does not materially affect rank-based AUC. The tree/boosted models (RF, XGB, CB) remain comparatively stable on MCC/F2 across samplers, with XGB/RF typically forming the largest polygons (i.e., strongest across the bundle).
The Baseline panel serves as a reference: ensembles
dominate on MCC/F2, while LR trails but does not collapse. Moving to
SMOTE and ADASYN, the LR degradation intensifies—MCC and F2 shrink
markedly—even though PR-AUC and H-measure decline only moderately, and ROC-AUC
stays saturated. This pattern is consistent with decision-boundary distortion
and score miscalibration induced by aggressive oversampling at a prevalence of
0.17%, which disproportionately inflates false positives at practically
relevant thresholds. Borderline-SMOTE and SVM-SMOTE show the same qualitative
behavior but with milder LR degradation; ensembles retain broad, well-rounded
polygons, reflecting robustness to these resampling variants.
Taken together, the radars visualize the complementarity
within the proposed metric bundle. PR-AUC and H-measure track the MCC/F2
separations (though less dramatically), reinforcing their role as
threshold-free and cost-sensitive companions, respectively. Conversely, the
near-constant ROC-AUC across panels underscores its limited diagnostic value in
this ultra-imbalanced setting. These visual regularities align with our
Kendall-τ concordance results (strong agreement among MCC/F2/H/PR-AUC;
weak with ROC-AUC) and the critical-difference rankings that favor tree/boosted
models. We therefore use the radars as an intuitive summary of
sampler–classifier interactions and as corroborating evidence for the central
claim: relying solely on ROC-AUC can misrepresent practical performance,
whereas a multi-metric, cost-aligned protocol reveals operationally meaningful
differences.
5.1.2. Yeast Dataset
Table 4 presents the
results of 20 distinct classifier–sampler configurations, including the
corresponding confusion matrix components and five evaluation metrics, and all
evaluations were conducted on the test set (unseen data) of the Yeast dataset.
The empirical evaluation conducted on the Yeast dataset clearly demonstrates
the limitations inherent in relying on ROC-AUC as an evaluation metric for
rare-event binary classification tasks. Although ROC-AUC scores across various
classifiers and sampling methods frequently appear stable or relatively high,
deeper analysis using alternative metrics uncovers significant shortcomings in
ROC-AUC’s reliability for highly imbalanced datasets.
For instance, the Logistic Regression classifier combined
with SMOTE sampling yields an apparently high ROC-AUC score of 0.908. However,
this performance is contradicted sharply by considerably lower scores in
crucial alternative metrics such as F2 (0.207), MCC (0.174), and
H-measure (0.677). The substantial false positive rate observed in this
scenario (FP=92) further highlights ROC-AUC’s inability to reflect the
practical costs associated with increased false alarms effectively.
Similarly, the XGBoost classifier combined with SMOTE
sampling produces a ROC-AUC score of 0.884, which at first glance appears
moderate. However, detailed metrics including F (0.484), MCC (0.455), and
H-measure (0.353) expose critical weaknesses in performance, particularly when
considering that even a relatively modest increase in false positives (FP=4)
can negatively impact the practical effectiveness of the model.
Additionally, analysis of the Logistic Regression
classifier with ADASYN sampling provides further evidence of ROC-AUC’s
limitations. Despite maintaining a high ROC-AUC score (0.908), this combination
demonstrates poor performance in alternative metrics: F2 at 0.146,
MCC at 0.125, and H-measure at 0.677. Moreover, this classifier configuration
suffers from an extremely high false positive count (FP=142), further
underscoring ROC-AUC’s inadequate sensitivity to misclassification costs.
Conversely, metrics such as MCC, F2, and
H-measure consistently provide a more accurate representation of classifier
performance, effectively distinguishing between models performing well and
those not. For example, the baseline Random Forest classifier achieves stable
and relatively high scores across MCC (0.608), F2 (0.536), and
H-measure (0.577) while maintaining a low false positive count (FP=1), clearly
signalling robust classification capability.
In summary, the empirical evidence from the Yeast dataset
conclusively illustrates that ROC-AUC frequently presents a misleadingly
optimistic view of classifier performance in highly imbalanced scenarios.
Alternative metrics such as MCC, F2, and H-measure emerge as more
reliable and practically meaningful model performance indicators in rare-event
classification problems.
Table 5 summarizes
the detailed analysis conducted on the Yeast dataset, presenting the
performance range, sensitivity to false positives and false negatives, and key
observations for ROC-AUC, PR-AUC, F₂-score, MCC, and H-measure. This summary
clearly highlights ROC-AUC’s inadequacy and supports alternative metrics’
practical relevance and greater accuracy for highly imbalanced datasets.
In addition to the scalar results in
Table 5, a compact cross-metric perspective provides an integrated view.
Figure 2(a)–2(e) present small-multiples radar plots for the Yeast dataset (1.35% positives), comparing F
2, H-measure, MCC, ROC-AUC, and PR-AUC for RF, LR, XGB, and CB under each resampling strategy. Axes are fixed across panels, scaled to [0,1], and polygons report fold-wise means. As with the Fraud radars, the goal is illustrative: to visualize patterns and separation across metrics, complementing the following confidence-interval and rank-based analyses.
Two regularities again emerge. First, ROC-AUC remains near
the outer ring for all models and samplers, yielding limited separation.
Second, threshold-dependent/cost-aligned metrics (MCC and F2) reveal
material differences that ROC-AUC alone obscures, with PR-AUC and H-measure
generally moving in the same direction, albeit less sharply.
Dataset-specific nuances are notable. In the Baseline
panel, XGB exhibits a pronounced collapse on F2, MCC, PR-AUC, and H,
despite a high ROC-AUC spoke—an archetypal instance of AUC saturation masking
practically relevant errors. CB and LR form larger, more rounded polygons, and
RF sits in between. Under SMOTE and ADASYN, LR shows a mixed profile: PR-AUC
and H increase substantially, yet MCC (and at times F2) contracts,
indicating that oversampling improves ranking and cost-weighted separation
while simultaneously inflating false positives at decision-useful thresholds
(score–threshold miscalibration). Borderline-SMOTE moderates this tension, with
milder LR degradation on MCC/F2 and stable ensemble performance.
SVM-SMOTE yields the most balanced polygons overall—especially for LR and
CB—suggesting that margin-aware synthesis can enhance both ranking-based and
threshold-dependent metrics on Yeast.
Taken together, these radars (i) make the ROC-AUC ceiling
effect visually explicit; (ii) highlight sampler–classifier interactions that
matter operationally (e.g., XGB’s baseline collapse; LR’s oversampling
trade-offs); and (iii) show PR-AUC and H-measure qualitatively tracking the MCC/F2
separations. The visual patterns are consistent with the Kendall-τ concordance
and critical-difference rankings reported for Yeast, reinforcing the central
claim that relying solely on ROC-AUC is insufficient. In contrast, a
multi-metric, cost-aligned protocol reveals differences of practical
consequence.
5.1.3. Ozone Dataset
Table 4 presents the
results of 20 distinct classifier–sampler configurations, including the
corresponding confusion matrix components and five evaluation metrics, and all
evaluations were conducted on the Ozone dataset's test set (unseen data). The
empirical evaluation conducted on the Ozone dataset provides further compelling
evidence of the limitations inherent in using ROC-AUC as an evaluation metric
for rare-event binary classification tasks. Despite ROC-AUC scores consistently
appearing moderate to high across multiple classifiers and sampling methods,
detailed examination using alternative metrics reveals substantial shortcomings
in ROC-AUC’s reliability for highly imbalanced datasets.
Table 6.
The results on the Ozone dataset.
Table 6.
The results on the Ozone dataset.
| |
Baseline |
SMOTE |
Borderline-SMOTE |
SVM-SMOTE |
ADASYN |
| |
RF |
LR |
XGB |
CB |
RF |
LR |
XGB |
CB |
RF |
LR |
XGB |
CB |
RF |
LR |
XGB |
CB |
RF |
LR |
XGB |
CB |
| ROC-AUC |
0.882 |
0.881 |
0.875 |
0.894 |
0.863 |
0.860 |
0.879 |
0.895 |
0.833 |
0.879 |
0.874 |
0.908 |
0.856 |
0.878 |
0.864 |
0.906 |
0.854 |
0.860 |
0.884 |
0.902 |
| PR-AUC |
0.211 |
0.232 |
0.196 |
0.225 |
0.362 |
0.195 |
0.250 |
0.251 |
0.339 |
0.223 |
0.236 |
0.293 |
0.323 |
0.214 |
0.226 |
0.259 |
0.325 |
0.204 |
0.219 |
0.258 |
| F2 |
0.071 |
0.135 |
0.130 |
0.068 |
0.407 |
0.336 |
0.372 |
0.309 |
0.417 |
0.372 |
0.337 |
0.368 |
0.309 |
0.375 |
0.337 |
0.385 |
0.361 |
0.338 |
0.333 |
0.316 |
| MCC |
0.164 |
0.184 |
0.143 |
0.094 |
0.381 |
0.225 |
0.307 |
0.240 |
0.407 |
0.262 |
0.294 |
0.300 |
0.318 |
0.266 |
0.294 |
0.330 |
0.357 |
0.228 |
0.285 |
0.251 |
| H |
0.080 |
0.102 |
0.064 |
0.089 |
0.237 |
0.105 |
0.125 |
0.125 |
0.215 |
0.094 |
0.095 |
0.203 |
0.209 |
0.085 |
0.095 |
0.111 |
0.175 |
0.105 |
0.083 |
0.139 |
| FP |
1 |
4 |
7 |
4 |
11 |
57 |
19 |
23 |
9 |
44 |
15 |
20 |
8 |
43 |
15 |
16 |
9 |
56 |
16 |
21 |
| FN |
16 |
15 |
15 |
16 |
10 |
8 |
10 |
11 |
10 |
8 |
11 |
10 |
12 |
8 |
11 |
10 |
11 |
8 |
11 |
11 |
| TP |
1 |
2 |
2 |
1 |
7 |
9 |
7 |
6 |
7 |
9 |
6 |
7 |
5 |
9 |
6 |
7 |
6 |
9 |
6 |
6 |
| TN |
537 |
534 |
531 |
534 |
527 |
481 |
519 |
515 |
529 |
494 |
523 |
518 |
530 |
495 |
523 |
522 |
529 |
482 |
522 |
517 |
For example, the Logistic Regression classifier combined
with SMOTE sampling achieves an ROC-AUC of 0.860, which initially might suggest
acceptable model performance. However, this apparent performance contrasts
sharply with notably weaker results in critical alternative metrics: F2-score
at 0.336, MCC at 0.225, and H-measure at 0.105. This combination also records a
substantial false positive rate (FP=57), highlighting ROC-AUC’s failure to
capture the practical implications of increased false alarms adequately.
Similarly, XGBoost with Borderline-SMOTE achieves a
relatively moderate ROC-AUC of 0.874, but deeper inspection through alternative
metrics reveals significant shortcomings. Despite its ROC-AUC score, the
combination yields a relatively low F2-score (0.337), MCC (0.294),
and H-measure (0.095), alongside an elevated false positive count (FP=15).
These findings further underscore ROC-AUC’s inability to reflect
misclassification costs sensitively.
Another illustrative case is observed with Logistic
Regression using ADASYN sampling. The ROC-AUC score of 0.860 might initially
seem satisfactory; however, alternative metrics such as F2 (0.338),
MCC (0.228), and H-measure (0.105) clearly indicate substantial deficiencies in
performance. Moreover, the high false positive count (FP=56) strongly
emphasizes ROC-AUC’s limited sensitivity to the actual cost of
misclassification.
In contrast, metrics such as MCC, F2, and
H-measure consistently provide a more precise representation of classifier
performance by distinguishing between models performing genuinely well and
those performing inadequately. For instance, the Random Forest classifier
combined with Borderline-SMOTE sampling exhibits relatively strong and balanced
performance across MCC (0.407), F2 (0.417), and H-measure (0.215)
with a comparatively low false positive rate (FP=9), clearly indicating
effective classification performance.
In summary, empirical evidence from the Ozone dataset
strongly reinforces that ROC-AUC is often misleadingly optimistic when
assessing classifier performance in highly imbalanced scenarios. Alternative
metrics, particularly MCC, F2, and H-measure, provide a more
reliable and practical assessment of classifier effectiveness in rare-event
classification tasks.
Table 7 summarizes
the comprehensive analysis of the Ozone dataset, capturing the observed
performance ranges, sensitivity to false positive and false negative
variations, and key observations for ROC-AUC, PR-AUC, F₂-score, MCC, and
H-measure. This comparative overview reinforces ROC-AUC’s inadequacies and
underscores the greater practical relevance and accuracy of MCC, F₂, and
H-measure for assessing classifier performance on highly imbalanced datasets.
Beyond the scalar summaries in
Table 7, a compact cross-metric view is useful.
Figure 3(a)–3(e) show small-multiples radar plots for the Ozone dataset (≈3% positives), comparing F
2, H-measure, MCC, ROC-AUC, and PR-AUC for RF, LR, XGB, and CB under each
resampling strategy. Axes are fixed across panels, scaled to [0,1], and polygons report fold-wise means. As in the prior datasets, these plots are illustrative—a compact view of pattern and separation across metrics that complements the subsequent confidence-interval
and rank-based analyses.
Two regularities recur. First, ROC-AUC lies close to the outer ring for all models and samplers, yielding limited discriminatory power among classifiers. Second, the threshold-dependent/cost-aligned metrics—MCC and F2—exhibit meaningful spread, with PR-AUC and H-measure generally moving in the same qualitative direction (though less sharply), thereby
visualizing the complementarity within the proposed metric bundle.
Dataset-specific nuances are evident. In the Baseline
panel, RF forms the broadest, most balanced polygon, leading on MCC, F2,
and PR-AUC, while CB is competitive and LR/XGB trail—despite uniformly high
ROC-AUC for all four models. Under SMOTE, polygons contract on MCC and F2
across models (with only modest changes in PR-AUC/H), indicating that naive
oversampling degrades performance at decision-relevant thresholds even as
rank-based AUC remains high. Borderline-SMOTE and SVM-SMOTE partially recover
this loss: RF again dominates on MCC/ F2, and CB closes the gap,
whereas LR/XGB improve mainly on PR-AUC/H with smaller gains on MCC/F2.
The most pronounced divergence occurs under ADASYN: LR exhibits a marked
increase in PR-AUC (and occasionally H-measure) while collapsing on MCC and F2,
a signature of oversampling-induced score/threshold miscalibration that
inflates false positives at practical operating points. In contrast, the
ensemble methods maintain relatively rounded polygons across samplers,
reflecting greater robustness to resampling variance.
Overall, the Ozone radars (i) make the ROC-AUC ceiling
effect visually explicit, (ii) reveal consequential sampler–classifier
interactions (e.g., ADASYN’s trade-off for LR), and (iii) show PR-AUC/H
qualitatively tracking the separations exposed by MCC/F2. These
visual regularities align with the Kendall-τ concordance and
critical-difference rankings reported for Ozone, reinforcing the central
conclusion that relying solely on ROC-AUC is insufficient. In contrast, a
multi-metric, cost-aligned protocol surfaces operationally meaningful
differences among models.
5.2. Cross-Domain Kendall Rank Correlations
5.2.1. Kendall Rank Correlations Between Metrics (Fraud Dataset)
The pairwise Kendall rank correlation coefficients, summarized in
Table 8 and illustrated in Fig. 4, reveal a statistically coherent structure in how the five evaluation metrics rank the 20 classifier–sampler configurations evaluated on the Fraud dataset, exhibiting a minority class prevalence of approximately 0.17%. Throughout the analysis, τ denotes Kendall’s rank correlation coefficient, and
p-values refer to the two-sided significance level derived from the exact null distribution, following the formulation by Kendall [
50].
An additional layer of analysis on the Fraud dataset was conducted using pairwise Kendall rank correlation coefficients (τ), accompanied by two-sided significance levels (p-values) calculated from the exact null distribution [
50]. This analysis aimed to evaluate the degree of concordance between different performance metrics and further highlight the relative alignment or divergence of ROC-AUC with metrics more sensitive to rare-event classification.
The results reveal a relatively weak positive correlation between PR-AUC and ROC-AUC (τ = 0.337, p = 0.0398), suggesting that although some concordance exists, it is neither strong nor robust. This weak association supports the notion that ROC-AUC may fail to track changes in precision-recall performance under highly imbalanced conditions reliably. More notably, ROC-AUC exhibits very low and statistically insignificant correlations with F₂ (τ = 0.216, p = 0.183), MCC (τ = 0.047, p = 0.770), and H-measure (τ = 0.179, p = 0.288). These findings emphasize that ROC-AUC rankings are disconnected mainly from metrics prioritizing misclassification costs and rare-event detection effectiveness.
In contrast, strong and statistically significant
correlations are observed among the alternative metrics. PR-AUC shows
moderate-to-strong correlations with F₂ (τ = 0.565, p = 0.0005), MCC (τ =
0.438, p = 0.0071), and H-measure (τ = 0.695, p = 0.000003), indicating that
these metrics capture similar aspects of classifier performance. Similarly, F₂
correlates strongly with MCC (τ = 0.640, p = 0.000085) and H-measure (τ =
0.533, p = 0.0010), while MCC and H-measure themselves exhibit a strong
concordance (τ = 0.617, p = 0.0001).
These results highlight two critical insights: first,
ROC-AUC is weakly aligned with metrics that account for precision, recall, and
misclassification asymmetry; second, alternative metrics such as F₂, MCC, and
H-measure display substantial agreement, reinforcing their utility as
complementary and reliable indicators for performance evaluation in highly
imbalanced datasets.
Figure 4.
Kendall rank correlations (τ) heatmap between metrics on the Fraud dataset.
Figure 4.
Kendall rank correlations (τ) heatmap between metrics on the Fraud dataset.
5.2.2. Kendall Rank Correlations Between Metrics (Yeast Dataset)
The pairwise Kendall rank correlation coefficients,
summarized in
Table 9 and illustrated in
Fig. 5, reveal a statistically coherent structure in how the five evaluation
metrics rank the 20 classifier–sampler configurations evaluated on the Yeast
dataset, exhibiting a minority class prevalence of approximately 1.35%.
Throughout the analysis, τ denotes Kendall’s rank correlation coefficient, and
p-values
refer to the two-sided significance level derived from the exact null
distribution, following the formulation by Kendall [
50].
An additional layer of analysis on the Yeast dataset was
conducted using pairwise Kendall rank correlation coefficients (τ), accompanied
by two-sided significance levels (p-values) calculated from the exact null
distribution [
50]. This analysis aimed to
assess the degree of concordance between different performance metrics and
further investigate ROC-AUC’s alignment with alternative measures sensitive to
rare-event classification.
The results reveal an extremely weak and statistically
insignificant correlation between PR-AUC and ROC-AUC (τ = 0.105, p = 0.5424),
indicating a lack of meaningful concordance between these metrics. Furthermore,
ROC-AUC exhibits negligible and non-significant correlations with F₂ (τ =
0.011, p = 0.9475), MCC (τ = 0.039, p = 0.8178), and H-measure (τ = 0.043, p =
0.7947). These findings underscore the disconnect between ROC-AUC and metrics
prioritizing detecting rare events and penalizing misclassification costs.
In contrast, notable correlations are observed among
alternative metrics. PR-AUC shows a strong and statistically significant
correlation with H-measure (τ = 0.840, p < 0.0001), suggesting a high degree
of agreement in how these metrics rank classifier performance. F₂ and MCC
demonstrate a robust concordance (τ = 0.893, p < 0.0001), highlighting their
mutual sensitivity to class imbalances. However, F₂ and H-measure (τ = 0.268, p
= 0.1132) and MCC and H-measure (τ = 0.162, p = 0.3389) show weaker and statistically
non-significant associations.
Overall, these results emphasize two key insights: ROC-AUC
shows minimal alignment with alternative metrics, reinforcing its inadequacy in
highly imbalanced scenarios; and strong correlations among specific pairs of
alternative metrics—particularly F₂ and MCC—demonstrate their consistency and
relevance for evaluating classifier performance in rare-event classification
tasks.
Figure 5.
Kendall rank correlations (τ) heatmap between metrics on the Yeast dataset.
Figure 5.
Kendall rank correlations (τ) heatmap between metrics on the Yeast dataset.
5.2.3. Kendall Rank Correlations Between Metrics (Ozone Dataset)
The pairwise Kendall rank correlation coefficients,
summarized in
Table 10 and illustrated in
Fig. 6, reveal a statistically coherent structure in how the five evaluation
metrics rank the 20 classifier–sampler configurations evaluated on the Ozone
dataset, exhibiting a minority class prevalence of approximately 3.1%.
Throughout the analysis, τ denotes Kendall’s rank correlation coefficient, and
p-values
refer to the two-sided significance level derived from the exact null
distribution, following the formulation by [
50].
An additional layer of analysis on the Ozone dataset was
conducted using pairwise Kendall rank correlation coefficients (τ), accompanied
by two-sided significance levels (p-values) calculated from the exact null
distribution [
50]. This analysis aimed to
evaluate the degree of concordance between different performance metrics and to
assess ROC-AUC’s alignment with alternative measures sensitive to rare-event
classification.
The results show an extremely weak and statistically insignificant
correlation between PR-AUC and ROC-AUC (τ = 0.053, p = 0.7732), suggesting
almost no concordance between these metrics. More concerningly, ROC-AUC
demonstrates negative correlations with F₂ (τ = -0.185, p = 0.2559), MCC (τ =
-0.248, p = 0.1271), and H-measure (τ = -0.168, p = 0.3189), though these
associations are not statistically significant. These findings indicate that
ROC-AUC fails to align with alternative metrics and may rank classifier
performance inversely in some instances, further underscoring its inadequacy
for imbalanced data evaluation.
In contrast, strong and statistically significant positive
correlations are observed among the alternative metrics. PR-AUC exhibits
substantial concordance with MCC (τ = 0.639, p = 0.000086) and H-measure (τ =
0.716, p < 0.0001), highlighting shared sensitivity to precision-recall
trade-offs and misclassification costs. Similarly, F₂ correlates strongly with
MCC (τ = 0.640, p = 0.000085) and moderately with H-measure (τ = 0.343, p =
0.0349), while MCC and H-measure also display a robust association (τ = 0.554,
p = 0.0007).
These findings reinforce two critical insights: ROC-AUC is
poorly aligned with alternative metrics and may produce misleading performance
rankings in highly imbalanced contexts; meanwhile, the strong concordance among
PR-AUC, F₂, MCC, and H-measure underscores their suitability as reliable and
complementary metrics for evaluating rare-event classification performance.
Figure 6.
Kendall rank correlations (τ) heatmap between metrics on the Ozone dataset.
Figure 6.
Kendall rank correlations (τ) heatmap between metrics on the Ozone dataset.
5.3. Cross-Metric Synthesis and Evaluation Strategy
The synthesis of results across the Fraud, Yeast, and Ozone datasets reinforces a clear hierarchy among the evaluated metrics. Kendall’s rank correlation analyses consistently demonstrate that τ(MCC, F₂) ≫ τ(PR-AUC, MCC or F₂) ≫ τ(ROC-AUC, any other metric). This ordering highlights MCC and F₂ as capturing similar operational trade-offs, PR-AUC as offering a compatible but threshold-free perspective, and ROC-AUC as providing minimal practical guidance in ultra-imbalanced settings. Consequently, we recommend a reporting bundle of MCC + PR-AUC, with F₂ included when high recall is mission-critical, while relegating ROC-AUC to supplementary material accompanied by explicit caution regarding its limitations.
Table 11 shows that the cross-domain analysis of the three datasets yields consistent conclusions.
The findings reinforce that MCC and F₂-score capture
complementary aspects of model performance, reflecting trade-offs between false
positives and false negatives at a fixed decision threshold. While MCC offers a
symmetric, prevalence-agnostic summary, F₂ is more sensitive to recall and
proves particularly useful when the cost of false negatives outweighs that of
false positives. PR-AUC, although threshold-independent, aligns reasonably well
with these metrics, providing a global view of ranking quality that remains
valuable when decision thresholds are not yet defined. ROC-AUC, by contrast,
consistently misaligns with operational needs in ultra-imbalanced settings. Its
scores remain artificially high even when models exhibit severe false-positive
inflation, thus obscuring practical deficiencies that MCC, F₂, and PR-AUC
readily expose.
These observations point to a clear recommendation: PR-AUC
and MCC should form the core of any evaluation framework for rare-event
classification. Where high recall is critical—for instance, in fraud detection
or medical screening—the inclusion of F₂ offers additional insight aligned with
stakeholder priorities. ROC-AUC may only be reported for completeness or legacy
comparisons if accompanied by a clear disclaimer outlining its insensitivity to
class imbalance and misalignment with operational costs.
These conclusions are not merely theoretical; they
translate into actionable strategies for practitioners working with datasets
where the minority class comprises less than 3% of the population. The primary
recommendation is to adopt PR-AUC to evaluate global ranking ability and MCC as
a threshold-specific measure of balanced performance. In domains where false
negatives carry disproportionate risk, such as missed fraud cases or
undiagnosed patients, the F₂-score is a vital complement, emphasizing recall without
discarding the need for precision.
The consistent misbehavior of ROC-AUC in our study warrants
caution. In multiple cases, ROC-AUC ranked models favorably even when both MCC
and PR-AUC indicated poor discriminative performance. For example, the
combination of Logistic Regression with SMOTE in the fraud dataset achieved a
ROC-AUC well above 0.90 despite a massive spike in false positives (FP = 2019,
MCC = 0.23), effectively masking operational failure. Such discordance between
ROC and MCC rankings—especially when discrepancies exceed 10 percentile
points—should be treated as a red flag in model validation pipelines.
Oversampling methods, too, must be evaluated contextually.
While techniques like SMOTE can offer measurable gains in some domains (e.g.,
the Yeast dataset), they may introduce detrimental artifacts elsewhere. It is
therefore critical that researchers assess the impact of oversampling not only
on headline metrics but also on raw confusion-matrix components.
Finally, in settings where the economic or human cost of
misclassification is asymmetric, the flexible F_β family offers tailored
sensitivity. Selecting β between 2 and 4 allows evaluators to reflect
real-world stakes—emphasizing recall where it matters most, while retaining the
interpretability of a single scalar score.