Preprint
Article

This version is not peer-reviewed.

Why ROC-AUC Is Misleading for Highly Imbalanced Data: In-Depth Evaluation of MCC, F2-score, H-measure, and AUC-based Metrics across Diverse Classifiers

Submitted:

09 January 2026

Posted:

13 January 2026

You are already at the latest version

Abstract
This study re-evaluates ROC-AUC for binary classification under severe class imbalance (<3% positives). Despite its widespread use, ROC-AUC can mask operationally salient differences among classifiers when the costs of false positives and false negatives are asymmetric. Using three benchmarks, credit-card fraud detection (0.17%), yeast protein localization (1.35%), and ozone level detection (2.9%), we compare ROC-AUC with Matthews Correlation Coefficient, F2-score, H-measure, and PR-AUC. Our empirical analyses span 20 classifier–sampler configurations per dataset, combined with four classifiers (Logistic Regression, Random Forest, XGBoost, and CatBoost) and four oversampling methods plus a no-resampling baseline (no resampling, SMOTE, Borderline-SMOTE, SVM-SMOTE, ADASYN). ROC-AUC exhibits pronounced ceiling effects, yielding high scores even for underperforming models. In contrast, MCC and F2 align more closely with deployment-relevant costs and achieve the highest Kendall’s τ rank concordance across datasets; PR-AUC provides threshold-independent ranking, and H-measure integrates cost sensitivity. We quantify uncertainty and differences using stratified bootstrap confidence intervals, DeLong’s test for ROC-AUC, and Friedman–Nemenyi critical-difference diagrams, which collectively underscore the limited discriminative value of ROC-AUC in rare-event settings. The findings recommend a shift to a multi-metric evaluation framework: ROC-AUC should not be used as the primary metric in ultra-imbalanced settings; instead, MCC and F₂ are recommended as primary indicators, supplemented by PR-AUC and H-measure where ranking granularity and principled cost integration are required. This evidence encourages researchers and practitioners to move beyond sole reliance on ROC-AUC when evaluating classifiers in highly imbalanced data.
Keywords: 
;  ;  ;  ;  ;  ;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated