Submitted:
30 December 2025
Posted:
30 December 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Our Contributions
- 1.
- Coverage-Deferral Trade-off. We identify a fundamental trade-off: Mondrian CP reduces coverage disparity by 26% but increases deferral disparity by 143% compared to global CP (p < 0.001, 100 seeds, 6 datasets).
- 2.
- Impossibility Result. We prove an analogous impossibility result for conformal prediction, parallel to Kleinberg et al. [20]: when base rates differ between groups, coverage parity and deferral parity cannot be simultaneously achieved.
- 3.
- Metric Selection. We demonstrate that standard EO metrics (TPR gap, FPR gap, Average Odds) are invariant to CP method choice because CP changes prediction sets, not point predictions. This identifies deferral gap as the key metric capturing CP’s unique fairness impact in HITL systems.
- 4.
- Practical Guidance. Through a comprehensive sweep of the shrinkage parameter , we characterize the trade-off curve and provide actionable recommendations: (global) for deferral fairness, (Mondrian) for group-conditional coverage validity, for balanced objectives.
2. Related Work
3. Preliminaries
3.1. Conformal Prediction
- 1.
- Train classifier f on
- 2.
- Compute conformity scores on
- 3.
- Compute quantile
- 4.
- Form prediction sets:
3.2. Mondrian Conformal Prediction
3.3. Shrinkage Interpolation
3.4. Deferral Protocol
3.5. Fairness Metrics
4. Methodology
4.1. Experimental Design
- Train (60%): Model training and preprocessing fitting
- Calibration (10%): Calibration method fitting (isotonic regression)
- CP (10%): Conformal prediction threshold calibration
- Test (20%): Final evaluation
| Dataset | N | Base Rate | Base Gap | Sensitive Attr |
|---|---|---|---|---|
| Adult | 9,769 | 0.239 | 0.198 | sex (M/F) |
| COMPAS | 1,058 | 0.470 | 0.120 | race (W/B) |
| German Credit | 200 | 0.300 | 0.102 | sex (M/F) |
| Taiwan Credit | 6,000 | 0.221 | 0.031 | sex (M/F) |
| ACS Income | 20,000 | 0.385 | 0.145 | sex (M/F) |
| Bank Marketing | 9,043 | 0.117 | 0.012 | age (≥40) |
4.2. HITL Simulation
- Human accuracy : Probability of correct decision
- Review rate : Fraction of deferrals actually reviewed
5. Experiments and Results
5.1. Standard EO Metrics are Invariant to CP Method
, which is unchanged by CP. If researchers evaluated CP methods using only these metrics, they would conclude that all CP methods are equally fair, a potentially misleading conclusion.5.2. Main Result: Coverage-Deferral Trade-off
- Reduces coverage gap by 26% (0.032 → 0.023), though increases in 3 of 6 datasets
- Increases deferral gap by 143% on average (0.052 → 0.125)
| Coverage Gap | Deferral Gap | ||||
|---|---|---|---|---|---|
| Dataset | Global | Mondrian | Global | Mondrian | Ratio |
| ACS Income | 0.008 | 0.008 | 0.034 ± 0.007 | 0.054 ± 0.010 | 1.6x |
| Adult | 0.050 | 0.012 | 0.051 ± 0.012 | 0.129 ± 0.018 | 2.5x |
| Bank Marketing | 0.020 | 0.011 | 0.000 ± 0.001 | 0.030 ± 0.008 | 85x† |
| COMPAS | 0.018 | 0.033 | 0.057 ± 0.015 | 0.108 ± 0.022 | 1.9x |
| German Credit | 0.035 | 0.057 | 0.061 ± 0.035 | 0.247 ± 0.048 | 4.0x |
| Taiwan Credit | 0.011 | 0.013 | 0.034 ± 0.009 | 0.077 ± 0.014 | 2.2x |
| Average | 0.032 | 0.023 | 0.052 | 0.125 | 2.4x |
5.3. Gamma Sweep: No Optimal Balance
- Adult: vs coverage_gap: ***; vs deferral_gap: ***
- Bank Marketing: *** and *** respectively
- All correlations significant at p < 0.001
5.4. HITL Invariance: A Structural Property by Design
- Point prediction EO (FPR/FNR gaps on model predictions): These are invariant to CP method because CP changes prediction sets, not the underlying point predictions. Our experiments confirm this (Table 2: FPR gap and FNR gap are identical across Global, Mondrian, and Shrinkage).
- Final HITL output EO (FPR/FNR gaps on decisions after human review): These depend on human accuracy h and review rate r, which are protocol parameters, not CP properties. Since human decisions are independent of CP method, and non-deferred cases use invariant point predictions, final output EO is primarily determined by HITL protocol parameters, not CP choice.
6. Discussion
6.1. Connection to Impossibility Results
6.2. Why Standard EO Metrics Fail
- 1.
- Metric selection matters. Researchers evaluating CP fairness must use CP-specific metrics (coverage gap, deferral gap, set size gap), not standard EO metrics.
- 2.
- CP affects different decisions. Point predictions determine what decision is made; prediction sets determine who makes the decision (model vs. human). These are distinct fairness concerns.
- 3.
- Deferral is consequential. In HITL systems, being deferred to human review has real costs: delays, resource consumption, and potentially different treatment. Deferral disparities are fairness concerns even if final decisions are equalized.
6.3. Is Deferral Disparity Harmful?
- Credit scoring: Deferred applicants face processing delays (days to weeks), missing time-sensitive opportunities like promotional interest rates.
- Healthcare triage: Higher deferral rates mean longer waits for specialist review, potentially delaying treatment for one demographic group.
- Employment: Deferred candidates may be deprioritized in fast-moving hiring pipelines, receiving offers after positions are filled.
6.4. Practical Implications
- For coverage parity: Use Mondrian CP (). Accept increased deferral disparity.
- For deferral fairness: Use global CP (). Accept coverage disparities.
- For balanced objectives: Use shrinkage with . This provides a compromise, though neither criterion is fully satisfied.
- For deployment: Explicitly state which fairness criterion is being optimized and acknowledge the trade-off.
6.5. Limitations
, treating only multi-label sets as uncertain. An alternative definition
would also defer empty sets. In our experiments, empty sets are rare ( across methods with ), so this choice has minimal impact. For aggressive coverage levels (), empty sets become more common and the deferral definition choice matters more.
is natural for binary classification. For multi-class settings (), our framework naturally extends to for practitioner-specified budget k. We recommend: (i) setting as a default to defer predictions with more than two plausible classes, (ii) using entropy-based thresholds for soft deferral decisions when class probabilities are available, or (iii) calibrating k to achieve a target deferral rate on held-out data. The theoretical trade-off (Theorem 1) extends directly: Mondrian’s group-specific thresholds will still induce differential deferral rates across groups, though the magnitude may vary with .7. Conclusion
- Statistically robust: Significant at p < 0.001 across 100 seeds
- Cross-dataset consistent: Present in all 6 benchmark datasets
- Structurally inherent: Invariant to HITL parameters
- Monotonic: No optimal shrinkage parameter exists
Supplementary Materials
References
- Anastasios N. Angelopoulos, Stephen Bates, Jitendra Malik, and Michael I. Jordan. Uncertainty sets for image classifiers using conformal prediction. In International Conference on Learning Representations (ICLR), 2021. RAPS: Regularized Adaptive Prediction Sets.
- Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias: There’s software used across the country to predict future criminals. and it’s biased against blacks. ProPublica, 2016. COMPAS recidivism data; foundational fairness dataset. Available online: https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.
- Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, and Daniel S. Weld. Does the whole exceed its parts? the effect of ai explanations on complementary team performance. In CHI Conference on Human Factors in Computing Systems, 2021. Most accurate AI not always best teammate; optimize for human-AI team. [CrossRef]
- Rina Foygel Barber, Emmanuel J. Candès, Aaditya Ramdas, and Ryan J. Tibshirani. Predictive inference with the jackknife+. Annals of Statistics, 49(1):486–507, 2021. Jackknife+ and cross-conformal methods for reusing training data. [CrossRef]
- Rina Foygel Barber, Emmanuel J. Candès, Aaditya Ramdas, and Ryan J. Tibshirani. Conformal prediction beyond exchangeability. Annals of Statistics, 51(2):816–845, 2023. Handles non-i.i.d. data in conformal prediction. [CrossRef]
- Alexandra Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, 5(2):153–163, 2017. Proves impossibility of calibration + equal error rates; COMPAS analysis. [CrossRef]
- Chi-Keung Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory, 16(1):41–46, 1970. Foundational: optimal reject rule for classification. [CrossRef]
- Jesse C. Cresswell, Yi Sui, Bhargava Kumar, and Noël Vouitsis. Conformal prediction sets can cause disparate impact. arXiv preprint arXiv:2410.01888, 2025.
- Dominik Dellermann, Philipp Ebel, Matthias Söllner, and Jan Marco Leimeister. Hybrid intelligence. In Business & Information Systems Engineering, volume 61, pages 637–643, 2019. Taxonomy for human-AI collaboration. [CrossRef]
- Frances Ding, Moritz Hardt, John Miller, and Ludwig Schmidt. Retiring adult: New datasets for fair machine learning. In Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2021. Folktables: ACS-based replacement for UCI Adult.
- Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference (ITCS), pages 214–226, 2012. Individual fairness via Lipschitz condition. [CrossRef]
- Marc N. Elliott, Allen Fremont, Peter A. Morrison, Philip Pantoja, and Nicole Lurie. Using the census bureau’s surname and geocoding lists to improve estimates of race/ethnicity and associated disparities. Health Services and Outcomes Research Methodology, 9(2):69–83, 2009. BISG: Bayesian Improved Surname Geocoding for race/ethnicity proxy estimation. [CrossRef]
- Michael Feldman, Sorelle A. Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. Certifying and removing disparate impact. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 259–268, 2015. Disparate impact 80% rule; data repair method. [CrossRef]
- Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS), volume 30, 2017. SelectiveNet: abstain on uncertain cases.
- Isaac Gibbs and Emmanuel Candès. Adaptive conformal inference under distribution shift. In Advances in Neural Information Processing Systems (NeurIPS), volume 34, 2021. Weighted conformal for distribution shift.
- Ben Green and Yiling Chen. Algorithmic risk assessments can alter human decision-making processes in high-stakes government contexts. In Proceedings of the ACM on Human-Computer Interaction (CSCW), volume 5, 2019. Risk scores change human decisions; can increase disparate impact. [CrossRef]
- Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems (NeurIPS/NIPS), pages 3315–3323, 2016. Introduced Equalized Odds and Equal Opportunity.
- Hans Hofmann. Statlog (german credit data) data set. UCI Machine Learning Repository, 1994. 1,000 loan applicants; credit risk prediction. Available online: https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data.
- Jiaye Huang, Huazhen Xi, Linjun Zhang, and Rina Foygel Barber. Conformal prediction with learned features. Journal of Machine Learning Research, 25:1–45, 2024. SAPS: Sorted Adaptive Prediction Sets.
- Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair determination of risk scores. In 8th Innovations in Theoretical Computer Science Conference (ITCS), 2017. [CrossRef]
- Ronny Kohavi and Barry Becker. Adult data set. UCI Machine Learning Repository, 1996. Census income prediction; 48,842 instances. Available online: https://archive.ics.uci.edu/dataset/2/adult.
- Nikita Kozodoi, Johannes Jacob, and Stefan Lessmann. Fairness in credit scoring: Assessment, implementation and profit implications. European Journal of Operational Research, 297(3):1083–1094, 2022. Fairness-profit trade-off in credit scoring. [CrossRef]
- Jing Lei and Larry Wasserman. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 109(507):1094–1114, 2014. Proved exact conditional coverage impossible distribution-free. [CrossRef]
- Stefan Lessmann, Bart Baesens, Hsin-Vonn Seow, and Lyn C. Thomas. Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. European Journal of Operational Research, 247(1):124–136, 2015. Benchmarking 41 algorithms on 8 credit datasets. [CrossRef]
- David Madras, Toniann Pitassi, and Richard Zemel. Predict responsibly: Improving fairness and accuracy by learning to defer. In Advances in Neural Information Processing Systems (NeurIPS), volume 31, pages 6147–6157, 2018. Learning to defer framework; fairness through abstention.
- Shira Mitchell, Eric Potash, Solon Barocas, Alexander D’Amour, and Kristian Lum. Algorithmic fairness: Choices, assumptions, and definitions. Annual Review of Statistics and Its Application, 8:141–163, 2021. Statistical perspective on fairness definitions. [CrossRef]
- Sérgio Moro, Paulo Cortez, and Paulo Rita. A data-driven approach to predict the success of bank telemarketing. Decision Support Systems, 62:22–31, 2014. Bank marketing dataset; 45,211 records. [CrossRef]
- Hussein Mozannar and David Sontag. Consistent estimators for learning to defer to an expert. arXiv preprint arXiv:2006.01862, 2020. First consistency guarantee for learning-to-defer.
- Ziad Obermeyer, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464):447–453, 2019. Racial bias in healthcare algorithm. [CrossRef]
- Harris Papadopoulos. Inductive conformal prediction: Theory and application to neural networks. Tools in Artificial Intelligence, pages 315–330, 2008. ICP theory and neural network applications. ICP theory and neural network applications.
- Harris Papadopoulos, Kostas Proedrou, Vladimir Vovk, and Alexander Gammerman. Inductive confidence machines for regression. In European Conference on Machine Learning (ECML), pages 345–356, 2002. Introduced inductive/split conformal prediction. [CrossRef]
- Yaniv Romano, Matteo Sesia, and Emmanuel J. Candès. Classification with valid and adaptive coverage. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, 2020. Adaptive conformal prediction for classification (APS).
- Mauricio Sadinle, Jing Lei, and Larry Wasserman. Least ambiguous set-valued classifiers with bounded error levels. In Journal of the American Statistical Association, volume 114, pages 223–234, 2019. LAC: Minimize set size at target coverage. [CrossRef]
- David W. Scott. Multivariate Density Estimation: Theory, Practice, and Visualization. John Wiley & Sons, 1992. Foundational text for kernel density estimation, Scott’s rule. [CrossRef]
- Sahil Verma and Julia Rubin. Fairness definitions explained. In Proceedings of the ACM/IEEE International Workshop on Software Fairness (FairWare), pages 1–7, 2018. Comprehensive taxonomy of 20+ fairness definitions. [CrossRef]
- Vladimir Vovk, Ilia Nouretdinov, and Alexander Gammerman. Mondrian confidence machine. In Technical Report, 2003. Introduced group-conditional conformal prediction.
- Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer, 2005. Foundational text introducing conformal prediction framework. [CrossRef]
- I-Cheng Yeh and Che-hui Lien. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2):2473–2480, 2009. Taiwan credit card default; 30,000 clients. [CrossRef]


| Metric | Global | Mondrian | Shrinkage | Difference |
|---|---|---|---|---|
| EOD (TPR diff) | -0.078 ± 0.001 | -0.078 ± 0.001 | -0.078 ± 0.001 | 0.00% |
| AOD (Avg Odds) | -0.067 ± 0.001 | -0.067 ± 0.001 | -0.067 ± 0.001 | 0.00% |
| FPR gap | 0.060 ± 0.001 | 0.060 ± 0.001 | 0.060 ± 0.001 | 0.00% |
| FNR gap | 0.091 ± 0.001 | 0.091 ± 0.001 | 0.091 ± 0.001 | 0.00% |
| Metric | Global | Mondrian | Shrinkage | Max Diff |
|---|---|---|---|---|
| FPR gap () | 0.025 ± 0.019 | 0.025 ± 0.019 | 0.025 ± 0.019 | <0.01 |
| FNR gap () | 0.055 ± 0.040 | 0.073 ± 0.062 | 0.064 ± 0.052 | 0.02 |
| Deferral gap | 0.050 ± 0.041 | 0.090 ± 0.070 | 0.065 ± 0.056 | 0.04 |
| Review Rate | Global Def. Gap | Mondrian Def. Gap | Ratio |
|---|---|---|---|
| 0.00 | 0.049 | 0.116 | 2.36x |
| 0.25 | 0.049 | 0.116 | 2.36x |
| 0.50 | 0.049 | 0.116 | 2.36x |
| 0.75 | 0.049 | 0.116 | 2.36x |
| 1.00 | 0.049 | 0.116 | 2.36x |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
