Software Unfairness Detection in Machine Learning-based Systems: A Systematic Mapping Study

Roa Alharbi; Noureddine Abbadeni

doi:10.20944/preprints202604.0038.v1

Submitted:

31 March 2026

Posted:

01 April 2026

You are already at the latest version

Abstract

Machine learning-based systems are increasingly deployed in high-stakes domains such as healthcare, finance, law, and e-commerce, where their predictions directly influence critical decisions. Although these systems offer powerful data-driven support, they also introduce serious concerns related to fairness, bias, and discrimination. As a result, detecting and addressing unfairness in machine learning software has become a central research challenge. This study presents a systematic mapping of research on software unfairness detection in machine learning systems, with the aim of consolidating existing fairness definitions, identifying major problem types, examining testing approaches, reviewing commonly used datasets, and highlighting open research gaps. A structured search was conducted across five major digital libraries and additional sources, covering publications from 2010 to 2025. From 1,805 initially identified records, 67 primary studies met the inclusion and quality assessment criteria. The findings show that research activity has grown significantly since 2019, reaching a peak in 2022. Most studies were published at conferences, followed by journals and workshops. The literature addresses various themes, including analysis of existing fairness methods, bias mitigation strategies, testing techniques, and evaluation frameworks. Fairness testing was performed at unit, integration, and system levels, with integration testing being the most common. Frequently used datasets include COMPAS, Adult Census Income, and German Credit. Widely adopted tools such as IBM AI Fairness 360, Themis, and Aequitas were also identified. Overall, the mapping highlights progress made in fairness research while emphasizing the need for stronger integration of fairness into practical machine learning development.

Keywords:

machine learning-based systems

;

software fairness

;

bias detection

;

fairness testing

;

systematic mapping study

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

Machine learning-based systems (MLS) have become deeply embedded in contemporary decision-making across sensitive domains such as healthcare, government services, education, business, finance, and security. Their capacity to process large-scale data and extract actionable patterns has positioned MLS as critical decision-support instruments in high-stakes contexts. Because MLS outputs can significantly affect people and groups making those predictions and recommendations credible, transparency of process, transparency of outcome, fairness, and equitable treatment is not simply desirable for societal trust and responsible/ethical deployment. When MLS produce biased outcomes, they will socially in-justice that can multiply historical social inequities and perpetuate discrimination on hiring, lending, and medical priority where the procedural justice and substantive outcomes are relevant.

Machine learning itself can be broadly categorized into supervised, unsupervised, semi-supervised, and reinforcement learning methods, each with distinctive ways of learning from data and making predictions [1]. Furthermore, systematic approaches for conducting structured reviews in software engineering and MLS research have been established and guide studies like this one [2].

Software fairness has been articulated through multiple complementary definitions. One widely employed perspective states that “an algorithm is fair if it gives similar predictions to similar individuals. Any two individuals who are similar with respect to a similarity metric defined for a particular task should be classified in the same way.” [3] A second, frequently cited view holds that fairness is achieved when protected attributes (e.g., race, gender, age) are excluded from influencing outcomes [4]. These definitions motivate a family of operational metrics that enable empirical assessment, including disparate impact, demographic parity, equalized odds, and fairness through awareness, each capturing a distinct notion of equitable treatment and outcomes [5]. No one metric can ever be adequate in every situation but, rather, metrics need to be considered and understood in relation to the application context and the kinds of harms deferred. The presence of multiple metrics illustrates both indicators of the conceptual complexity of fairness and the practically challenging nature of applying fair treatment in real systems [68].

Fairness challenges can manifest at all stages of the MLS lifecycle. In the preprocessing stage, fairness work targets the training data (e.g., addressing imbalance and representational issues) before model induction. In the in-processing stage, algorithm design and optimization incorporate fairness considerations directly into learning procedures or objective functions. In the post-processing stage, predictions are examined and, if needed, adjusted to avoid disparate impacts across groups [6]. Separating out these stages of the design, development, and evaluation of AI systems shows the importance of taking fairness into account across the entire process, not just as a one-off check but as ingrained practice from start to finish [76].

While skills related to fairness are very visible in AI work today, literature is still in its infancy. Previous surveys and summaries of how fairness was addressed have been helpful including what we have learned about metrics, mitigation methods, and algorithmic attitudes towards bias and discrimination. However, these works tend to focus on specific technical strata (e.g., metric definitions, bias mitigation strategies, or algorithm families) without providing a consolidated mapping that simultaneously: (a) synthesizes fairness definitions, (b) categorizes problem types tackled in the field (analysis, mitigation, testing, evaluation), (c) identifies approaches to fairness testing (including where they sit in the MLS lifecycle), (d) enumerates datasets/algorithms/models used to detect unfairness (and why they exhibit bias), and (e) surfaces research gaps and trends across time and publication venues. Consequently, it remains difficult to observe field-level structure, compare practices across subcommunities, and identify consistent trends that would support standardization and reproducibility.

This fragmentation is compounded by the dispersion of contributions across software engineering, machine learning, data science, and applied computing venues. Researchers in these communities often apply different fairness definitions and evaluation protocols, complicating cross-study synthesis. At the same time, the ecosystem’s reliance on commonly used datasets and benchmarks, each with known limitations, further motivates careful, structured accounting of what has been studied, how it has been evaluated, and where persistent gaps remain. Taken together, these observations motivate an integrative perspective that moves beyond isolated algorithmic or metric-centric reviews toward a systematic mapping study (SMS) that organizes the field along multiple, interlocking dimensions (definitions, problem types, testing approaches, datasets/algorithms/models, and research gaps/trends) over a defined time window.

Accordingly, this study undertakes an SMS of software unfairness detection in MLS over the 2010–2025 period, using established SMS procedures to search, screen, and assess the literature before extracting and synthesizing data aligned with predefined research questions. Unlike narrower technical surveys, this mapping emphasizes breadth with structure: it documents where research is published (venues/years), what kinds of fairness problems are being addressed, how fairness is tested across the MLS pipeline, which datasets/algorithms/models are used to reveal or study unfairness (including reasons underpinning their biases), and what gaps and trends emerge from the collective evidence. By consolidating these elements into a single, organized account, the study provides a platform for understanding the field’s current state and for charting principled directions for future work [37].

To situate this contribution within existing scholarship, it is useful to distinguish the present SMS from prior work. Earlier reviews offer valuable treatments of fairness metrics and mitigation strategies or discuss algorithmic bias from particular vantage points. However, they do not concurrently (a) map definitions, (b) classify problem types, (c) survey fairness testing approaches by lifecycle stage, (d) catalog datasets/algorithms/models used to detect unfairness with attention to bias sources, and (e) synthesize gaps and trends across venues and years within a single, unified framework. Nor do they provide a consolidated picture that speaks directly to software engineering concerns (e.g., testing levels and lifecycle integration) alongside machine-learning-centric perspectives. This study addresses that need through an SMS design explicitly structured to capture these dimensions and report their interrelationships.

Finally, although the Introduction deliberately avoids detailed tutorials on ML algorithms, data modalities, and deep learning architectures—topics that are covered comprehensively in the Background chapter—the present section maintains the essential foundations required to motivate a fairness-centered SMS. Specifically, it retains: (i) the core operational definitions of fairness and the standard metrics by which it is evaluated, [3,4,5] (ii) the three-stage view of where fairness work occurs in the MLS lifecycle, [6] and (iii) the overarching rationale for end-to-end fairness verification and validation in decision-critical contexts. Broader technical details (e.g., supervised vs. unsupervised learning, neural architecture, and dataset taxonomies) are reserved for the Background so the Introduction can maintain a focused funnel from societal context to fairness problem, to literature limitations, to the study’s objectives and scope.

Main Objective

The goal of this study is to present the results of a systematic mapping study of software unfairness detection in MLS. The primary objective is to address predefined research questions with research-analyzed results and to present the frequencies of solutions, thereby identifying the kinds of research, their quantities, and their results in this field.

Specific Objectives

Identify the most valuable venues of papers in the field of unfairness detection in MLS.
Explore different software fairness definitions.
Recognize types of addressed problems (detection, analysis, or evaluation).
Find approaches for fairness testing (algorithms or tools) and explore fairness testing levels in MLS.
Provide researchers with datasets, algorithms, and models for detecting unfairness in MLS and explain the reasons behind their biases.
Investigate the gaps in software fairness in MLS research topics.

The rest of the report is organized as follows: Research methodology, threats to validity, background, results and discussion, related work, conclusion, and future work.

Venue	Study ID	Publisher
Journal of Data and Information Quality	PS26	ACM
Journal on Emerging Technologies in Computing Systems	PS3	ACM
Journal of Computing Sciences in Colleges	PS4	ACM
Journal of Artificial Intelligence Research	PS6	JAIR
IEEE Transactions on Industrial Informatics	PS8	IEEE
Journal of Machine Learning Research	PS13	JLMR
Neural Computing and Applications	PS20, PS28	Springer
Data Mining and Knowledge Discovery	PS21, PS57	Springer
Data Science and Engineering	PS23	Springer
International Journal of Data Science and Analytics	PS24	Springer
DBLP CoRR journal	PS26, PS60	DBLP
Knowledge-Based Systems	PS36	Elsevier
International Journal of Intelligent Systems	PS37	WILEY
IEEE Intelligent Systems	PS40	IEEE
Algorithms	PS44	MDPI
Advances in Neural Information Processing Systems	PS45	NeurIPS
IEEE Transactions on Software Engineering	PS55	IEEE
International Journal of Crowd Science	PS58	DBLP
ACM Transactions on Software Engineering and Methodology	PS59	ACM
Journal of Technology in Human Service	PS63	DBLP
Electronics	PS64	MDPI
Expert Systems	PS66	WILY
ACM Transactions on Knowledge Discovery from Data	PS67	ACM
Venue	Study ID	Publisher
Journal of Data and Information Quality	PS26	ACM

Venue	Study ID	Publisher
The World Wide Web Conference	PS2, PS5	ACM
International Conference on Computer, Control, and Communication	PS7	IEEE
IEEE/ACM International Workshop on Software Fairness	PS9	IEEE/ACM
IEEE International Symposium on Technology and Society	PS10	IEEE
IEEE TrustCom 2020	PS11	IEEE
IEEE Conference on Decision and Control	PS12	IEEE
International Conference on Testing Software and Systems	PS14	SPRINGER
International Conference on the Quality of Information and Communications Technology	PS15, PS17	SPRINGER
International Conference on Computer-Aided Verification	PS16	SPRINGER
Joint European Conference on Machine Learning and Knowledge Discovery in Databases	PS18	SPRINGER
Companion Proceedings of the Web Conference 2021	PS22	ACM
Proceedings of the 23rd ACM SIGKDD	PS25	ACM
Proceedings of the Conference on Fairness, Accountability, and Transparency	PS27, PS30, PS38	ACM
Proceedings of the AAAI/ACM Conference on AI	PS29	ACM
ACL Workshop on Gender Bias for Natural Language Processing	PS31	ACL
European Software Engineering Conference and Symposium on the Foundations of Software Engineering	PS32	ACM
ACM Conference (Conference’17)	PS33	ACM
Annual Meeting of the Association for Computational Linguistics	PS34	ACL
INNS Big Data and Deep Learning Conference	PS35	SPRINGER
Proceedings of the 36th International Conference on Machine Learning	PS41, PS42	PMLR
AAAI Conference on Artificial Intelligence	PS43	AAAI
International Conference on Software Engineering	PS46, PS47, PS51, PS52, PS54, PS56	IEEE/ACM
International Workshop on Equitable Data and Technology	PS48	ACM
IEEE International Conference on Software Testing, Verification, and Validation Workshop	PS49	IEEE
ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering	PS50	ACM
IEEE/ACM 7th International Workshop on Metamorphic Testing	PS53	IEEE/ACM
ACM SIGSOFT International Symposium on Software Testing and Analysis	PS61	ACM
International Conference on Evaluation and Assessment in Software Engineering	PS65	EASE

Venue	Study ID	Publisher
Empirical Software Engineering	PS19, PS62	SPRINGER

ML	Machine Learning
DL	Deep Learning
NN	Neural Network
DNN	Deep Neural Network
CNN	Convolutional Neural Network
RNN	Recurrent Neural Network
LSTM	Long Short-Term Memory
NLP	Natural Language Processing
BERT	Bidirectional Encoder Representations from Transformers
GPT	Generative Pre-trained Transformer

ID	Reference	Title	QA1	QA2	QA3	Score	ID after QA
S1	Tremblay, Monica Chiarini, Kaushik Dutta, and Debra Vandermeer. Journal of Data and Information Quality (JDIQ) 2.1 ACM (2010)	Using data mining techniques to discover bias patterns in missing data.	1	1	1	3	PS1
S2	Krasanakis, Emmanouil, et al. Proceedings of the 2018 World Wide Web Conference. ACM 2018.	Adaptive sensitive reweighting to mitigate bias in fairness-aware classification	1	1	0.5	2.5	PS2
S3	Wang, Weijia, and Bill Lin. ACM Journal on Emerging Technologies in Computing Systems (JETC) 15.2 (2019): 1-17.	Trained biased number representation for ReRAM-based neural network accelerators.	1	1	1	3	PS3
S4	Thambawita, Vajira, et al. ACM Transactions on Computing for Healthcare 1.3 (2020): 1-29.	An extensive study on cross-dataset bias and evaluation metrics interpretation for machine learning applied to gastrointestinal tract abnormality classification.	1	0.5	0	1.5
S5	Amend, Jack J., and Scott Spurlock. Journal of Computing Sciences in Colleges ACM 36.5 (2021): 14-23.	Improving machine learning fairness with sampling and adversarial learning.	1	1	0.5	2.5	PS4
S6	Baniecki, Hubert, et al. The Journal of Machine Learning Research 22.1 (2021): 9759-9765.	dalex: Responsible machine learning with interactive explainability and fairness in Python.	1	0.5	0	1.5
S7	Wu, Yongkai, Lu Zhang, and Xintao Wu. The World Wide Web Conference. ACM 2019.	On convexity and bounds of fairness-aware classification.	1	1	0.5	2.5	PS5
S8	Caton, Simon, Saiteja Malisetty, and Christian Haas. Journal of Artificial Intelligence ACM Research 74 (2022): 1011-1035.	Impact of Imputation Strategies on Fairness in Machine Learning.”	1	1	1	3	PS6
S9	Kamiran, Faisal, and Toon Calders. 2009 2nd international conference on computer, control, and communication. IEEE, 2009.	Classifying without discriminating.	1	1	0.5	2.5	PS7
S10	DeBrusk, Chris. MIT Sloan Management Review (2018).	The risk of machine-learning bias (and how to prevent it.	1	0.5	0	1.5
S11	Zhou, Xiaokang, et al. IEEE Transactions on Industrial Informatics (2022).	Distribution Bias Aware Collaborative Generative Adversarial Network for Imbalanced Deep Learning in Industrial IoT.	1	0.5	0.5	2	PS8
S12	Verma, Sahil, and Julia Rubin. 2018 ieee/acm international workshop on software fairness (fairware). IEEE, 2018.	Fairness definitions explained.	1	1	0	2	PS9
S13	KIEMDE, Sountongnoma Martial Anicet, and Ahmed Dooguy KORA. 2020.	Fairness of Machine Learning Algorithms for the Black Community	1	0.5	0.5	2	PS10
S14	Xie, Wentao, and Peng Wu. 2020.	Fairness Testing of Machine Learning Models Using Deep Reinforcement Learning	1	1	1	3	PS11
S15	Olfat, Matt, and Yonatan Mintz. 2020.	Flexible Regularization Approaches for Fairness in Deep Learning	1	1	0.5	2.5	PS12
S16	Zafar, Muhammad Bilal, et al. The Journal of Machine Learning Research 20.1 (2019): 2737-2778.	Fairness constraints: A flexible approach for fair classification.	1	1	1	3	PS13
S17	Sharma, Arnab, and Heike Wehrheim, IFIP International Conference on Testing Software and Systems. Springer, Cham, 2020.	Automatic fairness testing of machine learning models.	1	1	1	3	PS14
S18	Villar, David, and Jorge Casillas. International Conference on the Quality of Information and Communications Technology. Springer, Cham, 2021.	Facing Many Objectives for Fairness in Machine Learning.	1	0.5	1	2.5	PS15
S19	Guan, Ji, Wang Fang, and Mingsheng Ying. International Conference on Computer Aided Verification. Springer, Cham, 2022.	Verifying Fairness in Quantum Machine Learning.	1	1	1	3	PS16
S20	Shin Nakajima and Tsong Yueh Chen (2019)	Generating Biased Dataset for Metamorphic Testing of Machine Learning Programs	1	0.5	0.5	2	PS17
S21	Kamishima, Toshihiro, et al. Joint European conference on machine learning and knowledge discovery in databases. Springer, Berlin, Heidelberg, 2012.	Fairness-aware classifier with prejudice remover regularizer.	1	1	0.5	2.5	PS18
S22	Perera, Anjana, et al. 2022	Search-based fairness testing for regression-based machine learning systems.	1	1	1	3	PS19
S23	Tian, Huan, et al. Neural Computing and Applications (2022): 1-19.	Image fairness in deep learning: problems, models, and challenges.	1	0.5	0.5	2	PS20
S24	Calders, Toon, and Sicco Verwer. Data mining and knowledge discovery 21.2 (2010): 277-292.	Three naive Bayes approaches for discrimination-free classification.	1	1	0.5	2.5	PS21
S25	Sun, Haipei, et al. Companion Proceedings of the Web Conference 2021. 2021.	Automating fairness configurations for machine learning.	1	0.5	1	2.5	PS22
S26	Abraham, Savitha Sam. Data Science and Engineering 6.4 (2021): 485-499.	FairLOF: Fairness in Outlier Detection.	1	1	1	3	PS23
S27	Wang, Yanchen, and Lisa Singh International Journal of Data Science and Analytics 12.2 (2021): 101-119.	Analyzing the impact of missing values and selection bias on fairness.	1	1	1	3	PS24
S28	Corbett-Davies, Sam, et al. Proceedings of the 23rd acm sigkdd international conference on knowledge discovery and data mining. 2017.	Algorithmic decision making and the cost of fairness.	1	1	0.5	2.5	PS25
S29	Berk, Richard, et alarXiv preprint arXiv:1706.02409 (2017).	A convex framework for fair regression.	1	1	0,.5	2.5	PS26
S30	Celis, L. Elisa, et al.Proceedings of the conference on fairness, accountability, and transparency. 2019.	Classification with fairness constraints: A meta-algorithm with provable guarantees.	1	1	1	3	PS27
S31	Prates, Marcelo OR, Pedro H. Avelar, and Luís C. Lamb Neural Computing and Applications 32.10 (2020): 6363-6381.	Assessing gender bias in machine translation: a case study with Google translates.	1	1	1	3	PS28
S32	Fazelpour, Sina, and Zachary C. Lipton. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. 2020.	Algorithmic Fairness from a Non-ideal Perspective	1	1	0.5	2.5	PS29
S33	Hutchinson, Ben, and Margaret Mitchell. Proceedings of the conference on fairness, accountability, and transparency. 2019.	50 years of test (un) fairness: Lessons for machine learning.	1	0.5	0.5	2	PS30
S34	Font, Joel Escudé, and Marta R. Costa-Jussa. arXiv preprint arXiv:1901.03116 (2019).	Equalizing gender biases in neural machine translation with word embedding techniques.	1	1	1	3	PS31
S35	Aggarwal, Aniya, et al. Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2019.	Black box fairness testing of machine learning models.	1	1	1	3	PS32
S36	Jones, Gareth P., et al. arXiv preprint arXiv:2010.03986 (2020).	Metrics and methods for a systematic comparison of fairness-aware machine learning algorithms.	1	1	0.5	2.5	PS33
S37	Krishna, Satyapriya, et al. arXiv preprint arXiv:2203.08670 (2022).	Measuring Fairness of Text Classifiers via Prediction Sensitivity.	1	1	0.5	2.5	PS34
S38	Barocas, Solon, Moritz Hardt, and Arvind Narayanan. Nips tutorial 1 (2017)	Fairness in machine learning.	1	0.5	0.5	2	PS35
S39	Varley, Michael, and Vaishak Belle. Knowledge-Based Systems 215 (2021): 106715.	Fairness in machine learning with tractable models.	1	1	0.2	2.5	PS36
S40	Valdivia, Ana, Javier Sánchez-Monedero, and Jorge CasillasInternational Journal of Intelligent Systems 36.4 (2021): 1619-1643.	How fair can we go in machine learning? Assessing the boundaries of accuracy and fairness.	1	1	1	3	PS37
S41	Friedler, Sorelle A., et al. Proceedings of the conference on fairness, accountability, and transparency. 2019.	A comparative study of fairness-enhancing interventions in machine learning.	1	1	0.5	2.5	PS38
S42	Ferrari, Elisa, and Davide Bacciu. arXiv preprint arXiv:2105.06345 (2021).	Addressing Fairness, Bias, and Class Imbalance in Machine Learning: the FBI-loss.	1	1	1	3	PS39
S43	Du, Mengnan, et al. IEEE Intelligent Systems 36.4 (2020): 25-34.	Fairness in deep learning: A computational perspective.	1	0.5	0.5	2	PS40
S44	Huang, Lingxiao, and Nisheeth Vishnoi. International Conference on Machine Learning. PMLR, 2019.	Stable and fair classification.	1	0.5	0.5	2	PS41
S45	Celis, L. Elisa, et alInternational Conference on Machine Learning. PMLR, 2021.	Fair classification with noisy protected attributes: A framework with provable guarantees.”	1	1	0.5	2.5	PS42
S46	Goel, Naman, Mohammad Yaghini, and Boi Faltings. Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. No. 1. 2018.	Non-discriminatory machine learning through convex fairness criteria.	1	1	0.5	2.5	PS43
S47	Shrestha, Yash Raj, and Yongjie Yang. Algorithms 12.9 (2019): 199.	Fairness in algorithmic decision-making: Applications in multi-winner voting, machine learning, and recommender systems.	1	0.5	0.5	2	PS44
S48	Mandal, Debmalya, et al. Advances in neural information processing systems 33 (2020): 18445-18456.	Ensuring fairness beyond the training data.	1	0.5	0.5	2	PS45
S49	Gao, Xuanqi, et al.	FairNeuron: improving deep neural network fairness with adversary games on selective neurons.	1	1	1	3	PS46
S50	Tizpaz-Niari, Saeid, et al.	Fairness-aware configuration of machine learning libraries.	1	1	0.5	2.5	PS47
S51	Chakraborty, Joymallya, Suvodeep Majumder, and Huy Tu.	Fair-SSL: Building fair ML Software with less data.	1	0.5	0.5	2	PS48
S52	Patel, Ankita Ramjibhai, et al.	A combinatorial approach to fairness testing of machine learning models.	1	1	1	3	PS49
S53	Chen, Zhenpeng, et al.	MAAT: a novel ensemble approach to addressing fairness and performance bugs for machine learning software.	1	0.5	1	2.5	PS50
S54	Li, Yanhui, et al.	Training data debugging for the fairness of machine learning software.	1	0.5	0.5	2	PS51
S55	Fan, Ming, et al.	Explanation-guided fairness testing through genetic algorithm.	1	1	0.5	2.5	PS52
S56	Pu, Muxin, et al.	Fairness evaluation in deepfake detection models using metamorphic testing.	1	1	1	3	PS53
S57	Zheng, Haibin, et al.	Neuronfair: Interpretable white-box fairness testing through biased neuron identification.	1	1	0.5	2.5	PS54
S58	Zhang, Peixin, et al.	Automatic fairness testing of neural classifiers through adversarial sampling.	1	0.5	0.5	2	PS55
S59	Zhang, Peixin, et al.	White-box Fairness Testing through Adversarial Sampling	1	0.5	1	2.5	PS56
S60	Fabris, Alessandro, et al	Algorithmic fairness datasets: the story so far.	1	0.5	0.5	2	PS57
S61	Zhang, Jiehuang, Ying Shu, and Han Yu.	Fairness in design: A framework for facilitating ethical artificial intelligence designs.	1	0.5	0.5	2	PS58
S62	Majumder, Suvodeep, et al. “	Fair enough: Searching for sufficient measures of fairness.	1	0.5	0.5	2	PS59
S63	Wang, Zichong, et al. “	Towards fair machine learning software: Understanding and addressing model bias through counterfactual thinking.	1	1	1	3	PS60
S64	Guo, Huizhong, et al.	Fairrec: fairness testing for deep recommended systems.	0.5	1	0.5	2	PS61
S65	Hort, Max, et al.	Search-based automatic repair for fairness and accuracy in decision-making software.	1	1	0.5	2.5	PS62
S66	Bantilan, Niels.	Themis-ml: A fairness-aware machine learning interface for end-to-end discrimination discovery and mitigation.	1	1	1	3	PS63
S67	Ling, Jiasheng, et al.	Machine Learning-Based Multilevel Intrusion Detection Approach	1	1	0.5	2.5	PS64
S68	Nasiri, Roya.	Testing Individual Fairness in Graph Neural Networks	1	0.5	0.5	2	PS65
S69	Consuegra-Ayala et al.	Bias mitigation for fair automation of classification tasks.	1	1	1	3	PS66
S70	Paiheng Xu et al.	GFairHint: Improving Individual Fairness for Graph Neural Networks via Fairness Hint.	1	1	1	3	PS67
S71	Bahangulu et al.	Algorithmic bias, data ethics, and governance: Ensuring fairness, transparency, and compliance in AI-powered business analytics applications	0.5	0.5	0.5	1.5

Software Unfairness Detection in Machine Learning-based Systems: A Systematic Mapping Study

Abstract

Keywords:

Subject:

1. Introduction

Main Objective

2. Research Methodology

2.1. Methods Overview

2.2. Search Strategy

2.3. Narrative for Study Screening and Selection

2.3. Quality Assessment and Data Analysis

2.4. Data Extraction

2.5. Threats to Validity

3. Background

3.1. Machine Learning-Based Systems and Deep Learning

3.2. Deep Learning Tasks and Architectures

3.3. ML and DL Tasks in Data Mining

3.4. Types of Datasets

3.5. Software Fairness

3.6. Testing Fairness in MLS

3.7. Fairness and Bias in MLS

4. Results and Discussion

5. Related Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A: Quality Assurance

Appendix B: Research Gaps Summary

References

MDPI Initiatives

Important Links

Subscribe

Category	Sub-category	Research Gap	Primary Study ID
1. fairness in machine learning	Fairness Metrics and Definitions	• Investigate incorporating fairness metrics into neural networks. • Measures for fairness. • Improve the definitions of fairness in data analysis. • Study other fairness metrics beyond individual fairness. • Much research relies on specific fairness metrics while ignoring others.	PS4 PS18 PS22 PS23 PS30 PS36 PS59
	Fairness Across ML Tasks	• Include fairness constraints in supervised (regression, recommendation) and unsupervised (set selection, ranking) tasks. • Fairness in natural language understanding, resource allocation, representation learning, and causal learning. • Fairness in regression-based systems. • Discover limitations in fair regression. • Fairness in CNNs and DNNs. • Extend DNN fairness testing to CNNs.	PS12 PS13 PS15 PS26 PS44 PS46 PS61
	Bias Mitigation Techniques	• More effective mitigation strategies beyond simple ML modifications. • Retraining as a solution to genetic fairness testing. • Explore hyper-parameter configurations that lead to high fairness. • Improve semi-supervised techniques to achieve fairness with limited labeled data. • Investigate ways to adjust the learning process to account for biases.	PS29 PS47 PS48 PS50 PS52 PS66
	Sensitive Attributes & Social Constructs	• Study numerical attributes and groups of attributes as sensitive. • Consider income, race, religious beliefs, age, nationality. • Explore multiple and continuous sensitive features. • Study other social constructs and stereotypes.	PS7 PS21 PS31
	Fairness Testing Frameworks	• Extend reinforcement learning-based testing for fairness. • Extend Themis for algorithmic bias testing. • FairRec for multi-attribute group fairness testing. • White-box and black-box fairness testing. • Explore other measures of fairness in white-box testing. • More techniques for black-box testing to detect individual discrimination.	PS14 PS19 PS49 PS61 PS63
	Fairness in Real-World Contexts	• Evaluate fairness-accuracy tradeoffs in real-world scenarios. • Fairness in temporal settings. • Study real-world data issues and their impact on fairness. • Add data documentation to future projects. • Extend fairness to ethical values like privacy and explainability.	PS27 PS57
2. Feature Selection & Data Handling		• Improve quality of datasets with test case generation approaches. • Feature selection in large datasets. • Investigating advanced feature selection techniques. • Adding datasets, algorithms, and imputation strategies. • Preprocessing of missing training data. • Pre-processing training data. • Study other datasets and fairness-accuracy tradeoffs. • Explore different characteristics in training data.	PS1 PS6 PS24 PS37 PS41 PS43 PS45 PS51 PS64
3. Model Optimization & Training Efficiency		• New ways for faster training processes. • Training CNNs with low-precision weights. • Improve algorithms for efficiency and equity. • Improve stability by shifting decision boundaries. • Implement models in industrial IoT for imbalanced learning. • Improve quantum decision models with fairness guarantees. • Enhance scalability of fairness testing for large datasets and complex graphs.	PS2 PS3 PS8 PS16 PS25 PS65 PS66
4. Testing & Evaluation Frameworks		• Improve white-box models for robustness. • Explore test case generation approaches. • DNN white-box testing challenges. • Try different equivalence classes for retraining. • Study bias kernels detected by verification algorithms. • Neuron coverage as a distortion metric. • Explore causal techniques, post-processing, interpretability, calibration. • Extend counterfactual thinking to text and image processing	PS17 PS32 PS33 PS34 PS54 PS55 PS56 PS60
5. Reinforcement Learning & Reward Functions		• Study other definitions of reward functions in black-box testing. • Extend RL-based testing frameworks. • Determine ideal G-ratio for fairness testing.	PS11
6. Ethics, Law, and Societal Impact		• Types of laws or regulations. • Ethical difficulty in statistical machine translation. • Algorithms raise complex questions for researchers and policymakers. • More debates on fairness including technical and cultural causes. • Transparency in model complexity and fairness tradeoff. • Algorithmic choices and social context. • Fair unified solutions. • Interdisciplinary collaboration (CS, statistics, cognitive science).	PS28 PS30 PS38 PS39 PS58
7. Emerging Applications & Techniques		• Deep clustering, adversarial training, and attacks. • Study proposed methods with image compression and deepfake techniques. • Investigate noise models for non-binary attributes. • Long-term studies on bias mitigation effects.	PS20 PS42 PS53

Id	Research questions	Rationale
RQ1	What is the distribution of papers through venues and years?	Identify the kind of seminars for collected papers, published journals, conferences, timeline, and range of publishing dates.
RQ2	What are the different definitions of software fairness?	Explore software fairness definitions from primary studies.
RQ3	What types of problems are addressed?	Identify the type of addressed problem (detection, analysis, or evaluation).
RQ4	What different approaches of fairness testing are presented?	Describe different approaches used in fairness testing solutions, such as algorithms or tools. And explore fairness testing levels in MLS.
RQ5	Which datasets are used to detect the unfairness of MLS?	Identify some datasets/algorithms/models that are used to detect unfairness in MLS and explain the reasons behind their biases.
RQ6	What are the research gaps and trends discovered in the reviewed studies?	Explore the gaps in software fairness in MLS research topics.

Main Digital Libraries	Supplementary Databases/Websites
IEEE Xplore Springer Link ACM Digital Library Wiley Online Library ArXiv	Google Scholar Research Gate

Search keywords	RQs	Possible Values
Journals, Conferences, Year	RQ1	Journals, conferences, university names and types, published year
Definition	RQ2	Fairness definition
Problem Types	RQ3	Analyzing, reviewing, detecting, evaluating, or testing solutions.
Methodology	RQ4	Approaches, tools, algorithms, unit testing, input testing, or system testing.
Datasets/Algorithms/Models	RQ5	Bias/unfairness, datasets, models, algorithms
Research gaps	RQ6	Trend, future work, research gap

Category	Concept/Metric	Fairness Definition	Study ID
General Fairness	Fairness	Ethical principle ensuring equitable and unbiased treatment across individuals/groups.	PS66
	Fairness-aware Model	A model that avoids discrimination and promotes fairness.	PS66
	Fairness Degree	Max difference in predictions for pairs differing only in sensitive attributes.	PS19
	Fairness Through Unawareness	Fairness by excluding sensitive attributes from decision-making.	PS38
	Counterfactual Fairness	Prediction remains unchanged if the individual belongs to a different group.	PS38
	Algorithmic Bias	Bias from mathematical rules favoring certain attributes.	PS63
Individual Fairness	Individual Fairness	Similar individuals should receive similar outcomes.	PS67, PS10, PS23, PS26
Individual Fairness	Individual Discrimination	Discrimination between individuals differs only in protected attributes.	PS32, PS11, PS32
Group Fairness	Group Fairness	Equal outcomes across demographic groups (e.g., gender, race).	PS10, PS23, PS26
Group Fairness	Fairness Constraints	Constraints like demographic parity, equal opportunity, disparate impact.	PS23, PS26
Fairness Metrics	Demographic Parity	Outcome is independent of the protected attribute.	PS15, PS25
	Equal Opportunity	Equal true positive rates across groups.	PS25, PS6, PS39
	Predictive Parity	Equal positive predictive value across groups.	PS15, PS35
	Disparate Impact	Ratio of favorable outcomes between groups.	PS25, PS6
	Average Absolute Odds	Average difference in false/true positive rates across groups.	PS6, PS35
	Theil Index	Measures inequality in prediction outcomes.	PS6
	Calibration	Predicted score should reflect actual outcomes equally across groups.	PS35, PS38

Problem Type	Problem Description	Number of Studies	Studies IDs
Analyzing	The process of systematically examining unfairness detection proposed methods to understand their effectiveness and limitations.	24	PS1, PS4, PS5, PS6, PS12, PS16, PS18, PS20, PS21, PS22, PS24, PS25, PS26, PS28, PS36, PS37, PS41, PS42, PS45, PS46, PS58, PS61, PS66
Mitigating Bias	The process of implementing strategies or algorithms that ensure fair outcomes and mitigate bias in unfairness detection systems.	15	PS3, PS7, PS8, PS13, PS15, PS23, PS27, PS31, PS34, PS43, PS48, PS53, PS59, PS64, PS67
Testing	The process of proposing testing solutions for checking whether the model produces fair outcomes and follows fairness metrics.	20	PS2, PS10, PS11, PS14, PS17, PS19, PS32, PS35, PS39, PS49, PS50, PS51, PS52, PS54, PS55, PS56, PS60, PS62, PS63, PS65
Evaluating	The process of comparing different approaches, analyzing outcomes, and validating results.	8	PS9, PS29, PS30, PS33, PS40, PS44, PS47, PS57

Problem Type	Suggested Solutions
Analyzing	Optimization methods, Analysis of Fairness Metrics
Mitigating Bias	Mitigation methods include outlier detection, ranking, imputation, data massaging, dataset sampling, and data augmentation.
Testing	White/black box testing, test case generation tools, comprehensive audits, adversarial fairness testing, feature importance testing, metric-based fairness testing, synthetic data testing, and casual fairness testing.
Evaluating	Compare unfairness detection approaches, evaluate methods by using different benchmark datasets with suggested solutions, and compare multiple fairness metrics.