Submitted:
31 May 2024
Posted:
03 June 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction


2. Background and Related Work
2.1. Representation Bias
2.2. Data Augmentation
2.3. Model Steering Involving Domain Experts
3. Interaction Approaches for Representation Debiasing
- Bias awareness: Although domain knowledge is essential to identify representation bias, prior researchers have signified the importance of interactive explanations for assisting domain experts in elucidating the behaviour of prediction systems [35,37,46,47]. Thus, we propose an approach for bias awareness, that aims at guiding domain experts to identify biased predictor variables with the help of data-centric explanations [8,46,48]. We recommend allowing domain experts to explore the distributions of categories or sub-categories of predictor variables, including the representation rate and data coverage of each category or sub-category, through interactive visualisations for higher transparency. This interaction approach is aligned with the ML transparency and exploration principles from Bove et al. [49], which aims at providing better contextualised explanations for the presence of representation bias. Furthermore, we propose illustrating the model performance corresponding to each category or sub-category of predictor variables to identify those most impacted by representation bias.
- Multi-variate constraint planning: Despite generating additional samples of underrepresented data, one primary reason for the limited effectiveness of data augmentation algorithms in mitigating representation bias is the generation of practically infeasible samples [17,31]. This occurs because data augmentation algorithms typically treat each predictor variable independently rather than jointly. It requires in-depth domain knowledge to understand the joint impact of the predictor variables. With multivariate constraint planning, we propose empowering domain experts to impose constraints on multiple predictor variables. This allows for the generation of specific sets of samples considered essential by experts to mitigate the impact of representation bias. For example, consider the representation debiasing of a diabetes prediction dataset. If healthcare experts identify that only 50 samples of diabetic patients aged 50 to 60 with high cholesterol and high blood pressure are necessary, then multivariate constraint planning can be utilised to achieve their requirement. This interaction approach enables control over the data augmentation process to mitigate the issue of generating practically infeasible data points.
- Conditional sampling: The interaction approach of conditional sampling is applicable after the application of data augmentation algorithms, allowing domain experts to select only relevant synthetic samples for upsampling the original training data. We recommend providing data filters for setting conditions defined by domain experts, allowing them to identify and remove generated samples which appear to be a misfit. This approach aims at purifying the generated data for minimal introduction of problematic data points during representation debiasing.
- What-if exploration: The interaction approach for what-if exploration of the generated data further allows domain experts to validate the generated samples. It aligns with the concept of “what-if” explanations [8,46,50,51], aiming to enhance understanding of the generated samples and their potential impact on prediction models. We recommend applying the prediction model to each generated sample to obtain their predicted target class and the corresponding confidence levels. This can allow domain experts to identify problematic data points that are difficult to train by the prediction algorithm. Additionally, domain experts should be able to adjust the values of generated samples and conduct what-if analysis to rectify such problematic generated instances.
4. Exploratory User Study
- Data Explorer - This component is designed creating bias awareness (Figure 2 (1)). It includes presenting the overall representation bias measures, as well as the representation bias measures for each predictor variable. We included a visual representation of the distribution of each predictor variable, illustrating the representation bias for each sub-category of the selected variable. Additionally, we illustrated the corresponding impact on the model performance for each sub-category and highlighted the most impacted variables and sub-categories.
- Augmentation Controller - This component is designed to support multivariate constraint mapping (Figure 2 (2)). It is designed to allow users to specify the required number of generated samples for a specific target class, and set constraints on the predictor variable values for the generated samples.
- Generated Data Explorer - This component is designed to support conditional sampling (Figure 2 (3)) and what-if exploration (Figure 2 (4)) of generated data. It is designed to allow users to sample generated data through conditions defined in data filers, and remove problematic samples. Moreover, it allows users to validate or modify generated samples through “what-if” analysis. By applying the prediction model to generated samples, this component further facilitates users to identify problematic samples having low prediction confidence levels.
5. Discussions
5.1. Debiasing is a Continual Process
5.2. Implications on Fairness
5.3. Future Work
6. Conclusions
Acknowledgments
References
- Data quality and artificial intelligence – mitigating bias and error to protect fundamental rights. 2019.
- Ding, J.; Li, X. An Approach for Validating Quality of Datasets for Machine Learning. 2018, pp. 2795–2803. [CrossRef]
- Shahbazi, N.; Lin, Y.; Asudeh, A.; Jagadish, H.V. Representation Bias in Data: A Survey on Identification and Resolution Techniques. ACM Comput. Surv. 2023, 55. [Google Scholar] [CrossRef]
- Aldoseri, A.; Al-Khalifa, K.N.; Hamouda, A.M. Re-Thinking Data Strategy and Integration for Artificial Intelligence: Concepts, Opportunities, and Challenges. Applied Sciences 2023, 13, 7082. [Google Scholar] [CrossRef]
- Mehrabi, N.; Morstatter, F.; Saxena, N.; Lerman, K.; Galstyan, A. A Survey on Bias and Fairness in Machine Learning. 2022, arXiv:cs.LG/1908.09635]. [Google Scholar] [CrossRef]
- Khan, B.; Fatima, H.; Qureshi, A.; Kumar, S.; Hanan, A.; Hussain, J.; Abdullah, S. Drawbacks of Artificial Intelligence and Their Potential Solutions in the Healthcare Sector. Biomedical Materials & Devices, 2023; 1–8. [Google Scholar] [CrossRef]
- Ahmad, S.; Han, H.; Alam, M.e.a. Impact of Artificial Intelligence on Human Loss in Decision Making, Laziness and Safety in Education. Humanities & Social Sciences Communications 2023, 10, 311. [Google Scholar] [CrossRef] [PubMed]
- Bhattacharya, A. Applied Machine Learning Explainability Techniques. In Applied Machine Learning Explainability Techniques; Packt Publishing: Birmingham, UK, 2022. [Google Scholar]
- Armstrong, S.; Sotala, K.; Ó hÉigeartaigh, S.S. The Errors, Insights and Lessons of Famous AI Predictions – and What They Mean for the Future. Journal of Experimental & Theoretical Artificial Intelligence 2014, 26, 317–342. [Google Scholar] [CrossRef]
- Papagiannidis, E.; Mikalef, P.; Conboy, K.; van de Wetering, R. Uncovering the dark side of AI-based decision-making: A case study in a B2B context. Industrial Marketing Management 2023, 115, 253–265. [Google Scholar] [CrossRef]
- Mazumder, M.; Banbury, C.; Yao, X.; Karlaš, B.; Rojas, W.G.; Diamos, S.; Diamos, G.; He, L.; Parrish, A.; Kirk, H.R.; Quaye, J.; Rastogi, C.; Kiela, D.; Jurado, D.; Kanter, D.; Mosquera, R.; Ciro, J.; Aroyo, L.; Acun, B.; Chen, L.; Raje, M.S.; Bartolo, M.; Eyuboglu, S.; Ghorbani, A.; Goodman, E.; Inel, O.; Kane, T.; Kirkpatrick, C.R.; Kuo, T.S.; Mueller, J.; Thrush, T.; Vanschoren, J.; Warren, M.; Williams, A.; Yeung, S.; Ardalani, N.; Paritosh, P.; Zhang, C.; Zou, J.; Wu, C.J.; Coleman, C.; Ng, A.; Mattson, P.; Reddi, V.J. DataPerf: Benchmarks for Data-Centric AI Development. 2023; arxiv:cs.LG/2207.10062]. [Google Scholar]
- Zha, D.; Bhat, Z.P.; Lai, K.H.; Yang, F.; Jiang, Z.; Zhong, S.; Hu, X. Data-centric Artificial Intelligence: A Survey. 2023. [arXiv:cs.LG/2303.10158]. [Google Scholar]
- Jakubik, J.; Vö ssing, M.; Kü hl, N.e.a. Data-Centric Artificial Intelligence. Business & Information Systems Engineering, 2024. [Google Scholar] [CrossRef]
- Singh, P. Systematic review of data-centric approaches in artificial intelligence and machine learning. Data Science and Management 2023, 6, 144–157. [Google Scholar] [CrossRef]
- Iosifidis, V.; Ntoutsi, E. Dealing with Bias via Data Augmentation in Supervised Learning Scenarios. 2018.
- Sharma, S.; Zhang, Y.; Ríos Aliaga, J.M.; Bouneffouf, D.; Muthusamy, V.; Varshney, K.R. Data augmentation for discrimination prevention and bias disambiguation. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 2020, pp. 358–364.
- Kumar, T.; Mileo, A.; Brennan, R.; Bendechache, M. Image Data Augmentation Approaches: A Comprehensive Survey and Future directions. 2023. [arXiv:cs.CV/2301.02830]. [Google Scholar]
- Alkhawaldeh, I.M.; Albalkhi, I.; Naswhan, A.J. Challenges and Limitations of Synthetic Minority Oversampling Techniques in Machine Learning. World Journal of Methodology 2023, 13, 373–378. [Google Scholar] [CrossRef]
- Mikołajczyk-Bareła, A. Data augmentation and explainability for bias discovery and mitigation in deep learning. 2023; [arXiv:cs.LG/2308.09464]. [Google Scholar]
- Balestriero, R.; Bottou, L.; LeCun, Y. The Effects of Regularization and Data Augmentation are Class Dependent. 2022; arXiv:[arXiv:cs.LG/2204.03632]. [Google Scholar]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Elnahas, M.; Hussein, M.; Keshk, A. Imbalanced Data Oversampling Technique Based on Convex Combination Method. IJCI. International Journal of Computers and Information 2022, 9, 15–28. [Google Scholar] [CrossRef]
- Celis, L.E.; Keswani, V.; Vishnoi, N. Data preprocessing to mitigate bias: A maximum entropy based approach. Proceedings of the 37th International Conference on Machine Learning; III, H.D.; Singh, A., Eds. PMLR, 2020, Vol. 119, Proceedings of Machine Learning Research, pp. 1349–1359.
- Temraz, M.; Keane, M.T. Solving the Class Imbalance Problem Using a Counterfactual Method for Data Augmentation. 2021, arXiv:[arXiv:cs.LG/2111.03516]. [Google Scholar] [CrossRef]
- He, H.; Bai, Y.; Garcia, E.; Li, S. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. 2008, pp. 1322 – 1328. [CrossRef]
- Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2; MIT Press: Cambridge, MA, USA, 2014. [Google Scholar]
- Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014, [http://arxiv.org/abs/1312.6114v10].
- Xu, L.; Skoularidou, M.; Cuesta-Infante, A.; Veeramachaneni, K. Modeling Tabular data using Conditional GAN. Advances in Neural Information Processing Systems, 2019.
- Patki, N.; Wedge, R.; Veeramachaneni, K. The Synthetic data vault. IEEE International Conference on Data Science and Advanced Analytics (DSAA), 2016, pp. 399–410. [CrossRef]
- Tang, Z.; Gao, Y.; Karlinsky, L.; Sattigeri, P.; Feris, R.; Metaxas, D. OnlineAugment: Online Data Augmentation with Less Domain Knowledge. Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII; Springer-Verlag: Berlin, Heidelberg, 2020; pp. 313–329. [Google Scholar] [CrossRef]
- Mumuni, A.; Mumuni, F. Data augmentation: A comprehensive survey of modern approaches. Array 2022, 16, 100258. [Google Scholar] [CrossRef]
- Fails, J.A.; Olsen, D.R. Interactive Machine Learning. Proceedings of the 8th International Conference on Intelligent User Interfaces; Association for Computing Machinery: New York, NY, USA, 2003; pp. 39–45. [Google Scholar] [CrossRef]
- Kulesza, T.; Stumpf, S.; Burnett, M.; Wong, W.K.; Riche, Y.; Moore, T.; Oberst, I.; Shinsel, A.; McIntosh, K. Explanatory Debugging: Supporting End-User Debugging of Machine-Learned Programs. 2010 IEEE Symposium on Visual Languages and Human-Centric Computing; IEEE: Leganes, Madrid, Spain, 2010; pp. 41–48. [Google Scholar] [CrossRef]
- Kulesza, T.; Burnett, M.; Wong, W.K.; Stumpf, S. Principles of Explanatory Debugging to Personalize Interactive Machine Learning. Proceedings of the 20th International Conference on Intelligent User Interfaces; ACM: Atlanta Georgia USA, 2015; pp. 126–137. [Google Scholar] [CrossRef]
- Teso, S.; Alkan, O.; Stammer, W.; Daly, E. Leveraging Explanations in Interactive Machine Learning: An Overview, 2022. arXiv:2207.14526 [cs].
- Teso, S.; Kersting, K. Explanatory Interactive Machine Learning. Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society; Association for Computing Machinery: New York, NY, USA, 2019. [Google Scholar] [CrossRef]
- Bhattacharya, A.; Stumpf, S.; Gosak, L.; Stiglic, G.; Verbert, K. EXMOS: Explanatory Model Steering Through Multifaceted Explanations and Data Configurations. Proceedings of the CHI Conference on Human Factors in Computing Systems; Association for Computing Machinery: New York, NY, USA, 2024. [Google Scholar] [CrossRef]
- Muralidhar, N.; Islam, M.R.; Marwah, M.; Karpatne, A.; Ramakrishnan, N. Incorporating prior domain knowledge into deep neural networks. 2018 IEEE International Conference on Big Data (Big Data). IEEE, 2018, pp. 36–45.
- Spinner, T.; Schlegel, U.; Schäfer, H.; El-Assady, M. explAIner: A visual analytics framework for interactive and explainable machine learning. IEEE trans. on visualization and computer graphics 2019, 26, 1064–1074. [Google Scholar] [CrossRef] [PubMed]
- Stumpf, S.; Rajaram, V.; Li, L.; Wong, W.K.; Burnett, M.; Dietterich, T.; Sullivan, E.; Herlocker, J. Interacting meaningfully with machine learning systems: Three experiments. Int. Journal of Human-Computer Studies 2009, 67, 639–662. [Google Scholar] [CrossRef]
- Guo, L.; Daly, E.M.; Alkan, O.K.; Mattetti, M.; Cornec, O.; Knijnenburg, B.P. Building Trust in Interactive Machine Learning via User Contributed Interpretable Rules. 27th International Conference on Intelligent User Interfaces 2022. [Google Scholar]
- Honeycutt, D.R.; Nourani, M.; Ragan, E.D. Soliciting Human-in-the-Loop User Feedback for Interactive Machine Learning Reduces User Trust and Impressions of Model Accuracy. 2020; arXiv:[arXiv:cs.HC/2008.12735]. [Google Scholar]
- Bhattacharya, A.; Stumpf, S.; Gosak, L.; Stiglic, G.; Verbert, K. Lessons Learned from EXMOS User Studies: A Technical Report Summarizing Key Takeaways from User Studies Conducted to Evaluate The EXMOS Platform. 2023, arXiv:[arXiv:cs.LG/2310.02063]. [Google Scholar]
- Schramowski, P.; Stammer, W.; Teso, S.; Brugger, A.; Herbert, F.; Shao, X.; Luigs, H.G.; Mahlein, A.K.; Kersting, K. Making deep neural networks right for the right scientific reasons by interacting with their explanations. Nature Machine Intelligence 2020, 2, 476–486. [Google Scholar] [CrossRef]
- Feuerriegel, S.; Dolata, M.; Schwabe, G. Fair AI. Business & information systems engineering 2020, 62, 379–384. [Google Scholar]
- Bhattacharya, A.; Ooge, J.; Stiglic, G.; Verbert, K. Directive Explanations for Monitoring the Risk of Diabetes Onset: Introducing Directive Data-Centric Explanations and Combinations to Support What-If Explorations. Proceedings of the 28th International Conference on Intelligent User Interfaces; Association for Computing Machinery: New York, NY, USA, 2023. [Google Scholar] [CrossRef]
- Lakkaraju, H.; Slack, D.; Chen, Y.; Tan, C.; Singh, S. Rethinking Explainability as a Dialogue: A Practitioner’s Perspective. 2022, arXiv:[arXiv:cs.LG/2202.01875]. [Google Scholar]
- Anik, A.I.; Bunt, A. Data-Centric Explanations: Explaining Training Data of Machine Learning Systems to Promote Transparency. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems; ACM: Yokohama Japan, 2021; pp. 1–13. [Google Scholar] [CrossRef]
- Bove, C.; Aigrain, J.; Lesot, M.J.; Tijus, C.; Detyniecki, M. Contextualization and Exploration of Local Feature Importance Explanations to Improve Understanding and Satisfaction of Non-Expert Users. 27th International Conference on Intelligent User Interfaces; ACM: Helsinki Finland, 2022; pp. 807–819. [Google Scholar] [CrossRef]
- Lim, B.Y.; Dey, A.K.; Avrahami, D. Why and why not explanations improve the intelligibility of context-aware intelligent systems. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems; Association for Computing Machinery: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
- Wang, D.; Yang, Q.; Abdul, A.; Lim, B.Y. Designing Theory-Driven User-Centric Explainable AI. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems; ACM: Glasgow Scotland Uk, 2019; pp. 1–15. [Google Scholar] [CrossRef]
- Bhattacharya, A. Towards Directive Explanations: Crafting Explainable AI Systems for Actionable Human-AI Interactions. Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA ’24); ACM: New York, NY, USA, 2024. [Google Scholar] [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
