Submitted:
16 March 2024
Posted:
18 March 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Literature review
3. Naive Bayes classifier
- i.
- User-based approach: In an attempt to generate unbiased representations of data, classifying and compute two percentiles in each class according to an auditor’s professional preferences. Draw the bound by the resulting two percentiles as audit evidence, and
- ii.
- Item-based approach: Suppose those represent risky samples. Choose them as audit evidence after classifying .
3.1. User-based approach
3.2. Item-based approach
3.3. Hybrid approach
4. Results
4.1. Experiment 1
4.2. Experiment 2
4.3. Experiment 3
5. Discussion
- Conventional sampling methods [4] may not profile the full diversity of data; thus, they may provide biased samples. Since this study samples data after classifying them using a Naive Bayes classifier, it substitutes for a sampling method to profile the whole diversity of data. Experimental results of Section 4 indicate that the Naive Bayes classifier classifies three open data sets accurately, even if they are excessive. Those accurate classification results indicate that we capture the whole diversity of experimental data.
- Developing conventional sampling methods may not consider complex patterns or correlations in data [4]. In this study, we handle complex correlations or patterns in data (for example, a graph structure in Section 4.3) by a Naive Bayes classifier. This design mitigates the sampling bias caused by complex patterns or correlations if it provides accurate classification results.
- Section 4.3 indicates that a Naive Bayes classifier works well for big data in a money laundering problem. It outperforms the random forest classifier and support vector machines model with a radial basis function kernel in classifying massive vertices. Thus, we illustrate that the efficiency of sampling big data can be improved. One can sample risker nodes modeling fraudulent financial accounts without profiling specific groups of nodes.
- Development of conventional sampling methods considers structured data; however, they struggled to handle unstructured data such as spam messages in Section 4.2. We resolve this difficulty by employing a Naive Bayes classifier before sampling.
- Since this study samples data from each class classified by a Naive Bayes classifier, accurate classification results eliminate sample frame errors and improper sampling sizes.
- It is still possible that a Naive Bayes classifier provides inaccurate classification results. Before integrating a machine learning algorithm with sampling, one should test the classification accuracy.
- In implementing Section 3.2, thresholds are needed. However, we should inspect variations of the prior probabilities for determining proper values. They denote the second limitation of our machine learning-based sampling.
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Deng, H.; Sun, Y.; Chang, Y.; Han, J. Probabilistic Models for Classification, In Data Classification: Algorithms and Applications; Aggarwal, C.C., Ed.; Chapman and Hall/CRC: New York, USA, 2014; pp. 65–86. [Google Scholar]
- Schreyer, M.; Gierbl, A.; Ruud, T.F.; Borth, D. Artificial intelligence enabled audit sampling – Learning to draw representative and interpretable audit samples from large-scale journal entry data. Expert Focus 2022, 04, 106–112. [Google Scholar]
- Zhang, Y.; Trubey, P. Machine learning and sampling scheme: An empirical study of money laundering detection. Comput. Econ. 2019, 54(3), 1043–1063. [Google Scholar] [CrossRef]
- Aitkazinov, A. The role of artificial intelligence in auditing: Opportunities and challenges. Int. J. Res. Eng. Sci. Manag. 2023, 6(6), 117–119. [Google Scholar]
- Chen, Y.; Wu, Z.; Yan, H. A full population auditing method based on machine learning. Sustainability 2022, 14(24), 17008. [Google Scholar] [CrossRef]
- Bertino, S. A measure of representativeness of a sample for inferential purposes. Int. Stat. Rev. 2006, 74(2), 149–159. [Google Scholar] [CrossRef]
- Guy, D.M.; Carmichael, D.R.; Whittington, O.R. Audit Sampling: An Introduction to Statistical Sampling in Auditing, 5th ed.; John Wiley & Sons: New York, USA, 2001. [Google Scholar]
- Schreyer, M.; Sattarov, T.; Borth, D. Multi-view contrastive self-supervised learning of accounting data representations for downstream audit tasks. In Proceedings of the Second ACM International Conference on AI in Finance Virtual Event, New York, USA; 5 3 2021. [Google Scholar] [CrossRef]
- Schreyer, M.; Sattarov, T.; Reimer, G.B.; Borth, D. Learning sampling in financial statement audits using vector quantised autoencoder. arXiv 2020. [Google Scholar] [CrossRef]
- Lee, C. Deep learning-based detection of tax frauds: an application to property acquisition tax. Data Technol. Appl. 2022, 56(3), 329–341. [Google Scholar] [CrossRef]
- Chen, Z.; Li, C.; Sun, W. Bitcoin price prediction using machine learning: An approach to sample dimensional engineering. J. Comput. Appl. Math. 2020, 365, 112395. [Google Scholar] [CrossRef]
- Liberty, E.; Lang, K.; Shmakov, K. Stratified sampling meets machine learning. In Proceedings of the 33rd International Conference on Machine Learning, New York, USA; 6 19 2016. [Google Scholar]
- Hollingsworth, J.; Ratz, P.; Tanedo, P.; Whiteson, D. Efficient sampling of constrained high-dimensional theoretical spaces with machine learning. Eur. Phys. J. C 2021, 81(12), 1138. [Google Scholar] [CrossRef]
- Artrith, N.; Urban, A.; Ceder, G. Constructing first-principles diagrams of amorphous LixSi using machine-learning-assisted sampling with an evolutionary algorithm. J. Chem. Phys. 2018, 148(24), 241711. [Google Scholar] [CrossRef] [PubMed]
- Huang, F.; No, W.G.; Vasarhelyi, M.A.; Yan, Z. Audit data analytics, machine learning, and full population testing. J. Finance Data Sci. 2022, 8, 138–144. [Google Scholar] [CrossRef]
- Kolmogorov, A. Sulla determination empirica di una legge di distribuzione. G. Inst. Ital. Attuari. 1933, 4, 83–91. [Google Scholar]
- Wasserman, S.; Faust, K. Social Network Analysis: Methods and Applications, 1st ed.; Cambridge University Press: Cambridge, New York, USA, 1994. [Google Scholar]











| Original data | Audit evidence | |
|---|---|---|
| Range | [104.78,225.24] | [104.78,225.24] |
| Standard deviation | 24.55 | 24.53 |
| Interquartile range | 34.58 | 34.58 |
| Skewness | 0.674 | 0.673 |
| Coefficient of variation | 0.1731 | 0.173 |
| Metric | Value |
|---|---|
| Accuracy | 0.983 |
| Precision | 0.992 |
| Recall | 0.989 |
| Specificity | 0.992 |
| F1 score | 0.99 |
| Class variable | Degree centrality | Clustering coefficient | Total number of members |
|---|---|---|---|
| 1 | [0,2) | [0,1] | 338800 |
| 2 | [2,4) | [0,1] | 117323 |
| 3 | [4,6) | [0,0.417] | 41720 |
| 4 | [6,10) | [0,0.367] | 22743 |
| 5 | [10,∞) | [0,0.28] | 15304 |
| Metric | Averaged value |
|---|---|
| Accuracy | 0.995 |
| Precision | 0.992 |
| Recall | 0.989 |
| Specificity | 0.992 |
| F1 score | 0.99 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).