Submitted:
27 September 2024
Posted:
30 September 2024
You are already at the latest version
Abstract

Keywords:
1. Introduction
- -
- Initial PRIM only search in chosen feature space, in our work, the algorithm chooses multiple random feature spaces.
- -
- Once the rules of each class label are discovered, we innovated in pruning the rules using Metarules. Meaning that not only the ruleset is interpretable, but also no rule is completely removed from the ruleset. Metarules aim at creating association rules having as items the rules generated.
- -
- To build the classifier, the selection of the final rule is primordial. We investigate the literature, especially CART and CBA original papers, to select the final rules that will be part of the final model. Having the algorithm choosing only the optimal boxes in each peel makes the selection fall on the first rules with the most significant coverage, support and confidence.
- -
- We tested our Random PRIM based classifier (R-PRIM-Cl) on ten well-known datasets to validate the classifier and raise the future challenges, and we compared the results to the three well known and well-established classifiers: Random Forest, Logistic Regression and XG-Boost. We used four metrics to evaluate the performances: Accuracy, Precision, Recall and F1-score.
2. Materials and Methods
2.1. Overview of the Patient Rule Induction Method
2.1.1. Definition
2.1.2. The Box Evaluation Metrics
2.2. Metarules
2.3. Related Works and Motivation
- -
- PRIM requires too much interactions with the expert because of the choice of the feature search space;
- -
- In case of a large number of rules, if the expert chooses multiple search spaces, the interpretability is lost and so is the explainability, since if the rules are not interpretable, they cannot be evaluated according to the domain knowledge;
- -
- The lack of a classifier to use the discovered rules as a predictive model for future predictions.
2.4. Selected Algorithms for the Comparison
3. Proposed Methodology
3.1. Random PRIM Based Classifier
-
Initialization Step
- (a)
- A set of N data instances
- (b)
- A set of P categorical or numeric variables as X = {x1, x2, …, xp}
- (c)
- Two class labels target variable Y {0, 1}
- (d)
- Define a minimum support, peeling and pasting thresholds {s, α, β}
- 2.
-
Procedure
- (a)
- Random choice of the feature search spaces
- (b)
- Implementation of PRIM on each subspace for each class label
- (c)
- Find boxes with the metrics
- (d)
- Implement Metarules to find the associations between rules
- (e)
- Execute Cross-Validation on 10 folds
- (f)
- Retain the rules or metarules based on Cross-Validation, the support, the density and the coverage rates
- (g)
- Calculate the Accuracy, Recall, Precision and F1-Score to validate the Model
- 3.
-
Output
- (a)
- The set of all boxes found to allow the discovery of new subgroups
- (b)
- The set of selected boxes from the Metarules pruning step
- (c)
- The final box measures: coverage, density, support, dimension
- (d)
- The final model metrics: accuracy, precision, recall, f1-score
3.2. Validation of the Classifier
3.3. Illustrative Example
-
Initialization Step
- (a)
- A set of 150 flowers from the Iris dataset
- (b)
- A set of 4 numeric variables as X = {sepal length (cm) , sepal width (cm), petal length (cm), petal width (cm)}
- (c)
- Y being a three class labels target we coded setosa as 1 and versicolor and virginica as 0, hence Y {0, 1}
- (d)
-
Define a minimum support, peeling and pasting thresholds:s = 10%, α= 5%, β=5%
- 2.
-
Procedure
- (a)
-
Random choice of the feature search spaces by the algorithm:[['sepal length (cm)', 'sepal width (cm)'],['sepal length (cm)', 'petal width (cm)'],['sepal width (cm)', 'petal length (cm)', 'petal width (cm)'],['sepal width (cm)', 'petal length (cm)'],['petal length (cm)', 'petal width (cm)']]
- (b)
- Implementation of PRIM on each subspace for each class label
- (c)
- Find boxes with the metrics as displays in Table 1 before the Metarules and cross- validation
- (d)
-
Implement Metarules to prune and detect overlapping
- No overlapping detected
-
Metarules detected for a density=100%:
- ○
- R2 => R1
- ○
- R3 => R7
- ○
- R4,R8 => R5 and R5,R8 => R4 and R5,R4 =>R8
- ○
- R6 => R8
- (e)
- Execute Cross-Validation on 10 folds
- (f)
- Retain the rules or metarules based on Cross-Validation, the support, the density and the coverage rates as displayed in Table 2
- (g)
-
Calculate the Accuracy, Recall, Precision and F1-Score to validate the Model
- Accuracy: 0.97
- Precision: 0.83
- Recall: 1.00
- F1 Score: 0.91
- 3.
-
Output
- (a)
-
The set of all boxes found to allow the discovery of new subgroups(R1, R2, R3, R4, R5, R6, R7, R8, R9, R10, R11, R12)
- (b)
-
The set of selected from the Metarules pruning step(R1, R4, R5, R6, R8) with R4,R5 and R8 covering the same dataand R6 include in R8
- (c)
-
The final box measures: coverage, density, support, dimensionEx: R1 (coverage=78%, density= 100%, support=26%, dimension=2)
- (d)
-
The final model metrics: accuracy, precision, recall, f1-scoreAccuracy=96.7%; Precision=95%; Recall=100%; F1 Score=97.4%
4. Results
4.1. Empirical Setting
- o Congressional Voting dataset (Vote)
- o Mushroom Dataset (Mush)
- o Breast Cancer Dataset (Cancer)
- o SPECT heart dataset (Heart)
- o Tic-Tac-Toe Endgame dataset (TicTac)
- o Pima Diabetes Dataset (Diabetes)
- o German Credit Card Dataset (Credit)
4.2. Results
5. Discussion
6. Conclusion
Author Contributions
Funding
Conflicts of Interest
References
- Dixon, M. F., Halperin, I., & Bilokon, P. (2020). Machine learning in finance (Vol. 1170). New York, NY, USA: Springer International Publishing.
- Ahmed, S.; Alshater, M.M.; El Ammari, A.; Hammami, H. Artificial intelligence and machine learning in finance: A bibliometric review. Res. Int. Bus. Finance 2022, 61, 101646. [Google Scholar] [CrossRef]
- Nayyar, A., Gadhavi, L., & Zaman, N. (2021). Machine learning in healthcare: review, opportunities and challenges. Machine Learning and the Internet of Medical Things in Healthcare, 23-45. [CrossRef]
- An, Q., Rahman, S., Zhou, J., & Kang, J. J. (2023). A comprehensive review on machine learning in healthcare industry: classification, restrictions, opportunities and challenges. Sensors, 23(9), 4178. [CrossRef]
- Gao, L., & Guan, L. (2023). Interpretability of machine learning: Recent advances and future prospects. IEEE MultiMedia, 30(4), 105-118. [CrossRef]
- Nassih,R., Berrado,A.,. 2020. State of the art of Fairness, Interpretability and Explainability in Machine Learning: Case of PRIM. In Proceedings of the 13th International Conference on Intelligent Systems: Theories and Applications (SITA'20). Association for Computing Machinery, New York, NY, USA, Article 32, 1–5. [CrossRef]
- Capponi A, Lehalle C-A, eds. Black-Box Model Risk in Finance. In: Machine Learning and Data Sciences for Financial Markets: A Guide to Contemporary Practices. Cambridge University Press; 2023:687-717.
- Imrie, F.; Davis, R.; van der Schaar, M. Multiple stakeholders drive diverse interpretability requirements for machine learning in healthcare. Nat. Mach. Intell. 2023, 5, 824–829. [Google Scholar] [CrossRef]
- Friedman, J.H., Fisher, N.I. Bump hunting in high-dimensional data. Statistics and Computing 9, 123–143 ,1999. [CrossRef]
- Berrado, A.; Runger, G.C. Using metarules to organize and group discovered association rules. Data Min. Knowl. Disc. 2007, 14, 409–431. [Google Scholar] [CrossRef]
- Azmi, M., Runger, G. C., & Berrado, A, 2019. Interpretable regularized class association rules algorithm for classification in a categorical data space. Information Sciences, 483, 313-331. [CrossRef]
- Maissae, H., & Abdelaziz, B. (2024). Forest-ORE: Mining Optimal Rule Ensemble to interpret Random Forest models. arXiv preprint arXiv:2403.17588. arXiv:2403.17588.
- Maissae, H., & Abdelaziz, B. (2022). A novel approach for discretizing continuous attributes based on tree ensemble and moment matching optimization. International Journal of Data Science and Analytics, 14(1), 45-63. [CrossRef]
- Azmi, M., & Berrado, A. (2021). CARs-RP: Lasso-based class association rules pruning. International Journal of Business Intelligence and Data Mining, 18(2), 197-217. [CrossRef]
- Dazard, J.E.; Rao, J.S. Local Sparse Bump Hunting. J. Comput. Graph. Stat. 2010, 19, 900–929. [Google Scholar] [CrossRef] [PubMed]
- Polonik, W.; Wang, Z. PRIM analysis. J. Multivar. Anal. 2010, 101, 525–540. [Google Scholar] [CrossRef]
- Dyson, G. An application of the Patient Rule-Induction Method to detect clinically meaningful subgroups from failed phase III clinical trials. Int. J. Clin. Biostat. Biom. 2021, 7, 038. [Google Scholar] [CrossRef]
- Yang, J.K.; Lee, D.H. Optimization of mean and standard deviation of multiple responses using patient rule induction method. Int. J. Data Warehous. Min. 2018, 14, 60–74. [Google Scholar] [CrossRef]
- Lee, D.H.; Yang, J.K.; Kim, K.J. Multiresponse optimization of a multistage manufacturing process using a patient rule induction method. Qual. Reliab. Eng. Int. 2020, 36, 1982–2002. [Google Scholar] [CrossRef]
- Dyson, G. An application of the Patient Rule-Induction Method to detect clinically meaningful subgroups from failed phase III clinical trials. Int. J. Clin. Biostat. Biom. 2021, 7. [Google Scholar] [CrossRef]
- Kaveh, A.; Hamze-Ziabari, S.M.; Bakhshpoori, T. Soft computing-based slope stability assessment: A comparative study. Geomech. Eng. 2018, 14, 257–269. [Google Scholar]
- Nassih, R.; Berrado, A. Potential for PRIM based classification: a literature review. In Proc. Third Eur. Int. Conf. Ind. Eng. Oper. Manag., Pilsen, Czech Republic, 2019, 7.
- Nassih, R.; Berrado, A. Towards a patient rule induction method based classifier. In Proc. 2019 1st Int. Conf. Smart Syst. Data Sci. (ICSSD); IEEE, 2019, 1-5.
- Breiman,L. Random forests, Mach. Learn. 45 (1) (2001) 5–32. [CrossRef]
- Biau, G., Scornet, E. A random forest guided tour. TEST 25, 197–227 (2016). [CrossRef]
- Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785-794).
- Qiu, Y., Zhou, J., Khandelwal, M., Yang, H., Yang, P., & Li, C. (2022). Performance evaluation of hybrid WOA-XGBoost, GWO-XGBoost and BO-XGBoost models to predict blast-induced ground vibration. Engineering with Computers, 38(Suppl 5), 4145-4162. [CrossRef]
- Cox, D. R. (1958). The Regression Analysis of Binary Sequences. Journal of the Royal Statistical Society: Series B (Methodological), 20(2), 215-242.
- Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and Regression Trees. Belmont, CA: Wadsworth International Group.
- Bellman, R. E. (1961). Adaptive Control Processes: A Guided Tour. Princeton University Press. [CrossRef]
- Liu, B., Hsu, W., & Ma, Y. (1998). Integrating Classification and Association Rule Mining. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD), 80-86.
- C.L. Blake, C.J. Merz, Uci repository of machine learning databases [ http://www.ics.uci.edu/ ∼mlearn/mlrepository.html ]. irvine, ca: University of california, Department of Information and Computer Science 55 (1998).
- J. Demšar, T. Curk, A. Erjavec, Č. Gorup, T. Hočevar, M. Milutinović, M. Možina, M. Polajnar, M. Toplak, A. Starič, M. Štajdohar, J. Umek, L. Zupančič, L. Žagar, T. Žbontar, M. Kravanja, M. Kale, and B. Zupan. (2013). Orange: Data Mining Toolbox in Python. Journal of Machine Learning Research, 14, 2349-2353.





| Index | Box | coverage | density | res dim | mass |
|---|---|---|---|---|---|
| R1 | 4.3 < sepal length (cm) < 5.35 AND 2.95 < sepal width (cm) < 4.4 | 0.78 | 1.00 | 2 | 0.26 |
| R2 | 4.3 < sepal length (cm) < 5.45 AND 2.15 < sepal width (cm) < 4.4 | 0.15 | 0.50 | 2 | 0.10 |
| R3 | 3.349 < sepal width (cm) < 4.4 | 0.08 | 0.38 | 1 | 0.07 |
| R4 | 4.3 < sepal length (cm) < 5.95 AND 0.1 < petal width (cm) < 0.8 | 1.00 | 1.00 | 2 | 0.33 |
| R5 | 1.0 < petal length (cm) < 3.75 AND 0.1 < petal width (cm) < 0.8 | 1.00 | 1.00 | 2 | 0,33 |
| R6 | 1.0 < petal length (cm) < 1.79 | 0.95 | 1.00 | 1 | 0,32 |
| R7 | 3.25 < sepal width (cm) < 4.4 AND 1.0 < petal length (cm) < 6.05 | 0.05 | 0.25 | 2 | 0.07 |
| R8 | 1.0 < petal length (cm) < 3.75 AND 0.1 < petal width (cm) < 0.8 | 1.00 | 1.00 | 2 | 0.33 |
| R9 | 4.3 < sepal length (cm) < 7.35 AND 0.1 < petal width (cm) < 2.25 | 0,90 | 0,35 | 2 | 0,86 |
| R10 | 3.05 < sepal width (cm) < 4.05 AND 1.35 < petal length (cm) < 5.85 | 0,45 | 0,49 | 2 | 0,31 |
| R11 | 4.3 < sepal length (cm) < 7.80 AND 3.05 < sepal width (cm) < 4.05 AND 1.35 < petal length (cm) < 6.9 AND 0.1 < petal width (cm) < 2.25 | 0,40 | 0,52 | 4 | 0,26 |
| R12 | 6.55 < sepal length (cm) < 7.80 AND 2.15 < petal width (cm) < 2.5 | 0,10 | 0,44 | 2 | 0,08 |
| Index | Box | coverage | density | res dim | mass |
| (R4,R5,R8) | 4.3 < sepal length (cm) < 5.95 AND 0.1 < petal width (cm) < 0.8 AND 1.0 < petal length (cm) < 3.75 | 1.00 | 1.00 | 3 | 0.33 |
| R6 | 1.0 < petal length (cm) < 1.79 | 0.95 | 1.00 | 1 | 0,32 |
| R1 | 4.3 < sepal length (cm) < 5.35 AND 2.95 < sepal width (cm) < 4.4 | 0.78 | 1.00 | 2 | 0.26 |
| Datasets | Nb of instances | Nb of attributes | Class labels | Class distribution |
|---|---|---|---|---|
| Vote | 232 | 16 | democrat: 0 republican:1 |
142 90 |
| Mush | 8124 | 22 | e: 0 p: 1 |
4208 3916 |
| Cancer | 286 | 9 | No recurrent: 0 Recurrent: 1 |
201 85 |
| Heart | 267 | 2 | 0 1 |
55 212 |
| TicTac | 958 | 9 | Negative: 1 Positive: 0 |
332 626 |
| Diabetes |
768 | 8 | Yes: 1 No: 0 |
269 499 |
| Credit | 1000 | 10 | Bad: 1 Good: 0 |
300 700 |
| Recall | Precision | |||||||
|---|---|---|---|---|---|---|---|---|
| Datasets | RF | XGB | LG | R-PRIM-Cl | RF | XGB | LG | R-PRIM-Cl |
| Vote | 94.33 | 94.33 | 93.33 | 95.32 | 98.67 | 99.33 | 98.60 | 98.74 |
| Mush | 98.12 | 98.12 | 97.72 | 97.65 | 99.72 | 100.00 | 93.67 | 96.47 |
| Cancer | 84.17 | 89.64 | 90.31 | 92.18 | 77.39 | 78.60 | 87.72 | 92.13 |
| Heart | 92.01 | 94.37 | 78.36 | 93.57 | 87.30 | 86.57 | 91.25 | 90.14 |
| TicTac | 97.35 | 98.12 | 96.58 | 98.06 | 98.79 | 97.88 | 98.10 | 96.80 |
| Diabetes | 87.30 | 85.64 | 89.51 | 88.67 | 91.31 | 90.68 | 89.67 | 94.36 |
| Credit | 89.64 | 82.45 | 91.35 | 90.03 | 88.60 | 85.34 | 91.78 | 95.24 |
| F1 score | Accuracy | |||||||
|---|---|---|---|---|---|---|---|---|
| Datasets | RF | XGB | LG | R-PRIM-Cl | RF | XGB | LG | R-PRIM-Cl |
| Vote | 96.45 | 96.77 | 95.89 | 97.00 | 96.56 | 97.10 | 97.43 | 97.40 |
| Mush | 98.91 | 99.05 | 95.65 | 97.06 | 100.00 | 100.00 | 98.63 | 97.54 |
| Cancer | 80.64 | 83.76 | 89.00 | 92.15 | 71.38 | 71.78 | 87.63 | 90.04 |
| Heart | 89.59 | 90.30 | 84.32 | 91.82 | 80.20 | 83.56 | 92.13 | 96.15 |
| TicTac | 98.06 | 98.00 | 97.33 | 97.43 | 98.85 | 98.87 | 97.86 | 98.37 |
| Diabetes | 89.26 | 88.09 | 89.59 | 91.43 | 84.65 | 84.37 | 89.61 | 87.98 |
| Credit | 89.12 | 83.87 | 91.56 | 92.56 | 90.15 | 89.61 | 91.34 | 92.42 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).