Submitted:
01 September 2025
Posted:
02 September 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
Implementation of Data Mining Techniques
Classification

Dealing with Dataset Imbalance

Addressing Data Leakage
Binning Latitude & Longitude Values (Feature Engineering)
Train-Test Split and Scaling Prior to SMOTE-Tomek Application
GridSearchCV Restricted to the Training Set
Independent Model Training for Each Version
Model Selection
| Model | Pros | Cons |
| Random Forest | Provides feature importance insights. | Requires more tuning to prevent overfitting. |
| Works well with non-linear relationships present in opioid-related predictors. | Harder to interpret than simpler models. | |
| Less sensitive to outliers than Logistic Regression. | ||
| eXtreme Gradient Boosting (XGBoost) | Optimized for imbalanced datasets. | Prone to overfitting if parameters are not well-tuned. |
| Efficient with large feature spaces which is useful when handling medical data. | Computationally intensive. | |
| Reduces bias-variance tradeoff. | ||
| Logistic Regression | Simple and interpretable, making it easy to explain the results. | Assumes linear relationships, which limits performance due to lack of feature correlation in this dataset. |
| Computationally efficient and easy to deploy. | Struggles with class imbalance. | |
| Support Vector Machine (SVM) | Effective in high-dimensional spaces. | Computationally expensive and long execution time. |
| Finds complex decision boundaries, improving classification accuracy. | Sensitive to noisy data. | |
| MLP Classifier (ANN) | Detects complex, non-linear relationships. | Requires large training data, making it prone to overfitting if not regularized. |
| Learns hierarchical representations, making it useful for demographic correlations. | Computationally demanding, requiring significant resources. |
Feature Selection

Supervised Model Variants
Anomaly Detection Process
Data Preparation
Model Selection
Supervised Model Variants
Anomaly Detection Process
Clustering


Overall Model Performance Comparison
| Model | Accuracy | Precision | Recall | F1-Score | AUC-ROC | MCC | Method |
| SVM | 0.718815 | 0.718815 | 1 | 0.836408 | 0.558413 | 0 | Base (Untuned) |
| Logistic Regression | 0.718815 | 0.718998 | 0.99942 | 0.836328 | 0.552389 | 0.014065 | Base (Untuned) |
| ANN | 0.717981 | 0.720421 | 0.993035 | 0.835041 | 0.629992 | 0.037113 | Base (Untuned) |
| XGBoost | 0.707134 | 0.731729 | 0.935577 | 0.821192 | 0.634434 | 0.096799 | Base (Untuned) |
| Random Forest | 0.675845 | 0.742068 | 0.841555 | 0.788686 | 0.601352 | 0.108623 | Base (Untuned) |
| XGBoost | 0.653734 | 0.744926 | 0.78816 | 0.765933 | 0.626191 | 0.103503 | SMOTE + Tomek Links |
| Random Forest | 0.647893 | 0.748445 | 0.768427 | 0.758305 | 0.594786 | 0.110615 | SMOTE + Tomek Links |
| ANN | 0.576971 | 0.802217 | 0.54614 | 0.649862 | 0.639797 | 0.181605 | SMOTE + Tomek Links |
| Logistic Regression | 0.523154 | 0.766055 | 0.48462 | 0.593672 | 0.552218 | 0.095958 | SMOTE + Tomek Links |
| SVM | 0.504798 | 0.852632 | 0.376088 | 0.521949 | 0.634489 | 0.202809 | SMOTE + Tomek Links |
| XGBoost | 0.659574 | 0.750968 | 0.78758 | 0.768839 | 0.633237 | 0.125162 | SMOTE + GridSearchCV Tuned |
| Random Forest | 0.639967 | 0.752644 | 0.743471 | 0.748029 | 0.616118 | 0.117752 | SMOTE + GridSearchCV Tuned |
| SVM | 0.526491 | 0.720721 | 0.557168 | 0.628478 | 0.502913 | 0.004740 | SMOTE + GridSearchCV Tuned |
| ANN | 0.573217 | 0.786885 | 0.557168 | 0.652396 | 0.623393 | 0.15415 | SMOTE + GridSearchCV Tuned |
| Logistic Regression | 0.523571 | 0.765782 | 0.485781 | 0.59446 | 0.552002 | 0.095645 | SMOTE + GridSearchCV Tuned |
| XGBoost | 0.664581 | 0.749322 | 0.801509 | 0.774537 | 0.625481 | 0.123765 | SMOTE + GridSearchCV + Manual Tuning |
| Random Forest | 0.652065 | 0.747909 | 0.778294 | 0.762799 | 0.604701 | 0.111496 | SMOTE + GridSearchCV + Manual Tuning |
| SVM | 0.526491 | 0.720721 | 0.557168 | 0.628478 | 0.502913 | 0.004740 | SMOTE + GridSearchCV + Manual Tuning |
| ANN | 0.540259 | 0.797699 | 0.482879 | 0.601591 | 0.61748 | 0.153998 | SMOTE + GridSearchCV + Manual Tuning |
| Logistic Regression | 0.521902 | 0.768372 | 0.479396 | 0.590422 | 0.551539 | 0.0994 | SMOTE + GridSearchCV + Manual Tuning |
Model Validation Strategy
Key Takeaways and Future Recommendations
Anomaly Detection Overview

- Different Scenarios or Datasets Synthetic vs Real Data - the numeric results might have been produced in a grid search with synthetic anomalies where 5% of the data points have been flipped, while the bar chart may have used the final real dataset or a different version of an existing dataset with labels that were differential.
- Different Labeling Rules GroundTruth Logic - the bar chart may have been produced from a more stringent methodology (for instance, if it required that at least two models agreed on an anomaly) while the numeric F1 score was computed with respect to synthetic labels. These divergences result in different classifications and therefore different performance number. So, the bar chart represents one specific labeling or data scenario where One-Class SVM is the best, with the separate numeric printout coming from another scenario (or as a result of a different labeling approach), which results in a different finding for One-Class SVM.
Clustering
Evaluation of Models
| Metric | Which one is better |
| Silhouette Score | Measures how well each point fits in its cluster. The Higher the Better |
| Davies-Bouldin Index | Means Separation of Clusters. The Lower the Better |
| Calinski-Harasbasz Index (not for HDBSCAN) | Measures Cluster Variance. Higher is Better |
| Mean Intra-Cluster Distance (HDBSCAN only) | Measures Compactness between clusters |
| Metric | K-Means (K=4) | K-Means (K=5) | DBSCAN (Best) | HDBSCAN |
| Silhouette Score | 0.5267 | 0.5649 | 0.2439 | 0.8983 |
| Davies-Bouldin Index | 0.7013 | 0.6512 | 0.4893 | 0.3650 |
| Calinski-Harabasz Score | 13252.17 | 12801.27 | 106.58 | N/A |
| Mean Intra-Cluster Distance | N/A | N/A | N/A | 0.0505 |
Evaluation of K-Means Clustering
- Silhouette Score: better for k = 5
- Davies-Bouldin Index: better for k = 5
- Calinski-Harabasz Score: better for k = 4
Evaluation of DBSCAN Clustering
- Silhouette Score (0.2439) is significantly lower than K-Means or HDBSCAN, indicating less defined clusters.
- Davies-Bouldin Index (0.4893) is lower than K-Means, meaning clusters are more distinct, but the result is still suboptimal.
- Calinski-Harabasz Score (106.58) is significantly worse than K-Means, highlighting a lack of clear structure in clusters.
- Key Strength: Can detect noise and outliers, which other methods struggle with.
- Silhouette Score (0.8983) is the highest, meaning the clusters are extremely well-defined.
- Davies-Bouldin Index (0.3650) is the lowest, confirming that clusters are well-separated.
- Mean Intra-Cluster Distance (0.0505) suggests tight, compact clusters.
- Conclusion: HDBSCAN outperforms both K-Means and DBSCAN in terms of overall quality of clustering.
Validation of Models
Data Visualization and Interpretation
Classification
Confusion Matrices
Base Model Performance





4.1.2. ROC Curve Analysis


Precision-Recall Curve Analysis


Anomaly Detection
Isolation Forest: Opioid and Depressant Combinations
Elliptic Envelope: Prescription Opioid Abuse Patterns
Local Outlier Factor: Speedballs and Unregulated Substances
One-Class SVM: High-Risk Depressant Combinations
General Observations Across All Models
4.2.2. Heatmap
- The scale of colors starts from the light color for fewer flagged cases to the dark color for more flagged cases.
- The count of how many times that combination was labeled as anomalous by the respective model is shown in the values present inside the cells.

Anomaly Detection Comparison

Drug Presence in Normal Vs Anomalous Cases
Clustering
K-Means Scatter Plots
Conclusions
Challenges and Limitations
Future Recommendations
References
- Agyemang, E. F. (2024). Anomaly detection using unsupervised machine learning algorithms: A simulation study. Scientific African, e02386–e02386. [CrossRef]
- Arora, S., Hu, W., & Kothari, P. K. (2018). An Analysis of the t-SNE Algorithm for Data Visualization. PMLR, 1455–1462. https://proceedings.mlr.press/v75/arora18a.html.
- CDC. (2024). Understanding the opioid overdose epidemic. Overdose Prevention. https://www.cdc.gov/overdose-prevention/about/understanding-the-opioid-overdose-epidemic.html.
- Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly Detection: A Survey. ACM Computing Surveys, 41(3), 1–58. [CrossRef]
- Com, L., & Hinton, G. (2008). Visualizing Data using t-SNE Laurens van der Maaten. Journal of Machine Learning Research, 9, 2579–2605. https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf.
- Durgesh Samariya, Ma, J., Aryal, S., & Zhao, X. (2023). Detection and explanation of anomalies in healthcare data. 11(1). [CrossRef]
- G.Malarselvi. (2024). A Multifaceted Approach for Enhancing Anomaly Detection in Industrial Systems with Adaptive Synthetic Sampling and Machine Learning Evaluation. Journal of Electrical Systems, 20(5s), 2124–2131. [CrossRef]
- https://www.facebook.com/thoughtcodotcom. (2019). Is Springfield the Most Common Place Name in the United States? ThoughtCo. https://www.thoughtco.com/place-name-in-all-fifty-states-1435154?utm_.
- in. (2015, March 26). Optimal number of bins in histogram by the Freedman–Diaconis rule: difference between theoretical rate and actual number. Cross Validated. https://stats.stackexchange.com/questions/143438/optimal-number-of-bins-in-histogram-by-the-freedman-diaconis-rule-difference-be.
- Johns Hopkins Medicine. (2022, October 19). Opioids. Www.hopkinsmedicine.org; John Hopkins Medicine. https://www.hopkinsmedicine.org/health/treatment-tests-and-therapies/opioids.
- Krawiec, P., Junge, M., & Hesselbach, J. (2021). Comparison and Adaptation of Two Strategies for Anomaly Detection in Load Profiles Based on Methods from the Fields of Machine Learning and Statistics. Open Journal of Energy Efficiency, 10(2), 37–49. [CrossRef]
- Matharaarachchi, S., Domaratzki, M., & Muthukumarana, S. (2024). Enhancing SMOTE for imbalanced data with abnormal minority instances. Machine Learning with Applications, 18, 100597. [CrossRef]
- Naarayanan, B., Franklin, C., Gouvtham, N., & Femi, S. (n.d.). Comparing the Performance of Anomaly Detection Algorithms. https://www.ijert.org/research/comparing-the-performance-of-anomaly-detection-algorithms-IJERTV9IS070532.pdf.
- Nguyen, A., Wang, J., Holland, K. M., Ehlman, D. C., Welder, L. E., Miller, K. D., & Stone, D. M. (2024). Trends in Drug Overdose Deaths by Intent and Drug Categories, United States, 1999‒2022. American Journal of Public Health, 114(10), 1081–1085. [CrossRef]
- Saeed, S. (2019). Analysis of software development methodologies. IJCDS. Scopus; Publish.
- Saeed, S. (2019). The serverless architecture: Current trends and open issues moving legacy applications. IJCDS. Scopus.
- Saeed, S., & Humayun, M. (2019). Disparaging the barriers of journal citation reports (JCR). IJCSNS: International Journal of Computer Science and Network Security, 19(5), 156-175. ISI-Index: 1.5.
- Saeed, S. (2016). Surveillance system concept due to the uses of face recognition application. Journal of Information Communication Technologies and Robotic Applications, 7(1), 17-22.
- Nkugwa Mark William. (2023, October 31). How to Determine Bin Width for a Histogram ( R and Python). Medium. https://nkugwamarkwilliam.medium.com/how-to-determine-bin-width-for-a-histogram-r-and-pyth-653598ab0d1c.
- Sadeq Darrab, Harshitha Allipilli, Ghani, S., Harikrishnan Changaramkulath, Koneru, S., Broneske, D., & Saake, G. (2023). Anomaly Detection Algorithms: Comparative Analysis and Explainability Perspectives. Communications in Computer and Information Science, 90–104. [CrossRef]
- sklearn.metrics.matthews_corrcoef. (n.d.). Scikit-Learn. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html.
- Spencer, M. R., Garnett, M. F., & Miniño, A. M. (2024). Drug Overdose Deaths in the United States, 2002–2022. Www.cdc.gov. https://www.cdc.gov/nchs/products/databriefs/db491.htm.
- The Lancet. (2023). Opioid crisis: addiction, overprescription, and Insufficient Primary Prevention. The Lancet Regional Health- Americas, 23(100557), 100557–100557. [CrossRef]
- Wolff, J., Gitukui, S., O’Brien, M., Mital, S., & Noonan, R. K. (2022). The Overdose Response Strategy: Reducing Drug Overdose Deaths Through Strategic Partnership Between Public Health and Public Safety. Journal of Public Health Management and Practice, 28(Supplement 6), S359–S366. [CrossRef]
- World Health Organization. (2023, August 29). Opioid overdose. World Health Organization. https://www.who.int/news-room/fact-sheets/detail/opioid-overdose.
- Jhanjhi, N.Z. (2025). Investigating the Influence of Loss Functions on the Performance and Interpretability of Machine Learning Models. In: Pal, S., Rocha, Á. (eds) Proceedings of 4th International Conference on Mathematical Modeling and Computational Science. ICMMCS 2025. Lecture Notes in Networks and Systems, vol 1399. Springer, Cham. [CrossRef]
- Humayun, M., Khalil, M. I., Almuayqil, S. N., & Jhanjhi, N. Z. (2023). Framework for detecting breast cancer risk presence using deep learning. Electronics, 12(2), 403.
- Gill, S. H., Razzaq, M. A., Ahmad, M., Almansour, F. M., Haq, I. U., Jhanjhi, N. Z., ... & Masud, M. (2022). Security and privacy aspects of cloud computing: a smart campus case study. Intelligent Automation & Soft Computing, 31(1), 117-128.
- Aldughayfiq, B., Ashfaq, F., Jhanjhi, N. Z., & Humayun, M. (2023, April). Yolo-based deep learning model for pressure ulcer detection and classification. In Healthcare (Vol. 11, No. 9, p. 1222). MDPI.
- N. Jhanjhi, "Comparative Analysis of Frequent Pattern Mining Algorithms on Healthcare Data," 2024 IEEE 9th International Conference on Engineering Technologies and Applied Sciences (ICETAS), Bahrain, Bahrain, 2024, pp. 1-10. [CrossRef]
- Ray, S. K., Sirisena, H., & Deka, D. (2013, October). LTE-Advanced handover: An orientation matching-based fast and reliable approach. In 38th annual IEEE conference on local computer networks (pp. 280-283). IEEE.
- Samaras, V., Daskapan, S., Ahmad, R., & Ray, S. K. (2014, November). An enterprise security architecture for accessing SaaS cloud services with BYOD. In 2014 Australasian Telecommunication Networks and Applications Conference (ATNAC) (pp. 129-134). IEEE.





Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).