Submitted:
20 April 2024
Posted:
23 April 2024
You are already at the latest version
Abstract
Keywords:
- Biased Models: Algorithms During training, machine learning algorithms frequently give priority to the majority class. In our example, the model might learn to perfectly identify healthy scans but completely miss the rare disease, leading to misdiagnoses [3].
- Poor Performance Metrics: Traditional accuracy metrics become unreliable when dealing with imbalanced classes. A high overall accuracy might mask the model's inability to detect the minority class effectively [4].
- Fraud Detection: Credit card transactions are mostly legitimate (majority class), with a small number being fraudulent (minority class). A model trained on imbalanced data might miss fraudulent transactions altogether.
- Spam Filtering: Most emails are legitimate (majority), with a smaller portion being spam (minority). An imbalanced model might classify some legitimate emails as spam (false positives) while missing actual spam emails.
- Customer Churn Prediction: Most customers remain loyal (majority), with a few churning (minority). An imbalanced model might fail to identify customers at risk of churning, hindering efforts to retain them.
- Oversampling: Replicating data pieces from minority classes in order to improve their representation. This can be done randomly or strategically (e.g., SMOTE algorithm) [5].
- Undersampling: Reducing the number of majority class data points to achieve a more balanced distribution. However, this can discard valuable information [6].
- Cost-Sensitive Learning: During training, instances of the minority class that were incorrectly classified are given larger weights, which forces the model to focus more on the minority class. [7].
- Hybrid Approaches: Combining techniques like oversampling with feature selection or cost-sensitive learning can be effective.
| Email Text | Label |
|---|---|
| Discussing upcoming meeting | Ham |
| Promotional offer - 50% off! | Spam |
| Forgot your password? Reset here. | Phishing |
| Important update from your bank. | Ham |
| Free gift card! Click here to claim. | Spam |
| Lunch order for tomorrow? | Ham |
| Win a trip to Hawaii! | Spam |
| Meeting reminder: 10:00 AM | Ham |
| Your account has been suspended. | Phishing |
| Update your billing information. | Phishing |
- Missed Spam: The model might classify some spam emails as legitimate (false negatives).
- Unnecessary Filtering: The model might flag some legitimate emails as spam (false positives).
2. Literature Survey
- SMOTE (Synthetic Minority Over-sampling Technique): It is used to generate synthetic samples by interpolating between existing minority class instances [23].
- ADASYN (Adaptive Synthetic Sampling): It is like SMOTE, but it produces more artificial data for harder-to-learn minority classes. [24].
- Borderline SMOTE: It concentrates on the minority class instances that are nearer to the borderline with the majority class [25].
- Safe-Level SMOTE: Modifies the SMOTE algorithm by incorporating a safety level to prevent overgeneralization [26].
- SMOTE Tomek Links: It joins SMOTE with Tomek Links, which are pairs of nearest neighbors from different classes. The Tomek Links are removed to increase the separation between classes [27].
| Technique | Description | Advantages | Technique |
|---|---|---|---|
| SMOTE (Synthetic Minority Over-sampling Technique) | It creates synthetic samples by interpolating between existing minority class instances. | Simple to implement | Can lead to overfitting |
| ADASYN (Adaptive Synthetic Sampling) | Focuses on generating more synthetic data for harder-to-learn minority class instances. | Addresses limitations of SMOTE | More complex to implement |
| Borderline SMOTE | Targets minority class instances close to the decision boundary with the majority class. | Aims to improve classification on the borders | May overfit on specific borderline regions |
| Safe-Level SMOTE | Introduces a "safety level" to avoid generating synthetic points too far from existing minority class instances. | Reduces overgeneralization | Requires careful selection of the safety level parameter |
| SMOTE Tomek Links | Combines SMOTE with Tomek Links (identifies noisy data points) to improve class separation. | Addresses noisy data along with oversampling | More complex to implement compared to basic SMOTE |
| FROST (Feature space RObust Synthetic saTuration) | Generates synthetic data points by amplifying the difference between a chosen feature value of a minority class instance and its neighbors. | Potentially more control over synthetic data generation | Relatively new technique, requires further research on optimal parameter settings |
3. Proposed Algorithm
3.1. FROST-Enhanced Oversampling
- 1.
- Choose Initial Feature (B):
- 2.
- Calculate Similarity Matrix (C):
- 3.
- Identify k-Nearest Neighbors (KNN) (D):
- 4.
- Generate Synthetic Data Points? (E):
- 5.
- Calculate Difference & Amplify (F):
- 6.
- Create New Data Point with Amplified Difference (G):
- 7.
- Add New Point to Synthetic Data Set (H):
- 8.
- Repeat (E-H):
- 9.
- End (J):
3.2. Methodology
- Account (from account_activity.csv)
- Customer (from customer_data.csv)
- Fraud Indicators (from fraud_indicators.csv)
- Suspicious Activity (from suspicious_activity.csv)
- Merchant (from merchant_data.csv)
- Transaction Category (from transaction_category_labels.csv)
- Amount Data (from amount_data.csv)
- Anomaly Scores (from anomaly_scores.csv)
- Transaction Metadata (from transaction_metadata.csv)
- Transaction Records (from transaction_records.csv)
- Customer (one-to-one) Account: Each customer has one account associated with them.
- Account (many-to-many) Transaction Record: An account can have many transactions, and a transaction record can be associated with multiple accounts (joint accounts).
- Transaction Record (one-to-one) Amount Data: Each transaction record has one set of amount data associated with it.
- Transaction Record (one-to-one) Transaction Metadata: Each transaction record has one set of metadata associated with it.
- Transaction Record (one-to-many) Anomaly Scores: A transaction record can have multiple anomaly scores generated by different models.
- Transaction Record (many-to-one) Transaction Category: A transaction can belong to one specific category (e.g., groceries, travel).
- Transaction Record (many-to-many) Merchant: A transaction can involve one merchant, and a merchant can have many transactions. (Consider scenarios like online marketplaces)
- Transaction Record (many-to-many) Suspicious Activity: A transaction record can be flagged for multiple suspicious activities, and a suspicious activity can be identified in multiple transactions.
- Suspicious Activity (many-to-many) Fraud Indicators: A suspicious activity can be triggered by multiple fraud indicators, and a fraud indicator can contribute to identifying multiple suspicious activities.
- One-to-One (1:1) - One instance of one entity and one instance of another are related to each other. (e.g., Customer - Account)
- Many-to-One (N:1) - A single instance of one entity is linked to several instances of another. (e.g., Transaction Record - Transaction Category)
- Many-to-Many (N:M) - Numerous occurrences of one entity are connected to numerous instances of another entity.. (e.g., Transaction Record - Merchant)
- Account: Represents a customer's financial account.
- Customer: Represents a customer with personal information.
- FraudDetectionSystem: Orchestrates the fraud detection process.
- SuspiciousActivityManager: Manages the identification and flagging of suspicious transactions.
- TransactionProcessor: Processes incoming transaction data.
- TransactionRecord: Represents a single transaction record with details.
- FraudIndicators: Encapsulates rules or checks for identifying potential fraud.
- TransactionAnalyzer: Analyzes transaction data using various techniques.
- AnomalyScoreCalculator: Calculates anomaly scores based on transaction attributes.
- FraudDetectionSystem<<uses>>TransactionProcessor: The system uses the processor to handle incoming transactions.
- FraudDetectionSystem<<composes>>SuspiciousActivityManager: The system manages the manager component responsible for identifying suspicious activities.
- TransactionProcessor<<creates>>TransactionRecord: The processor creates transaction records from raw data.
- TransactionRecord<<associates with>> Account: A transaction record is associated with a specific account.
- TransactionRecord<<associates with>> Merchant: A transaction record involves a merchant.
- TransactionRecord<<uses>>TransactionAnalyzer: The record utilizes the analyzer for in-depth analysis.
- TransactionAnalyzer<<uses>>FraudIndicators: The analyzer uses fraud indicators to identify potential red flags.
- TransactionAnalyzer<<uses>>AnomalyScoreCalculator: The analyzer uses the calculator to generate anomaly scores.
- SuspiciousActivityManager<<associates with>>TransactionRecord: The manager identifies suspicious activities within transaction records.
- SuspiciousActivityManager<<associates with>>FraudIndicators: The manager considers fraud indicators when flagging suspicious activities.
- Data Acquisition: Obtain a labeled dataset containing historical transaction data with fraudulent and legitimate transactions clearly identified.
- Data Preprocessing: Clean and scale the data to ensure compatibility with machine learning models.
- Class Imbalance Analysis: Calculate the dataset's degree of class imbalance.
-
Model Training: Train machine learning models for fraud detection with the following approaches:
- ○
- Baseline Model: Trained on the original imbalanced dataset.
- ○
- Oversampling with Random Replication: Traditional oversampling by replicating minority class data points.
- ○
- Oversampling with SMOTE: Oversampling using the SMOTE algorithm.
- ○
- Oversampling with FROST: Oversampling using the proposed FROST function with different values for k and m.
- Model Evaluation: Measures of classification accuracy such as precision, recall, F1-score, and AUC-ROC should be used to assess each model's performance.
4. Results

| S .No. | # Evaluating with SMOTE for different Classifiers | # Evaluating with FROST for different Classifiers |
| 1. Results for Decision Tree Classifier: | Accuracy: 0.9114754098360656 Precision: 0.9012345679012346 Recall: 0.9299363057324841 F1 Score: 0.9153605015673981 Confusion Matrix: [[132 16] [ 11 146]] |
Accuracy: 0.9393939393939394 Precision: 0.8032786885245902 Recall: 1.0 F1 Score: 0.8909090909090909 Confusion Matrix: [[137 12] [ 0 49]] |
| Results for Random Forest Classifier: | Accuracy: 0.9475409836065574 Precision: 0.9171597633136095 Recall: 0.9872611464968153 F1 Score: 0.9509202453987731 Confusion Matrix: [[134 14] [ 2 155]] |
Accuracy: 1.0 Precision: 1.0 Recall: 1.0 F1 Score: 1.0 Confusion Matrix: [[149 0] [ 0 49]] |
| Results for K-Nearest Neighbors (KNN): | Accuracy: 0.8459016393442623 Precision: 0.7696078431372549 Recall: 1.0 F1 Score: 0.8698060941828256 Confusion Matrix: [[101 47] [ 0 157]] |
Accuracy: 0.8737373737373737 Precision: 0.6818181818181818 Recall: 0.9183673469387755 F1 Score: 0.782608695652174 Confusion Matrix: [[128 21] [ 4 45]] |
| Results for Gradient Boosting Classifier: | Accuracy: 0.9245901639344263 Precision: 0.8988095238095238 Recall: 0.9617834394904459 F1 Score: 0.9292307692307693 Confusion Matrix: [[131 17] [ 6 151]] |
Accuracy: 0.9393939393939394 Precision: 0.9111111111111111 Recall: 0.8367346938775511 F1 Score: 0.8723404255319148 Confusion Matrix: [[145 4] [ 8 41]] |

5. Conclusions
Author Contributions
Conflicts of Interest
References
- Elshaar, S.; Sadaoui, S. Semi-supervised Classification of Fraud Data in Commercial Auctions. Appl. Artif. Intell. 2019, 34, 47–63, . [CrossRef]
- Hasani, Navid, et al. "Artificial intelligence in medical imaging and its impact on the rare disease community: threats, challenges and opportunities." PET clinics 17.1 (2022): 13-29.
- Shaikh, S.; Daudpota, S.M.; Imran, A.S.; Kastrati, Z. Towards Improved Classification Accuracy on Highly Imbalanced Text Dataset Using Deep Neural Language Models. Appl. Sci. 2021, 11, 869, . [CrossRef]
- Hasib, K.M.; Iqbal, S.; Shah, F.M.; Al Mahmud, J.; Popel, M.H.; Showrov, I.H.; Ahmed, S.; Rahman, O. A Survey of Methods for Managing the Classification and Solution of Data Imbalance Problem. J. Comput. Sci. 2020, 16, 1546–1557, . [CrossRef]
- Hoyos-Osorio, J.; Alvarez-Meza, A.; Daza-Santacoloma, G.; Orozco-Gutierrez, A.; Castellanos-Dominguez, G. Relevant information undersampling to support imbalanced data classification. Neurocomputing 2021, 436, 136–146, . [CrossRef]
- Mienye, I.D.; Sun, Y. Performance analysis of cost-sensitive learning methods with application to imbalanced medical data. Informatics Med. Unlocked 2021, 25, 100690, . [CrossRef]
- Feng, F.; Li, K.-C.; Shen, J.; Zhou, Q.; Yang, X. Using Cost-Sensitive Learning and Feature Selection Algorithms to Improve the Performance of Imbalanced Classification. IEEE Access 2020, 8, 69979–69996, . [CrossRef]
- Dablain, D.; Krawczyk, B.; Chawla, N.V. DeepSMOTE: Fusing Deep Learning and SMOTE for Imbalanced Data. IEEE Trans. Neural Networks Learn. Syst. 2022, 34, 6390–6404, . [CrossRef]
- Liang, X.; Jiang, A.; Li, T.; Xue, Y.; Wang, G. LR-SMOTE — An improved unbalanced data set oversampling based on K-means and SVM. Knowledge-Based Syst. 2020, 196, 105845, . [CrossRef]
- Raghuwanshi, B.S.; Shukla, S. SMOTE based class-specific extreme learning machine for imbalanced learning. Knowledge-Based Syst. 2019, 187, 104814, . [CrossRef]
- Xu, T.; Coco, G.; Neale, M. A predictive model of recreational water quality based on adaptive synthetic sampling algorithms and machine learning. Water Res. 2020, 177, 115788, . [CrossRef]
- Hu, Z.; Wang, L.; Qi, L.; Li, Y.; Yang, W. A Novel Wireless Network Intrusion Detection Method Based on Adaptive Synthetic Sampling and an Improved Convolutional Neural Network. IEEE Access 2020, 8, 195741–195751, . [CrossRef]
- Balaram, A., &Vasundra, S. (2022). Prediction of software fault-prone classes using ensemble random forest with adaptive synthetic sampling algorithm. Automated Software Engineering, 29(1), 6.
- Al Majzoub, H., Elgedawy, I., Akaydın, Ö., &KöseUlukök, M. (2020). HCAB-SMOTE: A hybrid clustered affinitive borderline SMOTE approach for imbalanced data binary classification. Arabian Journal for Science and Engineering, 45(4), 3205-3222.
- Smiti, S., &Soui, M. (2020). Bankruptcy prediction using deep learning approach based on borderline SMOTE. Information Systems Frontiers, 22(5), 1067-1083.
- Sun, Y.; Que, H.; Cai, Q.; Zhao, J.; Li, J.; Kong, Z.; Wang, S. Borderline SMOTE Algorithm and Feature Selection-Based Network Anomalies Detection Strategy. Energies 2022, 15, 4751, . [CrossRef]
- Chen, Y., Chang, R., &Guo, J. (2021). Effects of data augmentation method borderline-SMOTE on emotion recognition of EEG signals based on convolutional neural network. IEEE Access, 9, 47491-47502.
- Ayu, P. D. W., Pradipta, G. A., Huizen, R. R., &Artana, I. (2024). Combining CNN Feature Extractors and Oversampling Safe Level SMOTE to Enhance Amniotic Fluid Ultrasound Image Classification. International Journal of Intelligent Engineering & Systems, 17(1).
- Srinilta, C., &Kanharattanachai, S. (2021, April). Application of natural neighbor-based algorithm on oversampling smote algorithms. In 2021 7th International Conference on Engineering, Applied Sciences and Technology (ICEAST) (pp. 217-220). IEEE.
- Swana, E. F., Doorsamy, W., &Bokoro, P. (2022). Tomek link and SMOTE approaches for machine fault classification with an imbalanced dataset. Sensors, 22(9), 3246.
- Ning, Q.; Zhao, X.; Ma, Z. A Novel Method for Identification of Glutarylation Sites Combining Borderline-SMOTE With Tomek Links Technique in Imbalanced Data. IEEE/ACM Trans. Comput. Biol. Bioinform. 2021, 19, 2632–2641, . [CrossRef]
- Hairani, H.; Anggrawan, A.; Priyanto, D. Improvement Performance of the Random Forest Method on Unbalanced Diabetes Data Classification Using Smote-Tomek Link. JOIV : Int. J. Informatics Vis. 2023, 7, 258–264, . [CrossRef]
- Chawla, Nitesh V., et al. "SMOTE: synthetic minority over-sampling technique." Journal of artificial intelligence research 16 (2002): 321-357.
- He, Haibo, et al. "ADASYN: Adaptive synthetic sampling approach for imbalanced learning." 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). Ieee, 2008.
- Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Proceedings of the International Conference on Intelligent Computing, Hefei, China, 23–26 August 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 878–887.
- Bunkhumpornpat, C.; Sinapiromsaran, K.; Lursinsap, C. Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Bangkok, Thailand, 27–30 April 2009; pp. 475–482, doi:10.1007/978-3-642-01307-2_43.
- Zeng, M.; Zou, B.; Wei, F.; Liu, X.; Wang, L. Effective prediction of three common diseases by combining SMOTE with Tomek links technique for imbalanced medical data. In Proceedings of the 2016 IEEE International Conference of Online Analysis and Computing Science (ICOACS), Chongqing, China, 28–29 May 2016; pp. 225–228.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).