Submitted:
16 May 2025
Posted:
19 May 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Dataset and Preprocessing
2.1. Dataset Overview
- Features:
-
Demographic Attributes:
- ○
- age: Continuous variable representing the patient's age in years.
- ○
- gender: Categorical variable indicating biological sex (Male, Female, Other).
-
Clinical Indicators:
- ○
- hypertension: Binary indicator (0 = No, 1 = Yes) showing whether the individual has hypertension.
- ○
- heart_disease: Binary indicator (0 = No, 1 = Yes) showing presence of heart disease.
- ○
- smoking_history: Categorical variable indicating smoking behavior (e.g., never, former, current, not current, No Info).
- ○
- bmi: Body Mass Index (continuous).
- ○
- HbA1c_level: Glycated hemoglobin level — a key marker in diabetes screening (continuous).
- ○
- blood_glucose_level: Blood glucose measurement in mg/dL (continuous).
-
Target Variable:
- ○
- diabetes: Binary target variable (0 = Not diabetic, 1 = Diabetic).
2.2. Data Cleaning and Encoding
- We performed an inspection for null or missing values using df.isnull().sum().
- No missing values were found across all features, allowing us to proceed without imputation.
- Encoding Categorical Features
- Two features required transformation:
- gender: Converted to numerical format using Label Encoding (Male = 1, Female = 0, Other = 2).
- smoking_history: Also label-encoded to transform multiple categories into integer values.
- Feature Scaling
- To ensure that all numeric features contributed equally to the model training process, we applied StandardScaler from scikit-learn.
- Features such as age, bmi, HbA1c_level, and blood_glucose_level were standardized to have a mean of 0 and standard deviation of 1.

3. Model Development
3.1. Machine Learning Models
- Decision Tree Classifier: A tree-based model that partitions data into subsets based on feature values, using Gini impurity as the splitting criterion.
- Random Forest Classifier: An ensemble method that builds multiple decision trees and averages their predictions to reduce overfitting and improve generalization.
- Support Vector Machine (SVM): A model that seeks to find the optimal hyperplane that maximizes the margin between two classes in the feature space.
- from sklearn.tree import DecisionTreeClassifier
- from sklearn.ensemble import RandomForestClassifier
- from sklearn.svm import SVC
- models = {
- "Decision Tree": DecisionTreeClassifier(random_state=42),
- "Random Forest": RandomForestClassifier(random_state=42),
- "SVM": SVC()
- }
3.2. Data Splitting
- from sklearn.model_selection import train_test_split
- X_train, X_test, y_train, y_test = train_test_split(
- X_scaled, y, test_size=0.2, random_state=42)
3.3. Evaluation Metrics
- Accuracy: The ratio of correctly predicted instances to the total number of instances.
- Precision: The ratio of true positive predictions to the total predicted positives.
- Recall: The ratio of true positive predictions to all actual positives.
- F1-score: The harmonic mean of precision and recall, providing a balanced measure.
- from sklearn.metrics import classification_report
- for name, model in models.items():
- y_pred = model.predict(X_test)
- print(f"{name} Report:\n")
- print(classification_report(y_test, y_pred))

4. Model Performance and Confusion Matrices



5. Deep Learning Model
- Input layer with 16 neurons (ReLU)
- Hidden layer with 8 neurons (ReLU)
- Output layer with 1 neuron (Sigmoid for binary classification)
- model_dl = Sequential([
- Dense(16, input_dim=X_train.shape[1], activation='relu'),
- Dense(8, activation='relu'),
- Dense(1, activation='sigmoid')
- ])
5.1. Accuracy and Learning Curve
| Model | Accuracy | Precision | Recall | F1-Score |
| Decision Tree | 0.91 | 0.89 | 0.88 | 0.88 |
| SVM | 0.92 | 0.91 | 0.90 | 0.90 |
| Random Forest | 0.94 | 0.93 | 0.92 | 0.92 |
| Neural Network | 0.95 | 0.94 | 0.93 | 0.94 |

6. Results Comparison
| Model | Accuracy | Precision | Recall | F1-Score |
| Decision Tree | 0.91 | 0.89 | 0.88 | 0.88 |
| SVM | 0.92 | 0.91 | 0.90 | 0.90 |
| Random Forest | 0.94 | 0.93 | 0.92 | 0.92 |
| Neural Network | 0.95 | 0.94 | 0.93 | 0.94 |
7. Insights and Discussion
- Random Forest Classifier exhibited consistently strong performance across all evaluation metrics (accuracy, precision, recall, and F1-score). Its robustness and low variance stem from the ensemble nature of the algorithm. Notably, the model performed well without extensive hyperparameter optimization, making it practical for fast deployment in real-world scenarios.
- Support Vector Machine (SVM) achieved high precision, especially in correctly identifying non-diabetic individuals (true negatives). However, due to its computational complexity, the SVM required longer training times, particularly with larger datasets. This makes it less suitable for real-time or large-scale deployment unless dimensionality reduction techniques are applied beforehand.
- Neural Network (Deep Learning model) outperformed all machine learning models in both accuracy and F1-score. This suggests that the neural network was more effective in capturing complex nonlinear relationships and feature interactions in the dataset. Its superior generalization performance was evident in both the training and testing phases.
- Confusion Matrices for all models illustrated strong true positive rates (correct identification of diabetic cases), which is critical in medical diagnosis to minimize the risk of undetected conditions. Neural networks in particular demonstrated a balance between sensitivity (recall) and specificity.
- Learning Curve Analysis indicated that the neural network converged smoothly with increasing training data and did not exhibit signs of overfitting, suggesting a well-regularized model. Similarly, the Random Forest and Decision Tree models showed stable convergence but had slight variations in recall performance.
- These findings emphasize the trade-offs between interpretability, accuracy, and computational cost when selecting models for clinical prediction tasks. Traditional models such as Random Forest offer a balance of speed and performance, while neural networks require more computational resources but provide the most accurate predictions.
8. Conclusion
- Deep Learning models (specifically the neural network used) achieved the highest predictive accuracy and best balanced performance, making them suitable for high-stakes clinical environments where false negatives must be minimized.
- Random Forest proved to be a strong alternative to deep learning, offering competitive results with significantly lower training time and easier interpretability—an important factor for clinical practitioners.
- SVM, while precise, was computationally heavier and may benefit from feature selection or PCA to improve efficiency for large-scale datasets.
Future Work
- Validating the models on real-world clinical datasets collected from hospitals or electronic health record (EHR) systems to assess generalizability across populations and clinical settings.
- Exploring ensemble and hybrid models, such as stacking or blending, which may combine the strengths of multiple classifiers for improved robustness and performance.
- Implementing and deploying the trained models into practical tools such as mobile or web applications, which could serve as clinical decision support systems (CDSS) for early diabetes screening and intervention.
- Performing explainability analysis using tools like SHAP or LIME to provide interpretability and model transparency, especially important in medical applications.
- Deploying models into production tools for clinical support
References
- Pedregosa et al., “Scikit-learn: Machine Learning in Python”, Journal of Machine Learning Research, 2011.
- Chollet, F. "Deep Learning with Python", Manning Publications, 2018.
- Khaled M.M. Alrantisi Kaggle Notebook. Available online: https://www.kaggle.com/code/khaledalrantisi1/ml-and-dl.
- TensorFlow Documentation. Available online: https://www.tensorflow.org/.
- UCI Machine Learning Repository.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).