Glucobuddy: Detecting Diabetes Risk Using Machine Learning

Md Sultanul Arefin Afnan

doi:10.20944/preprints202505.2269.v1

Submitted:

28 May 2025

Posted:

28 May 2025

You are already at the latest version

Abstract

Diabetes mellitus is a chronic disease affecting over 420 million people globally, contributing significantly to mortality, disability, and healthcare costs. Early detection and risk assessment are critical in preventing severe complications such as cardiovascular disease, kidney failure, and neuropathy. Traditional diagnostic approaches, including fasting glucose and HbA1c testing, require medical infrastructure and trained personnel, making them difficult to access in resource-limited areas. This thesis presents Glucobuddy, an intelligent system designed to predict diabetes risk levels using machine learning models and to enhance user interaction through an integrated AI chatbot. The system analyzes key health indicators including age, glucose levels, and body mass index (BMI) to classify individuals into low-risk or high-risk categories. Three machine learning algorithms—Logistic Regression, Random Forest, and Support Vector Machines (SVM)—are evaluated and compared using performance metrics such as accuracy, precision, recall, and F1-score. In addition to automated risk classification, Glucobuddy incorporates an AI-powered chatbot designed to communicate results, provide general diabetes education, answer common queries, and suggest preventive actions based on the user’s risk profile. This interactive approach aims to enhance user understanding and engagement. The proposed system offers a cost-effective, accessible, and scalable solution for early diabetes risk screening, with particular focus on underserved communities. It provides healthcare professionals and individuals with a practical tool for early intervention, contributing to improved health outcomes and reduced healthcare burdens.

Keywords:

Diabetes Prediction

;

Machine Learning

;

Early Detection

;

Risk Classification

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

Dalian Polytechnic University Graduation Project/Thesis, Major: Computer Science & Technology; 219j10@xy.dlpu.edu.cn

1. Introduction

1.1. Background

Diabetes mellitus is a chronic, progressive metabolic disorder characterized by high levels of blood glucose, which, if left unmanaged, can lead to severe and irreversible complications. The World Health Organization (WHO) estimates that over 420 million people worldwide currently live with diabetes, and this number is expected to rise to 578 million by 2030, making it one of the most significant global public health challenges. [1] Type 2 diabetes accounts for over 90% of all cases and is strongly associated with lifestyle-related factors such as obesity, poor diet, sedentary behavior, and genetic predisposition. It is also a leading cause of morbidity and mortality, responsible for an estimated 1.5 million deaths globally in 2019. [2,3]. Early detection and intervention are critical to preventing the progression of diabetes and mitigating associated complications such as cardiovascular disease, kidney failure, retinopathy, neuropathy, and premature death. Clinical evidence strongly suggests that lifestyle interventions, including weight loss, dietary improvements, and increased physical activity, can delay or even prevent the onset of Type 2 diabetes in high-risk individuals. However, current diagnostic methods—including fasting plasma glucose, oral glucose tolerance tests, and glycated hemoglobin (HbA1c) measurements—require clinical laboratory facilities and trained healthcare professionals. The cost and complexity of these procedures create barriers to accessibility, especially in low-resource and rural communities where healthcare infrastructure is often inadequate.

Technological advancements over the past decade have introduced innovative ways to tackle this global epidemic. Machine learning (ML), a branch of artificial intelligence, has shown tremendous potential in healthcare applications by processing large datasets and identifying hidden patterns that are difficult to discern through traditional statistical methods. In recent years, numerous studies have explored the application of machine learning algorithms for the prediction and diagnosis of diabetes. For example, researchers have successfully utilized models such as Decision Trees, Support Vector Machines (SVM), Random Forest, and Logistic Regression on medical datasets like the PIMA Indian Diabetes dataset to achieve high prediction accuracies. A study by Sisodia and Sisodia (2018) demonstrated the effectiveness of Decision Tree and SVM models in diabetes prediction, achieving accuracy rates of over 78%. Similarly, the work of Kavakiotis et al. (2017) provided a comprehensive review of data mining and machine learning techniques in diabetes research, further highlighting the promising capabilities of these methods [5,6]. Despite these advances, many studies focus primarily on binary classification (diabetic vs. non-diabetic), with fewer efforts directed at developing models for early risk prediction. Risk prediction—estimating an individual’s probability of developing diabetes in the near future—is an emerging research area that could empower individuals to take preventive action before clinical diagnosis becomes necessary. A system designed for risk prediction could serve as a cost-effective, non-invasive, and scalable screening tool that complements existing diagnostic methods, particularly in resource-constrained settings. [7].

This research proposes Glucobuddy, a novel system designed to harness the power of machine learning to predict diabetes risk levels based on easily obtainable health indicators, specifically age, blood glucose levels, and body mass index (BMI). The simplicity of these input variables makes the system widely applicable and highly practical, especially for primary care providers and community health workers in low-resource environments. The core of Glucobuddy lies in its integration of predictive ML algorithms—Logistic Regression, Random Forest, and SVM—to classify individuals into low-risk and high-risk categories, thereby facilitating early interventions.

Furthermore, Glucobuddy distinguishes itself by incorporating an AI-powered chatbot to enhance user engagement and accessibility. The chatbot serves as a digital health assistant, capable of explaining risk assessments, providing diabetes education, answering user queries, and recommending lifestyle modifications based on individual risk profiles. This conversational interface is intended to bridge the gap between complex algorithmic predictions and user understanding, empowering individuals to make informed decisions about their health [8].

The fusion of machine learning and conversational AI in Glucobuddy presents an innovative, user-friendly solution aimed at democratizing diabetes risk assessment. By addressing barriers such as cost, access, and patient engagement, this system aspires to contribute meaningfully to global diabetes prevention efforts. Its design specifically targets underserved populations, offering a scalable tool that can be deployed in both clinical and community settings to improve early detection and support proactive health management.

1.2. Overview of the Project

Diabetes remains one of the most urgent health challenges of modern times, with incidence and mortality rates continuing to rise globally. Despite substantial advancements in medical diagnostics and treatment, many individuals in low-resource settings still lack access to timely screening and early intervention. The Glucobuddy project aims to bridge this critical healthcare gap by leveraging machine learning and artificial intelligence technologies to create an affordable, non-invasive, and scalable solution for early diabetes risk prediction and personalized health guidance.

The core objective of the Glucobuddy project is to develop an intelligent system capable of classifying individuals as either low-risk or high-risk for developing diabetes. Unlike traditional diagnosis, which relies on laboratory tests and medical supervision, Glucobuddy is designed to provide predictive risk assessment using easily obtainable personal health indicators. The three primary features targeted for prediction are age, blood glucose level, and body mass index (BMI). These parameters were chosen due to their strong association with diabetes risk and their availability in standard health screening environments. [6,9,10].

The Glucobuddy system is composed of two main components: the predictive model and the conversational AI chatbot. The predictive model leverages supervised machine learning algorithms trained on historical health datasets. Three well-established models—Logistic Regression, Random Forest, and Support Vector Machines (SVM)—have been selected due to their proven effectiveness and interpretability in healthcare classification tasks. Each model will be trained and evaluated using performance metrics such as accuracy, precision, recall, and F1-score to determine the optimal algorithm for risk classification. The model with the highest balanced performance will be integrated into the final system.

In parallel, the AI chatbot component serves as the interactive layer of Glucobuddy. While the machine learning model provides a risk score, the chatbot contextualizes and communicates this information to the user. Using natural language understanding (NLU), the chatbot can explain the predicted risk, offer personalized advice on lifestyle changes, answer common questions about diabetes, and suggest follow-up actions. The chatbot is envisioned to operate 24/7, providing immediate access to health information, especially in areas where professional medical consultation may not be available. This human-computer interaction element is a significant enhancement over existing prediction systems, promoting user engagement, understanding, and empowerment.

The data pipeline for Glucobuddy includes data collection, preprocessing, model training, evaluation, and deployment. Data preprocessing will involve handling missing values, outliers, normalization, and addressing class imbalance using techniques such as Synthetic Minority Over-sampling Technique (SMOTE). Once the data is cleaned and transformed, the machine learning models will be trained and cross-validated to prevent overfitting and to ensure generalizability.

Following model deployment, the system will undergo rigorous testing to validate its real-world usability and robustness. The AI chatbot will be integrated with the prediction model through API communication, allowing real-time risk prediction and conversational feedback. User acceptance testing (UAT) and simulated patient cases will be conducted to assess both the predictive accuracy and the quality of the chatbot interactions.

The Glucobuddy project also emphasizes accessibility and scalability. The system is intended to be lightweight, enabling it to run on basic mobile and web platforms. This design decision makes it suitable for use in community clinics, rural health centers, and even at-home personal health monitoring. The ultimate vision for Glucobuddy is to serve as a preventive healthcare companion that complements existing clinical practices, reduces the diagnostic burden on healthcare systems, and empowers individuals to take proactive control of their health.

By combining the analytical power of machine learning with the conversational capabilities of AI, Glucobuddy represents an innovative approach to early diabetes risk screening. This research expects to contribute to the emerging field of AI-assisted preventive healthcare by demonstrating the feasibility and impact of integrated predictive and interactive systems. Upon successful implementation, Glucobuddy could become a model framework for similar disease risk assessment applications beyond diabetes, laying the groundwork for the next generation of accessible digital health solutions.

1.3. Current System

Diabetes diagnosis and risk assessment are traditionally conducted within clinical settings, relying on established medical protocols and laboratory-based testing. The most widely used diagnostic methods include fasting plasma glucose tests, oral glucose tolerance tests (OGTT), and glycated hemoglobin (HbA1c) tests. While these methods are considered the gold standard by organizations such as the World Health Organization (WHO) and the American Diabetes Association (ADA), they share several inherent limitations that restrict their accessibility and effectiveness, particularly in underserved regions. [1,2]

Firstly, these traditional testing methods require specialized equipment, controlled environments, and trained healthcare professionals to ensure accuracy and safety. This creates a barrier for early screening in rural and low-income areas where medical infrastructure and resources are limited. Patients in such areas often face challenges including long travel distances to health facilities, high costs of diagnostic services, and long wait times for appointments and test results. Consequently, many individuals remain undiagnosed until the disease has progressed to more advanced stages, when complications are harder and more expensive to manage.

In addition to logistical barriers, traditional diagnostic methods are invasive and time-consuming. Blood samples must be collected, processed, and analyzed in a laboratory. For example, the OGTT requires patients to fast overnight, consume a glucose solution, and undergo multiple blood draws over a two-hour period. These factors contribute to patient discomfort and poor adherence to regular screening schedules, further exacerbating the risk of delayed diagnosis.

To address some of these limitations, a number of mobile applications and health monitoring devices have emerged in recent years to provide diabetes management and tracking solutions. However, most of these technologies focus on post-diagnosis monitoring rather than early detection or risk prediction. Common applications offer features such as blood glucose logging, medication reminders, and diet tracking, but they do not typically integrate predictive algorithms capable of assessing an individual's future risk of developing diabetes based on health indicators. [11,12]

Moreover, the few predictive tools that do exist are often designed for research or academic purposes and lack the user-centered design and conversational interface needed for real-world deployment in community or primary care settings. They are usually static and offer limited interactivity, providing risk scores without adequate explanation, context, or actionable recommendations for users.

As a result, there remains a significant unmet need for a comprehensive system that not only predicts diabetes risk based on accessible health data but also actively engages users in understanding and managing their health status. There is currently no widely adopted system that combines advanced machine learning models with an AI-driven conversational interface to guide users through the risk assessment process, explain their results, and recommend personalized preventive measures.

This gap presents the opportunity for Glucobuddy to offer a transformative solution. By integrating machine learning-based risk prediction with an AI chatbot capable of delivering real-time explanations and guidance, Glucobuddy addresses the shortcomings of both traditional diagnostic methods and current mobile health applications. It aims to empower individuals to take proactive steps toward diabetes prevention, especially in areas where access to professional medical consultation is limited.

1.4. Proposed System

The proposed system, Glucobuddy, aims to address the limitations of current diabetes diagnostic and monitoring approaches by providing an intelligent, accessible, and user-friendly early risk assessment tool. Glucobuddy combines machine learning-based risk prediction with an integrated AI-powered chatbot to deliver a complete and engaging user experience. It is designed to be easily accessible through mobile devices or web platforms, making it suitable for deployment in both clinical and non-clinical environments, including remote and resource-limited communities. [13,14]

The system operates in two main phases: data-driven risk prediction and conversational feedback.

In the first phase, users are prompted to enter basic health information including age, body mass index (BMI), and blood glucose levels. These variables were selected based on their proven correlation with Type 2 diabetes risk and their widespread availability through basic health screenings. After data input, the information is preprocessed and analyzed using trained machine learning algorithms. Glucobuddy leverages three well-established models—Logistic Regression, Random Forest, and Support Vector Machines (SVM)—to classify individuals into two categories: low-risk and high-risk for developing diabetes.

The selection of multiple algorithms allows for performance comparison and ensures that the most accurate and reliable model can be integrated into the final system. The model training phase includes data cleaning, handling of missing values, normalization, and mitigation of class imbalance using Synthetic Minority Over-sampling Technique (SMOTE). These techniques improve the model’s robustness and predictive capability across diverse patient data.

Once the machine learning model generates a prediction, the second phase of the system is activated. The AI-powered chatbot takes over as the user interface, delivering the prediction results in an understandable and friendly conversational format. Rather than presenting raw numerical risk scores, the chatbot explains the result, outlines possible health implications, and provides recommendations for lifestyle changes or encourages the user to consult a healthcare professional for further evaluation.

The chatbot is designed using natural language understanding (NLU) technology, enabling it to understand user queries and respond to them contextually. Users can ask follow-up questions about diabetes, healthy diets, exercise routines, or specific explanations about their risk level. The chatbot serves as a virtual health assistant available 24/7, providing reliable information and support at any time.

One of the key strengths of Glucobuddy is its focus on accessibility and scalability. The system is intentionally lightweight and designed to function efficiently on basic smartphones and computers, minimizing technological barriers to adoption. Its intuitive design ensures that users with limited technical literacy can easily navigate the interface and receive valuable health guidance.

Another important feature is privacy and data protection. Glucobuddy is designed to operate with anonymized or consent-based data to ensure compliance with healthcare data privacy regulations. No personal identifiers are stored or transmitted without user approval.

In summary, the proposed Glucobuddy system introduces an innovative combination of machine learning and conversational AI to create a user-centered solution for early diabetes risk assessment. By simplifying the prediction process and improving user engagement through interactive dialogue, Glucobuddy has the potential to complement traditional healthcare systems, reduce the diagnostic burden on clinics, and empower individuals to take a proactive role in managing their health.

This dual system approach not only predicts potential diabetes risk but also closes the communication gap by providing understandable, actionable insights directly to users. It represents a step forward in making predictive healthcare technologies accessible, interactive, and practical for widespread use, particularly in settings where healthcare access is limited.

1.5. Scope of the Project

The scope of this project focuses on the design, development, and evaluation of an intelligent system for early diabetes risk prediction using machine learning techniques, integrated with an AI-powered chatbot for user interaction and feedback. The project is structured around creating a proof-of-concept application that demonstrates the feasibility and potential of combining predictive analytics with conversational AI to enhance preventive healthcare services.

The primary objective is to develop a system that accepts basic health indicators—age, body mass index (BMI), and blood glucose levels—as inputs to predict an individual’s likelihood of developing Type 2 diabetes. The machine learning component will be limited to three widely recognized classification algorithms: Logistic Regression, Random Forest, and Support Vector Machines (SVM). These models will be trained and tested using existing publicly available health datasets. The project will evaluate each model’s performance based on standard classification metrics such as accuracy, precision, recall, and F1-score to determine the most effective predictive model.

The AI chatbot component is designed to enhance the user experience by communicating the risk assessment results and providing general lifestyle recommendations. The chatbot will use natural language processing (NLP) to understand and respond to user queries. However, the chatbot’s advice is strictly informational and not intended to replace professional medical consultation or diagnosis.

This project is intended as a research prototype and does not aim to build a fully operational commercial product. The scope does not include integration with wearable devices, continuous glucose monitors, or electronic health record (EHR) systems. Additionally, while the system will prioritize user privacy and data security, this project will not conduct formal compliance certifications such as HIPAA or GDPR audits.

The outcome of this project is expected to be a functional prototype system capable of demonstrating the effectiveness of combining machine learning and conversational AI for early diabetes risk screening. The system is designed to be scalable for future extensions but will remain within the limits of a research and educational study during this development phase.

2. Methodology (Analysis and Design)

2.1. Data Collection and Preprocessing

The success of any machine learning project heavily depends on the quality and suitability of the data used. For this research, the dataset used to train and evaluate the machine learning models is derived from a publicly available medical dataset: the PIMA Indians Diabetes Database, provided by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). This dataset has been widely used in diabetes-related research and serves as a standard benchmark for classification and predictive modeling tasks. [10]

Figure 1. Blood pressure distribution grouped by diabetes outcome.

The PIMA dataset contains records of female patients of Pima Indian heritage, aged 21 years and older, and includes 768 instances with 8 medical and personal attributes. For the purpose of this study, only three key features were selected based on their strong correlation with diabetes risk and ease of measurement in typical clinical or community settings:

Age (years) – An important demographic factor associated with increased risk of Type 2 diabetes.
Body Mass Index (BMI) – A widely used measure of body fat based on height and weight.
Blood Glucose Level (mg/dL) – A critical biomarker directly linked to diabetes risk.

The outcome variable in the dataset indicates whether the patient was diagnosed with diabetes (1) or not (0). For this research, the outcome variable is adapted to represent two risk levels: low risk (0) and high risk (1).

Figure 2. BMI distribution by diabetes outcome.

Figure 3. Age distribution by diabetes outcome.

Figure 4. Glucose distribution by diabetes outcome.

2.1.1. Data Cleaning

Real-world medical datasets are often incomplete, noisy, and inconsistent. Several preprocessing steps were applied to ensure the quality of the data:

Handling Missing Values: The dataset contains instances where critical measurements, such as BMI and glucose levels, were recorded as zero, which is not physiologically possible. These zero values were treated as missing and replaced using imputation techniques. The median value of the respective feature was used to fill missing values to minimize bias.
Outlier Detection and Treatment: Statistical methods such as interquartile range (IQR) analysis were applied to detect and manage outliers that could skew model performance. Extreme values were capped to reasonable physiological ranges based on clinical guidelines.

Figure 5. Heatmap distribution for diabetes outcome.

2.1.2. Data Normalization

Since the dataset includes numerical features with varying scales (e.g., glucose levels range from 0 to 200+, whereas BMI typically ranges from 15 to 50), data normalization was essential. The min-max scaling technique was applied to scale all feature values to a range between 0 and 1. This step ensures that no single feature disproportionately influences the learning algorithm due to its scale. [16]

2.1.3. Data Balancing

The dataset exhibited a slight imbalance between diabetic and non-diabetic cases. To mitigate the bias introduced by class imbalance, the Synthetic Minority Over-sampling Technique (SMOTE) was applied. SMOTE works by creating synthetic examples of the minority class based on existing instances, effectively balancing the dataset and improving model generalization. [15]

2.1.4. Data Splitting

The final dataset was divided into two subsets:

Training Set (80%) – Used to train the machine learning models.
Test Set (20%) – Used to evaluate model performance on unseen data.

Stratified sampling was applied to ensure that both the training and test sets maintained the same class distribution as the original dataset.

2.1.5. Summary

The thorough preprocessing pipeline ensured that the dataset was clean, balanced, and appropriately scaled, thus enabling the machine learning models to perform at their best potential. These steps were critical in enhancing the predictive accuracy, robustness, and reliability of the Glucobuddy system.

2.2. Machine Learning Models

The Glucobuddy system leverages the power of supervised machine learning algorithms to predict an individual’s risk of developing Type 2 diabetes. Machine learning provides a valuable alternative to traditional statistical analysis by identifying complex, non-linear patterns in the data that may not be immediately visible through conventional methods. For this study, three popular and widely used classification algorithms were selected: Logistic Regression, Random Forest, and Support Vector Machines (SVM). Each algorithm was chosen based on its proven effectiveness, interpretability, and relevance in healthcare data analysis [6,17].

Figure 6. Machine Learning Predictions visualization.

2.2.1. Logistic Regression

Logistic Regression is one of the simplest and most commonly used binary classification algorithms. It models the probability that a given input belongs to a particular class—in this case, either low-risk or high-risk of diabetes. Logistic Regression uses a logistic function to map predicted values to probabilities between 0 and 1. The model works by fitting a linear combination of the input features (age, BMI, glucose level) to predict the log-odds of the dependent variable (risk class).

The main advantage of Logistic Regression lies in its ease of implementation, low computational cost, and interpretability. In a medical setting, where understanding the contribution of individual risk factors is crucial, Logistic Regression provides clear insights into the importance of each feature.

2.2.2. Random Forest

Random Forest is a powerful ensemble learning method based on decision trees. It constructs multiple decision trees during training and outputs the mode of their predictions for classification tasks. Random Forest mitigates the risk of overfitting, a common issue with individual decision trees, by introducing randomness in the feature selection and data sampling process.

Random Forest is particularly useful for handling complex, non-linear relationships between features and is robust to noise and outliers. Additionally, Random Forest provides measures of feature importance, helping to understand which variables contribute the most to diabetes risk prediction. Its strong performance across many biomedical datasets makes it an ideal candidate for this project.

2.2.3. Support Vector Machines (SVM)

Support Vector Machines (SVM) are another powerful supervised learning technique suitable for binary classification tasks. SVM works by finding the optimal hyperplane that separates data points of different classes with the maximum possible margin. In cases where data are not linearly separable, SVM uses kernel functions (such as polynomial or radial basis function kernels) to map the input data into a higher-dimensional space where a linear separator can be found.

SVM is highly effective in handling high-dimensional data and performs well when the number of features exceeds the number of samples. Its ability to model complex decision boundaries makes it a strong competitor for diabetes risk prediction. However, SVM is computationally intensive and requires careful tuning of hyperparameters, such as the regularization parameter (C) and kernel type, to achieve optimal results.

2.2.4. Model Evaluation Approach

To assess the performance of these machine learning models, standard classification metrics were used, including accuracy, precision, recall, and F1-score. Accuracy provides a measure of overall correctness, precision evaluates how many predicted positive cases were actually positive, recall measures how many actual positive cases were correctly predicted, and F1-score balances the trade-off between precision and recall. Cross-validation techniques were applied to reduce overfitting and ensure that the models generalize well to unseen data.

2.2.5. Summary

By experimenting with these three algorithms, Glucobuddy aims to identify the most suitable model for accurately classifying individuals into low-risk and high-risk categories. Each algorithm offers unique strengths, and their comparative analysis provides valuable insights into the applicability of machine learning techniques for preventive healthcare solutions.

2.3. Model Design and Evaluation

The design of the Glucobuddy system follows a structured and methodical pipeline to ensure high-quality, accurate, and generalizable predictions of diabetes risk. The model design process involved data preprocessing, model selection, training, validation, and evaluation to identify the most effective predictive approach.

2.3.1. Model Design

The model pipeline begins with the input of user-provided data: age, blood glucose level, and body mass index (BMI). The data is first subjected to preprocessing steps as previously described, including handling missing values, outlier detection, normalization, and class balancing using the Synthetic Minority Over-sampling Technique (SMOTE). These steps are essential to reduce bias and prevent poor model performance caused by data quality issues.

Following preprocessing, the dataset was randomly split into two subsets: 80% for training and 20% for testing. Stratified sampling was used to maintain the proportion of high-risk and low-risk classes across both subsets.

The training data was then used to fit the machine learning models. Three algorithms—Logistic Regression, Random Forest, and Support Vector Machines (SVM)—were chosen based on their proven effectiveness in healthcare classification tasks. Each model was independently trained using the same dataset to ensure a fair comparison of performance.

2.3.2. Model Training and Hyperparameter Tuning

To optimize model performance, hyperparameter tuning was conducted for each algorithm:

For Logistic Regression, the regularization parameter (C) was adjusted to balance the trade-off between bias and variance.
For Random Forest, key parameters such as the number of trees (n_estimators), maximum tree depth, and minimum samples per leaf were fine-tuned to improve performance and reduce overfitting.
For SVM, both the kernel type (linear or radial basis function) and the regularization parameter were carefully tuned to maximize classification accuracy.

Grid search combined with k-fold cross-validation (with k = 5) was applied for hyperparameter optimization. This approach reduces overfitting by ensuring the model’s performance is validated on multiple subsets of the training data, providing a more reliable estimate of its generalization capability.

2.3.3. Model Evaluation

Once trained, the models were evaluated on the independent test set using four primary metrics:

Accuracy: Measures the proportion of correct predictions out of total predictions.
Precision: Indicates how many of the predicted high-risk cases were actually high-risk.
Recall (Sensitivity): Measures the ability of the model to correctly identify all actual high-risk individuals.
F1-Score: Provides a balanced measure that combines both precision and recall. [18]

Additionally, Receiver Operating Characteristic (ROC) curves were generated for each model, and the area under the curve (AUC) was calculated. The ROC-AUC score offers an overall measure of model performance across all classification thresholds, with a value closer to 1.0 indicating superior discriminatory power.

2.3.4. Performance Comparison and Selection

The models were compared across all metrics, and the one with the best balance of accuracy, precision, recall, and F1-score was chosen for integration with the AI chatbot component of Glucobuddy. While Random Forest and SVM were expected to perform better in capturing complex patterns, Logistic Regression was also considered valuable due to its interpretability and ease of deployment.

2.3.5. Summary

The structured design and rigorous evaluation methodology ensured that the final Glucobuddy model is both accurate and generalizable for predicting diabetes risk. The combination of model tuning, cross-validation, and thorough performance assessment provides a solid foundation for deploying Glucobuddy as a reliable early screening tool.

3. Implementation

3.1. Model Development

The model development phase of Glucobuddy establishes the core predictive capability by preparing the environment, organizing the codebase, and implementing the data pipeline through to model serialization. All experimentation and prototyping were conducted in the server/ directory of the repository, which contains the following key files:

3.1.1. Development Environment

Python Version: 3.8
Virtual Environment: Created via python -m venv venv
Installation:
Core Libraries:

○

Data handling: pandas (v1.x), numpy (v1.x)

○

Visualization: matplotlib, seaborn

○

Modeling: scikit-learn (v0.24+)

○

Persistence: pickle

Model	Accuracy	Precision	Recall	F1-Score	ROC-AUC
Logistic Regression	78.6%	79.3%	76.2%	77.7%	84.1%
Naive Bayes	76.2%	74.1%	77.4%	75.7%	82.3%
K-Nearest Neighbors	74.1%	72.5%	73.0%	72.7%	78.9%
Random Forest	81.0%	80.6%	79.5%	80.0%	85.4%
SVM (Linear)	79.4%	77.8%	78.1%	77.9%	83.6%

Feature	Description
Pregnancies	Number of times the patient has been pregnant
Glucose	Plasma glucose concentration (mg/dL)
Blood Pressure	Diastolic blood pressure (mm Hg)
Skin Thickness	Triceps skin fold thickness (mm)
Insulin	2-Hour serum insulin (mu U/ml)
BMI	Body mass index (kg/m²)
DPF	Diabetes Pedigree Function (genetic risk indicator)
Age	Age of the patient (years)
Outcome	0 = non-diabetic, 1 = Diabetic (target label)

Glucobuddy: Detecting Diabetes Risk Using Machine Learning

Abstract

Keywords:

Subject:

1. Introduction

1.1. Background

1.2. Overview of the Project

1.3. Current System

1.4. Proposed System

1.5. Scope of the Project

2. Methodology (Analysis and Design)

2.1. Data Collection and Preprocessing

2.1.1. Data Cleaning

2.1.2. Data Normalization

2.1.3. Data Balancing

2.1.4. Data Splitting

2.1.5. Summary

2.2. Machine Learning Models

2.2.1. Logistic Regression

2.2.2. Random Forest

2.2.3. Support Vector Machines (SVM)

2.2.4. Model Evaluation Approach

2.2.5. Summary

2.3. Model Design and Evaluation

2.3.1. Model Design

2.3.2. Model Training and Hyperparameter Tuning

2.3.3. Model Evaluation

2.3.4. Performance Comparison and Selection

2.3.5. Summary

3. Implementation

3.1. Model Development

3.1.1. Development Environment

3.1.2. Data Loading and Inspection

3.1.3. Feature Selection and Label Definition

3.1.4. Data Cleaning and Preprocessing

3.1.5. Feature Scaling

3.1.6. Train/Test Split

3.1.7. Prototype Model Training

3.1.8. Model Selection and Persistence

3.1.9. Integration Readiness

3.2. Algorithm Implementation

3.2.1. System Architecture

3.2.2. Flask API Structure

3.2.3. Model Integration

3.2.4. Chatbot Integration

3.3. Training and Tuning of Models

3.3.1. Training Process

3.3.2. Hyperparameter Tuning

3.3.3. Cross-Validation

3.3.4. Performance Evaluation

3.3.5. Model Selection

3.3.6. Summary

3.4. Model Integration

3.4.1. Loading the Model and Scaler

3.4.2. Creating the Prediction Endpoint

3.4.3. Compatibility with Front-End Interfaces

3.4.4. Scalability and Flexibility

3.4.5. Summary

4. Testing and Deployment

4.1. Testing Requirements

4.1.1. Model Testing

4.1.2. API Testing

4.1.3. Usability and Integration Testing

4.1.4. Testing Requirements Summary

4.2. Performance Evaluation of Models

4.2.1. Evaluation Metrics Overview

4.2.2. Model Performance Results

4.2.3. Analysis and Comparison

4.2.4. Final Model Selection Justification

4.3. Deployment Strategy

4.3.1. Local Deployment Using Flask

4.3.2. Backend and Frontend Separation

4.3.3. Data Privacy and Security

4.3.4. Future Deployment Scenarios

4.3.5. Summary

4.4. Scalability and Efficiency

4.4.1. System Scalability

4.4.2. Chatbot and API Expansion

4.4.3. Runtime Efficiency

4.4.4. Future Enhancements