Predicting Patient No-Show Using Machine Learning Techniques in the Healthcare Sector

Today, across the most critical problems faced by hospitals and health centers are those caused by the existence of patients who do not attend their appointments. Among others, this practice generates waste of resources and increases the patients’ waiting list. To handle these problems, hospitals are actively trying to implement methods to reduce the idle time caused by patient no-shows. Many scheduling systems developed require predicting whether a patient will show up for an appointment or not. Although, a challenging problem resides in obtaining these estimates precisely. The goal of this work is to analyze how objective factors influence a patient not to attending their appointment, to identify the main causes that contribute to a patient’s decision, and to be able to predict whether or not the patient will attend the scheduled appointment. As a result, the obtained model is tested on a real dataset collected in a health center linked to the University of Vale do Itajaí (UNIVALI), which includes 25 features and about 5000 samples. The algorithm that produced the best results for the available dataset is the Random Forest classifier. It reveals the best recall rate (0.91), since it measures the ability of a classifier to find all the positive instances and achieves a receiver operating characteristic curve rate of 0.969.


Introduction
The high rate of patients not showing up for examinations and medical appointments is a recurring problem in health care. "No-show" refers to a non-attending patient who neither uses nor cancels their medical appointments. This patients' behaviour is one of the main problems faced by health centres and has a significant impact on revenues, costs and the use of resources. Previous studies have looked at the economic consequences of the patient's absenteeism [1].
Each year, an average of about 30% of patients in the Brazilian state of Santa Catarina miss appointments, exams or scheduled surgeries. In 2018 alone, more than 52,000 patients did not attend scheduled procedures at health facilities in the state. This number represents 32.81% of the appointments given by regular centers. In absolute numbers, there were 52,710 patients who missed scheduled procedures in the first ten months of 2018, surpassing the figure for 2017, when 46,394 people failed to show up [2].
In other Brazilian states the situation is no different. In the city of Vitória, Espírito Santo, the number of absences from medical consultations in health centers reached 30% of the total performed in 2014-2015. According to the City of Vitória, the average cost of an appointment during these years was approximately 37 USD, which represented a loss of approximately 17.5 USD million for the government's coffers [3]. Given the impact that the non-use of health services causes in society, there is room for studies that present efficient solutions to this problem.
There is a general consensus in the literature that patient nonattendance is not random, and several studies have recognized the need to statistically analyze the factors that influence the patients' no-show. Reduce the impact of missed appointments and improve health care operations are some benefits that this analysis may support. Some of the recent studies show that there is a relationship between the number of missed appointments and patients behavior [1,4]. Also, the work presented in [5] carried out a study in the field of hospital radiology and grouped the extracted data into three groups: patient, examination, and scheduling. The most informative aspects for predicting patients' non-attendance for the exam were those based on the type of exam and the scheduling attributes, such as the waiting time between the appointment and the performance of the exam.
Long waiting lines, lack of resources to meet the demand, and financial loss, consequences that the patient's absence from the scheduled appointment can cause. To reduce these adverse effects, health centers have implemented various strategies, including sanctions and reminders. Although, during the last decades, a sizable number of medical scheduling systems have been developed to achieve better appointment allocation based on predictive models. Therefore, machine learning (ML) algorithms can serve as efficient tools to help understand the patient's behavior concerning their presence in the medical appointment.
A better understanding of the patient's absenteeism phenomenon allows the development of solutions to mitigate the occurrence of no-shows and contribute to the management and planning of health services. According to the context in which ML algorithms are applied, different observations are made, and new solutions can be modeled to mitigate absenteeism in medical appointments and exams. Thus, data collection and the application of ML algorithms in different contexts in healthcare should be explored, providing relevant information to assist in the decision process regarding the scheduling of appointments and exams.
There is currently one publicly available dataset present for patients' no-show which most researchers have been using. The open database is available on the Kaggle 1 platform and refers to medical appointments scheduled in public hospitals in the city of Vitoria, in the state of Espírito Santo -Brazil. The downside of this dataset is the lack of information about how data was pre-processed and the huge class imbalance. Therefore, in our work, we have opted to collect the dataset on our own to grab the most valuable information about the problem and help us to build more accurate classifiers.
To help understand the problem of absenteeism in medical appointments, we analyze the reasons that leverage the decision of a patient not to attend their medical consultation. Starting from the initial dataset, we identify the main factors related to the patient's absence and propose a no-show classification model based on ML techniques. To compose the solution, algorithms that apply supervised ML techniques are used. As such, the present work brings to the current state-of-the-art a new contribution to elucidate the reasons for the no-show of healthcare patients. For that purpose, a new ML model is proposed and validated with real-life medical data, an essential tool for managing healthcare units.
The remainder of this paper is organized as follows. Section 1 presents an approach to the context of the research. Section 3 describes the methodology applied in this study, including data analysis and the algorithms. Next, Section 4 discusses the results. Finally, Section 5 presents the final remarks.

Materials and Methods
Machine learning, as a subset of artificial intelligence, is a field of computer science that aims to develop algorithms that can improve through experience and by the use of incremental data [6]. In the past decades, an increase in its research interest led to more and more areas of science finding application within ML algorithms. When considering the healthcare environment, some studies apply ML algorithms to identify preeminent factors and characteristics of patients associated with the lack of attendance to the scheduled appointment [1,13,14]. Other studies use statistical predictive models capable of predicting whether a patient will be absent from the appointment based on their historical data [15]. Lee et al. [16] described the development process up to the implementation of the predictive model in a real clinical environment, as well as the insights acquired using the model.
Despite the relevance of patient absenteeism in medical appointments, only very few works use ML algorithms to identify and understand this problem. In this sense, ML was first applied in [17,18]. In these works, the publicly available dataset was used and guided the overall analysis. However, the lack of available attributes in the dataset did not allow the authors to build a solid predictive model. To the best of the authors' knowledge, a study of the new private dataset with different attributes to analyze the data distinctly and validate new ML models for the no-show problem is still missing in the literature.
This work presents a statistical analysis to enumerate the potential reasons why patients do not attend appointments. In addition, we apply ML techniques to create models that better fit the absenteeism problem. Thus, this work provides answers to some of the questions regarding patients' non-attendance, that is: -What are the key indicators that signal that a patient will not attend a scheduled appointment?
-What is the probability of a patient not showing up for an appointment?

Research Approach
The present work brings a new contribution to elucidate the reasons for the no-show of patients and build a ML model according to the following steps (see Figure 1):

1.
Collect the patient dataset consisting of data from both appointments and patients; 2.
Apply data cleaning techniques to prepare the dataset; 3.
Include peripheral databases to add more value to the initial dataset; 4.
Analise potential correlations between attributes in the dataset; 5.
Start a descriptive data analysis to detect key factors and trends that contribute to patients no-show; 6.
Adapt the dataset for the training and testing phase, and try several classification algorithms to process the data; 7.
Compare different performance metrics from the ML models, and select the model that provides the most accurate results for the problem at hand.

Data Preprocessing
Data preparation is one of the most relevant aspects of ML. This task is one of the most time-consuming and requires 60%, on average, of the time and energy spent on a data science project [19]. We include in this task the collection, cleaning, enrichment, and exploration of data, which are presented as follows.

Data Collection
The dataset used in this study consists of data obtained and extracted from the University of Vale do Itajaí Center of Specialization in Physical and Intellectual Rehabilitation (CER). The CER is an outpatient care service that performs diagnosis, assessment, guidance, early stimulation, and specialized care. It has acted in functional rehabilitation and psychosocial qualification to encourage the autonomy and independence of people with disabilities [20]. Firstly, we collected the relevant information -on the absenteeism problem -in loco at the rehabilitation center by transcribing 4812 medical records from an electronic spreadsheet of 2017 and 2019. In the initial dataset, each file is composed of the following attributes: 1.
"Medical record number": unique identifier of the patient's record; 2.
"Gender": male or female gender of the patient; 3.
"Attended": given whether the patient attended the scheduled appointment or not; 5.
"No-show Reason": description of the reason why the patient did not attend the scheduled appointment; 6.
"Type of Disability": the patient's motor or intellectual disability; 7.
"Date of Birth": the patient's date of birth; 8.
"Date of Entry into the Service": date of the patient's first appointment at the CER; 9.
"City": city where the patient resides; 10. "ICD": identifier of the patient's disease; 11. "UBS": basic health unit that sent the patient to be treated at the CER.
The dataset contains a target feature, identified by the variable "Attended" in which: "no" represents a patient that did not attend the medical appointment, and "yes" represents a patient that showed up. Unlike a system that performs a task by explicit programming, a ML system learns from data. It means that, over time, if the training process is repeated and conducted on relevant samples, the predictions will be more accurate.

Data Cleaning
This process converts (or maps) the data to another convenient format to carry out an analysis. In this work, the data manipulation process was performed in the virtual environment Google Colaboratory [21], through the Python programming language [22] with the help of libraries, such as pandas [23] and NumPy [24]. Firstly, we renamed the dataset columns. Secondly, we started the validation process. For the attribute "Attended", we founded that some values were in a different format than expected, such as "No", "no", and "Did not attend". To deal with this inconsistency, we adjusted all values to the value "No" to standardize this attribute value -in case the patient did not attend the scheduled appointment.
The expected value for the "Type of Disability" column was the letter "I" for intellectual disability and "F" for physical disability. However, we noticed that seven empty values, and three values outside the expected standard. In those cases, we amputated the empties values and corrected the others. Another validation of the data was related to the appointments date format. The initial data were not in the standard day/month/year. We provided the adjustment to this format. After this transformation, we considered only the data for 2019 and discarded the 90 medical records found for 2017.

Data Enrichment
In order to add more information to the collected data, some other databases were combined with the current database. Also, new columns were created based on the existing ones. The following items describe this process.
A. Disease Data: As the initial database only contained the patient's disease code, a new database with the names of related diseases was combined. With the inclusion of the disease names, data visualization and interpretation became more objective.
The database with the International Classification of Diseases (ICD) was extracted from a file in PDF format on the government portal [25], and transcribed to a file in JSON format. Initially, we adjusted the ICD registered in the database to the disease codes -extracted from the government portal -for the data merging. After the code standardization and data merging, we identified 37 diseases with different ICDs, and 1662 medical records without the registered disease code. According to the graph in Figure 2, we could check that some diseases stand out to the number of appointments. , only Itajaí has meteorological data from this data source. The measurement dates from the INMET dataset have converted to day/month/year formatthey were in the US format, month/day/year -for data standardizing. Atmospheric pressure, speed and direction of the wind, and air humidity data were discarded from the original dataset, keeping only the temperature and precipitation data. After data validation and standardization, the mean and maximum temperature and precipitation for each day were calculated and merged with the dataset of the medical records. The highest average temperature found was during April and November, remaining around 25 degrees. The temperatures registered were highest in April and October, approaching 34 degrees. Finally, a qualitative value has been assigned to represent the temperature and precipitation range. For temperatures, five classifications had considered: very cold, cold, mild, warm, and very warm. These classes represent temperatures less than or equal to 15 degrees, greater than 15 degrees, greater than 22 degrees, greater than 27 degrees, and greater than 32 degrees, respectively. As for precipitation, we entered the following values: no rain, weak, moderate, strong, and very strong. This classification refers to the maximum precipitation of the day that has been less than 1 millimeter (mm), greater than 1 mm, greater than 2.5 mm, greater than 10 mm, and greater than 50 mm, respectively.
C. Other Related Attributes: We have created new attributes for the dataset based on the existing ones. From the date of birth included in the data, we derived the patients' ages. In the same way, we extracted the month of the appointment from the date of the registration. Finally, we obtained the appointment shift based on the scheduled appointment time. The age attribute allowed us to analyze whether there is a relationship between the patient's age and the rate of abstention from appointments. The month of consultation helped us identify a correlation with the level of abstentions being severity in the months in which there is a drop in temperatures. After achieving the validation process, merging new databases, and inserting attributes in the original CER dataset, we kept 22 attributes to implement the initial data analysis.

Data Exploration
Data exploration is one of the steps responsible for exploring and visualizing data so that it is possible to identify patterns contained in the data sample. In this way, we enable inference that can contribute to the understanding of the problem in question. One of the ways to summarize the data and get an overview of the attributes is through descriptive statistics. The use of descriptive statistics allows summarizing the main characteristics of the dataset numerical characteristic (continuous or discrete), such as top, frequency, mean, and standard deviation (std). However, in a scenario with a lot of categorical variables, other approaches may be more appropriate. An overview of categorical attributes is in Figure 4. Figure    women and men respectively. However, the amount of abstentions by women exceeds that 33 of men, with female patients being responsible for 13.13% of abstentions against 10.45% for 34 males.

35
Concerning the age of patients, we extracted two age categories for analysis: patients 36 under 18 years old were labeled as "Minor Age" and the others as "Adult". It is observed in 37 Figure 7 that there is an imbalance about these categories, with appointments for younger 38 people being much more prevalent. Adults represent only 19% of consultations carried out 39 at the specialized center. 40 We also note that most of the data collected are from patients under 18 years old. It 41 may reveal some characteristics of the behavior of these patients that are not exclusively 42 related to the patient who will receive care. In many cases, underage patients need someone 43 close to accompany them in medical care. This fact implies a behavioral analysis of the 44 patient and their companion. In this study, we did not collect data related to this issue.  We extract the "appointment month" attribute from the existing column labeled 46 "appointment date". Figure 8 shows that the month with the highest number of scheduled 47 appointments is the month of September, followed by the months of October and August.

48
In addition, the month with the fewest medical appointments is November. However, May, As presented in Section 2.2, we included climate data in the original dataset. Figure 9 53 shows a way to visualize the relationship between weather conditions and the probability 54 of the patient not attending the scheduled medical care. From Figure 9, we observe that in 55 autumn and winter, temperatures tend to be lower in the city of Itajaí, region of the medical 56 center.

57
In July, the number of patients who did not attend the scheduled appointment reached

68
The no-show proportion is similar for both shifts, in which 12.24% for the morning 69 shift and 10.34% for the afternoon shift. Figure 11 shows that the absolute number of 70 abstentions based on appointment times has low variability. When the highest frequency 71 of scheduled medical appointments, the number of no-shows drops to around 9%, about 72 3% lower than the total number of abstentions for the morning.

73
In this way, although the hours most frequently present fewer abstentions, we can 74 infer that the patients' profile has not impacted the appointments during business hours. It 75 is relevant to clarify that the focus diseases at the study's medical center mainly concern 76 patients with motor disabilities. For this reason, possibly, they already have a routine with 77 different hours.

Model Building
Predictive analysis uses statistical and ML techniques to understand the data structure  Upon inspecting the percentage distribution of the records between the "yes" (or "no") 102 attribute in the target variable "attended", we find a considerable imbalance between both 103 classes. An imbalanced classification problem is well-known, in which the distribution 104 is biased. Figure 13 shows 90% of the dataset's records are labeled as "yes" and 10% are 105 labeled as "no". imbalance, but it also improved the classification performance metrics, mainly in precision 116 and recall measures. Figure 14 shows the target class after applying the oversampling 117 algorithm.

119
The model must be trained on a consistent number of observations to refine its 120 prediction ability to train the model to classify new patterns. If possible, two distinct 121 datasets are the best choice: one for training and a second to be used as a test. In this case, 122 as two dedicated datasets were not available, the original dataset was split in one part for 123 training (70%) and another used for testing (30%), called the holdout method [32].

124
The train_test_split function, available in the SKLearn library, splits the data in 125 both training and testing subsets. In the dataset split step, we need to keep the same 126 distribution of target variables within both the training and test datasets. It is necessary to 127 avoid that a random subdivision can change the proportion of the classes present in the 128 training and test datasets from that in the original. Thus, even after the process of oversam-129 pling described in Section 3.1, we apply the parameter "stratify" to the train_test_split 130 function to preserve the proportion of classes. Then, the performance of the classification algorithm is evaluated using the average of 136 the k accuracies resulting from the k-fold cross-validation. There is no defined rule for 137 choosing k, although splitting the data into 5 or 10 parts is more common. Figure 15 shows

140
This section presents the results obtained after performing the data pre-processing,     After the holdout process in the oversampled data described in Section 3.2, we trained  implicitly influence the patient's no-show that could not be inferred from the analyzed 212 dataset.

213
The results obtained from the data analysis represent a starting point in the develop-