Feature Selection Approach to Improve Malaria Diagnosis Model for High and Low Endemic Areas of Tanzania

: Malaria remains an important cause of death, especially in sub-Saharan Africa with about 228 million malaria cases worldwide and an estimated 405,000 deaths in 2019. Currently, malaria is diagnosed in the health facility using a microscope (BS) or rapid malaria diagnostic test (MRDT) and with area where these tools are inadequate the presumptive treatment is performed. Apart from that self-diagnosis and treatment is also practiced in some of the households. With the high-rate self-medication on malaria drugs, this study aimed at computing the most significant features using feature selection methods for best prediction of malaria in Tanzania that can be used in developing a machine learning model for malaria diagnosis. A malaria symptoms and clinical diagnosis dataset were extracted from patients’ files from four (4) identified health facilities in the regions of Kilimanjaro and Morogoro. These regions were selected to represent the high endemic areas (Morogoro) and low endemic areas (Kilimanjaro) in the country. The dataset contained 2556 instances and 36 variables. The random forest classifier a tree based was used to select the most important features for malaria prediction. Regional based features were obtained to facilitate accurate prediction. The feature ranking as indicated that fever is universally the most influential feature for predicting malaria followed by general body malaise, vomiting and headache. However, these features are ranked differently across the regional datasets. Subsequently, six predictive models, using important features selected by feature selection method, were used to evaluate the features performance. The features identified complies with malaria diagnosis and treatment guideline provided with WHO and Tanzania Mainland. The compliance is observed so as to produce a prediction model that will fit in the current health care provision system


Introduction
Malaria is a disease caused by the plasmodium parasite that is transmitted by the bite of an infected female anopheles' mosquito. Malaria remains an important cause of death, especially in sub-Saharan Africa with about 228 million malaria cases worldwide and an estimated 405,000 deaths in 2018 [1]. In 2019 WHO reported that after a period of unprecedented global success, progress in malaria control had stalled since 2016 [2]. Tanzania is a country in Eastern Africa, bordered by the great lakes of Victoria to the north, Tanganyika to the west and Malawi to the south. It comprises of a mainland and the Zanzibar archipelago. The prevalence of malaria in Tanzania has shown variation as with prevalence as high as 28% in the Western zone and as low as 1% in the Northern zone [3]. The prevalence variation between places and times of the years is affected with both climatic and non-climatic factors [4] Climatic factors, including temperature, rainfall and relative humidity, greatly influence the pattern and levels of malaria [5]- [7]. Non-climatic factors that influence malaria risk include types of vectors, species of malaria parasite, host immunity, insecticide and drug resistance, environmental development and urbanization, population movements, and other socio-economic factors including livelihoods. Ninety three percent (93%) of mainland Tanzania's population resides in malaria-endemic areas and in 2015, there were estimated to be 7.3 million clinical and confirmed cases of malaria reported in the country [8]. The WHO has issued a guideline on diagnosis and treatment of malaria that has to be followed by health professionals in managing malaria cases [2], [9]. The 2014 Tanzania Mainland's National Guidelines for the Diagnosis and Treatment of Malaria aligns with this guideline [10].

Malaria Diagnosis
For diagnosis of malaria, both WHO and Tanzania guidelines advocate parasitological confirmation of suspected malaria cases for people of all ages with fever, headache, joint pains, malaise, vomiting, diarrhea, body aches, body weakness, poor appetite, pallor and enlarged spleen, as a diagnostic criterion. But they also stipulate that in settings where parasitological diagnosis is not possible, the decision to provide antimalarial treatment must be based on the probability that the illness is malaria without the guideline on how this prediction should be done [9], [11] In treatment of malaria, it is advised that the antimalarial medications should be administer after a parasitological confirmation of the disease. However, in practice drug dispensing shops that are only permitted to sell non-prescription medications frequently dispense prescription-only treatments [12]- [15]. This resulted to drug resistance and drug shortage [16], [17]. The Tanzanian government has various initiatives to reduce dispensing and usage of antimalarial drugs by people who may not have malaria. In low endemic area other diseases that have similar symptoms tend to have high prevalence compared to malaria. Raspatory tract infection [18] indicated high prevalence in most of the African countries. Poor access to primary healthcare and errors in differential diagnoses represent a significant challenge to global healthcare systems. One initiative is the "not every fever is Malaria" campaign, which aims to educate people that not every fever episode is a malaria case since malaria shares similar symptoms with other febrile diseases such as dengue fever, typhoid fever, common cold, respiratory tract infection, dyspepsia, and pneumonia [19]. Since one method alone cannot eradicate the problem in hand, the need to enhance these efforts by providing tools to help better predict malaria cases is of at most important.
Machine learning has been successfully applied in the prediction of various diseases such as cancer, diabetics, typhoid, respiratory tract infection and even covid 19 [20]. In the study done by [21] highlighted that machine learning diagnosis is better that medical doctor's diagnosis and this proves the potential in the used of these algorithms in the diagnosis of diseases. The power of machine learning is complimented by current data explosion, voluminous amount of medical data that is generated and updated daily. Healthcare data includes paper and Electronic Health Records (EHR) which comprises of clinical reports of patients, diagnostic test reports, doctor's prescription, information related to pharmacy and information related to patient's health insurance. In archiving successful prediction of malaria, identification of features (variables) that are associated with malaria diagnosis and treatment is key. Feature selection is an efficient data preprocessing technique in data mining for reducing dimensionality of data [22]. In medical diagnosis, it is very important to identify most important risk factors related to disease. Relevant feature identification helps in the removal of unnecessary, redundant attributes from the disease dataset which, in turn, gives quick and better prediction results [23].
This study aimed to identify significant features for diagnosis of malaria both in low and high endemic areas Tanzania. To archive the main goal, the two questions were answered. One is weather the features and their importance would vary for high and low endemic areas in the country. Second is how much the accuracy of a malaria prediction model would vary when different feature sets were independently generated for the high and low endemic area. Model based feature selection approach was used to select the most important features. The results have shown that the importance of malaria symptoms vary from place to place. Also, the accuracy performance of the machine learning classifiers improved after the feature selection.

Study Area
Data was collected from four hospitals in two regions in Tanzania: Morogoro and Kilimanjaro ( figure 3). The four health facilities are Mawenzi regional hospital and Majengo health center in Kilimanjaro and Morogoro regional hospital and Mzumbe health center in Morogoro. dataset represents the patients who live in the areas with low malaria transmission that is represented by Kilimanjaro region and those who live in the areas with high malaria transmission that is represented by Morogoro region. The choice of these regions was based on the prevalence of malaria, where Morogoro represents regions with high prevalence with (15.0%) of malaria prevalence and Kilimanjaro represents regions with low prevalence with (1.0%) of malaria prevalence.

Method used and Participants
Malaria patient's records extraction form which was designed based on summary of the Ministry of Health (MoH) patient's file and the information collected when the patient visits the health facility. The records retrieved from the patient's files who have been treated for malaria from year 2015 to 2019. The aim was to identify the past state of clinical malaria diagnosis in the local health facilities and understand the common practice in the procedure of malaria diagnosis and treatment.
The key information collected was: (i) the patients' demographic information, (ii) the symptoms presented by the patient when consulting a doctor, (iii) the tests taken and results (iv) diagnosis based on the laboratory results and (v) the treatment provided. Data collection was administered by trained nurses and all participants provided written informed consent.

Ethical clearance
The study was approved by the National Institute for Medical Research Tanzania (NIMR) before the participants were recruitment and records were collected (approval number: NIMR/HQ/R.8.c/Vol.I/1352). For the survey all participants provided written informed consent to participate in the study. And for the patients records the consent was given by the health facilities with guidance from NIMR.

Dataset Description
Dataset can be refereed as a collection of data that contains a lot of separate pieces of data but can be used to train an algorithm with the goal of finding predictable patterns inside the whole dataset [24]. The dataset corresponds to the contents of a single database table, or a single statistical data matrix, where every column of the table represents a particular variable, and each row corresponds to a given member of the dataset in question. In Malaria dataset the variables collected included date of visiting the facility, age, sex, residence area, symptoms observed, type of test taken, test results, diagnosis and treatment provided as shown in Error! Reference source not found.. These variables were collected based on the items recorded in the patients' files by the doctors when the patients visit the health facilities. The residence area was selected to observe if patients coming from a certain area have more infection than the other location. The date of visit was collected to observe if there is seasonal malaria and the symptoms were collected to see if there is significance in malaria diagnosis.  [25], [26] was used to transform the patient's residence area variable by grouping the residence area to the hospital the patient attended. Feature selection as one of the dimensionality reduction technique [27], [28]was applied to reduce the number of subset attributes which were not significant to the target variable. A target variable is the variable whose values are to be modeled and predicted by other variables where in this case is the target variable is malaria diagnosis which can either be positive or negative.

Feature Selection
Three sets of features were generated from the malaria diagnosis dataset. The first feature set was derived from applying the features selection to a dataset of only Kilimanjaro (low endemic area) patients, the second from a dataset of only Morogoro (high endemic area) patients and the last from a dataset of both Morogoro and Kilimanjaro (combined areas) patients. Model-based feature selection method which uses a supervised machine learning algorithms to judge the importance of each feature in the dataset was used in this study to select the most important features. Feature selection is one of the important processes for machine learning because including irrelevant features affect the classification performance of the machine learning model. In model based feature selection there are two approaches which are using feature importance and selecting from the model which the main aim is to select the most significant features [29].
Random Forest algorithm was used as a feature selection algorithm and it selected important features from the malaria diagnosis dataset. The random forest used the treebased strategies used by random forests naturally ranks by how well they improve the purity of the node. This mean decrease in impurity over all trees. Nodes with the greatest decrease in impurity happen at the start of the trees, while notes with the least decrease in impurity occur at the end of trees. Thus, by pruning trees below a particular node, a subset of the most important features can be created.
To minimize the complexity and improve performance of the model, the top 10 important features were selected for the regional datasets and 15 important features for the combined malaria dataset. The evaluation criteria applied is that if the accuracy of the model that is trained using the dataset with the important features is higher than the full features dataset then the selected important features will be considered significant for classification of malaria and will be used for malaria prediction model development.

Evaluation of Selected Features
Evaluation of the selected features was done by comparing the accuracy of the machine learning model trained using a dataset with full features (full set) with the models developed using the 3 selected feature sets (Kilimanjaro, Morogoro and combined feature sets). Six different machine learning models were developed with each feature set and the average accuracy was computed and then used for comparison. The 6 classifiers chosen were: K Nearest Neighbor (KNN), Support Vector Machine (SVM), Naïve Bayes (NB), Logistic Regression (LR), Decision Tree (DT) and Random Forest (RF) classifiers. These were chosen due to their popularity in disease diagnosis [20], [30], [31].

Important Features for High Endemic Area Dataset
For Morogoro dataset (high endemic area) the most important features are led by the age of the patient followed by fever, abdominal pain, visit date, dizziness, vomiting, headache, sex of the patient, general body malaise, and confusion as shown in Figure 1.

Important Features for Combined Areas Dataset
From the malaria diagnosis combined dataset, the most important features are residence area of a patient, fever, age of the patient, general body malaise, visit date, headache, abdominal pain, backache, chest pain, sex of a patient, vomiting, confusion, dizziness, coughing and joint pain as shown in Figure 3.   Using the full set of features, the performance accuracy of the six the machine learning classifiers were K-Nearest Neighbor 70%, Support Vector Machine 69%, Naïve Bayes 33%, Logistic Regression 70%, Decision Tree 75% and Random Forest 82%, with an average performance of 70% accuracy as shown in Figure 7. Overall, there is an improvement of on the performance accuracy of the model with the datasets that has important features. When dataset with only important features was used on the machine learning classifiers, the performance accuracy was K-Nearest Neighbor 73%, Support Vector Machine 71%, Naïve Bayes 63%, Logistic Regression 70%, Decision Tree 79% and Random Forest 86% Figure 6.

Discussion
For proper management of any disease, accurate, affordable and timely diagnosis is key. In most developing countries, proper diagnosis of malaria has been a challenge due to the lack of testing equipment, few personnel to run diagnostic tests, and patients' self-medicating [32]- [34]. Machine learning can relieve this burden by providing a high-accuracy disease prediction tool that doesn't require expensive equipment or trained personnel to run. This can in-turn ensure patients seeking treatment at facilities without the recommended equipment or personnel for parasitology tests, and patients visiting pharmacies for medication without testing, can be better assessed for probability of disease before treatment. The accuracy of such prediction models relies on proper selection of important features for use in training the prediction model.
In this study, the aim was to find the most important features in the diagnosis of malaria and we found that not only the some of the symptoms have significance in the diagnosis of malaria but also non symptoms such as the residence area of the patient, sex and age have significance in the diagnosis of malaria. The difference in the level of important features for different regions signifies that each region is unique even though they're in the same country and they should not be treated the same. The difference can be due to geographical location which can enhance the rate of disease transmission. Apart from the difference in level of importance the experiments showed that there are features that are significant in one region but have no any significance in the other region. Coughing and joint pain are significant for malaria diagnosis in Morogoro but they have zero significance in Kilimanjaro while, dizziness and confusion are important in the diagnosis of malaria in Kilimanjaro and with no importance in Morogoro.
It was also observed that some months of the year when patients visit the health facility with malaria related symptoms are significant in malaria diagnosis. The months that are significant are either during the rain session or just after the rain session. This aligns with the guideline given by the WHO on the malaria transmission behavior. Transmission also depends on climatic conditions that may affect the number and survival of mosquitoes, such as rainfall patterns, temperature and humidity. In many places, transmission is seasonal, with the peak during and just after the rainy season as observed in the study done by [35]- [38]. Malaria epidemics can occur when climate and other conditions suddenly favor transmission in areas where people have little or no immunity to malaria [39].
Both WHO [2], [40] and Tanzania Mainland's malaria treatment guideline [8] proposes that for diagnosis of malaria a parasitological confirmation of suspected malaria cases should be given for patients of all ages with Fever, headache, Joint pains, Malaise, Vomiting, Diarrhea, Body ache, body weakness, Poor appetite, Pallor and enlarged spleen as a diagnostic criterion. The identifies criterions match with the features identified in this study and that proves that the model that will be developed will support the malaria treatment guideline given.
The trained models attained highest performance accuracy when trained on the dataset with the selected important features. This means that good feature selection influences a more accurate malaria prediction model.

Conclusions
The main objective of this paper was to compute the significant features for malaria diagnosis. Our results show it is possible to create a more accurate model for malaria prediction by applying feature selection methods to the malaria diagnosis dataset. The ranking of features by our feature selection algorithm shows us that fever is universally the most influential feature for predicting malaria in all the datasets followed by general body malaise, vomiting and headache features; however, these features are ranked differently across the regional datasets. The improvements of performance accuracy over using the original dataset vary greatly depending on which machine learning algorithm is used; therefore, to get the best possible model, it is necessary to review a wide range of combinations of feature selection techniques with machine learning algorithm. This study is limited to the selection of the important features to be used for malaria prediction. In future research, it would be interesting to look at different machine learning algorithms for building a malaria predictive model. The model that will be built in the further study can be used by clinician's pharmacist and different individuals to detect malaria in new patients, provided that patient data for the features used are available. Informed Consent Statement: The study was approved by the National Institute for Medical Research Tanzania (NIMR) before the participants were recruitment and records were collected. Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data that support the findings of this study are available from the corresponding author, [M.M], upon reasonable request.

Conflicts of Interest:
The authors declare no conflict of interest.