Analyzing the factors affecting the COVID-19 risk level in the US counties

The COVID-19 disease spreads swiftly, and nearly three months after the first positive case was confirmed in China, Coronavirus started to spread all over the United States. Some states and counties reported an extremely high number of positive cases and deaths, while some reported too few COVID-19 related cases and mortality. In this paper, the factors that could affect the transmission of COVID-19 and its risk-level in different counties have been determined and analyzed. Using Pearson Correlation, Kmeans clustering, and several classification models, the most critical ones were determined. Results showed that mean temperature, percent of people below poverty, percent of adults with obesity, air pressure, percentage of rural areas, and percent of uninsured people in each county were the most significant and effective attributes.


Introduction
COVID-19 disease is caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) with common symptoms of fever, dry cough, shortness of breath, and other signs of respiratory-related infections. World Health Organization (WHO) reported that 80% of patients experienced these symptoms mildly. However, older people (>60 years old) and the ones with co-morbid diseases are at a higher risk for severe symptoms and death [1,2]. Besides, younger patients with no underlying disease might also experience severe symptoms or even die [3][4][5]. The first positive case of COVID-19 in the United States was reported in the state of Washington on January 20, 2020. By March 17, 2020, Covid-19 has spread across all US states [6,7]. Figure 1 shows the aggregated COVID-19 positive case and death count maps for all US states until November 6, 2020. Reports showed that on November 6, 2020, the top five number of positive COVID-19 cases belonged to the states of California, Texas, Florida, New York, and Illinois, while the top 5 death cases belonged to New York, Texas, California, New Jersey, and Florida. [8,9].

Materials
The data used in this study was taken from various online sources from March 2020 to November 2020, including COVID-19 positive cases and deaths, demographic, meteorological, health, and location-based data separated by each US county (Table 1). To evaluate the risk level of each county, we used the COVID-19 positive cases and death rates (positive cases and death counts divided by the population of the county) instead of their actual numbers. For meteorological data, we took the average of the data over all days of each month to have the average monthly data because the time-series nature of the meteorological data was not match with other databases used. The maximum and minimum temperatures for each county were driven from the maximum and minimum temperature across all days from March to November. Population density along with the population of elderly people (>65 years old) and young people (<65 years old) have also been added to the dataset. All the parameters used in this study are shown in Table 1.  Figure 2 indicates the steps taken in this article to determine the key factors with higher effects on Coronavirus transmission and can mostly determine the risk level of each county. In the first step, we found and corrected the inaccurate records from the dataset (data cleaning). Then, we calculated the correlation coefficients between different parameters, identified highly correlated parameters, and among the ones with high correlation, we chose only one variable. We then applied clustering with the two key parameters of positive case rate and death rate as features. Counties were then labeled based on clustering results such that each label was considered as a class. In the next step, we applied different classification models on the complete dataset, and the best model was determined based on the accuracy rates. Finally, the significant parameters were selected for the analysis.

Data Cleaning
After combining the data from various sources, it was revealed that many values were missing ( Figure 3). One reason was the name mismatch between different counties in different datasets. Moreover, the number of ICU beds for some counties was missing. So, they have been imputed by zero. Missing values for the demographic data were replaced with data from the United States Census Bureau website [28]. If no data was available for a specific county, values were imputed using the average of the non-missing values of that parameter across the neighboring counties. For example, the missing precipitation, air pressure and dewpoint values for the "District of Columbia" county were replaced with the corresponding data for "Arlington" county in Virginia state. The total number of counties analyzed in the current study was 3131.

Correlation check
After normalizing all independent parameters, Pearson Correlation was applied to the data, and the parameters with high linear correlation (more than 0.8) were removed. The remaining parameters for further analyses are shown in Table 2.

Clustering Analysis
To determine the COVID-19 risk level of each county, K-means clustering was done based on COVID-19 positive case rates and death rates. The cluster number for each county was used as its label (dependent variable) for the classification analysis.
To determine the optimal number of clusters that can define the risk level of each county, the Elbow method is used (Figure 4). The Elbow method is a visual method that can determine the optimal number of clusters considering the total within-cluster sum of squares of Euclidean distances (the cost). The optimized k value (k is the number of clusters) is such that adding another cluster (k+1) does not significantly decrease the cost. In the Elbow method plot, the optimal k value is located at the elbow of the curve. [29,30] Figure 4 shows that k=3 could be the optimal number of clusters. As it can be seen in Figure 5 (clustering results), counties were clustered into three clusters of low-positive-case low-death, medium-positive-case medium-death, and high-positive-case high-death. The county located far apart from the other points belongs to New York County, which had a positive case rate of 15.5% and a death rate of 1.5%.  Table 3 show some characteristics of each cluster. As results showed, cluster 1 seemed to be the lowrisk cluster. It contained the highest number of counties with the lowest positive case rate and death rate on average. On the other hand, cluster 3 had the lowest number of counties but the highest positive case rate and death rate on average, which referred to high-risk counties. Cluster 2 is considered as the medium-risk cluster.
In the next section, the result of clustering analysis was used to find significant parameters that affect the COVID-19 risk level of each county by means of classification analysis.

Classification Analysis
To determine the significant factors, different classification models have been applied to the data. Based on the accuracy values attained, the best model was used to select the factors with the highest importance in the transmission of the COVID-19.
To extract significant factors, several classification models shown in Table 4 were applied to the dataset. The data was divided into train and test sets (80% and 20%, respectively). Table 4 shows the train accuracy and the test accuracy for each model. Among the classification models, Random Forest was overfitted on the data. XGBoost obtained a very low performance. SVM models, KNN, QDA, LDA, and MLR performed almost similarly on the test data. SVM and KNN coefficients are not interpretable. QDA performed poorly compared to LDA and MLR since the QDA fitted the more flexible classifier than necessary. LDA and MLR performed very similarly in this study. LDA makes more restrictive Gaussian assumption. However, the normality assumption was not satisfied on the current dataset. Therefore, to balance between the accuracy and complexity of the model, we chose the multinomial logistic regression (MLR) model as the best describing model.

Evaluation
Further analyses were finally performed to answer the research question: What makes a county be high-risk while another county has low risk against COVID-19 disease.
To extract the significant factors, a backward selection approach was used based on the p-values of the coefficients in the multinomial logistic regression model. Considering 95% of confidence interval, final significant factors were determined. The final factors were mean temperature, percent of people below poverty (people with income lower than the threshold determined by the United States Census Bureau [31]), percent of adults with obesity (BMI > 30 [32]), air pressure, percent of rural areas, and percent of people uninsured. It should be noted that other models with acceptable accuracy values (e.g., LDA, KNN, and SVM) suggested almost the same significant factors as MLR.

Significant Variables
In this step, the MLR model is applied only to the significant variables. The cross-validation accuracy and test accuracy were 70.45% and 66.83%, respectively. Table 5 shows the coefficients of significant variables driven from the MLR model. The second columns are the coefficients (log odds units) of variables of cluster=2 compared to cluster=1, and the coefficients (log odds units) of the third columns are for cluster=3 compared to cluster=1. As an illustration, one unit increase in mean temperature increases the odds by exp(0.5662)=1.7616 of being a county in cluster=2 vs. cluster=1.  Table 5 demonstrated that increasing the average temperature, percent of people below poverty, percent of adults with obesity, and percent of people uninsured would enhance a county's chance to be in a cluster with a higher level of COVID-19 risk. On the other hand, increasing the air pressure and percent of rural areas would decrease that chance. Previous studies concluded that there is a positive relationship between temperature and COVID-19 cases [16,17,33]. The results of this study were in line with their conclusion. Higher average temperature belonged to clusters 2 and 3, which had higher COVID-19 positive case rate and mortality rate.
Results revealed that the percentage of people below poverty in a county was positively associated with belonging a county to a cluster with a higher level of COVID-19 risk, as shown in Table 5. Low-income people might have limited access to health products such as masks and sanitizers, which have high effects on virus transmission and vitality [34,35]. They are more likely to work out of their homes due to unstable jobs and income or less likely to have reliable and valid information about the COVID-19 disease [36,37].
Moreover, low-income people are less likely to have health insurance (the higher percentage of people uninsured). So, due to high medical expenses, they prefer not to go to clinics/hospitals or use medications, which might increase the COVID-19 death rate and, as a result, increase the association of a county to a higher risk cluster. Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 5 April 2021 doi:10.20944/preprints202104.0132.v1 As the center for disease control and prevention [38] claimed, obesity would increase the risk of COVID-19 death. Besides, obesity would affect the immune system adversely. These are in line with the findings of this study. As the obesity percentage in a county increases, the chance of being in the high-risk cluster would increase as well.
Results demonstrated that air pressure was another significant factor which negatively associated with belonging a county to a higher risk cluster. Air pressure determines the precipitation, wind, and weather condition. High air pressure is associated with mild wind and calm weather [39]. So, as the study of Coccia [20] claimed, it decreased the transmission of COVID-19. So, it lowered the association of a county to higher risk clusters. Research conducted by Takagi et al. [40] also demonstrated that high air pressure would reduce the COVID-19 prevalence.
Considering the percent of rural areas in each county, Table 5 indicated that counties with higher percentages of rural areas almost belonged to cluster 1 (low-risk cluster). In rural areas, people can keep physical distance from others which is one of the most important factors to prevent the transmission of COVID-19. However, in rural areas, sometimes, accessing advanced medical centers is lower.

Discussion
In this paper, several demographic, meteorological, geographic, and health factors have been analyzed to determine the critical parameters which highly affected the transmission of COVID-19 in the US counties and to determine each county's risk level. To do so, aggregated COVID-19 positive cases and deaths for each county were driven, and then their rate with respect to population was calculated. Next, K-means clustering was applied to these two rates. Based on the Elbow method, three clusters were chosen titled row-risk (low-positive-case low-death), medium-risk (medium-positive-case medium-death), and high-risk (high-positive-case high-death) clusters. The labels obtained from clustering were then considered as the nominal dependent variable for classification.
In the next step, Pearson Correlation was applied to all factors to remove those highly correlated ones. Sixteen variables were kept for further analyses.
Using cluster labels, several classification models with 10-fold cross-validation were applied to the data. Based on the train and test accuracy, multinomial logistic regression (MLR) was finally chosen as the best model to determine the significant variables. Finally, six factors of mean temperature, percent of people below poverty, percent of adults with obesity, air pressure, percent of rural areas, and percent of people uninsured were chosen. Increasing the mean temperature, percent of people below poverty, percent of adults with obesity, and percent of people uninsured would increase the belonging of a county to the higher risk cluster. On the contrary, increasing the air pressure would increase the association of a county to the lower risk cluster. Counties with higher percentages of rural areas seemed to belong to the low-risk cluster mostly.