Forecasting COVID 19 Confirmed Cases Using Machine Learning: the Case of America

This paper presents a Multilayer Perceptron and Support Vector Machine algorithms approach to predict the number of COVID19 infections in different countries of America. It intends to serve as a tool for decision-making and tackling the pandemic that the world is currently facing. The models were trained and tested using open data from the European Union repository where a time series of confirmed contagious cases was modeled until May 25, 2020. The hyperparameters as number of neurons per layer were set up using a tabu list algorithm. The countries selected to carry out the study were Brazil, Chile, Colombia, Mexico, Peru and the United States. The metrics used are Pearson's correlation coefficient (CP), Mean Absolute Error (MAE), and Mean Percentage Error (MPE). For the testing stage we obtained the following results: Brazil, CP=0.65, MAE=2508 and MPE=17%; Chile, CP=0.64, MAE=504, MPE=16%; Colombia, CP=0.83, MAE=76, MPE=9%; Mexico, CP=0.77, MAE=231, MPE=9%; Peru, CP=0.76, MAE=686, MPE=18% and the United States of America, CP=0.93, MAE=799, MPE=4%. This resulted in powerful machine learning tools although it is necessary to use specific algorithms depending on the data and the stage of the country’s pandemic.


Introduction
The pandemic caused by the SARS-COV2 [1] virus is a concern for governments across the world. By studying its behavior, we can save millions of lives as authorities can use this information to prepare and face the crisis. We know that the effect of the pandemic caused by this virus is a result of the weakness of health systems in countries worldwide. Everyday thousands of COVID19 [2] patients arrive needing immediate assistance, which is often not available. Hence, it is necessary to build a technology tool that allows to predict the number of infected people and anticipate the worst scenarios, among which we can mention closure of the economy and collapse of the health system. On the other hand, the impact depends on the characteristics of the affected population, such as their social contacts, personal economic and educational levels, and the government resources to face the crisis [3]. These different factors mean that there is no single contagion model since information is scarce and the cost of measurement is still expensive for developing economies. The latter directly justifies our work because it takes advantage of the data and information from previous experiences to accurately predict the number of infections as a time series before reaching the peak of the curve that models the behavior of the infection. We therefore propose a machine learning tool as an inexpensive and reliable prediction solution that allows decision-making in a fast and timely manner.
In this work we propose the use of a Multilayer Perceptron and a Vector Support Machine [4] to predict the time series of the number of infected cases per day as an intelligent tool to make decisions in terms of public health strategies in countries of America [5]. These models were trained with data previously collected for each country and were tested for the last ten days until 25th of May 2020. The adjustment of the hyperparameters was carried out in a different manner for each algorithm and country. The following sections of this paper are: 2) the Materials and Methods section which describes how we built and trained the models and adjusted the hyperparameters. 3) the Results section which shows the comparison of the performance of the models for each country and the box plot charts to show how the distribution of the predicted data corresponds with the testing data. In 4) the Discussion section, the related work is described and compared with state of the art, exploring whether it is aligned with our proposal, and in section 5 the conclusions are presented.

Materials and Methods
We propose a method based on machine learning to make the prediction of possible confirmed cases of Covid19 in six different American countries, which could be used for planning in the containment stage of the pandemic. We propose two classic regression models [6] to perform this task and we compare their performance. The proposed models are Multilayer Perceptron and Support Vector Machine. It is most important to highlight that we used these machine learning algorithms because the amount of available data is small [7], on average 78 registers per country, which correspond to the accumulated confirmed cases day by day during the evolution of the pandemic from the start date of measurement to May 25. The data used for this work are public and were downloaded from the open data portal of the European Union [8].

Data Description
The COVID19 database was downloaded on May 25th and was composed of 11 fields:

Proposed models
To carry out the COVID19 forecast we proposed two machine learning regression algorithms: Multilayer Perceptron MLP and Support Vector Machine SVM. They were trained and tested using the same data and their hyperparameters were adjusted using state of the art information and hill climbing tabu list optimization algorithm.

Multilayer Perceptron
We built a feed forward backpropagation network with two hidden layers, five neurons in the input layer and one neuron in the output. The number of n neurons in the first hidden layer and the number of m neurons in the second hidden layer are hyperparameters which were selected using an optimization algorithm. The Figure 1 shows the general structure of the network used [9].
where and are the weights matrix of the layers respectively and their dimensions depend on the and values. On the other hand, we used Relu activation function in the two layers. All weights in the structure were adjusted using an optimization (ADAM) algorithm to minimize the cost function.

Support Vector Machine
We proposed the support vector machine as a model of regression, which can be found in the literature as Support Vector Regression. This model based its performance on the use of support vectors of the data to trace a multiple hyperplane to obtain zones where the data correspond. The Figure 2 shows the hyperplane examples for this application with two features with a linear kernel [10].

Figure 2. Support Vector Machine
The Support Vector Machine bases its functioning on achieving the maximum margin between the hyperplanes built using the vectors from the data, which is the reason for the name of the model. On the other hand, the problem is reduced to an optimization problem where Eq. (3) is the objective function and Eq. (4) are the constraints.

Adjusting the hyperparameters
For the Multilayer Perceptron, we adjusted the hyperparameters in two ways: general hyperparameters were adjusted for all data on all countries and specific hyperparameters were adjusted for data on a specific country. The first ones were static and were adjusted based on state of the art [11]. The second ones were adjusted using hill climbing -tabu list algorithm [4][5][6][7][8][9][10][11][12]. Table 2 shows the hyperparameters for this machine learning structure. The values n and m were adjusted using hill climbing -tabu list optimization algorithm [13]. As we can see, Figure 3 shows the convergence of the algorithm and the number of steps taken to get the best value in terms of the metric MAE [14]. We selected this metric to be optimized, because the deviation of infected cases per day is the most relevant. In the case of the Support Vector Machine, we adjusted the hyperparameters based on the methods proposed in [15]. Table 3 shows the hyperparameters evaluated for each combination using the grid method.

Training and testing the models
To carry out the training and testing of the proposed regression models, we formed matrices respectively, in such a way that the rows correspond to the last 5 data of the time series and the columns to the progress, day by day, of the confirmed cases series of COVID-19. In order to do so, we created a sliding window with a length of 5s and took the last 5 data as inputs and the current data as target. Thus, we repeated this task until we reached a point of division between the training and test data. The percentages we used were 80% and 20% respectively. The Figure 4 describes this process.

Proposed System
The proposed system consists mainly of five parts: the inputs, the predictor module, the accumulation stage, the performance measurement stage and the automatic adjustment and initialization stage. The Figure  5 shows the model of the proposed solution. Inputs: In this stage we organized the inputs in such a way that the prediction was made with the amount of people infected in the last 5 days of the time series for each country. It is very important to note that the inputs are moved and discarded as the output is fed back as a new entry, thus predicting a sequence of values. This is how we forecasted the time series.
Forecasting Algorithm: We selected the prediction algorithm based on its performance. For the proposed system the two possible algorithms are the Multi-layer Perceptron and the Vector Support Machine. These regression models of the time series are appropriate for this application and each one was chosen based on the best performance for each country, since it was not possible to obtain a bestperformance general regression model for all countries.
Accumulators: In order to measure the performance of the applied algorithms, we accumulated the prediction of 10 consecutive days and compared them with the corresponding values contained in the open data repository of the European Union from May 16 to May 25, 2020.
Performance Measurement: At this stage, we measured the performance of the two proposed algorithms for the same data set. We used the 3 evaluation metrics described in the next section. The values obtained here were used as an input for the hyperparameter optimization algorithm. In addition, these performance measures were the selection criteria for the model to be used for a specific country data set.
Automatic Initialization and adjustment: It was important that the models to be used were correctly tuned. To achieve this, we carried out an optimization algorithm that automatically initializes the optimal hyperparameter values. This ensures that the selected model is the best one to perform the task of regressing the number of SarsCov-2 infections. For this work, the selected optimization algorithm is hill climbingtabu list. On the other hand, the random weights are initialized under a normal distribution of mean 0 and variance 1.

Performance Metrics
To measure the forecasting performance of our proposed models we used the metrics presented in Eq.

Results
The results obtained in the adjustment, training and testing stages of the proposed models are shown below. Table 4 shows the adjusted hyperparameters and their values for each of the models. On the other hand, in Table 5 we highlight in bold the best performances obtained for each country, showing better results for Chile, Mexico and the USA using the Multilayer Perceptron model, and for Brazil, Colombia and Peru with the Support Vector Machine model. The largest Mean Percentage Error was found for Peru.  In the Figures 6 to 11, we show the box plot of the real values vs predicted values for each country. These charts indicate the correspondence between the distributions, giving us an idea of the good performance of each model. All distributions fit and are coupled in a look-acceptable manner; the largest deviation was found in the Peru chart.

Discussion
The proposed machine learning methods enable us to forecast the confirmed Covid19 cases with outstanding performance as shown in Table 2. However, determining which is the best model to use depends on the behavior of the data in each country, as for some cases it is better to use the Multilayer Perceptron while for others the Support Vector Machine. In this sense, this work contributes to demonstrating that the use of different regression models based on machine learning allows to obtain a reliable and adaptable tool for predicting the behavior of the pandemic in each country. In the same way, it contributes to demonstrating the need to fine tune the hyperparameters of the models to obtain the best performance sought as shown in Table 1. Similarly, we were able to verify that the distributions of the predicted data versus the real data are similar, as observed in figures 1 to 9. On the other hand, figure 10 verifies the technique presented in [4] applied to the prediction of SarsCov2 infected cases in different countries of Latin America. On the other hand, in [16] we find the use of clustering and autoencoders techniques to forecast infected cases in China, which enables us highlight the joint use of artificial intelligence algorithms as a powerful tool in this type of application for health care and decision-making. It differs in the methods used as we apply supervised learning in all cases.
In [17] [18] and [19] classical statistical-mathematical models such as ARIMA and optimization heuristics such as the flower pollen algorithm and Swarm particles algorithm are used for pandemic forecasting in Iran, China and globally respectively. These works are similar to ours as they seek the prediction of the time series of confirmed cases in the growth curve of the pandemic. However, we observe a difference as the authors did not use machine learning algorithms to carry out this task. In [20] the application of autoregressive neural networks for the prediction of SarsCov2 confirmed cases in Egypt is presented. This work is very similar to ours as the authors use one of the models which we propose in our work and they apply it almost identically. However, the authors did not adjust the hyperparameters of the MLP structure using an optimization algorithm such as hill climbing -tabu list. On the other hand, they do not compare the performance with another machine learning method although they compare with a statistical method such as ARIMA.
These works lead us to consider the main limitation of our research: the small amount of data available and used, which did not allow us to train more sophisticated regression models based on Deep Learning that could have a better performance. In this sense, it is possible that, as the infectious phenomenon evolves, the models that we propose could require a re-entrenchment and a new adjustment of hyperparameters to follow the trend of the behavior curve. Finally, due to the characteristics of the pandemic, we found the need to include other variables related to the characteristics of each country in order to improve the performance of the models used. We highlight variables such as gross domestic product and population density to contribute significantly to forecast the number of confirmed cases of Covid19. In the same way, it is very important to note that we did not find related works in state of the art that forecast the confirmed cases of Covid19 on the American continent.

Conclusion
Public health requires efficient planning in resources administration, which are limited in developing countries. Having prediction tools allows efficient management of government resources to face the problems caused by the COVID19 pandemic. In this way, machine learning-based prediction systems are fundamental tools in decision-making to save thousands of lives. For this reason, our model contributes significantly to developing tools to build plans aimed at facing the world's current public health problem.
This research work demonstrates that an MLP can predict better when an optimization algorithm to determine the hyperparameter (number of layers and number of neurons per layer) is applied, as hill climbing tabu list algorithm. However, a global minimum [21] was not found for all cases, making it necessary to use a Support Vector Machine when MLP did not perform well. In general, the comparison of the resulting evaluation metrics shows evidence that the proposed methods could be used as a forecasting tool in the COVID19 pandemic before reaching the peak where the data quickly increase.
With the proposed optimization algorithms, improvements were achieved on the performance metrics for the MLP model of Chile, Mexico and the USA. The Pearson correlation coefficient shows that the trend of the data continues and MAE and MPE show a good performance. On the other hand, SVM performs well in the same metrics for Brazil, Colombia and Peru. We can infer that it is important to constantly use different machine learning models to reach the global minima for each data case.