A two-stage polynomial and nonlinear growth approach for modeling the COVID-19 outbreak in Mexico

Since December 2019, the coronavirus disease (COVID-19) has rapidly spread worldwide. The Mexican government has implemented public safety measures to minimize the spread of the virus. In this paper, the authors use statistical models in two stages to estimate the total number of coronavirus (COVID-19) cases per day at the state and national level in Mexico. Two types of models are proposed: first, a polynomial model of the growth for the first part of the outbreak until the inflection point of the pandemic curve and then a second nonlinear growth model is used to estimate the middle and the end of the outbreak. Model selection will be performed using Vuong’s test. The proposed models show overall fit similar to predictive models (e.g. time series, and machine learning); however, the interpretation of parameters is less complex for decision-makers and the residuals follow the expected distribution when fitting the models without autocorrelation being an issue.


Introduction
The world is currently experiencing a pandemic caused by the novel coronavirus, formally named COVID-19 by the World Health Organization (WHO). Development of a vaccine and antiviral drugs to treat COVID-19 is still ongoing resulting in hospitalization and intensive care unit management as the only option in treating COVID-19. Thus, there is a dire need for research on modeling the outbreak of COVID-19 to help officials in their decision-making processes regarding interventions and allocation of resources [22]. At the moment this manuscript was being written, the pandemic was ongoing and most of the epidemiological models developed focused on short-term predictions, identifying the daily peak of COVID-19 cases, predicting the duration of the pandemic, and estimating the possible impact of the measures implemented for minimizing exposure to the virus and decrease the fatality rate [1] [3][11] [12][14] [15][17] [25].
As of October 24, 2020, the cumulative number of COVID-19 cases in Mexico was reported 874,171 and 42, 055, 863 cases worldwide [24]. Thus, the main objective of this paper is to model the total number of COVID-19 cases per day at the national and the state level in Mexico while simultaneously providing straightforward information to decision-makers; additionally, the authors seek to determine which model provides the most stable short-term predictions. The models developed in this research support facilitate obtaining information that supports the strategic planning activities of the Mexican states, metropolitan areas, municipalities, or cities with high population density. Mexican officials can use these models to aid in the management process involving the needs and resources of the health services such as available hospital beds, intensive care units, and respirators, personal protective equipment (PPE) for health personnel. For stakeholders, such as public health officials, having access to a daily and permanent monitoring at the center of the pandemic allows them to anticipate the purchase of health needs in advance.
Further, the authors would like to share these models so that officials and statisticians outside of Mexico can make use of them for their own decision-making procedures during the length of COVID-19 pandemic. The proposed methodology in this paper can easily be applied to COVID-19 worldwide.

Literature Review
Research exists with data-driven approaches such as AR (Autoregressive), ARIMA (Autoregressive Integrated Moving Average) ranging from simple models (exponential smoothing) to more complex models such as ARIMAX, ARCH, GARCH, ARFIRMA [2][6] [7][9] [10]. For example, an ARIMA model with data compiled by John Hopkins University was used to predict models for the daily confirmed cases in countries where the pandemic was peaking and to predict and anticipate the resources of the healthcare systems [5]. Unfortunately, these data-driven models fail to fit the data and often lack accuracy [4][5] [14]. More importantly, the parameters cannot be interpreted according to the reality of the pandemic, in other words, it is not possible to obtain information from the growth of the pandemic which causes statisticians, and officials to make their decisions based on predictive models not based on the peak of the pandemic or the growth of the pandemic. Another approach is real-time forecasting using a generalized Logistic growth model. This method has been previously used in China to generate short-term forecasting of COVID-19 cases [16] [20] and with data from Canada, France, India, South Korea, and the UK to forecast daily cases [4]. These models are incredibly useful, they provide information on the current state of the pandemic; however, in this study, we aim to demonstrate that the assumption of independence of observations is not met and that its modeling performance at the earlier stages of the pandemic is not optimal [15] [16].
The models used in this paper are based on statistical linear models, classic time series and restricted growth called limited growth or nonlinear growth models [3][8] [9][12] [13][21] [22], and real-time forecasting using generalized Logistic growth model [16] [19]. In this paper we propose to estimate in two stages of the pandemic utilizing polynomial and nonlinear growth functions while incorporating an autoregressive component with the purpose of meeting the assumption of independence of residuals. First, we propose to use a polynomial function estimated using the Prais-Winsten methodology to estimate the first stage of the pandemic (when exponential growth of COVID-19 cases is observed). Next, in the second stage of the pandemic, (when the peak of COVID-19 cases has been reached) we propose to utilize nonlinear growth functions, Logistic and Gompertz, in to predict the total cumulative number of cases and the growth rate of the spread of COVID-19. In each of the models estimated before and after the peak of the pandemic, at the second stage of modeling we added an autoregressive component of order one (AR(1)) to compare the results to the models that do not account for the violation of independence of residuals. This approach has been successfully used to model plant and animal growth where measuring the same unit can lead to violation of independence of residuals [15] [16]. Next, we will select the most useful model to modeling the pandemic by using Vuong's test [23]. We anticipate the proposed model to have good performance similar to Neural Networks (NN) or Support Vector Machines (SVM). A disadvantage of NN is that it is difficult to generate a day-to-day prediction in addition to finding the growth rate and the maximum number of cases. These are not a problem for the functions we propose. Furthermore, artificial intelligence (AI) has been used to identify, track, and forecast COVID-19 cases. However, AI models are difficult to interpret, the process by which they arrive at a decision is often referred to as a "black box" due to the complexity in understanding how AI models arrive at a certain conclusion [26]. The complex interpretation may create a barrier for decision-makers looking for straightforward solutions in the middle of a global pandemic.
Moreover, we anticipate our proposed modeling of the pandemic will meet the assumption of independence of residuals. Additionally our proposed model, in contrast to NN, will provide predictions and will make it possible to interpret parameters such as the highest number of infected, growth speed, the initial number of infected and the autoregressive parameter to measure the lag in reporting the new daily cases of infections.
Therefore, the objectives of this paper were to (1) specify the polynomial function and nonlinear growth models (Logistic and Gompertz) that include an autoregression component for dealing with the autocorrelated observations in the growth data in two stages: before and after the inflection point of the pandemic is reached, and (2) compare the different polynomial and nonlinear growth functions in their ability to describe the number of cases of the COVID-19 pandemic.

Dataset
The COVID-19 data used in this research was obtained from the publicly available data of the Mexican Secretaria de Salud Federal which contained the number of cases confirmed from February 27 th to September 24 th , 2020. The dataset also included the number of recovered patients and fatalities and can be downloaded in .csv format from the government's website (https://www.gob.mx/salud/documentos/datos-abiertos-152127). Data to test the models focused on the national level and three Mexican states: Campeche, Quintana Roo, and Tamaulipas. To demonstrate the benefit of utilizing the autoregressive term we will use data from the state of Aguascalientes. The time series beginning point (t =1) established was the day where positive COVID-19 had not been not reported. For example, the state of Tamaulipas, the first positive COVID-19 case was March 17, 2020, thus March 16, 2020 was considered (t =1). Analyses were conducted in R (version 3.6.2) and STATA (version 15.1)

Model Selection
This paper proposes the utilization of a two stage approach: First, Stage 1 model was fit throughout the pandemic and before reaching the peak of cases (before the inflection point), and in the second stage once the peak of cases has been reached a different type of model will be used. We will utilize Vuong test's for model selection [23].

Stage 1: Before the inflection point
In this paper, we examined a variety of models that successfully model the behavior of the pandemic (e.g. the maximum number of cases, growth rate, etc.) while simultaneously meeting the assumption of independence of residuals in the data. Using the Akaike Information

Stage 2: After the inflection point
To examine which were the best models once the peak of positive COVID-19 cases had been reached, we focused on the same three state data which showed good fit up to the peak of the pandemic. We sampled to data points after the inflection point: 20 days after and the day before we began writing the results of this paper (September 24, 2020). The results obtained from this stage of the pandemic will be compared with those obtained before the inflection point of the pandemic. We hypothesized that for the stage before the peak of cases one of the nonlinear growth models will have better fit.

Statistical Procedures
The complete polynomial model used in this work has the form: where is the number of total positive COVID-19 cases reported at time , the coefficients 0 , 1 , 2 , 3 are the parameters of the model to estimate with −1 the autoregressive component and is a random error term, with = 0, 1, 2, … . The value = 1 is selected equal to the day of the first positive case.
The Logistic model used the following form: where is the number of total positive COVID-19 cases reported at time t, the coefficients The Gompertz model used the following form: where is the number of total positive COVID-19 cases reported at time t, the Thus, the results will provide a p-value from Vuong's test which can be compared to a significance level α and aid in selecting the model that best fits the data. Graphically, we can observe in Figure 1 the approximate modeling of the pandemic, identify the peak (when there is a change in the growth rate), and the steps where the models of interest in the project will be used.

Results
The dates when the peak of COVID-19 cases was reached at the national level and in the three Mexican states selected for the study can be seen in Table 1.  Table 2 we can see the p-value obtained for the state of Tamaulipas on September 24, was 0.9703 for the H1A and its complement 0.0397 is the p-value for the H1B. At the significance level α = 0.05, we reject the null hypothesis in favor of H1B implying the Logistic model fits the data better than the polynomial model.  Table 3 summarizes which models fit better for each region at each of the six time points examined. In general, it is easy to see that the nonlinear growth models do not fit better than the polynomial models on dates after the peak of the pandemic has been reached. On the other hand, when examining the dates prior the peak of the pandemic there was a negligible difference between the models. More importantly, examining the model comparisons in Table 3 for dates before the peak and during the peak of the pandemic it is clear that no model worked better than the others. In other words, when modeling growth rate of COVID-19 cases there is no difference between the models. However, in the late phase of the pandemic, after the inflection point, the nonlinear growth models perform better than the polynomial model fit. Polynomial and nonlinear growth models were useful for modeling the beginning of the epidemic, see Figure 1, until reaching the maximum peak of daily cases for the pandemic [4]. On the other hand, nonlinear growth models are more accurate and effective when more information is available, and the maximum peak of daily cases has been reached. Furthermore, time series models allow practical real-time monitoring of when (a) exponential growth is beginning, (b) exponential growth is in effect, and (c) exponential growth is about to end which indicates that the epidemic is reaching its end. Finally, the nonlinear growth models are useful to describe the behavior at the end of the pandemic and allow monitoring and detection of a possible double peak of the epidemic. In general, the Logistic and Gompertz models had the better fit. However, as the pandemic advanced the Gompertz estimated the late phases of the pandemic better.

Limitations
The research we presented in this paper is not without limitations. The projections resulting from the models were estimated without considering any type of intervention. In other words, the intervention effects such as "stay at home" and "social distancing" are not considered in the model.
Due to the characteristics of the models used and the atypical characteristics of the COVID-19 pandemic, the results and information derived from the monitoring strategy, well known in epidemiology science as "SENTINEL contention model," are considered to adjust: 1.
The actual data shows a lot of variability from one day to the next; and 2.
Projections are very successful in a time frame of 1 to 10 days after which it is necessary to re-update the models to maintain the desired monitoring, precision, and reliability.

3.
In some cases, such as the state of Aguascalientes, it was necessary to add a second order autoregressive component.
The benefit of utilizing a second order autoregressive component is shown

Discussion
The modeling of the COVID-19 outbreak is a unique challenge for statisticians considering the data is limited, and there are often delays in updating the data. Our approach to modeling the pandemic can provide assistance to decision-making officials for containing and anticipating a "double peak" of the COVID-19 pandemic in Mexico. We divided the pandemic data into two stages, in stage one we implemented a polynomial model of order three with an autoregressive error component of order one applying the Prais-Winsten or Cochrane-Orcutt estimation methodology to control the autocorrelation in the data. In stage two, when the peak of COVID-19 cases has been reached we fit nonlinear growth functions, Logistic and Gompertz, to predict the total cumulative number of cases and the growth rate of the spread of COVID-19. Further, the models used in this paper are different than those used in predictive models using time series or machine learning; however, our model meets the assumption of independence of residuals and the interpretability of our model is superior to those of machine learning particularly for government officials without a statistics or machine learning background.
In summary, this paper showed the efficiency of utilizing two different types of models estimated in two stages (stages depending on the state of the pandemic). The majority of the efforts to model the COVID-19 pandemic are nonlinear models, such as Logistic and Gompertz [4] [5]. However, these models do not take into account autoregression thus making short, medium, and long-term predictions that could possibly be skewed. We recommend, due to parsimony in fitting the model, that in the early stages of the pandemic where there is an exponential growth (during or before the peak) to utilize a polynomial model of the third order estimated in with an autoregressive component. For the later, more advanced stages when the peak of the pandemic has been reached, we recommend utilizing a nonlinear growth model (Logistic or Gompertz) estimated with an autoregressive component.