Short-term forecasts of COVID-19 spread across Indian states until 1 May 2020

The very first case of corona-virus illness was recorded on 30 January 2020, in India and the number of infected cases, including the death toll, continues to rise. In this paper, we present short-term forecasts of COVID-19 for 28 Indian states and five union territories using real-time data from 30 January to 21 April 2020. Applying Holt's second-order exponential smoothing method and autoregressive integrated moving average (ARIMA) model, we generate 10-day ahead forecasts of the likely number of infected cases and deaths in India for 22 April to 1 May 2020. Our results show that the number of cumulative cases in India will rise to 36335.63 [PI 95% (30884.56, 42918.87)], concurrently the number of deaths may increase to 1099.38 [PI 95% (959.77, 1553.76)] by 1 May 2020. Further, we have divided the country into severity zones based on the cumulative cases. According to this analysis, Maharashtra is likely to be the most affected states with around 9787.24 [PI 95% (6949.81, 13757.06)] cumulative cases by 1 May 2020. However, Kerala and Karnataka are likely to shift from the red zone (i.e. highly affected) to the lesser affected region. On the other hand, Gujarat and Madhya Pradesh will move to the red zone. These results mark the states where lockdown by 3 May 2020, can be loosened.


Introduction
COVID-19 illness, an on-going epidemic, started in Wuhan city, China, in December 2019 continues to cause infections in many countries around the world [1]. Considering the scale and speed of transmission of COVID-19, on 11 March 2020, the World Health Organization (WHO) declared it as a pandemic [2]. Thereafter, COVID-19 has become a threat to human life on the planet. It has shown rapid infections in almost all countries, and there is no cure available for this deadly virus. Presently governments have issued precautionary measures such as social distancing, sanitization of streets and markets, quarantine of suspected and infected cases, and lockdown of the communities at different scales (colonies, towns, states, and countries, etc.). In India, exponential growth has not been observed as compared to the USA and other European countries. It is due to the measures taken by the Indian government. It indicates that there is a strong influence of these measures, such as lockdown on the transmission behavior of COVID-19. On the other side, these measures create substantial economic losses to the communities, and hence actions mentioned above cannot be imposed for longer periods. Mainly, developing countries (such as India) cannot afford such payoff after some finite time. The Indian government has continuously reviewed every hour situation in every state. The government has become more focused on localizing the lockdown in particularly alarming states and few towns which are hotspots for COVID- 19. For all these, it is important to have short-term forecasts which can be steering point for decision-makers and administrations. In this connection, data-based statistical models such as Autoregressive integrated moving average (ARIMA) and Holts method have shown effectiveness in predicting short-term forecast including the dengue fever [3,4], the hemorrhagic fever with renal syndrome [5], Tuberculosis [6] and COVID-19 [7]. ARIMA has more ability compared to other prediction models like the support vector machine and wavelet neural network for drought forecasting [8]. Also, exponential smoothing methods have been widely used for forecasting of the population in West Java [9], an inflation rate of Zambia [10] including a prediction for epidemic mumps [11] and COVID-19 [12][13][14][15][16][17]. However, mainly, for India, the short-term forecast is not done thoroughly. As India has diversity across the states, it will be essential to study the spreading behavior of COVID-19 in different Indian states. This article presents a short-term forecast for various Indian states which are severely infected.
The main objective of the present paper is to present 10-day ahead forecasts from 22April to 1 May 2020 of the cumulative number of infected cases and deaths due to COVID-19. This work also presents the analysis of Indian states at the regional level to understand the spread of infection. The current situation of India is shown in Figure 1, with the cumulative number of infected cases and deaths from 30 January to 21April 2020.

2.1ARIMAModel
The process whose statistical properties do not change with time, i.e. process with constant mean and constant variance, known as a stationary process, is a crucial collection of stochastic processes. Mathematically, the joint distribution of of a stationary process. Simply put, shifting the origin of time by a quantity  does not change the statistical properties of the process. Usually, dealing with real-time data, most time series does not exhibit stationarity in nature as they have no fixed mean. The properties of the crucial collection of models for which the th d difference of the time series is a stationary mixed autoregressive moving average process (ARMA). These models are known as ARlMA models. The ARMA model, introduced by Box and Jenkins, is the collection of popular methods that are directly applicable to modeling and analyzing the time series [18]. The ARMA model is formed by the merger of two models, the autoregressive AR(p) model and the moving average MA(q) model. These models are directly applicable to time series with stationary behavior. In case the series is non-stationary, it must be dealt via differencing to make it stationary. Generally, the ARMA model after differencing is known as ARIMA (p, d,q). Addressing The general ARIMA model is given by The expressions in the Eq. 4 are defined as:  is an operator, known as difference operator, and used to make the difference of time series stationary; and d is the difference value. In real-time data, taking the first difference (d=1) is usually found to be sufficient and occasionally second difference (d=2) would be enough to achieve stationarity. Akaike Information Criterion (AIC) is one of the essential criteria to select betweencompeting models. Mathematically, The model which has the least AIC is selected as the best model. Autocorrelation functions (ACF) and partial autocorrelation functions (PACF) are used to select order of moving average process MA(q) and autoregressive process AR(p) respectively. In the process to investigate the stationarity of time series Kwiatkowski-Phillips-Schmidt-Shin (KPSS) [19] and Augmented Dickey-Fuller (ADF) [20] tests are used. To reject the null hypothesis, the p-value must be smaller than the significance level.

Holt'sMethod
The numbers of confirmed cases and deaths in India are increasing day by day, as shown in Figure 1 thereupon the time series exhibit trend. Simple exponential smoothing methods should not apply in this case. When data shows the pattern, and there is no seasonality, Holt's method is a primary tool to handle it. Holt's method is a double exponential smoothing method (not based on ARIMA approach) which has two parameters. This method divides the time series into two sections: the level and the trend denoted by t B and t M respectively. These two parts are as follows: ) )( 1 ( The in-future forecasts values h t X  of the time series can be calculated by: where h is the number of periods in the future. Diverse statistical meaning-making models in the Rlanguage platform were used to evaluate the time series of infected cases and deaths for prediction purposes.

Results and Discussion
We present results for 10-day ahead forecasts (22 April to 1 May2020) generated for the cumulative number of infected cases and deaths in India as well as in the ten most affected states: Kerala, Maharashtra, Delhi, Gujarat, Tamil Nadu, Telangana, Uttar Pradesh, Madhya Pradesh, Karnataka, Rajasthan. In this work, we used two models Holt's method and ARIMA model to forecast the cumulative infected cases and deaths of COVID-19. For the ARIMA model, we forecast per day new infected case(s) and new death(s), whereas for Holt's method cumulative numbers are generated.

India forecasting: 3.2.1 ARIMA model:
During the analysis and forecasting of a time series, it is good to plot the time series data and pay attention to the unique features exhibited by the time series. It gives direction to the researcher for choosing an appropriate modeling approach that directly captures identified features. Before starting the procedure, there is a need to make the time series stationary. To stabilize the variance, we used square root transformation on the infected number of cases per day time series. For investigating the stationarity of time series, we take the support of the KPSS and ADF test, and results are shown in Table 1. The first difference of series, i.e. d=1, is optimum to make series reasonably stationary. Based on a 5% significance level both the tests, ADF and KPSS, reject the hypothesis of stationarity of time series without making any difference. Afterwards taking the first difference, both the criteria agree on the stationarity of time series. Further, to estimate another two parameters of the candidate model, the ACF and PACF of series, first difference, and square root transformation are used. From Figure 2(a) and 2(b), the ACF display one spike, and the PACF also displays one spike. Initially, on the bases of the number of spikes, we selected ARIMA(1, 1, 1). Alternate models are also used to compete with the ARIMA(1,1,1) model. All alternative models and their AIC values with the Ljung-Box test p-values are shown in Table 2. A model with a minimal amount of AIC is to have well-behaved residuals. Finally, we select ARIMA(1,1,2) for forecasting. In terms of the residuals, the ARIMA(1,1,2) model passed the Ljung-Box test with p-values larger than 0.05 level of significance. Since ARIMA(1,1,2) has the lowest AIC value, which means the residuals of ARIMA(1,1,2) are much well behaved compared to other considered models. We examine that all the residuals are scattered around zero mean with constant variance. Using this, ARIMA(1,1,2) model observe 36335.53 [95% PI(30884.56 -42918.87)] cumulative infected cases between by 1 May2020, results are shown in Table 3.    Since only one difference makes the time series stationary, we conclude to take d=1. Results of ADF and KPSS tests are presented in Table 4. From Figure 2(c) and 2(d), ACF demonstrates two significant spikes, and PACF demonstrates zero significant spike. Based on the number of spikes, we selected ARIMA (0, 1, 2). Alternate models were also used to compete with the ARIMA (0,1,2) model. Details of other potential models along with AIC values and Ljung-Box test p-values given in Table 5. Furthermore, to forecast the number of deaths per day in India, we found ARIMA (0,1,3) a reasonable model among other competitor models it has minimum AIC value. Furthermore, we found residuals are randomly scattered around zero mean with non-changing variance with time. Also, ARIMA(0,1,3) does not show a lack of fit with the Ljung-box test p-value larger than 0.05. Graphical results of forecasting from infected cases and deaths are shown in Figure 3.  Table 3. To eliminate the effect of square root transformation in per day infected cases we take a square of forecasted observations.

Holt's Method:
The time series plot of the cumulative number of confirmed cases and deaths for India is presented in Figure 1 exhibiting the trend in time series, but it does not have a pattern of seasonality. As a result of the features shown by time series in Figure 1, Holt's method was selected in this study to accomplish a 10-day ahead forecast (22 April to 1 May 2020). Generally, a Holt method has two smoothing constants, α, and β (their values lie in range 0 and 1). The square root transformation is used to stabilize the variance in the time series of infected cases. In the process to attain the optimal parameters we applied by trial and error technique. Results are shown in Table 6

Indian states forecasting:
COVID-19 is spreading very fast in India. Locating the regions of most spread within India will give insight for the lifting the lockdown which commenced on 25 March 2020. On the regional level, this study shows the analysis for the cumulative number of cases but not deaths due to the unavailability of data. A glimpse of the current situation of the increasing number of cases in 10 states is given in Figure 4, certainly detectable that Maharashtra, Gujarat, and Delhi are the most affected states in India till 21April2020. And Kerala is least affected in our list of states. Time series starts from the date when the first case was reported in the respective state.

ARIMA model:
For forecasting purposes, using the ARIMA model, the number of newly infected cases per day are analyzed instead of cumulative infected cases. To select the optimum ARIMA model for each state, firstly each state's time series is made stationary by taking differences. Next, we used ADF and KPSS tests to check stationarity. To stabilize the variance of Delhi, Telangana, Uttar Pradesh, and Gujarat time series, cube root transformations are used; later, one difference is enough to remove the trend. While to stabilize the variance of Maharashtra, Karnataka and Rajasthan time series, square root and square transformations are used, respectively.The same procedure is adopted for all the tentime series of infected cases per day. AIC values are used to select the best models, and the model is chosen on the base of the smallest AIC value. Results of analysis for ARIMA models are shown in Table 9. Analysis by ARIMA models shows that Maharashtra and Gujarat will be the most affected states by 1 May2020, with around 9787.24 and 4216cumulative cases, respectively. As we observe that Kerala's growth is declining and it will be less affected states with 449 [PI 95%(408-574.99)] cumulative cases. All the models passed the Ljung-Box test as well as does not show any lack of fit.  It is found that Kerala and Karnataka were in the red zone, and Gujarat and Madhya Pradesh were in the blue area until 1 April 2020 ( Figure 12). But they are likely to change their positioning by 1 May. Accordingly, Kerala and Karnataka will shift to the blue zone as cases are declining in both states. Conversely, Gujarat and Madhya Pradesh will move to the red area. The government should impose extra precautions in these states, as the cases will significantly rise in both in the coming days. While lockdown should remain in the red zone, conversely, the blue area is not remarkably affected by COVID-19, so lockdown should be lifted with some restrictions.It is advisable to lift the lockdown in states within green and light green zones for the proper functioning of the economy.
Further, analysis of red and blue zones at the regional level is of importance to decide about raising the district wise lockdown.

Conclusions
The spread of the COVID-19 epidemic has been slow in India as compared to other countries like Italy and the USA. It reflects the influence of the broad spectrum of social distancing measures put in use by the government of India, which has played the role of a barrier to growing infected cases and deaths, apparently helped to slow down the epidemic growth. Our short-term forecast reveals that at the regional level, Delhi, Rajasthan, Gujarat, Maharashtra, Uttar Pradesh, Madhya Pradesh, Telangana, and Tamil Nadu will be the most affected states in the coming days. Considering the situation, lockdown should not be lifted in these states. The number of cases in Kerala and Karnataka is found to be reducing. Moreover, these states are shifted from the red zone to blue. Since very little growth in the future is predicted, lockdown may be lifted in these states with some restrictions for the proper functioning of economic activities. While states in green and light green zones, namely, Himachal Pradesh, Goa, Uttrakhand, Bihar, Jharkhand, Chhattisgarh, Odisha, Sikkim, Assam, Arunachal Pradesh, Nagaland, Manipur, Mizoram, Tripura, Meghalaya show very less growth in the infected cases till 1 May, therefore, lockdown may be uplifted there. On India level, there will be around 36335.63 [95% PI(30884.56, 42918.87)] cases and 1099.38 [95% PI(959.77, 1553.76)] deaths up to 1 May 2020. The forecasts presented here are based on the assumption that current mitigation efforts will continue.

Data Availability
We obtained daily updates of the cumulative number of infected cases and deaths of the corona-virus illness for India from Worldometer website (online available: https://www.worldometers.info/corona-virus/country/india/). To obtain the state-wise cumulative number of infected cases and deaths for the corona-virus illness we used the government of India website (online available: https://www.mygov.in/corona-data/covid19-statewise-status). We gathered data of infected case(s) every day at 12 midnight (GMT-5) from 30 January to 21 April 2020. And forecasted the cumulative number of infected cases and deaths of the epidemic over the India and the cumulative number of infected cases in ten Indian states: Kerala, Maharashtra, Delhi, Gujarat, Tamil Nadu, Telangana, Uttar Pradesh, Madhya Pradesh, Karnataka, and Rajasthan, which show a high burden of COVID-19 cases.