1.1. Study Background
The coronavirus epidemic that beginning in Wuhan city (China) on December 31, 2019 and has since evolved into a pandemic. The incidence of novel COVID-19 infections has dramatically grown in the absence of antiviral medications and vaccinations, resulting in huge economic losses, panic and many deaths.
The use of different statistical models to analyse epidemic data has emerged as a key study field for predicting number of COVID-19 deaths and coronavirus-infected individuals.
The numerical data relevant to certain samples or groups are represented by statistical models. In order to assess trends in the data shown, these models frequently take the form of line graphs and scatterplots. While statistical models may display data in a variety of scenarios, those that deal with COVID-19 are particularly well-liked in the present since they provide numerical information about this pandemic, such as the number of cases as well as deaths brought on by COVID-19. These models have also proved very helpful in locating cases throughout the globe to specific nations, regions, cities, and specific areas within cities, enabling the authorities in these locations to respond appropriately to the infection. Additionally, models have focused on a variety of crucial traits among individuals who present with COVID-19, such as age, race, gender, and pre-existing diseases. This enables researchers to determine which populations are most at danger from this pandemic [
1].
AI (Artificial Intelligence) techniques built on ML (Machine Learning) and mathematical models are being utilised to evaluate the type of the epidemic’s progress throughout each country and identify any potential amplifying factors that might impede its effects [
2].
1.2. Literature Review
In order to examine the relationship between dependent and independent variables and determine the current rate of corona virus spread, [
3] sought to build on earlier research. This research statistically analysed the relationship of factors like region, sex, birth year, infection date, and release date or decease with the noted number of recovered as well as died patients. The findings revealed that region, infection date, and sex were associated with the number of both recovered and died patients, whereas birth year was associated with the number of died patients only. Furthermore, no deaths from COVID were noted among released patients, whereas 11.3 percent of died patients were confirmed to be COVID positive after their deaths. In South Korea, the main factor associated with infection numbers was found to be the number of patients infected by an unknown source, representing more than 33% of total infected patients.
The association between the overall number of COVID-19 infections and recoveries in various countries was studied and analysed by [
4] using the chain-binomial variant of Bailey’s model. They also pointed out that most studies have investigated COVID-19 cases with different regression as well as time series models, which are commonly used to assess the trend or growth of any illness.
The relationship between the transmission of viral infections and human migration was investigated by [
5]. They concluded that the intensity of pedestrian traffic in the research period impacted the virus spread after 15-20 days on average.
A time series-based system to track epidemics is a term that [
6] aimed to create. Utilizing univariate time series models, he showed the evolution of the reported incidents in the first stage. Additionally, he combined the models to offer more precise and reliable findings and thought about statistical probability distributions to create hypothetical futures. The “time series susceptible-infected-recovered” [tsiR] model was created and used as last stage, and its epidemiological ratio (R
0) was calculated to determine when the epidemic ended. Time series models used comprised the traditional exponential smoothing along with ARIMA techniques, in addition to feed-forward ANN (“Artificial Neural Networks”) and MARS (“Multivariate Adaptive Regression Splines”) from the ML toolbox. The basic mean, Granger-Newbold, and Bates-Granger techniques were included in the combinations. To assess the spread and containment of the epidemic, the tsiR model as well as the R
0 ratio was applied. The recommended method was applied to monitor the COVID-19 outbreak in Greece.
Using Bailey’s model and secondary data, [
7] calculated the removal rate, or the percentage of eliminated individuals in the infected population. Additionally, regression analysis was done to demonstrate the linear association between this indicator and the frequencies of all infections. Finally, they discussed the connection between the model and decision-making.
By carefully analyzing the cases that had been reported in the country up until 22 April 2020, [
8] used exploratory data analysis to create a statistical model that would help people understand the Corona virus in India. The study’s findings illustrated the daily and weekly effects of COVID-19 in India and drew comparisons between that nation and its neighbours as well as other badly afflicted nations.
The impact of travel history and interaction with travellers on the dissemination of the corona virus in Nigeria was evaluated by [
9] using the OLS ("Ordinary Least Squares") estimator. They created predictions with extracting data from the NCDC (“Nigeria Centre for Disease Control”) website spanning March 31, 2020, to May 29, 2020. The model evaluated the time before and after the Nigeria federal government imposed travel restrictions. Based on the diagnostic checks performed, the fitted model had good fit to dataset and no validity violations. With travel history as well as contact with travellers observed to rise likelihood of coronavirus infection by 85 and 88%, the results demonstrate that govt. made the right selection in enforcing travel restrictions. The authors came to the conclusion that the govt must enforce this policy to contain Coronavirus.
Using stochastic modeling, [
10] forecasted the prevalence of COVID-19 trends in East African countries, with a focus on Somalia, Sudan, Djibouti, and Ethiopia. The study’s findings indicated that, under the average rate scenario, the number of coronavirus positive individuals in Ethiopia might increase range between 5,846 to 56,610 within four months after 30 June, 2020.
An "autoregressive distributed lag model and limits Cointegration tests" were used by [
11] to evaluate the long-run equilibrium relationship between the cumulative number of new COVID-19 infections (X) and the cumulative number of fatalities brought on by COVID-19 (Y).The stability of the calculated model was also assessed. The consistency of the model parameters is evaluated using both the cumulative sum of recursive residuals test and squares test.
The dynamic relationship between NCASE and DEATH was examined by [
12] using the VECM ("Vector Error Correction Model"), Johnsen-Fisher Cointegration test, and the "Granger causality" test. From 1 April 2020 to 26 December 2010, data on daily new COVID-19-infected cases along with COVID-19 related deaths in the India, Ukraine, Canada, and United States have been gathered from the website. Summary figures showed that the United States had the largest number of instances of COVID-19 infection, followed by India, Canada, and Ukraine. The US also had the highest number of COVID-19-related deaths, followed by India, Ukraine, and Canada. Canada leads all other countries in terms of the death rate, followed by the US, Ukraine, as well as India. Results of the Johnsen-Fisher Cointegration test indicate that there is only one Cointegration equation. The Granger causality test and the VECM demonstrate that there is a short & long-term causal correlation between COVID-19 infection and mortality instances. It is discovered that the rate of adjustment is 9.9%.
1.4. Panel Data Model
These are a sort of data that include observations of various events gathered over various time scales for same group of people, entities, or units. Econometric panel data, in a nutshell, are multidimensional data gathered over a certain time.
A simple “regression model” of panel data is defined as
where
present predicted residuals obtained from panel regression analysis where, Y represents dependent variable, X denotes explanatory or independent variable,
and
indicates intercept and slope, t for the t
th time period, and i represents i
th cross-sectional unit and X is considered to be non-stochastic as well as error term to follow the “classical assumptions”, i.e.,
. In the present research, the number of cross-sections (districts) is 37 (i=1, 2, 3,..., 37), and the number of time points is 1, 2, 3,..., 30.
Panel data modelling detailed discussions may be observed in, viz., [
13,
14,
15,
16,
17].
Panel data give “more informative data, more variability, less collinearity among variables, more degrees of freedom and more efficiency” because they combine time series of cross-sectional observations, [
14].