DATA-FORENSIC DETERMINATION OF THE ACCURACY OF INTERNATIONAL COVID-19 REPORTING: USING ZIPF’S LAW FOR PANDEMIC INVESTIGATION

Severe outbreaks of infectious disease occur throughout the world with some reaching the level of international pandemic: Coronavirus (COVID-19) is the most recent to do so. As such pandemics cause extensive loss of lives, hamper industrial operations, and cause economic losses in both developing and developed countries, it is critical to establish common standards of accuracy in the determination and reporting of cases. In particular, there are current concerns that countries are hiding or incorrectly reporting cases of COVID-19. In this paper, we set out a mechanism for using Zipf’s law to establish the accuracy of international reporting of COVID-19 cases via a determination of whether an individual country’s COVID-19 reporting follows a power-law for confirmed, recovered, and death cases of COVID-19. We observe that the probability of Zipf’s law (P-values) for COVID-19 confirmed cases show that Uzbekistan has the highest P-value of 0.940, followed by Belize (0.929), and Qatar (0.897). For COVID-19 recovered cases, Iraq had the highest P-value of 0.901, followed by New Zealand (0.888), and Austria (0.884). Furthermore, for COVID-19 death cases, Bosnia and Herzegovina had the highest P-value of 0.874, followed by Lithuania (0.843), and Morocco (0.825). China, where the COVID-19 pandemic began, is a significant outlier in recording P-values lower than 0.1 for the confirmed, recovered, and death cases. This raises important questions, not only for China, but also any country whose data exhibits P-values below this threshold. The main application of this work is to serve as an early warning for World Health Organization (WHO) and other health regulatory bodies to perform more investigations in countries where COVID-19 datasets deviate significantly from Zipf’s law. To this end, we also provide a tool for illustrating Zipf’s law Pvalues on a global map in order to convey the geographic distribution of reporting anomalies.


Historical Perspective on COVID-19 Pandemic Investigations
The first recorded pandemic was in 165 AD to 180 AD. This pandemic was referred to as the Antonine Plague (also known as the plague of Galen) and resulted in about 5 million deaths across the globe. Analysis of symptomology and infection pattern suggest that this was likely smallpox or measles [1].
In around the 735 AD -737AD, the Japanese smallpox epidemic erupted (believed to be a variola virus), killing up to 1 million persons [2]. Later, around 541 AD -542 AD, the Plague of Justina killed between 30 -50 million persons, believed to be the world's first bubonic plague [3].
Procopius described the plague as that "by which the whole human race was near to being annihilated [4 -5]." The most devastating pandemic, in terms of its impact of the global population, occurred between 1347 AD -1352 AD; this is the pandemic referred to as 'The Black Death', which claimed between 75 -200 million lives. It is believed to have been caused by the bubonic plague [6]; Benedictow in [7] described this plague as "the greatest catastrophe ever"; Michael of Piazza, a Franciscan friar wrote contemporaneously that: ''the infection spread to everyone who had any intercourse with the disease'' [8].
It is recorded that around 1520 AD there was an outbreak of the New world smallpox, believed to be a Variola virus, resulting in 25 to 55 million deaths. The New world smallpox caused so much damage that Noble David Cook [9][10] estimated that "in the end, the regions least affected lost 80 percent of their populations; those most affected lost their full populations, and a typical society lost 90 percent of its population." Around 1629-1631 AD, the Italian Plague erupted, believed to originate from Yersinia Pestis bacteria in rats/fleas. It claimed up to 1 million lives [12][13].
Around 1665 AD, the great plague of London claimed 75,000 to 100,000 lives, also believed to have its source from rats and fleas [11].
From 1817 to 1923 the Cholera Pandemic (caused by V. Cholera bacteria) killed more than 1 million people [14] in Europe. Around 1885, a third plague caused by Yersinia Pestis bacteria carried by rats and fleas resulted in around 12 million deaths in China and India [15]. Also in the late 1800s Yellow fever, its source is believed to be viruses/mosquitoes, resulted in more than 150,000 deaths. It targeted mostly South America and sub-Saharan Africa [16][17].
As is clear from the above historical account of pandemic spread, the potential for negative global impact is very substantial indeed if unchecked. In the majority of the above cases, the reporting and compilation of pandemic statistics was substantially after the fact (sometimes by many centuries) given the limited contemporaneous statistical capabilities. In the absence of such statistics, compiled while the outbreak was still live, it would have been very difficult or impossible for authorities to make well-informed policy decisions in order to combat the pandemic spread.
It is therefore critical in the current COVID-19 pandemic that accurate compilation of international reporting is undertaken. However, given the potential for countries/individuals to falsify records, for political, offensive or financial purposes, it is necessary to have methods in place to distinguish authentic from forged records. In this paper, we propose using Zipf's law as a means for achieving this.

Motivation for the use of Zipf's law
Zipf's law was proposed in 1935 by the US linguist George K Zipf [29] and may be stated succinctly as follows: given some corpus of natural language utterances, the frequency of any given word is inversely proportional to its rank in a frequency table.
Newman [30] made this explicitly stochastic; when considering the probability of measuring a particular quantity (in our case, COVID-19 cases), and it is found that the quantity varies inversely as a power of that value, then the quantity may be said to follow Zipf's law [30]. Mathematically: where ( )is the distribution of the quantity x, is the Zipf's law exponent and C is a constant [30].
In this paper, we propose the investigation of reported COVID-19 datasets using Zipf's law to establish veracity and accuracy, in particular because of presence of widespread allegations that countries may have hidden or systematically underreported the cases of COVID-19 [35].
We thus aim to establish probability values (P-values) in relation to Zipf's law calculation for each country affected by COVID-19. Furthermore, we represent the P-values of each country based on the Zipf's law calculation on a global map. This is hence an ongoing work as more data is compiled throughout the current COVID-19 outbreak.

EXPERIMENTS
Our primary goal is to investigate internationally reported cases of COVID-19 in order to determine consistency with Zipf's law. A secondary goal is to calculate the P-value for Zipf's law on each country's COVID-19 datasets. Lastly, we illustrate Zipf's law P-values on a global map to convey the geographic distribution of reporting anomalies.
We use the Power-law package developed by Clauset et. al. [36] to obtain P-values for reported cases of COVID-19 per country. As methodologically reported in [36] and evaluated in [32], we carry out experiments 1000 times on the COVID-19 datasets in order to obtain P-values in each case. The steps followed to test whether COVID-19 datasets follow a Power-law are set out in [32] and [36]. It should be noted that the P-values are generated using the Kolmogorov-Smirnov (KS) statistic goodness-of-fit test as specified in [32] and [36].

Experimental Results
Tables 1 (Appendix I)     We illustrate the distribution of P-values across countries/regions for COVID-19 recovered cases in Figure 2.

Figure 2: Distribution of P-values across Countries/Regions for Recovered Cases
Again, Table 4 extracts four extremal countries'/regions' P-values from  We indicate the P-value distribution across countries/regions for COVID-19 death cases in Figure   3.    ( Table 2 indicates that Uzbekistan had the highest P-value of 0.94, followed by Belize with a Pvalue of 0.929, and Qatar with a P-Value of 0.897; Table 4 indicates that Iraq's recovered cases data most closely follows Zipf's law with a P-value of 0.901, followed by New Zealand with a Pvalue of 0.888, and Austria with a P-value of 0.884; Table 6 indicates that Bosnia and Herzegovina had the highest P-value of 0.874, followed by Lithuania with a P-value of 0.843, and Morocco with a P-value of 0.825).
As can be seen in Figures 4, 5 This has raised some questions, not only for China, but also every other country whose power-law P-values are less than 0.1 (this threshold being the one selected to establish compliance with Zipf's law according to the reasoning in [32,36]).
Based on the above discussion, we can conclude that: 1. Zipf's law can be applied to COVID-19 case data with reliability monotonically improving in relation to dataset size.
2. This analysis can potentially be used as an 'early warning system' for further investigation into COVID-19 datasets not consistent with Zipf's law.

CONCLUSIONS AND FUTURE WORK
In this paper, we have established that COVID-19 datasets for many countries can be shown to be consistent with Zipf's law. However, experiments also indicate that deviation of COVID-19 datasets from Zipf's law may be indicative of incorrect data reporting. The main application of this work is thus to serve as a potential early warning system for international health regulatory bodies such as the World Health Organization (WHO) in performing further investigations in countries where COVID-19 datasets have deviated from Zipf's law.
In future work, we plan to: