A Quantitative Genomic View of the Coronaviruses: SARS-COV2

In 2020, the pandemic caused by the Coronaviruses (CoV) that are a large family of viruses that cause illness ranging from the common cold to more severe diseases such as Middle East Respiratory Syndrome (MERS-CoV) and Severe Acute Respiratory Syndrome (SARS-CoV2). The Coronavirus disease (COVID-19) is a new strain that was discovered in 2019 and has not been previously identified in humans. It is the high time to investigate the quantitative and/or qualitative genomic informations of the virus SARS-CoV2 in order to strengthen the healthcare facility to fight against this viral disease. In this article, a through quantitative understanding of the purine and pyrimidine spatial distribution/organization of all 89 complete sequences of SARS-CoV (available as on date in the NCBI virus database, is made using different parameters such as fractal dimension, Hurst exponent, Shannon entropy and GC content of the nucleotide sequences of the genome of SARS-CoV2. Also a cluster among all the the SARS-CoV sequences of nucleotide have been made based on their phylogeny made through their closeness (Hamming distance) based on respective purine-pyrimidine distribution.


Introduction
The Coronavirus disease  is caused by SARS-COV2 and represents the causative agent of a potentially fatal disease that is of great global public health concern [1], [2]. Based on the large number of infected people that were exposed to the wet animal market in Wuhan City, China, it is suggested 5 that this is likely the zoonotic origin of COVID- 19 [3, 4, 5]. Person-to-person transmission of COVID-19 infection led to the isolation of patients that were subsequently administered a variety of treatments [6,7]. As of 11 February 2020, data from the World Health Organization (WHO) have shown that more than 43000 confirmed cases have been identified in 28 countries/regions, with 10 ≥ 99% of cases being detected in China [8]. On 30 January 2020, the WHO declared COVID-19 as the sixth public health emergency of international concern [9]. SARS-CoV2 is closely related to two bat-derived severe acute respiratory syndrome-like coronaviruses, bat-SL-CoVZC45 and bat-SL-CoVZXC21 [10]. On 11 February 2020, the WHO formally named the disease triggered by 15 2019 − nCoV as coronavirus disease 2019 . Also on that very day, the coronavirus study group of the International Committee on Taxonomy of Viruses named 2019 − nCoV as severe acute respiratory syndrome coronavirus 2 (SARS-CoV2) [11]. Complete genomic sequences have been released by the NCBI in the last few weeks to understand the evolutionary origin and molecular 20 characteristics of this virus [12]. Ceraolo and Giorgi [13] have confirmed the high sequence similarity (> 99%) between all sequenced 2019CoVs genomes available, with the closest BCoV sequence sharing 96.2% sequence identity, confirming the notion of a zoonotic origin of 2019 − nCoV . Coronaviruses are enveloped RNA viruses that are distributed broadly among humans, other mammals, and birds 25 and that cause respiratory, enteric, hepatic, and neurologic diseases [10,14,15].
As on date 15th March, 2020, there are 89 nucleotide sequences of SARS-CoV2 available in the NCBI virus database [16,17]. All these sequences are nearly about length 29 thousand and each of them are composed of four nu-30 cleotide bases viz. A, T, C and G. Importantly, they all are different from each other by means of spatial organizations of the nucleotide bases.
In this study, our aim is to attempt to discover the signatory imprint of this spatial organizations of the SARS-CoV2. The spatial distribution of the purine 35 and pyrimidine bases over the nucleotide sequences of the SARS-CoV2 are being fetched out through some quantitative parameters such as fractal dimension, Hurst exponent and Shannon entropy. In addition, also density of each of the bases are also seen and density of GC content is also determined in order to understand the stability of the DNAs. 40 This discovery would aid in the diagnosis of SARS-CoV2 virus infection in humans and potential animal hosts (using polymerase chain reaction and immunological tests), in the development of antivirals (including neutralizing antibodies), and in the identification of putative epitopes for vaccine develop-45 ment.

Database used and Specifications
In this work we have taken all nucleotide sequences from the NCBI Virus Database (https : //www.ncbi.nlm.nih.gov/labs/virus/vssi/) for experimental results and discussion purpose. This dataset contains 89 complete SARS-CoV2 50 nucleotide sequences as on date 15th March, 2020. We have transformed each DNA sequence to a binary sequence of 0 s and 1 s which is defined in equation 1. Here purines and pyrimidines nucleotide bases are represented as "1" and "0" respectively.
Equation (1) represents purine and pyrimidine nucleotide bases which are 55 encoded as 1 and 0 respectively into the transformed binary sequence. nent, distribution of purines-pyrimidines) ten different clusters have be generated. Following we present the methods in brief.

Fractal Dimension of Indicator Matrices
Let D = {0, 1} be the set of two symbols characterizing the purine and pyrimidine bases of a nucleotide sequence and S(l) be a binary sequence cor-75 responding to a nucleotide sequence with the repetition of two characters from D of length l. Here, we convert each of the binary sequences into indicator matrices [18,19,20,21]. In literature [22] there are several methods to find out the self organising structure of DNA sequences through indicator matrix. Then the indicator function for each sequence is defined as shown in equation 2: such that the indicator matrix: Here ϑ hk is a matrix with the distribution 0 and The self-organization of the purine and pyrimidine bases for all the SARS-CoV2 sequences can be obtained through the fractal dimension of the indicator matrix.

95
The Hurst Exponent (HE) is used for time series analysis to interpret the autocorrelation [23,24]. The value of HE is in between 0 to 1. The HE value 0 < HE < 0.5 and 0.5 < HE < 1 designates negative and positive autocorrelation of a time series respectively and 0.5 denotes a absolute randomness of a time series which indicates the equally likely value from a particular value 100 either by increasing or by decreasing. The HE of a binary sequence s n is defined where The auto correlation of purine-pyrimidine bases for all the SARS-CoV2 sequences is obtained through the Hurst exponent.

Shannon entropy 105
The Shannon entropy (SE) measures information-entropy of a Bernoulli process with probability p of the two outcomes (0/1). It is defined as where p 1 = k 2 l and p 2 = l−k 2 l ; here l is the length of the binary sequence and k is the number of 1's in the binary sequence of length l [25,26].
The binary Shannon entropy is a measure of the uncertainty in a binary string.
Whenever the probability p = 0, the event is certain never to occur, and so there is no uncertainty, leading to an entropy of 0. Similarly, if the probability p = 1, 110 the result is certain, so the entropy must be 0. When p = 0.5, the uncertainty is at a maximum and consequently the SE is 1.

GC Content and Nucleotides Density
In molecular biology, The GC content is usually calculated as a percentage and sometimes called G + C ratio or GC-ratio [27,28]. GC-content percentage 115 is calculated by the formula [29,30]. A DNA with low GC-content is likely to be unstable than DNA with high GC-content; however, the hydrogen bonds themselves do not have a particularly significant impact on molecular stability, which is instead caused mainly by molecular interactions of base stacking. The GC-content percentages as well as GC-ratio can be measured 120 by several means, but one of the simplest methods is to measure the melting temperature of the DNA double helix using spectrophotometry.
In addition to the GC content, the density of the nucleotides A, T, C and G also separately are obtained in the present study [31,32].

125
It is well understood from their very frequency of number of nucleotides usages that the SARS-CoV2 sequences are not randomly chosen. So we explicitly trying to get the spatial distribution of the purine and pyrimidine organizations among the SARS-CoV2 sequences through the parameters as defined in the previous section. In addition to the investigation of the purine-pyrimidine 130 distribution, we wish to explore the density of each of the nucleotides as well as GC content which has a significant role in stability.

Classification Based on Fractal dimension of Indicator Matrices
For each binary sequence (purine and pyrimidine) of SARS-CoV2, the fractal dimension (using Equation (3)) is calculated. Based on the fractal dimension, 135 we have made classifications (clusters) for all the the SARS-CoV2 sequences.
There are three distinct fractal dimensions (0.3, 0.4755 and 0.6) have been obtained and consequently only three clusters of the sequences are turned up.
The following Table 2 demonstrate the sequences and their corresponding FDs. The plot of the FD and corresponding histogram are figured in the Fig. 1. The dimension of each of indicator matrix is above 29000 × 29000 and consequently we fail to demonstrate image of the indicator matrix here. The sequences S47, S13, S28 and S79 have the FD 0.3 which depicts that the amount of fractality (a kind of non-linearity) is small and so the purine and pyrimidine organization is rather well-organized and closely affine-type. There are eight se-145 quences S48, S49, S50, S51, S53, S54, S55 and S56 having FD 0.4755 and FD of rest all the sequences of purine and pyrimidine of SARS-CoV2 have been found as 0.6 which is close to the FD of cantor set, which is coincidentally significant [33,34].

Classification Based on Hurst exponent
150 For each of the binary sequences of SARS-CoV2, the Hurst exponent (HE) (using Equation (4)) is determined and then ten clusters are formed using kmeans clustering technique for all the sequences. The Hurst exponents and the histograms of all the SARS-CoV2 sequences are plotted in the Fig. 2.
It has been observed that the HE is confined in the interval (0.643, 0.655) of 155 length 0.0123. This suggests that spatial distribution of the purine and pyrimidine bases of all the SARS-CoV2 sequence is positively autocorrelated. It is noted that there is a sequence S1 having HE 0.712 which can be seen the following Table 3. This sequence S1 (accession ID: N C 0 4551) has highest HE and clearly this sequence is having a significantly different spatial organization of 160 purine and pyrimidine bases. The length of the sequence S1 is 29903. It is worth mentioning that there are other ten sequences (S1, S13, S14, S15, S39, S40, S41, S42, S57, S60 and S89) having same length 29903 but their HE is significantly differed from the HE of the sequence S1. Based on the HE obtained from the binary sequences of SARS-CoV2, ten 165 clusters have been formed. The clusters are formed using k-means clustering.

Classification Based on Shannon Entropy
For all the 89 binary sequences (purine-pyrimidine) of SARS-CoV2, the   Having all the SE of the binary representation of purine and pyrimidine of the SARS-CoV sequences, only three culsters have been formed using k-means 205 clustering technique. The cluster-1 contains 21 sequences S68, S78, S88, S71, S13, S69, S15, S42, S74, S67, S39, S40, S41, S60, S1, S14, S89, S12, S57, S3 and S4 having SE centred at 0.999940381147619. The other 67 sequences belong to the other cluster-2 whose centre is at 0.999930184068656. Though these two clusters can be considered same. There is only one cluster-3 which contains 210 only one sequence S30 whose SE is 0.9999585474 (approximately 1) as already mentioned before.
It is worth mentioning that the SE is very much linear in trend for all these purine and pyrimidine distribution among the SARS-CoV2 sequences. This is something is crucial in Coronavirus (SARS-CoV2) unlike other sequences as 215 obtained in previous studies made [35,36,37]. The amount of uncertainly is at maximum which says the equally likely occurrence of purine and pyrimidine bases across the sequences among all the SARS-CoV2.

GC, A, T, C and G Density in the SARS-CoV2
In this section, we shall try to investigate the density of each nucleotides  Based on the GC content density in the SARS-CoV2 sequences, ten different clusters are formed using k-means clustering technique and the following Table   7 describes the sequences and their corresponding clusters where they belong. age of A, T, C and G with their respective histograms.         In the following Table 8  quences of A, T, C and G are given explicitly. the sequences into some clusters based on the closeness of purine-pyrimidine sequences similarity.

Hamming Distance of the SARS-COV2
The similarity analysis of the SARS-CoV2 sequences have been measured by calculating the distance between the vectors of binary strings encoded on the Hamming distance is deployed.
In order to demonstrate the methodology, the measure of distances (Hamming distance) among the 89 SARS-CoV2 sequences as depicted in the Table   290 10 are taken into consideration. It is noted that if two virus sequences are having large hamming distance between them then it inferences that these two sequences are unlikely related to each other. From the Figure

Conclusions and Summary
It is needless to mention that the novel coronavirus has led to a public health emergency of world concern according to WHO (https : //www.who.int/). One of the major reasons for such a global threat is due to the lack of quantitative as well as qualitative knowledge about this novel virus including its genomic and 310 proteomic levels.
In this article, an attempt has been made to clarify the quantitative nature of the SARS-CoV complete sequences. This present study also reveals the closeness among the 89 complete sequences in the purine-pyrimidine level descriptions through phylogenetic analysis. Also one of the major fact of the 89 SARS- 315 CoV sequences have been exposed that the purine and pyrimidine distribution among all these genes are evenly-equally spatially placed though the GC content is significantly low as described in the result. We believe this quantitative piece of information would enable researcher to comprehend the genomic description of the SARS-CoV sequences better and would atleast help passively in ensuring 320 proper healthcare facility against this massive global emergency. In our future endeavour, we wish to understand the proteins of the SARS-CoV2.

Authors Contributions and Conflicts of Interest:
The author SH has formulated and carried out the study with RKR and VS.
The authors SH and RKR analyse the study and written the manuscript and 325 finally all the three authors checked and approved the manuscript. The authors declare that there is no conflicts of interest. Figure 9: Phylogenetic tree of SARS-CoV2 sequences based on the distribution of purines and pyrimidines (cluster dendrogram using UPGMA distance method) S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 S19 S20