Spatial Distribution of Amino Acids of the SARS-CoV2 Proteins

The world is now undergoing through a global emergency due to COVID-19 which needs immediate remedies in order to strengthen the healthcare facility to save the nations. Looking towards to the remedies, research on different aspects including the genomic and proteomic level characterizations of the SARS-CoV2 are necessarily important. In this present study, the spatial representation/composition of twenty amino acids across the primary protein sequences of SARS-CoV2 have been looked into through different parameters viz. Shannon entropy, Hurst exponent in order to fetch the autocorrelation and amount of information over the spatial representations. Also frequency distribution of each of the amino acids over the protein sequences have been chalked out.


Introduction
Global emergency due to the COVID-19 is making life hard throughout the globe. The largest genomes (of size approximately 30 kb) for RNA viruses so far is known as the SARS-CoV2 [1,2,3,4,5]. CoVs are classified into three different classes such as α-CoVs, β-CoVs and γ-CoVs based on the genetic and 5 antigenic criteria [6,7]. The SARS-COV2 is classified into the β-CoV group [8].
Quite a good number of untiring research activities across the world have been carried out [9,10,11]. Everyday, new genome sequences of SARS-CoV2 are being included in the databases viz. NCBI virus database [12,13]. In the current scenario, no antiviral drugs with proven efficacy nor are there vaccines for 10 the CoV2 prevention exist [14,15]. Also, the researchers have little knowledge of the molecular biology of SARS-CoV2 infection [16]. In the present state of art, viral infection mechanism is not fully understood though various proteinprotein interactions (PPIs) of virus and host are known [17,18]. So identifying interactions between the SARS-CoV2 virus proteins and host proteins helps un-15 derstand the mechanism of viral infection and develop treatments and vaccines [19]. Understanding these SARS-CoV2 proteins is one of the primary aims to get a clarity of the PPIs between the virus proteins and host proteins [20]. Biologists yet to understand the spatial arrangement of secondary structure elements (SSEs) [21,22]. The geometric three dimensional structure of a protein depends 20 on the spatial arrangement of the SSEs which has been studied in [23]. So the spatial distribution as well as presence/absence of different amino acids over a primary protein sequence of SARS-CoV2 are significantly important to reveal.
It is needless to mention that the spatial arrangement uncovers the rules that govern the folding of polypeptide chains [24]. Alternation of amino acids over 25 the primary sequence might affect the function of a protein. Also the primary sequence of a protein reveals the molecular events in evolution. The spatial arrangement of amino acids determines the conformability of proteins too [25,26].
In this present study, spatial composition of twenty amino acids across the primary proteins of SARS-CoV2 have been looked into through parameters viz. 30 Hurst exponent and Shannon entropy. Also frequency analysis of the amino acids over the proteins have been chalked out. It is noted that authors have done similar analysis for the 89 genomes of SARS-CoV2 [27].

Methods
In characterizing the amino acids spatial distribution over the primary pro-85 tein sequences of SARS-CoV2, the three parameters Hurst Exponent, Shannon Entropy, Amino Acid Density are considered. Following these methods are described briefly. Similar works based on these methods are done in [32,33,34].

Hurst Exponent
Fractality (an organized form of nonlinearity) is naturally characterised using 90 fractal dimension. In one dimensional sequence the fractal dimension (D) and the Hurst Exponent (HE) are linearly related as D + H = 2 [35,36]. The Hurst exponent measures the autocorrelation in the sequences [37]. The HE lies in the interval (0, 1). For rough anti-correlated sequence HE is strictly less than 0.5 and for positively correlated sequences the HE ranges between 0.5 to 1. If 95 HE=0.5, then the sequence clearly depicts its randomness with white noise.
The HE of a binary sequence s n is defined as where Y (n) = 1 n

Shannon entropy
There are two kinds of Shannon entropy we wish to determine in this present study.
Binary Shannon Entropy: The Shannon entropy (SE) measures information entropy of a Bernoulli process with probability p of the two outcomes (0/1). It is defined as where p 1 = k 2 l and p 2 = l−k 2 l ; here l is the length of the binary sequence and Whenever the probability p = 0, the event is certain never to occur, and so there is no uncertainty, leading to an entropy of 0. Similarly, if the probability p = 1, the result is certain, so the entropy must be 0. When 110 p = 0.5, the uncertainty is at a maximum and consequently the SE is 1.
Amino Acid Conservation Shannon Entropy: Protein Post Translational Modification (PTM) important biological mechanism for expanding the genetic code [39,40]. To the find the conservation of amino acids in primary protein sequences, Shannon entropy is deployed. The SE is widely used for predicting PTMs. For a given protein sequence, the SE is calculated as follows: where p Ai represents the occurrence frequency of amino acid A i in the sequence.

Amino Acid Density
Over the primary protein sequences of SARS-CoV2 protein sequences, we 115 wish to explore the amino acid frequency distributions and corresponding statistical descriptions [41]. The density of the amino acids over a primary protein sequence can also be found using the following formula: where A i is an amino acid present in the primary protein sequence P , L(P ) is the length of the sequence P and F (A i ) is the frequency of the amino acid A i in through Hurst exponent are reported. The Hurst exponent also would imply the fractality (organized non-linearity) of the spatial representations. In addition, the amount of uncertainty of presence/absence of the amino acids over the protein sequences are determined through Shannon entropy. Also the amino acid conservation information is determined through the Shannon entropy. At   Here the HE of the 105 binary representation of the amino acid A 1 is ranging 140 from 0.509 to 0.7331 with standard deviation 0.04512. Based on the HEs of the binary sequences all these 105 primary protein sequences of SARS-CoV2, ten clustered (C) are formed as presented in the Table 2.      Here the HE of the 105 binary representation of the amino acid A 4 is ranging from 0.546 to 0.664 with standard deviation 0.0876. Based on the HEs of the binary sequences all these 105 primary protein sequences of SARS-CoV2, ten clustered (C) are formed as presented in the Table 5.

210
There are two protein sequences N68 and N81 without any amino acid G (conditionally essential) as it can be seen in the   the binary sequences has been plotted and corresponding histogram is also given in the Fig 5. The HE of the binary representations of ordering of the amino acid A 5 over all the primary protein sequences would reveal the autocorrelation of the amino acid. Here the HE of the 105 binary representation of the amino acid A 5 is ranging 225 from 0.5 to 0.685 with standard deviation 0.136. Based on the HEs of the binary sequences all these 105 primary protein sequences of SARS-CoV2, ten clustered (C) are formed as presented in the      Here the HE of the 105 binary representation of the amino acid A 7 is ranging 265 from 0.508 to 0.754 with standard deviation 0.0395. Based on the HEs of the binary sequences all these 105 primary protein sequences of SARS-CoV2, ten clustered (C) are formed as presented in the Table 8.
The binary representations B 72 , B 768 and B 715 of the spatial arrangement of the amino acid L over the protein sequences N2, N68 and N15 are random 270 as the HEs of these sequences is 0.5 (approx.). There are 54 sequences in the cluster 5 are having the HE 0.58. The spatial arrangements of the amino acid L over these proteins are not random but not either too trending as the HE is greater than 0.5 but less than 0.6. There are as usual other clusters having sequences with positive autocorrelation (trending) as given in the Table 8.     Table 9.
Here the cluster 3 contains most of the sequences (80 in number) for which the spatial distributions of the amino acid M over the protein sequences are having the HE 0.61 (approx) which indicates the trending behaviour. Clearly, the spatial organizations of the amino acid M over the protein sequences N102, 290 N80 and N81 are random. Rest all as usual having the trending trend as seen before. the binary sequences has been plotted and corresponding histogram is also given in the Fig 9. The HE of the binary representations of ordering of the amino acid A 9 over all the primary protein sequences would reveal the autocorrelation of the amino acid.
Here the HE of the 105 binary representation of the amino acid A 9 is ranging     One of the conditionally essential amino acids P does not arise in the protein     In the cluster 1, there are two protein sequences N96 and N97 which are absolutely free from the amino acid Q. The cluster 2 contains 45 sequences having HE 0.58 and so the spatial organization of the amino acid Q is positively trending. As usual there are other three clusters 1, 4 and 5 which contain 340 positive autocorrelated sequences of the spatial distribution of the amino acid Q over the protein sequences. There is only one binary representation B 11100 of the amino acid Q over the protein sequence N100 having negatively trending.   The binary representation B 127 of spatial organization of the non essential   Table 14.
The spatial representation B 1399 of the essential amino acid T is a null sequences having only zeros which imply the absence of the amino acid over the protein sequence N99. The spatial distributions of the amino acid T over the 375 76 protein sequences (belong to cluster 1) are positively trending.           Table 17.
The conditional amino acid Y is absent in the protein sequences N99 and N103. The spatial distribution of the amino acid Y over the only protein N80 belonging to the cluster 6 is not trending as its HE 0.479 < 0.5. The largest 425 cluster 1 contains 68 protein sequences where the amino acid Y spatially spread with positive trend.    The spatial distribution B 172 of the amino acid D over the protein sequence N2 is random since the HE of B 172 is turned out to be 0.501. The largest cluster 440 1 contains 60 protein sequences where the amino acid D is spread with positive trend as shown in the Table 18.  There are 48 sequences in the cluster 1 where corresponding spatial distributions B 18j of are positively trending with HE 0.712 exactly. Such a organized 455 trend is certainly noteworthy. it is noted that, the non-essential amino acid E does not appear in the protein sequences N80 and N99.     Table 21.  positive autocorrelated spatial representations of the amino acid R.

A Collective Views of the HEs
Following we have listed the protein sequences of different lengths ranging from 13 to 419, which does not contain some amino acid(s) as listed in the following  The protein sequence N99 of length 13 does not contains the amino acids   HEs of the spatial distribution is also shown through graphs in the Fig 21. It is worthy mentioning that in the correlation matrix in the Table 23, the negative correlations of the spatial distribution of the proteins are also shown.
As an example of the correlation (the correlation coefficient r: 0.443) of the spatial distribution (autocorrelation) of the amino acid M with the spatial 535 distribution of the amino acid Y is given below in the Fig. 22.     Table 25.     Here the SE of the 105 binary representation of the amino acid A 4 is ranging from 0 to 0.536 with standard deviation 0.0852. Based on the SEs of the binary sequences all these 105 primary protein sequences of SARS-CoV2, six clusters (C) are formed as presented in the Table 27.             Table 31.
The amino acid A 8 (M ) does not present in the sequence N99 which is of smallest length and so the amount of uncertaintly is zero as found in the Table   31. The cluster 1 including others contains most of the proteins of SARS-CoV2 where the amino acid is present all over the proteins of various lengths with almost certainty which is validated by its SE which is 0.162.   The cluster 3 contains one protein N80 where the spatial distribution B 980 has the SE 0.562 which says the absence of the the amino acid A 9 over the 695 protein is without uncertainty. It is noted that total number of amino acid A 9 placed over the 38 length protein N80 is 5. The other five clusters contains rest 104 proteins where the amino acid A 9 is spread with certainty as the HE is less than 0.5.        Table 34.
The cluster 4 contains the proteins N96 and N97 of length where the HE is turned out to be zero for the binary representations B 10j for j = 96 and 97 of the amino acid A 11 . It is noted that these two proteins naturally absolutely free from the amino acid A 11 . All the rest clusters contain all the protein sequences where the amino acid A 10 is present over the proteins with almost certainty.    binary sequences all these 105 primary protein sequences of SARS-CoV2, five clusters (C) are formed as presented in the Table 35.
The amino acid A 10 is present over the all the proteins except N99 with almost certainty since the SE of the spatial distributions is turned out to be less than 0.5, as shown in the Table 35. The SE of the smallest lengthy protein N99 745 is greater than 0.5 which imply the absence of the amino acid is spread over the protein with certainty.   Table 36. The amino acid A 13 (T ) is absent in the protein sequence of N99 and consequently the binary representation B 1399 of presence and absence of the amino 760 acid is absolutely a sequence with zeros without any uncertainty (SE=0) as shown in the Table 36. The rest proteins belonging to other clusters have the presence of the amino acid A 13 (T ) with least amount of uncertainty as depicted in the Table 36.   Table 37.      Table 39.     Table 41.   Table 42.
Every term of the binary representations B 1980 , B 1981 and B 1999 of lengths 38, 43 and 13 respectively is zero and consequently the SE is turned out to be zero which implies the absence of the amino acid is without any uncertainty.

860
The other proteins of the remaining clusters 1, 2, 4 and 5 have the presence of the amino acid A 19 with almost certainty.    of their lengths is identical for many values of j. This essentially reports that the probability of the presence of the amino acid A i over those proteins is same.
Here we explore the correlation of amount uncertainty of the presence/absence of the amino acids over the proteins of SARS-CoV2 of the spatial representations. Following is the correlation matrix of ten amino acids A, C, F, G, 890 H, I, L, M, N and P versus another ten amino acids Q, S, T, V, W, Y, D, E, K and R.   Next we are moving towards the entropy of conservation of amino acids over the 105 SARS-CoV2 proteins in the following subsection.

Amino Acid Conservation Shannon Entropy and Its Classification
For each of the 105 protein sequences the amino acid conservation informa-910 tion have been determined through HE as described earlier. In the following Table 45, the Shannon entropy(SE T 2) for each sequence and based on the SE, the formed clusters (C) are given. The plot of the SE over the 105 protein sequences with its histogram are given in the Fig. 45.  At last, the frequency analysis of the amino acids over the proteins is given in the following subsection. Following a correlation among the frequency distribution of each of the amino acids over the 105 proteins of SARS-CoV2. The correlation coefficient corresponding to frequency distribution over the proteins is given in the Table 46. The corresponding correlation are also given pairwise in a matrix form in the Fig. 48. sures the existence of significant correlations of frequencies of each of the amino acids over the proteins. In fact the correlation coefficient between the frequency 935 distributions corresponding to the amino acids A (Aliphatic) and K (Basic) is 1. The frequency plots of the amino acids A and K is given in the Fig. 49. The plots show the the strong correlation between the frequency distribution over the proteins. Overall it is observed that the same length proteins have mostly 940 similar frequency distribution of the twenty amino acids.

930
Next we are heading towards a comparative ...

Spatial Organization of Proteins of SARS-CoV
In 2003, the SARS coronavirus (SARS-CoV) had caused an epidemic in China including other 22 countries [42,43]. There 14 protein sequences available 945 in the NCBI database (taxid: 722424). The list of these protein (S1, S2, . . . S11) with their accessions is given here in the following Table 47. It is noted that the protein with the accession ACU31032 (S14) is a spike protein of length 1241 as mentioned in the NCBI database. The spike protein (S-protein) is a large type I transmembrane protein of length not exceeding 1400 950 amino acids. The spike protein has its important function in the case of SARS-CoV [44,45]. Among all other proteins of SARS-CoV, spike protein is the main antigenic component which is responsible for inducing host immune responses, neutralizing antibodies and/or protective immunity against virus infection [46].
We therefore illuminate here the spatial representations of the amino acids over 955 the the spike protein including other 13 proteins as mentioned in the Table 47.
The HE, SE and frequency distributions are given in the following and compared with the SARS-CoV2 proteins.
It is observed that the spatial representations of the presence of all the amino acids over the spike protein S14 follow the positive autocorrelation (positively 960  Below in the Table 49, we derive the correlation coefficients of the HEs of the spatial representations of the amino acids over the 14 SARS-CoV proteins. It is noted that the SE is turned out to be zero for the cases where the spatial 980 distribution corresponding to an amino acid which is absent over a protein.
The spatial distribution of amino acids over the proteins of SARS-CoV are all without much of uncertainty except three cases where the SEs are greater than the 0.5 where the absence of amino acids dominates in terms of certainty.
The correlation coefficients of the SEs of the spatial distributions of the 985 amino acids over the 14 SARS-CoV proteins are given in the Table 50. It is observed that the correlations among the SEs of the spatial distributions of the amino acids over the proteins are not significantly up as tabulated in the Table 50. The highest positive correlation based on SEs of the spatial distributions of the amino acid C with that of Y is turned up as 0.572.

Conclusions and Summary
In this present study the spatial arrangement of amino acids over the SARS-