Non-Uniform Aspects of SARS-CoV-2 Intraspecies Evolution Reopen Questions on Its Origin

Several hypotheses have been presented on the origin of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) from its identification as the agent causing the current coronavirus disease 19 (COVID-19) pandemic. So far, no hypothesis has managed to identify the origin, and the issue has resurfaced. Here we have unfolded a pattern of distribution of several mutations in the SARS-CoV-2 proteins across different continents comprising 24 geo-locations. The results showed an evenly uneven distribution of unique protein variants, distinct mutations, unique frequency of common conserved residues, and mutational residues across the 24 geo-locations. Furthermore, ample mutations were identified in the evolutionarily conserved invariant regions in the SARS-CoV-2 proteins across almost all geo-locations we have considered. This pattern of mutations potentially breaches the law of evolutionary conserved functional units of the beta-coronavirus genus. These mutations may lead to several novel SARS-CoV-2 variants with a high degree of transmissibility and virulence. A thorough investigation on the origin and characteristics of SARS-CoV-2 needs to be conducted in the interest of science and to be prepared to meet the challenges of potential future pandemics.


Introduction
SARS-CoV-2 is the etiological agent causing the COVID-19 pandemic. Since its very onset, the understanding of the origin of the SARS-CoV-2 has been of utmost importance in the fight against this virus, and the potential emergence of   The percentages of each SARS-CoV-2 protein across the 24 geo-locations are presented in Figure 1.  Furthermore, S, E, M, N, ORF3a, ORF6, ORF7a, ORF7b, and ORF8 protein sequences of four other coronaviruses Recombinant SARSr-CoV (taxid-698398), Bat SARS-CoV (taxid-442736), SARS-CoV ExoN1 (taxid-627440), and Bat SARSlike-CoV (taxid-1508227) were downloaded from the NCBI database. In this study, all mutations in SARS-CoV-2 proteins were detected with reference to the SARS-CoV-2 reference sequence, which was deposited in January 2020 by Wu and coworkers formerly called "Wuhan seafood market pneumonia virus" (WSM, NC 045512) [36]. The frequency of total and unique protein sequences is presented in Table 3. The least unique variations of M proteins of four types of beta-coronaviruses were observed. Other proteins of four CoVs had several unique variations, unlike in the case of non-uniformity in unique variations in SARS-CoV-2 proteins.

Unique proteins variants and their mutations
Across the 24 geo-locations, the common amino residues which did not possess any mutations were named as invariant residues. These invariant residues of all unique protein variants from all 24 geo-locations in SARS-CoV-2, were extracted (Table 4) (Supplementary file-I). On the other hand, mutated residues common in all 24 geo-locations were also detected ( Table 5) (Supplementary file-I).  Table 4, it was observed that methionine(M) at the residue position 1 did not change in any of the SARS-CoV-2 proteins listed above, except in ORF10. In ORF10, all amino acid residues from position 1 to 38 were mutated. Even methionine at position 1 was changed to glycine in the only ORF10 sequence QKG88643 from Massachusetts, USA (collected on 18-03-2020). This mutation M1G was found to be a 'neutral' mutation as predicted through the webserver, PredictSNP. Note that there was no homologous sequence to QKG88643 with 100% homology and 100% query coverage (NCBI Blast). Table 5: Mutation residues that were common in all 24 geo-locations.
Mutation residues in SARS-CoV-2 proteins across 24 geo-locations On the other side, the number of common mutations in the SARS-CoV-2 proteins across 24 geo-locations was surprisingly low (Table 5). D614 was the only mutation possessed by each unique S protein variant from all 24 geo-locations. Similarly, each unique N protein variant from all 24 geo-locations possessed R203 and G204 with changes mutations to multiple amino acids ( Table 5). The unique ORF3a variants from all 24 geo-locations had the only common mutation at position 57 with changes to multiple amino acids H/E/L/N/R, and Y. It was noticed that not a single common mutation across 24 geo-locations was found in E, M, ORF6, ORF7a, ORF7b, ORF8, and ORF10.

Spike protein variants and mutations
The total frequency of unique mutations possessed by the S protein of SARS-CoV-2 across the 24 geo-locations is presented in Table 6. We observed that the highest number (495%) of unique mutations possessed by unique S protein variants was from Peru, where 44 unique S sequences had 218 unique mutations. On the other side, the second-highest number of unique S protein variants from California possessed the lowest amount (33%) of unique mutations. Figure 2 shows the average numbers of mutations per unit unique S protein variants.  Figure 2(B) shows that the probability of having triple mutants in any randomly chosen unique S protein variant from Austria is nearly 1, since the ratio ( M S U S ) is 3.77 > 3. Similarly, the probability of having more than quadruple mutants in any randomly chosen unique S protein variant from Peru is nearly 1, since the ratio ( M S U S ) is 4.95 > 4. Spectacularly, none of the unique S protein variants from the geo-locations in North America possessed more than one mutation, since the ratio in each case was less than 1, although the total number of unique S variants and mutations were relatively higher than those of others.
The total 23 'variants of concern (VoC)' and 25 'variants of interest (VoI)' mutations in the S protein were reported [41,42,43,44]. Continent-wise, the frequency of common mutations were determined, as well as VoC, VoI among those common S protein mutations possessed by each continental geo-locations (Table 7). It was interesting to note, since Australia was the only geo-location in Oceania considered in this study, common mutations were not observed. Table 7: Continent-wise common mutations in the S protein and list of Variants of concern (VoC), Variants of Interest(VoI) mutations in S protein.

Continent
Total It was found that 487 common mutations in the S proteins were from patients from the seven geo-locations in North America, though the only common mutation across 24 geo-locations was D614G. Furthermore, it was noticed that all 23 VoC were presented in each geo-location from North America. On the other hand, the unique S proteins from the European geo-locations possessed only the D614 common mutation. In all African geo-locations, a moderate number of VoC and VoI were found, although the number of common mutations over the geo-locations was not relatively high compared to that of others (Table 7). Also, randomly chosen S protein variants from Ghana has a very high probability of acquiring double VoC/VoI mutants as the ratio ( M S U S ) is 2.75.
Earlier, it was reported that 'RRAR' (amino acid positions: 682-685), a unique furin-like cleavage site (FCS) in the S protein, which was absent in other lineages beta-coronaviruses, such as SARS-CoV, caused high infectivity and transmissibility [45,46,47]. Even in this FCS, a single mutation at position 684 was noticed in some unique S protein variants from California, Massachusetts, and Michigan. Details of the protein accessions with associated information are presented in Table 8. The first such mutation, A684V was reported in Massachusetts on September 9, 2020 (Accs. ID: QTP22615). Three days later, the same mutation was identified in California (QRG20397). The mutation A684V/S was 'neutral' (predicted using PredictSNP web-server), and hence it was expected that the ability to infect and transmit remains unchanged [40].

Envelope protein variants and mutations
The total frequency of unique mutations possessed by the E protein of SARS-CoV-2 across the 24 geo-locations is presented in Table 9.

Membrane protein variants and mutations
The frequency of unique mutations possessed by the M protein of SARS-CoV-2 across the 24 geo-locations is presented in Table 10.  All North American geo-locations shared a sum of 24 mutations in the M protein variants at positions 2,7,17,23,28,33,34,60,69,70,81,82,85,89,98,104,109,125,142,155,173,175,208, and 209 (Supplementary file-I). On the other hand, not a single common mutation in the M proteins was noticed in geo-locations from Asia and the same was observed in Africa and Europe. Each M protein from India shared 9 mutations with those of each North American geo-location, at positions 2, 17, 69, 70, 82, 104, 125, 142, and 209. Among the 24 common mutations from geo-locations in North America, only two mutations at positions 17 and 23 were shared with M proteins from Greece.

Nucleocapsid protein variants and mutations
The frequency of unique N protein mutations across the 24 geo-locations is presented in Table 11. It was observed that the least number of mutations was possessed by the unique N proteins from California ( M N U N = 0.27 < 1), whereas 53 unique N protein variants from Bangladesh had 86 mutations ( M N U N = 1.62 > 1) (Table 11). Every unique N protein-variant contain at least a single mutation which is followed by the ratio ( M N U N = 1.62 > 1). Likewise, each unique N variant from Bahrain, Peru, Chile, France, Greece, Hong Kong, India, Serbia, and Tunisia contain at least one mutation (for each geo-location ( M N U N = 1.62 ≥ 1).
Furthermore, it was noticed that 153 mutations were shared among all unique N proteins from each geo-location in North America. Only 6 mutations at positions 3,194,202,203,204 and 377 were common across Asian geo-locations, whereas only two mutations at positions 203 and 204 were found in the N variants from the European geo-locations. There were 9 mutations at positions 9, 194, 202, 203, 204, 205, 220, 235, and 238 in the N proteins detected in the African geo-locations.

ORF3a protein variants and mutations
The frequency of unique ORF3a protein mutations across the 24 geo-locations is presented in Table 12. From Table 12, it was observed that the least number of mutations was possessed by ORF3a variants from California, where the highest number of unique ORF3a variants available though ( M3a U3a ) = 0.25 << 1. On the other hand, 13 ORF3a variants from Greece had 25 mutations altogether. Therefore, almost every ORF3a variant was likely to contain double mutations ( M3a U3a ) = 1.92 ∼ = 2. Furthermore, each ORF3a variant from Australia, Austria, Bahrain, Chile, France, Ghana, Pakistan, Peru, Poland, Serbia, Spain, and Tunisia contains at least one mutation, that is Q57, but not more than two mutations since the ratio, M3a U3a lies in between 1 and 2.
A total of 167 common mutations in ORF3a variants across the North American geo-locations were detected, whereas the only common mutation, Q57 was detected in the European geo-locations. It was noted that unique ORF3a variants from Texas, Pennsylvania, Florida, Michigan, and Minnesota had common mutations at positions 243, 224, 255, 229, and 238, respectively, from California. ORF3a variants from African geo-locations share five common mutations at positions 57, 100, 155, 171, and 224. Also, three mutations at positions 57, 175, and 223 were possessed by the ORF3a variants from each Asian geo-location. It was noted that unique ORF3a variants shared 225 mutations among 264 in total in both California and Massachusetts.

ORF6 protein variants and mutations
The frequency of unique ORF6 protein mutations across the 24 geo-locations is presented in Table 13. There were 25 common mutations in ORF6 variants in each geo-location of North America, whereas no common mutation in ORF6 was found in the European geo-locations. Likewise, in Asian and African geo-locations, no common mutation was detected for ORF6 variants.

ORF7a protein variants and mutations
The frequency of unique ORF7a protein mutations across the 24 geo-locations is presented in Table 14. The ratio M7a U7a > 3 in Greece and Peru implied that most unique variants must have at least three mutations (Table 14). Unique ORF7a variants from Australia, Austria, Bangladesh, Chile, Egypt, Ghana, Hong Kong, India, Pakistan, and Serbia must contain at least a single mutation as in each case, the ratio was found greater than or equal/near to 1. Furthermore, it was observed that no new ORF7a sequence was found among 90 infected patients in France, so far.
Ninety-two common mutations were detected in the unique ORF7a variants in the North American geo-locations, whereas no common mutation was observed in the European geo-locations. Only one common mutation at position 28 in Asian geolocations, and another single common mutation at position 14 in ORF7a was found in African countries. ORF7a protein sequences from Austria had four mutations at positions 79, 99, 102, and 103, commonly found in each geo-location in North America. Likewise, all unique mutations in ORF7a variants detected in Greece, Poland, and Serbia were present in each North American geo-location.

ORF8 protein variants and mutations
The frequency of unique ORF8 protein mutations across the 24 geo-locations is presented in Table 16. In each geo-location, wildtype ORF8 protein mutated several times and emerged as a set of unique ORF8 variants in each geo-location. Every unique ORF8 variant from India and Bangladesh contains at least one mutation as the ratio in each case was greater than 1 (Table 16). A total of 32 shared mutations were identified across geo-locations in North America. It was noticed that L84 was the only common mutation found in Asian and African geo-locations.

ORF10 protein variants and mutations
The frequency of unique ORF10 protein mutations across the 24 geo-locations is presented in Table 17. The ratio M10 U10 = 0 implied that other than wildtype ORF10 (YP 009725255), no new ORF10 protein emerged in Chile, France, and Greece, although every amino acid contained mutations at each position starting from 1 to 38. In all 24 geolocations, every unique ORF10 variant possessed only a single mutation (as in each case 0 < M10 U10 < 2) (Table 17).

Mutations in the invariant residue regions of various proteins of SARS-CoV-2
The ORF10 protein was the unique protein present in SARS-CoV-2, which is not present in any other beta-coronavirus. So except for ORF10, other unique protein variants of four types of beta-coronaviruses were obtained from the NCBI database (Table 3). Further, sequence-based homology using the Clustal-Omega webserver of each unique protein variant of four types with reference protein sequence (NC 045512-China) was obtained (Supplementary file-II). Based on the alignment, invariant residue regions of length greater than three amino acids were detected (Table 18). From amino acid homology alignment, it was observed that the SARS-CoV-2 reference protein sequences of N C 045512 with a set of invariant residues were shared by those proteins of four other different types of beta-coronaviruses. There are several invariant regions identified in all proteins as indicated in Table 18. Each of the S, E, M, N, ORF3a, ORF6, ORF7a, ORF7b, and ORF8 proteins of five different coronaviruses shared 29, 4, 9, 11, 6, 1, 3, 2, and 2 invariant residue regions. Further, it is worth noting that the largest invariant region with a length of 101 was identified in the S protein. These invariant regions possibly served as sets of functional units in the respective proteins, indicating why these were conserved in the beta-coronavirus family.  Over time and due to intraspecies evolution, SARS-CoV-2 proteins have acquired several mutations even in the invariant regions. The total frequency and respective percentage of mutations detected in each invariant residue window of all proteins are presented in Table 19.  In all invariant regions of the S protein, unique variants from California, Florida, Texas, Minnesota, and Massachusetts possessed several mutations (Table 19). Notably, unique S protein variants from California, Texas, and Minnesota had possessed 93, 88, and 72 distinct mutations, respectively, in the invariant region of 101 amino acid residues. Among 29 invariant regions, only seven of the S proteins from Tunisia had a minimal number of mutations, with a maximum of two in each region. Likewise, S protein variants from Spain, Poland, Serbia, Greece, and France got a minimal number of mutations in nine, eight, five, four, and seven invariant regions, respectively. S protein variants from other geo-locations possessed a relatively (with regard to the North American geo-locations) smaller number of mutations in the invariant regions. In more than 50% of the 29 invariant regions, S protein variants from India, Bangladesh, Austria, Egypt, and Pakistan possessed a small number of mutations (Table 19). It was noteworthy that in India, Bangladesh, Austria, Egypt, and Pakistan, only a maximum of five mutations were found in the largest invariant region of the S2 domain of the S proteins.
Several mutations were identified in the S1, S2, S2' domains of the S protein (Table 19). The S1 domain of the S protein attaches the virion to the cell M by interacting with the host ACE2 receptor, initiating the infection. Also, the S2 domain contributes to the fusion of the virion and cellular membranes by acting as a class-I viral fusion protein, and the S2' domain acts as a viral fusion peptide which is unmasked following the S2 cleavage occurring after virus endocytosis [48]. These functions might be modified due to several mutations occurring in the invariant regions (postulated as important functional sites for the virus). Whether these mutations in the invariant regions in the S1, S2 and S2' domains would increase the infectivity of the virus is not clear but definitely remains a matter of concern.
Invariant regions in the E, M, and N proteins of five CoVs which include SARS-CoV-2 too, are presented in Table 20. There were 4, 9, and 11 invariant regions identified in the E, M, and N proteins, respectively.   M protein variants in the North American and Oceanian geo-locations contained various mutations in each identified invariant region. In contrast, few mutations in the M proteins in the rest of the geo-locations, were detected in some invariant regions (Table 20).
N proteins from California, Texas, Minnesota, Michigan, Massachusetts, Pennsylvania, Florida, India, Bangladesh, Egypt, and Australia had many mutations in each invariant region. In some of the invariant regions, few mutations were detected in the N proteins from the rest of the geo-locations.
Mutations in the invariant regions of the SARS-CoV-2 ORF proteins are listed in Table 21. There were 6, 1, 3, 2, and 2 invariant regions found in ORF3a, ORF6, ORF7a, ORF7b, and ORF8 variants, respectively. ORF3a variants in the North American and Oceanian geo-locations had several mutations in each invariant region, whereas very few mutations were detected in some invariant regions (not in all) of ORF3a in India, Bangladesh, Egypt, Chile (Table  21).
No mutations at the invariant region in ORF6 variants were found in Tunisia, Spain, Serbia, Poland, Peru, Hong Kong, Greece, and Egypt. On the other hand, a handful of mutations in the invariant region were detected in the rest of the geo-locations. In the North American geo-locations, the number of mutations in ORF3a proteins was relatively big. In the North American geo-locations, in the invariant regions, a significant number of mutations in ORF3a proteins were found. A small number of mutations were found in the invariant regions of the ORF7a variant in the rest of the geo-locations with the exception of Tunisia, Hong Kong, Greece, and France (Table 21).
No mutations were found in the ORF7b invariant regions for the ORF7b proteins from Tunisia, Spain, Serbia, Poland, Peru, Hong Kong, Greece, France, Chile, and Austria. On the contrary, a significant number of mutations were detected in the two invariant regions of ORF7b from the rest of the geo-locations.
In two invariant regions, ORF8 variants from California possessed four mutations in each region, and in other North American geo-locations several mutations were also detected in the two invariant regions. However, in most of geo-locations, such as India, Tunisia, Spain, France, Greece, and so on, no mutations were found in the two invariant regions (Table 21).

Discussions and Remarks
Variants of S, E, M, N, ORF3a, ORF6, ORF7a, ORF7b, ORF8, and ORF10 proteins of SARS-CoV-2 from six continents comprising 24 geo-locations were analyzed. In each geo-location, a non-uniform frequency distribution of unique variants of