Trends of mutation accumulation across global SARS-CoV-2 genomes: Implications for the ecology and evolution of the novel coronavirus

1 College of Veterinary Medicine, Western University of Health Sciences, 309 East Second Street, Pomona, CA, 91766, USA 2 Central Research Facility, Indian Institute of Technology Kharagpur, Kharagpur 721302, West Bengal, India. 3 Department of Botany, Prasannadeb Women’s College, Jalpaiguri, West Bengal, India. 4 Department of Microbiology, Bose Institute, P-1/12 CIT Scheme VII M, Kolkata 700054, West Bengal, India. 5 Department of Biotechnology, University of North Bengal, Raja Rammohanpur, Darjeeling 734013, West Bengal, India.

resembling severe acute respiratory syndrome (SARS), which, subsequently, went on to be identified as 2019 novel coronavirus disease (COVID-19; causative agent: SARS coronavirus 2, abbreviated as SARS-CoV-2; see Green, 2020) that has spread to 216 countries, infecting 14,562,550 people, and killing 607,781 as of 22 July 2020 (https://covid19.who.int). The first whole genome sequence of SARS-CoV-2 was deposited in GenBank (MN908947.3) on January 5 by researchers of Shanghai Public Health Clinical Center and School of Public Health, Fudan University, Shanghai, China . SARS-CoV-2 is an enveloped, positive-sense, single-stranded RNA virus containing a 29,903 nucleotide genome having an untranslated segment of 254 and 229 nucleotides at the 5' and 3' ends respectively. Its putative genes encode a surface spike glycoprotein, an envelope layer glycoprotein, a replicase intricate, a nucleocapsid phosphoprotein, and five other non-basic proteins . High genearrangement similitudes of SARS-CoV-2 with coronaviruses found in bats (Rhinolophus sinicus) (Zhou et al., 2020;Paraskevis et al., 2020) and Sunda Pangolin (Manis javanica) (Liu et al., 2019) indicate SARS-CoV-2 to be a zoonotic disease (Lu et al., 2020). However, human to human transmission of SARS-CoV-2 is also well-established, and its infection has spread across geographical and political barriers, courtesy of unbridled human travel across the globe. The virus spread rapidly in Italy, Spain, France, UK and Iran, and then in other parts of Western Europe, USA, Brazil, Russia, Southeast Asia, South Asia and Middle East.
At the same time as the scientific community is racing to develop vaccines and therapeutics against COVID-19 (Folegatti et al., 2020), the virus on its part is busy accumulating mutations across its pan genome, some of which may well help it evade clinical interventions (Benvenuto et al., 2020;Korber et al., 2020;Saha et al., 2020;Wang et al., 2020). In this context of SARS-CoV-2 evolution, the present study analyzes 3,608 whole genome sequences of this novel coronavirus, focusing on representative quasi-species isolated from 47 different countries (or distinct geographies), to delineate SARS-CoV-2 phylogeny, and then identify the distribution pattern of global point mutations in the SARS-CoV-2 genome. Consideration of the findings in the light of the existing knowledge on molecular biology and chemistry of point mutations (Hurlbert et al., 1960;Matallana-Surget et al., 2008;Phe et al., 2009;Kimsey et al., 2018) revealed potential mechanisms that could be instrumental in accelerating mutation recruitment in the RNA genome of SARS-CoV-2.

Comparative genomics
Of the 5,485 SARS-CoV-2 whole genome sequences deposited in GenBank till 4 June 2020, an overwhelming percentage (62.9%) were from USA, followed by From these, a total of 3,608 whole genome sequences of SARS-CoV-2 strains were downloaded from the NCB-SARS-CoV-2 database (https://www.ncbi.nlm.nih.gov/sars-cov-2/) on 4 June 2020, using a download filter that eliminated all incomplete or gapped genome sequences from the curated dataset. Sequence similarity between the 3,608 genomes was computed in a pairwise manner involving all the 6,507,028 combinations possible, using FastANI, a high throughput method for average nucleotide identity analysis (Jain et al., 2018). The software MAFFT (Katoh et al., 2002) was used with default options to align all the whole genome sequences included in the dataset.
Nucleotide positions involving polymorphisms (base substitutions) were identified in the individual genomes using the software SNP-sites (Page et al., 2016), which specifically identifies single nucleotide polymorphisms (SNPs) from aligned multi-fasta sequence files. Subsequently, the variant call format (VCF) file generated from the SNP-site analysis was processed using the software VCFtools (Danecek et al., 2011) to enumerate all transition and transversion events within the entire dataset of aligned whole genome sequences. The results obtained from VCFtools analysis were further compared and validated with those obtained from the alignment-free SNP-detection tool kSNP3 (Gardner et al., 2015).

Results and Discussions
Small but phylogenetically significant divergences accumulating in global

SARS-CoV-2 genomes
To delineate mutual sequence similarities existing between the 3,608 SARS-Cov-  Figure 1A-1C). Information regarding geographical origin of the sequences analyzed was also used to label the tree ( Figure 1D). Figure 1A, where the tree topology was labeled according to the dynamic clade nomenclature system (Rambaut et al. 2020)  On the other hand, Figure   Consistent with the above phylogenetic interpretations, labeling of the tree with the third clade-nomenclature convention proposed by Tang et al. (2020) and also followed by GISAID, indicated that the two original lineages, named as S and L (essentially equivalent to 19A and 19B of the Year-Letter nomenclature system), has diversified and thus far given rise to a total of seven clades, based on nine distinct marker mutations spread over 95% of the known SARS-Cov-2 diversity ( Figure 1C). As per the data available till 15 July 2020, Clade L is apparently more populous than Clade S, and has diversified further into V and G, with G splitting further into G, GH and GR (essentially equivalent to the old A2a clade of PANGOLIN, or the 20A, 20C and 20B of Year-Letter, nomenclature systems).
Labeling of the phylogenetic tree on the basis of the geographical origin of the sequences showed that members of the original and early-diverged clades (S and L, and V, respectively) are still mostly spread over Asian countries, whereas the recently derived clades (G, GH and GR) are distributed across the globe, especially in Europe and North America ( Figure 1D). Numerically, maximum mutations were found to be located in the genes encoding Nsp3 (total 435 mutations in the SARS-CoV-2 pan genome) and the spike protein S (300 mutations in the SARS-CoV-2 pan genome).

High frequency of CU (transition) and GU (transversion) mutations across global SARS-CoV-2 genomes
Of the 2,452 single nucleotide substitution mutations encountered globally in Nucleotide positions between which the locus spans have been designated based on the 5' to 3' sequence of the 29,903 nucleotide RNA-genome (MN908947) of the reference strain from Wuhan, China.
14 Table 2. Gene-/locus-wise frequency of the different types of transversion mutation encountered in the SARS-CoV-2 pan genome. 1 Nucleotide positions between which the locus spans have been designated based on the 5' to 3' sequence of the 29,903 nucleotide RNA-genome (MN908947) of the reference strain from Wuhan, China.  (Table 2).

Microevolutionary dynamics of SARS-CoV-2
Pace of mutation accumulation due to replication errors is higher in the RNA genomes of viruses than the corresponding spontaneous mutation rates in the DNA genomes of most other living entities. This leads to an unrelenting generation of mutant genomes for any RNA virus, alongside a rivalry among the extant variants, including the more advanced ones that are added to the Viro-diversity over time (Eigen et al., 1988). In the context of the highly dynamic epidemiology of SARS-CoV-2, knowledge on its genome evolution becomes all important for the surveillance and containment of the outbreak. In fact, progressive diversification of the SARS-CoV-2 genome is taking place in sync with the pace at which it is undergoing transmission over geographies and anthropologies; and in doing so, it is playing out a 'hide and seek' game with the promises of antiviral drugs and vaccines innovated over time. Furthermore, all active genomic variants maintained within global/local RNA virus populations (i.e., quasispecies) are expected to possess equal abilities to replicate and complete the infection cycle (Eigen et al., 1988). Therefore, the more or less efficient circulation (and diversification into potential quasispecies) of the two original major-lineages of SARS-CoV-2 (clades S and L) across distinct geographies (Figure 1) reflects their equivalent pathological as well as evolutionary fitness.
In the present study, mutation frequency in the genetic material of SARS-CoV-2 was found to be 2.27 × 10 -5 nucleotide positions mutated per nucleotide analyzed.
This indicated that a majority of the individual genomes differed from the consensus sequence of the pan genome at one or more nucleotide positions. These small, but definite and widespread, genomic divergences further imply that during the present pandemic the reservoir of SARS-CoV-2 quasispecies is rapidly expanding across geographies. This rich stock of genotypic, and therefore potentially phenotypic, variants is likely to hold major implications for potential multifaceted adaptations of this novel coronavirus within human hosts, and in doing so have serious consequences on the resultant pathogenesis, disease complications and control (Domingo and Holland, 1994).

Significance of the abundance of mutations in genes encoding Nsp3, ORF3a, S and N
Viruses that have evolved to survive via changing their hosts are extremely skilled molecular manipulators; the key to their ecological fitness is attributed to their ability to subvert host defense systems to ensure survival, replication and proliferation (Bowie and Unterholzner, 2008). Coronavirus-encoded accessory proteins, in general, play critical roles in virus-host interactions and modulation of host-immune responses, thereby contributing to their pathogenicity (Narayanan et al., 2008). The clinical prognosis of SARS-CoV-2 infection (Prete et al., 2020), in conjunction with the gene content of its precisely-mapped RNA genome (Liu et al., 2019;Wu et al., 2020), indicates that this novel coronavirus also possesses sophisticated molecular mechanisms designed to subvert human immune system, thereby facilitating high transmission. The multi-domain accessory protein Nsp3, which is the largest among all SARS-CoV proteins, binds to viral RNA, nucleocapsid protein (N), and other viral proteins; in addition, it participates in polyprotein processing (Qiu and Xu, 2020).
Furthermore, Nsp3 defies host innate immunity by its de-ATP-ribosylating, de-ubiquitinating, and de-ISGylating activities (Qiu and Xu, 2020). These attributes have currently made Nsp3, especially its papain-like protease component, a lucrative target for new antiviral drugs (Báez-Santos et al., 2015). In this scenario our discovery Earlier studies had reported that the accessory protein ORF3a of SARS-CoVs has pro-apoptotic activity (Padhan et al., 2008); very recent studies further implicated this protein of SARS-CoV-2 in inducing extrinsic apoptotic pathway through a unique membrane-anchoring strategy (Ren et al., 2020). In view of these key roles of ORF3a The envelope spike protein S, and the unexposed nucleocapsid protein N, are among the most promising targets for vaccine development against SARS-Cov-2 Pang et al., 2020;Salvatori et al., 2020). However, the detection of 300 point mutations (192 transitions with 106 CU substitutions and 108 transversions with 55 GU substitutions), distributed almost evenly across the total length of the S locus in the SARS-Cov-2 pan genome (Tables S1 and S2) seriously questions the prospects of eventual effectiveness of S-targeting vaccines. On the other hand, the N gene which was initially thought to be relatively more conserved and mutation-proof (Dutta et al., 2020), was found in the present study to harbor a total of 171 point mutations (107 transitions with 59 CU substitutions and 64 transversions with 42 GU substitutions) across global SARS-Cov-2 genomes (Tables S1 and S2).
Effects of the above mentioned mutations on the structures and functions of N as well as S proteins need to be studied in-depth so as to ensure that the protein product of the right alleles are chosen as antigenic epitopes in the respective vaccination strategies.

Physicochemical underpinnings of the preponderance of CU and GU substitutions
In view of the overwhelming preponderance of CU and GU transitions in the global mutation spectrum of SARS-CoV-2 (as compared to all other transition and transversion mutations respectively) it seems likely that in the ecological context of this novel coronavirus some physicochemical and/or biochemical mutagen is more instrumental in bringing about this selective change, over and above the general replication error-induced mechanism of mutagenesis. Cytosine can convert to uracil through hydrolytic deamination, under the action of ultra-violet (UV) irradiation or the potential mediation of the enzyme activation-induced cytidine deaminase, which is abundant in mammalian cells and known to act on single-stranded nucleic acids (Bransteitter et al., 2003). CU conversion is also possible chemically under the mediation of bisulfite reagents ( Figure S1; for details see Hyatsu, 2008) that are frequently used as disinfectants, antioxidants and preservative agents. Incidentally, several control techniques involving heating, sterilization, ultraviolet germicidal irradiation (UVGI) (Tseng and Li, 2007) and/or chemical disinfectants (Matallana-Surget et al., 2008) are being used currently to reduce the risk of viral infection from contaminated surfaces. Of these, intense UV-C irradiation is at the forefront of our fight against COVID-19, so indiscriminate use of the same may well accelerate the incidence of CU mutations in global SARS-CoV-2 genomes. Furthermore, UV's specificity for targeting two adjacent pyrimidine nucleotides is long known (Miller, 1985), while in the context of DNA, UV-induced signature mutations collated from existing data on cells exposed to UVC, UVB, UVA or solar simulator light, have been confirmed as CT in ≥ 60% dipyrimidine sites, of which again ≥ 5% is CCTT (Brash, 2015). In consideration of the above facts, it seems likely that UV irradiation is  (Ma et al. 2018;Nibert 2017); the consequent blend of wild-type and changed RdRP enzymes through its replication activities give rise to a range of viral genome-variants or quasispecies, even within a single transmission event (Ou et al., 2019). Those variants which have the best viral fitness, eventually, endure and become predominant in the population. In this context, it is further noteworthy that both tautomeric and anionic Watson-Crick(W-C)-like mismatches can increase the recruitment of replication and translation errors (Koag et al., 2014;Rozov et al., 2016). A sequence-dependent kinetic network system connects G•T/U wobbles with three particular W-C mismatches comprising of two quickly exchanging tautomeric species (Genol•T/U⇌G•Tenol/Uenol, population <0.4%) and one anionic species (G•T − /U − , population ≈0.001% at unbiased pH) (Kimsey et al. 2018).

Conclusion
The current investigation of 3,608 complete whole genome sequences of SARS-CoV-2 isolates from across the world brought to the fore a number of remarkable aspects of microevolution of this novel coronavirus. Phylogenomic analysis illustrated that the two major-lineages of the virus has thus far contributed almost equivalently to the pandemic, even as members of the early lineages are still mostly spread over Asian countries and those of the relatively recent lineages have undergone more global distribution. In the coming days it would be worth exploring whether this virogeography has got any bearing on the differential death rates of COVID-19 in Asian and European/American countries (https://www.worldometers.info/coronavirus/). An overwhelming preponderance of transition mutations, and far less frequency of transversions, was observed across the pan genome of the virus, irrespective of whether the genetic locus encoded a non-structural or structural protein. In this context it is noteworthy that the 29,903-nucleotide-long SARS-CoV-2 pan genome was found to have maintained a substantive 655 transversion mutations, notwithstanding the fact that natural selection disfavors transversion mutations because they are often non-synonymous, so less likely to conserve the structural biological properties of the original amino acids. Furthermore, a molecular bias of mutations was observed in the SARS-CoV-2 pan genome involving exceedingly frequent CU and GU substitutions among all transitions and transversion events respectively. More comprehensive and multi-faceted surveillance of the microevolution of SARS-CoV-2 is needed so as to gain constant insights into the pathogenic dynamism of the virus, and improvise control and therapeutic strategies accordingly.

Supplementary Material
One MS Word file named "Supplementary_Information" accompany this paper.

Author Contributions
RC conceived the study, designed the experiments, interpreted the results, and wrote the paper, with the help of WG. CR performed the bioinformatic experiments and