Witnessing evolution of SARS-CoV-2 through comparative phylogenomics: The proximate origin is Guangdong, not Wuhan

A new form of coronavirus called severe acute respiratory disease coronavirus type 2 (SARSCoV-2) is currently causing a pandemic. A six-month evolutionary history of SARS-CoV-2 is witnessed by characterising the total genome of 821 samples using comparative phylogenomic approaches. Our analyses produced striking inclusive results that may guide scientists/professionals for the past/future of pandemic. Phylogenetic and time estimation analyses suggest the proximate origin of pandemic strain as Guangdong and the origin time as first half of September 2019, not Wuhan and December 2019, respectively. The viral genome experienced a substitution rate similar to other RNA viruses, but it is particularly high in some of the peptides encoding sequences such as leader protein, E gene, orf8, orf10, nsp10, N gene, S gene and M gene and nsp4, while low in nsp11, orf7a, 3C-like proteinase, nsp9, nsp8 and endoRNase. Most strikingly, the divergence rate of amino acid sequences is high proportional to nucleotide divergence. Additionally, specific non-synonymous mutations in nsp3 and nsp6 evolved under positive selection. The exponential growth rate (r), doubling time (Td) and R0 were estimated to be 47.43 per year, 5.39 days and 2.72, respectively. Comparison of synapomorphies distinguishing the SARS-CoV-2 and the candidate ancestor bat coronavirus indicates that mutation pattern in nsp3 and S gene enabled the new strain to invade human and become a pandemic strain. We arrive at the following main conclusions: (i) six months evolution of viral genome is nearly neutral, (ii) origin of pandemic is not Wuhan and predates formal reports, (iii) although viral population is ongoing an exponential growth, the doubling time is evolving towards shortening, and (iv) divergence rate of total genome is similar to other RNA viruses, but it is prominently high in some genes while low in some others and evolution in these genes should be closely monitored as their protein products intervening to pathogenicity, virulence and immune response.


INTRODUCTION
Human populations have experienced a period of explosive growth depending on decreases in external cause mortality during approximately last 250 years following the Industrial Revolution (1). The growing human population has also led to anthropogenic impact on the environment promoting continuous and multiple contacts between people, domestic and wild animals. Such contacts were considered as main cause for the emergence and spread of zoonotic infectious diseases in association with crowded settlement (2). Most known examples of the emerging infectious diseases throughout and in various parts of the world include the smallpox, Spanish flu (H1N1 influenza), plague, cholera, Human Immunodeficiency Virus (HIV), hepatitis C, avian flu (H5N1 influenza), swine flu (H1N1 influenza), Severe Acute Respiratory Syndrome Coronavirus (SARS), and Ebola Virus (EBOV) (3)(4)(5).
It is known that the emergence and re-emergence of most of the zoonotic infectious diseases are resulted from viruses with RNA as their genetic material, which can rapidly adapt to varying environmental conditions owing to their exceptionally shorter generation times, high mutation rates, frequent recombination and re-assortment events creating novel genotypes from cocirculating strains (quasispecies) (3,6). The occurrence of the high mutation rates, as high as a million times [10 -2 -10 -5 per site per year (7)] greater than their hosts [2,2 x 10 -9 per site per year (8)], is considered as a beneficial trait for RNA viruses due to prominent enhancement in the ability of their virulence and evolvability.
The recent pandemic outbreak resulted from a novel coronavirus, named as SARS-CoV-2 (also referred to as hCoV-19) (9), causing unusual respiratory condition with various degrees of severity was reported for the first time by the end of 2019 in Wuhan, China (10,11). SARS-CoV-2 has a linear positive ss-RNA genome with a length of 29,903 nt, consisting of a leader sequence, ORF1ab encoding replicase polyproteins involved in RNA replication and transcription processes and non-structural proteins (nsp), S gene encoding spike glycoprotein, E gene encoding envelope protein, M gene encoding membrane glycoprotein, N gene encoding nucleocapsid phosphoprotein and six ORFs encoding peptides with unknown exact functions (12).
The SARS-CoV-2 is a new strain that has rapidly distributed worldwide within approximately three months infecting millions of people. Yet, there are many unknown aspects of the virus strain preventing health professionals to take plausible measurements and assessments for upcoming of pandemics. One aspect appears to be least concerned by healthcare professional is the past and ongoing evolution of the virus strain. Recent studies with a relatively limited number of samples for exploring the evolutionary history of SARS-CoV-2 have been restricted to either comparison with the possible ancestral forms (13,14) or a certain region and relatively smaller datasets (15)(16)(17)(18)(19). Here, a comprehensive dataset of 821 SARS-CoV-2 genomes from (13,14), was generated containing 16 non-structural proteins, four structural proteins and six accessory proteins (20)(21)(22)(23). Although a few of these peptide encoding sequences overlap (23), each one was treated as independent encoding unit to understand their evolution separately and outline their potential medical implications. The dataset/sub-datasets were subjected to appropriate bioinformatics analyses to accomplish the following specific objectives: (i) to explore evolutionary destiny of SARS-CoV-2 and its likely closest relative bat-RaTG13-CoV (13) based on synapomorphic characters for the first time, (ii) to determine the place of origin and likely origin strain based on phylogenetic trees, (iii) to estimate the time of origin and the rate of base/amino acid substitutions per individual peptide encoding sequences and total genome by time estimation analyses (iv) to estimate the demographic evolutionary dynamics such as exponential growth rate (r), doubling time (Td) and basic reproduction number (R 0 ), which were computationally predicted based on the genome data using a Bayesian framework, (v) finally, to investigate adaptive substitutions across SARS-CoV-2 genomes and substitution rates of each peptide encoding sequences, which are clinically important traits as they enable the virus to escape from vaccine coverage, to expand host range and to resist against to potential drugs.

Construction of dataset and basic sequence statistics
A dataset of 821 samples was constructed by retrieving the complete genomes of SARS-CoV-2 representing 38 countries from GISAID (13 March 2020; https://www.gisaid.org/) database.
Details of the samples were presented in Table S1. Annotation of the genomes were performed using ORFFinder implemented in Geneious R9 (24) and by comparison with the reference SARS-CoV-2 genome from NCBI (NC045512). Sequences were aligned using MAFFT v7.450 (25) and manually checked by MEGA X (26). The basic sequence statistics such as nucleotide and amino acid identities, nucleotide diversity and nucleotide compositions were estimated by MEGA X.

Comparative distribution of synapomorphies
The molecular synapomorphic changes that define the clades (or branches) of human SARS-CoV-2 and bat-RaTG13-CoV were investigated by generating a new RAxML tree using randomly chosen SARS-CoV-2 genomes of 25 humans and five coronavirus samples [bat-RaTG13-CoV (MN996532), two pangolin coronavirus (EPI_ISL_410538, _410721) and two SARS-CoV (AY274119, NC004718)] (see Table S2 for sampling). The sequence data and the ML tree file were imported into PAUP 4.0b10 (27). After defining outgroups and selecting maximum parsimony optimality criterion (parsimony settings; character state optimization: DELTRAN), the logfile option was activated (File: 'Log Output to Disk'). Sequence data was then used to obtain a labelled tree reconstruction and a complete list of synapomorphies (Trees: 'Describe Trees' with 'phylogram', 'labelled internal nodes', and 'list of synapomorphies'). The resulting logfile listed all synapomorphies of the dataset. The synapomorphic changes were screened and comparatively analysed in terms of total numbers, positions and substitution types (transitions/transversions and synonymous/nonsynonymous) across the genome.

Model selection, phylogenetic analyses and network reconstruction
The bat-RaTG13-CoV sample was included to the dataset as outgroup (13) for model selection and phylogenetic analyses. The best-fit evolution model of the dataset was selected using ModelTest-NG (28), applying the default parameters under three statistical criteria (AIC, BIC, and DT). The Maximum Likelihood (ML) tree was built using RAxML v8.0.9 (29) under the GTR substitution model with gamma distribution (gamma shape: 0.507).
To deeply investigate the evolutionary relationships and the proximate origin place of pandemic form of the SARS-CoV-2, a median-joining (MJ) network (30) was constructed using a dataset comprising total genome of 25 human SARS-CoV-2 samples selected from the basal placement of each clade and/or subclade of the phylogenetic trees generated here including the samples from Wuhan (China), supposed to be origin place of the outbreak (10), and the bat-RaTG13-CoV was assigned as outgroup (Table S3). The analysis was performed in Network v.5.0.1.1 (available at http://www.fluxus-technology.com) with the default settings (ε = 0).

Estimation of mutation rates of genome and peptide encoding sequences
The individually aligned peptide encoding sequences of SARS-CoV-2 samples were concatenated by SequenceMatrix v.1.7.8 (31). The substitution rate estimation of each peptide encoding sequences and whole genome was conducted in BEAST v1.10.4 (32) using the available dates for samples with a random starting tree on the CIPRES science gateway portal (33). As the dataset were not clock-like (likelihood ratio test: −ln+c -52308.

Demographic analyses
To understand the course of the pandemic from genome data, the demographic dynamics of SARS-CoV-2 outbreak, were inferred from Bayesian coalescent model using demographic reconstruction option in Tracer v1.7. The doubling time (Td) was calculated using the following formula: Td = ln(2)/r, assuming that the pandemic is growing exponentially with a constant growth rate (r) (34). The basic reproduction number (R 0 ) was calculated using the formula by

Basic sequence statistics
The final length of the sequences of human SARS-CoV-2 was 29,238 nt (without outgroups), of which 832 sites were variable. Of these variable sites, 232 were parsimony informative while 600 were singleton mutations. Divergence indices for both nucleotide (29,238 nt) and amino acid sequences of concatenated dataset (9,746 aa) on average were 0.029 and 0.057, respectively (Table 1). Divergences for each peptide encoding sequence were also in low levels ranging from 0.014 (nt div.) and 0.025 (aa div.) in orf7a to 0.070 (nt div.) and 0.120 (aa div.) in E gene. The low nucleotide diversity (π) values were observed in each peptide encoding sequences, ranging from 0.00005 in nsp9 to 0.00115 in orf8, with an average of 0.00028 overall mean nucleotide diversity (Table 1). Nucleotide compositions of the SARS-CoV-2 genomes were U and A rich, with 32.22% U, 29.82% A, 19.62% G, 18.34% C contents on average (Table S4).

Comparison of synapomorphies distinguishing bat coronavirus and SARS-CoV-2
The phylogenetic tree constructed to identify synapomorphies and the list of synapomorphic characters was shown in Figure S1 and Table S5, respectively. The ratio of synapomorphic characters to the total number of characters at the branch leading to bat-RaTG13-CoV (2.09%) was relatively higher than that of the branch leading to SARS-COV-2 (1.88%). The proportions of transitional substitutions (78.4% in bat-RaTG13-CoV and 78.9% in SARS-COV-2) within these synapomorphic characters were apparently higher than transversions (21.6% in bat-RaTG13-CoV and 21.1% in SARS-COV-2) at both branches. Although transitions and transversions dispersed similarly throughout the genomes (Fig. 1a), the distribution of transition-and transversion-types exhibited differences ( Fig. 1b and 1c).

Phylogeny of SARS-CoV-2
The phylogeny of SARS-CoV-2 was constructed using the concatenated dataset consisting of peptide encoding sequences (29,

Gene-level selection analyses
Genomic regions under positive and purifying selection or neutral evolution were visualised estimating the magnitude of dN -dS (Fig. 5). Although the most part of the genome was under purifying selection (dN < dS), the specific sites for peptide encoding sequences were found to be under positive selection (dN > dS) with significant statistical support (Fig. 5, Table 3). Five sites were detected under episodic positive selection in four peptide encoding sequences by MEME, while seven sites were found to be under pervasive positive selection in four peptide encoding sequences by FUBAR (Table 3). Only two codons in two different peptide encoding sequences (nsp3 and nsp6) were found as positively selected in both tests, therefore, the TreeSAAP analyses focused only on these two codons. The first one was codon 1179 in the nucleic acid-binding region (NAB) of the replicase protein nsp3, leading to a replacement between alanine and valine. This substitution seems to alter both chemical [long-range nonbonded energy (E I ), polar requirement (P r ), polarity (p) and solvent accessible reduction ratio (R α )] and structural [beta structure tendency (P β ), average number of surrounding residues (N s )] properties of the relevant protein ( Table 3). The second one was codon 38 in the replicase protein nsp6, with the substitution of leucine to phenylalanine altering the property of P β of the protein. The present findings on the basic sequence characteristics of SARS-CoV-2 genomes, the low level of nucleotide diversity and high level of haplotype diversity, are consistent with the sudden expansion in a relatively short period (Table 1). However, comparison of the peptide encoding sequences revealed that divergences of amino acid sequences were always proportionally higher than that of nucleotide sequences (  (Table   S4). However, when the codon positions of peptide encoding sequences were analysed separately, robust evidences of a bias towards C and G in the first and/or second codon positions of some peptide encoding sequences, most probably resulted from the codon preferences.

DISCUSSION
A strict phylogenetic approach was applied to reveal the differences in mutational patterns on the base of molecular synapomorphic changes that defining SARS-CoV-2 and its known closest relative bat-RaTG13-CoV branches as probably being a crucial preliminary step for the therapeutic studies. A several pattern common for both branches was found ( Fig. 1) It is also worthwhile to note a decline was observed in the basic reproduction number, the R 0 , a reference parameter for pathogeny of the agent. Although there has been several reports exceeding approximately R 0 > 6, the number estimated here, (R 0 = 2.72) is amongst the lowest (11) possibly because of the preventive measures taken worldwide. Therefore, it is recommended that measures must be persistent and strongly implemented so that until the declining tendency in R 0 becomes perpetual.
Accurate estimation of nucleotide substitution rates has crucial importance for several aspects, but two are especially essential. First, tracing evolutionary paths of the virus via substitution rate and mutational spectra may help to understand alterations in viral fitness as respond to host immune system on virus genes/genotypes and thus virulence in the host (60). Second, such results may have important implications for development of vaccines, antibodies and/or drugs (61,62). The estimated substitution rate for total genome of SARS-CoV-2 (Table 2) (20)(21)(22)(23). All these functions cause several effects on host cells, consequently determining patogenity, virulence, and antigens and finally the immune response (23). Therefore, the accurate estimation of substitution rate per gene has special importance for medical preventions and treatments.
Here, we report substitution rate of each peptide encoding sequence for the first time. Although the overall rate is 1.65 x 10 -3 , it is 4.5-7 times greater in some of the genes with important functions ( Table 2). It warrants mention especially the following peptide encoding sequence/genes: (i) nsp1 which encodes a protein leading the replication, (ii) E gene which encodes envelope protein involving in constitution of viral particles, (iii) orf8 and (iv) orf10 encoding accessory proteins both inducing apoptosis and DNA synthesis (21,22). Although not higher as in these four genes, the substitution rate is considerably (more than twice; Table  2) higher than the overall rate in N, nsp4, S, M and nsp2. The high rate in N, S and M genes encoding three of total four structural proteins has special importance as key viral antigens (20), thus requiring close scrutiny in future particularly for vaccines studies. On the other hand, the substitution rate of nsp11, orf7a, nsp5, nsp9, nsp8, orf7b and nsp15 is considerably lower than the overall rate (Table 2), polypeptide products of each peptide encoding sequence have key functions both in viral replication and viral interaction with host cell, thus host immunity (21)(22)(23). For example, the peptide encoding nsp15 is termed as "suicide enzyme" (23) and new deleterious mutations in this peptide may be public wish, but results obtained here are not promising towards that direction. A similarly case is valid for orf7a as it has several significant functions in interaction with host immune system. Evaluation of the substitution rate in relation with function of each peptide encoding sequence is beyond scope of the present study.
However, the conserved sequences of nsp8, nsp9 and nsp11, protein products of which have key importance in viral replication (23) may indicate a selective processes preventing accumulation of mutation in these peptide encoding sequences (63) and this constitutes an important question for further detailed research.
As for the overall substitution rate, the selective pattern on total genome also provides general information and do not allow us to distinguish key evolutionary trend. Although the SARS-CoV-2 genomes have been evolving under nearly neutral or purifying selection, one codon for each of nsp3 and nsp6 encoding sequences were found to be positively selected (Table 3) The results of the present study lead us to the following foremost conclusions. Although, six months evolution of viral genome is nearly neutral, one codon for each of nsp3 and nsp6 encoding sequences were positively selected and should be closely monitored. The nsp3 and N, S and M genes have to be further investigated, the potential sides for vaccines or drug studies, as the substitution rate per these genes show departure from the overall rate. The proximate origin of pandemic is Guangdong, its likely date of emergence is September 10, 2019, and this knowledge may serve in more accurate modelling or prediction on the evolutionary trend of the virus. The declining trend in the R 0 , appears to be the outcome of preventive actions taken worldwide and must be maintained. Finally, investigating more comprehensive genome data improves robustness and also enhances understanding of the demographic history of the outbreak. However, there is still a long way to go in order to better understand the SARS-CoV-2 and its outbreak.

ACKNOWLEDGEMENTS
This study is dedicated to all healthcare staff and scientists for their diligence and hard work during this SARS-CoV-2 outbreak. We are grateful to the authors and originating and    were shown with red colour (that over there the magnitude of dN -dS is between -1 to +1).
The codons subject to positive selection (dN -dS > 1) were shown with white and to purifying selection (dN -dS < 1) were shown with black.       Figure S1. The phylogenetic tree reconstructed for defining synapomorphies at the clades (or branches) leading to human SARS-CoV-2 and bat-RaTG13-CoV. The sampling list used in the reconstruction of the tree was presented in Table S3. Figure S2. Phylogenetic tree of SARS-CoV-2 reconstructed from the concatenated dataset of all peptide encoding sequences using Maximum Likelihood approach. Table S1. Detailed description of the sampling list retrieved from GISAID database. Table S2. A summary of the sampling list used the phylogenetic tree to define synapomorphies of the clades (or branches) leading to human SARS-CoV-2 and bat- Table S3. A summary of the sampling list used in the network reconstruction of SARS-CoV-2 Table S4. Nucleotide composition of each peptide encoding sequences (10 CDS and 16 nsps synthesised from orf1ab) and overall genome of SARS-CoV-2. Table S5. The list of molecular characters that define the clades (or branches) of human SARS-CoV-2 and bat-RaTG13-CoV