A comprehensive comparative phylogenomics and demographic evolutionary history of the SARS-CoV-2

A new form of beta coronavirus called severe acute respiratory disease coronavirus type 2 (SARS-CoV-2) causing a recent pandemic outbreak possesses a linear positive ss-RNA genome with a length of 29,903 nt. Here, the genomes of SARS-CoV-2 from 821 samples were characterised for its better understanding of the genomic and evolutionary patterns. The phylogeny of SARS-CoV-2 was reconstructed using concatenated dataset consisting of all peptide encoding sequences under Bayesian Inference (BI) and Maximum Likelihood (ML) approaches. Comparison of all peptide encoding sequences reveals high divergence of amino acid sequences proportional to divergence of nucleotides, indicating that the viral genomic evolution has not been strictly neutral. The most part of the genome was under neutral evolution, however, the specific sites for peptide encoding sequences were evolved under positive selection. As well as providing reliable evidence on transmission routes of the SARS-CoV-2 outbreak, the phylogenetics and network analyses suggest the sample reported from Guangdong province is likely ancestor of the pandemic virus form. The overall substitution rate of SARS-CoV-2 genome was estimated to be 1.65 x 10 -3 per site per year, falling within the range for previously reported RNA viruses. Median estimation of tMRCA from Bayesian coalescent analyses corresponds to 10 September 2019. The exponential growth rate ( r ), doubling time ( Td ) and R 0 were estimated to be 47.43 per year, 5.39 days and 2.72, respectively. These findings convincingly emphasise that the use of more comprehensive genome data improves robustness and also enhances understanding of the demographic history of the outbreak.


INTRODUCTION
The world is suffering from an ecosystem crisis due to extraordinary population growth and consequent activities of human species. Human populations have experienced a period of explosive growth depending on decreases in external cause mortality during approximately last 250 years following the Industrial Revolution (1). The population of any other species does not increase so rapidly in nature like humans although it is one of the species with the longest generation time. Within the last century, the world population has increased by one billion in each 16 years on average. This continuous growth is closely related to the emergence of modern medicine, and rapid developments in the scientific discoveries which led to technological developments, a relative decrease in competition for food resources and the increased volume of trade as the main results of globalization. The growing human population has also led to anthropogenic impact on the environment promoting continuous and multiple contacts between people, domestic and wild animals. Such contacts were considered as main cause for the emergence and spread of zoonotic infectious diseases in association with crowded settlement (2). Virus (EBOV) (3)(4)(5).
It is known that the emergence and re-emergence of most of the zoonotic infectious diseases are resulted from viruses with RNA as their genetic material, which can rapidly adapt to varying environmental conditions owing to their exceptionally shorter generation times, high mutation rates, frequent recombination and re-assortment events creating novel genotypes from cocirculating strains (quasispecies) (3,6). The occurrence of the high mutation rates, as high as a million times [10 -2 -10 -5 per site per year (7)] greater than their hosts [2,2 x 10 -9 per site per year (8)], is considered as a beneficial trait for RNA viruses due to prominent enhancement in the ability of their virulence and evolvability. Despite these exceptionally high mutation rates, most of the mutations are expected to be swept away from viral populations via purifying and/or negative selection. On the other hand, the deterministic nature of the natural selection tends to increase the fitness of the populations favouring beneficial mutations through positive selection in time. The balance between these opposed selective forces ultimately shapes the evolution of the viral populations.
The recent pandemic outbreak resulted from a novel coronavirus, named as SARS-CoV-2 (also referred to as hCoV-19) (9), causing unusual respiratory condition with various degrees of severity was reported for the first time by the end of 2019 in Wuhan, China (10,11). SARS-CoV-2 has a linear positive ss-RNA genome with a length of 29,903 nt, consisting of a leader sequence, ORF1ab encoding replicase polyproteins involved in RNA replication and transcription processes and non-structural proteins (nsp), S gene encoding spike glycoprotein, E gene encoding envelope protein, M gene encoding membrane glycoprotein, N gene encoding nucleocapsid phosphoprotein and six ORFs encoding peptides with unknown exact functions (12).
The SARS-CoV-2 is a new strain which has rapidly distributed worldwide within approximately three months infecting millions of people. Yet, there are many unknown aspects of the virus strain which prevent to take plausible measurements and assessments for upcoming of pandemic. One aspect appears to be least concerned by healthcare professional is the past and ongoing evolution of the virus strain. This study primarily aims to estimate the evolution and evolutionary drivers of the virus using available 821 total genomes of SARS-CoV-2 which expands significantly the early findings based on limited number of samples (13,14). Data matrices generated were subjected to appropriate bioinformatics analyses to accomplish the following specific objectives: (i) to determine the place of origin and likely origin strain based on phylogenetic trees, (ii) to estimate the time of origin and the rate of base/amino acid substitutions per individual protein encoding genes and total genome by time estimation analyses, (iii) to characterize the viral genome for base composition, rates of synonymous and nonsynonymous substitutions in connection with presence/absence of selection forces, (iv) to estimate demographic history of virus population based on possible evolutionary drivers, and (v) to determine possible synapomorphic base position states of bat coronavirus and SARS-CoV-2 using a strict phylogenetic approach. Finally, a recent evolutionary history of the virus will be discussed in the light of the results.

Construction of dataset and basic sequence statistics
A dataset of 821 samples was constructed by retrieving the complete genomes of SARS-CoV-2 representing 38 countries from GISAID (13 March 2020; https://www.gisaid.org/) database.
All details of the sampling dataset were presented in Table S1. Annotation of the genomes were performed using ORFFinder implemented in Geneious R9 (15) and by comparison with the reference SARS-CoV-2 genome from NCBI (NC045512). Sequences were aligned using MAFFT v7.450 (16) and manually checked by MEGA X (17). The basic sequence statistics such as nucleotide and amino acid identities, nucleotide diversity and nucleotide compositions were estimated by MEGA X.

Model selection, phylogenetic analyses and network reconstruction
One sample representing bat strain (bat-RaTG13-CoV; MN996532) was included to the dataset as outgroup (18) for model selection and phylogenetic analyses. The best-fit evolution model of the dataset was selected using ModelTest-NG (19), applying the default parameters under three statistical criteria (AIC, BIC, and DT). The Maximum Likelihood (ML) tree was built using RAxML v8.0.9 (20) using the GTR substitution model with gamma distribution (gamma shape: 0.507).
To deeply investigate the evolutionary relationships and origin of location of the SARS-CoV-2, a median-joining (MJ) network (21) was constructed using a total of 25 human SARS-CoV-2 samples selected from the basal placement of each clade and/or subclade of the phylogenetic tree generated here including the samples from Wuhan (China), supposed to be origin point of the outbreak (10), and bat-RaTG13-CoV was assigned as outgroup (Table S2). The analysis was performed using the whole genome dataset under the Network v.5.0.1.1 (available at http://www.fluxus-technology.com) with the default settings (ε = 0).

Estimation of mutation rates of genome and genes
The individually aligned peptide encoding sequences (10 CDS and 16 nsps synthesised from orf1ab) of SARS-CoV-2 samples were concatenated by SequenceMatrix v. 1.7.8 (22). The substitution rate estimation of each peptide encoding sequences and whole genome was conducted in BEAST v1.10.4 (23) using the available dates for samples with a random starting tree on the CIPRES science gateway portal (24). As the dataset were not clock-like (likelihood

Demographic analyses
The demographic dynamics of SARS-CoV-2 outbreak were inferred from Bayesian coalescent model using demographic reconstruction option in Tracer v1.7. The doubling time (Td) was calculated using the following formula: Td = ln(2)/r, assuming that the pandemic is growing exponentially with a constant growth rate (r) (25). The basic reproduction number (R0) was  Table S3 for sampling). The sequence data and the ML tree file were imported into PAUP 4.0b10 (33).

Basic sequence statistics
It is known that coronaviruses with largest RNA genome exhibit relatively low level of genetic divergence owing to the presence of 3′ exonuclease proofreading activity in their replicases (34). SARS-CoV-2 genomes analysed here displayed a similar pattern. The final length of the sequences of human SARS-CoV-2 was 29,238 nt (without outgroups), of which 832 sites were variable. Of these variable sites 232 were parsimony informative while 600 were singleton mutations. Divergence indices for both nucleotide (29,238 nt) and amino acid sequences of concatenated dataset (9746 aa) on average were 0.029 and 0.057, respectively (Table 1).
Divergences for each gene were also in low levels ranging from 0.014 (nt div.) and 0.025 (aa div.) in orf7a to 0.070 (nt div.) and 0.120 (aa div.) in E gene. Comparison of all peptide encoding sequences revealed that divergences of amino acid sequences were always higher, proportional to that of nucleotide sequences ( Table 1). The greater amino acid divergences were mainly resulted from the high incidence of nucleotide substitutions at non-degenerate and twofold degenerate sites ( Table 1), indicating that the viral genome may not be experiencing strictly neutral evolution. If the viral genome has been subjected to the strictly neutral evolution, purifying selection would intensively act on substitutions at non-degenerate sites (35). Additionally, compatible with the recent expansion and relatively short history of the virus, the low nucleotide diversity (π) values were observed in each peptide encoding sequences, ranging from 0.00005 in nsp9 to 0.00115 in orf8, with an average of 0.00028 overall mean nucleotide diversity (Table 1).
Similar to the previously reported human coronaviruses, SARS-CoV-2 genomes displayed several general patterns in terms of nucleotide composition: preference of (i) U (32.22%) over C (18.34%); (ii) A (29.82%) over G (19.62%); (iii) pyrimidines (50.6%) over purines (49.4%) ( Table 2). However, when the codon positions of peptide encoding sequences were analysed separately, a codon bias towards C and G were observed in the first and/or second codon positions of some peptide encoding sequences (Table 2), most probably resulting from the differences between codon preferences.

Phylogeny of SARS-CoV-2
The phylogeny of SARS-CoV-2 was constructed using the concatenated dataset consisting of all peptide encoding sequences (29,238 nt) by both ML under RAxML and Bayesian Inference (BI) under BEAST. The phylogenetic analyses have recovered trees with almost the same topologies. However, the tree generated under BI approach was more resolved and with higher node supports, therefore, presented and discussed here (Fig. 1). The MJ network revealed that the outgroup bat-RaTG13-CoV sample was connected to the sample from Guangdong (EPI_ISL_413892) with 1132 mutations, which has basal placement also in the phylogenetic tree (Fig. 2). Although Wuhan has widely cited as the origin of outbreak  Table   3). The sequences encoding structural proteins displayed higher substitution rates than overall genome, varying from 3.60 x 10 -3 in M gene to 8.42 x 10 -3 in E gene (Table 3).

Demographic dynamics of the outbreak of SARS-CoV-2
Median estimation of tMRCA from Bayesian coalescent analyses was found as 2019.69 (95%

Gene-level selection analyses
Genomic regions under positive and purifying selection or neutral evolution were visualised estimating the magnitude of dN -dS (Fig. 4). Although the most part of the genome was under purifying selection (dN < dS), the specific sites for peptide encoding sequences were found to be under positive selection (dN > dS) with significant statistical support (Fig. 4, Table 4). Five sites were detected under episodic positive selection in four peptide encoding sequences by MEME, while only seven sites were found as under pervasive positive selection in four peptide encoding sequences by FUBAR (Table 4). Only two codons in two different peptide encoding sequences (nsp3 and nsp6) were found as positively selected in both approaches, therefore, the TreeSAAP analyses focused only on these two codons. The first one was codon 1179 in the nucleic acid-binding region (NAB) of the replicase protein nsp3, leading to a replacement between alanine and valine. The positive selection at this codon was reported for the first time.
This codon was coding threonine in pangolin CoV and serine in BetaCoV England 1.
Considering the multiple interactions of NAB with other non-structural proteins such as nsp2, nsp5, orf3a as well as other domains of nsp3 (51), this substitution seems to be significant due

Comparison of synapomorphies between bat coronavirus and SARS-CoV-2
The phylogenetic tree constructed to identify synapomorphies and list of synapomorphic characters were shown in Figure S1 and Table S4, respectively. The ratio of synapomorphic characters to the total number of characters at the branch leading to bat-RaTG13-CoV (2.09%) was relatively similar to that of the branch leading to SARS-COV-2 (1.88%). The proportions of transitional substitutions (78.4% in bat-RaTG13-CoV and 78.9% in SARS-COV-2) within these synapomorphic characters were apparently higher than transversions (21.6% in bat-RaTG13-CoV and 21.1% in SARS-COV-2) at both branches. Although transitions and transversions dispersed similarly throughout the genomes (Fig. 5a), the distribution of transition-and transversion-types exhibited differences ( Fig. 5b and 5c). Remarkably, A to G and U to C transitions were observed in high frequencies of 63.03% in bat-RaTG13-CoV and 64.87% in SARS-CoV-2 on average of total transitions (Fig. 5b). The excessive number of these type of transitions might be related to deamination of adenine to hypoxanthine and thermodynamically in favour of hypoxanthine cytosine matches, is also known to be a commonly observed pattern in retroviral and other RNA genomes (54).

CONCLUDING REMARKS
In the present study, a more comprehensive dataset was utilised than the datasets previously It is also worthwhile to note the decline in the basic reproduction number, the R0, a reference parameter for pathogeny of the agent. Although there has been several reports exceeding approximately R0 > 6, the number estimated here, (R0 = 2.72) is amongst the lowest (11 and references therein) and somewhat indicate declining over the time. One of the obvious reason may be the preventive measures taken worldwide. Therefore, it is recommended that measures must be persistent and strongly implemented so that until the declining tendency in R0 becomes perpetual.
Finally, the findings on tMRCA, growth rate, doubling time and R0 strongly indicate that investigating more comprehensive genome data improves robustness and also enhances understanding of the demographic history of the outbreak. However, there is still a long way to go in order to better understand the SARS-CoV-2 and its outbreak.             Figure S1. The phylogenetic tree reconstructed for defining synapomorphies at the clades (or branches) leading to human SARS-CoV-2 and bat-RaTG13-CoV. The sampling list used in the reconstruction of the tree was presented in Table S3.
Supplementary Tables   Table S1. Detailed description of the sampling list retrieved from GISAID database. Table S2. A summary of the sampling list used in the network reconstruction of SARS-CoV-2 TableS3. A summary of the sampling list used in the phylogenetic tree reconstructed for defining synapomorphies at the clades (or branches) leading to human SARS-CoV-2 and bat-

RaTG13-CoV
TableS4. The list of molecular synapomorphic characters that define the clades (or branches) of human SARS-CoV-2 and bat-RaTG13-CoV