Phylogenetic and Variant Analysis of 1,040 SARS-CoV-2 Genomes

The severe acute respiratory syndrome-coronavirus 2 (SARS-CoV-2) viral genome is an RNA virus consisting of approximately 30,000 bases. As part of testing efforts, whole genome sequencing of human isolates has resulted in over 1,600 complete genomes publicly available from GenBank. We have performed a comparative phylogenetic analysis of the sequences, in order to detect common mutations within the population. Analysis of variants occurring within the assembled genomes yields 417 variants occurring in at least 1% of the completed genomes, including 229 within the 5’ untranslated region (UTR), 152 within the 3’UTR, 2 within intergenic regions and 34 within coding sequences.


Introduction
SARS-CoV-2, formerly known as the "Wuhan seafood market pneumonia virus," is a novel coronavirus that first appeared at the seafood and wildlife wholesale market in Wuhan, Hubei Provence, China during late November/early December, 2019 (Zhu et al., 2020). Due to its high human-to-human transmission rate (Li et al., 2020a), longer than normal latent period ( Bar-On et al., 2020) and mortality rates in vulnerable populations, a global pandemic was declared by the World Health Organization (WHO) on March 11, 2020 for the associated COVID-19 disease (Cohen and Kupferschmidt, 2020). As of May 10, 2020, a total of 4,006,257 cases resulting in 278,892 deaths in 215 countries have been confirmed (World Health Organization, 2020).
The reference genome of the SARS-Cov-2 RNA virus (GenBank accession NC_045512) consists of 29,903 bases. Among its features are a 265 base 5' untranslated region (UTR) and a 3' UTR composed of 229 bases. Its coding regions consist of 10 open reading frames (ORFs) coding for 26 genes ( Bar-On et al., 2020), including 13,218 bases coding for the ORF1ab polyproteins, whose transcription includes a 1 bp ribosomal slippage event (Atkins et al., 2016); 3,822 bases coding for a spike surface glycoprotein (S), 228 bases coding a small envelope protein (E), 669 bases coding for a membrane glycoprotein protein (M) and 1,260 bases coding for a nucleocapsid protein (N) along with five additional ORFs (ORF3a, ORF6, ORF7a/ORF7b, ORF8, and ORF10) (Figure 1).
Since the public release of the first reference sequence (MN908947; NC_045512) from NCBI's GenBank (Sayers et al., 2019) on January 12, 2020, the number of sequences available has increased exponentially, at a current rate of approximately 70 new sequences per day (Figure 2). Isolates have been sequenced from 27 countries ( Table 1), as well as 34 states, Washington DC, and passengers from a cruise ( Table 2).
SARS-CoV-2 is postulated to have originated from zoonotic transfer of a pangolin betacoronavirus based on a phylogenetic analysis of coronavirus sequences, due to a common insertion of 12 nucleotides within the receptor binding domain of the S protein region that optimizes binding to the human ACE2 receptor, although the most similar betacoronavirus is the bat RaTG13 (Andersen et al., 2020). RNA viruses are characterized by a high mutation rate (Duffy, 2018) driven by RNA dependent RNA   polymerase (RdRp) that results in viral evolution (Pachetti et al., 2020). Given the importance of the genome sequence with host transmission (in particular the S protein), it is important to understand common mutations within the population in order to have a better handle on viral load, virus spread, virus evolution, and disease severity. A number of previous studies have examined variants within SARS-CoV-2 isolates. One prior study looking at the genome diversity of SARS-CoV-2 have identified 93 mutations occurring in at least one isolate from a set of 86 complete genomes (Phan, 2020). A second report from a set of 220 complete genomes identified eight novel recurring mutations, with specific prevalence within Asian, North American, and European populations (Pachetti et al., 2020). This study by Pachetti, et al., also showed the occurrence of mutations over time based on sequence sampling dates. A report by Yin (Yin, 2020) identified fifteen highfrequency single nucleotide polymorphisms when comparing a set of 558 SARS-CoV-2 strains. This study found four of these mutations, 241C > T, 3037C > T, 14408C > T, and 23403A > G to be more prevalent in European strains, where the COVID-19 is typically more severe. Wang et al. (Wang et al., 2020a) detected 13 variation sites among 95 full-length genomic sequences (variants occurring in at least 3 isolates), with two at positions 8,782 and 28,144 showing a high mutation rate around 30%. Khailany et al. (Khailany et al., 2020) looked at mutations within 95 complete genome sequences, and found 116 mutations occurring in at least one isolate, with the most common being 8762C > T, 28144T > C, and 29095C > T. Tang et al. (Tang et al., 2020) studied mutations across 103 strains, and determined there were mutations in 149 sites, including six nonsynonymous mutations occurring at least twice. Additionally, this study identified two different mutations (with near complete linkage between 8782T > C and 28144C > T) that separated the virus into two groups labeled L and S. Using deep sequencing reads, Tang et al. also identified 18 locations showing intrahost variants, demonstrating heterogeneity of the virus within a specific host. Wang et al. (Wang et al., 2020b) identified ten high frequency mutations within a set of 108 isolates, which they say can be used to classify SARS-CoV-2 into five main groups. The mutations published include 28144T > C,8782C > T, 23403A > G, 3037C > T, 11083G > T, 26144G > T, and 2261G > T.
Forster et al. (Forster et al., 2020) used phylogenetic network analysis to find three central variants, A, B, and C, distinguishing East Asian isolates from European and American isolates. In their study, variants found defined clusters and/or subclusters by synonymous mutations 29095T > C (subclustering A), synonymous mutation 8782T > C and nonsynonymous mutation 28144C > T (separation of clusters A and B), In addition, Shen et al. (Shen et al., 2020) observed a median of 1-4 intrahost variants, ranging from 0-51, from a set of eight patients infected with SARS-CoV-2. Included in the Shen study were two patients from the same household, one of which was likely to have infected the other. Interestingly, only 7/25 variants detected in these two individuals were shared, illustrating the high viral mutation rate.
The SARS-CoV-2 spike (S) protein is of particular interest, due to its interaction with the human ACE2 protein that helps to mediate infection of host cells. Korber, et al. (Korber et al., 2020) have set up a workflow for measuring the dynamics of nonsynonymous mutations within the S protein coding region. This study has uncovered a number of amino acid mutations in the S protein, including D614G (23403G > A), which is thought to increase transmissibility due to its rapid expansion in global samples. This mutation was shown to have a high association with two other mutations, 3037C > T and 14409C > T.

Materials and Methods
All available SARS-CoV-2 nucleotide sequences and associated annotations (including locality) were downloaded from NCBI (https://www.ncbi.nlm.nih.gov/genbank/sars-cov-2-seqs/) on 5/7/2020, resulting in 2,262 sequences. Geographical location information was reduced to the corresponding ISO 3166-1 alpha-3 country code (Supplemental Table 1), and the two-letter abbreviated state code. Sequences were then filtered to include only those listed as complete with an isolation source of "Homo sapiens", which left 1,695 sequences listed as complete genomes isolated from human samples. Sequences from this set containing gaps in their assembly (defined by the character "N") were removed from further analysis. Isolates that were 100% matches across their entire length were merged into a single representative sequence (Table 3). A final total of 1,043 sequences were used in the final analysis, including three from Kentucky that we previously analyzed (Chung et al., 2020). These sequences were then compared against each other, and a distance matrix was constructed based on the number of gaps and mismatches, as calculated by NCBI blastn (v2.10.0) (Johnson et al., 2008). A multiple sequence alignment was performed using kalign (v3.2.5) (Lassmann, 2020) which resulted in a .aln alignment file.

Results
Our analysis focused on a set of 1,043 filtered sequences, including 1,040 publicly available in GenBank, as well as three new isolates sequenced by our group in Kentucky (GenBank accessions MT365025, MT365026, and MT365027). From our group of sequences, we found 47 groups of sequences that had at least two isolates that were 100% identical, including one group with 18 isolates all from Washington state; a second group with 16 isolates primarily from Washington state, and one group of 14 isolates all from Zhejiang, China (Table 3). These clustered groups of sequences are not surprising, particularly when examining the geographical similarities, suggesting clusters of similar transmission in both time and viral strains. The majority of the sequences were from the United States (Table 1, Figure 3), with California, New York, and Washington comprising the majority of the sequences (Table 2, Figure 4).
Based on a variant threshold of minor allele frequencies of 1% or more, we detected a total of 417 locations where isolate genomes express an alternate allele other than the reference, NC_045512, including 229 in the 5'UTR, 21 in  12 USA-NY -- ORF1ab, 2 in S, 3 in ORF3a, 1 in M, 3 in ORF8, 4 in N, 2 intergenic, and 152 in the 3'UTR. The vast majority of the detected variants lie within the 5'UTR and 3'UTR, and represent deletion events. Since these may be due to a number of factors such as sequencing preparation (i.e. selection of amplicon primers) and difficulty in multiple sequence alignment construction that contribute to less reliability, we only retained seven of the UTR variants for further consideration. This filtering led to a total of 44 variants (Table 4). Of these, twenty have previously been reported (Supplemental Table 2), including the pair at locations 8782/28144 which was previously demonstrated to have a high linkage (Tang et al., 2020) and the tuples at locations 8782/18060/28144, 241/3037/23403/28144, and 241/3037/14408/23403 which were shown to have a high number of descendants (Yin, 2020). We uncovered 24 novel variants not previously described to our knowledge, which may be a result of more recent mutation events and/or fixation of a mutation within the population.
Surprisingly, two of these, 1059C > T in the nsp2 region of ORF1ab and 25563G > T in ORF3a are found at a high frequency within the population, at a rate of 43% and 49%, respectively.
Linkage between the detected variants was analyzed using Haploview (v4.2) (Barrett et al., 2005) (Figures 5 and 6).   A total of 38 associations with an r 2 value >= 0.8 were detected (Table 5). In order to further examine these associations for viral evolution, we looked at their frequencies within geographic groupings, including China, East Asia, Hong Kong and Taiwan (Figure 7); China, India, and West Asia (Figure 8); and China, Cruise A, USA, and Europe (Figure 9), as well as by sampling date by month, ranging from December, 2019 to April, 2020 (Figure 10). Intriguingly, the time analysis shows nine variants that are shifting frequency in samples more recently procured during March-April 2020, including 490T > A, 1397G > A, 23403A > G, and 29540G > A. Eight of these variants are in coding regions, with six producing non-synonymous mutations in the amino acid sequence. One of these, 23403A > G has been recently reported as a mutation in the Spike protein resulting in a more transmissible form of SARS-CoV-2 (Korber et al., 2020).

Discussion
Of the 44 common variants we found, 20 result in nonsynonymous mutations which may have functional relevance. Understanding these specific variants is critical to being able to react to the evolution of SARS-CoV-2, in particular, in developing an effective vaccine. Among previously reported variants, a few proposed functional consequences have been reported. 14408C > T within the RNA-dependent RNA polymerase gene (RdRp) may potentially affect proofreading or binding with other cofactors, thus affecting viral mutation rates (Pachetti et al., 2020). The variant 23403G > A within the S protein has     been hypothesized to increase transmissibility due to its expansion in global samples. This is thought to occur via one of two mechanisms, the first by diminishing interactions between the S1 and S2 promoters of the spike protein based on structural changes, and the second by affecting immunological response due to its location within an immune-dominant epitope (Korber et al., 2020).
Our analysis uncovered six previously unreported nonsynonymous variants, including 833T > C, 1059C > T, 11916 C > T, 18736T > C, 18998C > T, and 25563G > T. Five of these transition events occur within ORF1ab, and one occurs within ORF3a. Of these, two (833T > C and 1059C > T) are within the nsp2 region, which is postulated to play a role in the host cell survival pathway via interactions with prohibitin (PHB) and prohbitin 2 (PHB2); one (11916 C > T) is within the nsp7 region that may act as a primase and therefore be involved in viral replication; two (18376T > C and 18998C > T) are within the 3' to 5' exonuclease, which functions in proofreading; and one (25563G > T) in ORF3a which forms viroporin ion channels and may modulate virus release (SwissProt, 2020).
While the SARS-CoV-2 virus is the product of a single copy RNA genome and our analysis of variants having a high association may be more applicable to eukaryotic genomes, it is known that recombination between strains is a key contributor to coronavirus evolution (Graham and Baric, 2010;Li et al., 2020b;Rehman et al., 2020). Korber, et al. (Korber et al., 2020) demonstrate that one particular mutation in the S protein (S943P) is likely a result of recombination between strains, due to the fact that at the time of reporting, it was found only in Belgium, but in multiple lineages, suggesting it was not a result of a founder sequence. Many associations have been identified (Korber et al., 2020;Pachetti et al., 2020;Tang et al., 2020;Yin, 2020), including the 38 we found with r 2 > 0.8. Further analysis of these is necessary, in order to rule out founder effect. A follow-up study using the reverse-genetic approach would be needed to understand the biological effects of the variant nucleotides and the nature of the high association between variant sequences (e.g. compensatory mutations) (Dinnon et al., 2020;Xie et al., 2020).

Conclusion
Our analysis resulted in 44 common variants within SARS-CoV-2, 24 of which had not been previously described. From these variants, a total of 38 pairwise associations had an r 2 value > 0.8. Since COVID-19 is an emerging disease, a lot of current research efforts focused on understanding the origin and mutation of the SARS-CoV-2 virus and isolates. As a result, the amount of sequence information is increasing on a daily basis. In this study, we constructed a framework that will allow us to update our analysis rather efficiently with little intervention. We hope to expand our analysis to include more recent submissions to the NCBI GenBank database, as well as the GISAID EpiFlu database (Elbe and Buckland-Merrett, 2017). As more data becomes available, with additional isolates resulting from community-acquired transmission, a more complete picture of the evolution of SARS-CoV-2 will be possible.