Genome Wide Analysis of Severe Acute Respiratory Syndrome Coronavirus-2 Implicates World-Wide Circulatory Virus Strains Heterogeneity

Severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), a novel evolutionarily divergent RNA virus etiological agent of COVID-19, is responsible for present devastating pandemic respiratory illness. To explore the genomic signatures, we comprehensively analyzed 2,492 complete and/or near-complete genome sequences of SARS-CoV-2 strains reported from across the globe to the GISAID database up to 30 March 2020. Genome-wide annotations revealed 1,407 nucleotide-level mutations at different positions throughout the entire genome of SARS-CoV-2. Moreover, nucleotide deletion analysis found nine deletions throughout the genome, including in polyprotein (n=6), ORF10 (n=1) and 3´-UTR (n=2). Evidence from the systematic gene-level mutational and protein profile analyses revealed a large number of amino acid (aa) substitutions (n=722), making the viral proteins heterogeneous. Notably, residues of receptor-binding domain (RBD) having crucial interactions with angiotensin-converting enzyme 2 (ACE2), and cross-reacting neutralizing antibody were found to be conserved among the analyzed SARS-CoV-2 strains, except for replacement of Lysine with Arginine at 378 position of the cryptic epitope of a Shanghai isolate, hCoV-19/Shanghai/SH0007/2020 (EPI_ISL_416320). Our method of genome annotation is a promising tool for monitoring and tracking the epidemic, the associated genetic variants, and their implications for the development of effective control and prophylaxis strategy. genome, and focusing on the structural proteins. Therefore, our study has targeted the genome-wide mutational spectra for inference on evolution of the viral population. untranslated regions. Further investigations should focus on in-silico structural validations and subsequent phenotypic consequences of the deletions and/or mismatches in transmission dynamics of the current epidemics and the immediate implications of these genomic markers to develop potential prophylaxis and mitigation for tackling the pandemic COVID-19 crisis. Moreover, the identification of the conformational changes in mutated protein structures and untranslated cis-acting elements is of significance for studying the virulence, pathogenicity and transmissibility of SARS-CoV-2. This mutational diversity should be investigated by further studies, including their metabolic functional pathway analysis.


Introduction
Severe acute respiratory syndrome (SARS) is an emerging pneumonia-like respiratory disease of human, which was reported to be re-emerged in Wuhan city of China in December 2019 1 . The identified causative agent is found to be a highly contagious novel beta-coronavirus 2 (SARS-CoV-2). Similar to other known SARS-CoV and SARS-related coronaviruses (SARSr-CoVs) 2,3 , the viral RNA genome of the novel SARS-CoV-2 encodes several smaller open reading frames (ORFs) such as ORF1ab, ORF3a, ORF6, ORF7a, ORF7b, ORF8 and ORF10 located in the 3′ region of the genome. These ORFs are predicted to encode for the replicase polyprotein, the spike (S) glycoprotein, envelope (E), membrane (M), nucleocapsid (N) proteins, accessory proteins, and other non-structural proteins (nsp) [3][4][5] .
However, the ongoing rapid transmission and global spread of SARS-CoV-2 have raised critical questions about the evolution and adaptation of the viral population driven by mutations, deletions and/or recombination as it spreads across the world encountering diverse host immune systems and various counter-measures 6 . Initial phylogenomic analysis of three super-clades (S, V, and G) isolated from the outbreaks of distinct geographic locations (China, USA and Europe) within SARS-CoV-2 showed little evidence of local/regional adaptation, suggesting instead that viral evolution is mainly driven by genetic drift and founder events 7 . Nevertheless, several reports predict possible adaptation at the nucleotide, amino acid (aa), and structural heterogeneity in the viral proteins, especially the spike (S) protein 8,9 . Interestingly, Shen et al. reported even intra-host viral evolution among the patients after infection, which might be related to its virulence, transmissibility, and/or evolution do to immune response 10 .
However, the previous reports have the limitations of considering a very few representative complete genomes covering only a few countries, targeting clade/group based consensus sequence, comparison to the Wuhan Refseq genome, and focusing on the structural proteins. Therefore, our study has targeted the genome-wide mutational spectra for inference on evolution of the viral population.

4
To decipher the genetic variations, we retrieved two thousand four hundred and ninety-two (n = 2,492) complete or near-complete genomes of SARS-CoV-2 available at the global initiative on sharing all influenza data (GISAID) (https://www.gisaid.org/) up to 30 March 2020. These SARS-CoV-2 sequences belonged to the infected patients from 58 countries of seven continents ( Supplementary Fig. 1, Supplementary Data 1). We aligned the SARS-CoV-2 genome sequences using MAFFT online server 11 , and the complete genome sequence SARS-CoV-2 Wuhan-Hu-1 strain (Accession NC_045512, Version NC_045512.2) was used as a reference genome. Multiple sequence alignments were finally opened with MEGA 7 12 to remove all ambiguous and lowquality sequences. Amino-acid heterogeneity analysis was performed with Fingerprint, a webbased protein profile analysis tool 13 . Finally, the aligned sequences were visualized using Unipro-UGENE 1.26.1 to visualize the deletions with respect to the reference genome 14 .

Results and Discussions
Nucleotide sequence alignment revealed a total of 1,407 mutations (synonymous vs nonsynonymous ratio = 2.8:1) across the entire set of genomes of the SARS-CoV-2 strains  22 . According to that study, no such deletion was reported elsewhere, and our results did not find that in the genomic data deposited in the public database either.
Remarkably, ORF10 undergoes a deletion (35 nucleotides) including its start codon, and instead, the start codon of adjacent spacer region can probably be used for protein coding (Fig. 1g).
Among these deletions, three were reported previously in three strains (Japan/AI/I-