Sequence analysis and amino acid variations of structural proteins deduced from novel coronavirus SARS-CoV-2 strains, isolated in different countries

SARS-CoV-2 is a novel and highly pathogenic coronavirus, which was first diagnosed in Wuhan city, China, in 2019, and spread to 185 countries and territories, and as of April 29, 2020, more than 3.11 million cases were recorded, and more than 217,000 people were killed. Despite all worldwide efforts, there is currently no vaccine, any drugs available to protect people against deadly SARS-CoV-2 coronavirus. The world urgently needs a SARS-CoV-2 coronavirus vaccine or effective antiviral drugs to relieve the human suffering associated with the pandemic that kills thousands of people every day. The SARS-CoV-2 genome encode a non-structural proteins named as ORF1a/b, and structural proteins such as spike (S) glycoprotein, nucleocapsid protein (N), small envelop protein (E) and matrix protein (M). A number of studies have been shown that CoV spike (S) glycoprotein and nucleocapsid protein (N) could be promising targets for vaccine, antibodies and therapeutic drug development to combat with deadly, pandemic SARS-CoV-2. Purposes of the present paper is the sequence analysis and amino acid variations of structural proteins deduced from novel coronavirus SARS-CoV-2 strains, isolated in different countries. Multiple sequence alignment of S, N and E proteins from four different coronavirus species, are also described. It is expected that the data from these studies will be very useful for the the designing and development of vaccines, antibodies and therapeutic agents that can be used to combat with the highly pathogenic SARS-CoV-2 coronavirus worldwide.


Introduction
SARS-CoV-2 is a novel and highly pathogenic coronavirus, which has caused an outbreak in Wuhan city, China in 2019, and then soon spread nationwide and spilled over to other countries and the world. Head of the United Nations has described this as humanity's worst crisis since World War II. Despite all worldwide efforts, there is currently no vaccine, any drugs available to protect people against deadly SARS-CoV-2 coronavirus. The world urgently needs a SARS-CoV-2 coronavirus vaccine or antiviral drugs to relieve the human suffering associated with the pandemic that kills thousands people every day. SARS -Cov-2 is a considered as betacoronavirus , like MERS-CoV or SARS-CoV, with the single stranded RNA genomes. Phylogenetic analysis on the coronavirus genomes has revealed that SARS-CoV-2 is a new member of the betacoronavirus genus, which includes SARS-CoV, MERS-CoV, bat SARSrelated coronaviruses (SARSr-CoV), as well as other coronaviruses identified in humans and animal species (Zhou, P. et al. 2020;Wu, F. et al.2020;Lu et al., 2019). The two-thirds of SARS -Cov-2 RNA genome (~30 kb) encodes a non-structural proteins, named as ORF1a/b (ppa1 and pp1ab). The rest part of virus genome encode mainly structural proteins such as spike (S) glycoprotein, nucleocapsid protein (N), small envelop protein (E) and matrix protein (M). A number of studies have been shown a CoV spike (S) glycoprotein as a leading target for vaccines, antibodies, and therapeutic drug development against deadly, pandemic SARS-CoV-2. It was shown that SARS-CoV-2 share about 80% sequence identity in the spike (S) gene with SARS-CoV and other SARSr-CoVs (Zhou et al., 2020). However, bat coronavirus RaTG13 appears to be the closest relative of the SARS-CoV-2 sharing over 93.1% sequence identity. The crystal structure of the SARS-CoV-2 spike receptor-binding domain (RBD) bound to the cell receptor ACE2 at 2.45 Å resolution was quite recently determined by Zhou et al. (2020). It was demonstrated that the overall ACE2-binding mode of the SARS-CoV-2 RBD is nearly identical to that of the SARS-CoV RBD, which also utilizes ACE2 as the cell receptor (Wong et al, 2004). It should be noted that since the RBD is the critical region for receptor binding, therefore RBD could be great promise for developing highly potent cross-reactive therapeutic agents towards diverse coronavirus species including SARS-CoV-2. The spike protein of SARS-CoV-2 coronavirus is cycteine-rich protein and total of nine cysteine residues are found in the RBD, eight of which forming four pairs of disulfide bonds (Lan et al., 2020). In addition, the spike protein has 22 potential N-glycosylation sites, of which two of them are in the receptor binding domain (RBD) region.
The SARS-CoV-2 nucleocapsid (N) protein is multi functional RNA binding protein, which is responsible for viral RNA transcription and replication. Nucleocapsid protein consist three domains, i) RNA binding domain, ii) C-terminal dimerization domain (CTD) and iii) Ser/Arg (SR)-rich linker. Previous studies have shown that NTD is responsible for RNA binding, dimerization domain for oligomerization and SR for phosphorylation (Lo et al., 2013;Chang et al., 2013;Chang et al., 2006;Wootton et al., 2002). A number of studies have been shown that N protein is highly produced during infection, and induced protective immune response against SARC-CoV as well as SARC-CoV-2 (Ahmed et al., 2020;Liu et al., 2006;Shang et al., 2005;Lin et al., 2003).
The purpose of this paper is to generate data that would be useful for the design and development of vaccines, antibodies, and therapeutic drugs to combat with deadly pandemic SARS-Cov2, by the sequence analysis and analysis of amino acid variations of novel coronavirus SARS-CoV-2 strains, isolated worldwide.

Sequence analysis of structural proteins deduced from novel coronavirus SARS-CoV-2 strains, isolated in different countries.
All available sequences of 2019 Novel Coronavirus (SARS-CoV-2) strains, isolated in different countries were downloaded from NCBI as of 24.04.2020. We aligned 1330 spike protein sequences of SARS-CoV-2 strains, isolated in different countries and found to be nearly 100% identical. The only difference was in 614th position of the consensus sequence ( Figure 1); variation in this position of residues G and D. It may due to a SNP (e.g. GAT to GGT, Asp to Gly or viceversa) ( Figure 1). Similarly, when 1334 nucleocapsid proteins sequences were aligned, there was almost 100% percent identity; the only difference was in a pair of aminoacids at 204-205 position, RG to KR ( Figure 2). We also aligned membrane and envelope proteins of SARS-CoV-2 coronovirus strains, isolated in different countries. When 1326 sequences of envelope protein was aligned the following differences were found: there is one (out of 1326 sequences A to V conversion) mutation at 36th position, and two mutations (L to H and L to R conversion) at 37th position, and one mutation (S to F) at 55th position. When 1316 sequences of membrane proteins, was aligned, there are differences in the following: two sequences have mutation at 2nd and 3rd position, one sequence at at 57th position, L to V, and one sequence have V to F mutation at 70th position, one sequence A to I mutation at 73th position, two sequences A to S at 85th, at 89th one sequence G to R , at 133th one sequence L to M, at 142th one sequence A to P, at 175th four sequences T to M, at 190th one sequence D to N, at 195 and 196th one sequence AY to VH. Thus, there are more amino acid variations in membrane proteins compared to S, N, and E. In general, these results demonstrate that there are no significant amino acid variations in the S, P or E proteins of the coronavirus strains (SARS-CoV-2) isolated in different countries.

Comparison of S, N and E proteins of closely and distantly related coronaviruses.
Based on a previous phylogenetic analysis of the coronavirus genomes, it was demonstrated that, like SARS-CoV, MERS-CoV and bat SARS-related coronaviruses, SARS-CoV-2 is a new member of the beta-coronavirus genus (Zhou et al. 2020;Wu et al. 2020;Lu et al., 2019). SARS-CoV-2 share about 80% sequence identity in the spike (S) gene with SARS-CoV and other SARSr-CoVs (Zhou et al., 2020). However, bat coronavirus RaTG13 exhibited a high sequence identity to SARS-CoV-2, sharing over 93.1%. We performed a multiple sequence alignment (MSA) of SARS-CoV (GenBank: NC_004718.3), SARS-CoV2 (GenBank: NC_045512.2), Bat SARS-like Coronavirus WIV1(GenBank: KF367457.1) and Pangolin Coronavirus (GenBank: MT072864.1). Sequences have 80% average percent identity over the alignment and 68.7% identical sites. As can be seen from Figure 2 (under the alignment), amino acids after the first furin cleavage site, as previously reported, are not well conserved, however, the amino acids on the right second cleavage site are very well conserved. Another finding is that the rich N-glycosylation sites (Figure 3, top left) and the cysteine-rich region are well conserved (Figure 3, top right) in S protein from different coronavirus species.  Porcine Coronavirus (GenBank: NC_039208.1). As seen from the top alignment, E-protein is strongly conserved among closely related CoV's. An interesting, but highly conserved C-XX-C region emerges from distantly aligned CoV's. The CxxC motif are employed by many redox proteins for reduction, formation and isomerization of disulfide bonds (Fomenko, Gladyshev, 2003). Thus, distantly related CoV's (Bovine CoV and Porcine CoV) are employed to highlight broadly conserved regions. Post-translational modifications such as N-linked glycosylation and phosphorylation sites detected by ProSite, which can be seen above the alignments, are well conserved in SARS-Cov2, SARS-Cov, Bat coronovirus, Bovine coronovirus, Pangolin coronovirus and Porcine coronovirus.

Discussion
The novel conronovirus, currently designated as SARS-CoV-2, is an emerging virus, we know relatively little about it. It has observed that SARS-CoV-2 may be transmitted from infected people without symptoms, therefore, it increases the challenges of controlling a deadly pandemic without the use of a vaccine. The goal of the present paper is the sequence analysis and analysis of amino acid variations of structural proteins, deduced from novel coronavirus SARS-CoV-2 strains, isolated in different countries, which would generate bases for the designing and development of vaccines, antibodies and therapeutic agents to combat with the highly pathogenic SARS-CoV-2 coronavirus worldwide. S protein of SARS-CoV-2, a type I transmembrane glycoprotein that plays an important role in virus binding and entry and also is a major inducer of neutralizing antibodies. S protein consists of a signal peptide, and two domains, extracellular domain and transmembrane domain. Its extracellular domain consists of two S1 subunits and, and S2, the carboxy-terminal membrane fusion subunit (Wrapp et al., 2020). The furin-like cleavage site has been recently predicted in SARC-Cov-2, which lack in the other SARS-like CoVs (Coutard et al. 2020). A number of studies have been shown S and N proteins are most promising targets for vaccine and anybody development against coronovirus, including SARS-CoV-2 (Chen, et al., 2020). Our amino acid sequences analysis of structural proteins, demonstrates that, despite the higher number of amino acid variations in membrane protein (M), however, there are no significant amino acid variations observed in the structural proteins of S, P or E-proteins obtained from new strains of SARS-CoV-2 coronavirus isolated in different countries . These data are believed can provide a basis for the development of vaccines and antibodies to combat the deadly SARS-CoV-2 outbreak, isolated in different countries.
SARS-CoV-2 share about 80% sequence identity in the S gene with SARS-CoV and other SARSr-CoVs (Zhou et al., 2020). However, bat coronavirus RaTG13 exhibited a high sequence identity to SARS-CoV-2, sharing over 93.1%. Our sequence analysis showed that N protein of SARS-COV2 is more similar to pangolin ( Figure 6). It should be noted that this protein is 100% identical in SARS-CoV and bat coronavirus. Given the similarity of SARS-CoV-2 to bat SARS-CoV-like coronaviruses2 in S protein and pangolin in N protein, this suggest that bats and pangolin may serve as reservoir hosts for progenitor of SARS-CoV-2.
By amino acid sequences analysis we also identified sequences, which are conserved in many coronaviruses, including new coronovirus SARS-CoV-2. Spike protein of NNTVYDPLQPELDSFKEELDKYFKNHTSP ( Figure 3) and Nucleocapsid protein of PKGFYAEGSRGGSQASSRSSSRSR (Figure 4) was found to be particularly well conserved in many coronaviruses. Such sequences could be important for developing vaccines, antibodies, and also would be important for diagnostic purposes.
Notable, the spike protein of SARS-CoV-2 coronavirus is cycteine-rich protein and total of nine cysteine residues are found in the RBD, eight of which forming four pairs of disulfide bonds (Lan et al., 2020). Correct formation of disulfide bridges is essential for proper folding of cycteine-rich proteins (Mamedov et al., 2019). In addition, the spike protein has 22 potential Nglycosylation sites, of which two of them are in the receptor binding domain (RBD) region.
Thus, the correct formation of disulfide bridges and, accordingly, the correct status of Nglycosylation will be crucial for the correct folding of the S-protein when this protein is recombinantly produced in a heterologous system.

Materials and Methods
First, all sequences are downloaded from NCBI Virus as of 24.04.2020, then these bulk files are pre-processed with in-house built python (also including numpy and pandas libraries) scripts. Eventually nearly 1300 sequence obtained for each given protein (spike, nucleocapsid, membrane and envelope). Then these sequences (separately) piped into local server (available at Akdeniz University, Biotechnolgy Department) to perform MSA with ClustalO (defaults used) and a MSA acquired for each protein. These MSA then fed to Geneious Prime to visualize and statistically calculate each and every mutation.