SARS-CoV19-2 spike protein mutation patterns: a global scenario

Analysis of SARS-CoV-2 spike protein sequences of over 19 countries from biological databases submitted around the globe was carried out with help of bioinformatics tools and structure prediction databases. Initial data analysis showed entry of virus into different geographic regions started in the month of January 2020. Meanwhile, alignment of spike protein sequences of SARS-CoV-2 isolates from China and other countries revealed a critical mutation of D614G. Surprisingly, mutation D614G was not seen in early samples submitted in the month of January but gradually it started appearing globally from the month of March 2020. However, the mutations of amino acids in the spike protein other than D614G exhibiting similar pI and altered polarity were found to be specific to geographical regions. Besides, prediction of homology model for interaction of spike protein showed predominant role of chain C of trimeric spike protein in adhering receptor binding domain (RBD) of human ACE2 receptor. Furthermore, the prediction of glycosylation points has revealed that there are about 20 N-glycosylation potential sites on spike protein. We believe that the information present here would not only help in thorough understanding of infectivity but also enhance the knowledge of the scientific community in developing prophylactics and/or therapeutics for SARS-CoV19-2 virus.


Introduction
SARS-CoV-2 is a RNA virus consisting of positive-sense single-strand RNA of approximately 27-32 kb. This virus, a member of the Coronaviridae family is classified into four genera of CoVs: alpha, beta, delta and gamma. This virus appears like a crown under an electron microscope, hence it is named as CORONA. It is known to infect a wide range of hosts including humans, other mammals, and birds. Infected people will have clinical symptoms like asymptomatic to severe symptoms in their respiratory and digestive tracts etc.
Interestingly, before it was reported/assumed that asymptomatic person can contribute up to 80% of SARS-CoV-2 transmission (Cascella et al. 2020).However, recently, (9 th June 2020) WHO has mentioned that spread of virus by asymptomatic patients is rare but very next day retracted their statement since data was based on few studies. Therefore more studies are required to conclude the potential source and route of transmission of SARS-CoV-2.
The infectious SARS-CoV-2 penetrates human cells via a transmembrane spike (S) glycoprotein, which is a trimeric class I fusion protein and consists of two domains, S1 and S2. The S1 subunit mediates the attachment of the virus, and subsequently the S2 subunit mediates downstream membrane fusion of the viral and human cellular membranes (Hoffmann et al. 2020, Walls et al. 2020, Zhou et al. 2020. The receptor binding domain (RBD) for SARS-CoV-2 has been identified as the human angiotensin-converting enzyme 2 (hACE2), and recent studies determined a high binding affinity to hACE2 (Shang et al. 2020, Walls et al. 2020. Considering its key role, the S protein is one of the major targets for the study and development of preventive and therapeutic modalities. Meanwhile, presence of glycans on proteins serve several purposes starting from binding/attachment interaction, protein folding, masking epitopes to evade the host defense system. Therefore, it is also crucial to understand their location on the protein for the development of vaccines, which has been ignored previously (Wolfert and Boons 2013). However, recently many researchers are taking this fact into consideration (Watanabe et al. 2020).
Bioinformatics tools and biological databases have tremendous potential in extracting the information on the causative agent which helps in monitoring spread of disease and discovery of any new drugs or preventive strategies (Bianco et al. 2012;Sosa et al. 2017).
Biological databases play a central role in Bioinformatics, especially, structural databases help to analyse 3 dimensional structures, and studying protein ligand interactions which facilitates screening and identification of candidate targets for drug discovery. Meanwhile, databases are vital for storage and updating information and not only provide a platform for banking but also guard scientific data, and also key elements for the progress of scientific research.
As of June 19 th 2020, a total of 84.6 million confirmed cases and 4.53 million deaths along with a total global recovered number 41.42 million were reported. First reporting of SARS-CoV-2 virus ever happened since 27 th December 2020 and transmission of virus from human to human on 19 th January 2020 by China. Ever since, SARS-CoV-2 has become pandemic, lockdown, social distancing, wearing a mask and staying and working from home has become new normal around the globe. Besides, it has also kept the scientific community, governments, healthcare workers, police, and diagnostic kit manufacturing industries on their toes for development of appropriate monitoring strategies to control the spreading of disease.
Many countries have developed their own diagnostic kits, which helps them in early detection and to contain virus from spreading. Despite stringent measures taken to control the disease, the constantly changing nature of the virus has created panic not only among the common man but also in the scientific and healthcare community, globally. However, the only way to understand it better is to share information from everyone's experience and surveying available databases. Therefore, we have made an attempt to compile the information available in the biological databases and share the critical information which shall be useful to the scientific community. This might help in monitoring and understanding of disease as well as developing preventive and therapeutic modalities for SARS-CoV-2.

Methods
NCBI data survey: NCBI database was searched for protein using the keyword "Covid19 and respective country" (https://www.ncbi.nlm.nih.gov ). Once the list is displayed sequences, protein sequences (surface glycoprotein) will be collected in fasta format based on date of sequence publication, else date of submission or sampling. Sequences published on the same date are considered as one to avoid duplication of sequences or reduce the numbers. Each sequence was collected and stored based on category as on the date of publication/sampling/submission. Later on, information was extracted and tabulated (Table 1) like, country of origin, date of publication, patient travel history (if available), mutation site, charge and position of mutant amino acids etc.
Multiple sequence alignment and domain prediction: Sequences collected were grouped based on country of origin in fasta format in a text file. The same files were used for multiple alignment using ClustalX software version 2.1 (Higgins 1994;Chenna et al. 2003).
Subsequently, alignment was analysed to look for mutation in the sequences and the same were tabulated (Table 1). For prediction of domains in spike protein of SARS-Covid19-2, the following sequence accession no. of NCBI QHU36864 was used, which was isolated from a 61 year old aged male person from Hubei, Wuhan, China. After downloading the sequence in fasta format the same was used for domain prediction using online web-tool prosite scan of expasy.org (De Castro et al. 2006).

Protein structure view: Homology models of SARS-CoV-2 spike protein and human ACE2
Receptor Binding Domain (RBD) were viewed and analysed using pyMOL software version 2.4. (Schrodinger, 2010). Protein chain colouring and locating position of amino acids was also done using pyMOL.
Prediction of glycosylation sites: Glycosylation points on spike protein for SARS-CoV-2 were predicted by NetNGlyc.

Results and discussion
Analysis of collected spike protein sequences from over 19 countries revealed that the entry of virus into respective geography (except China) started in the month of January 2020 since the official publication of protein sequence by China. Countries like, India, Republic of Korea, Italy, Australia and USA showed entry in the month of January (based on date of sample collection in Table 1). Whereas, entries in countries like Japan, Brazil and Israel took place in the month of February. Followed by in countries like Nepal, South Africa, Bangladesh, Pakistan, Iran, Sri Lanka, and Turkey in the month of March and April 2020.
However, these conclusions are based on data available in the NCBI database, which might not contain all the information. Though, the entry of virus in India, USA and Italy started in January, the virus spread and fatality rate was low in India (3.3%, /https://coronavirus.jhu.edu/data/mortality), this could be due to change in geographical region physical factors, ethnicity, different genetic determinants and phenome of local population, food habit and relative herd immunity etc., of an individual. However, interestingly, a major mutation was seen in sequences which were submitted outside China indicating that the virus started mutating in the new host at different geographical regions or territory. These were evident in our multiple alignment sequences results as described in Table 1  However, some specific mutations (different mutation apart from D614G) were seen in Chinese protein sequences and all were hydrophobic in nature (Table 1). Hydrophobic amino acids like valine, leucine, isoleucine, phenylalanine and methionine have nonpolar side chains and these are considered as essential for folding of polypeptide chains in a protein and to maintain its globular structure (Kauzmann, 1959;Perutz et al. 1965;Dill, 1990;Gowder et al. 2014). Therefore, it is supposed that the majority of hydrophobic nature of amino acids on chain C surface might provide internal stability of protein chains to maintain globular structure and in maintaining trimeric structure of the spike protein (Dyson et al. 2006).
Interestingly, D614G mutation was not seen at amino acid position D614 among Indian, Italian, Japanese and American isolates during the months of January and February however, the mutation was found in the isolates collected from March 2020 onwards. This clearly demonstrated the gradual/faster adaptation of virus and subsequent genetic drift with enhanced infectivity in the new host at different geographical regions. Mutations were also observed in other positions of spike protein too, however, they were specific to particular strain and not commonly seen in all other protein sequences as listed in table 1. Again, mutated amino acids were hydrophobic in nature, except in two samples, where it was acidic.
Surprisingly, none of these mutations altered pI of the protein ( Table 1)     Glycosylation prediction revealed there were 20 N-glycosylation potential sites present on the spike protein ( Figure 5). Glycosylation of glycoprotein is important since many viruses use them to mask surface peptides epitopes that would otherwise elicit antibody response (Zhou et al. 2020). Therefore, it is essential to consider their presence on protein while designing vaccines. A recent study by Sino Biosciences which developed a recombinant S1 protein with a D614G mutation showed comparable binding efficiency as the original 614D with ACE2 protein based on ELISA analysis (https://www.sinobiological.com/research/virus/2019-ncovantigen?utm). They also claim that mutation D614G was initially predominantly found in European countries, which is converse to data analysed in this study. Besides, they claim this mutation strain is more transmissive. The study was carried out by a US-based Los Alamos National Laboratory in collaboration with Duke University and University of Sheffield, England. Author Bette Korber studied a total of 6000 sequences around the globe and found 14 mutations. They also say that this predominant strain migrated to the US East coast in mid-March 2020. But at this point of time it is not clear whether strain was carried to the US or it got mutated in the USA? However, in search of their claim (mutant D614G is more transmissive) a literature survey was carried out and found a recent study by Zhang et al 2020 investigated this fact. They showed that mutation D614G indeed enhances viral transmission based on their experimental evidence using retroviruses pseudotyped with S G614 infected ACE2-expressing cells markedly more efficiently than those with S D614 .
In conclusion, major mutation was seen at amino acid position 614 and this mutation D614G changed amino acid from acidic to neutral. Mutation D614G did change pI of the protein from 6.05 to 6.13 but surprisingly other mutations did not affect/change pI of the