Mutation hot spots in Spike protein of SARS-CoV-2 virus

Spike (S) protein of Coronaviruses help in receptor attachment and virus entry into the host cells. While S protein is required for virus entry, it is also important as an immunogen as it is the most accessible part of the virus architecture. S protein form knob like structures (viral spikes) protruding outwards in the form of homotrimers containing an S1 and S2 as monomers. Mutations in structural proteins of virus play crucial role in determining virulence and also in many instances influencing emergence of antibody escape variants and cellular tropism. In this paper we have performed in depth analyses of spike protein sequences from various parts of the world and tried to correlate the data with possible functional relevance of such mutations.


Introduction
In recent times novel coronavirus 2019/ nCoV-19/ COVID 19/ SARS CoV2 infection has become a pandemic and matter of concern worldwide. As per the World Health Organization (https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports/), globally, there have been millions of confirmed COVID 19 cases thousands have died.
The global sequencing effort have led to significant numbers of SARS-CoV-2 sequence information available now in various sequence databases specifically dedicated for COVID-19 research. Although number of sequences are increasing, the representation of various countries varies notably with more representations from few and very little from many others. Thus, this is a limiting factor in the field of sequencebased studies.
Spike protein is one of the most important structural protein of SARS-CoV-2 that plays the major role in virus attachment to its receptor followed by cellular entry [1][2][3]. It is one of the major structural proteins, 1273aa long, with two major sub domains, S1 and S2 ( Figure 1). While S1 harbours the receptor binding domain or RBD and mediates virus attachment to its ACE2 receptor, S2 carries out the function of fusion to enable successful entry.
Viruses evolve by continuously mutating its gene sequences and the frequency of mutation is also attributed to the lack of proof-reading activity of the viral replication machinery. In case of SARS-CoV-2, the virus had proof-reading properties. Yet, mutations appear albeit at a slowly. The correlation between mutations and function/s of viral protein/s becomes important while designing drug/ vaccine candidates.
Here, we have studied the profile of mutations occurring in the spike protein worldwide and also laid more in-depth analyses of the sequences available from Indian patients.
Additionally, 1648 whole genome sequences of SARS-CoV-2 deposited from Indian patients were also downloaded from GISAID database for detailed spike protein mutation analysis in Indian context. Although around 2000 sequences from Indian origin have been deposited in the GISAID database, not all are well sequenced with respect to the protein sequence of interest. Thus, only 1648 sequences with good translatable sequence information were used for current sequence analyses. Accession YP_009724390.1 was used as the reference sequence.

Sequence Analyses
Multiple Sequence Alignment (MSA) was done using Clustal Omega and MAFFT V7 tools.

Prediction of protein stability change
Since D614G is one of the most dominant mutation, the effect of this mutation on spike protein stability and flexibility was studied using DynaMut prediction that uses Normal Mode Analysis (NMA). the SARS-CoV-2 spike ectodomain structure (open state) (PDB ID: 6VYB) was downloaded from RCSB PDB, uploaded on DynaMut software (University of Melbourne, Australia) and changes in vibrational entropy; the atomic fluctuations and deformation energies were determined for mutation D614G . For atomic fluctuation and deformation energy calculations, calculations were performed by the software over first ten non-trivial modes of the molecule.

Visualization of sites of RBD mutation with respect to ACE receptor interaction face
To visualise the mutated amino acid positions in SARS-CoV2 Spike RBD region with respect to ACE receptor the crystal structure of spike RBD with ACE2 was downloaded from PDB (PDB ID: 6LZG). The structure was viewed using PyMoL software and the two proteins were differently coloured along with the amino acids that were found to have mutated.

Results and Discussion
Multiple sequence alignment of COVID-19 spike protein sequences from various geographical locations world-wide revealed the types of mutations occurring since the outbreak as the pandemic progressed. Although, there were many mutations dispersed at various sites in the spike protein sequence, few mutations occurred more frequently (Figure 2A) i.e. in multiple isolates/ sequences.

Spike protein mutations world-wide
Till date there are several thousands of SARS-CoV-2 sequencing data available for analyses on different databases. To understand the mutation profile of spike protein on world-wide level representing multiple continents or geographical sources, we used all the available annotated full-length sequences of the spike protein from all the major continents from NCBI virus database. 11,571 sequences in total were taken into consideration as shown in Table 1.
While there were commonly occurring random mutations, as many as 15 mutations were detected to be occurring in a country specific manner and represented to be present in different domains of the SARS-CoV-2 spike protein. Out of 15 mutations detected, 8 of them lie in S1 domain, 6 of them lie in S2 domain and 1 of them lie in signal peptide. The signal peptide residue mutation, L5F is seen mainly in the USA sequences. Mutation in the signal peptide might modulate in the signalling events in the intracellular secretory pathway although the exact relevance can only be ascertained by cell culture-based assays on virus-host interactions. In the S1 domain, 8 mutations were detected in different continents out of which 4 were detected in Asia. India alone represented three of these mutations of the in the S1 domain namely L54F, R78M and E583D (Table 1). Further analysis of other Asian isolates revealed mutations at other positions, namely T22I (Iran), G1219V (Japan), T791I (Taiwan) and A829T (Thailand). Although more sequence analyses alter current judgements made, the emergence of mutations signify the evolving nature of the spike protein albeit at various levels. Three of the mutations S477N, G485R and N501Y lying in the receptor binding domain (RBD) might alter the virus-ACE2 receptor interactions or could potentially lead to involvement of other receptors or co-receptors. Alteration of receptor interactions does not necessarily indicate effect on virulence and the hypothesis by itself requires time-testing i.e. monitoring positive selection of the emerging mutations and also wet lab validations. S2 domain helps in the fusion process. In S2 domain, 6 mutations were detected out of which 3 were detected in Asian isolates. The S2 mutations were more prevalent as compared to S1 especially T791I which occurred in 22.6% of sequence analysed from Taiwan and A829T which neared 60.3% in those from Thailand. Other 3 mutations were detected in France (A845S), Turkey (M900I) and USA (P1263L). These mutations have a frequency of 8.2%, 41.8% and 0.7% respectively.
Overall, there were 18 major mutations noted and each one emerged at different time points on the pandemic time scale (Table 2, Figure 2A). D614G and A829T mutations were the first major mutations to emerge as early as in the month of January albeit only in few places during that time point. Table 2 enlists the accession numbers of the sequences where each of the major mutations primarily appeared. M900I appear to be one of the recent ones with the sample collection date being in July.

Mutation 614G
Mutation at 614 from Aspartic acid to glycine is one of the mutations that appeared in more than 70% of the sequences in all the continents except in Oceania (major geographical location is Australia) where it appeared in 66.8% of the sequences (Table 3, Figure 7). Mutation from Aspartic acid to glycine is a potentially crucial change in a protein sequence as Aspartic acid is a big negatively charged, acidic amino acid whereas Glycine is a small neutral amino acid and thus a change from D to G might lead to electrostatic alterations. In a study by Feng et al [7], it was shown that when there was a mutation from G to D in the RNA dependent RNA polymerase PB1 of H5N1 virus, the binding of PB1 to viral RNA was hampered and H5N1 was attenuated. In case of COVID 19 spike protein since, residue 614 is positioned between S1 and S2 and a D to G change might affect the binding of spike with its receptor. In this study we used in silico platform of DynaMut prediction (which uses Normal Mode Analysis/ NMA) to check the effect of D614G mutation on the structural stability, flexibility, atomic fluctuations (amplitude of absolute atomic motions) and deformation energies (change in local flexibilities) ( Figure 8). This mutation was seen to stabilize the structure and decrease molecular flexibility. Since Aspartic acid to Glycine is a major electrostatic change, this might also lead to formation of an antibody escape mutant if position 614 is a part of immunogenic epitope. In case this position is a part of an epitope, this mutation might help the virus to escape the immune system and proliferate into a new more adapted cluster. Currently, D614G mutation has been a topic of debate and functional relevance of this mutation with respect to virulence has not been established. This assumption can be validated by incorporating this mutation in a laboratory isolate followed by testing cellular tropism and affinity for neutralizing antibodies.

Mutation at Receptor Binding Domain
Tables 4-6 and Figure 3 show a compilation of all the RBD related mutations. We have shown this analysis separately as the receptor binding domain has been the focus of drug targeting and vaccine development because of the role in receptor binding. We observed that RBD has seen three major mutations showing percentage varying from 1%-4% of the total sequences of each geographical location. All these mutations were from Oceania (Table 4). In one of our previous studies published as a pre-print we had analysed mainly North American sequences as the study was performed at an earlier stage of the current pandemic and the availability of sequence was a limitation. In case of North America, we previously noted three mutations in the RBD domain, A348T, G476S and V483A. However, the frequency of mutations these are currently <1%, hence these mutations are not propagating further or propagating at a very slow pace. Similarly, in the current analyses which involved multiple continents, nearly 25 minor mutations were observed in the RBD region (Table 6) which have very low frequency of occurrence in the total sequence studied. Some of these mutations like N439K was detected in samples collected in the month of July, 2020 and thus more sequencing results that eventually gets deposited over next few months might shed light on the possibility of selection of such mutations and role in virus evolution.
To check the molecular location of the mutation sites of RBD with respect to the ACE2 receptor binding interface, we checked eleven of the RBD mutation sites on the RBD-ACE2 complex structure from PDB ( Figure 4-6). Most of the major RBD mutations (except L518I) lie in the receptor binding motif (RBM) of the spike protein, the residues of which directly interact with the ACE2 receptor. As per the crystal structure used, amino acids of SARS-CoV2 RBD region observed to be interacting with ACE2 receptor were A475, Y489, F486, N487, E484, Y453, K417, Y449, G496, Q498, G446, G502 and T500. Mutation site 484 with respect to the E484Q mutation in the Indian isolates and other mutations localize around the ACE2 interacting amino acids. This imply possible effect of these mutations on receptor attachment.

Landscape of spike protein mutations in Indian patients
To correlate our basic overall understanding of the mutation hot-spots in the spike protein from other parts of the world with Indian variations, we tried to study almost all available translatable good quality sequences with respect to the spike protein from the GISAID database. Such sequences represented multiple Indian states and widely distributed sample collection time points. Multiple new mutations were detected in the spike glycoprotein sequence of the Indian isolates that were not detected in other countries (Table 7, Figure 2B-C). One particular mutation V367F lying in the RBD region of the protein was initially detected in a Chinese isolate in the month of January (Table 6) and our analysis detected this mutation only in Indian isolates collected in the month of May, 2020 with a frequency of 0.12%. This mutation was not detected in any other countries as per sequences analysed thus suggesting either minor independent mutation that might have emerged in India after the outbreak. Since the international flights were restricted inflow of infected individuals carrying this mutation from China is less likely. Since the percentage of individuals having this mutation was less and that the mutation emerged in the month of May, not detected in later months, it might not have spread very efficiently as of sequence information available currently. However, since the mutation lies in the RBD region, monitoring possible future expansion and selection of such mutations might add on to the therapeutics field where RBD is been targeted.
The major mutations that were detected in the RBD region of the spike protein in other countries were not detected in the Indian isolates except G476S. Rather, multiple independent new mutations emerged in the spike protein of Indian SARS-CoV-2 isolates as highlighted in Table 7. The new mutations which appeared in the samples collected in the months of June and July, 2020, were distributed in both the S1 and S2 domains out of which some were also from the RBD region of the protein.
Recent studies have identified two protease cleavage sites in the Spike glycoprotein, S1/S2 cleavage site (681st -684th) and S2' cleavage site (811th-815th). These two sites have been shown to be important for the proteolytic processing of the protein which increases its efficiency to interact with the host cell receptors and mediate cellular transduction process [4][5][6]. Our mutational analysis reveals multiple mutations near the S1/S2 cleavage site (A672V, Q675R, Q675H, Q677H, Q677R, N679T, S698L) and one in the S2' cleavage site (P812L) that emerged recently in the Indian isolates. These mutations might influence the efficacy of activity of the enzymes with respect to the cleavage sites although this hypothesis is subject to experimental validations. Nonetheless, considering importance of these proteolytic cleavage events in the virus entry process, mutations might alter the tropism and transduction mechanism if such mutations succeed to get selected in the virus evolutionary process.
The data presented here is based on the currently analyzed sequences. Further sequencing and mutation analyses would shed more light on the nature of this virus and might alternatively influence the mutational profile and inference drawn.   S1 domain (residues 13-685) and S2 domain (686-1273) are highlighted in red and blue respectively. Mutations lying in the Receptor binding domain (RBD ) is highlighted in green and are also part of the S1 subunit..

Mutations
Date of it first occurrence in the year 2020 (day-monthyear) Accession ID of the sequence where the mutation was detected (NCBI) 1. D614G 00-01-2020 QJW69187.1

M900I
15-07-2020 QLK97783.1 Table 2: Major mutations detected in SARS-CoV2 Spike protein and its probable period of primary appearance (based on the collection date of the sample). The NCBI accession ID has been mentioned for the sequence where the mutation was detected.      C. Schematic showing structural organization of COVID 19 spike protein. Spike protein is 1273aa long with two main domains, S1 and S2. 13-685aa form the S1 domain and rest of the S protein comprises the S2 domain. The receptor binding domain (RBD) falls within S1 domain

Acknowledgements
CSIR is acknowledged for research support and fellowship to FB. North Bengal Medical College and Hospital is also deeply acknowledged. We thank AcSIR for academic support.

Author Contributions
FB did sequence alignments. AKB analysed biochemistry of amino acid changes and proof-read the manuscript. UR analysed the results and wrote the manuscript.

Funding
There was no funding funding for this work.