Mutation hot spots in Spike protein of COVID-19

Spike protein of Coronaviruses help in receptor binding and virus entry into the host cells. While spike protein helps in receptor mediated virus entry, it is also extremely important as an immunogen as it is the most accessible part of the viral architecture. SARS-CoV2 or COVID 19 has four different structural proteins, N (nucleocapsid), M (membrane), E (envelope) and S (spike). Although all these proteins are the part of virus structure, only E, M and S are exposed towards the outer surface of the virus particle. S protein forms a knob like structure protruding outwards beyond the other structural proteins. It forms homotrimers containing an S1 and S2 as monomers and together they form the viral spikes. Mutations in structural proteins of virus play crucial role in viral virulence by determining generation of antibody escape variants and cellular tropism. In this paper we have performed in depth analyses of spike protein sequence from various parts of the world and tried to correlate the data with the current situation of virulent nature of this virus in certain parts of the world much more as compared to others. Here, we have focussed on the isolates from the North America and have pointed out three major hot spots of mutations in the S1 subunit.


Introduction
In recent times novel coronavirus 2019/ nCoV-19/ COVID 19/ SARS CoV2 infection has become a pandemic and matter of concern worldwide. As per the World Health Organization, as of 11 th of April 2020, globally total confirmed COVID 19 cases have added up to 1,610,909 whereas as many as 99,690 are the number of deceased individuals. Among all the countries those comprising the Americas have seen the highest toll of affected individuals (536,664 confirmed and 19,294 deaths) after Europe. In Europe Turkey, Switzerland, Bosnia and Herzegovina, Andorra and San Marino have been declared to be facing community transmission whereas within the Americas, the entire United States of America, Canada, Brazil, Ecuador, Chile, Peru, Mexico, Panama, Dominican Republic, Combodia and Argentina have been classified to be experiencing community spread. The entire West Pacific region including the area of origin of this pandemic i.e. China has only seen sporadic spread.
In this paper we have focussed on COVID 19 isolates of the North American origin to investigate possible sequence-virulence relation of this virus in United States. We also studied the similarities and differences of North American isolates with other variants of the world.
Spike protein is one of the most important structural protein of SARS CoV2 that plays the major in virus entry. Spike protein is a 1273aa long protein with two major sub domains, S1 and S2 ( Figure 1). While S1 harbours the receptor binding domain or RBD and mediates virus attachment to its ACE2 receptor, S2 carries out the function of fusion to enable successful entry. S2 contains the fusion peptide. For carrying out sequence analyses of COVID 19, we have used the protein sequence of the spike protein.

Methods
All available full length sequences of COVID-19 spike protein (1-1273) of different geographical origins belonging to North America (United States of America) (342), South America (2), Oceania (Australia)(1), China (63) and Europe (14) were downloaded in FASTA format from severe acute respiratory syndrome coronavirus 2 data hub of NCBI virus database of National Library of Medicine (NLM).
Multiple sequence alignments were done using alignment tool of NCBI virus server as well as CLUSTAL Omega. Sequence alignments from CLUSTAL Omega was viewed using MView tool.
Phylogenetic analyses were done using simple phylogeny tool of CLUSTAL W2 using neighbour joining method.

Results and Discussion
Multiple sequence Alignment of COVID-19 spike protein sequences from United States of America showed multiple mutations at few frequent locations whereas some of the parts of this protein was seen to be conserved. Table 1 shows a summary/ list of mutations observed in isolates of USA. Although there were many mutations dispersed at various sites in the spike protein sequence, few mutations occurred more frequently (Table 2, Figure 2). At position 614, mutation D to G occurred in 99 of the isolates which is clearly a very frequent mutation.

Figure 2: Overall distribution of mutations in the analysed in isolates from North America.
Graph was plotted based on Receptor binding domain (RBD) of COVID 10 falls between the amino acids 331 and 524 [1]. In the receptor binding domain three different sites were seen to be mutated: A348T; G476S and V483A. Out of these, V483A repeated more frequently followed by G476S (Figure 3).
We compared the sequences of North American origin with all the available sequences from South America (Figure 4), Europe ( Figure 5) and China ( Figure 6). In case of South American isolates, one of the samples showed mutation of position 614 from D to G as seen in case of the isolates from North America. However, the other mutations at positions 348, 476 and 483 were not present. None of these mutations were seen in Australian isolates. Unlike sequences from Asia, Australia and China, four out of fourteen European isolates aligned showed the same mutation at position 614 as seen in case of isolates from USA. It is thus possible that a branch of mutants of European origin entered USA. Thus, European form of the COVID 19 seems to be closer to that of American virus type with respect to the spike protein sequence.  We compared all available sequences from China but none of the highlighted mutations were found to exist in the Chinese sequences ( Figure 6). Thus, the virus that continues to spread in the America is different based on this sequence analyses of spike protein than the original Wuhan virus. This study revealed that the spike protein of COVID 19 virus of USA is mutating at various sites. To determine if this virus is evolving in different clusters, we performed phylogeny of all the available 342 American spike protein sequences. It was observed that this virus diverged in at least twenty clusters. Out of these three major clusters were prominent. Cluster 1 was comprised of mutation G476S ( Figure 5A) and cluster 2 had mutation V483A ( Figure 5B). All these mutations fall in the RBD of the spike protein. Thus, the RBD mutants might be evolving into different directions which might also get reflected in the infectivity of these isolates.
Mutation at 614 from Aspartic acid to glycine which appeared in almost 99 isolates formed a very big cluster. Mutation from Aspartic acid to glycine is a potentially crucial change in a protein sequence as Aspartic acid is a big negatively charged, acidic amino acid and on the other hand Glycine is a small neutral amino acid. This indicates that majority of the conserved or non-changing zones fall in the S2 domain and thus could be used for designing therapeutic candidates or as antiviral targets. These features should also be taken care of while designing vaccine candidates for this virus where S protein is used as target.
The data presented here is based on the currently available sequences. Further sequencing from other parts of the North America and other countries would shed more light on the nature of this virus.