Emergence of RBD and D614G Mutations in Spike Protein: An Insight from Indian SARS- CoV-2 Genome Analysis

Currently, entire world is crumbled due to COVID-19 caused by novel SARS-CoV-2. Globally, over 5 million people are infected by SARS-CoV-2 with 6% fatality rate. The surface spike (S) protein plays a key role in the pathogenesis of SARS-CoV-2 by mediating viral entry through human angiotensin converting enzyme 2 (hACE2) receptors on the host cell and there is a big global race to find viral neutralizing antibodies and vaccine against S protein of SARS-CoV-2. Since SARS-CoV-2 evolved into 10 different clades in a very short span, a study on sipke protein mutation is essential to have effective vaccine coverage globally. Based on the mutation analysis of S protein from 166 Indian SARS-CoV-2 genome, a total of 40 different SNPs comprising of 14 synonymous and 26 non-synonymous mutations were observed, and notably, Indian S protein diverged into two major clusters, D614 and G614, with 11 different types. Majority of Indian strains fall in A2a and O clusters. Alarmingly, we have observed six SNPs at RBD and notably two of them at RBM (S438F and S494P). S494P SNP, similar to Bat–SARS like-CoV, may indicate a low ACE2 binding affinity. Interestingly 38% of Indian strains harbor a characteristic D614G SNP which was found predominantly in A2a cluster, mostly comprising USA and European strains with high disease severity. The association of disease severity with Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 4 June 2020 doi:10.20944/preprints202006.0032.v1 © 2020 by the author(s). Distributed under a Creative Commons CC BY license. D614G SNP is well-correlated in states with high death rate except Maharashtra. Notably, more than 50% of D614G mutation were observed in Northern part of India and 14% in Southern part but not in Kerala and Tamil Nadu strains. Highly conserved motif, D614 (608VAVLYQDVNCT-618) in upstream and also few downstream, of S1/S2 furin cleavage site may indicate specific key role in efficient interaction with host proteases in pathogenesis. Further studies are warranted to clarify the impact of SD614G SNP association to disease severity . Interestingly, C2367T (Y789Y) synonymous SNP is observed in 37% of Indian strains and notably similar SNPs with degeneracy bases were observed which is a key indication for the possibility of misdiagnosis by Real-Time PCR and revised strategies are needed for the precise diagnosis. Circulation of high number of signature SNPs [D614G and C2367T (Y789Y)] in certain states may be an early indication of emergence of community transmission in India. Further large genome sequence data from India will aid in deep understanding on the diversity of circulating SASR-Cov-2 and its impact on disease severity, origin of imported cases to India, community spread, effect on diagnosis and vaccine coverage.

Similar to its preceders, SARS-CoV-2 is also believed to have originated from bats, mainly horseshoe bats (Bat-CoV-RaTG13, Rhinolophus affinis from Yunnan Province), and, like civets in SARS-CoV, camels in MERS-CoV, pangolins are considered to be an intermediate host for SARS-CoV-2 Zhang et al., 2020;Jia et al., 2020). The genome analysis of SARS-CoV-2 reveals an identity of 96% and 91.02% to Bat-CoV-RaTG13 and Pangolin-CoV, respectively. Interestingly, the spike (S) protein was found to be highly similar to Pangolin-CoV (97.5%) compared to Bat-CoV-RaTG13 (95.4%) . Based on these data, studies have predicted that SARS-CoV-2 could be a chimera of different corona viruses that have undergone genetic recombination in a single host at a given time .
The genome size of SARS-CoV-2 is around 30Kb and codes for 16 non-structural (nsps), four structural and accessory proteins (Gorden et. al., 2020;Chen et al., 2020b). Non-structural proteins are majorly a replicase complex involved in the replication of virus notably RdRp (RNA-dependent RNA polymerase) which is involved in the replication of positive-sense RNA, papain-like protease (PL Pro ), and 3-chymotrypsin-like protease (3CL Pro ) mediate proteolytic cleavage of ~ 800 kDa polypeptide upon transcription of the viral genome (Yin et al., 2020;Chen et al., 2020b;Tahir ul Qamar et al., 2020). The surface of enveloped SARS-CoV-2 is embedded by three structural proteins, viz., (i) membrane protein (M) and (ii) envelope protein (E) -responsible for viral assembly and fusion with host cell membrane and (iii) spike protein (S) -plays a key role in the viral pathogenesis since it mediates the entry of virus into the host cell and is also responsible for provoking the immune response in the host (Li, 2016;Chen et al., 2020a;Yin et al., 2020). Nucleocapsid (N) protein is involved in packaging of genetic material into viral capsid and is shown to be a strong immunogen among the structural proteins of the virus (Kumar et al., 2020a).
To initiate the infection, virus uses its S protein to bind with the receptor of the host and subsequently directs its genome into the host cell. This protein comes under class I fusion protein, which contains a characteristic coiled-coil structure in their C-terminals. (Belouzard, 2009;Li, 2016Wrapp et al., 2020. There are three segments in the S protein which are responsible for viral binding and fusion. (i) A large ectodomain has two subunits S1 and S2 which are involved in binding to the receptor and membrane-fusion, respectively, (ii) a singlepass transmembrane anchor, and (iii) a short intracellular tail. They resemble a clove-shape containing a trimeric head with three S1 subunits and a trimeric S2 stalk. Further, S1 domain is divided into two major domains, (a) N-terminal domain (S1-NTD) and (b) C-terminal domain (S1-CTD) (Li, 2016;Yan et al., 2020;Wrapp et al., 2020). Both the domains have high affinity binding to sugar and protein receptors. These S1 domains binds to the receptor through receptor binding motif (RBM) of the receptor binding domain (RBD), which stabilizes the S2 domain for fusion. (Belouzard, 2009;Li, 2016;Rane, 2020;Zhang et al., 2020;Yin et al., 2020;Xu et al., 2020;Yan et al., 2020;Wrapp et al., 2020) When the virus enters the host, the S protein is cleaved between S1 and S2 by host furin proteases (TMPRSS2) (Hoffmann et al., 2020b). Subsequently, S1 comes into action by binding through RBD to the human angiotensin converting enzyme 2 (hACE2) receptor on the host cell surface, followed by fusion of both viral and host membranes by the S2 subunits, paving way for the viral genome to enter into the host cell (Li, 2016;Hoffmann et al., 2020a;Walls et al., 2020;Wrapp et al., 2020).
The S protein of SARS-CoV-2 has undergone unique changes when compared to SARS-CoV.
Interestingly, the key residues at RBM of SARS-CoV-2 are identical to Pangolin-CoV but differ from SARS-CoV, and Bat-CoV-RaTG13 (Andersen et al., 2020). SARS-CoV-2 have unique additional residues at S1/S2 furin recognition motif (PRRARSV) which is similar to MERS-CoV but different from SARS-CoV, Bat-CoV-RaTG13, and Pangolin-CoV (Millet and Whittaker, 2015;Zhang et al., 2020). In addition to S1/S2 cleavage site, SARS-CoV-2 has a novel cleavage site ((L/S)KPTKRS) in the S2 domain (S2') which also aid in membrane fusion and virus infectivity (Belouzard et al., 2009;Hoffmann et al., 2020a).. Since S protein plays a key role in pathogenesis of SARS-CoV-2, researchers have tried monoclonal antibodies against S protein and have shown a promising neutralization effect on viral entry in in vitro, which indicate its ideal candidature for vaccine to prevent COVID-19 (Thanh Okba et al., 2020;Chen et al., 2020a;Yuan et al., 2020). There is a big global race to find viral neutralizing antibodies and vaccine against S protein of SARS-CoV-2 (Thanh Lurie et al., 2020;Kumar et al., 2020b). Considering the short span evolution of 10 different clades of SARS-CoV-2, understanding the variation in key residues of S protein is essential to have effective vaccine coverage globally (Forster et al., 2020;Biswas and Majumder, 2020). In the present study, we have analysed the mutations of S protein from Indian SARS-CoV-2 to understand its diversity, trace the origin of imported cases to India, effect on diagnosis and vaccine coverage.

SARS-CoV-2 sequence retrieval
A total of 172 Indian SARS-CoV-2 whole genome RNA sequences were retrieved from GISAID (https://www.gisaid.org/) hCov-19 Database as on 7 May 2020 and these sequences represent strains from 19 different states of India (Shu and McCauley, 2017). Using Bio-Edit version 7.0.9 (Isis Pharmaceuticals), the sequences coding for SARS-CoV-2 S protein were trimmed appropriately by mapping to the S region of the reference genome SARS-CoV-2 Wuhan-Hu-1, China (NC_045512.2) obtained from NCBI (https://www.ncbi.nlm.nih.gov/). Mutations in S protein were analysed after filtering the sequences which had >5N but retained the sequences with degeneracy bases.

Mutation analysis
To determine the single nucleotide polymorphism (SNP), multiple sequence alignment of the Indian SARS-CoV-2 S gene was performed using CLUSTALW along with reference S gene sequences representing different countries by using MEGA (Molecular Evolutionary Genetic Analysis Platform) version X software (Kumar, et al., 2018). Mutation at both nucleotide and amino acid level were documented and further their prevalence in global strains and possible origin were analysed using global SARS-CoV-2 genomic subsampling (5040 genome) in Nextstrain/ncov (https://nextstrain.org/) (Hadfield et al., 2018). The mutation induced significant effects, such as, binding efficiency to ACE2 receptors, host protease cleavage, vaccine coverage, and diagnosis were analysed.

Phylogenetic analysis of spike gene
To uncover the vaccine coverage, the Indian S protein diversity was analysed by multiple sequence alignment of S protein using CLUSTALW and phylogenetic tree was constructed by using the maximum likelihood algorithm in MEGA version X software (Kumar, et al., 2018).
Robustness of the tree topology was tested by bootstrapping with 500 replicates.

Synonymous mutation in Indian SARS-CoV-2 spike protein
A total of 14 synonymous SNPs were identified in Indian strains, of which seven mutations were uniquely found to be present only in Indian strains. Remarkably, a C2367T (Y789Y) mutation, in 37% (62) Indian strains and globally one in USA strains, of the subset was analysed. Of these 62 (C2367T) mutations, nine strains also co-occur with other mutations in S protein particularly in association with K77M (3) and A243S (2). Interestingly, four synonymous mutation were also observed at RBD (I410I, G431G, V433V, L518L), and uniquely, two SNPs in Indian strains 1% 1%    (Table 1). Majorly all synonymous SNPs are prevalent in A2a cluster, and few others in A1a, A2, B1, and B4 clusters.

Non-synonymous SNPs in Indian spike protein
A total of 26 non-synonymous SNPs were identified in Indian strains, of which 17 mutations were novel uniquely found only in Indian strains ( Table 2). Most of non-synonymous SNPs were observed at RBD (10), NTD (10), SP (3), CT (3), and HR1 (2) and interestingly no mutations were observed at S1/S2 or S2' host protease cleavage site (Fig-2). Among 26 SNPs, two (D614G and S494P) of them are with unique features. Non-synonymous mutations were widely distributed among different cluster of SARS-CoV2 but majorly seen in A2a, followed by A1a and A3.

(i) Signature D614G mutation
Based on Nextstrain analysis, a unique D614G signature mutation was observed in 38% (64) of Indian strains which is a characteristic SNP majorly found in A2a cluster comprising mostly USA and European strains with limited Asian and Asia Pacific strains. Of these 63 (D614G) mutations, 20 strains also co-occur with other mutations in S protein especially in association with D294D (6), Q271R (2) S438F (2), and G1124V (2). Interestingly, D614G mutation was not observed in early Indian genome sequences associated with travel from Wuhan or other Asian countries. Notably more than 50% of this mutation was observed in northern and 14% in southern part of India but not in Kerala and Tamil Nadu strains. The figure 3 shows how D614G is confined to A2a cluster and its global spread from a subset of global genome data in Nextstrain/ncov. Interestingly, the region containing D614 (608-VAVLYQDVNCT-618) was found to be highly conserved domain in both human-(SARS-CoV-1 and SARS-CoV-2) and animal-associated (Bat-CoV and Pangolin-CoV) corona virus (Fig-4).

(ii) Novel SNPs in RBD of the spike protein from Indian SARS-CoV-2
Alarmingly, we have observed six SNPs at RBD (R408I, C432*, I434K, S438F, S494P, and E516*), of which two was observed at RBM (S438F and S494P). Of the six SNP, three (C432*, I434K and S438F) mutations were observed in single strain EPI_ISL_424362. The change in S438F and S494P at RBM is shown in figure 5 along with other CoV-related S proteins. The impact of other RBD mutations is needed to study further in order to unravel the crucial role of these sites for interacting with host ACE2.

Wuhan-Hu-1 genotype and other interesting SNPs in India
Interestingly, 19% (31) of the India strains were found to be similar to Wuhan genotype, among them 41% of them were from Tamil Nadu (13) and 19% from Ladakh (6). Based on 72 Indian strains S protein analysis, we found it was diverged into two different cluster D614 and G614 with 11 different variants as shown in figure 6. Of the global subset analysed, the S943P was found to be a unique mutation site (11 strains) in Belgium strains in A2a clade and the same site was mutated to S943T one each in Indian (A6 clade) and Belgium strain. The P9Q, K77M, and V622I mutation sites in Indian strains were also mutated in USA (P9L), and Belgium (K77N) Indonesia (V622F) strains, respectively, with different amino acid substitutions.

Discussion
In the current study based on the mutation analysis in 166 Indian SARS-CoV-2 genome from GISAID hCov-19 Database (Shu and McCauley, 2017), it is revealed that S protein have 40 different SNPs with 14 synonymous and 26 non-synonymous SNPs which help us to understand the possible origin of imported SARS-CoV-2, unique strains in circulation, associated impact on Real-Time PCR diagnosis and vaccine coverage for Indian strains. Among the synonymous SNPs, seven were confined to India and interestingly C2367T (Y789Y) mutation was observed predominantly (37%) in Indian strains and one in USA strain (EPI ISL 436898) of the subset.
Although C2367T SNP did not alter the amino acid (Y789Y), it made a signature SNP which aided us to understand the strain circulation within the community. Notably, C2367T (Y789Y) SNP was found in association with K77M in 3 strains, each one from Delhi, Bihar, and Tamil Nadu and A243S in two strains each one from Delhi, and Bihar, this probably indicate possible COVID-19 infection in association with travel (Eden et al., 2020). Receptor binding domain expressed four synonymous SNPs (I410I, G431G, V433V, L518L) which indicate that these site are prone to mutate further and may lead to alteration in the binding property of S protein to ACE2 receptor. Seven synonymous mutations were uniquely observed in some of the western countries majorly from A2a cluster and, remarkably six Indian strains had C882T (D294D) SNPs which were majorly seen in Canada, USA, and Turkey, which probably indicate the imported SARS-CoV-2 cases to India (Eden et al., 2020). From the present study, we could understand that same synonymous mutation sites, ( primer binding in certain SARS-CoV-2 strains which were imported from specific countries (Brufsky, 2020).
Among 26 non-synonymous SNPs, 17 mutations were novel, uniquely found only in Indian strains. It has been shown that D614G characteristic SNP was majorly observed in A2a cluster which mostly comprised USA and European strains and is believed to have an impact on COVID-19 severity (Phan, 2020;Brufsky, 2020;Stefanelli et al., 2020;Bhattacharyya et al., 2020;Biswas and Majumder, 2020). Interestingly, D614G signature mutation was observed in 38% (64) of Indian strains and was found to be high in states with more prevalence and CFR of COVID-19. Korber et al. (2020) showed high number of D614G mutation in ICU patients.
Similarly, the association of D614G SNP and disease severity is well correlated in Delhi, Gujarat, Madhya Pradesh, and West Bengal, where CFR is high. But in contrast, this association was not observed in Maharashtra, which encountered the high death rate, and this may be due to the limited genome analysis from the states (Fig-7). D614G mutation was not observed in early Indian genome sequence associated with travel from Wuhan or other Asian countries. More than 50% of this mutation was observed in northern and 14% in southern part of India but not in Kerala and Tamil Nadu strains. It was observed that D614G SNP had signature co-mutation association with specific states and this may probably indicate a possibility of COVID-19 infection in association with travel and also community circulation of unique strain (Stefanelli et al., 2020;Eden et al., 2020). This signature SNP (D614G) site is located between the RBD and S1/S2 polybasic protease cleavage site of S protein and their effect on ACE2 receptor binding or host-mediated furin cleavage are need to be analysed to understand their role in enhanced pathogenicity in USA and Europe patients. It was found that the region contain D614 (608-VAVLYQDVNCT-618) in upstream and few downstream regions of S1/S2 showed highly conserved motif among corona virus from different species which may indicate specific key role in efficient interaction with host proteases in viral pathogenesis (Walls et al., 2020;Brufsky, 2020). The study by Bhattacharyya et al. (2020) showed that D614G SNP in this motif resulted in additional host-mediated cleavage of S1 by Cathepsin-L and elastase. Belouzard et al. (2009) have shown that cleavage of addition proteolytic site by cathepsin L in SARS-CoV enhances the membrane fusion. This probably indicate the possible mechanism of D614G SNP association with disease severity of COVID-19. However, we need further studies to clearly elucidate the impact of D614G SNP associated with disease severity.
The present study clearly traces the possible origin of imported Indian SARS-CoV-2 by unique SNPs confined to specific geographic location (Tables 1 and 2), which represented K77M SNPs only for Indian Strain, and same site is mutated in Belgium strain K77N (A2a). Similarly, a unique S943P mutation was observed only in Belgium strains and the same was also observed one in each India and Belgium strain, with different amino acid (S943T) in the later strain (Korber et al., 2020). These finding unraveled the travel associated origin of SARS-CoV-2 to India (Eden et al., 2020).
Alarmingly, we have observed six novel SNPs at RBD (R408I, C432*, I434K, S438F, S494P, and E516*) of the S protein from Indian SARS-CoV-2, from which two was at RBM (S438F and S494P) and majority of them belong to O clade. The residues at RBM are crucial for the interaction of hACE2 receptor. Based on the S protein alignment of SARS-CoV-2, SARS-CoV, BatCoV RaTG13, Pangolin-CoV and Bat-SARS-like CoV, it was found that S494 is a conserved key amino acid in RBM of SARS-CoV-2 to interact with ACE2 and it is identical with Pangolin-CoV (Andersen et, al., 2020). A novel mutation in Indian strain at RBM (S494P) was found to be similar to that of Bat-SARS-like CoV and studies in SARS-CoV on equivalent position variation have shown reduced ACE2 binding affinity and antigenicity (Chakraborti et al., 2005;Watabe and Kishino, 2010;Rockx et al., 2010;Wu et al., 2012). Similarly, the SNPs at R408I and N439K (adjacent to S438F) in S protein shown to have weak interaction with receptors (Jia et al., 2020). The impact of other RBD mutations is needed to study further in order to unravel the crucial role of these sites for interacting with host ACE2.

Effect of synonymous SNPs on real-time PCR diagnosis
Considering the fact that synonymous mutation does not show any effect on conformational change and function of S protein, but will adversely affect the diagnosis of SARS-CoV-2 by Real-Time PCR if any of the primers' 3' prime binding sites fall on the mutational sites (Tahamtan and Ardebili, 2020;Xi et al., 2020). This clearly indicate the reason for some reported cases of Real-Time PCR negative with Chest X-ray positive (Lei et al., 2020). The primer mismatching due to SNPs may lead to false negative results, especially for the confirmatory test of SARS-CoV-2 since primary screening of target E gene is highly conserved, and thus, suspected patients will not be quarantined which may result in spreading of the disease unknowingly in the community Corman et al., 2020). So, for better detection by Real-Time PCR, more conserved region must be considered in three or four target genes or different regions with in the same gene which would be ideal to reduce the false negative results due to their increased efficiency in detection Corman et al., 2020;Tahamtan and Ardebili, 2020;Wu et al., 2020c). In such case, even if mutation occurs in any one of the target region or gene, other two or three will be positive, so COVID-19 patients will be less likely to be misdiagnosed. Also combination of real-time RT-PCR and clinical features by Chest X-ray or CT will facilitate enhancement in SARS-CoV-2 diagnosis (Xi et al., 2020;Tahamtan and Ardebili, 2020;Wang et al., 2020).

Impact on global vaccine coverage and convalescent plasma therapy
Even though different strategies have been tried on the global race of SARS-CoV-2 vaccine discovery, a majority of them were targeting S protein (Korber et al., 2020). In such case, nonsynonymous SNPs will adversely affect the global vaccine coverage for SARS-CoV-2 (Wrapp et al., 2020). So, SNPs studies from different geographical regions are crucial to bring in very effective pan vaccine for COVID-19 (Watabe and Kishino, 2010;Jia et al., 2020;Yin, 2020).
More studies on the S protein SNPs will also throw a light on requirement of multivalent like Pneumococcal vaccines (Daniels et al., 2016) or seasonal vaccine like Influenza (Fiore et al., 2009). Some studies have shown that convalescent plasma therapy is ineffective in COVID-19 (Zeng et al., 2020;Brown and McCullough, 2020), and this may be due to the variants of S protein which resulted in poor viral neutralization (Marano et al., 2016;Rockx et al., 2010;Wrapp et al., 2020). Considering this, it would be a better strategy if we do a strain matching by S protein sequencing between donor and recipients before plasma therapy, like HLA matching in organ transplantation (Sheldon and Poulton, 2006).

Early indication of community spread
In the present study, we could observe that two mutations C2367T (Y789Y, 37%) and D614G (38%) majorly from A2a clusters were predominant in India compared to other mutations. The D614G was found to be in association with D294D synonymous SNP in six strains, interestingly five of them were from Delhi and one from Gujarat (Gandhinagar), similarly two SNPs both from Gujarat (Q271R), and West Bengal (Kolkata) (G1124V) uniquely co-occur with D614G.
More than 50% of the D614G SNP was found to be prevalent in northern part of India majorly from Delhi, Gujarat, Madhya Pradesh and West Bengal, which indicated distinctive strains circulation in the local community. A unique C2367T (Y789Y) mutation was confined only to India and was majorly observed in states with high prevalence of COVID-19, such as, Delhi, Tamil Nadu, Karnataka, Bihar and Maharashtra, which may be an early indication of emergence of community transmission in India (Stefanelli et al., 2020;Eden et al., 2020).

Conclusion
The present study clearly shows that Indian strains have mutation at RBD and RBM majorly linked to O clade, also an A2a clade associated D614G SNP similar to USA and Europe. The impact of these mutations on pathogenesis of SARS-CoV-2 is needed to elucidate with further studies either by site-directed mutagenesis or viral strains harboring these mutations. Certain characteristic SNPs on S protein clearly unraveled the possible origin of Indian strains.
Circulation of these signature SNPs in certain states may be an early indication of emergence of community spread in India. More SARS-CoV-2 genome sequencing from India and other part of the world would aid in unraveling the mutations in the different region of genome and its impact on ACE2 receptor binding, S1/S2 cleavage by host proteases, disease severity associated with specific strain, drug resistance, Real-Time PCR diagnosis and global vaccine coverage.
Understanding these factors would greatly help us to effectively combat these new super spreading invaders (SARS-Cov-2) into human community.