Reporting two SARS-CoV-2 strains based on a unique trinucleotide-bloc mutation and their potential pathogenic difference

SARS-CoV-2, the novel coronavirus behind COVID-19 pandemic is acquiring new mutations in its genome. Although some mutations provide benefits to the virus against human immune response, a number of them may result in their reduced pathogenicity and virulence. By analyzing more than 3000 high-coverage, complete genome sequences deposited in the GISAID database, here I report a unique 28881-28883:GGG>AAC trinucleotidebloc mutation in the SARS-CoV-2 genome that results in two sub-strains, described here as SARS-CoV-2g (2888128883:GGG genotype) and SARS-CoV-2a (28881-28883:AAC genotype). Computational analysis and literature review suggest that this bloc mutation would bring 203-204:RG(arginine-glycine)>KR(lysine-arginine) amino acid changes in the nucleocapsid (N) protein affecting the SR (serine-arginine)-rich motif of the protein, a critical region for the transcription of viral RNA and replication of the virus. Thus, 28881-28883:GGG>AAC bloc-mutation is expected to modulate the pathogenicity of the SARS-CoV-2. Remarkably, SARS-CoV-2g and SARS-CoV-2a strains can be linked with the heterogeneity of COVID-19 cases across different regions within and between countries by analyzing existing data. Sequence analysis suggests that severely affected cities, such as Milan, Lombardy, New York, Paris have the predominant presence of SARS-CoV-2g strains, whereas less affected places like Abruzzo, Lyon, Valencia have a relatively higher presence of SARS-CoV-2a, an indication that the latter strain may contribute to the reduced cases of COVID-19. A similar relationship is observed when Netherlands, Portugal are compared with Spain, France and Germany. These analyses suggest that the SARS-CoV-2 has already evolved into a less infective SARSCoV-2a affecting COVID-19 cases in different regions. The time a country or region needs to acquire SARS-CoV-2a strains may be indicative to the time it would need to overcome the peak of the COVID-19 cases. To confirm these assumptions, prompt retrospective and prospective epidemiological studies should be conducted in different countries to understand the course of pathogenicity of the SARS-CoV-2a and SARS-CoV-2g. Potential drugs can be designed targeting 28881-28883 region of the N protein to modulate virus pathogenicity.


Non-technical summary
Through an extensive analysis of the SARS-CoV-2 whole-genome sequences, here I am reporting two strains of the virus, designated as SARS-CoV-2a and SARS-CoV-2g which can be differentiated based on a unique 3 nucleotide (the building blocks of virus genome) change in the SARS-CoV-2. From literature review and computational analysis, I have characterized these strains. This bloc mutation is located in the 28881-28883 region on the reference genome map of SARS-CoV-2.
Remarkably, SARS-CoV-2a seems to be prevalent in areas/countries with relatively low COVID-19 cases (such as Portugal, Netherlands, Belgium) whereas in highly affected countries/areas (USA, Spain, France, and Germany) SARS-CoV-2g predominates. Within a country, such as in Italy, Abruzzo has very low COVID-19 cases and high presence of SARS-CoV-2a.
This is a crucial observation and can be further explored through retro-and -prospective pannational genetic and epidemiological studies. Monitoring the dynamics of these two strains might be invaluable to manage the COVID-19 pandemic and this can be achieved by sequencing only a small region of the virus genome encompassing 28881-28883 nucleotide bloc. The two strains, SARS-CoV-2g has got GGG in those positions which is as same as the reference sequence, so should be considered as 'wild type'. Whereas, in SARS-CoV-2a the sequence has mutated into AAC. This is a unique event where three nucleotides are changing as a bloc in SARS-CoV-2. Most importantly, this bloc mutation affects the nucleocapsid (N) protein of the virus. N protein is crucial for virus replication. Literature review suggests that the (GGG>AAC) mutation would negatively affect the N protein and thus reduce its infectivity which can explain why in areas where SARS-CoV-2a is predominant, COVID-19 cases are lower. Notably, whole-genome sequencing of the SARS-CoV-2 and deposition to the public databases has been progressing with an unprecedented pace during this outbreak. Up until April 10, 2020, more than 3500 high-coverage, complete genome sequences of SARS-CoV-2 have been submitted to GISAID (Global Initiative on Sharing All Influenza Data) maintained by MPII (Max Planck Institute for Informatics).
After a careful analysis of the whole genome sequences in the GISAID database, this study has established that a unique trinucleotide-bloc mutation, 28881-28883:GGG >AAC might have occurred in recent time giving rise to a new subtype of SARS-CoV-2 with potential impacts on the course of the COVID-19 pandemic. This bloc mutation is mapped within the nucleocapsid (N) gene according to the SARS-CoV-2 reference genome. N protein plays a critical role to assemble coronavirus RNA genome and create a shell around the enclosed nucleic acid. It also interacts with the viral membrane protein during viral assembly, assists in RNA synthesis, folding and virus budding. The protein also affects host cell responses to the viral infection, including cell cycle regulation and immune responses modulation 2 .
The 28881-28883:GGG >AAC mutation affects the SR (serine-arginine)-rich domain of the N protein. Previously in SARS-Cov-1 the closest neighbor to SARS-CoV-2, it has been shown that experimentally introduced deletion in the SSRSSSRSRGNSR region of the SR-rich motif significantly reduces the infectious virions 3 . The 28881-28883:GGG >AAC mutation affects the location adjacent to the aforementioned region, and so is expected to impact the pathogenicity of the SARS-CoV-2 in a similar manner. This assumption is remarkably supported from the analysis conducted by combining sequence information from GISAID database and COVID-19 cases in different regions around the globe from live trackers. From this exercise, it has become evident that regions with low/moderate cases of COVID-19 have the prevalence of 28881-28883:AAC genotype (SARS-CoV-2a), whereas the highly affected regions predominantly have 28881-28883:GGG genotype (SARS-CoV-2g).
History of previous infections suggests the evolution of viruses with different pathogenicity acquired through mutations 4 5 . Although hundreds of mutations have been reported in the SARS-CoV-2 genome to date, the trinucleotide bloc mutation reported and characterized in this study have unique features with potential impact on the pathogenicity of the virus.
The results suggest that by monitoring the prevalence of the SARS-CoV-2a and SARS-CoV-2g strains, countries may track the course of COVID-19 pandemic. Potential drugs can be designed to target SR-rich motif of the N protein to curb the pathogenicity of the SARS-CoV-2. However, some assumptions need to be confirmed with more retrospective and prospective research. Special attention should be given to trace back the COVID-19 affected human samples from where the SARS-CoV-2 sequences were obtained and follow up with their clinical outcome.

28881-28883:GGG>AAC change is a unique event resulting in two sub-strains of the SARS-CoV-2 described here as SARS-CoV-2g and SARS-CoV-2a:
In all 3000 complete genome sequences of SARS-CoV-2 analyzed in this study, there was not a single occasion where a bloc of tri-nucleotide has changed except the GGG>AAC in the 28881-28883 location of the genome. All other changes are mostly single nucleotide polymorphism (SNPs). This observation suggests that GGG>AAC change has occurred at the same time or at a short span of time. Such changes would be expected to have significant impacts on the virus life cycle and pathogenicity as discussed later.

SARS-CoV-2a is a relatively new strain and has a distinct mutation profile compared to SARS-CoV-2g:
The 28881-28883:AAC genotype and resulting SARS-CoV-2a strain is found in samples collected in relatively recent times, mostly from March onward. All the sequences from Wuhan, the first epicenter of COVID-19 have 28881-28883:GGG genotype and so is the reference genome of the SARS-CoV-2. Although one SARS-CoV-2a affected person was reported in Italy on January 7 , an analysis on the sequences deposited from Japan gives a good snapshot of its recent origin.  Table-ST1). The positions which have changed in less than 10% cases generally are country-specific, except the 26144:G>T which has been found in sequences from various countries. This pattern of mutational exclusiveness requires more elaborate analysis to trace the evolution of the SARS-CoV-2 strains, as they hold important clues on their pathogenicity.

Impacts of 28881-28883:GGG>AAC mutation on the pathogenicity of the SARS-CoV-2
According to the NCBI reference genome, 28881-28883:GGG>AAC bloc results in two amino acid 203-204:RG>KR changes in the nucleocapsid (N) protein of the SARS-CoV-2. Looking at the surrounding sequence of these amino acids ( Figure-4), it appears that the mutation will discontinue a serine-arginine (S-R) dipeptide by introducing a lysine in-between them.

Distributions of SARS-CoV-2a and SARS-CoV-2g within and among countries and their potential impacts on COVI-19 cases:
This study started from the observation that although Italy has the third largest reported COVID-19 cases in the world (as of April 11), Abruzzo of Italy has much fewer cases compared to Lombardy. As of April 9, 2020, some 54802 COVID-19 cases have been reported in Lombardy whereas in Abruzzo the number is 1931. By looking at the region-specific sequences from GISAID database, it was found that out of 30 sequences deposited from Italy by April 11, 2020 total 13 sequences came as 28881-28883:AAC (SARC-CoV-2a) and most strikingly 10 of them are from Abruzzo (Figure-6). This is ~77% of the total number of sequences (N=13) that came from that region. When other COVID-19 high vs. low regions were analyzed within and among countries and compared with their SARS-CoV-2 sequence entries in the GISAID database, a trend was observed that there is an inverse relationship between the reported number of COVID-19 cases and the relative abundance of SARS-CoV-2a strain. As of April 9, 2020 Belgium, Netherlands, Portugal have 31%, 50% and 60% SARS-CoV-2a, whereas Spain has only ~4% (N=83) and France ~3% (N=150). Deposited sequences from Germany, France, Belgium and Netherlands came from different part of those countries. Sequences from Portugal were deposited from a few numbers of laboratories located at different places.
In case of UK, 26% of strains showed SARS-CoV-2a genotype, but location information could not be confirmed for them. When European countries with more than 50 submitted sequences (as of April 9, 2020) were analyzed and then compared their reported COVID-19 cases, it appeared that the countries with a relatively higher prevalence of SARS-CoV-2a have lower cases of COVID-19 (Figure-7). This assumption should be taken with caution as there must be many factors responsible for the differences in COVID-19 cases in different countries, including their testing and reporting policy. However, the persistent observation of SARS-CoV-2a prevalence in countries and regions with low COVID-19 cases warrants an immediate molecular and epidemiolocal research around the world to check the impacts of the two SARS-CoV-2 strains on the course of COVID-19 pandemic. It is worth reporting here that SARS-CoV-2a are also present in other European nations with low COVID-19 cases such as Finland, Austria, Denmark, Iceland, Estonia. When checked in South America, Brazil showed a sizable presence of SARS-CoV-2a. Sequences from Argentina and Chile indicate the presence of SARS-CoV-2a in those countries. More data will be needed from that region to be assured whether the presence of SARS-CoV-2a strains may be linked with the relatively lower reported cases of COVID-19 in South America, or it is just an artifact.
The most severely affected region in North America, New York has predominantly SARS-CoV-2g. Only recently, some SARS-CoV-2a sequences have been observed in sequences from samples isolated in New York. As of April 9, only 5% sequences are of SARS-CoV-2a and 95% are of SARS-CoV-2g (N=145). If the assumptions made above are true, then the more infectious SARS-CoV-2g might be behind the high COVID-19 cases in New York and until SARS-CoV-2a takes the upper-hand, the trend would continue. However, 241:C>T, 3037:C>T and 14408:C>T changes were present in the 89% of the 145 deposited sequences by April 9, 2020. As these mutations work as the precursor for the 28881-28883:GGG>AAC, it is expected that the GGG>AAC change will increase in the future giving rise to more SARS-CoV-2a and lowering the COVID-19 cases.
Both Australia and Canada also have a presence of SARS-CoV-2a strains in some regions (Figure-S3). In Asia, Japan shows 11 of the submitted 95 sequences as SARS-CoV-2a. Vietnam, India, Thailand, Singapore have got SARS-CoV-2a strains according to the submitted sequence. However, sequences from South Korea, Malaysia, Nepal did not reflect the strain as of April 7, 2020. More sequences from different regions of these countries might be necessary to get a complete picture. China alone has deposited 250 whole sequence data as of April 9. However, most of these sequences are from samples collected in February. Samples collected at a later date should be screened for AAC genotype as discussed before.
A big caveat in the proposed link of SARS-CoV-2a with lower numbers of COVID-19 cases in some countries is their difference in testing rates. Different countries do test at different rates. However, the number of COVID-19 cases are not always related to the test rate. Germany has the highest rate of testing (16/1000 people) followed by Austria (13.3/1000 people) 11 . However, the COVID-19 cases in Austria are just 13560, as of April 11, 2020 12 . Austria has deposited only 18 sequences in the GISAID database (as of April 11, 2020) and 3 of them are SARS-CoV-2a. More sequencing of virus genome (or least the N protein) can help understand the picture. A properly designed pan-national study will be able to help understand the actual scenario after considering the confounding factors such as healthcare provision, gender, age distributions, economic condition, environment, nutrition, control measures of the country etc.

Discussion:
Hundreds of mutations have been reported in SARS-CoV-2 so far and the tally is increasing as more sequences being deposited in the public databases. It is often a challenge to make practical use of those sequences (and mutation) data. This study reports for the first time the rise and probable impacts of two strains SARS-CoV-2a and SARS-CoV-2 from the original SARS-CoV-2 strain after analyzing available sequence and COVID-19 case data. The mutually exclusive nature of these two strains may work as anchors to follow them both retro-and-prospectively.
The uniqueness of the trinucleotide mutations (28881-2883:GGG>AAC) makes it a highly potential candidate to follow the trend of the COVID-19 pandemic across regions caused by SARS-CoV-2. The molecular analysis presented in this paper has set the ground to assume that SARS-CoV-2a is linked with lower cases of infection because of the mutated SR-motif important for viral replication. However, this needs to be confirmed by i) further laboratory experiment on the particular location on the SR motif and ii) epidemiological research by matching the sequence data from different countries with their COVID-19 patients. Factors that may contribute to the GGG>AAC conversion should also be investigated. Demography, nutritional status, geographical location, environmental factors may play roles for this conversion as empirically SARS-CoV-2g (GGG) strains seem to be predominant in the megacities.
This study could explain the COVID-19 cases in different courtiers from where reliable data were obtainable. However, an explanation for the fatality difference still remains elusive. In a comparison between Lombardy and Abruzzo, it appears that the lethality is less in SARS-CoV-2a infected areas. This remains true when different regions of Netherlands were compared. However, when a country-wise comparison is made, the picture is not clear-cut. Notably, Germany (with a low prevalence of SARS-CoV-2a strains and higher COVID-19 cases) has much lower fatality compared to Netherlands or Brazil, both of which have a higher presence of SARS-CoV-2a. An obvious explanation is the difference in the healthcare provisions, age distributions and other local and policy differences in different countries.
Nevertheless, based on the information on the two strains of SARS-CoV-2, the fatality can be discussed from molecular perspective too. Among the mutations differences between the two strains as discussed above, it is particularly important to note that the ORF3a gene in the SARS-CoV-2a strain remains unmutated compared the SARS-CoV-2g where in many cases either 25563:G>A or 26144:G>A mutations are present in a mutually exclusive manner. It is already known that ORF3a plays a critical role to induce over reaction from inflammatory cytokines which often leads to the 'cytokine storms' 13 , one of the most important reasons behind the fatality from COVID-19.
The complete absence of 25563:G>T and 26144:C>T mutations in the SARS-CoV-2a indicates that this strain will express an active ORF3a protein whereas more than 40% SARS-CoV-2g strains might be mutated for this gene (~33% 25563:G>T and ~9% 26144:G>T) (Figure-3).
This implies that SARS-CoV-2a, although will have less infectivity because of the mutated N protein, this strain might be more lethal than those SARS-CoV-2g with ORF3a mutations. This explanation is supported by the sequence data from Germany where 45% (N=52) strains are mutated for 25563:G>T and 6% (N=52) for 26144:G>T. This extrapolation should be considered with caution as there might be other attenuating mutations and confounding factors.
However, if 28881-28883:GGG>AAC is a decisive change that makes the SARS-CoV-2a less pathogenic compared to the SARS-CoV-2g, then 203-204:RG>KR positions of the N protein should be targeted to design drugs to affect the replication of the virus and thus reduce the pathogenicity of SARS-CoV-2 infection. Mathematical models to predict the course of the COVID-19 pandemic should consider the impact of 28881-2883:GGG>AAC mutation in the SARS-CoV-2 genome to better understand the course of the infection and guide nations' preparedness. For nations with no elaborate facilities for whole-genome sequencing, RT-PCR based testing should be recommended by targeting 28881-28883 region. This will give diagnostic information on COVID-19 together with the information on the two sub-strains: SARS-CoV-2a and SARS-CoV-2g in an infected person. This will allow gathering valuable information about the prevalence of these two strains are prevalent in those countries.
This work further recommends more active efforts to look into the genomes of the SARS-CoV-2 with closer pannational collaboration to understand the transitions and distributions of the SARS-CoV-2a and SARS-CoV-2g strains for better understanding and management of COVID-19. Experimental and epidemiolocal research together with genome information will be key to make use of the analysis and assumptions presented in this paper.