SARS-CoV-2 and MERS-CoV Share the Furin Site CGG-CGG Genetic Footprint

The SARS-CoV-2 polybasic furin cleavage site is still a missing link. Remarkably, the two arginine residues of this protease recognition site are encoded by the CGG codon, which is rare in Betacoronavirus. However, the arginine pair is common at viral furin cleavage sites, but are not CGG-CGG encoded. The question is: Is this genetic footprint unique to the SARS-CoV-2? To address the issue, using Perl scripts, here I dissect in detail the NCBI Virus database in order to report the arginine dimers of the Betacoronavirus proteins. The main result reveals that a group of Middle East respiratory syndrome-related coronavirus (MERS-CoV) (isolates: camel/Nigeria/NVx/2016, host: Camelus dromedarius) also have the CGG-CGG arginine pair in the spike protein polybasic furin cleavage region. In addition, CGG-CGG encoded arginine pairs were found in the orf1ab polyprotein from HKU9 and HKU14 Betacoronavirus, as well as, in the nucleocapsid phosphoprotein from few SARSCoV-2 isolates. To quantify the probability of finding the arginine CGG-CGG codon pair in Betacoronavirus, the likelihood ratio (LR) and a Markov model were defined. In conclusion, it is highly unlikely to find this genetic marker in betacoronaviruses wildlife, but they are there. Collectively, results shed light on recombination as origin of the virus CGG-CGG arginine pair in the S1/S2 cleavage site.

Background First of all, the structure and availability of the NCBI Virus database information (1), that makes this work possible, must be appreciated. Arginine is a polar and non-hydrophobic amino acid, with a positive charged group a physiological pH. Arginine participates in the binding of negatively charged substrates and/or protein actives sites (2). Consistently, arginine is involved in viral polybasic proteolytic cleavage sites, even as a dimers, as recognition motif of the ubiquitously expressed furin serine protease (3,4).
A notable characteristic of the SARS-CoV-2, that distinguishes from the rest of Sarbecovirus, is the acquisition of a polybasic furin cleavage site (PRRAR) at the S1-S2 boundary of the S glycoprotein (5). It greatly mediates the fusion of human cell and viral membranes, and the rapid human-to-human virus transmission (5-7). That acquisition was achieved through the insertion of four amino acids (PRRA). However, the furin protease recognition pattern is common in viral proteins, such as the hemagglutinin (H5) protein of the avian influenza viruses (3) or the spike glycoprotein of three of the seventh coronavirus known to infect humans (8): HCoV-HKU1 (RRKRR-760, coordinate based on GenBank: YP_173238.1), HCoV-OC43 (RRSRR-763, GenBank: AOL02453.1) and MERS-CoV (RSVRSV-753, GenBank: YP_009047204.1).
Another notable characteristic of the SARS-CoV-2 is the CGG-CGG coding sequence of the arginine dimer in that polybasic furin cleavage site. In the genetic code, arginine is encoded by six codons AGA, AGG, CGC, CGA, CGG and CGT codons. CGG is a minority arginine codon in SARS-CoV-2 (9). Consistently, CGG-CGG encoded arginine dimers at viral polybasic furin cleavage sites have not been found (10). In this sense, SARS-CoV-2 has the most extreme CpG deficiency in all known Betacoronavirus genomes, probably to avoid the human antiviral defence, mediated by the zinc finger antiviral protein (ZAP) (11). On the other hand, the other thirteen SARS-CoV-2 proteome arginine dimers, which are strictly conserved in the closest Sarbecovirus strains, are not CGG-CGG encoded either (12).
Is the CGG-CGG encoded arginine dimer unique to SARS-CoV-2 polybasic furin cleavage site?
Based on the NCBI Virus database as a source of information, through a bioinformatics approach and using Perl scripts, all current Betacoronavirus arginine dimers and their coding regions are here reported. Full updated results are available in a Google Drive Folder (see below the Web address). Interestingly, arginine dimers were widely distributed in Betacoronavirus proteins, about 30% of them contained one or more of the amino acid pair. These proteins were mostly members of the non-structural orf1ab-polyportein complex, and also in the structural S glycoprotein and nucleocapsid phosphoprotein. As regards the arginine codon usage focused on the Betacoronsvirus arginine dimers, AGA was the majority (about 50%), followed by CGT (about 24%). CGG was minoritary (about 5%). Table 1 summarizes the Betacoronavirus arginine dimers, that were encoded by CGG-CGG. The most remarkable discovery was the CGG-CGG arginine pair close to the furin recognition site of the spike glycoprotein from a group of MERS-CoVs (Table 1). Based on MERS-CoV spike glycoprotein structure (13), the S2 chain spans from arginine R-748 (coordinate based on UniProtKB -A0A023SFE5) to the C-terminal histidine H-1353, residues. In the case of human-infection, the MERS-CoV S glycoprotein is cleaved at R-748 generating the S1 and S2 subunits (8). However, It is worth noting that the CGG-CGG encoded arginine pair reported here (RR-700, coordinate based on GenBank AVN89376.1) is located 47 residues upstream the S1/S2 cleavage site (R-748), that creates, with a lysine residue, a true polybasic motif (KRR-700). Figure 1 shows sequence details. From the entire Betacoronavirus protein sample, there were 684 MERS-CoV spike glycoprotein sequences, of which 8 (1.17%) had the CGG-CGG encoded RR-700 dimer, in the rest was CGC-CGA encoded. In addition, the Betacoronavirus species MERS-CoV, Rousettus and Eidolon helvum bat coronavirus HKU9 and Rabbit coronavirus HKU14 also had CGG-CGG encoded arginine dimers in their orf1ab-polyprotein (Table 1). Within the SARS-CoV-2 species (apart from the S glycoprotein), only two SARS-CoV-2 isolates from North America showed a CGG-CGG arginine dimer in the orf1ab-polyprotein, and few SARS-CoV-2 isolates, also from North America, showed the first (out of four) nucleocapsid phosphoprotein arginine dimer encoded by CGG-CGG (Table 1).

CGG-CGG likelihood ratio (LR) and Markov model
Based on the structure of the NCBI Virus database, the results are grouped by Geographic Regions. The observed frequencies of the arginine codon pairs can be associated with probabilities. Also, based on the principles of forensic genetics (14), it is appropriate to ask for the LR value of the CGG-CGG genetic footprint, as a fundamental genetic marker of the pandemic virus. Given a Geographic Region, LR compares (ratio) the probability (P1) that if CGG-CGG encoding RR pair belongs to the SARS-CoV-2 (obviouslly, P1 = 1) with the probability (P2) that if CGG-CGG encoding RR pair belongs to a random Betacoronavirus isolate from the same SARS-CoV-2 Geographic Region (frequency). Only Africa and Asia Geographic Regions showed CGG-CGG frequencies other than zero, with the following LR values: In forensic genetics LR is used by juries or judges to draw inferences or conclusions and decide legal matters (14). So that LR should be large enough to allow that a genetic marker could be considered unique of a given forensic evidence. Here, the Africa and/or Asia Betacoronavirus LRs were not excessively high, which agreed that arginine CGG-CGG is not unique SARS-CoV-2 genetic footprint.
On the other hand, to quantify the probability of the CGG-CGG presence, a First-Order Markov Chain was defined. The states were the arginine codons themselves. This Markov model allowed to determine the probability of the second arginine codon depending on the previous codon. Since arginine has six codons, in an arginine dimer there are 36 (6 x 6) chances of finding a codon pair (like a roll of two dice: 36 possible outcomes). By normalizing the codon pair frequencies, the stochastic matrix of the Markov chain could be created, whose elements are the transition probability between codons (states). As an example, Table 2 shows a stochastic Markov matrix, based on the arginine dimers found in a recent Asia Betacoronavirus protein sample. In this sample, if the first codon was AGG or CGA, the second was most likely AGA. If the first codon was CGG, the second was most likely CGT. The elements on the main diagonal mean the probability that the second codon was the same as the first. The significant presence of two arginine codons in a row occurred only in AGA-AGA.

Concluding remarks
In this work, about ten million of Betacoronavirus protein and coding sequences have been analysed, as well as, few million more of arginine pairs. It was a large sample which grow day by day. So, the present results are also updated (Google Drive Folder). Furthermore, analysis of arginine pairs from viruses of other taxonomic groups is going on. In conclusion, excluding the pair of the SARS-CoV-2 furin site (PRRAR), the arginine CGG-CGG encoding is highly unlikely in betacoronaviruses wildlife, but they are there. However, just because that presence, recombination may have operated into the origin of the virus S1/S2 protease recognition site. Recombination is the common method of viruses picking up new skills (15)(16)(17)(18)(19).
Full updated results: https://drive.google.com/drive/folders/1Dp04BHDyMay1sB0GX0O0IFzfZTp_VrBu?usp=sharing Table 1. CGG-CGG encoded arginine dimer from Betacoronavirus protein sequences. The list is limited to records that exclude those from the SARS-CoV-2 polybasic furin cleavage site (PRRAR). The data is grouped by Geographic Regions   (19).Strictly conserved amino acids are denoted by *, gaps are denoted by -. Positions of sequence amino acid residue s are indicated by the numbers on the right. The MERS-CoV CGG-CGG encoded arginine doublet (RR-700), located 47 residues upstream of the S1/S2 cleavage site is highlighted in bold and red, within the polybasic motif (KRR), highlighted in yellow. The specific SARS-CoV-2 and MERS-CoV furin protease recognition pattern and the S1/S2 cleavage positions R-685 and R-748, respectively, are also highlighted in bold and yellow. The stochastic matrix is a square matrix of transition probability between arginine codons (states). The rows are probabilistic vectors. An element of the matrix means the probability that the second arginine codon would be that of the column if the first is that of the row. Consequently, the sum of the elements of a row is 1. Data used to create this Betacoronavirus-Asia stochastic Markov matrix: Total number of analysed Betacoronavirus protein sequences, 93,977; total number of protein sequences having arginine dimer(s), 34,346 (36.55%); total number of arginine dimers in the sample , 134,249; total number of SARS-CoV-2 (CGG-CGG) polybasic furin cleavage site arginine dimers, 6,859 (5.11%). To avoid distortions in calculations of transition probabilities between arginine codons, the arginine dimers of the SARS-CoV-2 furin site were excluded.