1 SARS-CoV-2 and the secret of the furin site

The SARS-CoV-2 high infectivity is due to the functional polybasic furin cleavage site in the S protein. How it was acquired is unknown. There are two challenges to face: (i) an evolutionary model, to fit the origin of the coronavirus; and (ii) a molecular mechanism for the site acquisition. Here we show genomic fingerprints which are specific of Pangolin-CoVs, Bat-SARS-like (CoVZC45, CoVZXC21), bat RatG13 and human SARS-CoV-2 coronaviruses. This, along with phylogenetic analysis, we found that these species have the same evolutionary origin in the bat, including a genetic recombination of S gene between Pangolin-CoV (2017) and RatG13 ancestors. However, this does not explain why SARS-CoV-2 is the only of them with the furin site, which consists in four amino acid (PRRA) motif. The Arginine doublet is encoded by CGGCGG codons. Surprisingly, none of the Arginine doublet of other furin site of viral proteins from several type of viruses, are encoded by the CGGCGG codons. This makes it difficult to consider a virus recombination as mechanism for the PRRA acquisition. The origin of SARS-CoV-2, is the origin of the recognition cleavage site. The bat coronavirus RaTG13 appears to be the closest relative of the SARS-CoV-2, but was isolated in 2013. So, new RatG13 samples would provide insights into the acquisition of the polybasic motif.


Introduction
The origin of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the pandemic virus of the coronavirus disease 2019 , is controversial. It is linked to the origin of the polybasic furin clevage site in the sipke glycoprotein (1). Furin is a protease ubiquitously expressed in human cells. In the human genome, the furin gene (FUR) is located on chromosome 15. Furin specifically cleaves substrates at single or paired basic residues in normal protein processing (2). However, furin is also involved in protein processing in infectious diseases and cancer (3), and now in COVID-19. The polybasic furin cleavage site is present in many viral proteins from different types of viruses (3). Unlike SARS-CoV and other Betacoronavirus Sarbecovirus (lineage B), the spike protein of SARS-CoV-2 is thought to be uniquely cleaved by the furin (2,4,5). We addressed both the SARS-CoV-2 evolutionary model analysis and the site acquisition, through a bioinformatic approach, based on the available genomic data, as an attempt to fit together the pieces of a puzzle. The reproducibility of the results has been considered essential.

Methods
The source of information was the National Center for Biotechnological Information (NCBI) databases and the methodology was based on the bioinformatic resources of NCBI and the European Laboratory of Molecular Biology (EMBL). The nucleotide Basic Local Alignment Search Tool (BLASTn) were performed using the NCBI-BLAST program (6). A reference SARS-CoV-2 genome was used as query against the entire NCBI nucleotide collection. The entire nucleotide collection consists of GenBank+EMBL+DDBJ+PDB+RefSeq sequences. The database is non-redundant. Identical sequences have been merged into one entry. Number of sequences: 65512295. For clarity, search was limited to records that exclude: Severe acute respiratory syndrome coronavirus 2 (taxid:2697049). Multiple sequence alignments were created by Clustal Omega (v. 1.2.4) using default parameters (7). The phylogenetic analysis were based on both complete genomes and S gene sequences. Because the purpose of the phylogenetic analyses is to make the detailed phylogenetic relationships of closely related SARS-CoV-2 coronaviruses, and to ensure that the analyses are comparable, sequences from the same coronaviruses were used in both phylogenetic analyses. The sequences were those of the SARS-CoV-2 group (Table 1), and selected coronaviruses sequences from the literature (4,8,9,10), for the construction of the phylogenetic tree. It was constructed with the Neighbor Joining method of the Clustal Omega (v. 1.2.4) software package, and the iTol Interactive Tree Of Life tool (11). Assessed clustering strength was calculated by bootstrap using 1000 replicates. Tree scale bar stands for the evolutionary distance. A node in the phylogenetic tree may represent both a common ancestor of the homologous sequences that originate from it (defining a cluster), and/or an evolutionary event of speciation.

Evidence of a common ancestor of SARS-CoV-2 and its closely related coronaviruses
The basic principles of biology are also fulfilled in the world of viruses. The principle of the Cell Theory of Rudolf Virchow (1858), Omnis cellula ex cellula (each cell derived from another pre-existing cell), could be interpreted as "each virus derives from a pre-existing virus". Thus, we were interested in locating an ancestor of the SARS-CoV-2 within the framework of an evolutionary model with its closely related coronaviruses.
Fortunately, through the BLASTn search, using as query a reference SARS-CoV-2 complete genome sequence, against the entire collection of NCBI nucleotide sequences, we found three SARS-CoV-2 genomic fingerprints that allowed us to identify its closely related coronaviruses, which are: Pangolin-CoVs (2017, 2019), Bat-SARS-like (CoVZC45, CoVZXC21) and bat RatG13 (Figure 1 and Table1). Regarding the genomic fingerprints, with coordinates based on NC_045512.2 SARS-CoV-2 isolate Wu han-Hu-1, complete genome (used as quesry), they are: fingerprint 1, 1923-3956; fingerprint 2, 21577-22539; and fingerprint 3, 27910-28257. More specifically, fingerprint 1 is at the beginning of the genome in the orf1a RNA polimerase gene, covering the nsp2 (the final 796 nucleotides) and nsp3 (the initial 1237 nucleotides) regions. The fingerprint 2 is at the beginning of S gene, covering the part encoding the N-terminal domain and the ACE2 receptor binding domain (RBD). The fingerprint 3 is the orf8 gene itself. Interestingly, these genomic fingerprints are only specific markers at gene level (RNA sequence), not at the protein level. That is, their encoded amino acid sequences are similar to those of the other Sarbecovirus.
The genomic fingerprints are shared by the closely related SARS-CoV-2 coronaviruses but not by other coronaviruses. It is an evidence that a common ancestor served as a progenitor of them. It is unlikely that these SARS-CoV-2 related species independently acquired identical markers at three different locations in the genome. As further evidence of that close phylogenetic kinship, in the N-terminal domain of the S protein, there are short sequence features (one deletion and three insertions, Figure 2) which are also shared by the SARS-CoV-2 related coronaviruses but not by other Sarbecovirus. Again, It is unlikely that these species independently acquired identical deletion/insertions at four different locations in the S gene.

Evidence of S gene recombination between Pangolin-CoV (2017) and BatCov-RatG13 ancestors
A detailed analysis of the N-terminal domain of S protein evidence high similarity between Pangolin-CoV (2017), RatG13 and SARS-CoV-2 sequences ( Figure 2). Since at the complete genome level, Pangolin-CoV (2017) is phylogenetically distant from RatG13 and SARS-CoV-2, this already point out to a recombination event, that it has been validated by through further phylogenetic analyses.
We found in the phylogenetic analysis based on the nucleotide sequence of S gene region encoding the Nterminal domain and the RBD, that Pangolin-CoVs (2017) on the one hand, and RatG13 and SARS-CoV-2 on the other, were consistently (bootstrap support 1000) grouped together in the same cluster. Pangolin-CoV (2019) and Bat-SL-CoV grouped with the other coronaviruses ( Figure 4). Comparing both phylogenetic analyses, the one based on complete genomes (Figure 3), and that based on the S gene ( Figure 4), there are a conflicting evolutionary history suggesting a recombination or horizontal gene transfer event (12), involving Pangolin-CoV (2017) and RatG13 ancestors. This could be an evidence that these species have undergone recombination of S gene. So, the Pangolin-CoV (2017) species has similar N-terminal domain and RBD to that of RatG13 and SARS-CoV-2 ( Figure 2 and 6). However, from the current sequence and phylogenetic analyses, it is not possible to elucidate the directionality of that recombination. That is, we cannot know which of them acted as donor and/or receptor on that exchange of S gene region.

The secret of the SARS-CoV-2 functional polybasic furin clevage site
A key identity mark of SRAS-CoV-2 is the polybasic furin cleavage in the S protein, but is absent in the closely related SARS-CoV-2 coronaviruses (4,5). It is a polybasic recognition motif of the human ubiquitously expressed serine protease furin, that cleaves the S protein in the conserved S1/S2 cleavage region (1,4,5). According the SRAS-CoV-2 biology, this furin site is responsible for its high infectivity and transmissibility, as well as, for the COVID-19 pathogenesis (13,14).
In SARS-CoV-2 S protein, this site is four amino acid PRRA, encoded by the inserted of 12 nucleotides in the S gene. The presence of a doublet of Arginine is a distinctive structural feature of the furin site (15). Taking as reference the S protein of the Bat coronavirus RaTG13 (the closest relative), the PRRA insertion occurred between the Serine-680 (encoded by TCA) and the Arginine-681 (encoded by CGT). However, there are three possible 12-nucleotide fragments that, inserted in a different strategic way, encoded the same PRRA sequence. The Figure 5 shows details of each possible insertion. From the current genomic data, it is now impossible to know what of these cases actually happened in the most decisive evolutionary event in the SARS-CoV-2 speciation.
However, furin sites are present in many viral proteins of all types of viruses (3). Table 2 shows several examples. From the seventh coronavirus known to infect humans (4), the site is also found in the S protein of the Betacoronavirus Embecoviruses (Lineage A) HKU1, OC43; and the Betacoronavirs Merbecovirus (lineage C) Middle East respiratory Syndrome-Related Coronavirus (MERS-CoV); but not in the Betacoronavirus Sarbecovirus (Lineage B) SARS-CoV; and the Alphacoronaviruses NL63 and 229E.
With regard to random insertion mutation. Viral RNA synthesis is performed by the RNA-dependent RNA polymerase (RdRP), including 15-16 non-structural proteins, RNA-modifying enzymes, and a 3′-5′ exonuclease activity that assists RNA synthesis with a unique RNA proofreading function necessary for maintaining the integrity of the >30 kb coronavirus genome (1,18). Moreover, SARS-CoV-2, like the RNA viruses, could be considered a quasispecies, where there are always random point mutations, or error tail, but it is metastable and the mutation rate is below the threshold of error catastrophe (1). Thus, the complexity of RNA replicase and the very structure of the virus genome, reduce the likelihood of a random insertion mutation as mechanism for the origin of the PRRA motif.
Concerning recombination with other viruses. The two Arginines of the PRRA motif, in SARS-CoV-2 are encoded by the CGGCGG codons. So, we have analysed the codon usage of representative furin sites with an Arginine double of a wide variety of viral proteins, of all types of viruses (including, dsDNA, (+)ssRNA, (-)ssRNA) ( Table 2). Surprisingly, none of the Arginine doublet is encoded by the CGGCGG codons. Then, we were interested in the Arginine codon usage in SARS-CoV-2 genome. As it is shown in Table 3, out of the six codons of the Aarginine, the CGG codon is the minority. Also surprisingly, out of the 42 Arginines of the SARS-CoV-2 S protein, only two Arginines are encoded by the CGG codon, which are those of the PRRA motif. All recombination event requires a donor. In this case, the donor should be another furin site, probably from another virus, but having the double RR and encoded also by the CGGCGG codons. We have not been able to identify this hypothetical donor. So, this makes it difficult to consider viral recombination as mechanism for PRRA acquisition in SARS-CoV-2.
As concerns a laboratory origin. It is unlikely that someone manipulated these viral changes in a laboratory, but not impossible. Furthermore, the mechanism of the acquisition of the functional polybasic furin site in SARS-CoV-2 refers to how, but not when and where. Since RatG13 sample was isolates in 2013 (8), it is plausible that a SARS-CoV-2 ancestor acquired the site after this year. Regarding where, we propose two scenarios that can plausibly explain the origin of SARS-CoV-2: (i) in animal host before jumping to human, or (ii) inside the human host. Assuming the first scenario, the current evolutionary model, tells us that that "animal host" should be the bat.
It is not unreasonable to consider that SARS-CoV-2 could have acquired the furin site within the bat. There are signs that point to this possibility. Part of the optimization that the pandemic virus acquired for binding to the human receptor ACE2 was ad quired within the bat. J.  describe the contacting residues of SARS-CoV-2 RBD in direct comparison with the contacting residues of SARS-CoV RBD (19). Interestingly, we found that SARS-CoV-2 RBD optimization is also shown in RatG13 ( Figure 6). In the same vein, SARS-CoV-2 sequence analysis, led to predicted the acquisition of three O-linked glycans around the furin site (4), also, SARS-CoV-2 and RAtG13 sequences around the site are 100% identical.
However, if a SARS-CoV-2 ancestor had acquired the furin site in the bat, it does not imply that RatG13 had also acquired the site. This is impossible to know, because we are analysing the RatG13 sample of 2013. Until no new RatG13 genomes, but captured in 2021, would be analysed, the scenario is open.
Taken together, all these lines of evidence and reasoning show that the acquisition of the polybasic furin cleavage site by SARS-CoV-2 is a "missing link" in our understanding of its evolutionary history, that can only be addressed through the discovery of new viruses. The principle of Theodosius Dobzhansky (1973) "Nothing in Biology Makes Sense Except in the Light of Evolution" becomes more relevant in the case of the pandemic virus, not only because there is the hypothesis of a laboratory origin, but also because there is a furin site that has changed the world.

Conclusions
• Genomic fingerprints in Orf1a, S and orf8 genes, and short sequence features in the N-ternimal domain of the S proteins; along with phylogenetic analysis based on complete genome, suggest that Pangolin-CoV (2017), Pangolin-CoV (2019), Bat-SL-CoV (CoVZC45, CoVZXC21), bat RatG13 and human SARS-CoV-2 coronavirus species have the same evolutionary origin in the bat, and have been separated by speciation events.
• Sequence and phylogenetic analyses suggest that Pangolin-CoV (2017) and RatG13 coronaviruses ancestors have undergone recombination in the S gene region encoding the N-terminal domain and the RBD, before pangolin coronavirus jumped to the pangolin as host animal.
• The CGG CGG codons of the Arginine doublet in the SARS-COV-2 PRRA polybasic furin cleavage site (S protein), have not been identified in any other furin site Arginine doublet, from viral proteins, including S protein, of a wide variety of viruses. This not supports recombination as mechanism for PRRA acquisition in SARS-CoV-2.
• SARS-CoV-2 origin matters. The million-dollar question: would a bat coronavirus RatG13, isolated in 2021, have the functional polybasic furin site? *: based on the NCBI-BALSTn search described in Figure 1. Percentatge of identity, using the NC_045512.2 SARS-CoV-2 isolate Wuhan-Hu-1, complete genome sequence as a query.   At the top there is a color scale on the alignment scores between query and hit sequences, which are represented for each line (red, maximum value). Also in the upper center, there is a bar with a scale (1-30000) that represents the query sequence that was NC_045512.2 SARS-CoV-2 isolate Wuhan-Hu-1, complete genome sequence (29903 nucleotide length), against the entire NCBI nucleotide collection. Description of the settings and software that was used are included in the Methods. The top hits with a complete query match were (continuous red lines) are (Sequence descripton, GenBanK id, percent identity): Synthetic construct (2019) (Table 1). For better identification, each of these coronavirus is indicated on the corresponding branches of the tree. For simplicity, only three genomes of SARS-CoV-2 have been included. Description of the settings and software that was used are included in the Methods. The black point stands for a common ancestor of the SARS-CoV-2 group, and the red point depicts the divergence of the Pangolin-CoV (2017)