Structural and genetic analysis of coronaviruses spike proteins suggest pangolin as a proximate intermediate host of SARS-CoV-2 (COVID-19)

During December 2019, a novel coronavirus named SARS-CoV-2 has emerged in Wuhan, China. The human to human transmission of this virus has also been established. The virus has so far infected more than 2 million people and spread over 200 countries. The World Health Organization (WHO) has declared COVID-19 a global health emergency due to its spread well beyond China. It has been established that this virus originates from bats and uses an intermediate host for transfer to humans. The knowledge about the intermediate host is important to find the virus shuttle mechanism to stop future outbreaks. For this, the genetic and structural analysis of coronaviruses spike proteins was performed using a computer-assisted approach.To conduct the In silico analysis, 43 sequences of spike protein belong to different species were retrieved from the NCBI nucleotide database. Pairwise and multiple sequence alignments were performed to check the similarities and differences of the retrieved sequences. Moreover, to highlight relationships among different species, phylogenetics analysis was performed using the MEGA software tool. In the end, protein structure alignment (superimposition) was performed against the reference structure by UCSF Chimera software. The results highlighted that the maximum similarity of human protein was found against Bat and Pangolinsequences. Moreover, among Bat and Pangolin, the highest similarity was found against pangolin based on phylogenetics analysis. These results suggest that SARS-CoV-2 transfers from bats to humans through pangolins. Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 2 May 2020 doi:10.20944/preprints202005.0022.v1


Introduction
The current novel coronavirus disease 19   Since its initial outbreak in Wuhan, China, COVID-19 has infected more than 2.0 million of individuals with more than 0.12 million deaths and spread over 205 countries so far. The causative organism of COVID-19 is a betacororonavirus that is named as SARS-CoV-2. It is a zoonotic virus that transmits from animals to humans. Generally, the coronaviruses are enveloped viruses having a spherical or pleomorphic shape and contain positive-sense RNA. The genome is ranged between 26 to 32 Kbps with 80-120 nm of diameter [1]. The viral genome contains four structural proteins vis Envelop protein (E), Spike protein (S), Membrane protein (M) and Nucleo-capsid protein (N) [2]. Each of the proteins has an important role in virus life as S-protein for host cell attachment, N-protein for nucleocapsid formation, and M & E-proteins in viral assembly [3][4][5].
One of the important challenges is to determine the origin of SARS-CoV-2, to understand its transmission from animals to humans. Now it has been established that SARS-CoV-2 has been originated from horseshoe bats [6]

Retrieval of glycoprotein sequences
The sequences of the spike protein of coronaviruses were retrieved from the National Center for Biotechnology Information (NCBI) nucleotide database. These sequences belong to different species including humans. Altogether, 43 sequences of different species were retrieved. The list of the species and the accession numbers are mentioned in table 01.

Pairwise alignment
The retrieved sequences were aligned to check the similarities and differences of the retrieved sequences against the reference sequence. For pairwise alignment, the human coronavirus sequence was taken as the reference sequence. All the retrieved sequences were aligned against the reference coronavirus sequence using the BLAST algorithm (https://blast.ncbi.nlm.nih.gov/Blast.cgi).

Multiple sequence alignment
The multiple sequence alignment of all the retrieved sequences was performed to check the similarities and differences among the sequences. Moreover, multiple sequence alignment is required to perform phylogenetics analysis as well. The multiple sequence alignment of all the sequences was performed using Clustal Omega webserver (https://www.ebi.ac.uk/Tools/msa/clustalo/).

Phylogenetics analysis
To highlight the relationship among different species, phylogenetic analysis was performed by the Molecular Evolutionary Genetics Analysis (MEGA) 6.06software tool. Moreover, to confirm the results, phylogenetics analysis was performed by different algorithms including parsimony analysis, maximum likelihood analysis and unweighted pair group method with arithmetic mean (UPGMA) analysis.

Protein structure prediction and refinement
The protein structures of all the retrieved sequences mentioned in table 01were predicted either by homology modeling or threading algorithms. First of all, the templates for the sequences were searched in Protein Databank (PDB) database using the BLAST algorithm. The structures of all proteins were predicted by homology modeling where a good template was found in PDB.
Moreover, the structures of the remaining proteins were predicted by threading where good templates of the proteins were not found.Furthermore; it was ensured that the quality of the structures is good. The qualities of the predicted protein structures were enhanced by Modrefiner (https://zhanglab.ccmb.med.umich.edu/ModRefiner/) web server for those proteins where quality was not so good.

Proteins' structure superimposition
To perform protein structure alignment, the superimpositions of the structures were performed by UCSF Chimera 1.14 software. Just like pairwise alignment, the predicted structures were aligned to check the similarities and differences of the predicted structures against the reference structure. The human coronavirus glycoprotein structure was taken as reference structure and all the predicted structures were compared against the reference structure.

Pairwise Alignment
The pairwise alignment of all the retrieved sequences was performed against the reference human coronavirus sequence. The results of the pairwise sequence alignment are mentioned in

Multiple sequence alignment
The results of multiple sequence alignment are shown in figure 1. According to the results, the maximum length of the sequence was related to Canine3 (1481 amino acids) specie while the minimum length was related to Poultry 1 (224 amino acids) specie. Moreover, stars in the alignment highlight the conserved residues among all the compared species based on multiples sequence alignment.Although no conserved residues were found in multiple sequence alignment results when we compared all 43 sequences, some amino acid residues were conserved when we limit alignment to some species. For example, if we limit the species to Rat, Camel, and Bovine, then the sequences are conserved at many points. This highlights the similarity of coronavirus sequences among these three species. Hence, there is a possibility that Rat, Camel, and Bovine directly infected themselves during the transmission of coronavirus.

Phylogenetics analysis
The phylogenetic analysis was performed by MEGA 6.06 software. To confirm the results, the analysis was performed by three different algorithms including Parsimony, Maximum Likelihood, and UPGMA.The clusters in the tree were formed according to species. For example, bovines were present in one cluster, camels were present in another cluster. Moreover, rats, bats and canine were present in their respective cluster as shown in figure 2. Interestingly, all results clustered human, bat, and pangolin in one cluster. This cluster is of great significance as we are focusing on human coronavirus origin. Moreover, in this cluster, the human had a closer relationship with pangolin compare to Bat1. This highlights the possibility that the origin of human coronavirus is from Bat to Pangolin to Human.

Proteins' structures analysis
The predicted structures of all the proteins were compared against the reference structure of human coronavirus glycoprotein. The structure comparison is shown in figure3 and figures 4.
According to the comparison, the maximum similarity among the structures was found in Pangolin species while Bats also got significant similarities against the Human glycoprotein structure. Moreover, if we compare the resemblance of structures among different species, then Pangolin and Bats got maximum similarities in the structure to compare to humans than the rest of the species. This result strengthens the results of pairwise alignment and phylogenetics analysis where Pangolin and Bats showed the highest similarities against humans.

Discussion
An earlier study claimed that snakes were likely to be the intermediate hosts of the SARS-CoV-2. The researchers compared the codon usage in the SARS-CoV-2 virus against that of the cells in eight animals at the Wuhan Huanan Seafood Wholesale Market. That study found that the snakes share the most similar codon usage pattern to SARS-CoV-2, thereby declaring that snakes were the most likely intermediate hosts [8]. A follow-up study compared the codon usages of three coronaviruses (SARS-CoV-2, SARS-CoV, and MERS-CoV) to those of more than 10,000 different kinds of animals, suggesting that the early claim of snake-borne transmission of SARS-CoV-2 is likely to be incorrect [9]. Recently, a published study shows similar results and predicts pangolin as the intermediate host [9]. The phylogenetic analysis of spike protein using three different algorithms confirms the above findings that humans, bat, and pangolin coronaviruses were found in the same cluster, however; spike protein of pangolin virus has a close relation with human Similar results were reported in a recently published study [12]. To further examine our findings, we performed the structural analysis by comparing the spike protein structure of human Coronavirus with other coronaviruses. As previously predicted, the human Coronavirus spike protein is more closely related to the pangolin Coronavirus than bat or other coronaviruses.
Taking all together, the MSA, phylogenetic analysis and structural analysis predict pangolin as an intermediate host. As the virus originates from the live food market Wuhan, where wild animals including bats and the pangolin were kept together that provide the best environment of Coronavirus transfer between hosts.

Conclusion
Amid the COVID-19 outbreak, the detailed understanding of how the SARA-CoV-2 transfers to humans will be helpful in the prevention of future outbreaks. The SARS-CoV-2 transfers from bats to humans through an intermediate host.
Using the genetic and structural analysis of spike proteins from different coronaviruses, we predict that pangolins served as an intermediate host to transfer the novel virus from bats to humans.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that they have no conflict of interest.