A COVID-19 Drug Repurposing Strategy Through Quantitative Homological Similarities by using a Topological Data Analysis based formalism

Since its emergence in March 2020, the SARS-CoV-2 global pandemic has produced more than 65 million cases and one point ﬁve million deaths worldwide. Despite the enormous eﬀorts carried out by the scientiﬁc community, no eﬀective treatments have been developed to date. We created a novel computational pipeline aimed to speed up the process of repurposable candidate drug identiﬁcation. Compared with current drug repurposing methodologies, our strategy is centered on ﬁltering the best candidate among all selected targets focused on the introduction of a mathematical formalism motivated by recent advances in the ﬁelds of algebraic topology and topological data analysis (TDA). This formalism allows us to compare three-dimensional protein structures. Its use in conjunction with two in silico validation strategies (molecular docking and transcriptomic analyses) allowed us to identify a set of potential drug repurposing candidates targeting three viral proteins (3CL viral protease, NSP15 endoribonuclease, and NSP12 RNA-dependent RNA polymerase), which included rutin, dexamethasone, and vemurafenib among others. To our knowledge, it is the ﬁrst time that a TDA based strategy has been used to compare a massive amount of protein structures with the ﬁnal objective of performing drug repurposing.. Here, we report a based TDA novel strategy for drug repurposing in combination with current methodolo-gies of molecular docking, diﬀerential expression analysis of SARS-CoV-2 infected cells and correlation with FDA approved drugs transcriptomic proﬁles. Our results indicate that the proposed TDA based formalism is an excellent tool to address biological problems from a dual perspective. In the ﬁrst place, from a structural biology point of view, we use of Vietoris-Rips complex to compute the barcodes encoding the shape of each 240 protein structure. Next in combination with its Persistent Betti Functions, we transform individual barcodes into one-dimensional functions to measure a degree of similarity between proteins. It allowed us to classify proteins based solely in the C α atomic coordinates. Persistent homology has been previously proposed as a method to study the topological invariants of the three-dimensional structure of biomolecules. Several studies have employed use TDA-based methods to classify of protein structures using only the three-dimensional 245 coordinates of the atoms from crystallographic resolved proteins. For instance, Xia and collaborators [33] performed persistence homology analysis of three-dimensional biomolecular structures in order to study their structural characteristics, ﬂexibility prediction, and folding properties. Hence, they deﬁne the molecular topological ﬁngerprints (MFTs) to extract the topological information from protein structures using persistent Betti numbers K. Dey and colleagues proposed another topology-based method to create 250 fast classiﬁer using a support vector machine to


Introduction
On March 11, 2020, the World Health Organization (WHO) declared the Coronavirus Disease 2019  outbreak, produced by the novel SARS-CoV-2 virus, a global pandemic [1]. So far, three previously approved antiviral drugs and one antimalarial medication (remdesevir, lopinavir, Interferon-β1, and hydroxychloroquine) have been tested for efficacy against SARS-CoV-2 infection by the WHO SOLIDARITY 5 consortium in a large multicentric study. The results of the trial suggested that these treatments had little or no effect in a set of clinical outcomes which included overall mortality, time to initiation of mechanical ventilation, and duration of hospital stay [2].
With the second wave ongoing in many countries, herd immunity far down the road, and no date scheduled for the release of an effective vaccine, it is still a pressing need to find adequate treatments for the disease. 10 De novo drug development and testing, including preclinical research and clinical trials, is a slow process that could take more than 12 years ( [3], [4]). However, the current sanitary emergency makes it imperative to shorten this time frame. Therefore, sustained efforts to identify potential candidates for drug repurposing are necessary.
In the context of COVID-19, Kumar and co-workers [5] compiled sets of genes linked to the disorder 15 and studied their distribution in the human interactome. They first identified the interactome subnetworks' hub genes in which the disease-related genes were placed. Then, they queried the drug-gene interaction database [6] [7] to identify FDA approved drugs which had the hub genes as their target (i.e., chloroquine, lenalidomide, pentoxifylline). Zhou and collaborators compiled a list of human proteins that physically interact with four previous human coronaviruses (SARS-CoV, MERS-CoV, HCoV-229E, and HCoV-NL63) 20 and used network proximity measures to prioritize 16 potential anti-human coronavirus repurposable drugs including melatonin, mercaptopurine, and sirolimus [8]. Virtual screening studies based on molecular docking approaches have also been reported. To cite an example, Kerestsu et al. used a protease inhibitors database (MEROSP) and the geometric structure of the 3C-Like virus protease (3CL pro ) to identified 15 potential inhibitors using the surflex-Dock software [9]. 25 Here we present a general-purpose drug repositioning workflow and its application to the specific case of COVID-19. Our procedure is based on recent developments in the field of Topological Data Analysis and its use in the study of biological geometric structures [10].
Particularly, our method relies on the idea that drugs that are known to target a specific protein would likely target other proteins that present high degrees of topological similarities with the first. Therefore, 30 the accumulated knowledge of drug-protein interactions available in public repositories such as DrugBank in combination with the information about protein geometric structures found in the Protein Data Bank (PDB) can be used to predict new potential drug protein targets based on the computation of protein-protein topological similarities. Figure 1 contains a brief summary of the general methodology.
Following this principle, we aimed to identify candidate repurposable drugs to target SARS-CoV-2 proteins. To this end, first, we retrieved information about all FDA approved drugs and their protein targets from DrugBank [11] and the available geometric structures of the target proteins, as well as the SARS-CoV-2 protein structures, were obtained from the Protein Data Bank (PDB).
Second, for each protein geometric structure, the coordinates of alpha carbons were selected. This is often referred to as the coarse-grained representation of the protein. Then, using persistent homology theory, we 40 generated the barcodes of each protein's coarse representation through the construction of Vietoris-Rips complexes for the first three persistent similarity measures that provide information about the protein's shape. In short, the initial point cloud used to represent a protein in a three-dimensional space is transformed into a set of three numbers, which include information about the topology of the object. A given pair of barcodes can be tested for similarity using appropriate metrics. 45 Persistent similarity measures were then computed between the SARS-CoV-2 protein structures and the whole set of protein structures which were known targets of FDA approved drugs. Drugs targeting proteins presenting large topological similarity values with SARS-CoV-2 proteins were selected as potential repurposing candidates and tested in a further twofold in silico validation step.
To validate the findings in silico, first, drugs selected in the previous step were subjected to blind docking 50 using both the predicted target protein and drug three-dimensional structures and the binding energies of the multiple detected pockets were computed. Second, we carried out searches for transcriptomic studies including samples infected with SARS-CoV-2 and uninfected controls, and obtained the gene expression signatures of SARS-CoV-2 infection by differential expression analysis. Then, the SARS-CoV-2 infection signatures were compared to the transcriptomic profiles produced by treatment with FDA approved drugs 55 generated by the LINCS L1000 team.

Validation of the persistence similarity function
Before applying our pipeline to the identification of potentially repurposable drugs for the treatment of COVID-19, we tested the capacity of the persistent similarity measure to identify proteins with closely related 60 three-dimensional structures. To this end, we retrieved information from two curated protein classifications (i.e., the Skolnick's dataset and a random sampling of 500 protein structures derived from ten different subfamilies of the SCOPe database). Then, we tested the capacity of our method to reproduce them correctly by computing all possible pairwise persistence similarity measures between the structures included in each dataset. response (p-adj = 2.0e − 20). The FDA approved drugs showing the strongest negative correlation in LINCS L1000 analysis were niclosamide, bisacodyl, and perhexiline (r = −0, 21, −0.19, −0.18). GSEA analysis of the transcriptomic signatures produced by those medications suggested that they induce significant gene expression changes in pathways linked to interleukin signaling and NF-kB activation. Genes included in the set of potential therapeutics for SARS were also found to be upregulated in the bisacodyl signature (NES = 105 1.61, p-adj = 2.19e − 02) . The JAK-STAT complex and the TCF dependent signaling pathways were found to be downregulated in the perhexiline and niclosamide signatures, respectively.
Eight thousand three hundred and eighty DEGs were identified in the DS2 analysis. Four thousand six hundred and six genes were found to be upregulated, and 3744 were found to be downregulated in SARS-CoV-2 infected samples compared to uninfected controls. Upregulated genes were enriched in components of included in the interleukin-12 and 17 signaling pathways. In contrast, interleukin-4 and 13 signaling related genes tended to be downregulated by chloroquine treatment (NES = −1.45, p-adj = 4.30e − 02). Genes involved in the viral mRNA translation and the ISG15 antiviral mechanism were also upregulated in the gene expression profiles induced by treatment with chloroquine, phenylbutazone, and troglitazone. In addition, the SARS-CoV infection pathway was found to be upregulated in samples treated by chloroquine and 120 troglitazone. ADORA2B mediated anti-inflammatory cytokine production-related genes was downregulated by the treatment of the three top negatively correlated drugs.
DS3 presented the lowest yield in terms of differentially expressed genes. One hundred and eighty-eight genes were found to be upregulated to controls, whereas 31 genes were found to be downregulated in infected samples compared to controls. Twenty-nine biological processes were found to be significantly upregulated 125 and were mainly linked to mechanisms aimed to fight the viral infection and immune system-related processes including, defense response to virus (p-adj = 7.2e − 13), myeloid leukocyte mediated immunity (p-adj = 8.8e − 15), regulation of cytokine production (p-adj = 1.5e − 08), and response to interferon-gamma (padj = 1.9e − 08) among others. Chloroquine was found to be the top negatively correlated drug (r = −0.11), followed by others such as pazopanib, spectinomycin, and troglitazone (r = −0.11, −0.11, −0.10).

130
The correlations observed in this dataset tended to be weaker than those computed for DS1 and DS2.
GSEA analyses of the drug signatures showed that troglitazone increased the expression of genes classified as potential therapeutics for SARS (NES = 1.46, p-adj = 4.65e − 02), as well as antiviral pathways such as the ISG15 and IFN-stimulated antiviral mechanisms. Spectinomycin was found to reduce the expression of interferon-gamma signaling and interleukin 2, 3, and 5 pathways related genes, whereas pazopanib was found to upregulate viral related pathways such as viral mRNA translation influenza and SARS-CoV-2 infection.
Supplementary File 1 includes the complete differential gene expression and enrichment analysis results for transcriptomic datasets 1, 2, and 3, whereas Supplementary File 2 contains the full LINCS L1000 analysis information. 140 DrugBank queries yielded 1825 medications approved by the American Food and Drug Administration (FDA). The identified drugs had 1821 known unique protein targets, for which 27839 tridimensional structures were available in the protein databank. Barcodes associated with the first three persistent similarity measures were successfully calculated for 25800 out of the 27839 structures, whereas computational limitations prevented us from estimating the remaining 1622 structures' barcodes. We also retrieved multiple 145 protein structures from SARS-CoV-2 that were available in PDB, including the Spike protein receptorbinding domain, the RNA-dependent RNA polymerase (NSP12), the endoribonuclease (NSP15), the ADP ribose phosphatase (NSP3), the RNA binding protein (NSP9), the 3C-like protease, and the NSP 8 and 7.

Drugs, protein targets, and PDB structures included in this study
In total, we calculated the barcodes of 23 viral protein structures. Table 2 shows the complete information regarding the included SARS-CoV-2 protein structures. We compared twenty-three PDB structures derived from SARS-CoV-2 with 25800 structures belonging to proteins that are known targets of FDA approved drugs through the computation of 593400 persistent similarity measures. Based on the results of the Skolnick's, SCOPe, and cytomegalovirus analyses, we selected a stringent mean similarity persistent threshold of 0.9 in order to call two protein structures similar. Three 155 viral structures, the 3CL protease (6M2Q), the RNA-dependent RNA polymerase (6M71), and the NSP15 endoribonuclease (6W01), presented persistent similarity mean values higher than the selected threshold with proteins known to be targeted by approved drugs. The 3CL protease was found to be associated with 284 PDB structures (Supplementary table 1), most of them classified as Aldo/Keto reductases and protein kinases, which were targeted by 55 different pharmacological compounds (Supplementary Table 2). The 160 RNA-dependent RNA polymerase was found to be significantly associated with 361 PDB structures (Supplementary tables 3), which in many cases belonged to the protein kinase and flavin-containing oxidoreductase families, and that were found to be targeted by 204 unique drugs (Supplementary Table 4). Finally, the viral NSP15 endoribonuclease presented topological similarity values higher than 0.9 with 13 PDB structures Drugs known to target those proteins presenting high topological similarity values with the SARS-CoV-2 structures were subjected to blind docking with the viral proteins. A set of potential repurposable candidates was then selected based on the topological similarity criteria, the transcriptomic effects that exert, and the binding energies derived from the blind docking analyses. Therefore, the selected candidates are known to 170 target proteins with high topological similarity with a specific viral protein, present high affinities with the viral structures, and have the capacity to partially revert the transcriptomic effects induced by the viral infection. The full description of the candidates can be consulted in Table 3.

Discussion
On 31st December 2019, the World Health Organization (WHO) was officially notified about several 225 cases of pneumonia in Wuhan City, China, caused by the COVID-19, a disease with no effective treatment nor specific vaccine. A disease, which history and quest for a cure is a daily struggle and is constantly being rewritten. Because specific antiviral treatments and vaccines are still under development, drug repurposing strategies suggesting the use of FDA approved drugs for other conditions quickly became the only option to treat COVID-19. However, to date no therapeutic agents have yet been proven effective. Several treatments 230 have been currently reported under investigation specifically to treat COVID-19 as the result of drug repurposing strategies [28] [29] [28] and, as this draft is being written, up to 700 research papers have already been published. The number of clinical trials using repurposed drugs such as hydroxychloroquine, remdesivir, lopinavir/ritonavir among others, alone or in combination, is also exponentially growing, although in most cases unfortunately the results are not as good as initially expected [30,31,32]. 235 Here, we report a based TDA novel strategy for drug repurposing in combination with current methodologies of molecular docking, differential expression analysis of SARS-CoV-2 infected cells and correlation with FDA approved drugs transcriptomic profiles. Our results indicate that the proposed TDA based formalism is an excellent tool to address biological problems from a dual perspective. In the first place, from a structural biology point of view, we use of Vietoris-Rips complex to compute the barcodes encoding the shape of each 240 protein structure. Next in combination with its Persistent Betti Functions, we transform individual barcodes into one-dimensional functions to measure a degree of similarity between proteins. It allowed us to classify proteins based solely in the Cα atomic coordinates. Persistent homology has been previously proposed as a method to study the topological invariants of the three-dimensional structure of biomolecules. Several studies have employed use TDA-based methods to classify of protein structures using only the three-dimensional 245 coordinates of the atoms from crystallographic resolved proteins. For instance, Xia and collaborators [33] performed persistence homology analysis of three-dimensional biomolecular structures in order to study their structural characteristics, flexibility prediction, and folding properties. Hence, they define the molecular topological fingerprints (MFTs) to extract the topological information from protein structures using persistent Betti numbers [34]. K. Dey and colleagues proposed another topology-based method to create 250 protein signatures to create a fast domain classifier using a support vector machine [35]. Interestingly, our mean persistence similarity metric was able to achieve results comparable to those obtained by the state of the art structural alignment method, DALI [36], and presented a high predictive power clustering proteins in terms of external classifications.
Molecular docking simulation is a rapid screening method to test compounds binding activity and tran-255 scriptomic data represent a very rich alternative resource for inferring non-obvious relationships between drugs and genes. Previous in silico molecular docking studies have highlighted the potential of repurposed drugs for the treatment of COVID-19 [37, 38,39,40,41,42,43]. Yet, here we used "in silico" molecular docking combined with transcriptomic small molecule treatment data from LINCS L1000 to determine which Among all the SARS-Cov-2 proteins analyzed (n=23, Table 2), only three showed a persistent similarity score above 0.9 against other protein structures targeted with known drugs. Interestingly, these proteins 270 are key components in coronavirus replication and structural assembly: The Viral 3CL protease (6M2Q),a chymotrypsin-like protease that is essential for the production of non-structural proteins [44]; the nsp12 RNA-dependent RNA polymerase (6M71), the main component of coronavirus replication and transcription machinery, and because of that an excellent target for new therapeutics [45] and the nsp15 endoribonuclease (6W01), a protein a with a poorly defined role in SARS-CoV-2 infection but has been described to be linked 275 to pRB downregulation affecting host cell cycle division and coronavirus infection [46] in other coronaviruses (SARS-CoV) but also with a role as an antagonist of host dsRNA sensors during coronavirus infection in macrophages to evade innate immune system defenses [47]. Hence, in this study, we select three proteins from the SARS-CoV-2 coronavirus as the best candidates to find repurposed drugs to combat the disease.
Our differential expression analyses revealed that troglitazone, niclosamide and chloroquine, among mul- Rutin and Indomethacin were amongst the notable compounds selected from 3CL main protease. Besides, 295 they have been proven as good candidates in other studies. Rutin is a polyphenolic flavonoid that has shown a wide range of pharmacological applications due to its significant antioxidant properties [52]. Our results from GSEA analyses revealed that rutin might act in early stages of SARS-COV-2 infection by activating the Interferon-induced ISG15 pathway. ISG15 is an interferon-induced protein that has been implicated as a central player in the host antiviral response, and it is the key element for the innate immune response against 300 viral infection [53]. Besides, ISG15 modulates the immune system stimulating the IFN-gamma production by NK cells that lead to the promotion of early viral response [54]. Although the result of the possible interaction between rutin and 3CL protease has been reported by other studies using an in silico approach [55], our results provide a transcriptomic dimension to the possible effect of rutin during infection with SARS-COV-2. Moreover, to our knowledge this is the first time the natural compound rutin is related with 305 the antiviral activity induced by the protein ISG15.
Dexamethasone, a corticosteroid used in a wide range of conditions for its anti-inflammatory and immunosuppressive effects, could be one of the most promising repurposed drugs chosen to treat COVID-19 disease, based on some results that proven a decrease on the incidence of death versus the usual care group among patients receiving invasive mechanical ventilation [56]. This compound was chosen because of its 310 properties as immunosuppressant to treat the cytokine storm induced by the immune response to coronavirus infection in late stages of the disease. Nonetheless, our results, indicated that dexamethasone could also be a good candidate to target nsp15 endoribonuclease, although some repurposed works also suggested it as the target of the main protease [57]. That data could support the idea of giving corticosteroids not just in the advanced infection stage but also at the beginning, however a recent study [58] tested multiple 315 pharmacological compounds derived from the steroids in vitro and demonstrated that dexamethasone has no antiviral activity against SARS-CoV-2. Nevertheless, other corticosteroids we also found that could interact with nsp15 protein, such as mifepristone suppressed viral growth conferring more than 95% of cell survival rate after viral infection and drug administration in vitro [58].
Lastly, the RNA-dependent RNA polymerase nsp12 of SARS-CoV-2 is a protein that performs essential 320 functions in the coronavirus life cycle with no host cell homolog. This gives an advantage for antiviral drug development, reducing the risk of affecting any protein in human cells as it has been proven by many drug repurposing studies directed against nsp12 RdRP [59,60,61,62]. Vemurafenib, sorafenib and raloxifene may be potential candidates against nsp12 RdRP. Vemurafenib can disturb the cellular Raf/MEK/ERK signaling cascade via binding in the ATP-binding site of BRAF(V600E) kinase and inhibiting its function 325 [63], whereas sorafenib is another kinase inhibitor that targets VEGFR, PDGFR, and RAF kinases [64].
In conclusion, our strategy on Quantitative Homological Similarities through TDA based formalism would 330 allow researchers and clinicians to select optimal candidates from drug repurposing to hit in the bullseye not only of SARS-CoV-2 coronavirus but also any new viruses that may appear in the future, by choosing the best targets among all virus proteins. In this specific case, by targeting nsp15 endonuclease and nsp12 RNA polymerase, in addition to other promising drug targets of the 3CL main protease, could support the development of a cocktail of anti-coronavirus treatments that could also be potentially used for the discovery of broad-spectrum antivirals. Furthermore, by choosing a precision multidrug treatment, we could rescue any specific drug failure or avoid any future drug resistance due to possible acquired mutation in any of the proteins as a consequence of continuous virus replication and spreading, since the virus will be attacked from different fronts. Nevertheless, our results based on multidrug combinations should be validated both in vitro and in vivo experiments not just to prove the effectiveness of the treatment but also to select the 340 best combination against SARS-Cov-2 infection and consequent disease symptoms.

Data obtention
DrugBank queries were carried out [11] to retrieve the information regarding medications with known protein targets. In short, the DrugBank database version 5.1.5 was downloaded in XML format, and the 345 dbparser package [68] and custom R scripts were employed to extract the relevant information. We only selected drugs approved by the American Food and Drug Administration (FDA) and retrieved the names and UniProt identifiers of their protein targets. Then, UniProt IDs were mapped to their respective Protein Data Bank (PDB) structures using the Retrieve/ID mapping tool available at uniprot. All the PDB structures targeted by FDA approved drugs were downloaded in PDB format and stored for downstream analysis.

350
Protein Data Bank queries were also performed to identify the three-dimensional structures of SARS-Cov-2 proteins.

Data preparation and barcodes computation
All protein structures in PDB format were loaded into the R's environment using the bio3d package [69].
Then, the coarse-grain representation of each structure was generated by selecting only the tridimensional and observed that the all-atom model contained too many details which masked useful information, such as, the Betti 1 barcodes indicating alpha helix structures [70]. Barcodes were constructed using the R package of TDAstats [71]. TDAstats makes use internally of Ripser C++ library [72], an optimized fast software for Vietoris-Rips computation and barcode construction.

Validation of the Betti persistence similarity function
Two independent datasets were used to test the ability of the persistence Betty function to cluster protein based on previous structural classifications. The First, termed as the Skolnick dataset, includes a collection of manually curated domains from the Catalytic Site Atlas (CSA) [73]. In particular, 40 proteins are classified into four structural catalytic families, including Flavodoxin-like fold CheY-related, Plastocyanin, TIM 370 Barrel, and Ferratin [74] [75]. To construct the second dataset we carried out a random sampling of 500 protein structures derived from 10 different protein superfamilies from SCOPe [76]. Supplementary Table   (PONER TABLA) contains the information regarding the superfamilies and protein structures selected for SCOPe analysis. In short, barcodes were computed for all the involved protein structures and pairwise similarity matrices were computed using the Betti persistent similarity function. The Human cytomegalovirus 375 (HCMV) is a widespread pathogen that belongs to the subfamily of the beta-herpesvirus. Whereas HCMV often generates persistent asymptomatic infections in healthy people, it can produce severe complications in immunosuppressed individuals [77]. Cidofovir, a nucleotide analogue that inhibits the viral DNA polymerase, is used to treat patients with severe HCMV infection [12]. To test if our TDA-based drug repurposing strategy was able to identify potential medications for the treatment of HCMV infection, we retrieved the 380 crystallographic structure of the HCMV DNA polymerase from PDB (PDB ID: 1T6L [78]) and computed persistent Betti similarities to all other protein structures that are known targets of FDA approved drugs.

Protein-ligand binding with autodock 4.2
Ligand preparation was carried out as follows: First, the FDA-approved drugs in SDF format were Autodock does not include the values of their atomic force fields, and it is, therefore, unable to perform molecular docking using them. Polar hydrogens were also added to the SARS-CoV-2 protein pdb structures which were also transformed to the PDBQT format. Docking was carried out using Autodock 4.2 [80], a molecular docking software developed by the Scripps Research Institute. A grid box spanning the whole protein structure was set in order to perform blind docking. Autodock was configured following the manual 395 recommendations [81]. We increased the parameter ga_runs from 10 to 150 to improve the accuracy of the results.

Differential gene expression analyses of SARS-CoV-2 infected human samples and cell lines and uninfected controls.
We carried out searches for transcriptomic datasets of patients and human-derived cell lines including 400 samples infected with SARS-CoV-2 and uninfected controls. At the time the searches were carried out, three datasets were identified. Dataset 1 (DS1) was found in gene expression omnibus (GEO) under ID GSE150316 [82]. It includes formalin-fixed paraffin-embedded samples from multiple tissues (i.e., lung, jejunum, heart) derived from SARS-CoV-2 infected individuals and uninfected controls obtained in autopsies. We restricted our analysis to lung samples. Twenty-one samples (16 cases and five controls) were selected for downstream 405 analysis.
Dataset 2 (DS2) [83] gathers samples derived from bronchoalveolar lavage fluids (BALF) of SARS-CoV-2 infected patients (four samples derived from two patients with two technical replicates) and three healthy controls. Samples derived from infected patients were stored at National Genomics Data Center under accession number CRA002390, whereas control samples were downloaded from the NCBI SRA database and were 410 available under the following identifiers SRR10571724, SRR10571730, and SRR10571732. Sequence alignment using the human reference genome hGR38 and count extraction were carried out using the Rsubread package [84].
Finally, the third dataset (DS3) was available in GEO under accession ID GSE147507 [85]. It presented a complex design including both primary cell lines derived from the human lung epithelium and transformed 415 lung alveolar which were either mocked treated or infected with different viruses including the influenza A virus (IAV), the respiratory syncytial virus (RSV), and SARS-CoV-2, as well as, samples derived from infected ferrets and two technical replicates of a lung sample derived from a SARS-CoV-2 infected human patient. We restricted our analysis to the cell lines NHBE, A549, and Calu-3, which were either infected with SARS-CoV-2 or were mock treated. The infected human lung samples and the healthy lung biopsies 420 were also included. Overall, twenty-eight samples were analyzed in this dataset.
For each dataset, differential gene expression analysis between SARS-CoV-2 infected samples and uninfected controls carried out using the DESeq2 package [86].

Gene Set Enrichment Analysis (GSEA)
440 Dysregulated biological processes were identified for each transcriptomic dataset using the pre-ranked Gene Set Enrichment Analysis (GSEA) implementation of fgsea package [88]. The C5 molecular signatures collection, which contains gene sets derived from the three branches of Gene Ontology (GO) was used as a source of functional information. GO terms including more than 500 or less than 15 genes were filtered out.
GSEA analyses were also performed for those LINCS L1000 level 5 expression signatures negatively correlated 445 with the differential gene expression profiles generated by the SARS-CoV-2 infection to determine their effect in specific pathways and biological processes. Reactome (version 73) was used as a source of pathway information and analyses were carried out using the clusterprofiler package [89]. Biological processes and pathways presenting false discovery rate (FDR) adjusted p-values were called to be significantly dysregulated.

ities of pairwise three-dimensional molecules considered as surfaces
In this section we borrow well-known concepts for Algebraic Topology [90] and Persistent Homology [91] (see also [92] and references therein) to introduce a mathematical formalism that allows us to compare three-dimensional protein structures considered as surfaces. This fact implies that only consider a shape composed by a union of two-dimensional faces each of them composed by one-dimensional segments, that 455 are constructed by a finite set of zero-dimensional points contained in a three-dimensional space. Intuitively, we assume that a molecule is a kind of graph embedded in a three-dimensional space and we take into account the path following by the molecule to be folded.

Simplicial complexes for three-dimensional molecules considered as surfaces
Throughout this paper we will identify a molecule M with a finite set of 3-dimensional data points denoted by where we assume that M is a high natural number, together a set, denoted by S(M) containing the information of the molecular combinatorial structures at different scales and its relationships. To describe this kind of molecular geometry we use the so-called k-simplexes (0 ≤ k ≤ 2) defined from the data set M as follows.
• A 0-simplex, also called vertex, is generated by an individual point x 0 ∈ M and we will denote its Given k = 1, 2 and for 0 ≤ i ≤ k we can define the operator R i that for each k-simplex removes the i-th position: In order to have a well-define map from a set of k-simplexes to a set of k − 1-simplexes, we construct a finite sequence of sets of k-simplexes, denoted by S k (M) (k ∈ Z), obtained as follows.
• S k (M) = ∅ for every integer number k = 0, 1, 2; otherwise Since we consider that x i ∈ R 3 for 1 ≤ i ≤ M, and that M is a surface, we associate to our molecule M three non-empty sets of simplexes: recall that S k (M) = ∅ ⊂ S(M), for every integer number k ∈ Z, k = 0, 1, 2. As we will see below it will have some consequences. From the construction of S(M) the following two properties holds.

Simplicial homology for three-dimensional molecules considered as surfaces
In order to perform geometric operations (like unions of simplexes) at each k-level and to describe the relationship between two consecutive levels (like cut the faces of a simplex) we endow to each S k (M) with 485 an algebraic structure of vector space over a finite field of scalars. To this end, we consider the finite field Z 2 = {0, 1}. We recall that the two operations over Z 2 are the sum and the multiplication: Now, for 0 ≤ k ≤ 3 we introduce the vector space of formal series of k-simplexes with coefficients over the finite field Z 2 as Observe that if σ ∈ S k (M) then we can identify this simplex with the formal series also denoted by σ =   Figure 5) and where Now, we can identify Z 2 [S 0 (M)] ≡ Z 4 2 , by using 0, 0, 0, 1). 0, 0, 0, 0, 0), 1, 0, 0, 0, 0), 0, 0, 0, 1, 0, 0), 0, 0, 0, 0, 1, 0), 0, 0, 0, 0, 0, 1),

Now, we can identify
From now on, we identify • If k = 1, 2 then we can extend the map R i : . Now, R i is a linear map between vectors spaces for 0 ≤ i ≤ k.

Homology groups and features for three-dimensional molecules considered as surfaces
Next, we associate an incidence matrix (defined by a linear map between vector spaces) to each pair of consecutive levels (k − 1, k) as we show in the next example.  Table 5.
The formal way to introduce the above matrices is the following. For 1 ≤ k ≤ 2 we can define the following linear map between vector spaces: This map uses the whole set {R 0 , R 1 , . . . , R k } of remove the i-th position linear map for 0 ≤ i ≤ k. Observe that for k = 0 we have the zero map, and also for k = 3 we obtain the 0-map. Finally, we consider that ∂ k−1,k = 0 for all integer k such that k = 0, 1, 2.
To better understand the role of these maps, observe that if we have two simplexes [x 0 , x 1 , x 2 ] and [x 3 , x 4 , x 5 ] in Z 2(M) 2 without common faces then the union of both is described under this algebraic framework by the sum [x 0 , x 1 , x 2 ] + [x 3 , x 4 , x 5 ] (see Figure 6(a)). By using the map ∂ 1,2 we obtain its description that is, the sum of the six faces of the two simplexes (see Figure 6(a)). They represent the total number of faces in their union. However, if we consider the union of two simplexes [x 0 , x 1 , Figure 6(b)), then its description in Z 0(M) 2 is now because [x 1 , x 2 ] + [x 1 , x 2 ] = 0. This fact implies that the union of both is now described by the non-common four faces by forgetting the inner common face (see Figure 6(b)).
Moreover, the matrices associated to the linear maps ∂ 0,1 and ∂ 1,2 are It is not difficult to see that the matrix product ∂ 0,1 · ∂ 1,2 is the zero matrix. Then the vector space generated by the columns of the matrix ∂ 1,2 , denoted by Col ∂ 1,2 , is contained in the vector space, denoted by Nul ∂ 0,1 , Thus, we have the following subspaces Col ∂ 1,2 ⊂ Nul ∂ 0,1 ⊂ Z 6 2 . In a simlar way we have that Col It is possible to prove that ∂ k−1,k · ∂ k,k+1 = 0 holds for all 0 ≤ k ≤ 3 (indeed, it is true for all integer number k) if introduce the 0 map as This property means that the linear subspace generated by the columns of the matrix ∂ k,k+1 is contained in the linear subspace of the solution of the homogeneous linear system with matrix ∂ k−1,k . From the rank-nullity theorem, we known that In particular, Nul ∂ −1,0 = Z M 2 and Col ∂ 2,3 = Col 0 = {0} is the trivial subspace. It allows us to introduce the vector space of k-features (known as the k-th homology group) of M by In particular,  We recall that two vector spaces are linearly isomorphic if and only if they have the same dimension.

Persistent homology for three-dimensional molecules considered as surfaces
To describe the evolution of the features of a three-dimensional molecule M we can use the so-called Vietoris-Rips complex at scale ε. It is constructed as an approximation of S(M) which usually is computationally intractable. To define it fix some real number ε > 0 and construct a finite sequence of sets where the elements of each set are defined as follows: holds for all i = j.
Proceeding in a similar way as above we can construct vector spaces over the finite field Z 2 obtaining is a linear subspace that depends on ε > 0, for k ≥ 1. Moreover, we have a vector space  Since a k (γ) ≤ b k (γ) holds we will call the interval [a k (γ), b k (γ)] the barcode of the feature γ. 530 In order to implement in practice the comparison between two molecules a range represented by an interval [ε min , ε max ] of real numbers is chosen. This interval reflects the smallest and largest features scales that we will consider. A maximal choice is to take ε min = 0 and ε max the farthest distance between points of M. We can take values Let N k (0 ≤ k ≤ 2) be the number of vectors in the set γ ∈ H k,εj (M) : for some 1 ≤ j ≤ m that we can ordered as that we represent graphically as we show in Figure 7.
Thus, we have that the k-level barcodes associated with the partition P := {ε j } m j=1 are given by B k (M, P) := {I k,ν : 1 ≤ υ ≤ N k } . To simplify notation, we will write

Persistent similarity using barcodes between molecules considered as surfaces
In order to determine the grade of similarity between two barcodes from proteins we need to set a similarity metric. Based on the barcodes concept, it is possible to build a model that enables us to study the structure of the proteins in different k-scales. To this end we introduce the so-called k-th Persistent Betti Function (PBF) [10] by: k is the weight for the j-th associated k-feature. Usually is considered that w (j) k = 1 for all k and j.
• σ is the resolution parameter. Usually is considered σ = 1, otherwise we can change the resolution parameter to observe variations of structure properties from various scales.
• Finally, κ is the kernel scale parameter.
In this way, it is possible to transform the complexity of the three-dimensional protein structure and the barcodes into unidimensional continuous functions. Therefore, only 3 Persistent Betti Functions (PBFs) (one of each feature level) are needed to represent a protein tertiary structure. To compare two protein structures, namely M and N , we construct its corresponding family of PFBs denoted by We implemented in R the Persistent Betti Functions (see [10] and references therein). Then, by the help 540 of the family of PFBs, we introduce the following k-similarity measure between two molecules M and N (0 ≤ k ≤ 2). Definition 4.9. We called k-persistent similarity measure (0 ≤ k ≤ 3) between two molecules M and N to the number Observe that since holds for all x ∈ R and max(f M (x, B k ), f N (x, B k )) decreases fast to 0 as |x| increases to ∞, we obtain that  In this paper, we computed the k-Persistent Similarity (0 ≤ k ≤ 2) of PDB structures from a Drug-545 bank against SARS-CoV-2 proteins. We calculated the mean of the Persistent similarity for each protein