SARS-CoV-2 Infections-Gene Expression Omnibus (GEO) Data Mining, Pathway Enrichment Analysis, and Prediction of Repurposable Drugs/Compounds

The COVID-19 global pandemic has created dire consequences with an alarming rate of morbidity and mortality. There are not yet vaccine or efficacious treatment options to combat the causative SARS-CoV-2 infection. This paper describes the identification of potentially repurposable drugs for COVID-19 treatment by conducting pathway enrichment analysis on publicly available Gene Expression Omnibus datasets. We first determined SARS-CoV-2 infection-induced alterations of host gene expressions and pathways. We then identified drugs or compounds that target and counter virus-triggered cellular perturbations, suggesting their potential repurposing for COVID19 treatment. The key findings are that SARS-CoV-2 infection in host cells induces mitochondrial dysfunction, inhibits oxidative phosphorylation, and activates several immune response and proinflammatory pathways. Triptolide, the major bioactive component of a traditional Chinese medicine herb, may rescue mitochondrial dysfunction by activating oxidative phosphorylation. Further in vitro and in vivo studies are necessary to verify these results prior to clinical application.

Introduction data from over 3.5 million samples. The transcriptomic profile of SARS-CoV-2 infections, as well as several drug-induced gene expression signatures, are readily available within these databases.
In this paper we discover potentially repurposable drugs for COVID-19 treatment by conducting pathway enrichment analysis on publicly available GEO datasets. First, we describe SARS-CoV-2 infection-induced alterations of host gene expression and pathways. Then we identify drugs that target and counter virus-induced cellular perturbations, suggesting their potential repurposing for COVID-19 treatment. Further in vitro and in vivo studies will be necessary to verify these results prior to clinical application. Figure 1 outlines the 5 steps taken in this study. They are described in detail below.

GEO data identification (Step 1)
We searched the NCBI GEO datasets 20 with the terms "SARS-CoV-2 OR COVID-19" and the filters "human" for organism and "expression profiling by high throughput sequencing" as study type. We sought datasets where SARS-CoV-2 infection was studied on established laboratory cell lines and on cell cultures derived from cells isolated from SARS-CoV-2 infected humans. Out of 30 search results (July 25, 2020) we identified GSE147507 21 (bulk RNA-Seq data generated from various cell lines) and GSE152075 22 (RNA-Seq data from nasopharyngeal cells of SARS-CoV-2 positive and negative individuals). More information on the GEO dataset search results is available in supplementary methods (SM1).

RNA-Seq analysis (Step 2)
For GSE147507, CLC Genomics Workbench v20 (QIAGEN Digital Insights) 23 was used to perform RNA-Seq data analysis on raw sequencing reads and for computing differentially expressed (DE) genes between virus-infected vs. mock controls. For GSE152075, DE genes were identified from gene expression count data using Partek Flow software (Partek Inc.) 24 . A detailed description of RNA-Seq analysis methods by CLC Genomics Workbench is available in supplementary methods (SM2).

Pathway enrichment analysis (Step 3)
The Ingenuity Pathway Analysis (IPA) suite of tools (QIAGEN Digital Insights) 25 was used on DE genes to identify statistically enriched biological pathways induced by virus infection. We uploaded DE genes between the virus-infected vs. mock-control samples and applied the IPA Core Analysis tool to identify statistically significant metabolic and signaling pathways. Statistical significance was calculated using the right-tailed Fisher Exact Probability Tests; biological pathways showing p-value < 0.05 were considered statistically significant. The activity status of pathways was determined by calculating the activity z-score 26 , which is a statistical measure of how closely the gene expression pattern present in the query dataset compares to the expected pattern based on the literature findings. A positive score indicates an overall increase in the pathway activity, whereas a negative value indicates an overall decrease in activity. The IPA Comparison Analysis tool was used to compare pathway enrichment analysis results generated from the multiple datasets used in our study.

Identification of potentially repurposable drugs and drug target pathways (Steps 4 and 5)
DE genes between SARS-CoV-2-infected vs. mock-treated samples from series 16 of GSE147507 were uploaded to BaseSpace Correlation Engine (BSCE) software (Illumina Inc.) 27 . The Pharmaco Atlas tool was used to identify drugs exhibiting a negative correlation to SARS-CoV-2-induced gene expression profiles.
The top five drugs from the Pharmaco Atlas results (sorted by the negative correlation score) were used to search BSCE datasets comprising DE genes between drug treatment vs. control experiments using the filters "human" for organism, "RNA expression" for data type, and "lung" as a keyword. The identified datasets are available in supplementary methods (SM3).
All identified datasets along with the SARS-CoV-2-induced host DE genes from series 16 of GSE147507 were then added to the BSCE Meta-Analysis tool. The filter term "Canonical pathways | Broad MSigDB" available under the "BioGroup Results" option was used to compare the activity patterns of statistically enriched pathways across SARS-CoV-2 and drug-induced genes. The BSCE datasets and analyses are available in supplementary methods (SM4).

The number of SARS-CoV-2-induced genes correlates with increased viral load and level of ACE2 expression.
To understand the host viral response mechanism, we (1) measured gene expression, (2) identified the DE genes between virus-infected (positive) vs. mock-control (negative) samples from each dataset, and (3) performed pathway enrichment analysis.
The number of DE genes was used as a measure of the virus-induced host response (Table 1). We also checked the expression of several genes associated with SARS-CoV-2 pathogenesis. For example, the ACE2 and TMPRSS2 genes facilitate virus entry and protein priming, respectively 28 . We observed that both ACE2 and TMPRSS2 are expressed in NHBE and Calu3 cells but not in A549 cells. As expected ACE2 is highly expressed in the A549 cells when supplemented with the exogenous ACE2 (series 6 and 16).
Since the A549 wild type cells exhibited an absence of ACE2 and TMPRSS2 expression, we searched for an alternative gene that the virus might use to gain cellular entry. A recent finding suggests that Neuropilin Receptor 1 (NRP1) may serve as a host factor for SARS-CoV-2 infection 29 . We observed NRP1 expression in all cell lines we examined, including A549. Two endosomal cysteine proteases, Cathepsin L/B (CATL and CATB), exhibit the potential to compensate for TMPRSS2 30 and are also expressed in A549 cells.

SARS-CoV-2 infection results in mitochondrial dysfunction, activation of immune response pathways, inhibition of oxidative phosphorylation, and protein synthesis initiation.
To ascertain the SARS-CoV-2-induced alterations of signaling and metabolic pathways in host cells, we selected the GSE147507 series 16 dataset, as with 4684 DE genes it had the most robust host response. We performed pathway enrichment analysis using Ingenuity Pathway Analysis (IPA) software and show the top ten statistically enriched pathways in Figure 2. The activity status of a pathway is calculated by activity z-score algorithm 26 . The full list of statistically enriched pathways is available in a supplementary document (SD1).

Figure 2.
Top ten significantly enriched biological pathways for SARS-CoV-2-induced human gene expression. Ingenuity Pathway Analysis software was used to determine statistically significant pathways (x-axis = name of pathway; y-axis = -log of the p-value). The significance value (p-value of overlap) of the pathways was calculated by the right-tailed Fisher's Exact Test. Color intensity is proportional to the magnitude of the IPA z-score algorithm and reflects the degree of pathway activation (orange) or inhibition (blue). The gray bar indicates a pathway for which no prediction can be made due to insufficient evidence in the Ingenuity Knowledgebase.
The most statistically significant enriched pathway was mitochondrial dysfunction. The most inhibited pathway was oxidative phosphorylation, which is a mitochondrial metabolic pathway. The activated pathways include several immune-response and pro-inflammatory signaling pathways: Interferon, Sirtuin, Tumor Necrosis Factor 2 (TNFR2), and cytosolic pattern recognition receptormediated Interferon Regulatory Factor (IRF) activation. Supplementary figures provide examples of IPA pathway diagrams overlaid with DE genes observed between SARS-CoV-2-infected vs. mock controls for the most activated (Interferon signaling; SF1) and most inhibited (oxidative phosphorylation; SF2) pathways.
To ensure IPA analysis validity, we compared the IPA results with the BSCE-computed pathway enrichment results for the series 16 DE genes (SARS-CoV-2 infected vs. mock control). Whereas IPA searches a proprietary pathway knowledgebase and uses the right-tailed Fisher Exact test to identify pathway enrichments, BSCE uses an open-access Molecular Signatures Database (MSigDB) collection 31 and applies a modified gene set enrichment analysis algorithm to compute pathway enrichment Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 19 September 2020 doi:10.20944/preprints202009.0459.v1 scores 27 . The full list of BSCE-computed statistically enriched pathways is available in a supplementary document (SD2). We determined that the two bioinformatics tools provide similar results, particularly when comparing the statistical enrichment of immune response and proinflammatory pathways activation and mitochondrial pathways inhibition. We also performed pathway enrichment analysis of DE genes from two additional datasets (GSE140507, GSE 152075) comprising four cell types (NHBE, A549, Calu3, NSP) and three virus infection types (SARS-CoV-2, RSV, HIPV). To understand which biological processes are relevant to each dataset, we applied the IPA Comparison Analysis tool. We used hierarchical clustering for sorting and activity z-scores for displaying the comparison analysis results. Figure 3 shows the top 20 pathway analysis comparison results. The entire comparison analysis results are available in a supplementary document (SD3).  7), and A549 cells infected with RSV and HIPV (MOI 2.00) (lanes 8,9).
Our NSP cell results are in agreement with the contributors of GSE152075, who reported that host responses to SARS-CoV-2 are dependent on viral load and observed a strong inhibition of oxidative phosphorylation in SARS-CoV-2 infected cells through gene set enrichment analysis 32 . The activated pathways shown in Figure 3 are primarily involved in several immune-response and inflammatory processes, including interferon and cytokine signaling pathways, and reveal a dosedependent up-regulation in A549 infected cells (lanes 2-5). Taken together, the results from in vitro experiments with the cell lines (A549 and Calu3) and the in vivo results from NSP cells directly isolated from virus infected humans suggest that the inhibition of EIF2 and oxidative phosphorylation along with activation of immune response and pro-inflammatory pathways are the characteristic signatures of SARS-CoV-2, RSV, and HIPV infections.

Identification of drugs and compounds negatively correlated to SARS-CoV-2 gene expression patterns.
After identifying both up-and down-regulated genes and pathways in SARS-CoV-2 transcriptomic data, we next sought to identify drugs and small molecules that may reverse these virus-induced host gene expression alterations and subsequently be further investigated for the possible treatment of COVID-19. This concept is central to the connectivity map (CMap) approach of repurposing established drugs for off-label diseases or indications 18 .
We searched for drugs and compounds corresponding to DE genes from the GSE147507 Series 16 datasets using the Pharmaco Atlas application of BSCE software. BSCE searches GEO datasets and returns a list of drugs and compounds that produce a negatively correlated gene expression profile for the input query. The top 25 drugs, sorted by decreasing negative correlation score, are shown with their correlation scores in Table 2. negative We noted the presence of currently prescribed COVID-19 drugs including Dexamethasone, Hydroxychloroquine (HCQ), and Azithromycin in the list at positions three, five, and twenty-five, respectively (Table 2) 8 . The entire list of drugs is available as a supplementary document (SD4).
Triptolide is a potent activator of mitochondrial metabolic pathways, i.e., oxidative phosphorylation and TCA cycle and respiratory electron transport, whereas Hydroxychloroquine downregulates interferon and cytokine signaling.
We focused on the top five drugs from Table 2 for further analysis. Dactinomycin is a DNA binding compound that inhibits transcription 33 . Triptolide is a diterpenoid epoxide and the major bioactive component of Tripterygium wilfordii Hook.f. (TWHF), a Traditional Chinese Medicine (TCM) herb 34 . Dexamethasone is an anti-inflammatory 9-fluoro-glucocorticoid 35 . Estradiol is a naturally occurring hormone circulating endogenously in females 36 . HCQ is a chemotherapeutic agent that acts against erythrocytic forms of malarial parasites 37 .
To ascertain the effect of these five drugs on SARS-CoV-2 triggered host pathways, we first searched for drug-induced transcriptomics datasets in BSCE. Then we compared pathway enrichment profiles between the SARS-CoV-2 and drug-induced gene expression datasets.
Since SARS-CoV-2 is a respiratory virus, we focused on BSCE datasets in which drug treatment induced gene expression was studied in cell lines derived from lung tissue. We did not find any lung cell-related gene expression data in two of the top five drugs, Estradiol and HCQ. We therefore excluded Estradiol from further analysis, but continued with HCQ since some studies indicate that it has potential to treat SARS-CoV-2 38 . However, it is important to note that the HCQ data is from experiments performed in human Peripheral Blood Mononuclear Cells (PBMC), not lung tissue.
Of the SARS-CoV-2-induced pathways, we selected three up-regulated (generic transcription, cytokine signaling, and interferon signaling) and three down-regulated (translation, oxidative phosphorylation and TCA cycle and respiratory electron transport) for comparative analysis in BSCE. In Figure 4, panels A and B display the up and down-regulated pathways, respectively. In each graph, lane 1 is the input query dataset representing the result from SARS-CoV-2 vs. mock-infected samples from the GSE140507 series 16 dataset. Lanes 2-17 are the pathway enrichment analysis results of several drug treatments in different cell lines, including A549, airway smooth muscle, and PBMC.

Figure 4. Pathway activity comparison -SARS-CoV-2 infection vs. drug treatments.
Each plot represents the activity patterns of a pathway across multiple datasets. The x-axis lists the datasets and the y-axis shows the correlation score. The red bar above the midline represents the strength of overlap or enrichment between the up-regulated genes and the pathway genes. In contrast, the green bar below the midline represents the strength of overlap or enrichment between the down-regulated genes and the pathway genes. The absence of a colored bar means that the correlation is insignificant.

Discussion
Our project aim was to identify genes and pathways that might be involved in SARS-CoV-2 pathogenesis as well as discover potential drugs or compounds that could be repurposed for treatment of COVID-19. To achieve this goal, we took an in silico approach and leveraged the rich collection of publicly available transcriptomic data. To gain mechanistic insight into the SARS-CoV-2 triggered host response, we identified virus-induced host gene expression patterns from two GEO datasets and performed pathway enrichment analysis. We discovered that the number of virusinduced DE genes positively correlates with the dose of virus infection and level of ACE2 expression. We performed in depth pathway enrichment analysis of virus-induced host response using two bioinformatics tools. The key findings from pathway analysis are mitochondrial dysfunction, activation of immune-inflammatory pathways, and strong inhibition of oxidative phosphorylation.
Connecting established drugs with off-label diseases through transcriptomic data analysis is becoming a common practice in the field of bioinformatics, and has been successfully used to identify repurposed drugs through connectivity map (CMap) projects 18 . Repositioning drugs for obesity 39 , influenza 40 , ovarian cancer 41 , and gastric cancer 42 are a few prime examples of drug discovery using the transcriptomic data analysis approach. Motivated by this, we used BSCE software to identify drugs by searching drug/compound-related GEO datasets to reveal a reverse pattern of SARS-CoV-2-promoted gene expression signature. The current prescribed treatments for COVID-19 include Dexamethasone, Remdesivir, and HCQ alone or with Azithromycin. Our drug discovery method successfully identified Dexamethasone, HCQ, and Azithromycin. Our method failed to identify the other common drug, Remdesivir, due to the lack of available and relevant GEO datasets.
The BSCE-generated comparative pathway analysis shed light on the mode of action of four drugs by revealing which SARS-CoV-2-prompted pathway activity patterns could be reversed. The activation of cytokine and interferon signaling may lead to a hyperactive immune-inflammatory response, aka "cytokine storm" 16 . In severe COVID-19 cases, ARDS and multiple-organ failure occurs rapidly, resulting in death within a short time 43 . A cytokine storm is considered to be one of the major causes of ARDS and multiple-organ failure. In our study, the most potent inhibition of cytokine and interferon signaling was observed by transcriptomic data from human PBMC, where the HCQ effect was measured against heat killed GAS infection. Amongst other drugs we studied, only Dactinomycin showed a modest inhibitory effect against the two pro-inflammatory pathways.
Our pathway analysis results suggest that mitochondrial dysfunction and strong inhibition of mitochondrial oxidative phosphorylation are the prominent signatures of several viral infections, including SARS-CoV-2, RSV, and HIPV. The role of mitochondria in a viral infection host response has been previously described 44 . The SARS-CoV encoded protein, ORF-9b, localizes into the mitochondria and manipulates its function to evade host immune response 45 . The role of mitochondrial dysfunction in COVID-19 pathogenesis has also been postulated [45][46][47][48] . Some studies show that coronaviruses manipulate host mitochondria and cause mitochondrial dysfunction leading to an inflammatory response 49 . A link between mitochondrial dysfunction and COVID-19-related sepsis is also described 50 . Considering that mitochondria are cellular powerhouses and play the central role in maintaining oxidative homeostasis, rescuing mitochondrial function through therapeutic interventions could be pivotal in reducing COVID-19 morbidity and mortality.
In our analysis, we found that Triptolide might have the potential to intervene therapeutically with mitochondrial dysfunction, as it exhibited the most potent dose and time of treatment dependent activation on both oxidative phosphorylation and TCA cycle and respiratory electron transport pathways. Triptolide, a diterpene triepoxide, is a pharmacologically active compound isolated from Tripterygium wilfordii Hook F (TwHF), a perennial vine (Thunder God Vine) commonly grown in southeast China, and used in Traditional Chinese Medicine to treat autoimmune and inflammatory conditions 51 . It covers a broad spectrum of biological profiles including antiinflammatory, immunosuppressive, antifertility, antitumor activities, and neuroprotective effects. The therapeutic potential of Triptolide have been tested for nephritis, asthma, arthritis, Alzheimer's Disease, Parkinson's Disease, Pancreatic cancer, and HIV 34,51,52 . Despite its therapeutic potential, its application in clinical practice is limited due to poor water solubility and multi-organ toxicity 53 . However, several Triptolide derivatives that improve water solubility and reduce the toxicity have been developed 51 , including Minnelide, which exhibits promising results for pancreatic cancer in preclinical models 54 . These Triptolide derivatives could be tested for clinical efficacy against COVID-19.
Responding to an urgent need to find COVID-19 treatments, many research groups have taken a variety of approaches to identify repurposable drugs, including viral protein structure-based molecular docking 55 , network-based 56,57 and connectivity map-based 58 drug repurposing methods. In this paper we describe an additional approach, the application of anti-correlation based transcriptomic data analysis methods using publicly available GEO datasets to identify drugs that may be repositioned for COVID-19 treatments.
There are limitations with this in silico data mining approach. The absence of drug or compoundinduced transcriptomic data in the GEO database, such as in the case of Remdesivir, prevents its identification as a potentially repurposable drug. We mined data generated by different research groups using a variety of experimental protocols. Microarray data were compared with the results of RNA-Seq data. The inherent heterogeneity of experimental platforms might lead to erroneous conclusions. Therefore, all drug predictions described here warrant experimental and clinical verifications. However, we hope that these findings will be helpful to understand the pathogenesis of SARS-CoV-2 and can motivate researchers to test these compounds that eventually could help to find better treatment options for COVID-19.
Data Availability: All relevant data are within the manuscript and its Supporting Information files. A Figshare collection with results from RNA-Seq and pathway enrichment analyses is available at DOI: 10.6084/m9.figshare.c.5007983.
Authors Contributions: A.C. conceived the project, performed pathway enrichment analysis, and wrote the initial manuscript draft. S.C. performed RNA-Seq data analysis, created figures and tables, and deposited the data. C.I. provided editorial suggestions and reworked the manuscript to its final form.
Competing Interests: Authors declare that they do not have any competing interests.

SM2-CLC Genomics data
Raw reads for all human samples (GSE147507; series 1-9, 15, 16) were downloaded from the NCBI Sequence Read Archive 59 and data analysis was performed using CLC Genomics Workbench v20 (QIAGEN Digital Insights). Raw reads were aligned to the human reference genome (hg38) with annotations (Ensemble v91) using default settings, and unmapped reads were collected and aligned against corresponding virus reference genome sequence. The differential expression of genes between virus-infected vs mock-treated samples were computed for each series. Genes with FDRadjusted p-value <0.05 and log fold change +/-1.0 were considered DE genes.

SM3-BaseSpace Correlation Engine drug-related datasets
Dactinomycin: results from a microarray experiment in which A549 human lung cancer cells were seeded eight days before drug treatment (GSE6400) 60 . At four hours before RNA isolation, Dactinomycin (5 µ g/mL final concentration) or control (5% mannitol) solution was added to the cell cultures.
Triptolide: results from a microarray experiment describing identification of genes affected by a short treatment of A549 cells with increased concentrations of Triptolide (GSE16760) 61  Dexamethasone: results from four GEO datasets measuring gene expression in human lung tissue-derived cells treated with Dexamethasone. GSE34313 is a microarray dataset of airway smooth muscle cell responses to Dexamethasone at 4 and 24 hours 62 . GSE52778 is an RNA-Seq dataset after 18 hours of Dexamethasone treatment 63 . GSE96649 is an RNA-seq dataset from A549 cells with Dexamethasone (1uM) for treatment at 5 hours vs. untreated controls 64 . GSE17307 is a microarray dataset of Dexamethasone (1uM) treatment at 6 hours vs. untreated controls 65 .
Hydroxychloroquine: GSE74235 is an RNA-Seq dataset of human Peripheral Blood Mononuclear Cells from three healthy participants, stimulated with rheumatogenic, heat-killed Group A Streptococcus (GAS) for 24 hours, MOI 10 66 . The effect of Hydroxychloroquine (20 µ M) was measured in combination with GAS.

SM4-BaseSpace Correlation Engine pathway comparison
BSCE was used to mine the GEO datasets and compute DE genes between two conditions. For microarray experiments, statistical significance was determined using the Welch t-test. Genes with p-value cutoff 0.05 and fold change cutoff +/-1.20 were considered DE genes. For RNA-Seq experiments, raw sequencing reads were aligned to the human genome sequence using STAR 2.3 and RefSeq annotations. Differential gene expression results between the control and test sample groups were calculated using DESeq2. Genes with Benjamini-Hochberg adjusted q-value cutoff 0.05 and a fold change cutoff of +/-1.2 were considered DE. BSCE software uses a ranked-based non-parametric Running Fisher test to compare gene expression signatures between two DE gene lists and calculate a correlation score 27 . The fuchsia color indicated up-regulated genes with log2Fold change > 1. The intensity of the color is proportional to the magnitude of fold change.
The green color indicates down-regulated genes with log2Fold change < -1. The fuchsia color indicates up-regulated genes with log2Fold change > +1. The intensity of the color is proportional to the magnitude of fold change.