UCSCXenaShiny: an R package for exploring and analyzing UCSC Xena public datasets in web browser

Motivation: UCSC Xena platform provides huge amounts of processed cancer omics data from big public projects like TCGA or individual reserach groups for enabling unprecedented research opportunities. In 2019, we developed UCSCXenaTools, an R package for retrieval of UCSC Xena data. However, an easier dataset exploration and analysis tool is still lack, especially for researchers without programming experience. Results: We develop UCSCXenaShiny, an R Shiny package to quickly explore, download all datasets from UCSC Xena data hubs. In addiction, a module based analysis framework is constructed to analyze and visualize data. https://github.com/openbiox/UCSCXenaShiny or https://cran.

In 2019, we developed UCSCXenaTools, an open-source R package for retrieving and assembling public UCSC Xena data (Wang and Liu, 2019). UC-SCXenaTools was developed to communicate with UCSC Xena data hubs for downloading datasets or dataset subsets, querying metadata of data hub, cohort or dataset. Despite UCSC Xena platform itself allows users to explore and analyze data, it is hard for researchers to quickly explore all available datasets, locate what they need in their research and download useful datasets. Besides, the analysis features provided by UCSC Xena platform mainly focus on individual cohort data, thus lack of full-feature functionality.
To this end, we develop an open-source R Shiny package UCSCXenaShiny for cancer community to allow researchers to explore and analyze datasets from UCSC Xena data hubs in web browser. In addiction, an extensible module based analysis framework is constructed to analyze data. Currently, several modules providing single-gene expression analysis and visualization are implemented.

Dataset exploration
UCSCXenaShiny opens a web page in user's browser to provide service. The page "Repository" is used to explore all available UCSC Xena datasets. Users can find desired datasets by either defined buttons or searching in dataset table. Once one or several datasets selected, users can query their metadata or download them ( Fig.1 and Fig.3). To improve the performance of downloading large datasets, we provide a button to download a Shell script containing 'wget' commands which can run in Unix-like system.

Module and pipeline
For now, several modules targeting at single-gene expression analysis are available at page "Module" (Fig.2 and Fig.4), a pipeline based on them is available at page "Pipeline" (Fig.2). The usage is quite easy, users just need to type the gene symbol name and all procedures will be properly done by UCSCXenaShiny, including downloading data from UCSC Xena data hubs, cleaning data, analyzing data and visualizing the result. We are happy to accept new feature requests and they can be discussed at https://github.com/openbiox/UCSCXenaShiny/issues.

Package structure
The structure and workflow of UCSCXenaShiny is described in Fig.1. Currently, the core components of this package are page "Repository" and page "Module". Page "Repository" allows researchers to explore and download datasets. Table 1 summaries the cohort and dataset number available at different UCSC Xena data hubs. There are total 1639 datasets and TCGA project is the major contributor. The development of UCSCXenaShiny is based on R Shiny platform (https://shiny.rstudio.com/), the overview of its graphic interface is shown in Fig.2.

Feature 1: dataset exploration and download
UCSCXenaShiny allows users to explore UCSC Xena datasets quickly and easily (Fig.3). A table storing all datasets is shown in Fig.3A, users can filter datasets by either typing some key words in search bar or selecting data hubs or data types. Once desired datasets are selected in the table, users can click the button on the bottom to check metadata of datasets or download datasets (Fig.3B-D).

Feature 2: Single-gene expression analysis
UCSCXenaShiny provides modules implementing basic analysis functionality and modules can be go further assembled as analysis pipeline (Fig.1). For example, we constructed a few modules to analyze and visualize the single gene expression, including its pan-cancer distribution with violin plot or anatomy heatmap (Maag, 2018), and survival effects (Terry M. Therneau and Patricia M. Grambsch, 2000) under different expression cutoff. We combined some of them and built single-gene expression analysis pipeline so researchers can get as much information as possible in one click for a same task view. An example for gene TP53 is given in Fig.4.