mySORT: A web framework by using Deconvolution Approach to Estimating Immune Cell Composition from Complex Tissues

Cancer immunotherapy reaches a remarkable achievement in various cancer types and brings new possibilities to improve cancer patients’ long-term survival. However, outcomes vary from case to case, and the present protocol benefits a small fraction of patients. One notable factor is the tumor microenvironment, especially the immune cell components, that may reflect the immune response's status quo on site. Thus, understanding the content of infiltrating immune cells in tumors is not only for research interesting but also a crucial subject toward precision medicine. We implement an algorithm for resolving relative proportions of twenty-one immune cell subclasses from a human tissue profiled transcriptome by microarray technology to reach the goal above. By selecting gene features and then adopting ?-Support Vector Regression, we can construct a deconvolution model and resolve the immune cell context. The excellent consistency between the estimated values and the correct immune-cell composition further demonstrates this approach provides a more natural alternative to revealing samples' immune cell content and reliable results like recent single-cell technologies. Based on this algorithm, the web-based deconvolution tool implemented named mySORT provides a user-friendly interface for estimating the immune cell content by uploading gene expression profiling. We also present comprehensive visualization 2D/3D plots in mySORT so that users can easily make a comparison between different samples. Finally, we synthesized pseudo-bulk expression data from single-cell transcriptomic datasets of 17 melanoma and 16 head and neck cancer patients. The deconvolution results of microarray-based data in the previous study and synthetic pseudo-bulk data all proved the excellent performance of mySORT. We believe that mySORT can help researchers in all fields easily understand complex immune microenvironment. The website of mySORT is freely accessible on https://symbiosis.iis.sinica.edu.tw/mySORT/.


Introduction
Cancer is a disease caused by a malicious cell population that can divide unlimitedly and further metastasizes to other remote sites, thus occupying the healthy cells' space and other resources. The immune system can detect not only invasive antigens but also abnormal cells in our bodies.
Unfortunately, cancer's presence proves it may find ways to escape from the surveillance of the immune system. A recent breakthrough in finding immune blockage/checkpoint molecules PD-1 or CTLA-4 [1,2] leads a new paradigm of cancer therapy targeting checkpoint inhibitors such as PD-1/PD-L1, which successfully re-activate the immune system [3][4][5]. However, checkpoint inhibitors' objective response rate varies, and adverse side-effects have been observed [6]. Consequently, finding out possible factors causing the variation of therapeutic outcomes among patients is critical in modifying present cancer immunotherapy to better performance.
Immune cell composition in tumors is proposed to explain the patients' and cancer types' diverse responses [7][8][9]. To reveal the immune escape mechanism driven by the tumor, a robust approach for estimating immune cell content in the tumor microenvironment is crucial. Traditional methods such as flow cytometry and immunocytochemistry can provide a small range of known biomarkers. These methods are also applied to a small fraction of biopsy and are difficult to scale-up to resolve all interested immune cell types.
Recent high throughput technologies such as microarray and next-generation sequencing (NGS) have revolutionized the way of gene expression profiling, by which methods to estimate immune cell composition were conceived [10][11][12][13]. For example, several statistical approaches on the microarray, such as quadratic programming [14], Digital Sorting Algorithm [15], semi-supervised non-negative matrix factorization [16] were proposed to deconvolute the immune cell composition, viz, resolving cellular components from a measure of pooling values [12,17]. Most of these methods focused on a small spectrum of cell types. Newman et al. adopted the novel strategy and implemented their method to a web service CIBERSORT [13]. By benchmarking on the cellular composition result to the ground true (cell fractions, typing by flow cytometry), CIBERSORT is recognized as a superior method [18].
The performance detailed into the cell types varies. Using the CIBERSORT defined scenario, we revisited the dataset, exclude potentially contaminant datasets and weak supporting cell subtypes, optimize the signature gene set, and propose a -support vector regression method deconvolute the immune cell composition [19]. This method, mySORT, outperformed CIBERSORT in the benchmark testing of microarray datasets.
In this study, we implement the web application of mySORT with an interactive user interface to process uploaded transcriptome profiles. MySORT resolves the relative proportion of twenty-one types of immune cells. Results and statistical analyses (clustering, alpha-and beta-diversity) are presented in a graph-rich output. Furthermore, single-cell RNA sequencing is a novel immerging method to analyze the tumor microenvironment's complexity [20][21][22][23]. We use single-cell RNA 6 sequencing data to validate the consistency between these two measures.

Usage of mySORT website
Users can upload a text or CSV file containing a gene expression matrix of single or multiple samples.
The matrix should contain gene symbols as rows and sample names as columns. If the submitted expressed data is not normalized previously (user-defined), mySORT will perform log-transformed on it. Those people who wanted to know the algorithm of mySORT with pseudocodes in detail, please find the information in our previous publication [19]. To quickly understand the operation of mySORT, we prepare the demo, which includes several expression data from different kinds of cancer tissues.
Users can categorize these expression datasets as several groups for deep analysis. Simultaneously, users can also visualize both alpha and beta diversity in low dimension 2D or 3D plot. The calculation and visualization processed above were conducted by R packages Vegan [24], Phyloseq [25], and Plotly  including hierarchical clustering, alpha diversity, and beta diversity analyses (green rectangle).

Visualization plots of alpha diversity and beta diversity
We further implement alpha and beta diversity plots to visualize the single-cell experiment data's overall distribution and profiling. Here we adopt Simpson's diversity index for alpha diversity to indicate the richness and evenness of immune-cell species and the heterogeneity of a sample [26]. For example, samples with more infiltrating immune cell types or samples with the same cell component but with a more even cell number distribution of each type tend to have higher alpha diversity measurements. The non-metric multidimensional scaling (NMDS) plot is applied to describe the beta diversity; here, the difference of immune cell composition among samples [27]. The shorter distance of two samples on the NMDS plot indicates the overall similarity is relatively higher between this The table compares the web application of mySORT to CIBERSORT by several functions. mySORT demonstrates the advantage of data visualization functions over CIBERSORT.

Validation of mySORT performance by real single-cell datasets
Blood biopsies previously benchmarked the performance of mySORT from 20 adults, in which nine immune cell types were identified using flow cytometry analysis, and it demonstrated that the present computations performed better than the current state-of-the-art deconvolution method CIBERSORT [14]. In this study, we further used the cutting-edge technology single-cell RNA sequencing data to validate the performance of mySORT. Here, we collected two public single-cell RNA sequencing datasets of tumor samples from 17 melanoma patients and 16 head and neck cancer patients. The

Not required Required
Custom signature matrix Not allowed Allowed

Immune cell composition
The relative proportion in synthetic pseudo-bulk data of tumor samples were used to estimate the relative proportion of immune cell types. Finally, we found that the prediction of mySORT had a good correlation with the ground truth in both datasets when all immune cell types were considered ( Fig. 2A and 2C). If we separated the outcome by cell types, we could also observe good correlation in almost all cell types except for macrophages ( Fig. 2B and 2D). The lower accuracy of macrophages possibly results from the relatively lower proportion of macrophage-based datasets in our signature matrix. However, the overall performance of mySORT still presents better consistency with single-cell data and seems to be not affected by this phenomenon seriously.

Construction of synthetic pseudo-bulk gene expression data
The single-cell RNA sequencing data of melanoma patients and head and neck cancer patients were downloaded from NCBI GEO Accession GSE72056 and GSE103322, respectively [22,23]. The Smart-seq2 protocol was conducted for both single-cell datasets. Following the instruction of the original publication [22], we kept cells with good quality by the criteria of at least 1,700 expressed genes for melanoma samples and 2,000 expressed genes for head and neck squamous cell carcinoma (HNSCC) samples. An adequate quantity of housekeeping gene expression was also confirmed in every qualified Similarly, the strategy described above was also applied for the HNSCC dataset. Still, six immune cell

types, including B cells, CD4 T cells, CD8 T cells, macrophages, dendritic cells, and mast cells, as
well as 16 qualified HNSCC samples, were used.

Comparison of estimated and true immune cell composition
The single-cell data and the output of mySORT only share several immune cell types, so we rescale the sum of both the ground-truth value and the predicted value to 1 as the total value for comparison.
The Pearson correlation coefficient and root-mean-square were then used to measure the correlation and difference between the estimation of immune cell content and the ground truth.

System Implementation
For intuitive user experience for easy understanding, we build mySORT by composed of LAMP

Conclusion
The breakthrough in immunotherapy leads to novel anti-cancer drugs/therapies. Undoubtedly, there will be more and newer strategies for overcoming cancer cells' immune suppressive ability in the future. Correlation between the cellular composition of cancer and the drug response suggests the immune cell population's heterogeneity be a critical issue in the clinical practice. Thus, we built mySORT into a user-friendly web framework. We added two cell population diversity measurements to help biomedical researchers understand their samples' tumor microenvironment with comprehensive plots and charts.
The performance of cell deconvolution methods like mySORT largely depends on the quality and coverage of transcriptome data of the cell population, the feature selection strategy, and the implemented model's power. Although the accuracy of mySORT outperformed other concurrent methods on the test datasets from the microarray experiment, the predicting power in some cell types is barely satisfactory. Recent advances in single-cell RNA sequencing methodology provides various cell profiling data in the public depository. It is worthy of revising our methods to adopt the new, massive datasets and apply deep learning models to resolve the cell component deconvolution question. We believe that combining these new tools to achieve the concept of precision medicine will serve as a critical point of improving cancer treatment.

Ethics approval and consent to participate
Not applicable

Consent for publication
Not applicable.

Availability of data and material
In this study, we used the Melanoma dataset (NCBI GSE72056) and head and neck cancer dataset ( NCBI, GSE103322).