Comparative analysis on molecular characteristics of Chromophobe renal cancer and Oncocytoma

(200 words): Chromophobe renal cell carcinoma (chRCC) and oncocytoma (RO) are renal tumor types originating from alpha intercalated cells of the collecting ducts of the kidney. Both tumor types have similar gross histological morphology and increased mitochondria, which leads to difficulties differentiating between these tumors, especially with core biopsy samples. This study aims to apply a machine learning approach to develop a molecular classifier based on transcriptomics data. Here we generated a meta-data set containing 62 chRCC and 45 RO gene expression arrays. Arrays were subjected to quality control steps, and genes were selected based on differential expression and ROC analysis. The final gene list was evaluated with UMAP based dimension reduction followed by density-based clustering with 95.5% accuracy. Molecular profiling by KEGG pathway analysis identified enrichment of fatty acid oxidation pathway in RO. We finally identified and val-idated the 30-gene signature, with an accuracy of 94.4% to distinguish chRCC from RO on UMAP analysis. Our results show that chRCC and RO have a distinct gene signature that can differentiate these tumors and complement histology for routine diagnosis of these two tumors.


Introduction
Chromophobe renal cell carcinoma (chRCC) and oncocytoma are renal tumor types originating from alpha intercalated cells of the collecting ducts of the kidney [1,2]. They account for 5% and 3-7% of all of the renal neoplasms, respectively [3][4][5][6]. Oncocytoma is benign and typically requires no treatment [6]. On the other hand, chRCC is a malignant lesion with 5-year overall survival of 90%; it is treated with a partial or radical nephrectomy [3,5]. ChRCC is a well-circumscribed tumor that is frequently larger than any other renal cancer [4,7], whereas oncocytomas can also be large, ranging in size from 1.5 cm to 13 cm, like chRCC [8]. Both have a brown appearance grossly. A central stellate scar is frequently noted in oncocytomas but may not always be present. Consequently, these tumors are difficult to distinguish grossly and by computed tomography (CT) imaging.
Histologically, chRCC is composed of sheets of cells with well-defined cell borders that have darker cytoplasm than conventional clear cell carcinoma [6]. On the other hand, oncocytoma has variable architecture but frequently consists of nests of tumor cells comprised of large, round, eosinophilic cells in loose connective tissue [6]. However, some cases have small cells with hyperchromatic nuclei, termed oncoblasts, which can emulate other eosinophilic renal cancers and pose difficulties in providing an accurate diagnosis from a core needle biopsy, where there is limited histologic context [9].
Electron microscopy may aid in the diagnosis of chRCC, but is not employed routinely in determining diagnosis by pathologists [4]. Immunohistochemical staining for cytokeratin 7, epithelial-mesenchymal antigen, and parvalbumin (PVALB) are occasionally employed, but these three markers have variable sensitivity, from 60-100%, in the diagnosis of chRCC [10,11]. To differentiate oncocytoma, immunohistochemical localization of cytokeratin 7, S100A1 [12], and kidney-specific cadherins are employed, but the overlap between the markers of chRCC and oncocytoma makes it an ineffective method of differentiation. On top of these issues, the tissue collection by core needle biopsy, commonly used in the diagnosis of small renal masses, has a misclassification rate of 20% due to low tissue volume and the presence of oncoblasts [13][14][15].
Molecular diagnostic methods have been used to identify specific genetic changes in disease and can aid in patient stratification [16]. Previous studies on chRCC and oncocytomas proposed markers like parafibromin, aquaporin 6, and synaptogyrin 3 [17]. Genetic markers like AP1M2, MAL2, PROM2, PRSS8, FLJ20171 [18] and EGLN2 [19] might be useful as well in addition to the currently available markers. The major drawbacks to these molecular classification studies are the smaller sample size and overlapping expression in chRCC and oncocytoma.
Here, we identified transcriptomic differences distinguishing chRCC from oncocytoma in a meta-dataset combined from multiple studies from the Gene Expression Omnibus (GEO) and ArrayExpress. From the gene expression differences, we elucidated pathway differences between chRCC and oncocytoma. We then implemented machine learning (ML) algorithms and developed an 80-gene transcriptomic signature and ML models to distinguish chRCC from oncocytoma.

Dataset search and selection:
ChRCC and oncocytoma transcriptomic microarray studies were identified in GEO and ArrayExpress. We included studies that contained treatment naïve "chRCC" and "oncocytoma" samples and were arrayed using the HG-U133plus 2 mRNA expression array. Based on the selection criteria, 6 studies (GSE11024, GSE11151, GSE12090, GSE2109, GSE8271, GSE19982) were chosen and their expression and phenotype data were downloaded [20,21] with GEOquery [22]. Phenotype data were prepared from the downloaded files and only chRCC and oncocytoma arrays are selected in data preparation steps. The ideal probe set for the study was determined with jetset [23] and all overlapping probes in jetset and all studies were selected (15875 probes). Five studies (GSE11151, GSE11024, GSE19982, GSE2109, GSE8271) were combined and checked for batch variation with boxplots and principal component analysis (PCA). Batch effects were corrected with ComBat from sva [24] using GSE19982 as a reference batch. GSE12090 was kept as the validation dataset. Batch effect correction and normalization were checked with PCA, uniform manifold approximation and projection (UMAP) [25], and boxplots.

Unsupervised learning, Density-based UMAP (DBU):
Density-based UMAP [26] consists of two algorithms, a dimension reduction and a clustering algorithm. We implemented UMAP as dimension reduction and clustering was done with density-based clustering (DBSCAN). The optimum parameter for UMAP was identified with a grid-search parameter identifier. This identifier implements different parameters and visual inspection of cluster density and inter-cluster distance was used to identify the optimum parameters. Number of neighbors, minimum neighbor distance, distance metric, and the number of probes were used in the grid search. The final chosen parameters are 20 neighbor, 0.01 minimum distance, 1000 probes, and manhattan distance metric. DBU was iterated 1000 times with these parameters and the resulting 2-dimensional UMAP coordinates were fed into a density-based clustering algorithm. The optimum parameter for density-based clustering was chosen with an elbow plot (eps = 2, Minimum points 5) from the "fpc" package [27]. Each iteration classified all the arrays into groups and plurality voting was implemented for final group determination. A 70% or higher agreement for one single group was used as a cut-off for group classification. If a sample does not reach 70% for a group, it is called "ambiguous". Grouping consensus heatmap was generated with gplots [28]. Alluvial plot was generated with the ggalluvial package [29].

Differential expression and network analysis and Immune cell infiltration:
Differentially expressed genes between chRCC and oncocytoma were assessed with limma [30] and fed into gene set enrichment analysis (GSEA) with fgsea [31], clusterprofiler [32] and ReactomePA [33] to identify networks and pathways involved in chRCC and oncocytoma [34]. Differential expression was presented with a heatmap generated from ComplexHeatmap [35] and the network plot was generated with ReactomaPA and enrichplot [28,31]. KEGG plots were created with the "pathview" package for specific pathways [38].

Gene Signature:
We calculated the sensitivity, specificity, accuracy and AUC for each sample percentile. Genes with the highest sensitivity and specificity were initially selected [39]. We narrowed down the list by comparing their expression levels at the different percentiles. Genes with higher fold change and relevance to chRCC or oncocytoma based on literature review were given preference. We selected a total of 80 genes. The 80 gene signature's efficiency was evaluated with heatmap, unsupervised learning (UMAP and DBU), and a random forest supervised learning model.

Statistical analysis:
All statistical analyses in this study were performed using R language and environment for statistical computing [40], version 4.10. All p-values were two-sided and a p < 0.05 was considered significant.

Meta-dataset of publicly available oncocytoma and chRCC microarray data
We queried the Gene Expression Omnibus (GEO) and ArrayExpress databases using the search terms "chRCC" and "oncocytoma" and identified 24 microarray studies ( Figure 1). After removing duplicates, 17 studies were evaluated for eligibility using the following criteria: 1) the chRCC and oncocytoma tumors used for the array were treatment naïve, 2) the array assessed mRNA expression, and 3) the array used was HGU133 plus2. Based on these eligibility criteria, six studies were selected and downloaded from GEO and ArrayExpress (Table 1). The six studies contained 107 arrays: 45 arrays for oncocytoma tumors and 62 arrays for chRCC. Five studies were used to develop a trainingtest dataset (53 chRCC and 36 oncocytoma arrays) for unsupervised and supervised machine learning models and sixth dataset (GSE12090, 9 chRCC and 9 oncocytoma arrays) was used for the validation set. All downloaded arrays were merged using the probes which overlapped across all arrays. This resulted in a combined dataset of 89 samples and 15875 probes. Principal component analysis (PCA) demonstrated batch differences among the downloaded studies ( Figure 2, A & B). Bayes empirical model was used for batch effect correction with study accession ID as batch and histological types as a covariate. Study GSE19982 was used as a reference batch for the batch correction as it contained the highest number of samples. After batch effect correction, PCA demonstrated global transcriptomic differences between the oncocytoma and chRCC histology types were greater than the transcriptomic differences among the downloaded studies ( Figure 2C and D). This resulting dataset was used for further analysis.  Using unsupervised clustering with uniform manifold approximation of projection (UMAP) for all 15875 probes, the 89 arrays formed two distinct clusters with no major batch effects from individual studies ( Figure 3A). Transcriptome of ChRCC and oncocytoma each formed their own clusters ( Figure 3B), suggesting that chRCC and oncocytoma have distinct transcriptomic profiles. Two chRCC samples were clustered with oncocytoma, suggesting sample misclassification. We therefore evaluated the robustness of the unsupervised classification with the density-based UMAP (DBU) algorithm [26]. The DBU algorithm takes a random subset of 1000 genes from the dataset and applies dimension reduction with UMAP to all samples. The UMAP output is then fed into density-based clustering to identify distinct groups.

Unsupervised clustering using gene expression data largely agrees with histology classification
DBU was iterated 1000 times and the final classification was determined with plurality voting ( Figure 3C). DBU identified 51 samples in group 1 and 38 samples in group 2 (termed "DBU1" and "DBU2," respectively). DBU1 contains 51 chRCC samples, while DBU2 contains 36 oncocytoma and 2 chRCC samples ( Figure 3D). Two samples were classified as "Ambiguous". The overall agreement between DBU and histology was 95.5%.

Differential expression, network and pathway analysis
We performed differential expression analysis between chRCC and oncocytoma after excluding misclassified and ambiguous samples. We found that 8411 out of 15875 probes are differentially expressed between the two tumors (Supplemental Figure 3). Additionally, 299 probes have at least log2 fold change of 1 between the tumors. AUC analysis of these 316 genes identified 194 genes with an AUC of 0.9 or higher between the tumors. Hierarchical clustering with these 194 genes showed a 97.8% agreement with DBU classification ( Figure 4A). These results show that there are large transcriptomic differences between chRCC and oncocytoma.
We compared differentially expressed genes (DEG) in both chRCC and oncocytoma with normal kidney cortex using gene set enrichment analysis (GSEA). GSEA using DEG between chRCC and normal kidney tissue identified 67 significantly enriched reactome pathways (adjusted p value < 0.05, Benjamini-Hochberg correction) out of 218 pathways. Notably enriched pathways are transcriptomic regulation by TP53, beta-catenin independent WNT signaling, diseases of signal transduction by growth factor receptors and second messengers, carbohydrate metabolism, interleukins 4 and 13 signaling, glycosaminoglycan metabolism, and extracellular matrix organization.  Figure 4C). TCA cycle in oncocytoma have upregulated fatty acid metabolism and isocitrate dehydrogenase, oxoglutarate dehydrogenase and succinate CoA ligase. Both tumors showed upregulated arginine biosynthesis, ascorbate and aldarate, alanine, aspartate and glutamate metabolism ( Figure 4C).

Identification of a Gene signature for chRCC/Oncocytoma classification
To develop a robust gene signature to differentiate between chRCC and oncocytoma, we identified genes with maximum diagnostic utility by calculating sensitivity, specificity, AUC, accuracy for all percentiles. We identified 84 genes which have a sensitivity + specificity of > 1.8, an AUC > .92, and an accuracy > .89. Genes with higher fold change, inter-percentile fold change and their relevance to chRCC and oncocytoma were given priority for selection. We selected 30 genes for the final gene signature (GS30). We evaluated the robustness of the signature using 1000 iterations of an unsupervised model. We developed unsupervised models with 10, 15 and 20 genes from GS30 and identified 20 genes as the optimum number for classification and that less than 20 genes introduce ambiguity in the classification. DBU with 20 genes from GS30 ( Figure 5D) have accuracy of 95.5% in plurality voting for 1000 iterations, similar to 1000 gene model and confirms GS30's ability to capitulate the difference the tumors as good as 1000 gene models. In total, 49 of chRCC and 36 of oncocytoma samples are correctly classified with these models. Two chRCC samples are classified with oncocytoma and two samples were ambiguous, similar to the 1000 gene models.
The accuracy of GS30 assessed on the validation cohort GSE12090 (9 chRCC and 9 Oncocytoma) was 94.4%. Only one chRCC sample was misclassified to oncocytoma. Heatmap and UMAP analysis showed clear distinction between the tumors ( Figure 5C and E).

Discussion
This study reports a robust way to differentiate chRCC from oncocytoma using transcriptomics. ChRCC and oncocytoma have distinct transcriptomic profiles that can differentiate from each other ( Figure 3B). Single molecules from current IHC markers showed an overlap at the transcript level, causing a broad range of sensitivity and specificity. Unsupervised models with 20 genes provided a consistent classification. Based on these models, we developed GS30, with an accuracy of .95 in 1000 unsupervised models. This gene signature shows potential for clinical use for differentiating chRCC from oncocytoma through PCR, microarray, or RNASeq based assays. In addition, these assays do not require large amount of tissue, making this a suitable candidate for a core needle biopsy.
Molecular profiling with KEGG pathway analysis on chRCC identified that calcium signaling, carbon and glycolipid metabolism, PPAR signaling pathway, and TNF signaling pathways ( Figure 4B) are distinctive compared with normal kidney tissue. ChRCC and oncocytoma both showed upregulated oxidative phosphorylation with decreased glycolysis and gluconeogenesis suggesting energy metabolic reprogramming in both tumors. Comparison in the TCA cycle pathway shows a dependence of fatty acid metabolism in RO ( Figure 4D). Signal transduction by growth factor receptor, TP53 transcriptional regulation, platelet activation, signaling, and aggregation are top differentiating pathways between tumors when comparing functional differences between chRCC and oncocytoma on Reactome pathways.
The strength of our study combining multiple datasets to increase the power of detecting transcriptomic difference. Although transcriptomic studies from The Cancer Genome Atlas contains far largest sample set, our dataset is the largest of chRCC-RO samples (n = 89). Our unsupervised algorithm identified 95.5% and 94.4% cases correctly in testing and validation dataset. However, we identified 4 chRCC samples that are not congruent with their histological diagnosis. Two of the samples did not reach consensus of 70%, are ambiguous, with almost evenly split in plurality voting. In case of remaining two samples the transcriptomic profile is indistinguishable from oncocytoma suggesting possible misclassification. Microscopy combined with immunohistochemical analysis are often insufficient in distinguishing chRCC and oncocytoma in biopsies. Hence, transcriptomics is important for further resolving the two pathologies. Since we demonstrated that oncocytoma and chRCC each have unique gene expression and pathways expressed, transcriptomics may aid in distinguishing cases found to be uncertain on biopsy and histologic analysis.

Conclusions
We show that chRCC and oncocytoma are transciptomically distinct tumors. Using our unsupervised learning method, we were able to separate these two tumors into distinct classes. Our results suggest that transcriptomics combined with supervised learning can be a good clinical tool to aid pathologists to distinguish chRCC from oncocytoma tumors for the diagnosis of these two tumors.  Institutional Review Board Statement: Ethical review and approval were waived for this study, due to the use of publicly available data with identifiers completely removed.

Informed Consent Statement:
This study was done with de-identified human tissue expression data from publicly available sources and does not require consent. Data Availability Statement: Data for this study is publicly available and can be downloaded from the gene expression omnibus or ArrayExpress website using accession number. Histopathology and other clinical information are also available with the datasets.