Preprint
Case Report

This version is not peer-reviewed.

GSTP1 as the Most Important Gene for Prostate Cancer Diagnosis

Submitted:

18 April 2026

Posted:

20 April 2026

You are already at the latest version

Abstract
We showed that tumor prostate cancer samples can be predicted from differentially expressed genes (DEG) (Tumor versus Normal samples, log2FC ≥1.5 and padj ≤ 0.05). PCA results confirm that the 1314 DEG’s separate tumor (n=197) from normal (n=49) samples with explained variances of 15% and 12%, for PC1 and PC2 respectivelly. Moreover, variables importance analyses with random forest showed GSTP1 as the most important gene for predicting tumor samples. Finally, bayesian network structure learnt from the data shows GSTP1 node is directly connected to CRABP2, ATP1A4, MIR31HC, ANO1, FOLR1, HOXB2, MSLN, GATA3, AP0019942. Noteworthy, GSTP1 is not yet used in comercional diagnostic assays, nor it is found among the most relevant DEGs. Pubmed search showed GSTP1 already has publications implicating it to prostate cancer. Results also show decision tree selected AL109976.1, AC138647.2, DUOXA1 and CCK as potential biomarkers.
Keywords: 
;  ;  ;  ;  

Introduction

Early detection of prostate cancer significantly improves treatment outcomes and survival rate. Therefore, there is emphases in the need for frameworks for diagnostic biomarkers selection that can be applied to different cohorts. Several commercial assays already use gene expression to diagnose prostate cancer [1], e.g., Decipher, Oncotype DX (GPS), Prolaris (CCP / CCR), PTEN, ProMark and TMPRSS2-ERG Fusion. In this work, two-phased framework was proposed to select biomarkers among tumor genes (see Figure 1).

Methods

The cohort builder of GDC portal (https://portal.gdc.cancer.gov/) accessed on 2026-04-04 was used to obtain data from prostate cancer cases (Experimental Strategy = RNA-Seq, Data Format = tsv, Data Category = transcriptome profiling, Workflow Type = STAR - Counts, Sex at Birth = male, Primary Diagnosis = acinar cell carcinoma, Primary Site = prostate gland). RNA-seq data only were downloaded. Among the 246 samples, 49 and 197 were from normal and tumor tissue type, respectively. A github with data and scripts to reproduce the results was made available at: https://github.com/datasciencebioinformatics/BiomarkerIdentification_ProstateCancer/

Results

Tumor Genes

We identified DEG’s 1314 tumor genes (Tumor versus Normal) (log2FC ≥1.5 and padj ≤ 0.05, see Supplemental Table S1). From these, Principal component analysis (PCA) plot shows good separation of tumor versus normal samples, with 15% and 12% variance explained by PC1 and PC2, respectively (see Figure 2).
Among the DEG’s, GRP (ERG log2FC=2.66, padj=1.60e-17) is the only gene among our putative biomarkers that is already used experimentally for prostate cancer diagnosis (Supplemental Table S2). Figure 3 shows the difference in mean expression of GRP (TPM values) between normal and control samples.
Figure 3. Boxplot with normalized read counts (transcripts Per Million TPM) for Tumor versus Normal samples. Cutpoints = [0, 0.0001, 0.001, 0.01, 0.05, Inf], symbols [“****”, “***”, “**”, “*”, “ns”].
Figure 3. Boxplot with normalized read counts (transcripts Per Million TPM) for Tumor versus Normal samples. Cutpoints = [0, 0.0001, 0.001, 0.01, 0.05, Inf], symbols [“****”, “***”, “**”, “*”, “ns”].
Preprints 209160 g003
In addition to DEGs, biomarkers whose fold change (FC) was ≥50 and average TPM of control samples ≤ 10 were identified as potential treatment targets Supplemental Table S3).

Biomarker Selection

Decision Tree

Discrete categories (Low, Medium and High) were used for the gene expressions to fit a decision tree model for the prediction of Tissue Type (Tumor/Normal). From Figure 3 we can read that initial distribution of the n=256 of samples are as follows : 20% for the Normal and 80% Tumor. Moreover, if expression of AL109976.1 is high or medium THEN the tissue Type is predicted as Normal. However, if the AL109976.1 is low and the expression of AC138647.2 is low, THEN the prediction of Tissue Type is Tumor.
Figure 3. Decision tree models built from prostate cancer data for the prediction of Tissue Type (Tumor/Normal). Accuracy (mean=0.87 ± sd=0.01, final = cp=0.001).
Figure 3. Decision tree models built from prostate cancer data for the prediction of Tissue Type (Tumor/Normal). Accuracy (mean=0.87 ± sd=0.01, final = cp=0.001).
Preprints 209160 g004
The complete set of rules is shown on Table 1 and the average expression of the selected genes in normal tumor and normal samples are shown on Table 2.
Among the decision tree selected genes, DUOXA1 (Dual Oxidase Maturation Factor 1) acts in the maturation of DUOX1, generating reactive oxygen species (ROS) that promote AKT signaling and cell survival in prostate cancer. Studies indicate that DUOXA1 may be involved in resistance to apoptosis (cell death) and acts as a potential biomarker in gene expression panels for predicting prostate cancer progression and survival. CCK receptor a potential target to inhibit adipocyte-promoted cancer progression.

Bayesian Networks

Bayesian netrowks have the advantage of enabling predicting the outcome (Tissue Type) even in the absence of the measures of all variables (genes). Moreover, the structure of bayesian networks and the conditional probabilities can be learnt from data. For this works tabu search and maximum likelihood estimation were used for structure and parameter learning, respectively. One can also ask about the significance of the genes selected from the structure learning procedure in comparison to classical variable selection methods. The genes in the beaysian network directly connected to GSTP1 are CRABP2, ATP1A4, MIR31HC, ANO1, FOLR1, HOXB2, MSLN, GATA3, AP0019942 (See Figure 4).

Random Forest

A random forest model was constructed from the 1314 putative biomarkers (log2FC ≥1.5 and padj ≤ 0.05) and variables importance analysis was applied. Noteworthy, results corrobates GSTP1 as the most important variables for the prediction of Tissue Type (Normal/Tumor).
Figure 5. Variables importance analyses with Random Forest. The 20 variables with highest importance metrics (Overall) are shown in the plot.
Figure 5. Variables importance analyses with Random Forest. The 20 variables with highest importance metrics (Overall) are shown in the plot.
Preprints 209160 g006

Conclusions

For the cohort used in this work, GSTP1 is the most important gene for diagnosis of prostate cancer tumor. I searched in the pubmed the name of the genes automatically selected by the decision tree (gene_name + “prostate cancer”[Title/Abstract] + biomarker) AND ((“2026”[Date - Publication] : “2026”[Date - Publication])) and found the GSTP1 gene has publications with implication to cancer prostate in the yar of 2026 [11,12,13].
The format for bioinformatics research in cancer differs between scientific (academia) and applied (hospital) studies. This proposed framework is for user-agency in an applied setting.

Availability of data and materials

The data that support the findings of this study are openly available in GDC portal. A github with data and scripts to reproduce the results was made available at : https://github.com/datasciencebioinformatics/BiomarkerIdentification_ProstateCancer/.

Authors’ contributions

Felipe Leal Valentim (FLV) conceived and implemented the project.

Funding declaration

FLV had the post-doc funded by FUSP but this work was performed independently.

Supplementary Materials

The following supporting information can be downloaded at: Preprints.org.

References

  1. Hamed, N.W.; Elbeljihy, H.S.; Hussin, S.A.; Fouda, R.M.; Oy, E.K.; W Magar, R. From prostate specific antigen to genomic signatures: Advances in biomarkers for prostate cancer diagnosis and prognosis. Transl Oncol. 2026, 66, 102719. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  2. Erho, N.; Crisan, A.; Vergara, I.A.; Mitra, A.P.; Ghadessi, M.; Buerki, C.; Bergstralh, E.J.; Kollmeyer, T.; Fink, S.; Haddad, Z.; Zimmermann, B.; Sierocinski, T.; Ballman, K.V.; Triche, T.J.; Black, P.C.; Karnes, R.J.; Klee, G.; Davicioni, E.; Jenkins, R.B. Discovery and validation of a prostate cancer genomic classifier that predicts early metastasis following radical prostatectomy. PLoS One 2013, 8(6), e66855. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  3. Knezevic, D.; Goddard, A.D.; Natraj, N.; Cherbavaz, D.B.; Clark-Langone, K.M.; Snable, J.; Watson, D.; Falzarano, S.M.; Magi-Galluzzi, C.; Klein, E.A.; Quale, C. Analytical validation of the Oncotype DX prostate cancer assay - a clinical RT-PCR assay optimized for prostate needle biopsies. BMC Genomics 2013, 14, 690. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  4. Kuhl, V.; Clegg, W.; Meek, S.; Lenz, L.; Flake, D.D., 2nd; Ronan, T.; Kornilov, M.; Horsch, D.; Scheer, M.; Farber, D.; Zalaznick, H.; Cussenot, O.; Compérat, E.; Cancel-Tassin, G.; Wild, P.J.; Chun, F.K.; Mandel, P.; Moinfar, F.; Cohen, T.; Delee, S.; Kronenwett, R.; Doedt, J. Development and validation of a cell cycle progression signature for decentralized testing of men with prostate cancer. Biomark Med 2022, 16(6), 449–459. [Google Scholar] [CrossRef] [PubMed]
  5. Hamed, N.W.; Elbeljihy, H.S.; Hussin, S.A.; Fouda, R.M.; Oy, E.K.; W Magar, R. From prostate specific antigen to genomic signatures: Advances in biomarkers for prostate cancer diagnosis and prognosis. Transl Oncol. 2026, 66, 102719. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  6. Roth, J.A.; Ramsey, S.D.; Carlson, J.J. Cost-Effectiveness of a Biopsy-Based 8-Protein Prostate Cancer Prognostic Assay to Optimize Treatment Decision Making in Gleason 3 + 3 and 3 + 4 Early Stage Prostate Cancer. Oncologist 2015, 20(12), 1355–64. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  7. Hamed, N.W.; Elbeljihy, H.S.; Hussin, S.A.; Fouda, R.M.; Oy, E.K.; W Magar, R. From prostate specific antigen to genomic signatures: Advances in biomarkers for prostate cancer diagnosis and prognosis. Transl Oncol. 2026, 66, 102719. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  8. Love, M.I.; Huber, W.; Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014, 15(12), 550. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  9. Trummer, O.; Langsenlehner, U.; Krenn-Pilko, S.; Pieber, T.R.; Obermayer-Pietsch, B.; Gerger, A.; Renner, W.; Langsenlehner, T. Vitamin D and prostate cancer prognosis: a Mendelian randomization study. World J Urol 2016, 34(4), 607–11. [Google Scholar] [CrossRef] [PubMed]
  10. Sanders, I.; Holdenrieder, S.; Walgenbach-Brünagel, G.; von Ruecker, A.; Kristiansen, G.; Müller, S.C.; Ellinger, J. Evaluation of reference genes for the analysis of serum miRNA in patients with prostate cancer, bladder cancer and renal cell carcinoma. Int J Urol. 2012, 19(11), 1017–25. [Google Scholar] [CrossRef] [PubMed]
  11. Schut, I.C.; Waterfall, P.M.; Ross, M.; O’Sullivan, C.; Miller, W.R.; Habib, F.K.; Bayne, C.W. MUC1 expression, splice variant and short form transcription (MUC1/Z, MUC1/Y) in prostate cell lines and tissue. BJU Int. 2003, 91(3), 278–83. [Google Scholar] [CrossRef] [PubMed]
  12. De Vrieze, M.; Zhang, N.; Seibold, P.; Gerhäuser, C.; Albers, P.; Krilaviciute, A. Clinical validity of circulating tumor DNA as a diagnostic biomarker for prostate cancer: a systematic review. Cancer Epidemiol Biomarkers Prev 2026. [Google Scholar] [CrossRef] [PubMed]
  13. Huang, Y.; Mao, J.; Li, X. Emerging biomarkers in prostate cancer diagnosis and treatment: Insights into genetic, RNA and metabolic markers (Review). Int J Oncol. 2026, 68(2), 15. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  14. Ren, Z.; Liu, X.; Zhang, J.; Song, M.; Yang, Q.; Li, C.; Liu, D. Network pharmacology research integrating LC-MS/MS, machine learning, molecular docking, and dynamics simulation: key biomarkers and potential mechanisms of Phellinus igniarius against prostate cancer. In Silico Pharmacol. 2026, 14(1), 65. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Figure 1. A two-phased framework for biomarker selection with bioinformatics and machine learning methods.
Figure 1. A two-phased framework for biomarker selection with bioinformatics and machine learning methods.
Preprints 209160 g001
Figure 2. PCA plot of RNA-seq samples with DEGs for tumor and normal samples.
Figure 2. PCA plot of RNA-seq samples with DEGs for tumor and normal samples.
Preprints 209160 g002
Figure 4. Bayesian network structure learnt from the tumor genes. The figure shows only the nodes directly connected to GSTP1 or to Tissue Type.
Figure 4. Bayesian network structure learnt from the tumor genes. The figure shows only the nodes directly connected to GSTP1 or to Tissue Type.
Preprints 209160 g005
Table 1. Decision tree rules for the prediction of tissue tpe.
Table 1. Decision tree rules for the prediction of tissue tpe.
yprob
n loss Normal Tumor terminal node
root 246 49 0.20 0.80
IF ENSG00000277287.1=high,medium THEN Tissue Type is Normal 26 2 0.92 0.08 *
IF ENSG00000277287.1=low THEN Tissue Type is Tumor 220 25 0.11 0.89
IF ENSG00000277287.1=low AND ENSG00000287325.1=high,medium THEN Tissue Type is Normal 7 0 1.00 0.00
IF ENSG00000277287.1=low AND ENSG00000287325.1=low THEN Tissue Type is Tumor 213 18 0.08 0.92
IF ENSG00000277287.1=low AND ENSG00000287325.1=low AND ENSG00000140254.12=high,medium THEN Tissue Type is Normal 8 2 0.75 0.25 *
IF ENSG00000277287.1=low AND ENSG00000287325.1=low AND ENSG00000140254.12=low THEN Tissue Type is Tumor 205 12 0.06 0.94
IF ENSG00000277287.1=low AND ENSG00000287325.1=low AND ENSG00000140254.12=low AND ENSG00000187094.12=high,medium THEN Tissue Type is Normal 7 3 0.57 0.43 *
IF ENSG00000277287.1=low AND ENSG00000287325.1=low AND ENSG00000140254.12=low AND ENSG00000187094.12=low THEN Tissue Type is Tumor 198 8 0.04 0.96 *
Table 2. Genes selected for the final decision tree model (cp = 0.001).
Table 2. Genes selected for the final decision tree model (cp = 0.001).
Gene id Gene name FC Tumor/ Normal Avg normal Std normal Avg tumor Std tumor
ENSG00000277287.1 AL109976.1 0.28 1.87 1.05 0.51 0.38
ENSG00000287325.1 AC138647.2 0.22 1.19 1.19 0.27 0.33
ENSG00000140254.12 DUOXA1 0.25 13.50 9.45 3.41 4.04
ENSG00000187094.12 CCK 0.21 52.31 65.52 10.88 19.84
* Gene id = ENSEMBL ID, Gene name, FC Tumor/Normal = Fold change, Avg. normal = average tpm value among normal samples, Std normal = standard deviation tpm value among normal samples, Avg. tumor = average tpm value among tumor samples, Std tumor = standard deviation tpm value among tumor samples.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated