Working Paper Article Version 1 This version is not peer-reviewed

Relevant and Non-redundant Feature Selection for Cancer Classification and Subtype Detection

Version 1 : Received: 11 June 2021 / Approved: 14 June 2021 / Online: 14 June 2021 (10:30:21 CEST)

How to cite: Rana, P.; Thai, P.; Dinh, T.; Ghosh, P. Relevant and Non-redundant Feature Selection for Cancer Classification and Subtype Detection. Preprints 2021, 2021060346 Rana, P.; Thai, P.; Dinh, T.; Ghosh, P. Relevant and Non-redundant Feature Selection for Cancer Classification and Subtype Detection. Preprints 2021, 2021060346

Abstract

Biologists seek to identify a small number of significant features that are important, non-redundant, and relevant from diverse omics data. For example, statistical methods like LIMMA and DEseq distinguish differentially expressed genes between a case and control group from the transcript profile. Researchers also apply various column subset selection algorithms on genomics datasets for a similar purpose. Unfortunately, genes selected by such statistical or machine learning methods are often highly co-regulated, making their performance inconsistent. Here, we introduce a novel feature selection algorithm that selects highly disease-related and non-redundant features from a diverse set of omics datasets. We successfully applied this algorithm to three different biological problems: a) disease to normal sample classification, b) multiclass classification of different disease samples, and c) disease subtypes detection. Considering classification ROC-AUC, False-positive, and False-negative rates, our algorithm outperformed other gene selection and differential expression (DE) methods for all six types of cancer datasets from TCGA considered here for binary and multiclass classification problems. Moreover, genes picked by our algorithm improved the disease subtyping accuracy for four different cancer types over the state-of-the-art methods. Hence, we posit that our proposed feature reduction method can support the community to solve various problems, including the selection of disease-specific biomarkers, precision medicine design, and disease sub-type detection.

Subject Areas

feature subset selection; disease classification; subtype detection

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our diversity statement.

Leave a public comment
Send a private comment to the author(s)
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.