Preprint Article Version 2 Preserved in Portico This version is not peer-reviewed

ML Classification of Cancer Types Using High Dimensional Gene Expression Microarray Data

Version 1 : Received: 29 January 2024 / Approved: 30 January 2024 / Online: 31 January 2024 (01:49:19 CET)
Version 2 : Received: 17 April 2024 / Approved: 18 April 2024 / Online: 18 April 2024 (09:55:50 CEST)

How to cite: Mukhopadhyay, D.; Phanord, D.D.; Dalpatadu, R.J.; Gewali, L.P.; Singh, A.K. ML Classification of Cancer Types Using High Dimensional Gene Expression Microarray Data. Preprints 2024, 2024012067. https://doi.org/10.20944/preprints202401.2067.v2 Mukhopadhyay, D.; Phanord, D.D.; Dalpatadu, R.J.; Gewali, L.P.; Singh, A.K. ML Classification of Cancer Types Using High Dimensional Gene Expression Microarray Data. Preprints 2024, 2024012067. https://doi.org/10.20944/preprints202401.2067.v2

Abstract

Cancer is a disease caused by the abnormal growth of cells in different parts of body is one of the top causes of death globally. Microarray gene expression data plays a critical role in the identification and classification of cancer tissues. Due to recent advancements in Machine Learning (ML) techniques, researchers are analyzing gene expression data using a variety of such techniques to model the progression rate & treatment of cancer patients with great effect. But high dimensionality alongside the presence of highly correlated columns in gene expression datasets leads to computational difficulties. This paper aims to propose the use of ML classification techniques- Linear Discriminant Analysis (LDA) & Random Forest (RF) for classifying five types of cancer (breast cancer, kidney cancer, colon cancer, lung cancer and prostate cancer) based on high dimensional microarray gene expression data. Principal component analysis (PCA) was used for dimensionality reduction, and principal component scores of the raw data for classification. Six distinct categorization performance measures were used to evaluate these approaches; RF method provided us with higher accuracy than LDA method. The method and results of this article should be helpful to researchers who are dealing with many genes in microarray data.

Keywords

principal components analysis; Linear Discriminant Analysis; Random Forest; precision; recall; F1; AUC; macro-averaged AUC; micro-averaged AUC

Subject

Biology and Life Sciences, Biochemistry and Molecular Biology

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.