Version 1
: Received: 29 January 2024 / Approved: 30 January 2024 / Online: 31 January 2024 (01:49:19 CET)
Version 2
: Received: 17 April 2024 / Approved: 18 April 2024 / Online: 18 April 2024 (09:55:50 CEST)
How to cite:
Mukhopadhyay, D.; Phanord, D.D.; Dalpatadu, R.J.; Gewali, L.P.; Singh, A.K. ML Classification of Cancer Types Using High Dimensional Gene Expression Microarray Data. Preprints2024, 2024012067. https://doi.org/10.20944/preprints202401.2067.v1
Mukhopadhyay, D.; Phanord, D.D.; Dalpatadu, R.J.; Gewali, L.P.; Singh, A.K. ML Classification of Cancer Types Using High Dimensional Gene Expression Microarray Data. Preprints 2024, 2024012067. https://doi.org/10.20944/preprints202401.2067.v1
Mukhopadhyay, D.; Phanord, D.D.; Dalpatadu, R.J.; Gewali, L.P.; Singh, A.K. ML Classification of Cancer Types Using High Dimensional Gene Expression Microarray Data. Preprints2024, 2024012067. https://doi.org/10.20944/preprints202401.2067.v1
APA Style
Mukhopadhyay, D., Phanord, D.D., Dalpatadu, R.J., Gewali, L.P., & Singh, A.K. (2024). ML Classification of Cancer Types Using High Dimensional Gene Expression Microarray Data. Preprints. https://doi.org/10.20944/preprints202401.2067.v1
Chicago/Turabian Style
Mukhopadhyay, D., Laxmi P Gewali and Ashok K Singh. 2024 "ML Classification of Cancer Types Using High Dimensional Gene Expression Microarray Data" Preprints. https://doi.org/10.20944/preprints202401.2067.v1
Abstract
Machine Learning classifiers are used to classify a very wide dataset containing gene ex-pression microarray data of patients with five types of cancer (breast cancer, kidney cancer, Colon cancer, lung cancer and prostate cancer). Since the dataset was very wide with a large number of columns, the code yielded stack overflow errors, and we resorted to Principal Components Analysis (PCA) for dimensionality reduction, and principal component scores of the raw data for classification. PCA was run using a fast algorithm which is able to compute PC scores for very large datasets. High classification accuracies are obtained using just the first two principal component scores. Machine Learning (ML) classifiers Linear Discriminant Analysis (LDA) & Random Forest (RF) methods were utilized where the latter provided with higher accuracy than the former. The results of this article should be helpful to researchers who are dealing with large number of genes in microarray data.
Keywords
Linear Discriminant Analysis; Random Forest; Precision; Recall; F1; AUC; macro-averaged AUC; micro-averaged AUC
Subject
Biology and Life Sciences, Other
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.