Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Using Machine Learning to Identify the Risk Factors of Pancreatic Cancer from the NIH PLCO Dataset

Version 1 : Received: 26 August 2023 / Approved: 28 August 2023 / Online: 29 August 2023 (10:11:39 CEST)

How to cite: Dutta, A. Using Machine Learning to Identify the Risk Factors of Pancreatic Cancer from the NIH PLCO Dataset. Preprints 2023, 2023081933. https://doi.org/10.20944/preprints202308.1933.v1 Dutta, A. Using Machine Learning to Identify the Risk Factors of Pancreatic Cancer from the NIH PLCO Dataset. Preprints 2023, 2023081933. https://doi.org/10.20944/preprints202308.1933.v1

Abstract

Background: Pancreatic cancer (PC) is a disease with poor prognosis and survival rate. There is a pertinent need to identify the risk factors of this disease. The purpose of this study is to identify a subset of factors (a.k.a. features) as predictors of PC from the Prostate, Lung, Colorectal and Ovarian (PLCO) cancer dataset consisting of responses to 65 questions about demographics, cancer and health history, medication usage, and smoking habits from 154,897 participants. Method: There are two challenges to selecting the subset of features that predict PC with highest probability: the problem is computationally intractable, and the PLCO dataset is highly imbalanced. We use an innovative method to use the dataset in a balanced way, without involving up- or down-sampling. We use nine feature selection methods to select the optimal subset of features from the preprocessed and balanced dataset. Results: Our preprocessed dataset consists of 32 risk factors (8 demographics, 5 cancer history, 13 health history, 2 medication usage, 4 smoking habits). Risk factors belonging to cancer and health history, followed by smoking habits, were consistently chosen by the feature selection methods. We also discuss findings in the medical sciences literature that corroborate our findings. Conclusions: The study found that risk factors belonging to cancer and health history are the most prominent ones for PC. In particular, previously diagnosed with PC is chosen as the most prominent risk factor by majority of methods. While most of our findings are consistent with the literature, some of our findings shed light on novel factors that may not have received their due attention by the research community.

Keywords

Pancreatic cancer; NIH PLCO dataset; feature selection; classification

Subject

Engineering, Electrical and Electronic Engineering

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.