Early Lung Cancer Prediction Using Neural Network with Cross-validation

Lung cancer is known as lung carcinoma. It is a disease which is malignant tumor leading to the uncontrolled cell growth in the lung tissue. Lung Cancer disease is one of the most prominent cause of death in all over world. Early detection of this disease can assist medical care unit as well as physicians to provide counter measures to the patients. The objective of this paper is to approach an automated tool that takes influential causes of lung cancer as input and detect patients with higher probabilities of being affected by this disease. A neural network classifier accompanied by cross-validation technique is proposed in this paper as a predictive tool. Later, this proposed method is compared with another baseline classifier Gradient Boosting Classifier in order to justify the prediction performance.


Introduction-
Past health record of a patient can be utilised in early prediction of any disease. Timely detection and screening play leading role in prevention of lung cancer. This paper focuses on predicting patients with lung cancer severity at an early stage so that counter measures can be suggested by the physicians. Prediction at an early stage will assist health care systems to handle this disease carefully. Handling the consequence with care may help medical experts to take informed decision and act accordingly. Data mining and knowledge discovery are applied on past health records to identify hidden patterns and relationship among the data. A recommended system is proposed in this paper that automatically analyses previous health records of patient in order to determine possibility of being affected by lung cancer. Supervised machine learning approaches are utilized for this prediction purpose.
The system proposed in this paper automatically captures the interfering factors such as patient's age, alcohol consumption, smoking addiction while deciding whether the patient may suffer from lung cancer or not in near future. The proposed system is basically a classifier model that intended to predict lung cancer suffering possibilities. A neural network based framework followed by 10-fold cross validation procedure is implemented for obtaining the prediction in advance. After implementing the model, evaluation process takes place. The evaluation results are compared with Gradient Boosting classifier which is serving as baseline classifier in this context.

Related Work-
In the world lung cancer is the most common cancer. After breast and prostate It is the third most common cancer. The standard care for people with early stage of lung cancer is thoracic surgery.
Smoking is the most direct cause of lung cancer that leads to 90% of lung cancer deaths [1][2]. There are other causes leading to lung cancer in non-smoking people attributed to genetic factors and air pollutants such as asbestos, radon gas, and passive smoking [3][4][5][6].
Some researchers conducted studies on patients containing females and males with a tendency of lung cancer. It revealed that the better prognosis was found in females compared to males after adjustment for age, disease stage and smoking history [7]. It may be the evidence of sex being a predictor in lung cancer prognosis. Similarly, it also showed results suggesting poor prognosis for older patients compared to younger patients [8]. So age of patients as not an important prognostic factor in lung cancer survival and treatment.
Machine learning classifiers were used to extract features for CT image dataset for detecting lung disease in CT images of the thorax. Multi-crop convolutional neural networks approaches are also applied by researchers for lung nodule classification to detect malignancy. Unsupervised deep embedding clustering analysis has been studied extensively in terms of distance functions for detection of lung cancer [9].

Proposed Methodology-
A multi-step procedure is followed to build the proposed model to be applied on lung cancer dataset. Objective of this study is to detect patients with severe lung disease troubles. The required steps are explained as follows-

Data Collection and Pre-processing-
To fulfill the objective of this paper, a dataset related to Lung cancer is collected from kaggle [Add Ref]. The dataset can be formulated as a collection of attributes such as patient's age, smoking tendency, alcohol consumption which are quite promising predictor for determining lung cancer possibilities. Pre-processing techniques such as missing values handling, irrelevant attributes (like patient's name) elimination are applied to the collected dataset. Scaling of attributes within specified range will provide a transformed dataset that can be fitted to classifier model.

Methodology-
Classifications are the techniques that are applied on dataset and mapping inputs to target class. For this purpose neural network architecture is proposed in this paper that accepts several factors those affect lung cancer and finally predicts possibility of being affected by lung cancer. Neural network proposed in this paper is comprised of several neurons. Each of these neurons will accept necessary parameters and apply some activation functions in order to produce outputs. Activation functions are useful to perform diverse computations and produce outputs within a certain range. In other words, activation function is a step that maps input signal into output signal.
After configuring this neural model, training process is executed. The training process goes through one cycle known as an epoch where the dataset is partitioned into smaller sections. An iterative process is executed through a couple of batch size that considers subsections of training dataset for completing epoch execution.

Implementation-
While designing this model it is necessary to tune hyper-parameters in order to achieve maximized efficiency. This section describes specification of the model along with its hyperparameters. This model consists of three Dense layers having 64,32,1 number of nodes respectively. In this context, sigmoid and relu are two popular activation functions those are applied in each of these specified layer. The first two layers apply relu as activation function and the final layer applies sigmoid activation function.
Finally these aforementioned layers are assembled using adam solver through 30 epochs and with a batch size of 10. Fine-tuning of the hyper-parameters supports the model to obtain best predictive result. The neural network receives a total of 2,433 parameters which are trained to obtain prediction. The summarization of the model is described in Figure1.

Figure1. Summary of Neural Network model
This implementation is followed by 10-fold cross-validation method for estimating the proficiency of the model. It is a resampling methodology where the dataset is segregated into 10 groups and in each iteration one group is considered as the test data and the remaining nine folds are considered as training data. Stratified K-fold technique is incorporated in this framework that validates the cross-validation methodology. The above mentioned model is fitted into the training dataset and it is evaluated against the test dataset. Later evaluation scores for each of these iterations are accumulated and mean score is calculated.
This neural network structure accompanied with 10-fold cross validation procedure is applied on lung cancer dataset. Implementation of this model is evaluated and compared with other benchmark classifiers such as Gradient Boosting Classifier.

Classifier Performance Evaluation-
Once predictions from classifier models are obtained, it is necessary to justify the quality of the predictive results. Justifying the performance of model acquires some evaluating metrics. Use of these metrics will identify the best problem-solving approach. The metrics those are employed by this framework as described as follows-1. Accuracy is a metric that detects the ratio of true predictions over the total number of instances considered. However, the accuracy may not be enough metric for evaluating model's performance since it does not consider wrong predicted cases. Hence, for addressing the above specified problem, precision and recall is necessary to calculate. To address the best problem solving model, it should exhibit lower MSE value and higher values of accuracy, F1-Score, and Cohen-kappa score.

Baseline Classifier-
Gradient boosting classifier is implemented in this paper that serves as baseline while comparing the performance of the proposed method. This classifier is based on boosting technique Gradient boosting algorithm[18] is another boosting technique based classifier that learns by fitting consecutively new models into new models to provide a more accurate estimate of the response variable. It constructs new-base models which decrease the loss function obtained from trained samples. From these calculations the errors are measured and analysed for optimal prediction of results. Loss function calculates the range of detected rate which compares with desired target. Onward stepwise process is most popular method for updating different with various attributes. The accuracy is optimized by reducing loss function and adding base learners at all stages.
The transformed and pre-processed data are partitioned into training and testing set with a ratio of 8:2. Gradient Boost classifier is built based on 500 numbers of estimators on which the boosting is terminated. After implementation, training dataset is fitted into the classifier model and later predictions are obtained for test dataset. Prediction outcomes are evaluated against accuracy, f1-score, cohen-kappa score and MSE.

Experimental Results-
The prediction performance of proposed model that is, neural network along with 10-fold cross-validation method is indicated in Table1. A comparative analysis with Gradient Boosting classifier in terms of specified evaluating metrics are also provided. This analysis clearly shows that proposed model is superior while detecting patients having lung disease severity.

Conclusions-
Machine learning based lung cancer prediction model has been approached to support clinicians in managing patients' trouble. Neural network along with 10-fold cross validation procedure is proposed in this paper that predicts lung cancer in advance. The predictive model accepts past medical records and the model is accompanied by designing with finetuning parameters. Experimental results have shown promising prediction results with an accuracy of 95%, f1-score of 0.94, cohen-kappa score of 0.9 and MSE of 0.05. Incorporating more influential factors to this model may help in providing more accurate predictions.