SUPERVISED CLASSIFICATION OF NORMAL AND TUMOROUS BRAIN MR IMAGES USING MACHINE LEARNING SCHEMES

Manual interpretation of these huge amounts of image volumes are susceptible to inter-reader variability and human error. Thus, accurate automated CAD scheme is highly desirable in clinical pathological diagnosis. In this research, plethora of machine learning paradigms (e.g. feature extraction, dimensionality reduction and supervised classification methods) were explored, evaluated, compared and analyzed to identify the optimal pathway for brain MR images (normal vs neoplastic) binary classification task. External validation dataset was used to test the generalizability of the optimal predictive models implemented. Relevant and informative features were selected to construct cross-validated decision tree and eventually simple rule set was built based on the decision tree. The experimental results show that almost all pattern recognition paradigms achieve high accuracy with careful selection of number of attributes. LDA+ELM with 55 features are the optimal pipelines which achieve perfect classification when training and test data are of same source; and achieving (accuracy=97.5%, AUC=0.989, sensitivity=95% and specificity=100%) under balanced test dataset; (accuracy=99.5%, AUC=0.988, sensitivity=95% and specificity=100%). Cross-validated decision tree model also shows comparable performance: accuracy=98.8%, AUC=99.1%, sensitivity=99.6% and specificity=98.2%. Three highly relevant and robust attributes are visualized and selected for construction of decision tree models and finally a rule sets are read directly off the decision tree. This rule sets can potentially serve as fast and accurate classification algorithm.


Background
Brain tumor is a potentially life threatening medical condition and can inflict people, regardless of their age, ethnicity or gender. Early detection and diagnosis can lead to better understanding and accurate characterization of brain tumor, which is of paramount importance for pre-treatment planning and surgery and ultimately increase the chances of survival of patients.
With the rapid advancement of biomedical image analysis technologies, coupled with development of statistical modeling algorithms, biomedical imaging has emerged as a powerful tool in diagnosis procedure aside from gold standard biopsy (resection of tissue samples) nowadays. Among the available imaging protocols, magnetic resonance imaging (MRI) is one of the most promising one due to its several properties:1) no harmful radiation, 2) multi-spectral high resolution images, 3) superior soft tissue differentiation, and 4) high contrast. However, manual interpretation of this enormous volumes of images can be tedious, time consuming and prone to both observers' fatigue and variability in opinions among human experts. Therefore, automated or at least interactive computer assisted diagnosis (CAD) system is of great demand to facilitate accurate and quick detection and diagnosis of brain tumor [1].
By utilizing and integration of high quality imaging tool, like MRI and machine learning approaches, classification of brain MR images (normal vs neoplastic) using either supervised or unsupervised learning methods has been proposed in substantial number of papers [2,3]. Normally, supervised method achieves higher accuracy, but unsupervised method is also desirable in biomedical image analysis since annotated images is not readily available [4]. Despite presence of vast amount of advanced pattern recognition algorithms, brain MR images classification still remain an open challenge as there is no universally accepted optimum radiomics and classifiers. This is primarily due to differences in brain anatomical structures, wide variety of MRI parameters, imaging artifacts, imaging features that create ambiguity [5] and so on. Furthermore, some methods proposed in the literature are subject to overfitting (a condition where classifiers fit well to training data, but fail to generalize to unseen data) and there is limited clinically accepted validation process to support the superior classification accuracy obtained [6].
In view of various available machine learning algorithms being proposed and no widely accepted best method, the authors intend to conduct comparative study of several feature extraction, feature reduction and classifiers in brain MR binary classification problem (normal vs tumorous). The performance of these methods will be analyzed by using six performance measures: test accuracy, sensitivity, specificity, area under the curve of receiver operating curve (AUC), training time and test time. This hierarchical procedures can be named as radiomics [7]. According to [8], it is impossible to compare and validate performance of learning algorithms without the measure of variance. Thus, training and test phases for each experiments are conducted 30 times to produce unbiased performance measures. In addition, in order to gauge and analyze the generalizability of the machine learning paradigm, additional images from completely different sources are included as test dataset.
The contributions of this research are five-fold: 1) large amount of samples (thousands of brain MR images) are used as input data for feature engineering as well as classifiers training and evaluation regarding classification of pathologic brain MR images, 2) perform repeated sampling of training and test dataset for reliable and unbiased evaluation, analysis and comparison of each supervised learning methods, 3) extra independent test dataset (external validation set) is utilized to gauge and validate the generalizability of the optimal machine learning pipeline aforementioned , 4) determination of relevant and informative features that can provide new insight and perspective for this binary classification problem, 5) construction of compact classification rule, which can be readily used in CAD system.

Methods
The overall pipeline of the research framework is outlined in Fig. 1. This experiments were performed using MATLAB R2017a on Intel® Core™ i5 processor with 3.89GB usable RAM.

Data acquisition
The brain MR images were all downloaded from some popular publicly available online database. The justification of using multiple sources of tumorous and normal brain T1-weighted MR images is to prevent the feature extracted and thus the information learned by the classifiers to be affected by or related to the way the MR images are acquired, like the types and parameter settings of various MRI machine. In other words, this step can be crucial to find distinctive, repeatable and robust feature or features subset in differentiating between tumorous and normal MR images regardless of the sources of data. The dataset used are downloaded from some online repositories and are listed as below: 2 . The slice thickness is 6 mm and the slice gap is 1 mm [9,10].
1500 slices of these images were utilized as tumorous images.
2) Another set of tumorous brain MR image volume were acquired from [11,12]. A total of 775 slices of image containing tumor were extracted.
3) Tumorous brain image volumes [13] were Dataset 1-3 consists of tumorous images, whereas dataset 4-6 was made up of normal images. Dataset 1, 2, 4, 5 will be used for training and test classifiers, while dataset 3 and 6 will serve as external validation dataset to evaluate the generalization performance of optimal machine learning schemes [16].

Image preprocessing
Image preprocessing on every dataset is performed to increase signal to noise ratio and enhance the visual quality of MR images. Steps of preprocessing of brain MR images is presented in Fig. 2. In this research, first stage of MR images preprocessing is bias field estimation using second order parametric surface fitting method by Levenberg-Marquadt algorithm, in reference to work of [17]. The steps of the algorithm are: 1. Different from previous work of [17], which suggested the selection of data points from background image, data points with normalized pixel intensity higher than pre-specified threshold 0.3 are selected. Their coordinates and gray level values are stored in data matrix = ( , , ), = 1,2, … , , where is the number of gray level. 2) First order parametric equation is selected for the fitted surface. Low order polynomial results in smooth surfaces and easy estimation of parameters. Non-linear least squares method: Levenberg-Marquadt algorithm is utilized to estimate the coefficients. 3) Use the fitted equation to generate final bias field signal by linear projection of original image with the coefficients estimated in step 2. 4) Perform element-wise division of original image with the generated bias field image in step 3 to yield the bias field removed MR images.
Next, 5 × 5 median filter is applied to retain important edges and eliminate noise. Then, histogram stretching or also known as inner cropping is performed on the image. The goal is to increase contrast without modifying shape of the original histogram (distribution of gray level).
images pixel intensity) are transformed into much lower dimensional feature subspace (feature vectors). Ideally, these features should be informative, relevant and non-redundant. The main goal of feature extraction is to reduce computational cost (increase convergence speed during training) and data storage requirements without compromising the accuracy of classification models. In this research, 6 types of feature extraction algorithms are employed: first order statistical feature, gray level co-occurrence matrix (GLCM), gray level run length matrix (GLRLM), histogram of oriented gradients (HOG), local binary pattern (LBP), and discrete wavelet packet transform (DWPT). It should be noted that the images come with different dimensions, thus before feature extraction (except first order statistical features), all the images were resized to 256 × 256. The number of features extracted from each method are summarized in Table 1. After the process of feature extraction, it was found that there are 161 LBP features of that contain all zero for every samples. In view of this, these non-informative features are removed. So, the final whole data dimension is 4675 × 2527.

First order statistical features
Textural features that will be computed include mean, variance, skewness, kurtosis, energy, range and entropy. Let be the gray level of an image. The relative frequency, ( ) for each gray level can be computed as below: Based on the definition of ( ) and knowing that number of gray level is represented by , the formula for the features is described in Table 2. Table 2. List of statistical measures and its corresponding formula.
Statistical measures Formula Mean

GLCM
Developed by [18], GLCM is an excellent representation of image properties related to second order statistics [19].Nine textural features are used in this study. The following equations define these features. Let ( , ) be ( , ) entry for the GLCM while ( , ) be ( , ) entry for the normalized GLCM. Suppose that and denote the number of row and column of GLCM. The normalized GLCM entries, ( , ) can be expressed as: ( , ) × The mean and standard deviation for the rows and columns of GLCM are: The features are described in Table 3. Table 3. List of GLCM features and its corresponding formula. GLCM features Formula Energy = ∑ ∑( ( , )) 2

GLRLM
The frequencies of the gray level run length and its corresponding gray level that occur in digital image matrix is used to construct GLRLM. Elements in GLRLM is represented by ( , ), where denotes the index for gray level whereas denotes index for the run length. Before defining the textural features, let be the number of gray levels in the image, be the run length, is the number of row and is the number of column of original image. GLRLM features are described in Table 4.

HOG
HOG proposed by [20] is a local feature descriptor that is able to capture edge gradient structure with high tolerance to local geometric and photometric (illumination) transformation [21] and simultaneously maintain high selectivity [22]. These features is well known for preserving local secondorder interaction between pixels [23].

LBP
A grayscale and rotation invariant feature, known as local binary pattern (LBP) was introduced by [24]. LBP is a simple yet efficient operator in depicting local image pattern and has been utilized as features for various application [25].Aside from the implementation of original LBP, an improved version of locally rotation invariant LBP [26] is applied for better tradeoff between discriminative power and robustness.

DWPT
Unlike spectral analysis, e.g. Fourier transform (FT) that represents signal as sum of sinusoids, wavelet transform decomposes signal into basic wavelet functions of different scales in time-frequency domain.
From the perspective of computational complexity, FT is ( The inherent multiresolution property of wavelet transform is the main reason it is an excellent feature extraction method [28]. Essentially, the dyadic DWT can be implemented through filter tree algorithm. It is worth noticing that only approximation components will be extracted at each level of decomposition in DWT. DWPT, being an extension to DWT allows all nodes in the tree structure to split [29]. In other word, not only approximation coefficients, details components are decomposed to form full binary tree as shown in Fig. 3. Three level of decomposition and Haar wavelet is chosen for this experiment. As shown in Fig. 3, The first series of analysis filter bank (low pass filter and high pass filter) is performed along the columns of the image while the second series is implemented along the rows of image or vice versa. LL, LH, HL and HH are subbands that capture approximation, horizontal edges, vertical edges and diagonal edges respectively.

Feature reduction/Feature subset selection
Images LL LH HL HH Feature reduction, also known as dimensionality reduction is a process of transforming original attribute space into lower dimensional space, before they are input to classification models. Feature subset selection involves removal of irrelevant and redundant features. The objectives of these processes are to preserve informative and relevant features and get rid of irrelevant and redundant attributes, which can ultimately lead to improvement of classifiers' performance (generalization capability and convergence speed), simpler and more understandable models, and reduced storage requirement. In this paper, some widely used feature transform methods like Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Partial Least Square (PLS) and Independent Component Analysis (ICA) [30] are employed. In addition, three filter feature selection method, namely Relief, Discriminant Least Square Regression (DLSR) [31] and Sequence Forward Search based on LW (SFS-LW) [32] were implemented. Essentially, these 2 techniques are different. Feature transform approach project the original data dimension into lower dimensional spaces through linear combination of original features, while feature selection seeks to choose subsets of features that is highly relevant to targets, or class labels [33].

Classification models
Classification models attempt to derive relationship between the set of input variables (normally in the form of input vector) and its corresponding categorical target/label. In this research, the classifiers are going to deal with binary classification problem (identify and distinguish neoplastic brain MRI from normal brain MRI). The classifiers employed in this study are logistic regression (LR), Naïve-Bayes (NB) classifier, kNN classifier, weighted kNN [34], artificial neural network (ANN), support vector machine (SVM), and extreme learning machine (ELM) [35]. Table 5 presents the implementation of the classification algorithms mentioned above.

Performance evaluation and analysis
In order to evaluate the performance of each machine learning approaches applied, 6 widely accepted performance measures are utilized, including test accuracy (%), sensitivity (%), and specificity (%), AUC, training time (s) and test time (s). Since different schemes of feature reduction and classifiers will be experimented for 30 times using different sets of training data, in other words 30 different models will be trained, the mean and standard deviation of performance metrics can be computed for unbiased analysis and comparison among methods employed. Table 6 describes the performance measures and its definitions. Different types of feature reduction and classification algorithms will then be applied. Finite size of feature subsets under each dimensionality reduction method will be experimented. Each experiment will be conducted 30 times using stratified sampling strategy for unbiased statistical analysis. With sufficiently large sample size, most distributions can be approximated as normal distribution according to Central Limit Theorem. Confidence intervals based on studentized method can be constructed after computation of standard error. However, this kind of comparison might be prone to type I error according to [38]. Fig. 4 shows the schematic diagram of the research framework.

Identification of relevant features
Several features are identified empirically by examining the change in accuracies according to number of features. The attributes are considered as relevant if the set of features added lead to significant improvement in accuracy. It should be noted that this empirical method is greedy and do not take between features interaction into account. Addition of features that leads to spike in accuracy is identified the goodness of split of each identified attribute is computed. Goodness of split is a measure used by classification and regression tree (CART) to choose which attribute to split. The first three features with highest goodness of split are selected for construction of decision tree. Further details were discussed in section 3.3.

Construction of decision tree and classification rule
Blind use of complex and black-box predictive models in high stake circumstances had led to serious consequences [39]. This scenario may be induced by false widespread belief of accuracy and interpretability trade-off among researchers which suggests that more complex model is more accurate [40].
These recent study and findings have motivated and led to this extra step of determining relevant feature subset and constructing decision tree and rule-based classifiers. This not only achieve the objective of building interpretable models but also contributes to knowledge discovery as the mechanisms of predictive models in predicting the outcomes (normal or tumorous) is better understood.  Overall efficiency and generalizability of classifier [41]. Sensitivity Ability to correctly identify positive samples (necrotic brain MR images). Specificity Ability to correctly identify negative samples (normal brain MR images). AUC Equivalent to the probability that a classifier will rank a randomly chosen positive instance higher than randomly chosen negative instance [8]. Training time Quantify the convergence speed of training. Test time Quantify the speed of generation of test data output labels

Results and Discussions 3.1 Comparison among classifiers under different feature reduction schemes and number of features
As mentioned in section 2. represents the number of samples. The ranges of accuracy represent 95% of confidence intervals. It should be noted that this CI can be misleading due to variation in the sampling of training data and should not be used for significance test and comparison.

Performance evaluation using external validation set
As mentioned in section 2.1, dataset 1, 2, 4 and 5 will serve as training dataset and dataset 3 and 6 will serve as test dataset. The purpose of this extra experimentation is to gain insight of the generalizability of the machine learning pipelines that had achieved perfect classification as pointed out in section 3.1.
In addition to that, we manipulate external validation set (dataset 3 and 6) to be balanced (100 tumorous cases and 100 normal cases) and imbalanced (100 tumorous cases and 900 normal cases) so that each perfect classifiers can be evaluated under both balanced and skewed class distribution. According to Table 7   with imbalanced dataset. In other words, the learning algorithms misclassify the normal images as tumorous with newly added normal cases in the dataset.

Identification of relevant features subset
Several features are identified empirically by examining the change in accuracy in Fig. 9-11 under DLSR, Relief, and SFS-LW feature selection methods. Table 9 shows the first 15 features of the 3 feature selection schemes. Several significant features were determined based on the improvement in accuracy scores after addition of particular feature as shown in Fig. 12. There are 10 features being identified that can potentially provide information in classification of normal and tumorous image samples. Then, MATLAB function predictorImportance was utilized to acquire the predictors' importance estimate. The results are shown in Fig. 13. Three of the most important features have their indices attached with asterisk in Table 9. The details of these 3 features are listed in Table 10. In addition, we also visualize the most informative features in Fig. 14-16. As can be seen in Fig. 14, 2 peaks observed in the kernel density plot strongly suggest bimodal distribution. In Fig. 15 and Fig. 16, most of the normal and tumorous samples represented by the features occupy different regions in the dimensional space. This graphic displays have demonstrated the discriminating power of the selected features.  309  1896  6  1884  310  1870  7  55  284  1840  8  427  545  1897  9  1846  1928  1921  10  1522  546  1859  11  59  1930  1835  12  1876  1922  1881  13  991  1909  61  14  1931  285  1909  15 135 1933 1939 Fig. 12. Manual determination of significant features (shown in bold in Table 9).

Construction of decision tree and set of rules
In this section, all downloaded dataset was used to train 10-fold cross validated decision tree classifier. The changes in performance measures when the relevant features are gradually added to input data are shown in Table 11. Accuracy as much as 0.988 is attained when trained with input data of 3 attributes. The high sensitivity (low false negative rate) is crucial as cost of false negatives is higher than the cost of false positives prediction [42]. In order to track down the best models comprising of the 3 features, we look into the 10 cross-validated decision tree models. The optimal decision tree models trained with data of 3 features are shown in Fig. 17 and its performance shown in Table 12.

Discussion
We will arrange discussion of our results according to order in Section 3 paragraph by paragraph.
To summarize, there are some important findings discovered from Section 3.1: a. For most learning algorithms, the accuracy of classifiers tends to improve and become stable after a certain number of attributes as the number of features increase. The performance of classifiers can be affected by the ability of classification models to deal with large number of attributes. Classifier like RBF SVM that can project low dimensional space to high dimensional space can achieve better classification accuracy with fewer features, and thus resilient to overfitting. On the other hand, Naïve Bayes performs poorly across all feature reduction methods. This may be attributed to violation of independence among attributes and Gaussian probability distribution might not be good estimation of probability distribution of the features. b. kNN and weighted kNN shows similar trends in their accuracy, however the test time for weighted kNN is significantly higher compared to kNN. In weighted kNN, the weight of each neighbor has to be evaluated to determine class of new test instance. c. The average accuracy for ANN is generally lower compared to both ELM and SVM, particularly when number of predictors increases, which often come with higher requirement of training time. This may be attributed to overfitting. d. RBF SVM is very robust and achieves relatively high test accuracy in all feature reduction schemes compared to other classifiers, except under LDA schemes where ELM outperforms SVM. Nonetheless, the parameters C and σ is nontrivial and has to estimated using cross validated grid search method, which is computationally intensive. e. All the dimensionality reduction techniques proposed is feasible with high accuracy achieved with selection of appropriate number of features and classifiers. f. Classifiers under PCA and PLS scheme requires less features to achieve high accuracy if compared with LDA and ICA scheme. g. Number of features has to be carefully selected under LDA and ICA scheme as the accuracy can vary substantially with the change in number of attributes included as shown in Fig. 7 and Fig. 8. h. DLSR feature selection is more superior to Relief and SFS-LW as classifiers fed with DLSR feature attain higher accuracy with fewer features. This means DLSR can select relevant and more compact feature subsets. i. Training time of classifiers can be sorted in ascending order as follows: ELM < linear SVM < LR < RBF SVM < ANN. We do not take into account Gaussian NB classifier as the training phase only involves computation of mean and standard deviation of each features of training data. j. The training time of RBF SVM with ICA features as input variables is also significantly higher.
According to [43], the training time complexity of SVM is proportional to number of support vectors. Thus, it is believed that number of support vectors is higher in ICA feature space. k. Test time of all classifiers stagnate regardless of number of features except for kNN, weighted kNN and NB. The calculation of Euclidean distance of nearest neighbors and posterior probability becomes more complex with inclusion of more features. l. Since all of the dimensionality reduction methods are viable, we are more interested in feature selection scheme as the attributes under feature transform methods (e.g. PCA, PLS, LDA and ICA) loses its interpretability and the contribution of each original features are unknown [44]. By examining Fig. 9-11 carefully, we spotted a huge spike in accuracy when certain features are included. We suspect there is some relevant features in each of those spikes in accuracy and a few features that contribute to improvement in accuracy are selected.
In section 3.2, we further evaluate the optimal supervised classification models using dataset from different sources. Additionally, dataset was manipulated in term of proportion of tumorous and normal cases to assess the classifiers using balanced and imbalanced test data. Most machine learning paradigms evaluated perform differently under balanced and imbalanced test dataset, except LDA+ELM classifier with 55 features. This shows that the generalization performance of classifiers will generally deteriorate when tested using data from other sources, in our case different MRI machine and parameter settings. Thus, we suggest the training data samples should be vast and diverse with relevant predictors included, which could only be achieved with extensive experimentation. From the results shown in Fig. 9-11, it is clear that certain attributes contribute to the improvement of classifiers. We further identified these features and evaluate their importances based on decrease in impurity measures (Gini index) due to split. Three of the most important attributes are selected. The reasons for the relevance of these features are listed below: -Gray level run length non-uniformity measures the heterogeneity of gray level. Based on Fig.  13, high value of run length non-uniformity indicate tumorous cases and vice versa. Theoretically, high value of run length non-uniformity implies the runs are not equally distributed across length [45]. The authors suspects that the unequal distribution might arise from the presence of tumor where different run length of similar pixel intensity for some gray levels might occur within the tumor region. -In line with what is reported in [24], good discrimination can be achieved with the occurrence probability of 'uniform' rotation invariant local binary patterns. In this research, the frequency (probability) of LBP pattern with uniformity measure of more than 2. This is probably caused by the heterogeneity at the boundary of tumor tissue which can result in higher values for this attribute as shown in Fig. 14. In section 3.4, with the use of these relevant features, 10-fold cross-validated decision tree models was trained and achieved 98.8% accuracy. The determination of informative feature subsets is more critical than the utilization of complex, black-box model. The high discriminative power and relevance of this feature subset suggest the feasibility to build simple classification rules with fairly high accuracy. Classification rules is favorable as it can serve as independent nuggets of knowledge [46]. This model is not only intuitive and can be interpreted easily by end users, but can leads to knowledge discovery apart from making prediction on unseen data.

Conclusions
The high resolution MR images has become one of the most popular neuroimaging tools nowadays, not just for research in brain anatomical structure, but can be useful in medical diagnosis (tumor detection). Machine learning is definitely a promising pathway in the development of automated CAD system. The first part of the experiment show high accuracy can be obtained for almost all machine learning pipelines with careful selection of number of predictors when trained and tested with dataset of same sources. We identified the machine learning schemes with perfect classification. Then, we evaluate these trained classifiers with independent test dataset of different source to further validate the performance of each approaches. Experimental results show that the optimal techniques is LDA+ELM (number of hidden neurons=150) with 55 features achieving (accuracy=97.5%, AUC=0.989, sensitivity=95% and specificity=100%) under balanced test dataset; (accuracy=99.5%, AUC=0.988, sensitivity=95% and specificity=100%). This particular model excels in predicting potential normal MR images. In addition, we also identified a few highly relevant features (1 from GLRLM and 2 from LBP) and visualized them graphically. With the data showing high separability in scatter plot, relatively easy-to-read 10 fold cross validated decision tree models are constructed by using both dataset #1 and #2 with the three features. The performances are as follow: accuracy=98.8%, AUC=99.1%, sensitivity=99.6% and specificity=98.2%. Then, simple classification rule read directly off the decision tree are constructed. This highly intuitive classification rule sets which involves only 3 predictors can be easily embedded in any real time expert systems or automated CAD tools.
There are numerous possible directions for future work in this field. More advanced feature descriptors can be explored to search for the robust, repeatable and reproducible radiomic features in tumor diagnosis [47]. Less popular nonlinear dimensionality reduction methods like kPCA and locally linear embedding can also be investigated.

Availability of data
The sources of data have been stated clearly in section 2.1.