BRAIN TUMOR DETECTION BASED ON ENSEMBLE LEARNING

In this paper, we propose methods for brain tumor detection in MRI images based on ensemble learning. We build upon prior research on ensemble methods by testing the concatenation of pre-trained models: features extracted via transfer learning are merged and segmented by classification algorithms or a stacked ensemble of those algorithms. The proposed approach achieved accuracy scores of 0.98 , outperforming a benchmark VGG-16 model. Considerations to granular computing are given in the paper as well.


Introduction
Brain tumors are defined as the growth of abnormal cells in the human brain. Brain tumors are either benign (non-cancerous) or malignant (cancerous) and are generally classified based on the afflicted region; common brain tumors include meningioma, glioma, and pituitary tumors.
Brain tumors pose a major public health issue as according to the American Cancer Society, the total death count from brain tumors is predicted to be 18,020 in 2020 and 23,890 people are expected to be diagnosed with malignant tumors in 2020.
Treatments for brain tumors include chemotherapy, radiation therapy and surgery. However, before such treatments can begin, initial evaluation of the tumors must take place. Typically, the brain is assessed either by Magnetic Resonance Imaging (MRI) or Computed Tomography (CT) scans.
Though medical images play a crucial role in patient diagnosis and treatment, analyzing such images is a time-consuming and costly task. As a result, computational and mathematical methods have been introduced to this field. Notably, various machine learning methods and techniques such as convolutional neural networks and support vector machines have been utilized to analyze medical images. In this paper, we expand upon past studies by testing the concatenation of pre-trained models.
First, sections 2,3, and 4 will present relevant background information, including an overview of algorithms and ensembles. Next, section 5 will present the proposed approach and sections 6 and 7 will show and discuss its results. Finally, section 8 will conclude the paper.

2.1.
Bootstrap Aggregation (Bagging) Ensemble Method ( [2]). The Bootstrap Aggregation method, more commonly known as the bagging method, involves homogeneous weak learners. The learners are trained in parallel and independently from one another and the results are aggregated either by voting or averaging to arrive at a final prediction. Often, the homogeneous learners are base classifiers fitted on random subsets of the dataset.
Given L bootstrap samples, s L (.) as the model, and w 1 (.), w 2 (.), ..., w i (.) as weak learners, the 2 aggregation methods are given by: 2.2. Boosting Ensemble Method ( [20], [11], [10]). The Boosting method involves homogeneous weak learners that are trained in a sequential manner. Each subsequent model attempts to correct the errors from the previous model, hence reducing the bias of the overall classifier before a prediction is made. There are 2 main types of boosting methods that differ in how shortcomings of weak learners are identified : adaptive boosting and gradient boosting.
Adaptive boosting identifies such shortcomings by giving higher weight to misclassified input data and lower weight to correctly classified input data. On the other hand, gradient boosting identifies the shortcomings by utilizing gradients to minimize the loss function.
Given s l (.) as the model, c l as coefficients, w l as weak learners, E(.) as the fitting error of the model, and e() as the loss/error function, the following are true for adaptive boosting.
Similarly, given ∇ as the gradient, the following is true for gradient boosting. [23]). The Stacking method involves heterogeneous weak learners that are trained in parallel and independently from one another. The results are then aggregated through a meta model that makes a prediction based on the predictions from weak learners.
The meta model for classification tasks is usually logistic regression and for regression tasks, it is usually linear regression. Their respective equations are given below.
Linear Regression: y = a + bx Logistic Regression: p = 3.1. Support Vector Machines ( [7]). A support vector machine is a classifier that constructs a hyperplane in the feature space to separate the data (input) into classes. The classifier aims to maximize the distance between the hyperplane and the nearest data-point of any class.
If the data set is not linearly separable, it can be projected to higher dimensions through a kernel function.
There are 4 main types of kernel functions-linear, polynomial, radial basis function (RBF), and sigmoid-and their equations are given below.
(where γ, d, and C are kernel parameters and ||X − Y || 2 is the square euclidean distance) The regularization parameter (often referred to as the C parameter) controls the extent to which mis-classification should be avoided. On the other hand, the gamma parameter (γ) defines the extent of influence of a single data point when calculating the hyperplane.

K-Nearest Neighbors ([8]). K-Nearest
Neighbors is a classifier that separates the data into classes by examining the distance between data points and the current given point. The algorithm selects k closest data points and classifies the given point based on votes.
There are several distance functions available: Euclidean, Manhattan, and Minkowski.
Euclidean Function: Manhattan Function: Furthermore, note that cross-validation is often used to select the value k.

Random Forest ([3]
). The Random Forest classifier is a bootstrap aggregation (boosting) ensemble of decision tree classifiers. The decision trees are fitted on random subsets of the dataset and are aggregated either by voting or averaging.
There are a number of heuristics (known as attribute selection measures) that are used to define how data points on certain levels of the tree will be split. 2 heuristics are given below.
Information gain (entropy): (where p i is the probability that an arbitrary tuple in D belongs to a class C i ) 3.4. XGBoost ( [6]). XGBoost is an ensemble of gradient boosted decision trees.
A key aspect of XGBoost is that it uses more regularized model formalization in order to control overfitting.

Convolutional Neural Networks
Convolutional neural networks (CNN) are a class of deep learning neural networks, often used to analyze and classify images. A CNN works by extracting features from an input image (array of pixels). The image passes through a series of layers (usually consisting of convolutional, ReLU, and Pooling layers) before the final fully connected layer classifies the image based on "voting".

Convolutional Layer ([16]). A convolutional layer uses filters to extract features from previous layers while preserving corresponding spatial information.
A feature map O s is calculated as shown below.
(where b s is the bias term, W sr is the sub-filter for this feature map, * is the convolution operation, and X r is the rth inputted feature map) 4.2. Activation Functions ( [12], [4]). Activation functions define the output of neurons given a set of inputs. The function introduces non-linear properties into the network by calculating the 'weighted sum' of inputs before determining which neurons will push forward values into the next layer.
Some common activation functions are given below.
Pooling Layers ( [19], [17]). Pooling layers reduce the spatial size of activation maps while maintaining important structural elements (without unnecessary detail). The 2 most common methods are max pooling and average pooling and they are given below.
Max Pooling: (where x is the input to the layer, y is the output of the l-th layer, and H l × W l × D l is the size of the l-th layer)  [18]). The model proposed in this paper is an application of the granular computing paradigm. Granular computing is concerned with the processing of information units or information granules. These information granules are collections of entities that are arranged together based on similar aspects. In the context of machine learning and ensemble learners, each ensemble can be thought of as a granule because they combine multiple learning algorithms. The proposed model and ensemble consists of multiple levels, where each level corresponds to a distinct level of granularity. For example, in a stacked ensemble, the collection of base level classifiers is at the bottom level of granularity and the entire ensemble (including both base level classifiers and the meta classifier) is at the highest level of granularity. Two granular computing concepts-granulation and organization-can be applied to the proposed model as well. Granulation involves the process of decomposing an object into parts and organization involves the process of integrating certain parts into a complete object. The extraction of feature vectors with convolutional neural networks is essentially granulation, and combining different classifiers and learners through a stacking ensemble is organization.
Granular computing concepts can be applied to the random forest classifier as well. The bootstrap aggregation technique (bagging ensemble) found in a random forest classifier involves integrating numerous decision trees into 1 ensemble; it involves the organization concept. The proposed method was tested on the "Brain MRI Images for Brain Tumor Detection dataset", which contains a total of 253 images: 155 of the images contain tumors and 98 do not. Before the images were used to train and test the model, the dataset was pre-processed and augmented. The dataset was created by Navoneel Chakrabarty and it can be found here: https://www.kaggle.com/navoneel/brain-mri-images-for-brain-tumor-detection/ data([5]),

5.2.
Data Pre-processing. Data Pre-processing involved 3 main steps: splitting the dataset, cropping the images, and resizing the images.

5.2.2.
Cropping. The brain was cropped out of the MRI images through a 3 step method. The method finds the largest contour of the brain, finds extreme points along the contour, and crops the image. An example is given below in Figure 1.

5.2.3.
Resizing and additional pre-processing. Because the dataset contains images of differing dimensions and aspect ratios, they were resized to match the input size of the pre-trained models (dimensions 224 × 224 × 3). Additional pre-processing (including interpolation and subtracting the mean RGB channels of the data-set) was done to prepare the images for the pre-trained model as well.

Data Augmentation.
To increase the size of the datasets, the data was augmented through numerous random transformations. Through data augmentation, possibilities of over-fitting were reduced and the model was able to generalize better. Chosen augmentation options include a range of rotation of 15 degrees, width and height translation range of 0.1, shearing transformation range of 0.1, brightness range between 0.5 and 1.5, and horizontal and vertical flips. An example of data augmentation is given below in Figure 2.  Because segmentation is done by separate classifiers, the final fully connected layers of the pre-trained models are not included. Instead, a custom flatten layer, dropout layer, and dense layer with the sigmoid activation function are added at the end of the models to prepare the tensors for the merging step.

5.4.2.
Merging extracted features. The feature vectors extracted from the pre-trained convolutional neural network models are merged via concatenation. Concatenation was chosen instead of other merging options (eg. average, add) to ensure that none of the input was discarded. The feature vectors are merged to utilize the distinct features extracted from different models and to improve classification.
Once the features vectors have been merged, the resulting vector passes through another dense layer with the sigmoid activation function.
The merging of pre-trained models takes place with the following combinations: VGG16 + Resnet-50, VGG16 + Inception-v3, Resnet-50 + Inception-v3, and VGG16 + Resnet-50 + Inception-v3. 5.4.3. Classification Algorithms. The classification algorithms that are used in the model include support vector machines, k-nearest neighbors, random forest classifiers, XGBOOST, and a fully connected dense layer (with the sigmoid activation function). Before classification takes place, the hyper parameters of all these algorithms (including choice of kernel for support vector machines) are selected using grid search with cross-validation. Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 8 June 2021 doi:10.20944/preprints202008.0641.v2

5.4.4.
Ensemble. The ensemble method used to combine the heterogeneous collection of classification algorithms is model stacking. Through model stacking, the predictions of various models can be combined to potentially obtain better predictive performance.
The structure of the proposed model with Resnet-50 and Inception-v3 as the feature extractors and a fully connected dense layer (with the sigmoid activation function) as the classifier is shown below in Figure 3.
For other variations of the model, different combinations of feature extractors (pre-trained models) are used and the final dense layer is simply switched with a classification algorithm or a stacked ensemble of classifiers.  The structure of an ensemble classifier stacking a XGBoost classifier, a support vector machine, k nearest neighbors classifier, and a random forest classifier is shown below in Figure 5.

Results
All experiments were conducted via Google Colaboratory and the results are presented in this section.
Early stopping was used when training the models to avoid overfitting and selected parameters include monitoring for validation accuracy in 'max' mode with a minimum of 10 epochs.
Note that a GPU was used during model training.
A benchmark VGG-16 model will be tested to compare results as well. The benchmark VGG-16 model consists of the pre-trained VGG-16 model without the final fully connected layers. Instead, a dropout layer, flatten layer, another dropout layer, and a dense layer with the sigmoid activation function are added in that order.

Discussion
From the results, we see that the proposed models achieved high accuracy scores in segmenting the MRI images. Though exceptions did exist, accuracy greater than 0.85 for the test data-set was achieved for all combinations of feature extractors. In particular, the model that combined VGG-16 and Resnet-50 achieved test accuracy of 0.98 for most of its classifiers/ensembles. Moreover, most of the proposed models achieved higher accuracy scores when compared to the benchmark VGG-16 model which achieved an accuracy rate of 0.843 for the test dataset.

Conclusion
In this paper, methods based on ensemble learning were proposed for detecting brain tumors in MRI images. The proposed models were successful in the given task, reaching scores of 0.98 in the case of the stacked ensemble model combining VGG-16 and Resnet-50. Though exceptions did exist, most of the proposed models in this paper outperformed the benchmark VGG-16 model.
The author hopes that research into merging model outputs and stacking ensembles continues, especially in the context of medical research.
Python programs used for data pre-processing and experimentation can be found in this Github repository: https://github.com/ethank11k/Brain-Tumor-Detection-Models/.