BRAIN TUMOR DETECTION: 2 NOVEL APPROACHES

In this paper, we propose 2 novel methods for brain tumor detection in MRI images. In the first proposed approach, we build upon prior research on ensemble methods by testing the concatenation of pre-trained models: features extracted via transfer learning are merged and segmented by classification algorithms or a stacked ensemble of those algorithms. In the second approach, we expand upon prior studies on convolutional neural networks: a convolutional neural network involving a specific module of layers is used for classification. The first approach achieved accuracy scores of 0.98 and the second approach achieved a score of 0.863, outperforming a benchmark VGG-16 model. Considerations to granular computing and circuit complexity theory are given in the paper as well.


Introduction
Brain tumors are defined as the growth of abnormal cells in the human brain. Brain tumors are either benign (non-cancerous) or malignant (cancerous) and are generally classified based on the afflicted region; common brain tumors include meningioma, glioma, and pituitary tumors.
Brain tumors pose a major public health issue as according to the American Cancer Society, the total death count from brain tumors is predicted to be 18,020 in 2020 and 23,890 people are expected to be diagnosed with malignant tumors in 2020.
Treatments for brain tumors include chemotherapy, radiation therapy and surgery. However, before such treatments can begin, initial evaluation of the tumors must take place. Typically, the brain is assessed either by Magnetic Resonance Imaging (MRI) or Computed Tomography (CT) scans.
Though medical images play a crucial role in patient diagnosis and treatment, analyzing such images is a time-consuming and costly task.
As a result, computational and mathematical methods have been introduced to this field. Notably, various machine learning methods and techniques such as convolutional neural networks and support vector machines have been utilized to analyze medical images. In this paper, we expand upon past studies by testing the concatenation of pre-trained models. We also propose a convolutional neural network model consisting of a series of modules.
First, sections 2,3, and 4 will present relevant background information, including an overview of algorithms and ensembles. Next, section 5 will present the first approach and sections 6 and 7 will show and discuss its results. Section 8 will present the second approach and once again sections 9 and 10 will show and discuss any results. Finally, section 11 will conclude the paper.

2.1.
Bootstrap Aggregation (Bagging) Ensemble Method ( [2]). The Bootstrap Aggregation method, more commonly known as the bagging method, involves homogeneous weak learners. The learners are trained in parallel and independently from one another and the results are aggregated either by voting or averaging to arrive at a final prediction. Often, the homogeneous learners are base classifiers fitted on random subsets of the dataset. Given

2.2.
Boosting Ensemble Method ( [20], [11], [10]). The Boosting method involves homogeneous weak learners that are trained in a sequential manner. Each subsequent model attempts to correct the errors from the previous model, hence reducing the bias of the overall classifier before a prediction is made. There are 2 main types of boosting methods that differ in how shortcomings of weak learners are identified : adaptive boosting and gradient boosting.
Adaptive boosting identifies such shortcomings by giving higher weight to misclassified input data and lower weight to correctly classified input data. On the other hand, gradient boosting identifies the shortcomings by utilizing gradients to minimize the loss function.
Given s l (.) as the model, c l as coefficients, w l as weak learners, E(.) as the fitting error of the model, and e() as the loss/error function, the following are true for adaptive boosting.
Similarly, given ∇ as the gradient, the following is true for gradient boosting. [23]). The Stacking method involves heterogeneous weak learners that are trained in parallel and independently from one another. The results are then aggregated through a meta model that makes a prediction based on the predictions from weak learners.
The meta model for classification tasks is usually logistic regression and for regression tasks, it is usually linear regression. Their respective equations are given below.
Linear Regression: y = a + bx Logistic Regression: p = 3.1. Support Vector Machines ( [7]). A support vector machine is a classifier that constructs a hyperplane in the feature space to separate the data (input) into classes. The classifier aims to maximize the distance between the hyperplane and the nearest data-point of any class.
If the data set is not linearly separable, it can be projected to higher dimensions through a kernel function.
There are 4 main types of kernel functions-linear, polynomial, radial basis function (RBF), and sigmoid-and their equations are given below.
(where γ, d, and C are kernel parameters and ||X − Y || 2 is the square euclidean distance) The regularization parameter (often referred to as the C parameter) controls the extent to which mis-classification should be avoided. On the other hand, the gamma parameter (γ) defines the extent of influence of a single data point when calculating the hyperplane.

K-Nearest Neighbors ([8]). K-Nearest
Neighbors is a classifier that separates the data into classes by examining the distance between data points and the current given point. The algorithm selects k closest data points and classifies the given point based on votes.
There are several distance functions available: Euclidean, Manhattan, and Minkowski.
Euclidean Function: Manhattan Function: Furthermore, note that cross-validation is often used to select the value k.

Random Forest ([3]
). The Random Forest classifier is a bootstrap aggregation (boosting) ensemble of decision tree classifiers. The decision trees are fitted on random subsets of the dataset and are aggregated either by voting or averaging.
There are a number of heuristics (known as attribute selection measures) that are used to define how data points on certain levels of the tree will be split. 2 heuristics are given below.
Information gain (entropy): (where p i is the probability that an arbitrary tuple in D belongs to a class C i ) 3.4. XGBoost ( [6]). XGBoost is an ensemble of gradient boosted decision trees.
A key aspect of XGBoost is that it uses more regularized model formalization in order to control overfitting.

Convolutional Neural Networks
Convolutional neural networks are a class of deep learning neural networks, often used to analyze and classify images. A CNN works by extracting features from an input image (array of pixels). The image passes through a series of layers (usually consisting of convolutional, ReLU, and Pooling layers) before the final fully connected layer classifies the image based on "voting".

Convolutional Layer ([16]).
A convolutional layer uses filters to extract features from previous layers while preserving corresponding spatial information.
A feature map O s is calculated as shown below.
(where b s is the bias term, W sr is the sub-filter for this feature map, * is the convolution operation, and X r is the rth inputted feature map) 4.2. Activation Functions ( [12], [4]). Activation functions define the output of neurons given a set of inputs. The function introduces non-linear properties into the network by calculating the 'weighted sum' of inputs before determining which neurons will push forward values into the next layer.
Some common activation functions are given below.
Pooling Layers ( [19], [17]). Pooling layers reduce the spatial size of activation maps while maintaining important structural elements (without unnecessary detail). The 2 most common methods are max pooling and average pooling and they are given below.
Max Pooling: (where x is the input to the layer, y is the output of the l-th layer, and H l × W l × D l is the size of the l-th layer)  [18]). The model proposed in this paper is an application of the granular computing paradigm. Granular computing is concerned with the processing of information units or information granules. These information granules are collections of entities that are arranged together based on similar aspects. In the context of machine learning and ensemble learners, each ensemble can be thought of as a granule because they combine multiple learning algorithms. The proposed model and ensemble consists of multiple levels, where each level corresponds to a distinct level of granularity. For example, in a stacked ensemble, the collection of base level classifiers is at the bottom level of granularity and the entire ensemble (including both base level classifiers and the meta classifier) is at the highest level of granularity. Two granular computing concepts-granulation and organization-can be applied to the proposed model as well. Granulation involves the process of decomposing an object into parts and organization involves the process of integrating certain parts into a complete object. The extraction of feature vectors with convolutional neural networks is essentially granulation, and combining different classifiers and learners through a stacking ensemble is organization.
Granular computing concepts can be applied to the random forest classifier as well. The bootstrap aggregation technique (bagging ensemble) found in a random forest classifier involves integrating numerous decision trees into 1 ensemble; it involves the organization concept.

Proposed Method
The proposed method was tested on the "Brain MRI Images for Brain Tumor Detection dataset", which contains a total of 253 images: 155 of the images contain tumors and 98 do not. Before the images were used to train and test the model, the dataset was pre-processed and augmented. The dataset was created by Navoneel Chakrabarty and it can be found here: https://www.kaggle.com/navoneel/brain-mri-images-for-brain-tumor-detection/ data([5]), 5.2. Data Pre-processing. Data Pre-processing involved 3 main steps: splitting the dataset, cropping the images, and resizing the images.

5.2.2.
Cropping. The brain was cropped out of the MRI images through a 3 step method. The method finds the largest contour of the brain, finds extreme points along the contour, and crops the image. An example is given below in Figure 1. Resizing and additional pre-processing. Because the dataset contains images of differing dimensions and aspect ratios, they were resized to match the input size of the pre-trained models (dimensions 224 × 224 × 3). Additional pre-processing (including interpolation and subtracting the mean RGB channels of the data-set) was done to prepare the images for the pre-trained model as well.

Data Augmentation.
To increase the size of the datasets, the data was augmented through numerous random transformations. Through data augmentation, possibilities of over-fitting were reduced and the model was able to generalize better. Chosen augmentation options include a range of rotation of 15 degrees, width and height translation range of 0.1, shearing transformation range of 0.1, brightness range between 0.5 and 1.5, and horizontal and vertical flips. An example of data augmentation is given below in Figure 2. Because segmentation is done by separate classifiers, the final fully connected layers of the pre-trained models are not included. Instead, a custom flatten layer, dropout layer, and dense layer with the sigmoid activation function are added at the end of the models to prepare the tensors for the merging step. The flatten layer is added to reshape the tensor to be of a shape suitable for the classification algorithms. The dropout layer is added to prevent the model from over-fitting, making the model more general. Finally, the dense layer with the sigmoid activation function is added to learn non-linear relationships of extracted features.
Input to the pre-trained model is augmented images of dimensions 224 × 224 × 3 and output from the model is feature vectors of dimensions 7 × 7 × 512. The subsequent flatten layer reshapes the tensors to 1 single dimension with size 25088. Next, the following dropout layer drops half (0.5) of the input units. Finally, the dense layer contains 256 neurons, reshaping the tensors to size 256.

5.4.2.
Merging extracted features. The feature vectors extracted from the pre-trained convolutional neural network models are merged via concatenation. Concatenation was chosen instead of other merging options (eg. average, add) to ensure that none of the input was discarded. The feature vectors are merged to utilize the distinct features extracted from different models and to improve classification.
Once the features vectors have been merged, the resulting vector passes through another dense layer with the sigmoid activation function. Like before, this particular dense layer allows the model to find non-linear relationships of merged features. Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 28 August 2020 doi:10.20944/preprints202008.0641.v1 Input to the merging step are 2 feature vectors of size 256. The output after concatenation is a feature vector of size 512. For the model variant with 3 pretrained models, the output after concatenation is a feature vector of size 768. The next dense layer contains 256 neurons, reshaping the feature vector from size 512 (or 768) to size 256.

Classification Algorithms.
The classification algorithms that are used in the model include support vector machines, k-nearest neighbors, random forest classifiers, XGBOOST, and a fully connected dense layer (with the sigmoid activation function). Before classification takes place, the hyper parameters of all these algorithms (including choice of kernel for support vector machines) are selected using grid search with cross-validation.

5.4.4.
Ensemble. The ensemble method used to combine the heterogeneous collection of classification algorithms is model stacking. Through model stacking, the predictions of various models can be combined to potentially obtain better predictive performances.
Some of the classification algorithms used are ensembles themselves. Random forest classifiers are bootstrap aggregation (bagging) ensembles and XGBoost is a boosting ensemble.
By training multiple decision trees in parallel and aggregating their predictions, random forest classifiers are able to improve accuracy and prevent over-fitting. On the other hand, with gradient boosting, XGBoost classifiers are able to reduce bias and improve accuracy.
The structure of the proposed model with Resnet-50 and Inception-v3 as the feature extractors and a fully connected dense layer (with the sigmoid activation function) as the classifier is shown below in Figure 3.
For other variations of the model, different combinations of feature extractors (pre-trained models) are used and the final dense layer is simply switched with a classification algorithm or a stacked ensemble of classifiers.  The structure of an ensemble classifier stacking a XGBoost classifier, a support vector machine, k nearest neighbors classifier, and a random forest classifier is shown below in Figure 5.

Results
All experiments were conducted via Google Colaboratory and the results are presented in this section.
Early stopping was used when training the models to avoid overfitting and selected parameters include monitoring for validation accuracy in 'max' mode with a minimum of 10 epochs.
Note that a GPU was used during model training.
A benchmark VGG-16 model will be tested to compare results as well. The benchmark VGG-16 model consists of the pre-trained VGG-16 model without the final fully connected layers. Instead, a dropout layer, flatten layer, another dropout layer, and a dense layer with the sigmoid activation function are added in that order.
Training Accuracy Validation Accuracy Test Accuracy 0.993 0.88 0.843 Table 1. From the results, we see that the proposed models achieved high accuracy scores in segmenting the MRI images. Though exceptions did exist, accuracy greater than 0.85 for the test data-set was achieved for all combinations of feature extractors. In particular, the model that combined VGG-16 and Resnet-50 achieved test accuracy of 0.98 for most of its classifiers/ensembles.
Moreover, most of the proposed models achieved higher accuracy scores when compared to the benchmark VGG-16 model which achieved an accuracy rate of 0.843 for the test dataset.

Novel Convolutional Neural Network
In this section, we propose a convolutional neural network model consisting of a series of modules.
Each module consists of a convolutional layer, a batch normalization layer, an activation layer with the sigmoid function, and a max pooling layer, in that order.
Batch normalization layers are added after each convolution to standardize layer inputs. Activation layers with the sigmoid function are added after batch normalization layers to introduce non-linear properties to the model. Max pooling layers follow the activation functions and they reduce the spatial size of the tensors.
In the beginning of the model, a zero padding layer is added to preserve information at the boundaries. Near the end of the model, a flatten layer reshapes the feature vector to be of a shape suitable for the final dense layer (which includes the sigmoid function).
The structure of the proposed convolutional neural network model with 2 modules (12 layers) is shown below in Figure 6. 8.1. Circuit Complexity Theory. The proposed convolutional neural network model is an application of circuit complexity theory. In the context of complexity theory, a circuit is a directed acrylic graph in which every node is associated with a computation, and the input to each successive node is the output of the preceding node. Input nodes of the graph do not have preceding nodes and output nodes of the graph do not have succeeding nodes; depth is defined as the longest path from any input node to an output node. Hence, we can clearly observe that a circuit can be representative of a convolutional neural network model.
Previous results on circuit complexity theory indicate that a shallow neural network may require exponentially more layers than deeper models. Yao showed that logic gate circuits of depth-2 require exponential size to implement d-bit parity; on the other hand, a deep circuit of depth O(log(d)) only requires O(d) nodes ( [24]).
Later, Hastad showed that certain functions that are computable with a polynomialsize logic gate circuit of depth k requires exponential size when the logic gate circuit is restricted to depth k − 1 ( [13]). A similar result exists for the case with circuits made of linear threshold units (formal neurons) for other families of functions ( [14]).
Recent research presents similar findings regarding deeper models as well. Braverman provided results regarding another class of functions that cannot be represented in an efficient manner with small depth circuits ( [1]). Also, results from a study concerning sum-product networks (a network in which every node either computes products or sums over real numbers) present 2 families of polynomials that can be efficiently represented with circuits of depth d but not with circuits of depth 2 (requires exponential size) ( [9]).
With 12 layers, the proposed model aims to take advantage of the findings regarding circuits of greater depth to obtain highly accurate results more efficiently.

Results
In this section, the proposed 12 layer model is tested on the same datasets and environment (Google Colabatory with GPU) as before. The pre-processing steps (including cropping and data augmentation) are identical as well.

Discussion
As shown above, the proposed convolutional neural network model achieved high accuracy rates for all 3 datasets: 1.0, 0.84, and 0.863 for training, validation, and test datasets respectively. The accuracy is reflected in the confusion matrices as well with very low false negative and false positive rates for all 3 datasets. In fact, this model achieved a higher accuracy score for the test dataset than the benchmark VGG-16 model from table 1.

Conclusion
In this paper, 2 novel methods for detecting brain tumors in MRI images were proposed. The proposed models were successful in the given task, reaching scores of 0.98 in the case of the stacked ensemble model combining VGG-16 and Resnet-50 and a score of 0.863 for the proposed convolutional neural network. Though exceptions did exist, most of the proposed models in this paper outperformed the benchmark VGG-16 model.
The author hopes that research into merging model outputs and stacking ensembles continues, especially in the context of medical research. The author also hopes that the application of granular computing concepts and circuit complexity theory for machine learning will continue as well.
Python programs used for data pre-processing and experimentation can be found in this Github repository: https://github.com/ethank11k/Brain-Tumor-Detection-Models/.