Convolutional Neural Networks in Computer-Aided Diagnosis of Colorectal Polyps and Cancer: A Review

As a relatively high percentage of adenoma polyps are missed, a computer-aided diagnosis (CAD) tool based on deep learning can aid the endoscopist in diagnosing colorectal polyps or colorectal cancer in order to decrease polyps missing rate and prevent colorectal cancer mortality. Convolutional Neural Network (CNN) is a deep learning method and has achieved better results in detecting and segmenting specific objects in images in the last decade than conventional models such as regression, support vector machines or artificial neural networks. In recent years, based on the studies in medical imaging criteria, CNN models have acquired promising results in detecting masses and lesions in various body organs, including colorectal polyps. In this review, the structure and architecture of CNN models and how colonoscopy images are processed as input and converted to the output are explained in detail. In most primary studies conducted in the colorectal polyp detection and classification field, the CNN model has been regarded as a black box since the calculations performed at different layers in the model training process have not been clarified precisely. Furthermore, I discuss the differences between the CNN and conventional models, inspect how to train the CNN model for diagnosing colorectal polyps or cancer, and evaluate model performance after the training process.

Convolutional Neural Network (CNN) is a deep learning model used in various computer vision tasks such as object detection and face recognition, emerging in the early 1990s [3]. In 2012, CNN was exploited to classify 1.2 million images into 1000 various classes, with phenomenal results being presented in the ImageNet Large Scale Visual Recognition Competition (ILSVRC) [5] [6]. After that, CNNs became the prevalent method in computer vision tasks [3].
Colorectal cancer is the third most common cancer and the second leading cause of mortality worldwide [7], with 1.1 million new cases and 576 thousand deaths among 185 countries in 2020 [8]. Polyps in the colon can precede colorectal cancer; thus, timely diagnosis and removal of these polyps by colonoscopy can prevent colorectal cancer and its mortality. Moreover, diagnosing colorectal cancer or malignant polyps early on by colonoscopy can increase the survival rate in the long term [9]. However, according to studies, up to 30% of adenomas polyps are missed during colonoscopy screening by endoscopists. In this situation, developing a computer-aided diagnosis (CAD) tool based on deep learning, or artificial intelligence in general terms, can help minimize the polyps missing rate [10].
Computer-aided diagnosis (CAD) is a tool that can assist doctors and radiologists in making effective diagnostic and treatment decisions. The CAD relies on medical image analysis, and medical imaging is no exception to the other computer vision tasks for which CNN is applied, including magnetic resonance imaging (MRI), X-ray, computed tomography (CT) [11], histological images [12]. In recent years, CNNs have been used in various fields of medical imaging such as detection of diabetic retinopathy and related eye diseases [13], breast lesion detection and classification [14], brain tumor detection in MRI images [15], skin lesion classification [16], automatic colorectal polyp detection and classification in colonoscopy videos [17], and coronavirus disease 2019 detection in chest X-ray images [18]. CNNs show remarkable medical imaging performance compared with conventional models like support vector machines (SVM), artificial neural networks, and k-nearest neighbor according to primary and secondary studies such as Shin et al. [19], Pan et al. [20], and Anwar et al. [11]. In some cases, CNNs even attained a higher accuracy than humans. For instance, Choi et al. [21] showed that CNN models were approximately 4% to 5% more accurate than expert endoscopists in colorectal cancer diagnosis and polyp classification.
This review concentrates on applying CNN models for detecting and classifying colorectal polyps in colonoscopy images or videos. This paper begins with the introduction (section I), and section II presents the differences, benefits, and drawbacks of conventional and CNN models. In section III, I describe the components of CNN models, their tasks in the learning process, and the types of outputs. Section IV discusses the preparation of colonoscopy images as inputs of the CNN model and details the training, validation and testing process. In section V, I define different evaluation metrics and how to measure them. In section VI and section VII, I explain two techniques to prevent overfitting and overcome data shortage for training the model, called data augmentation and transfer learning, respectively. Section VIII details the limitations of CNN models and the approaches that can lead to improving the diagnostic performance of these models in future studies. Finally, conclusions are drawn in section IX.

II. What are the differences between conventional machine learning models and CNNs?
In most medical studies, conventional or traditional models like logistic regression [22], linear regression, artificial neural networks [23], support vector machines [24], and random forest [25] are used for diagnostic purposes. In these models, features are extracted manually, and this technique is called hand-crafted feature extraction [26]. These features are chosen based on methods such as histogram of oriented gradient (HOG) or hue histogram in images by persons [27]. However, CNNs extract features automatically, and features are learned by the CNN model ( Figure 2), achieving superior accuracy than conventional models, especially when data is in the form of images [26]. On the other hand, we need more data to train CNN models than conventional models since CNNs have millions of learnable features or parameters [26]. In contrast, the number of features in conventional models (less than a dozen) is much less than CNNs. Thus, the computational volume and cost go higher. To solve this problem, it is better to exploit the graphical processing unit (GPU) instead of the central processing unit (CPU) for training CNNs. GPU or graphic card processor consists of more cores than the CPU, whereas the computation power of the GPU core is less. Hence, GPUs process faster in matrix computing, and computations in the training process are of the same type.

III. How does the Convolutional Neural Network Black-Box work?
Inputs for the CNNs which are kind of artificial neural networks, should have grid patterns, e.g., images and videos. In this model, image features are automatically learned and extracted from low-level to high-level. The early layers extract simple or low-level features like edges or corners, while the late layers extract complex or high-level features like specific objects. CNN models are composed of multiple layers, which are classified into convolutional layers, pooling layers, and fully connected layers. The role of convolutional and pooling layers is feature extraction, and fully connected layers map the extracted features into the output of the CNN model [26].
At first, images are fed into the first convolutional layer as inputs of the CNN model. Images are a kind of matrix that each pixel is equivalent to an element of the matrix. Each convolutional layer consists of filters or kernels that detect image features like edges, corners, textures, and objects. Kernels are also a type of matrix, and their elements are called weights, the values of which are calculated in the training process. Then, the input image matrix convolves with the kernel, or in other words, the kernel matrix sweeps through the entire input matrix in the length and width directions where these two matrixes overlap ( Figure 3). In the next step, the elements of the feature map matrix are summed with a constant value, called bias. Here, a non-linear activation function like ReLU (Rectified Linear Unit), sigmoid, or tanh (hyperbolic tangent) is applied to each acquired matrix element. A significant variation exists between the values of the feature map matrix elements, and the activation function diminishes variation between these values since the activation function range is limited. For instance, for all input values, the hyperbolic tangent or sigmoid outputs are between minus one and one, or zero and one, respectively ( Figure 4). Between the convolutional layers, there are pooling layers that, through down-sampling, reduce the computation volume and remove redundancy from the input matrix of pooling layers by decreasing matrix dimensions. In these layers, a window with a specific size moves through the input matrix. Wherever the window is located, the maximum value of elements in the window (Max pooling) or the average of the elements' value in the window (Average pooling) creates each element of the output matrix of the pooling layers ( Figure 5). Finally, the last convolutional layer's output matrix is converted to a vector (a kind of matrix with only one column) before being fed into fully connected layers as inputs or features extracted by the convolutional and pooling layers. Fully connected layers, as an artificial neural network, perform classification [29]. The overall structure of CNN, along with the types of layers described above, is shown in Figure 6.   In the related primary studies, the output of the CNN model can be divided into four overall categories, polyp classification, polyp detection, polyp localization, and polyp segmentation. In polyp classification, the CNN model recognizes the type of polyp (e.g., adenomatous, hyperplastic, serrated, or normal) in a colonoscopy image ( Figure 6). In polyp detection, the CNN model only recognizes whether a colonoscopy image contains at least one polyp or not. In polyp localization, the CNN model marks the position of each polyp in a colonoscopy image with a rectangle but not with its exact shape (Figure 7). In polyp segmentation, the CNN model draws a margin around each polyp that it detects ( Figure 8) [31]. CNN models have different architectures; each model has a specific name, such as U-Net [32], VGG [33], ResNet [34], and Faster R-CNN [35].

IV. Model training, validating, and testing
The CNN model maps from the input data (i.e., images) to the corresponding output (ground truth or gold standard) through supervised learning method. The data obtained directly from colonoscopy, or raw data, is in video format, so videos should be converted into consecutive images to be processable to the CNN models as inputs. Then, the obtained images should be resized to the specific size required for input into the CNN model, though some CNN architectures like U-Net are compatible with various image sizes [37]. Many related primary studies have utilized pre-prepared and publicly available datasets, including ETIS-LARIB [38], CVC-CLINIC [39], ASU-Mayo Clinic Colonoscopy Video [40], and KVASIR [41]. In contrast, in some studies such as Ozawa et al. [17], Haj-Manouchehri et al. [42], Choi et al. [21] and Shafi et al. [30], colonoscopy videos were provided in collaboration with hospitals or institutes, and these videos needed to be converted to images. Finally, the dataset should be split into three parts: 90%, 5%, and 5% of the whole dataset (or in other ratios), used for the CNN model training (training dataset), validation, and testing (testing dataset), respectively.
The loss function reaches a minimum, or optimum point in the training process by optimization algorithms, referred to as back-propagation and gradient descent. The loss function (or cost function) measures the similarity among ground truth (gold standard) labels and the output of the CNN model, and the loss function's value is updated in each epoch (iteration) through the back-propagation algorithm. Sometimes, the loss function should be calculated through a subset of the training dataset instead of the whole training dataset due to lack of memory, increased efficiency and decreased computation cost. This subset is named mini-batch, and the mini-batch size usually is a power of 2 (e.g., 32). Gradient descent ( Figure  9) is a kind of optimization algorithm that is defined as follows: In the above equation, L represents the loss function, and W denotes learnable parameters like kernels' weights in convolutional layers that are updated until the loss function converges a value and remains stable after at least 20 consecutive epochs. Also, α denotes the learning rate, a minute positive constant number usually between 0 and 1 [26].
In the validation phase, hyperparameters such as learning rate, mini-batch size, number of epochs, and type of loss function and optimizer are tuned, and the values are determined [26].
The CNNs performance is evaluated in the testing phase by measuring the evaluation metrics defined in section V [26]. Testing is divided into internal and external categories, based on whether the testing dataset and the training dataset are from the same place or not, respectively. An important note to consider is that in some papers, especially in medical science criteria, the testing phase is called the validation phase, which differs from the validation phase described above.
Sometimes, the CNN model performs astounding on the training dataset but does not perform well on the testing dataset or other datasets. In other words, there is a considerable interval between the test accuracy value and the training accuracy value. In this case, overfitting has occurred, and to overcome this issue, we can exploit various methods like data augmentation, transfer learning, dropout, regularization, and batch normalization [11].

V. Evaluation metrics
After the CNN model is trained on the training dataset, we have to evaluate our CNN model performance on another dataset (testing dataset). This evaluation is accomplished by measuring accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV), which are defined below.  -by-two table, also called a diagnostic table. Ground Truth

FN TN
The TP index indicates the number of images containing at least one polyp that the CNN model has correctly classified or detected, or the number of images in which the type of polyp is malignant and the model has correctly diagnosed the polyp type. The FN index represents the number of images containing at least one polyp while the CNN model cannot classify or detect it (or them), or the number of images containing malignant polyps falsely diagnosed as benign. The TN measure indicates the number of images without any polyps correctly classified by the CNN model as non-polyp images, or the number of images containing benign polyps correctly diagnosed as the benign type. Finally, FP represents the number of images without any polyps falsely classified by the CNN model as having at least one polyp, or the number of images in which the polyp is benign while the model has falsely diagnosed it as malignant.

VI. Data augmentation
In medical imaging, due to privacy concerns, the amount of dataset is often inadequate [26], and the dataset is unbalanced. This means the number of negative group images is as many times as positive group images. An unbalanced dataset leads the CNN model to perform weakly in positive group recognition, i.e., diagnosis of malignant polyps, and causes the values of the evaluation metrics to be lower than when the dataset is balanced [43]. Moreover, in the training phase of CNN, if the amount of the training dataset is insufficient, overfitting will ensue, which is explained in section IV [44]. Furthermore, the more layers the CNN model has, the larger the training dataset needs to be to train the CNN model [42].
Geometric transformations, as effective ways for augmentation, are applied to enlarge the training dataset. Transformations that can be used include rotation (0 degrees to 360 degrees), flip from top to bottom or right to left, zoom in or out, random brightness changes, and cropping of training images. In these ways, we acquire several new images from each original image [45]. Therefore, these transformations enhance the CNN model's diagnostic performance and the evaluation metrics values [44].

VII. Transfer learning
Transfer learning is another way to tackle the data shortage. In this method, firstly, the CNN is trained on another huge dataset (mostly ImageNet). Although ImageNet is a nonmedical dataset, it can be useful when CNN is trained on medical images. Then, the CNN is initialized with the obtained pre-trained weights, and the layer weights of the CNN are updated persistently until the evaluation metric (accuracy) reaches its optimum value. In fact, updating the last layers' weights is usually adequate since, as mentioned in section III, the early layers learn the low-level features such as edges or corners, which are common in all images. However, if the category of images with which pre-trained and target models are trained differs considerably, the early layers' weights may also need to be updated [46].

VIII. Limitations of CNNs and future prospect
Although CNN models have achieved astonishing results in computer vision tasks, they may not perform optimally in some cases, and it is better to use hybrid models (combination of hand-crafted and automatic feature extraction methods) [29]. For instance, Shin et al. [27] demonstrated that the accuracy of a hybrid model involving the combination of HOG and hue histogram methods (hand-crafted feature extraction) with the dictionary learning method (automatic feature extraction) in polyp detection was 4% higher than the CNN model.
CNN models are far more data-hungry than conventional models because they have millions of learnable features or parameters. Also, due to patient privacy, medical images are less available than other types of images. To resolve this issue and to prevent overfitting, data augmentation (section VI) and transfer learning (section VII) have been utilized in most primary studies related to the subject of this paper. In addition, there is another method for overcoming data scarcity, called Generative Adversarial Networks (GANs), which was first designed by Goodfellow et al. [47] in 2014.
GAN is a machine learning framework that generates new sample images from original sample images among various domains, including medical images. This framework can be applied to images of a training dataset to enhance the variety and number of images, and boost the robustness of CNN models [48]. In Thomaz et al. [49], the Faster R-CNN model was trained on a training dataset augmented by the data augmentation method, obtained sensitivity (recall) of 61.0% in polyp segmentation on a testing dataset. In that study, the same model was also trained on a training dataset augmented by the GAN, improving sensitivity (recall) to 69.2% on the same testing dataset. As a result, it is recommended to exploit the GAN framework instead of the data augmentation method in future polyp detection and segmentation studies. Because using GANs not only improves the diagnostic performance of CNN models but also is not time-consuming in creating new images compared with other data augmentation methods.
As mentioned in Section IV, colonoscopy images should be rescaled to train CNN models. This resizing reduces images quality and misses some detailed information, e.g., small polypus lesions. These issues have negative impacts on CNN models training and thus these models' diagnostic performance. To resolve these issues in future studies, we can crop each colonoscopy image into several smaller patches and then feed these patches to the CNN model as input instead of rescaled images [50].

IX. Conclusions
In recent decades, CNN models have accomplished many computer vision tasks like object detection, image reconstruction, and medical imaging with notable results. In this paper, I have discussed the applications of CNN in diagnosing colorectal polyps or cancer while explaining the differences and advantages of CNN over conventional models. knowing the applications and benefits of the CNN models, as well as their limitations and drawbacks, will help develop computer-aided diagnosis tools. Such tools can enhance the endoscopists' performance during colonoscopy, ultimately increasing the colorectal cancer survival rate by reducing the polyps missing rate.