Breast tumor detection and classification in Mammograms : Gabor wavelet vs . statistical features

Breast cancer is the second cause of fatality among all cancers for women. Automatic classification of breast cancer lesions in mammograms is a challenging task due to the irregularity and complexity of the location, size, shape, and texture of these lesions. The intensity dissimilarity has been found between breast cancer tissues and normal tissues, when a multispectral anatomical mammographic screening scans have been done. In this work, two approaches have been evaluated to classify the breast tumor lesions. The first one is through Gabor wavelet features and the second one is Statistical features. Subsequently, support vector machine, Multilayer Perceptron and KNN classifiers have been used with computer based method for breast tumor classification.


Introduction
Medical imaging is a robust and reliable diagnostic method for the breast related diseases.Breast tumor originates in the internal coating of milk lobules in the breast.It is basically abandoned progress of anomalous cells.A latest summary by National Cancer Registry Programs state that 28-35% breast cancer from all cancers are spread among females in major cities (Delhi, Mumbai, Chennai etc.) and it is increasing fast in huge figures [1].The American Cancer Society estimates that around 1,658,370 women in the US will be identified with breast tumor, and due to it about 589,430 females died till 2015 [2].It is considered that most effective way to raise the chance of save from disease is by diagnosis and treatment in early stages [3][4].Mammography is the major screening tool which is carried out for detection of breast cancer at early stage and by the use of mammography at least 30% drop in breast cancer losses [5].But some of the breast lesions such as micro-calcification, breast masses, shape distortion, and irregularity between breasts may not be detected by screening mammography because it is very difficult to interpret the morphological features [6].Dense breast parenchyma is highly challenging job for sustaining sensitivity of mammography which depreciates both recognition and classification tasks [7].In several types of pictures, texture is considered as important attributes.This however, can be a challenging task because the location, size, shape, and texture of a tumor are not constant.Hence it is key necessity to increase the accuracy of screening databases by developing an automatic classification system.This system could be a useful tool for radiologists to screen out doubtful cases, however, the accuracy of lesions classification depends on the extraction of the most significant features.In the present study, two set of features have been analyzed, Gabor-wavelet features and Statistical features.The first group of the feature extraction is Gabor-wavelet method, which has the ability to yield optimized diverse resolution information in both time and frequency domain [8].The second group of features is statistical type, which is based on grey scale histogram distribution among pixels.It includes First Order Statistics (FOS) features, Gray Level Cooccurrence Matrix (GLCM), Grey Level Run Length Matrix (GLRLM) and Statistical Features Matrix (SFM) methods.These features are based on the relationship among image pixels.Although, Gabor-wavelet features and statistical features like FOS, SFM and have been widely employed separately or in combination with each other in many different studies, their individual benefits and applicability have not been compared.This motivated the present research to investigate the efficacy and capability of these features in lesions classification.

Related Work:
Many of previous studies have presented that the performance accuracy for identifying a breast cancer drops due to development of breast thickness [9].Computer based classification system is needful for detection of breast lesions in dense mammograms.This computer based diagnosis system includes the various texture models.There are many studies which includes the various texture features which are used for classification.
Li et al. [10] have considered the size and location of region of interest (ROI), and also included the effect of size and location.They determined that the location of ROI plays important role for performance of texture features.The accuracy of performance decreases if the location of ROI is moved from the midpoint of breast area to the back part of nipple.Because that back part of breast includes the thick or dense region.Bovis et al. [11] done the classification of mammogram task based on mass and density.The authors used 377 mammograms from DDSM database for their work.They used only four texture models for classification purpose.There are two unlike algorithms which adopt by the authors.First includes the two class problem and second includes the four class problem.ANN classifier used for the classification purpose.They achieved 71.4% accuracy.Petroudi et al. [12] includes a method for involuntary classification of breast mass or density arrangements.These breast density arrangements inside the breast are shown by statistical features.In their technique, every mammogram is classified into three unlike parameters: breast tissue, background and pectoral muscle.In conclusion, an overall of 132 mammograms selected from the Oxford Database.Hapfelmeier et al. [13] evaluated two computer aided diagnosis prototypes which includes segmentation, texture feature extraction, and classification for mass lesions and micro calcification for evaluating the hazard of breast cancer.The result includes classification performance of CAD prototype about 77.7% for 242 texture features.These analyses included 1347 ROIs on DDSM database.They used Support Vector Machine (SVM) for classification task and Linear Discriminant Analysis (LDA) used for feature selection task.Oliver et al. [14] suggested a CAD method for classification of breast thickness using morphological and roughness features.They have used two sets of database for evaluating the accuracy of proposed system.First is mini-MIAS database, that includes 322 images and the second is DDSM database, which includes 831 images.They have employed Bayesian classifier, Decision Tree classifier, and K-Nearest Neighbor for classification task and achieved a maximum accuracy of 86%.There are many of texture features which basically depend upon the neighboring pixels, as Gray Level Co-Occurrence Matrix (GLCM) and Gray Level Run-Length Matrix (GLRLM), Haralick, [15] presented GLCM method in 1973.Many of investigators used the GLCM matrix for extracting the features which are useful for classification.100 features extracted by Zarchari et al. [16] where they used contour based methods, statistical, GLCM, intensity, and Gabor.Georgiadis et al. [17] extracted 4 different features from histograms, 4 different and 10 features from GLRLM matrix and 22 features from GLCM matrix.Karssemeijer et al. [18] found 67% accuracy for BIRADS-IV images using k-NN classifier Sampaio [19] considered 623 images from BIRADS-II image database.Shape and texture features were studied using Cellular neural network and an accuracy of 80% has been reported.Sharma et al. [20] concluded the SVM is the best classifier for the classification of breast density and it had an accuracy of 89% when feature selection is used.But it had been only for 2class problem not more.
Table 1 represents the summary of literature review in chronological order.

Database:
Digital Database for Screening Mammography (DDSM) [21] database has been used for this study, which is publically available at University of South Florida.480 cases have been used for this study.There are different types of file in DDSM database, first is "ics" file, second is four image files which are using lossless JPEG encoding, third is zero to four overlay files and last one is "16 bit PGM" files.There are two most common methods of breast projection, namely Cranio-Caudal (CC) and Medio Lateral Oblique (MLO).The CC view is taken from above, so area close to the chest wall does not display.In MLO projection, X-ray is taken from front side of the breast as whole breast is visible.In the present work, overall 480 (160×3) mammograms (MLO views) including of 160 mammograms from every of the 3 categories of breast lesions diagnosis taken from the DDSM database.

Texture based features extraction technique:
In this study, Gabor wavelet features and Statistical features are considered and compared for classification of mammograms.Statistical features are texture-based feature, which are extracted using techniques like FOS, GLCM, GLRLM and SFM.A brief discussion has been presented in the following section about these features.Where The parameter  signifies the direction parameter,  signifies the scale, and W is the modulation frequency of the Gaussian function.These types of Gabor functions have many of application like tuning, image compression and classification for texture.When an image (,) is convoluted with a Gabor function (,), the result gives a Gabor filtered image.The Gabor wavelets are limited with in two-dimensional Gaussian envelope.There are many direction or orientation and scale which are similar to the complex planar.Every Gabor wavelet comprises a specific orientation and wavelength.The Gabor wavelets are usually measured as distinct set of self-similar functions.If (,) is the mother Gabor wavelet, at that point this selfsimilar filter group is acquired by appropriate rotations and dilations of (,).The Discrete Gabor Wavelet Transform (DGWT) of image (,) size (×) is termed as: Where , are the parameters for size deviation of filter mask; , are the positioning or orientation and the scale values individually. *  is complex conjugate of , which is the self-similar function formed from the spin and dilation of the mother wavelet '' and is described as: (i) First order statistics: First order statistical features are determined from the spatial distribution of grey levels values of an image.FOS deals with the individual pixel of an image, not depends upon neighborhood pixels (ii) Gray Level Co-occurrence matrix (GLCM): The GLCM is based on the relation between neighboring pixels at different offset and different angles, where first pixel is called as reference and the second is called as neighboring pixel.Gray level co-occurrence matrix is used to compute the texture features based on second order statistics.[15] (iii) Gray level run length matrix (GLRLM) The basic of run-length matrix is proposed by researcher Galloway [22].Coarse structure or long gray levels run are analyzed by gray level run length matrix.This matrix (GLRLM) is statistical based approach for higher runs of pixels

(iv) Statistical features matrix (SFM):
The basic of statistical feature matrix is proposed by C.M. Wu [23].It is used for visual perception based texture feature extraction.Correlation between two statistical feature matrixes is simply defined by measurement of distance.Here we have been extracted four features based on visual perception.Four statistical matrix features are coarseness, contrast, periodicity, and roughness.

System Overview
The proposed automated algorithm includes classification of lesion types, ROI Extraction, feature extraction, dimensionality reduction, classification, and feature efficacy evaluation.A tumor classification approach described below is then applied to every instance of the window.If the window is classified to have tumor, the central pixel of the window will be labeled as tumor.On the other hand, if it is classified as healthy, the central pixel will be labeled as healthy.A proposed post processing method is applied to remove the false positives/negatives.Additionally, we report a study in which we compare the efficiency of Gabor wavelet features with a set of statistical features; i.e., the two main groups of competent and successful texture-based features in tumor classification.Several classification methods such as SVM, KNN, and multilayer perceptron (MLP) are applied for efficacy evaluation of the two feature sets.Final results are then compared using three performance criteria described below.The overall process has been shown in Figure 2 Figure 2 Framework for the proposed system

Performance criteria
The performance of the proposed approach for the classification of normal, benign and cancer mammograms is measured using accuracy.Classification accuracy is depends on the number of samples correctly classified.Higher the accuracy, better the classifier is performing.Accuracy= Where, TP =number of true positives; FP=number of false positives; TN=number of true negatives; FN=number of false negatives Confusion matrix shows information about actual and predicted classifications successfully completed by the classifier.Here table 2 shows the basic confusion matrix.SVM can be performed for dual category problems.It categorizes the complex data by defining the finest hyper plane.The hyper plane distributes the plane into two classes the one side of the plane belongs to one class and second side of the plane belongs to other class [24].

Larger width margin boundary gives the best hyperplane between classes
There are many parameters used for categorization.
• ∈ (For round-off error)= 1.0E-12It has been observed from the results that statistical features have more accuracy of classification for all three criteria in the case of SVM with linear kernel, Multilayer perceptron and K-NN classifiers.However, Gabor wavelets features lead to better results for 512×512 and 256×256 size ROIs by K-NN (K=5) method.Although Gabor wavelets are popularly used in medical imaging due to their directional sensitivity, they consume large computational cost and slow in processing.To evaluate this computational cost, the run-times needed for the individual steps have been measured and presented in Table 7.

( a )
Gabor wavelet based feature extraction methodA common 2-D Gabor filter is originally a sinusoidal function.2-D Gabor function is modulated by dual dimensional Gaussian function, where 'W' is the modulation frequency, can be defined as:

Figure 3
Figure 3 Extracted features

Table - 1
Summary of previous breast density classification in chronological order

Table 2
Confusion matrix In this approach, classification task is performed based on the type of adjacent neighbor(s), so the method is recognized as nearest neighbor classifier.If there are more than one neighbor this method is known as k-Nearest Neighbors.While many techniques for feature extraction have been published, we are not aware of any convincing comparative study in the domain of lesion classification.We have evaluated the proficiency and ability of two widely used feature sets -Gabor wavelets and statistical features in this application.Figure3represents the extracted features from statistical and Gabor wavelet.Table3to 6 summarize the classification accuracies achieved for different ROI sizes.

Table 3
Classification accuracies obtained using 64×64 size ROI dataset.The bold values denote significantly better results than the other set

Table 6
Classification accuracies achieved on the two different feature sets obtained from 512×512 size ROI dataset.

Table 7
Run time for the individual steps of the algorithm for 40 subjects (for 256×256)In this paper, a comparative study of Gabor and Statistical features has been done and subsequently, an automated framework is proposed, which can classify mammogram images containing tumor.Three well known classifiers MLP, SVM and KNN have been used in the study.The comparison results show that statistical features have higher accuracy than Gabor wavelet based features.Moreover, statistical features have much smaller dimensionality than Gabor wavelet based feature.Gabor wavelets features occupy a large amount of memory; they are highly redundant and lead to high computational costs.The proposed algorithm is based on statistical features, which is highly accurate (93.4%) and low in computational complexity.Therefore statistical features are significantly capable to differentiate tumor tissues from other tissue types in mammograms.