Content-Based Image Retrieval Based on Late Fusion of Binary and Local Descriptors

One of the challenges in Content-Based Image Retrieval (CBIR) is to reduce the semantic gaps between low-level features and high-level semantic concepts. In CBIR, the images are represented in the feature space and the performance of CBIR depends on the type of selected feature representation. Late fusion also known as visual words integration is applied to enhance the performance of image retrieval. The recent advances in image retrieval diverted the focus of research towards the use of binary descriptors as they are reported computationally efficient. In this paper, we aim to investigate the late fusion of Fast Retina Keypoint (FREAK) and Scale Invariant Feature Transform (SIFT). The late fusion of binary and local descriptor is selected because among binary descriptors, FREAK has shown good results in classification-based problems while SIFT is robust to translation, scaling, rotation and small distortions. The late fusion of FREAK and SIFT integrates the performance of both feature descriptors for an effective image retrieval. Experimental results and comparisons show that the proposed late fusion enhances the performances of image retrieval.


I. INTRODUCTION
In image classification and retrieval-based problems, the extraction of a meaningful image descriptor is an open research problem [1]. Due to an increase in the number of image archives, it is necessary to design an effective system for image search [1], [2]. The conventional annotationsbased approach for image retrieval relies on text and keywords-based image search [1], [3], [4].
Human manual efforts and difference in the visual perception make the conventional approach less effective [1], [2]. The modern image search systems retrieve the images on the basis of image visual contents and this is referred as CBIR [1]. Image retrieval has vast applications in many domains like image analysis, search of image over internet, medical image retrieval, remote sensing and video surveillance [1]. The recent research is carried out to enhance the performance of image retrieval by retrieving the images that are similar to the query image [1], [5], [6]. In recent few years, different types of image representations are proposed that are associated with different application domains [1], [2], [7], [8], [9], [10]. In this paper, we are interested to explore a novel image representation based on the late fusion of binary and local features. The proposed framework can describe the visual contents in a meaningful way and the main focus of this research is image retrieval.
In CBIR, the commonly used image representations are mainly based on global and local features [1], [2], [11]. The global feature representations are based on color, texture and shape [1]. Color is one of the fundamental image feature and it is not dependent on size, direction and angle [2]. Color features lack spatial distribution and perceptual meaning [1], [2]. Texture features are divided into two main classes and texture is extracted to capture the spatial attributes in the group of pixels [2], [5]. The main limitation of spatial texture techniques is their sensitivity [23] proposed image retrieval by using integration SIFT and Local Binary Pattern (LBP). The combination of SIFT-LBP is selected to enhance the performance of image retrieval in-case of noisy background and ambiguous objects. Yu et al. [22] proposed two new image representations based on early features fusion for an effective image retrieval. The clusters are constructed by applying a weighted scheme to maintain a balance between two features. The two new feature integrations are image-based SIFT-LBP and patch-based HOG-LBP. The image-based integration of SIFT-LBP outperforms the state-of-the-art approaches [22]. Zhang et al. [27] proposed a rotation invariant image matching system by using combination of SIFT and LBP. The regions of LBP descriptors are calculated by using SIFT detector. Kabbai et al. [28] proposed a new approach to extract invariant features from the regions of interest. The uniform pattern is applied to the LBP and Center Symmetric Local Binary Pattern (CSLBP) for a robust image matching application.
According to [6], the spatial information can be extracted by dividing an image into Level 1 and Level 2 triangles. Dense SIFT is used for feature extraction and three different classifiers are applied to determine the best retrieval performance. Zeng et al. [29] proposed an image representation that is based on generalized histogram of quantized colors. Gaussian Mixture Models (GMMs) is applied for quantization and Expectation-Maximization (EM) algorithm is used for training. Bayesian Information Criterion (BIC) is applied for determination of quantized color bins. Images are retrieved on the basis of similarity between the respective spatiograms. Walia et al. [30] proposed a fusion framework for color-based image retrieval. The color and texture are extracted by applying Color Difference Histogram (CDH) and Angular Radial Transform (ART) and modification in CDH algorithm is proposed to make it more effective. Dubey et al. [31] proposed a rotation and scale invariant hybrid image descriptor for an efficient image retrieval.
The color features are extracted by quantizing RGB color space while texture is extracted by structuring the patterns that are generated from locally structured elements. Color and textural are integrated to construct the inherently Rotation and Scale-invariant Hybrid image Descriptor (RSHD). Montazer et al. [32] proposed a new learning method for RBFNNs (Radial Basis Function Neural Networks). Particle Swarm Optimization (PSO) is applied to initialize the radial basis function units in more accurate way. The spatial information of the data and non-linearity of the function are approximated to determine the widths of RBFNNs. According to Wan et al. [33], the modern machine learning techniques based on Convolutional Neural Networks (CNN) can reduce the semantic gaps. CNN model pre-trained on large dataset can be used for feature extraction and the optimized feature extraction techniques based on deep learning outperforms conventional features. In our previous work [34], we proposed the visual words integration (late fusion) of two local features. Different Weighed Averages (WA) of local features are also calculated to sort out the second best performance for image retrieval. The research presented in this paper is different from our previous work [34] as in this paper, we replaced SURF with FREAK to evaluate the late fusion (visual words integration) of SIFT (local descriptor) with FREAK (binary descriptor). The experimental results demonstrates the dominant effect of FREAK when used with the late fusion SIFT descriptor. representation model [35]. The detail about features extraction using FREAK, SIFT and late fusion of FREAK and SIFT is mentioned in the following sub-sections.

A. Fast Retina Keypoint (FREAK)
FREAK is a binary descriptor that is computed on the basis of brightness comparison tests around the keypoints and is inspired from the human visual system [19]. In the first step, a circular sampling grid is applied to generate retinal sampling pattern with higher density of points near the center . One-bit Difference of Gaussian (DoG) is applied to compute a binary descriptor. The feature are calculated by applying a saccadic search. In the last step, the sum of local gradients over selected pairs are used to compute the rotation of keypoints [19].

B. Scale Invariant Feature Transform (SIFT)
There are four main steps that are involved to compute SIFT descriptors [13]. The first step involves the computation of interest points, for this purpose the Difference of Gaussian (DoG) is used to produce several Gaussian blurred images. The neighborhood image are compared for the calculation of DoG. The second step involves the calculation of extrema to find out the stable keypoints and low contrast and pints along the edges are removed by apply Taylor series.
Determinant and trace of Hessian metric are used to remove the outliers. In the third step, to achieve rotation invariance, the principal orientation is calculated for the keypoints. The last step involves the computation of SIFT descriptor. For each keypoint, a set of orientations histograms are created on 4x4 pixel neighborhoods, with 8 orientation bins [13].

C. Proposed Late Fusion Based on Binary and Local Descriptors
1) In BoVW model, a raw image I is represented as: where r,s are the pixels of image.
2) Binary and local features (FREAK and SIFT) are extracted from a set of training images and an image I is represented as: Where d 1 to d T are the image descriptors.
3) A quantization algorithm such as k-means is applied to construct the codebook (visual vocabulary) consisting of Z words, represented as CB: where CB is the codebook consisting of w Z visual words, separate codebooks are constructed for FREAK and SIFT by extracting the respective features.

IV. EXPERIMENTAL PARAMETERS AND RESULTS
The proposed late fusion is evaluated by using three image dataset [36], [37], [38]. We used the image classification parameters as mentioned in [34]. 70% of the images from each class are selected for training and 30% for testing. Keeping in view, the unsupervised nature of clustering using k-means, each experiment is repeated 10 times. During every run, images are randomly selected for training and testing. The set of selected images for training are used for the construction of codebook and image retrieval performance is reported by using the images from test dataset. The images are retrieved by calculating the closeness among the classifier score values within the same class.
The size of the codebook affects image retrieval performance [39], [40]. An increase in the the size of the codebook increases the retrieval precision. The larger size codebook decreases the retrieval precision due to over-fitting [39], [40]. We constructed different sizes of codebook from randomly set of selected training images to sort out the best retrieval performance.

A. Evaluation Measures
We selected precision and recall to determine the performance of our proposed late fusion [2]. Precision determines the number of correctly retrieved images.
where K r represents the number of relevant images similar to the query and X r indicates the number of images retrieved by the system in response to the query.
where X c is total number of images of that class in the database.

B. Performance Evaluation using Corel-1000
There are 10 classes in Corel-1000 image dataset and each class contains 100 images. Fig. 2 represents a sample of randomly selected images from each class of the Corel-1000. We selected Corel-1000 for the evaluation of proposed framework as it has been recently used to evaluate the performance CBIR research [22], [26], [34], [32], [41]. We varied the codebook size to sort out the best retrieval performance of proposed work.     The results and comparisons presented in Table I and Fig. 3 indicate that the best MAP is obtained from the proposed last fusion on a codebook of size 600 with a value of 74.80 %. The best MAP obtained by using SIFT and FREAK on a codebook with a size of 600 is 68.29% and 56.76%, respectively. The MAP on codebook of each size by using the proposed late fusion is higher than that of SIFT and FREAK. To present a sustainable performance, the best MAP obtained from the proposed late fusion is compared with existing research [22], [26], [34], [32], [41]. Table II and Table III   The best values of precision and recall are mentioned as bold in Table II and Table III. The MAP of proposed late fusion outperforms state-of-the-art research [22], [26], [34], [32], [41].

C. Performance Evaluation Using Corel-1500
There are 15 classes in the Corel-1500 [37] image benchmark and each class contains 100 images [37]. Fig. 6 represents a sample of randomly selected images from each class of the Corel-1500. We evaluated the proposed framework on Corel-1500 and compared the results with state-of-the-art CBIR methods [29], [34]. Fig. 3 presents a comparison of MAP as a function of codebook size using different evaluation parameters. The comparison results results presented in Fig. 3   The proposed framework based on late fusion of FREAK and SIFT provides a better retrieval performance with higher values of precision and recall than the existing research [29], [34].

D. Performance Evaluation using Oliva and Torralba (OT-Scene)
There are 08 classes in OT-Scene image dataset that contains a total of 2688 images. Fig. 8 represents a sample of randomly selected images from each class of the OT-Scene image dataset. We evaluated the proposed framework using OT-Scene benchmark and compared the results with state-of-the-art CBIR methods [30], [42]. Fig. 9 presents a comparison of mean precision as a function of codebook size using different evaluation parameters.   The results and comparisons presented in Fig. 9 and Table V indicate that the proposed late fusion of FREAK and SIFT provides a better retrieval performance than existing research [30], [42].

V. CONCLUSION AND FUTURE DIRECTIONS
The research scope of this paper is based on the late fusion/visual word integration of binary and local descriptor for an effective image retrieval. Keeping in view, the good recognition ability and efficient computation time, we selected FREAK as the binary descriptor. SIFT is selected as the local feature as it is robust to change in translation, scaling, rotation, and small distortions.
The image is represented in the form of late fusion of FREAK and SIFT. The proposed research is based on the BoVW framework and classification is performed by using SVM. Testing is performed by varying the size of codebook, to sort out the best image retrieval precision. The results obtained from the proposed research are compared with state-of-the-art CBIR research.
The late fusion of FREAK and SIFT outperforms several CBIR methods. In future, we will evaluate our proposed work on real world retrieval problem by using a pre-trained CNN model.