Retrieval of flower videos based on a query with multiple species of flowers

Searching, recognizing and retrieving a video of interest from a large collection of a video data is an instantaneous requirement. This requirement has been recognized as an active area of research in computer vision, machine learning and pattern recognition. Flower video recognition and retrieval is vital in the field of floriculture and horticulture. In this paper we propose a model for the retrieval of videos of flowers. Initially, videos are represented with keyframes and flowers in keyframes are segmented from their background. Then, the model is analysed by features extracted from flower regions of the keyframe. A Linear Discriminant Analysis (LDA) is adapted for the extraction of discriminating features. Multiclass Support Vector Machine (MSVM) classifier is applied to identify the class of the query video. Experiments have been conducted on relatively large dataset of our own, consisting of 7788 videos of 30 different species of flowers captured from three different devices. Generally, retrieval of flower videos is addressed by the use of a query video consisting of a flower of a single species. In this work we made an attempt to develop a system consisting of retrieval of similar videos for a query video consisting of flowers of different species.


Introduction
There is a growth in digital video data in recent years, due to the availability of digital devices such as mobiles and cameras. Users can search and share desired videos due to networking technology, which has made developing an automated system to search and retrieve videos. And it is an interesting and active research [1]. Videos are categorized into different domains for example sports, news, surveillance, commercials, medical etc., again domain specific videos are categorized into different subcategories/classes [2]. To design a video retrieval system, two main prominent methods are used to increase retrieval performance. First is to find more appropriate features to describe videos and second is an appropriate dimensionality reduction method for selecting most discriminative features.
Developing a flower video retrieval system is a domain specific with many applications. It is an application in the field of floriculture for commercial trades. Due to the development of technology in business, trader can store a large volume of videos. Instead of visiting the nurseries for their desired flowers, users can analyse the entire flower before purchasing it and its seeds. Also, they can view different species of flowers along with different variants available in each species. Further it finds applications such as medicinal, cosmetics, industrial use for the extraction of oils from flowers and decoration etc., [3]. In such cases, it is essential to develop an automated system to search and retrieve videos of flowers of user's interest. Therefore, the proposed research motivates to design an automated system for the retrieval of users desired videos of flowers. The challenges involved in flower videos to design a retrieval system are illumination: light variations differ from different angles and varied seasonal time; variation in viewpoint: videos with varying viewpoint of flowers changes appearance of the flower in size, shape, pose and rotation; cluttered background, variation among intra class and inter class, multiple instances of flowers in videos etc.

Related Works
Generally, the video retrieval system retrieves similar videos based on query by example.
An example may be an image, keywords, sketch, object, video, video frame etc., [4]. In the literature we found retrieval of videos based on an object [5], frame [6], video [2,[7][8][9], keywords [10]. For the retrieval of videos the features and algorithms such as optical flow tensor and Hidden Markov Models (HMMs) [7], the multi-modal spectral clustering and 3 ranking algorithm [8], block wise intensity comparison [2], Scale Invariant Feature Transform (SIFT) [11], Bag-of-Features [12], dynamic weighted similarity measure with color and edge descriptors [9] are used. When a set of features are used to represent of a video, then the dimension of features may be high. If the dimension of the feature vector is high, the video retrieval system consumes more computational time. It can be reduced with the feature dimensionality reduction techniques. The dimensionality reduction techniques such as Principal Component Analysis (PCA) [2], Fisher Discriminant Ratio [1], Linear Discriminant Analysis [7], semi-supervised linear discriminant analysis [13], supervised linear dimensionality reduction [12], nonparametric discriminant analysis [14] are utilized to reduce the feature dimension in other video retrieval systems.

Previous work
In proposed work, to design a flower video retrieval system the features of previous work [15] such as GLCM [24], LBP [25] and SIFT [22] are utilized. Instead of extracting features from entire keyframe, features are extracted in two different modes from each keyframe of the video. Initially, from all Flower Region of Interest (FRoI), secondly, from maximum Flower Region of Interest (Max.FRoI). A dimensionality reduction method is introduced for the features extracted from Max.FRoI, to improve the performance of the system with greater extent, which leads the fast accessing of videos. In the previous work [15] the query video consists of a single class of flowers. In the present work along with single class of flower videos query video also consists of multiclass flowers. The dataset considered in the present work is relatively large. The comparative study is made with previous work to show the effectiveness of the proposed work.

Contributions of the proposed work
The contributions are summarized as follows.
1. Creation of a reasonably large dataset of videos of flowers which shall be made available public for research purpose.
2. Proposal of fusion of features strategy to improve the performance of the existing model.  . Adoption of a dimensionality reduction approach to improve the efficiency of the system. 6. Addressed retrieval of videos of flowers even when a query video contains flowers of more than one class. 7. Compared the proposed model with earlier proposed model and a deep learning model.

Proposed work
The proposed model comprises three stages namely, preprocessing, extraction of features and retrieval. The block diagram of the proposed flower video retrieval system using Flower Region of Interest (FRoI) is as shown in Fig. 1.

Preprocessing
The preprocessing stage involves the processes such as selection of keyframes, segmentation and extraction of flower region of Interest (FRoI). The proposed system initially converts video to frames. Suppose that the flower video dataset 'X' consists of 'vn' number of samples and it is stated as Let the flower video xvi consists of a finite set of 'FN' number of frames and it is defined as Then the keyframes of the video xvi are selected using GMM cluster based algorithmic model [16]. Here, Block wise entropy feature is extracted from each frame of the video and similar frames are grouped together using Gaussian Mixture Model and the frame near to each cluster centroid are selected as keyframes of the video. GMM is explained in section 3.1.1. When the set of keyframes are selected from xvi, then the video xvi is represented as 'Ky' number of keyframes and is defined as, The flowers in keyframes are segmented from their background using statistical region merging algorithm [17]. The keyframes after segmentation can be defined as

Gaussian Mixture Model (GMM)
Gaussian Mixture model (GMM) is a statistical and unsupervised learning model. GMM [18], preserves content of the scene, the idea behind GMM is to describe pixels, some of which represent the background while the others represent the foreground in the scene. A finite number of mixtures of Gaussian distributions are used to generate data points. It preserves the sub-sampling property; it leads for clustering data points. The GMM parameters are estimated from data using the maximum expectation algorithm. A GMM is a weighted sum of several Gaussian densities. Therefore, in the present work to create clusters GMM is used for the selection of keyframes. Clusters are created by fitting the Gaussian distribution on data (x) with 'n' features, the Gaussian function is defined as [19,32,33].
Where is the mean and is the standard deviation of data (features) 'x'. 6 After the process of segmentation of keyframes, all flower regions are selected using connected component analysis and the selected flower regions are named as Flower Regions of Interest (FRoI's) (refer Fig. 1). Then from FRoI's of each keyframe, features such as GLCM, LBP and SIFT are extracted for further processing.

Extraction of Features
Video visual features such as color, texture, local invariant features, etc., play an important role in the retrieval of videos [20,21]. Some of the different species of flowers are similar in color. For example, we can find red colored rose, hibiscus, bougainvillea belongs to three different species. Therefore, color feature many not discriminate flowers from one species to another. There exists a large intra class variability and inter class similarity in the dataset.

Texture Features
Texture of an image/frame contain unique visual patterns. Texture features describes the object surface, these features are independent of object color [23]. The videos of flowers consist of large intra class variation such as variation in color of flowers. Therefore, to describe the flower region, texture features play a vital role. In this work, texture features namely, Gray Level Co-occurrence Matrix and Local Binary Pattern are used.

Gray Level Co-Occurrence Matrix (GLCM)
GLCM describes the texture of flower in terms of statistical information. In the current work, the system extracts 14 different gray level co-occurrence of statistical values [24,34] are extracted from each FRoI. These features are represented as a feature vector.

Local Binary Pattern (LBP)
LBP describes the texture description in terms of local features of the flower region. An approach to recognize local binary patterns of image texture, and their occurrence histogram 7 proved that LBP is a powerful texture feature [25]. It is robust in terms of variation and transformation of the gray scale. In the proposed work, the system extracts LBP features [25] which are invariant to local grayscale variations in the FRoI. LBP texture features are extracted using 3x3 neighbourhood by the value of centre pixel, the pixels of eight neighbors are thresholded. In 3x3 neighbourhood, the centre pixel LBP value is obtained by thresholded binary values are weighted by powers of two and summed up.

Scale Invariant Feature Transform (SIFT)
SIFT plays a vital role in video retrieval for the analysis of the video content [11]. In SIFT the set of image features are generated in 4 stages [22]. These histograms compute the direction and magnitude of the gradient in the region of 16x16 pixels. The histograms results are represented in the form of descriptors. In the current work these feature descriptors are used to describe the FRoI's [22].
To design the proposed model, the features such as Gray Level Co-occurrence Matrix (GLCM) [24], Local Binary Pattern (LBP) [25] and Scale Invariant Feature Transform (SIFT) features proposed by [22] are extracted. Initially we propose to accomplish extracting these features by considering an entire keyframe after segmentation [15]. Subsequently, we employ the extraction of features on all flower regions of each keyframe of the video. And finally, we accomplish extraction of these features by selecting the Maximum Flower Region among all flower regions of the keyframe for the purpose of retrieval.

Entire keyframe
In this method [15], the model extracts the features such as Gray Level Co-occurrence Matrix (GLCM) [24], Local Binary Pattern (LBP) [25] and Scale Invariant Feature Transform (SIFT) [22] from an entire keyframe after segmentation and generates feature vector. Then, in the proposed model these features are fused like GLCM+LBP, GLCM+SIFT, LBP+SIFT, GLCM+LBP+SIFT to improve the performance of the system.
The video xvi is represented as a set of features and is defined as, , similarly, features for all videos of a data base 'X' of equation (1) is defined as, x respectively in equation (1).

All Flower Regions of Interest
The proposed system extracts features such as GLCM [24], LBP [25] and SIFT [22] from all flower regions of keyframes and is as shown in Fig.2. In the proposed model these features are fused like GLCM+LBP, GLCM+SIFT, LBP+SIFT, GLCM+LBP+SIFT to improve the performance of the system. Let Rr be the number of selected flower regions of a keyframe ski in equation (4). Then, ski with number of flower regions is defined as Then, the feature vector of all regions is represented as Finally, the feature vector of all the regions of a keyframe ski as shown in equation (7) is represented as, Then, all regions of a keyframe ski is defined as,  (7). Then, the feature vector of all FRoI's of all keyframes of a video xvi can be defined as
Then the feature matrix of Max. FRoI of all keyframes of a video xvi can be defined as Where ()   (12) Where j=1 to 'y' keyframes of a video xvi as shown in equation (4).
Finally, the reduced feature vectors for all videos of the database 'X', are defined as,

Linear Discriminant Analysis (LDA)
LDA is a supervised dimensionality reduction method [26]. Ronald Fisher in 1936 proposed discriminant analysis, to find a new feature space from original feature space. LDA plays a vital role in order to maximize class separability and preserves the within class similarity. It maximizes the distance between the projected data of inter classes and minimizes the distance between the predictable data of the intra class [27,13] and hence in the current work we have applied LDA for the reduction of feature dimension.
The reduced dimension of the feature vector is defined as follows,    

Retrieval: Query claiming identity of class
Initially, for a given query video 'QV', the system acquires the identity of the class using Support vector machine (SVM) is a computationally powerful tool for supervised learning [28,14,35]. Support vector machine is a vector-space-based classification method for both 12 linear and non-linear data. The fundamental idea of SVM classifier is to find the optimal separating hyperplane between two classes. For more information please refer [29,30].

Datasets
Dataset is a fundamental requirement to test the efficiency of any automatic system

All FRoI's
The result analysis of proposed retrieval system trained with the features extracted from all FRoI's are shown in the following figures Fig. 4, Fig. 5 and Fig. 6

Max. FRoI
The result analysis of proposed retrieval system trained with the features extracted from maximum flower region of interest are shown in the following figures Fig. 7, Fig. 8 and Fig.   9 for SGGP, Sonycyber Shot and Canon datasets respectively. From the results we can observe that the accuracy of the system in this approach is achieved 60.59% for SGGP dataset, 67.07% for Sonycyber Shot dataset and 75.79% for Canon dataset for 70% training and 30% testing. Further, from the results we can observe that the Max. FRoI give improved results than all FRoI's for all the three datasets.

Max. FRoI with LDA
In this section we obtain discriminant features from Max. FRoI using LDA are passing to the model. It improves the retrieval performance by identifying the class of the query video.

Comparative study between proposed work and previous work
In the previous work [15] the features such as Gray Level Co-occurrence Matrix (GLCM) [24], Local Binary Pattern (LBP) [25] and Scale Invariant Feature Transform (SIFT) [22] are extracted from entire keyframe. With the fusion of these features the model achieved good performance. The retrieval accuracy of previous work [15] achieved 53.83%, 60.18% and 65.73% are shown in Fig. 10, Fig. 11 and Fig. 12 for SGGP, Sonycyber Shot and Canon datasets respectively. In the proposed work, to further improve the retrieval performance, GLCM [24], LBP [25] and SIFT [22] features are extracted in two different modalities as mentioned in section 3.2.2 and 3.2.3. The features extracted from proposed retrieval system using, Max. FRoI and Max. FRoI with LDA these two methods give good results when compared to the previous work [15]. The comparison between the results obtained from previous and proposed approaches namely, features extracted from an entire keyframe, all FRoI's, Max. FRoI and Max.FRoI with LDA are summarized in Table 4 for all datasets.

Result analysis and Discussion
We have the following observations from the proposed system of approaches namely, features extracted from an entire keyframe, all FRoI's, Max. FRoI and Max.FRoI with LDA.
1. Features extracted from entire keyframes of a video provide good results with the fusion of the features GLCM+LBP+SIFT as shown in Fig. 10 to Fig. 11.

All FRoI's approach generates almost similar results for the combination of features
GLCM+LBP+SIFT as compared to the features extracted from an entire keyframe as shown in Fig. 4 to Fig. 6 for SGGP, Sonycyber Shot and Canon datasets respectively.
3. Max. FRoI's approach generates good results for the combination of features GLCM+LBP+SIFT as shown in Fig. 7 to Fig. 9 for SGGP, Sonycyber Shot and Canon datasets respectively. From the results we can observe that, this approach generates improved results than the features extracted from entire keyframe.
4. The proposed approach Max. FRoI with LDA results show the effectiveness of the selection of more discriminating feature subset from original set using LDA. The efficiency of the proposed system using Max. FRoI with LDA is improved and achieved 100% performance for SGGP, Sonycyber Shot and Canon datasets. Table 1 and Table 2 show the combination of features LBP+SIFT achieves good performance for SGGP and Sonycyber Shot datasets.

Comparative study between proposed work and deep learning model
In [31], authors have proposed a flower video retrieval system using deep leaning approach, here the similar videos for a given query video are retrieved using Multiclass Support Vector Machine. For the extraction of features in [31], authors have proposed three different modalities; entire keyframe, segmented flower region of a keyframe, and the gradient of flower region are considered for feature extraction using Deep Convolutional Neural Network of AlexNet architecture. Among these three modalities, the segmented flower region 21 of a keyframe is achieved better results for smaller dataset. In [31], the query video consists of a single class of flowers. In the present work along with single class of flower videos query video also consists of multiclass flowers. The dataset considered in the present work is relatively large. The presented model is compared against deep learning model [31] which reveals that the proposed one is superior to the existing in terms of retrieval results. The proposed system Max. FRoI with LDA is improved and achieved 100% performance for larger sized datasets namely SGGP, Sonycyber Shot and Canon. The retrieval results in terms of Accuracy, Precision, Recall and F-measure of existing work [31] are compared with present work and the results are shown in Fig. 13, Fig. 14 and Fig. 15 for SGGP, Sonycyber Shot and Canon datasets respectively.

Conclusion
The main aim of this work is to discover the solution to a problem of retrieval of videos of flowers through query by video mechanism. The presented system works based on keyframes represented for each video. Features extracted in three different modalities namely, all regions of flowers in the keyframe, maximum flower region in the keyframe and finally, maximum flower region with a set of discriminating features generated by LDA. The presented system is compared against our previous work and the deep learning retrieval system, which reveals that the proposed system with Max. FRoI and LDA is superior to the existing models in terms of retrieval results. Further, the proposed system retrieves similar videos when the query video consists of multi class flowers.

Future work
The research work presented in this paper can be further extended in following ways: 1. An attempt on shot boundary or class boundary detection when a video consists of multiple species of flowers can be further explored.
2. The current research work limits the species of flowers to 30. There is scope for extending the class size and to explore different methodologies to retrieve flower videos in real time.