A Study of Tangerine Pest Recognition Using Advanced Deep Learning Methods

: To improve the tangerine crop yield, the work of recognizing and then disposing of specific pests is becoming increasingly important. The task of recognition is based on the features extracted from the images that have been collected from websites and outdoors. Traditional recognition and deep learning methods, such as KNN (k-nearest neighbors) and AlexNet, are not preferred by knowledgeable researchers, who have proven them inaccurate. In this paper, we exploit four kinds of structures of advanced deep learning to classify 10 citrus pests. The experimental results show that Inception-ResNet-V3 obtains the minimum classification error.


Introduction
Tangerine is known for its fresh citrus taste and for being low in calories and high in nutrient content.According to a report by the US Department of Agriculture, a medium-sized tangerine contains 29.1 mg of vitamin C, which supplies nearly 50% of daily requirement.It is well known that vitamin C is a powerful, natural, water-soluble antioxidant that can help the body resist infectious agents and eliminate cancer-causing free radicals [1].In addition, Tangerines are rich in potassium and low in sodium.This contributes to the relaxing of blood vessels and maintenance of proper blood pressure.However, due to the effects of several citrus pests, the quality and quantity of tangerines has been significantly reduced.In order to control the citrus pests, two types of approaches are commonly adopted: a biological control method and a chemical pesticides control method.No matter which method is chosen, monitoring and identify a species of citrus pest in time can greatly help in reducing yield loss.In this paper, the recognition tasks are based on the features extracted from images that have been collected from websites and Jeju Island (South Korea) tangerine plantations.
For low-dimensional data, using conventional classification algorithms, such as k-nearestneighbor (KNN), support vector machine (SVM), and multilayer perceptron (MLP), is reasonable.Even though for a new input dataset we cannot employ transfer learning to fine-tune the trained model, these traditional algorithms are so computationally simple that they are not time-sensitive, and the hyper-parameters in them that need to be considered are few, compared with deep learning.However, as for high-dimensional data (for example, when the input image size is 224 × 224), the performance of algorithms mentioned above are not as good as that of convolutional neural networks (CNNs), although some preprocessing techniques-for example, principal components analysis (PCA) can be used for reducing the redundancy of input.A research about image classification problem was conducted in [2], which compares the accuracy of KNN, SVM, and MLP with that of CNNs.In terms of 10 labeled datasets, CNN (Inception-V3) with transfer learning gets the highest classification accuracy (88%), CNN-2 (there are two convolutional layers and two fully connected layers in total) without transfer learning second (43.1%).The worst was SVM, with an accuracy < 0.1%.
The development of CNN has dramatically improved the state-of-the-art in computer vision and natural language processing [3].In general, one kind of CNN is composed of four parts: convolutional layer, pooling layer, activation function, and fully connected layer.Apart from these basic components, the batchnormalization layer [4] and dropout layer [5] are always added to overcome the problem of local gradient vanishment and overfitting.
CNN models can be applied to many domains, such as agriculture, stockbreeding, and medical treatment (refer to Table 1).This phenomenon gives us the confidence to apply this technology for tangerine pest recognition.We explored four kinds of structures of advanced deep learning: VGG, Residual Network, Inception modules, and Xception.They have achieved great success and other recent models are almost their variants.The experimental results show that Inception-ResNet-V3 gives the best classification performance.Our paper is structured as follows.Section 2 describes the dataset for classification.Section 3 discusses the strengths and weaknesses of the four selected structures.In Section 4, we present our experimental results and compare the performance of these algorithms.Section 5 contains the conclusion and discusses future work.

Tangerine Pests Description
Asian citrus psyllid (ACP) is regarded as a key pest because it is a carrier of huanglongbing (HLB), which is the most destructive disease in citrus-growing areas worldwide.HLB causes asymmetrical blotchy mottling of leaves.Fruit from HLB-infected trees is of no value because of poor size and quality [10].
Stink bugs are well known "plant feeders".They attack fruits, vegetables, and ornamental plants.Many species of stink bugs have been found feeding on citrus, and their favorite citrus seems to be the tangerine.The southern green stink bug (SGSB) is the one that attacks citrus most frequently [11].When the SGSB punctures a piece of citrus, it leaves hard brownish or black spots.These injuries influence the citrus' edible qualities and, thus, lower its market value.
Asian longhorned beetle (ALB) is also a serious pest of citrus.China, Japan, and Korea are considered to be its origins.However, with the increase in global trade and movement of wood packing material, ALB is now found in more than 19 countries in the world [12].The damage caused by the adult ALB is minor, but larval tunneling in the cambial region and the wood is deadly to citrus trees because it hinders the water and nutrient transportation.
Citrus root weevil (CRW) can cause great damage to trees throughout its life cycle.The larvae feed on the roots, tubers, or other underground portions and the adults feed along the edges of young leaves, creating semicircular notches.Monitoring for the presence of adults is difficult because they feed during the early morning and late afternoon, hiding in the foliage during the day.
Besides the four major pests described above, citrus swallowtail (CS), fork-tailed bush katydid (FTBK), green leafhopper(GL), Chinese fruit fly(CFF), fruit-piercing moth(FPM), and citrus mealybug (CM) are all picked as labels of the deep learning models.Each pest's representative pictures are shown in Figure 1.

The Image Data of Tangerine Pests
A total of 5247 images are included in our dataset (refer to Table 2).In the same pest folder, the similarity of every two images is lower than 80%, which enhances the diversity of sampling.The size of all images is resized to 224 × 224 by a bilinear interpolation approach during training and testing.According to the statistical results in [13], we guarantee that the scale of object to be recognized in the image is higher than 50%.The effect of viewing angles on insect identification has been discussed in [14], which encouraged us to gather the pest photos with diverse shooting angles.Take the ALB as an example.The same ALB photographed at different angles shows different patterns (refer to Figure 2).This strong view dependence may lead to incorrect classification for CNNs with single-angle inputs.

•
The backgrounds of images are varying Compared with indoor conditions, the visual noise of natural environments is more complicated and uncontrollable.In the early literature of pest detection [15][16], in order to reduce the impact of visual noise on recognition accuracy, some image segmentation algorithms were used to capture the desired objects from cluttered backgrounds.However, there are still unstable factors that affect the performance of these approaches-for example, the selection of thresholding for pixels and histogram [17].When the pest's camouflage has the same appearance as its surroundings (refer to Figure 3), the image segmentation methods seem to be invalid.
On the other hand, increasing the background complexity of images can boost the generalization of CNNs.Overfitting is universal in machine learning, and it is caused by inadequate learning from training examples.The simplest way to avoid this problem is to diversify the types of samples.The varying background is an obstacle to setting threshold values for segmentation algorithms, but here it impairs the correlation between two images.

Data Augmentation
Another effective way to help the model generalize is data augmentation (DA).In fact, the field of DA is not new, and various techniques of it have been applied to specific problems [18][19].The generic practice in augmenting image data is to perform geometric transformations, such as rotation, reflection, shift, flip, etc.
The influence of unbalanced data on classification performance has been discussed in [20], which prompted us to increase the number of images of pest whose dataset scale is smaller than that of others.An example of images generated by random combinations of geometric transformations is shown in Figure 4. Table 3 gives the parameter set of transformations.After the DA operations, each type of pest contains 600 images.

Advanced Deep Learning Models
Convolutional filters have been used for many years to extract features from 2D shapes.The first successful structure of CNN is LeNet-5, which applied gradient-based learning to update weights and biases [21].The use of LeNet-5 has considerably reduced the test errors of handwritten digit recognition from 12% to 0.8%.Based on GPU implementation, AlexNet [22] greatly increased the depth of LeNet-5, and this architecture also won the first prize in the 2012 ImageNet competition.The design of the VGG model makes the structure of CNN deeper and wider.

VGG
There are three main factors that contribute to improving the performance of CNN: • Depth: the number of convolutional layers • Width: the number of convolutional filters in each layer • Better utilization of model parameters VGG models focus on the aspect of depth.The two famous structures of the VGG model are VGG-16 and VGG-19 which comprises 13 and 16 convolutional layers respectively [23].In this paper, we selected VGG-16 (refer to Figure 5) as one of the classification methods.Another difference between VGG-16 and AlexNet is that VGG-16 uses a smaller receptive window size.The utilization of 3×3 convolutional filters in each convolution layer can not only increase the number of non-linear activation functions but also decrease the number of parameters.This means that we can utilize the same computational resource to construct a deeper CNN model.For example, one 5×5 convolutional layer can be replaced by two layers of 3×3 convolution.Moreover the parameters of two 3×3 convolutional layers is 1.39 times less than that of one layer of 5×5 convolution.

Residual Network
It is generally believed that the deeper network construction, the higher the classification accuracy that will be derived.However, as the depth of the CNN increases, the accuracy gets saturated and then drops rapidly.The design of residual networks (RN) is aimed at solving the problem of degradation [24].Instead of directly fitting a desired underlying mapping, He et al. [25] adds an identity mapping to each building block.To save computation costs, two layers of 1×1 convolution are embedded in residual block (refer to Figure 6).The first 1×1 convolution is used for reducing the computation complexity, and the second layer of 1×1 convolution is responsible for increasing the dimensions.The same bottleneck architecture is followed in this paper.The invention of Inception models breaks the mode of stacking layers, which provides us with a new way to increase the depth and width of a CNN.Inception models are more difficult to design than a VGG or RN because of the diversified modules in them.However, under the same computational budget, Inception models can be constructed more deeply and widely so as to achieve more accuracy than a VGG or RN.
According to the above comparison, Inception-V3 is selected as the basic model.Meanwhile, we also added identity mapping to each module of it to accelerate the training speed.Figure 7 shows one of the modules in the improved Inception-V3 (Inception-ResNet-V3).

Xception
Xception [30] is an extreme version of the Inception model that executes the channel-wise convolution and the cross-channel convolution separately.The basic operator of Xception is called separable convolution (refer to Figure 8), and this largely decreases the parameters of regular convolutions.For channel-wise convolution, Xception adopts a smaller window (3×3), just as in the VGG networks.For cross-channel convolution, 1×1 regular convolution is performed.Figure 9 shows a building block of Xception.Because of the multiscale processing modules inside them, Inception models outperform other types of networks in classification accuracy.On the other hand, these modules increase the difficulty in design.As new structures are proposed in [31][32], more evidences arises suggesting that separable convolution can significantly improve the utilization rate of model parameters.

Experiment and Results
In this section, the first four parts represent the preparations for training models.The last part compares different aspects of the selected architectures.

Experiment Setup
We randomly selected 500 images from each pest folder to train models and the ratio between training set and validation set is 4:1.The remaining 100 images were used for testing.The framework used to implement the models is Keras with TensorFlow for the backend.The hardware foundation is GPU (GTX 1080Ti, 12G).The code and models are available at https://github.com/xingshulicc/citrus-pest-classification-by-advanced-deep-learning.

Hyperparameters of Models
Besides the structure, the hyperparameters of the model can also determine the performance of the classification.
• Mini-batch size The most popular algorithm for updating weights and biases of deep neural networks is the mini-batch stochastic gradient descent (SGD).Therefore, a good choice of mini-batch size can help improve the quality of the models.In theory, a large batch size can better approximate the statistics of entire dataset and, thus, increase the convergence precision.However, Masters et al. [33] have proven that the small values of batch size (between 2 and 32) sped up training and enhanced model generalization capability.Based on the experimental results of [33], we selected 8 as the final batch size.
• Learning rate A static learning rate may lead to slow convergence or fluctuation around the minimum.Thus, we hoped that the learning rate could be changeable during training.The learning rate was initialized with 0.0001, and a decay  was introduced to reduce the value of the learning rate for every epoch.
• Momentum Momentum contributes to accelerating convergence, and its value is usually set to 0.9.There are two types of momentum in deep learning: classical and Nesterov [34].Nesterov momentum has recently become a preferred choice because it can change the velocity of gradient descent in a quicker and more responsive way.

Transfer Learning
Transfer learning allows us to transfer the knowledge gained from one problem to a different but related problem.It significantly promotes the efficiency of learning.A good example of transfer learning for CNNs is the initialization with a pre-trained model.Moreover, Yosinski et al. [35] have found that the convolutional filters in different depths play distinct roles.In the first-layer convolutions, extracted features are general and can be applicable to other datasets.However, the features learned from convolutional layers that are close to the classification layer are not general but specific.
We initialize all convolutional layers with a pre-trained model and freeze the parameters of the first convolutional layer.Figure 10 shows a comparison between VGG-16 with pre-training and without pre-training.It can be seen that higher classification accuracy and faster convergence are generated by the fine-tuned VGG-16 model.

Against Overfitting
In fact, even though the model has been pre-trained, we notice that the problem of overfitting still exists (refer to Figure 10a).Overfitting cannot be completely eliminated but can be reduced to a low level by available methods.Besides the extension of the dataset, decreasing the complexity of model is an alternative method.
The number of parameters in fully connected layers usually accounts for half of the total parameters of the networks.Therefore, restricting the size of fully connected layers becomes a priority.In general, dropout [5] is the first choice.This approach drops units from the neural network during training by multiplying the Bernoulli distributed random variables.This method produces many potential subnetworks in the period of training.At test time, to approximate the effect of these thinned networks, their average predicted number is calculated.However, it has been proven that the dropout can slow down the speed of training.Another available method is global average pooling (or global max pooling), which was first proposed in network in network [36].The global average pooling layer computes the average of each feature map from the last convolutional layer and feeds the average values directly into the classification layer.Global average pooling saves more time per epoch than dropout (refer to Table 4).The structures of fully connected layers constructed by dropout and global average pooling are shown in Figure 11.

Classification Performance and Comparisons
The training results of selected models with early stopping are shown in Figure 12.It can be seen that the overfitting in each model has been controlled at a very low degree (< 4%).With regard to convergence, VGG-16 is the fastest.It stopped at 28 epoch with 98.12 training accuracy and 94.05 validation accuracy.On the contrary, the convergence of Xception is the slowest, which stopped at 100 epoch but with higher training and validation precision.
As for computational complexity (refer to Table 5), Inception-ResNet-V3 occupies the most computing resources, with 54,352,106 parameters in total.ResNet-50, composed of 27,804,554 parameters, ranks second.The least one is VGG-16, which only has 14,982,474 parameters.Another interesting finding displayed in Table 5 is    According to Table 6, we find that the more complexity with which a network is constructed, the better classification performance it achieves.For example, the test accuracy of Inception-ResNet-V3 is the highest, and VGG-16 gets the worst classification result, unsurprisingly.This phenomenon is also consistent with the strategies used to improve the performance of CNNs.We have analyzed the confusion matrix of Inception-ResNet-V3 for the test dataset (refer to Table 7) and found that the error rate of CRW is the highest (6%).The misclassified images are shown in Figure 13.Different species of pest can present similar shape and color pattern under special shooting conditions.The probability of this case is very low but inevitable in the process of data collection.Inception-ResNet-V3 performed badly on identification of these samples, which proved that the model produced divergence on the classification of fine-grained images.

Conclusions and Future Work
Because of pest damage, the quality and quantity of tangerines has been reduced.To improve the yield of this fruit, four types of CNN models were applied to identify species of citrus pests.We initialized the convolutional layers of each model by transfer learning method and freezing the parameters of the first convolutional layer during training and testing.However, we found that the overfitting problem was still considerable.Therefore, to reduce the level of overfitting, a global average pooling and early stopping approach were utilized.The experimental results show that Inception-ResNet-V3 got the highest classification accuracy (98.73 %) and the second shortest time per epoch (58s).
For the trained model, there are two serious problems need to be solved in the future (refer to Table 8): 1.It is difficult to store entire network architecture in a small computational budget device.
2. The runtime of different platforms is long.In addition, the disease in tangerines can also cause huge economic losses.Therefore, we are trying to collect image data of citrus diseases and select suitable CNN models to recognize them.

Figure 1 .
Figure 1.The sampling pictures for each kind of pest.

Figure 2 .
Figure 2. ALB images with different viewing angles.

Figure 3 .
Figure 3.The images of citrus pests whose cryptic colorings are the same as those of their surroundings: (a) is GL, (b) is SGSB, and (c) is larvae of CS.

PreprintsFigure 4 .
Figure 4. Original image and generated images by DA

Figure 7 .
Figure 7.An example of module in Inception -ResNet -V3: A and B are groups of feature maps.

Figure 8 .
Figure 8.(a) is regular convolution, (b) is separable convolution.Suppose the input channels of (a) and (b) are both m and output channels are both u.Then the number of parameters involved in calculation of (a) is m*3*3*u, and that of (b) is m*3*3 + m*1*1*u.

Figure 10 .
Figure 10.(a) VGG-16 with pre-training; (b) is VGG-16 without pre-training.The red curve in (a) and (b) is training accuracy, and the blue curve is validation accuracy.

Figure 11 .
Figure 11.(a) The method of dropout and (b) is global average pooling that the time consumption per epoch of Xception is larger than that of ResNet-50 and Inception-ResNet-V3.The reason for this is that the identity mappings in ResNet-50 and Inception-ResNet-V3 accelerate the speed of training.

Figure 12 .
Figure 12.The training process of the four models mentioned previously: (a) is ResNet-50 (b) is VGG-16 (c) is Inception-ResNet-V3 and (d) is Xception; (e) is their final training performance.

Figure 13 .
Figure 13.The misclassified samples in the test dataset.

Table 1 .
The related works use CNNs as classifiers.

Table 1 .
The list of species and number of samples for the pest recognition.
Furthermore, in order to strengthen the robustness of pest recognition, two measures have been taken:•Get pictures of pest at different angles Preprints (www.preprints.org)| NOT PEER-REVIEWED | Posted: 7 November 2018 doi:10.20944/preprints201811.0161.v1

Table 3 .
The setting of parameters in geometric transformations.

Table 4 .
The time consumption per epoch and total parameters of networks with different fully connected layers.According to the comparison as shown in Table4, we give preference to global average pooling with early stopping.The patience of early stopping is defined as follows:

Table 5 .
The computational complexity and time consumption per epoch of each model.

Table 6 .
The test performance of each model.

Table 7 .
The confusion matrix for test dataset.

Table 8 .
The comparison of running performance of Inception-ResNet-V3 on computer and smart phone.