Pest animal’s Detection, and Habitat Identiﬁcation in Low-resolution Airborne Thermal Imagery

: Invasive species are signiﬁcant threats to global agriculture and food security being the major causes 1 of crop loss. An operative biosecurity policy requires full automation of detection and habitat identiﬁcation of the potential pests and pathogens. Unmanned Aerial Vehicles (UAVs) mounted thermal imaging cameras can observe and detect pest animals and their habitats, and estimate their population size around the clock. However, their effectiveness becomes limited due to manual detection of cryptic species in hours of captured ﬂight videos, failure in habitat disclosure and the requirement of expensive high-resolution cameras. Therefore, the cost and efﬁciency trade-off often restricts the use of these systems. In this paper, we present an invasive animal species detection system that uses cost-effectiveness of consumer-level cameras while harnessing the power of transfer learning and an optimised small object detection algorithm. Our proposed optimised object detection algorithm named Optimised YOLO (OYOLO) enhances YOLO (You Only Look Once) [27] by improving its training and structure for remote detection of elusive targets. Our system, trained on the massive data collected from New 11 South Wales and Western Australia, can detect invasive species (rabbits, Kangaroos and pigs) in real-time with 12 a higher probability of detection (85–100 %), compared to the manual detection. This work will enhance the visual analysis of pest species while performing well on low, medium and high-resolution thermal imagery, and 14 equally accessible to all stakeholders and end-users in Australia via a public cloud. 15

. An exemplar shots of dataset shots of invasive animals (rabbits, pigs and Kangaroos) captured in airborne thermal imagery that show visual challenges for object recognition from a distance These videos are captured using helicopter-based surveys of vast farmlands in states of Western Australia (WA) and New South Wales (NSW), Australia.
Recent advancements in drone and imaging technologies have enabled non-invasive monitoring of pest 27 animals [4,10,18,32]. However, manual detection of pest animals, habitat identification and estimation of pest 28 population size is cumbersome as it requires frame by frame analysis of hours of video data. Some automated 29 approaches are proposed in recent years [2,6,19,25,31]. However, they often lack usability due to low accuracy, 30 ineffectiveness against occlusion, limitations of the visible spectrum and low detection speed. It requires a need 31 for an intelligent, real-time, fully automated and around the clock monitoring system that is not only limited to 32 animal detection but also their habitat identification. Thermal imagery can provide crucial information about 33 animal habitats that look more active and warm in thermal heat maps. 34 However, such approaches work better for large mammals as the drastic changes in the temperature gradient 35 between mammals, and their cold background can distinguish their thermal signatures, facilitating their detection 36 and count. In the pest animal remote sensing scenario, both small and elusive signatures are available that 37 decrease the accuracy of computer vision algorithms. Figure 1 shows some dataset shots of invasive animals 38 (rabbits, pigs and Kangaroos) captured in airborne thermal imagery that show visual challenges for object 39 recognition from a distance. As object size becomes very small, even manual identification and tagging of correct 40 thermal signatures is problematic. Similarly, due to striding and pooling, the small-scale objects disappear in the deep convolution layers.

53
Therefore, the removal of pooling and striding can improve YOLO to detect smaller objects. Meanwhile, 54 YOLOv4 [3] presents new findings. However, its scope is to increase the overall speed and accuracy on MS 55 COCO dataset using a different bag of features and bot to increase small object detection in thermal imaging.

56
In this work, we address the above weaknesses by introducing an optimised version of YOLO for small object 57 detection. We named it Optimised YOLO. It enables us to propose a real-time pest animal detection with 58 improved accuracy on imagery captured from consumer-level thermal cameras counted on an unmanned aerial 59 vehicle. 60 We claim the following contributions in this paper:

61
• We introduce a real-time pest animal detection with improved accuracy and speed using deep learning-based 62 small object detection approach.

63
• We optimise traditional YOLO by improved model training and structure optimisation for detecting smaller 64 objects.
• We validate our approach on an extensive thermal video data set collected by the Department of Primary 66 Industries, NSW, Australia. This dataset is very challenging due to low resolution, the small size of pets 67 like rabbits and elusive signatures of similar thermals signatures of pigs and Kangaroos. 68 We have organised the paper as follow: Section 2 describes the related work. Section 3 provides a detailed 69 description of our methodology. Section 4 illustrates our finding with the help of experimental results. Section 5 70 presents the discussion and future work directions, followed by concluding remarks and references.

72
One of the traditional approaches to animal detection and activity monitoring is the use of camera traps.

73
They have been used to investigate 13 broad areas of wildlife monitoring in Australia over the last twenty-four 74 years [24? ]: However, the field of view and coverage of camera traps is limited, and it has not proved to be a 75 reliable tool to monitor cryptic pest animals and their activities [33]. An alternative way to airborne monitoring 76 through unmanned aerial vehicles (UAVs) and helicopters [1,4,9,10,18,32].

77
The recent revolution in the field of deep learning [30] has enabled scientists to automate various 78 vision-based problems. Early use of deep learning for automated animal classification involved sufficient 79 pre-processing and limited recognition accuracy [5,7] [23] and YOLO [27]. However, all these approaches use high-resolution data for training and object scales 85 are generally larger and clear. Therefore, their performance decreases for small animal detection from UAV, 86 especially in low-resolution thermal video sequences resulting in low accuracy, slow detection or overfitting.

87
Our work is related to YOLO YOLO [27] and its improved versions [3,28]. Some recent work on small 88 object detection from a distance is related to our work. An improved version of YOLO for UAV called UAV-YOLO

89
[22] tried to improve small object detection through YOLO. It has included a few more convolution layers and

94
In this paper, we used the Convolutional neural network (CNN)-based object detection method for pest 95 animal detection in thermal imaging. Data collection. In order to perform this study, we first established the 96 Australian pest animal database that was collected by two different teams. One team from the department of primary industry, NSW was responsible for the collection of rabbit movement and warren footage using 98 helicopter-based surveys. The other team at Department o primary industry and regional development, Western

99
Australia was responsible for data collection related to wild pigs and Kangaroos using drones.

101
In order to perform this study, we first established the Australian pest animal database that was collected by 102 two different teams. One team from the department of primary industry, NSW was responsible for the collection 103 of rabbit movement and warren footage using helicopter-based surveys. The other team at Department o primary 104 industry and regional development, Western Australia was responsible for data collection related to wild pigs and 105 Kangaroos using drones.

106
From the thermal footage, we extracted frames to prepare training dataset. As the video framerate 60fps, so 107 thus we had a huge number of frames. However, the majority of frames has no evidence of the presence of any 108 invasive animals; we used only those frames that had confirmed the presence of targetted pest animals. 109 We manually labelled the dataset. We used python based library Labelme, which is a graphical image 110 annotation tool inspired by http://labelme.csail.mit.edu. We also observed that target objects were very small in 111 some of the frames that had been collected from a high altitude. Similarly, some of the targets were obscure, and 112 even manual classification of their thermal signatures was challenging. We had to magnify such frames/images to 113 label them accurately. Some sample shots of the manual annotation of our thermal dataset are shown in Figure 2.  ; ResUnit includes two "DBL" structures followed by one "add" layer leads to the residual-like unit, "ResBlock" has several "ResUnit" with one zero-padding layer and "DBL" structure forward generates a residual-like block, "ResBlock." is the module element of Darknet-53.
An architectural diagram of YOLOv3 is shown in Figure 3  includes two "DBL" structures followed by one "add" layer leads to the residual-like unit, "ResBlock" has several 139 "ResUnit" with one zero-padding layer and "DBL" structure forward generates a residual-like block, "ResBlock."   Figure 3(A) on left shows F2 that is generated from F1 by a 1-dilated convolution; each element in F2 has a receptive field of 3 × 3. Figure 3(B) on right shows F3 that is generated from F2 by a 2-dilated convolution; each element in F3 has a receptive field of 7 × 7.
A region of the input on which a pixel value in the output depends on is called the receptive field. CNN's 147 progressive reduce resolution and removing subsampling can help, but it reduces the receptive field. Dilated Let F : Z 2 → R is a discrete function, Φ n = −n, n 2 and let f = Φ n → R is another discrete, the convolution 152 operator * can be defined as : Let d be a dilation factor and let * d be defined as: ( where * d is a dilated convolution or a d-dilated convolution. The tradition CNN convolution is simply the 155 1-dilated convolution. Dilated convolution supports an exponential expansion of the receptive field without loss 156 of resolution-figure 3 illustrated outcome of dilated convolution. Figure 3(A) shows F2 that is generated from 157 F1 by a 1-dilated convolution; each element in F2 has a receptive field of 3 × 3. Figure 3(B) shows F3 that is 158 generated from F2 by a 2-dilated convolution; each element in F3 has a receptive field of 7 × 7.
159 Therefore, to increase the receptive field of YOLO to handle small objects, we integrated dilated convolutions 160 in its architecture. For this purpose, we replaced DDL block with DDDL block that uses dilated convolution filtering. Finally, semantic information from three scales is concatenated to detect objects and their categories.

164
Optimised YOLO architecture is shown in Figure 5.

172
In this section, we describe the description of our image dataset, system parameters, list of experiments and 173 their results. We would also discuss our experimental results and future work directions.

174
Two different teams collected our thermal image dataset. One team from the department of primary industry,

175
NSW was responsible for the collection of rabbit movement and warren footage using helicopter-based surveys.

176
The other team at Department o primary industry and regional development, Western Australia was responsible 177 for data collection related to wild pigs and Kangaroos using drones.
10-15 for helicopter: The size of the collected video was around 3 TB. We first tried to establish the baseline by training a YOLOv3 based detection, For this purpose, we used the     there are still some weaknesses in our system that we intend to improve in our future work. Our current model 231 has only trained fr three classes of invasive pest animals including pigs, rabbits and Kangaroos. We want to 232 extend it to include several species of pest animals in our future work. Another aspect that needs improvement is 233 the removal of double counts as in some instances; the same animal is being counted twice. As accurate count is 234 not claimed in this paper, we intend to develop some robust strategy to manage this problem in our future work.