On the Importance of Passive Acoustic Monitoring Filters

Passive acoustic monitoring (PAM) is a non-invasive technique to supervise the wildlife. 1 Acoustic surveillance is preferable in some situation such as in the case of marine mammals, when 2 the animals spend most of their time underwater, making it hard to obtain their images. Machine 3 learning is very useful for PAM, for example, to identify species based on audio recordings. But 4 some care should be taken to evaluate the capability of a system. We define PAM-filters as the 5 creation of the experimental protocols according to the dates and locations of the recordings, 6 aiming to avoid the use of the same individuals, noise and recording devices in both training 7 and test sets. A random division of a database present accuracies much higher than accuracies 8 obtained with protocols generated with PAM-filter. Although we use the animal vocalizations, 9 in our method we convert the audio into spectrogram images, after that, we describe the images 10 using the texture. Those are well-known techniques for audio classification, and they have already 11 been used for species classification. Also, we perform statistical tests to demonstrate the significant 12 difference between accuracies generated with and without PAM-filters with several well-known 13 classifiers. The configuration of our experimental protocols and the database were made available 14 online1. 15


Introduction
Techniques of Passive acoustic monitoring (PAM) are tools to automatic detect, 19 localize and monitor animals [1]. Passive refers to the fact that the system is non-invasive, 20 as it does not interfere with the environment. It is an acoustic system because the 21 surveillance is done through audio signals. For example, a recording device connected 22 to the internet could acquire data from an environment and send captured data to a 23 classification system that identifies which species are nearby. 24 In the case of marine animals, the use of audio data might be preferred over image 25 data [2]. The reasoning for that is because visual survey methods for some marine 26 animals, such as whales, may detect only a fraction of the animals present in the area. 27 This happens because visual observers can only see them during the very short period 28 when they are on the surface, and also because visual surveys can be undertaken only 29 during daylight hours and in relatively good weather. noise. Based on these issues and aiming on more reliable results for PAM systems, we 55 propose the PAM-filter, which means trying to use the same individual always in the 56 same set, whether training or test. In the database we use, it was possible to separate 57 the individuals from the same class by location and date of record, trying our best to 58 avoid the recognition of individuals and noise. Experiments were also conducted with a 59 randomized version of the database, and the results are fairly disparate.

61
In this section, we describe the database used for experimentation and the protocols 62 developed aiming to properly explore it. In addition, the theoretical framework is also 63 described.

72
The database is composed of almost 1,600 entire tapes. Each tape is composed of 73 several minutes of recording, and they may contain vocalizations of several species.

74
Smaller cuts of these long-length audio files are available in the website, usually with 75 vocalizations of only one species. They are divided into two sections in the website, "all 76 cuts" and "best cuts". "Best cuts" represents high-quality and low-noise cuts. "All cuts" 77 contains all the audio files from the "best cuts" plus other ones, which are lower-quality 78 and noisier. 79 We choose to use the "best cuts" for the classification task, as noise reduction and 80 segmentation is not the focus of this work. "Best cuts" contains 1,694 audio files from 81 2 https://ocean.si.edu/ocean-life/marine-mammals/north-atlantic-right-whale 3 https://www.kaggle.com/c/the-icml-2013-whale-challenge-right-whale-redux 4 https://www.kaggle.com/c/noaa-right-whale-recognition 5 https://marinemammalscience.org/ 6 https://cis.whoi.edu/science/B/whalesounds/index.cfm 32 species of marine mammals. There were 25 samples that contains vocalizations of 82 more than one species, those were removed as we do not intend to handle multi-label 83 classification neither audio segmentation.

84
The website also provides a meta-data file for each audio cut. The data contains 85 additional information, such as the date and the location of the record. However, most 86 meta-data files do not present information for all their fields. Several classes have must 87 of their samples records in just a few locations and dates. It led us to suspect that they 88 may contain samples of the same individual. Also, the noise pattern in such samples are 89 homogeneous, using the metadata files it is possible to see that the cuts are extracted 90 from the same long-length tapes.     The first protocol is defined without any concerns about PAM-filter. We use the 123 database randomly divided into ten-fold cross-validation, classes with fewer than ten 124 samples were removed. Table 1

Watkins Experimental Protocol #2 (WEP#2): training/test protocol 128
As it is listed in Table 1, some species, like Balaenoptera acutorostrata, hold all its     all their samples recorded in only two locations (see Table 1), increasing the number of 148 fold to three would result in a major cut of classes. we decided to remove classes which hold fewer than ten samples in either one of the 154 folds.  developed to handle tasks such as infant cry motivation [7], music genre classification 165 [8] and music mood classification [9]. The visual domain has also been used with animal 166 vocalizations, in tasks as species identification and detection [4,10].

181
Texture is an important visual attribute in digital images. In case of spectrograms 182 in particular, texture is a very prominent visual property. In this vein, the textural 183 content of spectrograms has been used in several audio classification tasks, such as 184 music genre classification [11], voice classification [12], birds species classification and 185 whales recognition [4].

186
In [13], the authors propose the Local Binary Pattern (LBP). The texture of an the parameter P represents the number of neighbour pixels to be taken into account, 192 and R stands for radius, the distance between the pixel and its neighbours. g c and g p    To diversify the experiments of this work, we also executed deep learning tests. 245 We use a pre-trained ResNet-50 (Residual Networks -50 layers) [17] fine-tuned with 246 the training samples. As it is common in convolutional neural networks, we use the 247 spectrograms as input. The deep learning model was used here such a way that it 248 provided features, in a non-handcrafted fashion, and it also performed the classification. The audio files are converted to spectrogram images using the software SoX 9 . the training sets were used to create a model with the library Scikit-Learn 11 . Features of 261 the testing samples were, then, predicted by the model.

262
The experiments with deep learning are slightly different. A specific representa-263 tion of it is presented in Figure 5. A pre-trained ResNet-50 were fine-tuned with the 264 spectrograms of the training samples (the same generated in Figure 4). After that, the    Unfortunately, wildlife databases with information such as dates, locations, indi-319 viduals and devices used in the recordings are not easily found. But our results suggest 320 that they must be used to create appropriated experimental protocols.