ARTICLE | doi:10.20944/preprints202010.0526.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: audio classification; dissimilarity space; siamese network; ensemble of classifiers; pattern recognition; animal audio
Online: 26 October 2020 (13:57:01 CET)
The classifier system proposed in this work combines the dissimilarity spaces produced by a set of Siamese neural networks (SNNs) designed using 4 different backbones, with different clustering techniques for training SVMs for automated animal audio classification. The system is evaluated on two animal audio datasets: one for cat and another for bird vocalizations. Different clustering methods reduce the spectrograms in the dataset to a set of centroids that generate (in both a supervised and unsupervised fashion) the dissimilarity space through the Siamese networks. In addition to feeding the SNNs with spectrograms, additional experiments process the spectrograms using the Heterogeneous Auto-Similarities of Characteristics. Once the similarity spaces are computed, a vector space representation of each pattern is generated that is then trained on a Support Vector Machine (SVM) to classify a spectrogram by its dissimilarity vector. Results demonstrate that the proposed approach performs competitively (without ad-hoc optimization of the clustering methods) on both animal vocalization datasets. To further demonstrate the power of the proposed system, the best stand-alone approach is also evaluated on the challenging Dataset for Environmental Sound Classification (ESC50) dataset. The MATLAB code used in this study is available at https://github.com/LorisNanni.
ARTICLE | doi:10.20944/preprints201804.0258.v2
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: audio classification; multi-resolution analysis; LSTM; auto-ml
Online: 19 July 2018 (05:53:20 CEST)
We describe a multi-resolution approach for audio classification and illustrate its application to the open data set for environmental sound classification. The proposed approach utilizes a multi-resolution based ensemble consisting of targeted feature extraction of approximation (coarse scale) and detail (fine scale) portions of the signal under the action of multiple transforms. This is paired with an automatic machine learning engine for algorithm and parameter selection and the LSTM algorithm, capable of mapping several sequences of features to a predicted class membership probability distribution. Initial results show an improvement in multi-class classification accuracy.
ARTICLE | doi:10.20944/preprints202108.0277.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Bioacoustics; Machine Hearing; Bird sound recognition; Artificial Neural Networks; Audio Signal Processing
Online: 12 August 2021 (13:34:50 CEST)
The automatic classification of bird sounds is an ongoing research topic and several results have been reported for the classification of selected bird species. In this contribution we use an artificial neural network fed with pre-computed sound features to study the robustness of bird sound classification. We investigate in detail if and how classification results are dependent on the number of species and the selection of species in the subsets presented to the classifier. In more detail, a bag-of-birds approach is employed to randomly create balanced subsets of sounds from different species for repeated classification runs. The number of species present in each subset is varied between 10 and 300, randomly drawing sounds of species from a dataset of 659 bird species taken from Xeno-Canto database. We observe that the shallow artificial neural network trained on pre-computed sound features is able to classify the bird sounds relatively well. The classification performance is evaluated using several common measures such as precision, recall, accuracy, mean average precision and area under the receiver operator characteristics curve. All of these measures indicate a decrease in classification success as the number of species present in the subsets is increased. We analyze this dependence in detail and compare the computed results to an analytic explanation assuming dependencies for an idealized perfect classifier. Moreover, we observe that the classification performance depends on the individual composition of the subset and varies across 20 randomly drawn subsets.
ARTICLE | doi:10.20944/preprints201807.0185.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: deep learning; multi-task learning; audio event detection; audio tagging; weak learning; low-resource data
Online: 10 July 2018 (16:05:15 CEST)
In training a deep learning system to perform audio transcription, two practical problems may arise. Firstly, most datasets are weakly labelled, having only a list of events present in each recording without any temporal information for training. Secondly, deep neural networks need a very large amount of labelled training data to achieve good quality performance, yet in practice it is difficult to collect enough samples for most classes of interest. In this paper, we propose factorising the final task of audio transcription into multiple intermediate tasks in order to improve the training performance when dealing with this kind of low-resource datasets. We evaluate three data-efficient approaches of training a stacked convolutional and recurrent neural network for the intermediate tasks. Our results show that different methods of training have different advantages and disadvantages.
ARTICLE | doi:10.20944/preprints201811.0509.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: audio recognition; context-aware; deep learning
Online: 20 November 2018 (16:32:16 CET)
This paper proposes a method for recognizing audio events in urban environments that combines handcrafted audio features with a deep learning architectural scheme (Convolutional Neural Networks, CNNs), which has been trained to distinguish between different audio context classes. The core idea is to use the CNNs as a method to extract context-aware deep audio features that can offer supplementary feature representations to any soundscape analysis classification task. Towards this end, the CNN is trained on a database of audio samples which are annotated in terms of their respective "scene" (e.g. train, street, park), and then it is combined with handcrafted audio features in an early fusion approach, in order to recognize the audio event of an unknown audio recording. Detailed experimentation proves that the proposed context-aware deep learning scheme, when combined with the typical handcrafted features, leads to a significant performance boosting in terms of classification accuracy. The main contribution of this work is the demonstration that transferring audio contextual knowledge using CNNs as feature extractors can significantly improve the performance of the audio classifier, without need for CNN training (a rather demanding process that requires huge datasets and complex data augmentation procedures).
ARTICLE | doi:10.20944/preprints202108.0185.v1
Subject: Mathematics & Computer Science, Other Keywords: AES; Audio analysis; Authenticated encryption; Cryptography; Python
Online: 9 August 2021 (09:31:46 CEST)
The focus of this research is to analyze the results of encrypting audio using various authenticated encryption algorithms implemented in the Python cryptography library for ensuring authenticity and confidentiality of the original contents. The Advanced Encryption Standard (AES) is used as the underlying cryptographic primitive in conjunction with various modes including Galois Counter Mode (GCM), Counter with Cipher Block Chaining Message Authentication Code (CCM), and Cipher Block Chaining (CBC) with Keyed-Hashing for encrypting a relatively small audio file. The resulting encrypted audio shows similarity in the variance when encrypting using AES-GCM and AES-CCM. There is a noticeable reduction in variance of the performed encodings and an increase in the amount of time it takes to encrypt and decrypt the same audio file using AES-CBC with Keyed-Hashing. In addition, the corresponding encrypted using this mode audio spans a longer duration. As a result, AES should either have GCM or CCM for an efficient and reliable authenticated encryption integration within a workflow.
ARTICLE | doi:10.20944/preprints202104.0766.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: PAM; Passive acoustic monitoring; audio classiﬁcation; texture classiﬁcation; PAM- 16 ﬁlter; experimental protocols for audio classiﬁcation; statistical tests.
Online: 29 April 2021 (07:55:09 CEST)
Abstract: Passive acoustic monitoring (PAM) is a non-invasive technique to supervise the wildlife. Acoustic surveillance is preferable in some situation such as in the case of marine mammals, when the animals spend most of their time underwater, making it hard to obtain their images. Machine learning is very useful for PAM, for example, to identify species based on audio recordings. But some care should be taken to evaluate the capability of a system. We deﬁne PAM-ﬁlters as the creation of the experimental protocols according to the dates and locations of the recordings, aiming to avoid the use of the same individuals, noise and recording devices in both training and test sets. A random division of a database present accuracies much higher than accuracies obtained with protocols generated with PAM-ﬁlter. Although we use the animal vocalizations, in our method we convert the audio into spectrogram images, after that, we describe the images using the texture. Those are well-known techniques for audio classiﬁcation, and they have already been used for species classiﬁcation. Also, we perform statistical tests to demonstrate the signiﬁcant difference between accuracies generated with and without PAM-ﬁlters with several well-known classiﬁers. The conﬁguration of our experimental protocols and the database were made available online.
ARTICLE | doi:10.20944/preprints202007.0209.v1
Subject: Engineering, Other Keywords: Deep learning; Head Related Transfer Function (HRTF); Restoration; Ambisonics; Spatial Audio; Spherical harmonic; Audio signal processing; Denoising; Auto-Encoder; Neural Network
Online: 10 July 2020 (08:58:11 CEST)
Spherical harmonic (SH) interpolation is a commonly used method to spatially up-sample sparse Head Related Transfer Function (HRTF) datasets to denser HRTF datasets. However, depending on the number of sparse HRTF measurements and SH order, this process can introduce distortions in high frequency representation of the HRTFs. This paper investigates whether it is possible to restore some of the distorted high frequency HRTF components using machine learning algorithms. A combination of Convolutional Auto-Encoder (CAE) and Denoising Auto-Encoder (DAE) models is proposed to restore the high frequency distortion in SH interpolated HRTFs. Results are evaluated using both Perceptual Spectral Difference (PSD) and localisation prediction models, both of which demonstrate significant improvement after the restoration process.
ARTICLE | doi:10.20944/preprints201712.0001.v3
Subject: Mathematics & Computer Science, Probability And Statistics Keywords: signal processing; bayesian methods; subaquatic audio; hydrophone; unsupervised learning
Online: 8 January 2018 (18:29:11 CET)
The problem of event detection in general noisy signals arises in many applications; usually, either a functional form for the event is available, or a previous annotated sample with instances of the event that can be used to train a classification algorithm. There are situations, however, where neither functional forms nor annotated samples are available; then it is necessary to apply other strategies to separate and characterize events. In this work, we analyze 15 minute-long samples of an acoustic signal, and are interested in separating sections, or segments, of the signal which are likely to contain significative events. For that, we apply a sequential algorithm with the only assumption that an event alters the energy of the signal. The algorithm is entirely based on Bayesian methods.
ARTICLE | doi:10.20944/preprints202203.0252.v1
Subject: Mathematics & Computer Science, Other Keywords: Audio-Visual Technologies; Blended Learning; Pedagogy; Virtual Learning Environments; Virtual Reality
Online: 17 March 2022 (11:05:27 CET)
The Covid-19 pandemic caused a shift in teaching practice towards blended learning for many Higher Education institutions. This led to the rapid adoption of certain digital technologies within existing teaching structures as a means to meet student access needs and facilitate learning. Integration of these technologies caused numerous challenges for practitioners and often provided mixed results. This paper is an attempt to summarise and extend pre-Covid pedagogical research to leverage digital immersive technologies for blended teaching in the post-pandemic era. Focus is given towards the evolution of Virtual Learning Environments through elements of immersive audio-visual technologies, which are shown to be effective when coupled in a blended approach. It is both a review of these methodologies and a case study of the I-Ulysses: Virtual Learning Environment as a point of comparison for evaluating the review.
ARTICLE | doi:10.20944/preprints202010.0343.v2
Subject: Engineering, Other Keywords: deep learning; sound event detection; convolutional neural networks; audio processing; embedded systems
Online: 9 November 2020 (14:21:39 CET)
For the Remotely Piloted Aircraft Systems (RPAS) market to continue its current growth rate, cost-effective "Detect and Avoid" systems that enable safe beyond visual line of sight (BVLOS) operations are critical. We propose an audio-based "Detect and Avoid" system, composed of microphones and an embedded computer, which performs real-time inferences using a sound event detection (SED) deep learning model. Two state-of-the-art SED models, YAMNet and VGGish, are fine-tuned using our dataset of aircraft sounds and their performances are compared for a wide range of configurations. YAMNet, whose MobileNet architecture is designed for embedded applications, outperformed VGGish both in terms of aircraft detection and computational performance. YAMNet's optimal configuration, with > 70% true positive rate and precision, results from combining data augmentation and undersampling with the highest available inference frequency (i.e. 10 Hz). While our proposed "Detect and Avoid" system already allows the detection of small aircraft from sound in real time, additional testing using multiple aircraft types is required. Finally, a larger training dataset, sensor fusion, or remote computations on cloud-based services could further improve system performance.
ARTICLE | doi:10.20944/preprints201812.0086.v4
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: multi-model information fusion; video skim-ming; audio and text classification; keyframe extraction
Online: 5 August 2019 (03:48:49 CEST)
In this paper, we propose a novel approach of video skimming by exploiting the fusion of video temporal information and keyword information representation extracted from multi-model video information including audio, text and visual indices. In addition, we introduce the brand-safe filtering and sentiment analysis in order to only reserve the user-friendly content in the video skim. In the experiment by using the videos from YouTube-8M dataset, we have proved that the semantic conservation in the video skim from the proposed approach highly outperforms the approaches by only partial information of the video in conserving the semantic content of the video.
ARTICLE | doi:10.20944/preprints201802.0076.v1
Subject: Engineering, Civil Engineering Keywords: non-destructive evaluation; hammering inspection; audio signal processing; machine learning; online learning
Online: 9 February 2018 (06:55:24 CET)
Developing efficient Artificial Intelligence (AI)-enabled system to substitute human role in non-destructive testing is an emerging topic of considerable interest. In this study, we propose a novel impact-echo analysis system using online machine learning, which aims at achieving near-human performance for assessment of concrete structures. Current computerized impact-echo systems commonly employ lab-scale data to validate the models. In practice, however, the echo patterns can be far more complicated due to varying geometric shapes and materials of structures. To deal with a large variety of unseen data, we propose a sequential treatment for echo characterization. More specifically, the proposed system can adaptively update itself to approaching human performance in impact-echo data interpretation. To this end, a two-stage framework has been introduced, including echo feature extraction and the model updating scheme. Various state-of-the-art online learning algorithms have been reviewed and evaluated for the task. To conduct experimental validation, we collected 10,940 echo instances from multiple inspection sites with each sample had been annotated by human experts with healthy/defective condition labels. The results demonstrated that the proposed scheme achieved favorable echo pattern classification accuracy with high efficiency and low computation load.
ARTICLE | doi:10.20944/preprints202008.0707.v1
Subject: Behavioral Sciences, Cognitive & Experimental Psychology Keywords: Anxiety; Audio-Visual stimulation; COVID-19; Environmental enrichment; Forest environments; Forest therapy; Lockdown; Mental health; Stress; Quarantine
Online: 31 August 2020 (05:20:50 CEST)
The prolonged lockdown imposed to contain the COVID-19 pandemic prevented many people from direct contact with nature and greenspaces, raising alarms for a possible worsening of mental health. This study investigates the effectiveness of a simple and affordable remedy for improving psychological well-being, based on audio-visual stimuli brought by a short computer video showing forest environments, with an urban video as a control. Randomly selected participants were assigned the forest or urban video, to look at and listen early in the morning, and filled questionnaires. In particular, the State-Trait Anxiety Inventory (STAI) Form Y, collected in baseline condition and at the end of the study, and the Part II of the Sheehan Patient Rated Anxiety Scale (SPRAS), collected every day immediately before and after watching the video. The virtual exposure to forest environments showed effective to reduce perceived anxiety levels in in people forced by lockdown in limited spaces and environmental deprivation. Although significant, the effects were observed only in the short term, highlighting the limitation of the virtual experiences. The reported effects might also represent a benchmark to disentangle the determinants of health effects due to real forest experiences, for example, the inhalation of biogenic volatile organic compounds (BVOC).