Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Characterization of Deep-Learning-Based Speech Enhancement Techniques in Online Audio Processing Applications

Version 1 : Received: 7 March 2023 / Approved: 8 March 2023 / Online: 8 March 2023 (15:25:56 CET)

A peer-reviewed article of this Preprint also exists.

Rascon, C. Characterization of Deep Learning-Based Speech-Enhancement Techniques in Online Audio Processing Applications. Sensors 2023, 23, 4394. Rascon, C. Characterization of Deep Learning-Based Speech-Enhancement Techniques in Online Audio Processing Applications. Sensors 2023, 23, 4394.

Abstract

Deep-learning-based speech enhancement techniques have been recently grown in interest, since their impressive performance can potentially benefit a wide variety of digital voice communication systems. However, such performance has been evaluated mostly in offline audio processing scenarios (i.e. feeding the model, in one go, a complete audio recording, which may extend several seconds). It is of great interest to evaluate and characterize the current state-of-the-art in applications that process audio online (i.e. feeding the model a sequence of segments of audio data, concatenating the results at the output end). Although evaluations and comparisons between speech enhancement techniques have been carried out before, as far as the author knows, the work presented here is the first that evaluates the performance of such techniques in relation to their online applicability. Meaning, this work measures how the output signal-to-interference ratio (as a separation metric), the response time and memory usage (as online metrics) are impacted by the input length (the size of audio segments), in addition to the amount of noise, amount and number of interferences, and amount of reverberation. Three popular models were evaluated, given their availability on public repositories and online viability: MetricGAN+, Spectral Feature Mapping with Mimic Loss, and Demucs-Denoiser. The characterization was carried out using a systematic evaluation protocol based on the Speechbrain framework. Several intuitions are presented and discussed, and some recommendations for future work are proposed.

Keywords

speech enhancement; online applicability; real-time factor

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.