Submitted:
07 March 2023
Posted:
08 March 2023
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Evaluated Techniques
- Ease of replicability: the existence of a public code repository to recreate its published results.
- Online viability: it can be deduced from the original published paper that it is able to run in an online manner.
- Popularity: if a pre-trained model is publicly available in one or more speech enhancement repositories (such as Hugging Face [21]) to be downloaded and tested.
2.1. Spectral Feature Mapping with Mimic Loss
2.2. MetricGAN+
- In addition to learning the metric behavior from the clean speech and estimated speech, it was also learned from the noisy speech, stabilizing the learning process of the discriminator network (which aims to mimic such metrics).
- Generated data from previous epochs was re-used to train the discriminator network, improving performance.
- Instead of using the same sigmoid function for all frequencies in the generator network, a sigmoid function was learned for each frequency, providing a closer match between the sum of the clean and noise magnitudes, and the magnitude of the noisy speech.
2.3. Demucs-Denoiser
3. Evaluation Methodology
3.1. Input Lengths and Online Processing
3.2. Corpus Synthesis, Evaluated Variables
- Reverberation
- Signal-to-noise ratio
- Number of interferences
- Input signal-to-interference ratio
3.2.1. Reverberation Scale
3.2.2. Signal-to-Noise Ratio
3.2.3. Number of Interferences
3.2.4. Input Signal-to-Interference Ratio
3.3. Evaluation Corpus
3.4. Evaluation Metrics
- Output signal-to-interference ratio
- Inference response time and real-time factor
- Memory usage
3.4.1. Output Signal-to-Interference Ratio
3.4.2. Inference Response Time and Real-Time Factor
3.4.3. Memory Usage
4. Results
4.1. Output SIR vs. Input Length
4.2. Output SIR vs. Reverberation Scale
4.3. Output SIR vs. Number of Interferences
4.4. Output SIR vs. Input SIR
4.5. Output SIR vs. SNR
4.6. Real-Time Factor

4.7. Memory Usage
5. Discussion
6. Conclusions
Supplementary Materials
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Das, N.; Chakraborty, S.; Chaki, J.; Padhy, N.; Dey, N. Fundamentals, present and future perspectives of speech enhancement. International Journal of Speech Technology 2021, 24, 883–901.
- Wang, D.; Chen, J. Supervised Speech Separation Based on Deep Learning: An Overview. IEEE/ACM Transactions on Audio, Speech, and Language Processing 2018, 26, 1702–1726. [CrossRef]
- Eskimez, S.E.; Soufleris, P.; Duan, Z.; Heinzelman,W. Front-end speech enhancement for commercial speaker verification systems. Speech Communication 2018, 99, 101–113.
- porov, a.; oh, e.; choo, k.; sung, h.; jeong, j.; osipov, k.; francois, h. music enhancement by a novel cnn architecture. journal of the audio engineering society 2018.
- Lopatka, K.; Czyzewski, A.; Kostek, B. Improving listeners’ experience for movie playback through enhancing dialogue clarity in soundtracks. Digital Signal Processing 2016, 48, 40–49.
- Li, C.; Shi, J.; Zhang, W.; Subramanian, A.S.; Chang, X.; Kamo, N.; Hira, M.; Hayashi, T.; Boeddeker, C.; Chen, Z.; et al. ESPnet-SE: End-to-end speech enhancement and separation toolkit designed for ASR integration. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 785–792.
- Rascon, C.; Meza, I. Localization of sound sources in robotics: A review. Robotics and Autonomous Systems 2017, 96, 184–210.
- Lai, Y.H.; Zheng, W.Z. Multi-objective learning based speech enhancement method to increase speech quality and intelligibility for hearing aid device users. Biomedical Signal Processing and Control 2019, 48, 35–45.
- Zhang, Q.;Wang, D.; Zhao, R.; Yu, Y.; Shen, J. Sensing to hear: Speech enhancement for mobile devices using acoustic signals. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2021, 5, 1–30.
- Rao, W.; Fu, Y.; Hu, Y.; Xu, X.; Jv, Y.; Han, J.; Jiang, Z.; Xie, L.; Wang, Y.; Watanabe, S.; et al. Conferencingspeech challenge: Towards far-field multi-channel speech enhancement for video conferencing. In Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 679–686.
- Hu, Y.; Loizou, P.C. Subjective comparison and evaluation of speech enhancement algorithms. Speech communication 2007, 49, 588–601.
- Upadhyay, N.; Karmakar, A. Speech enhancement using spectral subtraction-type algorithms: A comparison and simulation study. Procedia Computer Science 2015, 54, 574–584.
- Nossier, S.A.; Wall, J.; Moniri, M.; Glackin, C.; Cannings, N. A comparative study of time and frequency domain approaches to deep learning based speech enhancement. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 2020, pp. 1–8.
- Graetzer, S.; Hopkins, C. Comparison of ideal mask-based speech enhancement algorithms for speech mixed with white noise at low mixture signal-to-noise ratios. The Journal of the Acoustical Society of America 2022, 152, 3458–3470.
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210. [CrossRef]
- Bagchi, D.; Plantinga, P.; Stiff, A.; Fosler-Lussier, E. Spectral Feature Mapping with MIMIC Loss for Robust Speech Recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5609–5613. [CrossRef]
- Fu, S.W.; Yu, C.; Hsieh, T.A.; Plantinga, P.; Ravanelli, M.; Lu, X.; Tsao, Y. MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement. In Proceedings of the Proc. Interspeech 2021, 2021, pp. 201–205. [CrossRef]
- Défossez, A.; Synnaeve, G.; Adi, Y. Real Time Speech Enhancement in the Waveform Domain. In Proceedings of the Proc. Interspeech 2020, 2020, pp. 3291–3295. [CrossRef]
- Hao, X.; Xu, C.; Xie, L.; Li, H. Optimizing the Perceptual Quality of Time-Domain Speech Enhancement with Reinforcement Learning. Tsinghua Science and Technology 2022, 27, 939–947. [CrossRef]
- Zeng, Y.; Konan, J.; Han, S.; Bick, D.; Yang, M.; Kumar, A.;Watanabe, S.; Raj, B. TAPLoss: A Temporal Acoustic Parameter Loss for Speech Enhancement. arXiv preprint arXiv:2302.08088 2023.
- Jain, S.M., Hugging Face. In Introduction to Transformers for NLP: With the Hugging Face Library and Models to Solve Problems; Apress: Berkeley, CA, 2022; p. 51–67. [CrossRef]
- Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; Bharath, A.A. Generative Adversarial Networks: An Overview. IEEE Signal Processing Magazine 2018, 35, 53–65. [CrossRef]
- Fu, S.W.; Liao, C.F.; Tsao, Y.; Lin, S.D. MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement. In Proceedings of the Proceedings of the 36th International Conference on Machine Learning; Chaudhuri, K.; Salakhutdinov, R., Eds. PMLR, 2019, Vol. 97, Proceedings of Machine Learning Research, pp. 2031–2041.
- Recommendation, I.T. Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. Rec. ITU-T P. 862 2001.
- Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, pp. 4214–4217. [CrossRef]
- Défossez, A.; Usunier, N.; Bottou, L.; Bach, F. Music source separation in the waveform domain. arXiv preprint arXiv:1911.13254 2019.
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015; Navab, N.; Hornegger, J.; Wells, W.M.; Frangi, A.F., Eds.; Springer International Publishing: Cham, 2015; p. 234–241.
- Reddy, C.K.; Gopal, V.; Cutler, R.; Beyrami, E.; Cheng, R.; Dubey, H.; Matusevych, S.; Aichner, R.; Aazami, A.; Braun, S.; et al. The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results. arXiv preprint arXiv:2005.13981 2020.
- Ravanelli, M.; Parcollet, T.; Plantinga, P.; Rouhe, A.; Cornell, S.; Lugosch, L.; Subakan, C.; Dawalatabad, N.; Heba, A.; Zhong, J.; et al. SpeechBrain: A General-Purpose Speech Toolkit, 2021, [arXiv:eess.AS/2106.04624]. arXiv:2106.04624.
- Ko, T.; Peddinti, V.; Povey, D.; Seltzer, M.L.; Khudanpur, S. A study on data augmentation of reverberant speech for robust speech recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 5220–5224.
- Vincent, E.; Gribonval, R.; Fevotte, C. Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech, and Language Processing 2006, 14, 1462–1469. [CrossRef]
- Raffel, C.; McFee, B.; Humphrey, E.J.; Salamon, J.; Nieto, O.; Liang, D.; Ellis, D.P.; Raffel, C.C. MIR_EVAL: A Transparent Implementation of Common MIR Metrics. In Proceedings of the ISMIR, 2014, pp. 367–372.
| 1 | Latency is the time difference between the moment the audio segment is fed to the technique and the estimated target speech in that audio segment is reproduced to a speaker. |
| 2 | Although, to be fair, it is a very challenging scenario, resulting from the absence of data to “look ahead” to, as well as the harsh time constraints. Both of these, in conjunction, may have contributed to this oversight. |
















| Evaluated Variable | Values |
|---|---|
| Input Length (samples) | 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, Full recording |
| Reverberation Scale | 0, 1, 2, 3 |
| Signal-to-Noise Ratio (dB) | -10, -5, 0, 5, 10, 15, 20, 25, 30 |
| Number of Interferences | 0, 1, 2, 3 |
| Input Signal-to-Interference Ratio (dB) | -10, -5, 0, 5, 10, 15, 20, 25, 30, 100† |
| Variable | Default Value |
|---|---|
| Reverberation Scale | 0 |
| Signal-to-Noise Ratio (dB) | 30 |
| Number of Interferences | 0 |
| Input Signal-to-Interference Ratio (dB) | 100† |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).