Submitted:
30 September 2024
Posted:
03 October 2024
You are already at the latest version
Abstract

Keywords:
1. Introduction
2. Related Works and Research Planning
3. Speech Denoising
3.1. Processing Framework
3.2. DNN Architecture
3.3. Input Arrangements
3.4. Loss Functions
4. Experiment and Performance Evaluation
4.1. Datasets for Model Training
4.2. Performance Evaluation
4.3. Further Considerations of Loss Function in the STFT and STDCT Domains
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Sun, Y.; Wang, W.; Chambers, J.A.; Naqvi, S.M. Enhanced Time-Frequency Masking by Using Neural Networks for Monaural Source Separation in Reverberant Room Environments. In Proceedings of the 2018 26th European Signal Processing Conference (EUSIPCO), 3-7 Sept. 2018, 2018. pp. 1647–1651.
- Choi, H.-S.; Heo, H.; Lee, J.H.; Lee, K. Phase-aware Single-stage Speech Denoising and Dereverberation with U-Net. ArXiv 2020, abs/2006.00687. [Google Scholar]
- Sarker, I.H. Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions. SN Computer Science 2021, 2, 420. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. ArXiv 2015, abs/1505.04597. [Google Scholar]
- Tan, K.; Wang, D. Learning Complex Spectral Mapping With Gated Convolutional Recurrent Networks for Monaural Speech Enhancement. IEEE/ACM Trans. on Audio, Speech, and Language Processing 2020, 28, 380–390. [Google Scholar] [CrossRef]
- Choi, H.-S.; Kim, J.-H.; Huh, J.; Kim, A.; Ha, J.-W.; Lee, K. Phase-aware Speech Enhancement with Deep Complex U-Net. ArXiv 2019, abs/1903.03107. [Google Scholar]
- Li, A.; Liu, W.; Zheng, C.; Fan, C.; Li, X. Two Heads are Better Than One: A Two-Stage Complex Spectral Mapping Approach for Monaural Speech Enhancement. IEEE/ACM Trans. on Audio, Speech, and Language Processing 2021, 29, 1829–1843. [Google Scholar] [CrossRef]
- Yuan, W. A time–frequency smoothing neural network for speech enhancement. Speech Communication 2020, 124, 75–84. [Google Scholar] [CrossRef]
- Luo, X.; Zheng, C.; Li, A.; Ke, Y.; Li, X. Analysis of trade-offs between magnitude and phase estimation in loss functions for speech denoising and dereverberation. Speech Communication 2022, 145, 71–87. [Google Scholar] [CrossRef]
- Azarang, A.; Kehtarnavaz, N. A review of multi-objective deep learning speech denoising methods. Speech Communication 2020, 122, 1–10. [Google Scholar] [CrossRef]
- Pandey, A.; Wang, D. Dense CNN With Self-Attention for Time-Domain Speech Enhancement. IEEE/ACM Trans. on Audio, Speech, and Language Processing 2020, 29, 1270–1279. [Google Scholar] [CrossRef]
- Défossez, A.; Synnaeve, G.; Adi, Y. Real Time Speech Enhancement in the Waveform Domain. ArXiv 2020, abs/2006.12847. [Google Scholar]
- Germain, F.G.; Chen, Q.; Koltun, V. Speech Denoising with Deep Feature Losses. In Proceedings of the Interspeech, 2018.
- Erdogan, H.; Hershey, J.R.; Watanabe, S.; Roux, J.L. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 19-24 April 2015, 2015. pp. 708–712.
- Kulmer, J.; Mahale, P.M.B. Phase Estimation in Single Channel Speech Enhancement Using Phase Decomposition. IEEE Signal Processing Letters 2015, 22, 598–602. [Google Scholar] [CrossRef]
- Lu, X.; Tsao, Y.; Matsuda, S.; Hori, C. Speech enhancement based on deep denoising autoencoder. In Proceedings of the Interspeech; 2013. [Google Scholar]
- Zhao, H.; Zarar, S.; Tashev, I.; Lee, C.H. Convolutional-Recurrent Neural Networks for Speech Enhancement. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 15-20 April 2018, 2018. pp. 2401–2405.
- Tan, K.; Wang, D. A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement. In Proceedings of the Interspeech, Hyderabad, India, 2-6 September; 2018; pp. 3229–3233. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Computation 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Mirsamadi, S.; Tashev, I.J. Causal Speech Enhancement Combining Data-Driven Learning and Suppression Rule Estimation. In Proceedings of the Interspeech; 2016. [Google Scholar]
- Park, S.R.; Lee, J. A Fully Convolutional Neural Network for Speech Enhancement. In Proceedings of the Interspeech 2017, Stockholm, Sweden; 2017; pp. 1993–1997. [Google Scholar]
- Li, A.; Zheng, C.; Peng, R.; Li, X. On the importance of power compression and phase estimation in monaural speech dereverberation. JASA Express Letters 2021, 1. [Google Scholar] [CrossRef]
- Liu, L.; Guan, H.; Ma, J.; Dai, W.; Wang, G.-Y.; Ding, S. A Mask Free Neural Network for Monaural Speech Enhancement. ArXiv 2023, abs/2306.04286. [Google Scholar]
- Ba, J.; Kiros, J.R.; Hinton, G.E. Layer Normalization. ArXiv 2016, abs/1607.06450. [Google Scholar]
- Maas, A.L. Rectifier Nonlinearities Improve Neural Network Acoustic Models. 2013.
- Huang, G.; Liu, Z.; Maaten, L.V.D.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 21-26 July 2017; 2017; pp. 2261–2269. [Google Scholar]
- Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Neural Information Processing Systems; 2017. [Google Scholar]
- Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016, 1874–1883. [Google Scholar]
- Hu, H.-T.; Tsai, H.-H.; Lee, T.-T. Suitable domains for causal speech denoising using DNN with U-Net architecture. In Proceedings of the 7th International Conference on Knowledge Innovation and Invention 2024 (ICKII 2024), Nagoya, Japan, August 16-18; 2024. [Google Scholar]
- Oppenheim, A.V.; Willsky, A.S.; Nawab, S.H. Signals & systems, 2nd ed.; Prentice Hall: Upper Saddle River, N.J., 1997. [Google Scholar]
- Valentini-Botinhao, C.; Wang, X.; Takaki, S.; Yamagishi, J. Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech. In Proceedings of the Speech Synthesis Workshop; 2016. [Google Scholar]
- Thiemann, J.; Ito, N.; Vincent, E. The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings. Journal of the Acoustical Society of America 2013, 133, 3591–3591. [Google Scholar] [CrossRef]
- Hu, Y.; Loizou, P.C. Evaluation of Objective Quality Measures for Speech Enhancement. IEEE Trans. on Audio, Speech, and Language Processing 2008, 16, 229–238. [Google Scholar] [CrossRef]
- ITU-T, R.P. Methods for subjective determination of transmission quality International Telecommunications Union: Geneva, Switzerland, 1996.
- Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech. IEEE Trans. on Audio, Speech, and Language Processing 2011, 19, 2125–2136. [Google Scholar] [CrossRef]








| Input type |
Initial SNR |
Classical U-Net withCCABs | Advanced U-Net with GLFBs | ||||||||||
| Resulting SNR (dB) |
CSIG | CBAK | COVL | PESQ | STOI (%) | Resulting SNR (dB) |
CSIG | CBAK | COVL | PESQ | STOI (%) | ||
| STFT-RI sequences | –2.5 dB | 12.97 | 3.460 | 3.055 | 3.091 | 2.811 | 80.59 | 13.73 | 3.495 | 3.114 | 3.132 | 2.851 | 81.69 |
| 2.5 dB | 15.96 | 3.941 | 3.372 | 3.496 | 3.097 | 85.69 | 16.51 | 3.948 | 3.410 | 3.515 | 3.123 | 86.39 | |
| 7.5 dB | 18.37 | 4.322 | 3.624 | 3.814 | 3.323 | 89.03 | 18.75 | 4.334 | 3.660 | 3.838 | 3.355 | 89.65 | |
| 12.5 dB | 20.45 | 4.618 | 3.854 | 4.072 | 3.518 | 91.45 | 20.69 | 4.632 | 3.887 | 4.097 | 3.549 | 91.99 | |
| Average | 16.94 | 4.085 | 3.476 | 3.618 | 3.187 | 86.69 | 17.42 | 4.102 | 3.518 | 3.646 | 3.220 | 87.43 | |
| STDCT sequences | –2.5 dB | 13.04 | 3.458 | 3.062 | 3.096 | 2.821 | 80.85 | 13.79 | 3.506 | 3.122 | 3.149 | 2.878 | 81.95 |
| 2.5 dB | 16.04 | 3.940 | 3.375 | 3.498 | 3.102 | 85.81 | 16.53 | 3.964 | 3.421 | 3.533 | 3.145 | 86.53 | |
| 7.5 dB | 18.38 | 4.345 | 3.626 | 3.830 | 3.332 | 89.14 | 18.66 | 4.358 | 3.666 | 3.858 | 3.370 | 89.68 | |
| 12.5 dB | 20.50 | 4.661 | 3.861 | 4.101 | 3.535 | 91.61 | 20.67 | 4.660 | 3.890 | 4.116 | 3.562 | 92.00 | |
| Average | 16.99 | 4.101 | 3.481 | 3.631 | 3.198 | 86.85 | 17.41 | 4.122 | 3.525 | 3.664 | 3.239 | 87.54 | |
| Waveform sequences | –2.5 dB | 13.40 | 3.338 | 3.014 | 2.996 | 2.739 | 81.18 | 13.45 | 3.313 | 3.009 | 2.980 | 2.734 | 81.13 |
| 2.5 dB | 15.96 | 3.784 | 3.297 | 3.364 | 2.992 | 85.61 | 16.17 | 3.761 | 3.306 | 3.355 | 3.000 | 85.81 | |
| 7.5 dB | 18.21 | 4.178 | 3.547 | 3.690 | 3.218 | 88.80 | 18.30 | 4.152 | 3.543 | 3.673 | 3.218 | 88.88 | |
| 12.5 dB | 20.05 | 4.492 | 3.771 | 3.958 | 3.418 | 91.13 | 20.13 | 4.469 | 3.761 | 3.942 | 3.418 | 91.06 | |
| Average | 16.90 | 3.948 | 3.407 | 3.502 | 3.092 | 86.68 | 17.01 | 3.924 | 3.405 | 3.487 | 3.092 | 86.72 | |
|
Loss function |
Initial SNR |
Classical U-Net with CCABs | Advanced U-Net with GLFBs | ||||||||||
| Resulting SNR (dB) |
CSIG | CBAK | COVL | PESQ | STOI (%) | Resulting SNR (dB) |
CSIG | CBAK | COVL | PESQ | STOI (%) | ||
| (0, 1) | –2.5 dB | 12.97 | 3.460 | 3.055 | 3.091 | 2.811 | 80.59 | 13.73 | 3.495 | 3.114 | 3.132 | 2.851 | 81.69 |
| 2.5 dB | 15.96 | 3.941 | 3.372 | 3.496 | 3.097 | 85.69 | 16.51 | 3.948 | 3.410 | 3.515 | 3.123 | 86.39 | |
| 7.5 dB | 18.37 | 4.322 | 3.624 | 3.814 | 3.323 | 89.03 | 18.75 | 4.334 | 3.660 | 3.838 | 3.355 | 89.65 | |
| 12.5 dB | 20.45 | 4.618 | 3.854 | 4.072 | 3.518 | 91.45 | 20.69 | 4.632 | 3.887 | 4.097 | 3.549 | 91.99 | |
| Average | 16.94 | 4.085 | 3.476 | 3.618 | 3.187 | 86.69 | 17.42 | 4.102 | 3.518 | 3.646 | 3.220 | 87.43 | |
| (0.5, 1) | –2.5 dB | 12.75 | 3.638 | 3.053 | 3.190 | 2.809 | 81.02 | 13.58 | 3.737 | 3.125 | 3.277 | 2.876 | 82.41 |
| 2.5 dB | 15.65 | 4.052 | 3.349 | 3.548 | 3.076 | 85.82 | 16.23 | 4.110 | 3.399 | 3.604 | 3.126 | 86.83 | |
| 7.5 dB | 18.00 | 4.388 | 3.600 | 3.843 | 3.305 | 89.21 | 18.46 | 4.424 | 3.635 | 3.880 | 3.339 | 89.91 | |
| 12.5 dB | 20.15 | 4.655 | 3.834 | 4.085 | 3.499 | 91.65 | 20.44 | 4.675 | 3.859 | 4.109 | 3.526 | 92.07 | |
| Average | 16.64 | 4.183 | 3.459 | 3.666 | 3.172 | 86.92 | 17.18 | 4.237 | 3.504 | 3.718 | 3.217 | 87.81 | |
| (0, 0.5) | –2.5 dB | 13.30 | 3.298 | 3.024 | 2.973 | 2.758 | 80.37 | 13.82 | 3.397 | 3.098 | 3.062 | 2.821 | 81.36 |
| 2.5 dB | 16.27 | 3.837 | 3.375 | 3.442 | 3.107 | 85.67 | 16.63 | 3.930 | 3.427 | 3.517 | 3.153 | 86.30 | |
| 7.5 dB | 18.63 | 4.318 | 3.666 | 3.844 | 3.396 | 89.25 | 18.84 | 4.389 | 3.701 | 3.903 | 3.434 | 89.58 | |
| 12.5 dB | 20.52 | 4.679 | 3.912 | 4.156 | 3.627 | 91.70 | 20.72 | 4.739 | 3.944 | 4.207 | 3.666 | 92.02 | |
| Average | 17.18 | 4.033 | 3.494 | 3.604 | 3.222 | 86.75 | 17.50 | 4.114 | 3.542 | 3.672 | 3.269 | 87.32 | |
| (0.5, 0.5) | –2.5 dB | 13.08 | 3.750 | 3.120 | 3.267 | 2.844 | 81.14 | 13.93 | 3.901 | 3.222 | 3.402 | 2.949 | 82.65 |
| 2.5 dB | 16.04 | 4.206 | 3.448 | 3.677 | 3.171 | 86.25 | 16.66 | 4.312 | 3.520 | 3.777 | 3.259 | 87.24 | |
| 7.5 dB | 18.44 | 4.569 | 3.715 | 4.007 | 3.442 | 89.74 | 18.89 | 4.639 | 3.769 | 4.078 | 3.509 | 90.30 | |
| 12.5 dB | 20.50 | 4.848 | 3.954 | 4.268 | 3.663 | 92.18 | 20.90 | 4.907 | 4.004 | 4.331 | 3.727 | 92.67 | |
| Average | 17.01 | 4.343 | 3.559 | 3.805 | 3.280 | 87.32 | 17.59 | 4.440 | 3.629 | 3.897 | 3.361 | 88.21 | |
|
Loss function |
Initial SNR |
Classical U-Net with CCABs | Advanced U-Net with GLFBs | ||||||||||
| Resulting SNR (dB) |
CSIG | CBAK | COVL | PESQ | STOI (%) | Resulting SNR (dB) |
CSIG | CBAK | COVL | PESQ | STOI (%) | ||
| (0, 1) | –2.5 dB | 13.04 | 3.458 | 3.062 | 3.096 | 2.821 | 80.85 | 13.79 | 3.506 | 3.122 | 3.149 | 2.878 | 81.95 |
| 2.5 dB | 16.04 | 3.940 | 3.375 | 3.498 | 3.102 | 85.81 | 16.53 | 3.964 | 3.421 | 3.533 | 3.145 | 86.53 | |
| 7.5 dB | 18.38 | 4.345 | 3.626 | 3.830 | 3.332 | 89.14 | 18.66 | 4.358 | 3.666 | 3.858 | 3.370 | 89.68 | |
| 12.5 dB | 20.50 | 4.661 | 3.861 | 4.101 | 3.535 | 91.61 | 20.67 | 4.660 | 3.890 | 4.116 | 3.562 | 92.00 | |
| Average | 16.99 | 4.101 | 3.481 | 3.631 | 3.198 | 86.85 | 17.41 | 4.122 | 3.525 | 3.664 | 3.239 | 87.54 | |
| (0.5, 1) | –2.5 dB | 12.91 | 3.642 | 3.077 | 3.206 | 2.837 | 81.28 | 13.19 | 3.700 | 3.103 | 3.245 | 2.857 | 81.79 |
| 2.5 dB | 15.92 | 4.077 | 3.385 | 3.580 | 3.114 | 86.27 | 16.10 | 4.090 | 3.395 | 3.591 | 3.123 | 86.50 | |
| 7.5 dB | 18.26 | 4.412 | 3.627 | 3.870 | 3.334 | 89.48 | 18.38 | 4.407 | 3.632 | 3.871 | 3.341 | 89.67 | |
| 12.5 dB | 20.30 | 4.676 | 3.852 | 4.108 | 3.524 | 91.68 | 20.41 | 4.667 | 3.857 | 4.105 | 3.528 | 91.99 | |
| Average | 16.85 | 4.202 | 3.485 | 3.691 | 3.202 | 87.18 | 17.02 | 4.216 | 3.497 | 3.703 | 3.212 | 87.48 | |
| (0, 0.5) | –2.5 dB | 13.09 | 3.345 | 3.035 | 3.008 | 2.776 | 80.32 | 13.79 | 3.401 | 3.099 | 3.065 | 2.828 | 81.41 |
| 2.5 dB | 16.06 | 3.845 | 3.376 | 3.448 | 3.107 | 85.60 | 16.55 | 3.913 | 3.420 | 3.506 | 3.155 | 86.25 | |
| 7.5 dB | 18.43 | 4.314 | 3.658 | 3.841 | 3.390 | 89.10 | 18.84 | 4.384 | 3.703 | 3.902 | 3.440 | 89.58 | |
| 12.5 dB | 20.42 | 4.687 | 3.908 | 4.160 | 3.626 | 91.67 | 20.68 | 4.740 | 3.943 | 4.209 | 3.670 | 91.98 | |
| Average | 17.00 | 4.048 | 3.494 | 3.614 | 3.225 | 86.67 | 17.47 | 4.109 | 3.541 | 3.670 | 3.273 | 87.30 | |
| (0.5, 0.5) | –2.5 dB | 13.23 | 3.792 | 3.141 | 3.303 | 2.877 | 81.48 | 13.80 | 3.889 | 3.208 | 3.386 | 2.934 | 82.40 |
| 2.5 dB | 16.03 | 4.229 | 3.452 | 3.695 | 3.188 | 86.34 | 16.63 | 4.313 | 3.516 | 3.774 | 3.254 | 87.06 | |
| 7.5 dB | 18.52 | 4.591 | 3.726 | 4.026 | 3.460 | 89.73 | 18.86 | 4.636 | 3.764 | 4.072 | 3.503 | 90.13 | |
| 12.5 dB | 20.48 | 4.862 | 3.957 | 4.280 | 3.674 | 92.08 | 20.87 | 4.903 | 4.000 | 4.324 | 3.720 | 92.52 | |
| Average | 17.07 | 4.368 | 3.569 | 3.826 | 3.300 | 87.40 | 17.54 | 4.435 | 3.622 | 3.889 | 3.353 | 88.03 | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).