Submitted:
27 February 2023
Posted:
27 February 2023
You are already at the latest version
Abstract
Keywords:
1. Introduction
- We construct an encoder–decoder structure including gating blocks using the decoupling-style phase-aware method that can collaboratively estimate the magnitude and phase information of clean speech in parallel and avoid the compensation effect between magnitude and phase;
- We propose a convolution-augmented gated attention unit that can capture time and frequency dependence with lower computational complexity and achieve better results than the conformer;
- The proposed approach is superior to the previous approaches on the Voice Bank+DEMAND dataset [29], and an ablation experiment has verified our design choice.
2. Related Works
2.1. MetricGAN
- Input noisy speech, , into the generator to generate enhanced speech, ;
- Input a clean–clean speech pair, , into the discriminator to calculate the output, , and calculate through the objective evaluation function;
- Input an enhanced–clean speech pair, , into the discriminator to calculate the output, , and calculate through the objective evaluation function;
- Calculate the loss function of the generator and the discriminator, and update the weights of both networks.
2.2. Gated Attention Unit
2.3. Conformer
2.4. Limitations and Our Approach
3. Methodology
3.1. Encoder and Decoder
3.2. Two-Stage CGA Block
3.3. Metric Discriminator
3.4. Loss Function
4. Experiments
4.1. Datasets and Settings
4.2. Evaluation Indicators
5. Results and Discussion
| Method | Size(M) | PESQ | CSIG | CBAK | COVL | SSNR | STOI |
|---|---|---|---|---|---|---|---|
| Noisy | -* | 1.97 | 3.35 | 2.44 | 2.63 | 1.68 | 0.91 |
| SEGAN | 97.47 | 2.16 | 3.48 | 2.94 | 2.80 | 7.73 | 0.92 |
| DCCRN | 3.70 | 2.68 | 3.88 | 3.18 | 3.27 | - | 0.94 |
| Conv-TasNet | 5.1 | 2.84 | 2.33 | 2.62 | 2.51 | - | - |
| TDCGAN | 5.12 | 2.87 | 4.17 | 3.46 | 3.53 | 9.82 | 0.95 |
| MetricGAN+ | - | 3.15 | 4.14 | 3.16 | 3.64 | - | - |
| UNIVERSE | - | 3.33 | - | - | 3.82 | - | 0.95 |
| CDiffuSE | - | 2.52 | 3.72 | 2.91 | 3.10 | - | - |
| SE-Conformer | - | 3.13 | 4.45 | 3.55 | 3.82 | - | 0.95 |
| DB-AIAT | 2.81 | 3.31 | 4.61 | 3.75 | 3.96 | 10.79 | 0.96 |
| DPT-FSNet | 0.91 | 3.33 | 4.58 | 3.72 | 4.00 | - | 0.96 |
| DBT-Net | 2.91 | 3.30 | 4.59 | 3.75 | 3.92 | - | 0.96 |
| CGA-MGAN | 1.14 | 3.47 | 4.56 | 3.86 | 4.06 | 11.09 | 0.96 |
5.1. Baselines and Results Analysis
5.2. Ablation Study
6. Conclusion
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Wang, D.; Chen, J. Supervised speech separation based on deep learning: An overview. IEEE/ACM Transactions on Audio, Speech, and Language Processing 2018, 26, 1702-1726. [CrossRef]
- Atmaja, B.T.; Farid, M.N.; Arifianto, D. Speech enhancement on smartphone voice recording. In Proceedings of the Journal of Physics: Conference Series, 2016; p. 012072. [CrossRef]
- Tasell, D.J.V. Hearing loss, speech, and hearing aids. Journal of Speech, Language, and Hearing Research 1993, 36, 228-244. [CrossRef]
- Wang, J.; Liu, H.; Zheng, C.; Li, X. Spectral subtraction based on two-stage spectral estimation and modified cepstrum thresholding. Applied acoustics 2013, 74, 450-458. [CrossRef]
- Abd El-Fattah, M.A.; Dessouky, M.I.; Abbas, A.M.; Diab, S.M.; El-Rabaie, E.-S.M.; Al-Nuaimy, W.; Alshebeili, S.A.; Abd El-samie, F.E. Speech enhancement with an adaptive Wiener filter. International Journal of Speech Technology 2014, 17, 53-64. [CrossRef]
- Ephraim, Y.; Malah, D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Transactions on acoustics, speech, and signal processing 1984, 32, 1109-1121. [CrossRef]
- Wang, D. On ideal binary mask as the computational goal of auditory scene analysis. In Speech separation by humans and machines; Springer: 2005; pp. 181-197. [CrossRef]
- Narayanan, A.; Wang, D. Ideal ratio mask estimation using deep neural networks for robust speech recognition. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013; pp. 7092-7096. [CrossRef]
- Wang, Y.; Narayanan, A.; Wang, D. On training targets for supervised speech separation. IEEE/ACM transactions on audio, speech, and language processing 2014, 22, 1849-1858. [CrossRef]
- Xu, Y.; Du, J.; Dai, L.-R.; Lee, C.-H. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 2014, 23, 7-19. [CrossRef]
- Paliwal, K.; Wójcicki, K.; Shannon, B. The importance of phase in speech enhancement. speech communication 2011, 53, 465-494. [CrossRef]
- Yin, D.; Luo, C.; Xiong, Z.; Zeng, W. Phasen: A phase-and-harmonics-aware speech enhancement network. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2020; pp. 9458-9465. [CrossRef]
- Hu, Y.; Liu, Y.; Lv, S.; Xing, M.; Zhang, S.; Fu, Y.; Wu, J.; Zhang, B.; Xie, L. DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement. arXiv preprint arXiv:2008.00264 2020. [CrossRef]
- Williamson, D.S.; Wang, Y.; Wang, D. Complex ratio masking for monaural speech separation. IEEE/ACM transactions on audio, speech, and language processing 2015, 24, 483-492. [CrossRef]
- Li, A.; Zheng, C.; Yu, G.; Cai, J.; Li, X. Filtering and Refining: A Collaborative-Style Framework for Single-Channel Speech Enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing 2022, 30, 2156-2172. [CrossRef]
- Wang, Z.-Q.; Wichern, G.; Le Roux, J. On the compensation between magnitude and phase in speech separation. IEEE Signal Processing Letters 2021, 28, 2018-2022. [CrossRef]
- Yu, G.; Li, A.; Zheng, C.; Guo, Y.; Wang, Y.; Wang, H. Dual-branch attention-in-attention transformer for single-channel speech enhancement. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022; pp. 7847-7851. [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Advances in neural information processing systems 2017, 30.
- Subakan, C.; Ravanelli, M.; Cornell, S.; Bronzi, M.; Zhong, J. Attention is all you need in speech separation. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021; pp. 21-25. [CrossRef]
- Luo, Y.; Mesgarani, N. Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language processing 2019, 27, 1256-1266. [CrossRef]
- Pandey, A.; Wang, D. TCNN: Temporal convolutional neural network for real-time speech enhancement in the time domain. In Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019; pp. 6875-6879. [CrossRef]
- Jia, X.; Li, D. TFCN: Temporal-Frequential Convolutional Network for Single-Channel Speech Enhancement. arXiv preprint arXiv:2201.00480 2022.
- Gulati, A.; Qin, J.; Chiu, C.-C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100 2020. [CrossRef]
- Chen, S.; Wu, Y.; Chen, Z.; Wu, J.; Li, J.; Yoshioka, T.; Wang, C.; Liu, S.; Zhou, M. Continuous speech separation with conformer. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021; pp. 5749-5753. [CrossRef]
- Kim, E.; Seo, H. SE-Conformer: Time-Domain Speech Enhancement Using Conformer. In Proceedings of the Interspeech, 2021; pp. 2736-2740. [CrossRef]
- Cao, R.; Abdulatif, S.; Yang, B. CMGAN: Conformer-based Metric GAN for Speech Enhancement. arXiv preprint arXiv:2203.15149 2022. [CrossRef]
- Fu, S.-W.; Liao, C.-F.; Tsao, Y.; Lin, S.-D. Metricgan: Generative adversarial networks based black-box metric scores optimization for speech enhancement. In Proceedings of the International Conference on Machine Learning, 2019; pp. 2031-2041.
- Hua, W.; Dai, Z.; Liu, H.; Le, Q. transformer quality in linear time. In Proceedings of the International Conference on Machine Learning, 2022; pp. 9099-9117.
- Valentini-Botinhao, C.; Wang, X.; Takaki, S.; Yamagishi, J. Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech. In Proceedings of the SSW, 2016; pp. 146-152. [CrossRef]
- Mao, X.; Li, Q.; Xie, H.; Lau, R.Y.; Wang, Z.; Paul Smolley, S. Least squares generative adversarial networks. In Proceedings of the Proceedings of the IEEE international conference on computer vision, 2017; pp. 2794-2802. [CrossRef]
- Lu, Y.; Li, Z.; He, D.; Sun, Z.; Dong, B.; Qin, T.; Wang, L.; Liu, T.-Y. Understanding and improving transformer from a multi-particle dynamic system point of view. arXiv preprint arXiv:1906.02762 2019.
- Su, J.; Lu, Y.; Pan, S.; Wen, B.; Liu, Y. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864 2021.
- Braun, S.; Tashev, I. A consolidated view of loss functions for supervised deep learning-based speech enhancement. In Proceedings of the 2021 44th International Conference on Telecommunications and Signal Processing (TSP), 2021; pp. 72-76. [CrossRef]
- Rix, A.W.; Beerends, J.G.; Hollier, M.P.; Hekstra, A.P. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In Proceedings of the 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), 2001; pp. 749-752.
- Hu, Y.; Loizou, P.C. Evaluation of objective quality measures for speech enhancement. IEEE Transactions on audio, speech, and language processing 2007, 16, 229-238. [CrossRef]
- Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In Proceedings of the 2010 IEEE international conference on acoustics, speech and signal processing, 2010; pp. 4214-4217. [CrossRef]
- Pascual, S.; Bonafonte, A.; Serra, J. SEGAN: Speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452 2017. [CrossRef]
- Ye, S.; Hu, X.; Xu, X. Tdcgan: Temporal dilated convolutional generative adversarial network for end-to-end speech enhancement. arXiv preprint arXiv:2008.07787 2020.
- Fu, S.-W.; Yu, C.; Hsieh, T.-A.; Plantinga, P.; Ravanelli, M.; Lu, X.; Tsao, Y. Metricgan+: An improved version of metricgan for speech enhancement. arXiv preprint arXiv:2104.03538 2021. [CrossRef]
- Serrà, J.; Pascual, S.; Pons, J.; Araz, R.O.; Scaini, D. Universal Speech Enhancement with Score-based Diffusion. arXiv preprint arXiv:2206.03065 2022.
- Dang, F.; Chen, H.; Zhang, P. DPT-FSNet: Dual-path transformer based full-band and sub-band fusion network for speech enhancement. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022; pp. 6857-6861. [CrossRef]
- Yu, G.; Li, A.; Wang, H.; Wang, Y.; Ke, Y.; Zheng, C. DBT-Net: Dual-branch federative magnitude and phase estimation with attention-in-attention transformer for monaural speech enhancement. arXiv preprint arXiv:2202.07931 2022. [CrossRef]
- Isola, P.; Zhu, J.-Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2017; pp. 1125-1134. [CrossRef]










| Method | PESQ | CSIG | CBAK | COVL | SSNR | STOI |
|---|---|---|---|---|---|---|
| CGA-MGAN | 3.47 | 4.56 | 3.86 | 4.06 | 11.09 | 0.96 |
| w/o Conv. Block | 3.37 | 4.50 | 3.80 | 3.97 | 10.97 | 0.96 |
| Using Conformer | 3.33 | 4.43 | 3.72 | 3.91 | 10.18 | 0.96 |
| w/o Gating Decoders | 3.43 | 4.52 | 3.83 | 4.02 | 10.99 | 0.96 |
| Mag-only | 3.42 | 4.54 | 3.80 | 4.03 | 10.73 | 0.96 |
| w/o Discriminator | 3.37 | 4.54 | 3.79 | 4.00 | 10.83 | 0.96 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).