Submitted:
22 May 2024
Posted:
22 May 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
- The structure of the original neural beamforming is revised. Specifically, two main changes are made. Frist, the number of the time-frequency domain path scanning block in the neural network is reduced to 3 from original 6. This simplification improves the training efficiency and inference speed of the model, while reducing the complexity and resource consumption of the model. Second, an iteratively refined separation method is proposed, which combines the initially separated speech with the original mixed signal as an auxiliary input for the iterative network. By repeating this process in N iteration stages, the MVDR beamformer and post-separation network are mutually promoted. As a result, the separation results are effectively improved.
- The proposed method not only evaluates each stage of the multi-stage iterative processes, but also uses more evaluation metrics to get a more comprehensive evaluation. The experimental results show that the proposed method works well on the spatialized version of the WSJ0-2mix data corpus, and outperforms the current popular methods greatly. In addition, it is noted that our proposed method also performs well in the dereverberation task.
2. Proposed Method
2.1. Signal Model
2.2. Initial Separation
- 1.
- In the encoder section, firstly, the mixed signal xc is transformed into time-frequency representation Yc by short-term Fourier transform (STFT). Then, this representation Yc is applied into dynamic range compression (DRC) module to obtain YDRC. Subsequently, the local features YConv are extracted from YDRC through a 2D convolutional (Conv2D) layer. Finally, these features YConv are passed through a rectified linear unit (ReLU) activation function to obtain the encoded feature Ec. The whole encoder section can be expressed as:
- 2.
- In the separator section, firstly, the encoded feature Ec is sent to the layer normalization (LN) for standardization, followed by a Conv2D layer to obtain the feature . Subsequently, the feature is sent into N time-frequency domain scanning blocks using a time-frequency scanning mechanism [21,22]. Each scanning block consists of two recurrent modules, where the first recurrent module utilizes a Bi-directional LSTM (BLSTM) network layer along the frequency axis, while the second recurrent module utilizes BLSTM along the time axis. Both modules all include reshaping, LN, and fully connected (FC) operations. Finally, after processing through these modules, the features are further refined through a Conv2D layer and a ReLU activation function, resulting in the separated mask . The whole separator section can be expressed as:
- 3.
- where Separator{.}c denotes the separator corresponding to the signal of the cth microphone and denotes the mask of the qth speaker in the cth microphone.
- 4.
- Thus, the separated masks are element-wise multiplied with the encoded feature Ec to obtain the separated feature representation .where Ec denotes the encoded feature representation in the cth microphone, ☉ denotes Hadamard product.
- 5.
- In the decoder section, the separated feature passes through a Conv2D layer, inverse DRC (IDRC) and inverse STFT (ISFFT) to obtain the finally separated waveforms , where the superscript (0) denotes the first stage. The whole decoder section can be expressed as:where Decoder{.}q denotes the decoder of the qth speaker, denotes the separated waveform of the qth speaker in the cth microphone.
2.3. Iterative Separation
2.4. Loss Funtion
3. Experimental
3.1. Datasets and Microphone Structure
- 6.
- The effectiveness of our proposed method is evaluated using the spatialized version of the WSJ0-2mix dataset given in [23]. This dataset contains 20,000 training data (about 30 hours), 5,000 validation data (about 10 hours) and 3,000 test data (about 5 hours), respectively. All utterances in the training and validation sets are either expanded or truncated to four seconds, and the sampling rate for all audio data is set to 8 kHz. This dataset comprises "min" and "max" versions: in the "min" version, the speech is truncated to match the duration of the shorter utterance, while in the "max" version, the speech is extended to match the longer utterance. The "min" version is used for training and validation, while the "max" version is used for testing to maintain consistency with baseline methods. When mixing speech from two speakers, the signal-to-interference ratio (SIR) of the speech signals varies from -5dB to +5dB. Then, these adjusted speech signals are convolved with the RIRs to simulate the reverberation effect in real environments. The RIRs are simulated by the image method proposed by Allen and Berkley in 1979 [24]. During the simulation, the length and width of the room varies from 5m to 10m, the height varies from 3m to 4m, the reverberation time varies from 0.2s to 0.6s and the positions of the microphones and speakers are all randomly selected. The microphone array consists of 8 omnidirectional microphones placed inside a virtual sphere. The center of this sphere is roughly located at the center of the room and the radius of the sphere randomly selected from 7.5m to 12.5m. The first four microphones are used for training and validation: two are symmetrically positioned on the surface of the sphere, while the other two are randomly positioned inside the sphere. The last four microphones are used for testing, and they are randomly positioned within the area defined by the first two microphones, which can evaluate the performance of the model in unseen microphone configurations.
3.2. Model Configuration
- 7.
- Here are the basic model configuration for the experiment, most of which are the same as the original settings [22]. The window settings of STFT are a 32ms frame length and a 16ms hop size. In the encoder section, the kernel size of the Conv2D layer is set to (7, 7) in order to extract local feature, while in other sections, the kernel size of Conv2D layer is set to (1, 1). In the separator section, the number of the time-frequency domain path scanning blocks in this paper is reduced to 3 from original 6 for simplifying the model. Each block contains two BLSTM layers, with each BLSTM layer consisting of 128 hidden units. In addition, a parameter sharing strategy is adopted in this model, meaning the same parameters are used during the iteration separation. This strategy reduces the total number of the parameters in the model, thus decreasing the computational requirements.
3.3. Training Configuration
- 8.
- In the model training, the batch size is set to 1. Utterance-level permutation invariant training (uPIT) is applied to address the source permutation problem. The Adam optimizer is utilized and the learning rate is set to 1×10-3. Additionally, the maximum norm value of gradient clipping is set to 5. The networks and the comparison network are trained for 150 epochs to ensure fairness of the experiments.
3.4. Evaluation Metrics
- 9.
- The SDR of blind source separation evaluation (BSS-Eval) [25] and scale-invariant signal-to-distortion ratio (SI-SNR) are chosen as the main objective measures of separation accuracy. Furthermore, perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI) and SIR are used to further evaluate the accuracy of separated speech. It is worth noting that during the evaluation process, the first microphone is selected as the reference by default.
4. Results
4.1. Analysis of Iterative Results
- 10.
- The separation results of our proposed method at different stages are shown in Table 1. Here, denotes the speech signal of each speaker on the first microphone after the TFDPRNN module at the nth stage, and denotes the speech signal of each speaker after the MVDR beamformer at the nth stage.
- 11.
- From Table 1, we can see that after the first iteration, the SDR performance of is increased by 17.46% and the SI-SDR performance of is increased by 22.66%. This shows that the first iteration can bring high gains to our model. However, after the second iteration, although some improvement is still observed with each iteration, the performance improvements of in SDR and SI-SDR are very slow, with an increase of less than 1%. This indicates that as the number of iterations increases, the estimation of the SCM becomes more accurate and gradually approaches the inherent performance upper limit of MVDR beamforming. On the other hand, is improved by about 51% over in SDR and SI-SDR, and is improved by about 11% over on these two metrics as well. In the subsequent stages, although the performance is continually improved, the rate of improvement is less than 2%. The performance of and on the SIR and PESQ are improved after each iteration. The STOI performance of and are remained at about 0.99. Overall, the performance of our model is improved obviously in the iterative process.
- 12.
- In order to describe the results of the method more graphically, Figure 3 shows a comparison of the spectrograms for a single speaker. They include original clean speech signal, the original reverberant speech signal, the mixed speech signal, the output signals of neural network at different stages, and the output signals of the beamformer at different stages, respectively. Observing the output spectrograms of neural network in the left column, it is clear when the processing stage increases, the clarity and quality of the spectrograms gradually are improved. For example, compared to the spectrogram of original reverberant signal, it can be observed that certain spectral components in the spectrogram of the first neural network are reduced (e.g., the red circles). After iterative processing, these signal components are gradually recovered in the spectrogram of the second neural network. In the speech spectrogram in the right column, the separation effect of the beamformer at different stages is observed, and there are no significant changes or improvements. When comparing the speech spectrograms in the left and right columns, we can see that the output spectrograms processed by the neural network exhibit relatively higher clarity compared to those processed by the beamformer. This comparison reveals the potential advantages of neural networks in processing speech signals.
- 13.
- In general, the iteration of the MVDR and post-separation is fully employed for promoting their individual optimization, which ultimately improves the overall performance. In addition, both the outputs of MVDR and post-separation can be used as the final output of the model. However, the performance of the output of post-separation is better than the MVDR beamformer from Table 1, so the former is used as the final output of the model. In addition, each additional processing will lead to an increase in the real-time factor (RTF) of the model, thereby increasing the data processing time. Considering this, the output of post-separation after the first iteration is used as the evaluation results of our model for comparison with other methods.
4.2. Comparison with Reference Methods
- 14.
- Filter-and-Sum Network (FaSNet) [26] is a time-domain method that uses a neural network to implement beamforming technology. This method utilizes deep learning to automatically learn and optimize the weights and parameters of the beamformer. The core advantage of this method is its adaptability, allowing the network to adjust according to the complexity and diversity of speech signal;
- 15.
- Narrow-band (NB) -BLSTM [27] is a frequency-method using the BLSTM network, which is specially focused on narrow-band frequency processing and is trained by full-band methods to improve its performance. By processing each narrow-band frequency component separately in frequency domain, this method can effectively identify and separate individual speakers for the overlapped speech;
- 16.
- Beam-TasNet [9] is a classical speech separation method that combines time-domain and frequency-domain approaches. First, the time-domain neural network is used for pre-separation. Subsequently, these pre-separated speech signals are used to calculate the SCM of the beamformer. Finally, the separated signal is obtained by the beamformer;
- 17.
- Beam-Guided TasNet [19] is a two-stage speech separation method that also combines both time-domain and frequency-domain approaches. In the first stage, the initial speech separation is performed using the Beam-TasNet. In the second stage, the network structure remains the same as the Beam-TasNet, but the input includes the output from the first stage. This iterative process helps to further refine the separation of the initial speech.
- 18.
- Beam-TFDPRNN [15] is our previously proposed time-frequency speech separation method, which also uses a neural beamforming structure like Beam-TasNet. This method has more advantages in the reverberant environment, because it uses a time-frequency domain network with more anti-reverberant ability for the pre-separation.
4.3. Dereverberation Results
5. Conclusions and Discussion
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Chen, Z.; Li, J.; Xiao, X.; Yoshioka, T.; Wang, H.; Wang, Z.; Gong, Y. Cracking the Cocktail Party Problem by Multi-Beam Deep Attractor Network. In Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU); December 2017; pp. 437–444.
- Qian, Y.; Weng, C.; Chang, X.; Wang, S.; Yu, D. Past Review, Current Progress, and Challenges Ahead on the Cocktail Party Problem. Frontiers Inf Technol Electronic Eng 2018, 19, 40–63. [Google Scholar] [CrossRef]
- Chen, J.; Mao, Q.; Liu, D. Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation 2020.
- Luo, Y.; Mesgarani, N. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1256–1266. [Google Scholar] [CrossRef]
- 5. Subakan, C.; Ravanelli, M.; Cornell, S.; Bronzi, M.; Zhong, J. Attention Is All You Need In Speech Separation. In Proceedings of the ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Toronto, ON, Canada, June 6 2021; pp. 21–25.
- Zhao, S.; Ma, Y.; Ni, C.; Zhang, C.; Wang, H.; Nguyen, T.H.; Zhou, K.; Yip, J.; Ng, D.; Ma, B. MossFormer2: Combining Transformer and RNN-Free Recurrent Network for Enhanced Time-Domain Monaural Speech Separation 2023.
- Gannot, S.; Vincent, E.; Markovich-Golan, S.; Ozerov, A. A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 692–730. [Google Scholar]
- Anguera, X.; Wooters, C.; Hernando, J. Acoustic Beamforming for Speaker Diarization of Meetings. IEEE Trans. Audio Speech Lang. Process. 2007, 15, 2011–2022. [Google Scholar] [CrossRef]
- Ochiai, T.; Delcroix, M.; Ikeshita, R.; Kinoshita, K.; Nakatani, T.; Araki, S. Beam-TasNet: Time-Domain Audio Separation Network Meets Frequency-Domain Beamformer. In Proceedings of the ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Barcelona, Spain, May 2020; pp. 6384–6388.
- Zhang, X.; Wang, Z.-Q.; Wang, D. A Speech Enhancement Algorithm by Iterating Single- and Multi-Microphone Processing and Its Application to Robust ASR. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); March 2017; pp. 276–280.
- Erdogan, H.; Hershey, J.R.; Watanabe, S.; Mandel, M.I.; Roux, J.L. Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks. In Proceedings of the Interspeech 2016; ISCA, September 8 2016; pp. 1981–1985.
- Gu, R.; Zhang, S.-X.; Zou, Y.; Yu, D. Towards Unified All-Neural Beamforming for Time and Frequency Domain Speech Separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 849–862. [Google Scholar] [CrossRef]
- Xiao, X.; Zhao, S.; Jones, D.L.; Chng, E.S.; Li, H. On Time-Frequency Mask Estimation for MVDR Beamforming with Application in Robust Speech Recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: New Orleans, LA, March 2017; pp. 3246–3250.
- Luo, Y. A Time-Domain Real-Valued Generalized Wiener Filter for Multi-Channel Neural Separation Systems. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 3008–3019. [Google Scholar] [CrossRef]
- Zhang, X.; Bao, C.; Zhou, J.; Yang, X. A Beam-TFDPRNN Based Speech Separation Method in Reverberant Environments. In Proceedings of the 2023 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC); IEEE: ZHENGZHOU, China, November 14 2023; pp. 1–5.
- Kavalerov, I.; Wisdom, S.; Erdogan, H.; Patton, B.; Wilson, K.; Roux, J.L.; Hershey, J.R. Universal Sound Separation 2019.
- Tzinis, E.; Wisdom, S.; Hershey, J.R.; Jansen, A.; Ellis, D.P.W. Improving Universal Sound Separation Using Sound Classification. In Proceedings of the ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Barcelona, Spain, May 2020; pp. 96–100.
- Shi, Z.; Liu, R.; Han, J. LaFurca: Iterative Refined Speech Separation Based on Context-Aware Dual-Path Parallel Bi-LSTM 2020.
- Chen, H.; Yi, Y.; Feng, D.; Zhang, P. Beam-Guided TasNet: An Iterative Speech Separation Framework with Multi-Channel Output. arXiv:2102.02998 [eess] 2022.
- Wang, Z.-Q.; Erdogan, H.; Wisdom, S.; Wilson, K.; Raj, D.; Watanabe, S.; Chen, Z.; Hershey, J.R. Sequential Multi-Frame Neural Beamforming for Speech Separation and Enhancement. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT); IEEE: Shenzhen, China, January 19 2021; pp. 905–911.
- Yang, L.; Liu, W.; Wang, W. TFPSNet: Time-Frequency Domain Path Scanning Network for Speech Separation. 5. [CrossRef]
- Yang, X.; Bao, C.; Zhang, X.; Chen, X. Monaural Speech Separation Method Based on Recurrent Attention with Parallel Branches. In Proceedings of the INTERSPEECH 2023; ISCA, August 20 2023; pp. 3794–3798.
- Wang, Z.-Q.; Le Roux, J.; Hershey, J.R. Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Calgary, AB, April 2018; pp. 1–5.
- Allen, J.B.; Berkley, D.A. Image Method for Efficiently Simulating Small-room Acoustics. The Journal of the Acoustical Society of America 1979, 65, 943–950. [Google Scholar] [CrossRef]
- Févotte, C.; Gribonval, R.; Vincent, E. BSS_EVAL TOOLBOX USER GUIDE REVISION 2.0. 2011, 22.
- Luo, Y.; Ceolini, E.; Han, C.; Liu, S.-C.; Mesgarani, N. FaSNet: Low-Latency Adaptive Beamforming for Multi-Microphone Audio Processing. arXiv:1909.13387 [cs, eess] 2019.
- Quan, C.; Li, X. Multi-Channel Narrow-Band Deep Speech Separation with Full-Band Permutation Invariant Training. In Proceedings of the ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Singapore, Singapore, May 23 2022; pp. 541–545.
- Chen, Z.; Yoshioka, T.; Lu, L.; Zhou, T.; Meng, Z.; Luo, Y.; Wu, J.; Xiao, X.; Li, J. Continuous Speech Separation: Dataset and Analysis 2020.
- Maciejewski, M.; Wichern, G.; McQuinn, E.; Roux, J.L. WHAMR!: Noisy and Reverberant Single-Channel Speech Separation. In Proceedings of the ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Barcelona, Spain, May 2020; pp. 696–700.



| stage | RTF | SDR | SI-SDR | SIR | PESQ | STOI | |||||
| 0 | 0.024 | - | 14.41 | - | 13.91 | - | 30.67 | - | 4.13 | - | 0.98 |
| 1 | 0.049 | 18.79 | 21.84 | 17.08 | 21.13 | 27.21 | 30.67 | 3.93 | 4.13 | 0.99 | 0.98 |
| 2 | 0.074 | 22.07 | 24.17 | 20.95 | 23.48 | 33.21 | 33.21 | 3.93 | 4.23 | 0.99 | 0.99 |
| 3 | 0.104 | 22.10 | 24.48 | 21.07 | 23.84 | 33.23 | 33.93 | 3.99 | 4.25 | 0.99 | 0.99 |
| 4 | 0.125 | 22.21 | 24.91 | 21.20 | 24.26 | 33.80 | 34.36 | 3.98 | 4.26 | 0.99 | 0.99 |
| 5 | 0.148 | 22.31 | 24.60 | 21.34 | 23.98 | 34.05 | 34.24 | 4.00 | 4.25 | 0.99 | 0.99 |
| Method | Param | SDR | SI-SDR | PESQ | SIR | STOI |
|---|---|---|---|---|---|---|
| FaSNet | 2.8 M | 11.96 | 11.69 | 3.16 | 18.97 | 0.93 |
| NB-BLSTM | 1.2 M | 8.22 | 6.90 | 2.44 | 12.13 | 0.83 |
| Beam-TasNet | 5.4M | 17.40 | - | - | - | - |
| Beam-Guided TasNet | 5.5M | 20.52 | 19.49 | 3.88 | 27.49 | 0.98 |
| Beam-TFDPRNN | 2.7 M | 17.20 | 16.80 | 3.68 | 26.77 | 0.96 |
| iBeam-TFDPRNN | 2.8M | 24.17 | 23.48 | 4.23 | 33.21 | 0.99 |
| Method | SDR | |
| Beam TasNet | 10.8 | 14.6 |
| Beam-Guided TasNet | 16.5 | 17.1 |
| iBeam- TFDPRNN | 20.2 | 19.7 |
| Oracle mask-based MVDR | 11.4 | 12.0 |
| Oracle signal-based MVDR | ∞ | 21.1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
