Submitted:
26 December 2024
Posted:
27 December 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Description of the Acoustics of Speech Signals
2.1. Aspects of Speech to Help Estimate Time Stamps
2.2. Speech Aspects to Help Distinguish Among Speakers
3. Analysis of Speech Signals
3.1. Basic Spectral Estimates of Speech Signals
3.2. Advanced Spectral Measures
3.3. Time Windows
3.4. Intonation Features
4. Use of Speech Features for Speaker Diarization
5. Models to Apply for Speaker Diarization
5.1. Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs)
5.2. Universal Background Model
5.3. Embeddings
5.4. Joint Factor Analysis
5.5. Neural (Machine Learning) Methods
5.5.1. Basics of ANNs
5.5.2. ANN Training: Loss Functions and Steepest Gradient Descent
5.5.3. Convolutional Neural Networks (CNNs)
5.5.4. Recurrent Neural Networks (RNNs)
5.5.5. Attention
5.5.6. Neural Variants of Speaker Embeddings
6. Speech Activity Detection (SAD)
7. Clustering Frames by Speaker Turns
8. Summary of Speaker Diarization Methods
8.1. Modular SD
8.2. Recent ANNs for SD
9. Ways to Evaluate Speaker Diarization
10. Datasets for Speaker Diarization
11. Discussion
12. Conclusions
References
- Han, K.; Wang, D. A classification based approach to speech segregation. The Journal of the Acoustical Society of America 2012, 132, 3475–3483. [Google Scholar] [CrossRef] [PubMed]
- LeRoux, J.; Wisdom, S.; Erdogan, H.; Hershey, J.R. SDR–half-baked or well done?” ICASSP 2019, 626–630. –.
- Sell, G.; Snyder, D.; McCree, A.; Garcia-Romero, D.; Villalba, J.; Maciejewski, M.; Manohar, V.; Dehak, N.; Povey, D.; Watanabe, S.; Khudanpur, S. Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge. Interspeech 2018, 2808–2812. [Google Scholar]
- Von Neumann, T.; Kinoshita, K.; Delcroix, M.; Araki, S.; Nakatani, T.; Haeb-Umbach, R. All-neural online source separation, counting, and diarization for meeting analysis. ICASSP 2019, 91–95. [Google Scholar]
- Raj, D.; Garcia-Perera, L.P.; Huang, Z.; Watanabe, S.; Povey, D.; Stolcke, A.; Khudanpur, S. DOVER-lap: A method for combining overlap-aware diarization outputs. IEEE Spoken Language Technology Workshop 2021, 881–888. [Google Scholar]
- Kanda, N.; Xiao, X.; Gaur, Y.; Wang, X.; Meng, Z.; Chen, Z.; Yoshioka, T. Transcribe-to-diarize: Neural speaker diarization for unlimited number of speakers using end-to-end speaker-attributed ASR. ICASSP 2022, 8082–8086. [Google Scholar]
- Kanda, N.; Gaur, Y.; Wang, X.; Meng, Z.; Chen, Z.; Zhou, T.; Yoshioka, T. Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers. Interspeech, 2020. [Google Scholar]
- Haykin, S.; Chen, Z. The cocktail party problem. Neural computation 2005, 17, 1875–1902. [Google Scholar] [CrossRef]
- Arons, B. A review of the cocktail party effect. Journal of the American Voice I/O Society 1992, 12, 35–50. [Google Scholar]
- Bronkhorst, A.W. The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions. " Acta Acustica united with Acustica 2000, 86, 117–128. [Google Scholar]
- Bregman, A.S.; Analysis, A.S. , 1990.
- Chen, Z.; Yoshioka, T.; Lu, L.; Zhou, T.; Meng, A.; Luo, Y.; Wu, J.; Xiao, X.; Li, J. Continuous speech separation: Dataset and analysis. ICASSP 2020, 7284–7288. [Google Scholar]
- Luo, Y.; Mesgarani, N. Conv-Tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 2019, 27, 1256–1266. [Google Scholar] [CrossRef]
- Kinoshita, K.; Delcroix, M.; Araki, S.; Nakatani, T. Tackling real noisy reverberant meetings with all-neural source separation, counting, and diarization system. ICASSP 2020, 381–385. [Google Scholar]
- Shafey, L.E.; Soltau, H.; Shafran, I. Joint speech recognition and speaker diarization via sequence transduction. Interspeech, 2019. [Google Scholar]
- Xiao, X.; Kanda, N.; Chen, Z.; Zhou, T.; Yoshioka, T.; Chen, S.; Zhao, Y.; Liu, G.; Y. Wu; Wu, J.; Liu, S.Microsoft speaker diarization system for the VoxCeleb speaker recognition challenge 2020. Microsoft speaker diarization system for the VoxCeleb speaker recognition challenge 2020. ICASSP 2021, 5824–5828. [Google Scholar]
- Anguera, X.; Bozonnet, S.; Evans, N.; Fredouille, C.; Friedland, G.; Vinyals, O. Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech, and Language Processing, 20(2), 356-370, 2012.
- Tranter, S.E.; Reynolds, D.A. An overview of automatic speaker diarization systems. IEEE Transactions on Audio, Speech, and Language Processing 2006, 14, 1557–1565. [Google Scholar] [CrossRef]
- Moattar, M.H.; Homayounpour, M.M. A review on speaker diarization systems and approaches. Speech Communication, 54 2012, 1065–1103.
- Park, T.J.; Kanda, N.; Dimitriadis, D.; Han, K.J.; Watanabe, S.; Narayanan, S. A review of speaker diarization: Recent advances with deep learning. Computer Speech & Language, 2022. [Google Scholar]
- Chang, X.; Zhang, W.; Qian, Y.; Le Roux, J.; Watanabe, S. MIMO-Speech: End-to-end multi-channel multi-speaker speech recognition. IEEE Automatic Speech Recognition and Understanding Workshop 2019, 237–244. [Google Scholar]
- Yoshioka, T.; Erdogan, H.; Chen, Z.; Alleva, F. Multi-microphone neural speech separation for far-field multi-talker speech recognition. ICASSP 2018, 5739–5743. [Google Scholar]
- O’Shaughnessy, D.; Communication, S. ; Machine; Press, I.E.E., 2000.
- Dissen, Y.; Goldberger, J.; Keshet, J. Formant estimation and tracking: A deep learning approach,” J. Acoust. Soc. Am. 2019, 145, 642–653. [Google Scholar] [CrossRef]
- Hu, G.; Wang, D. Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Transactions on Neural Networks 2004, 15, 1135–1150. [Google Scholar] [CrossRef]
- Spanias, A.S. Speech coding: A tutorial review. Proceedings of the IEEE 1994, 82, 1541–1582. [Google Scholar] [CrossRef]
- Davis, S.B.; Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. ASSP, vol. 28, 357–366, 1980.
- Makhoul, J. Linear prediction: A tutorial review. Proceedings of the IEEE 1975, 63, 561–580. [Google Scholar] [CrossRef]
- Rabiner, L.; Cheng, M.; Rosenberg, A.; McGonegal, C. A comparative performance study of several pitch detection algorithms. IEEE Transactions on Acoustics, Speech, and Signal Processing 1976, 24, 399–418. [Google Scholar] [CrossRef]
- Adami, A.G.; Mihaescu, R.; Reynolds, D.A.; Godfrey, J.J. Modeling prosodic dynamics for speaker recognition. ICASSP, 2003. [Google Scholar]
- Zelenák, M.; Hernando, J. , "The detection of overlapping speech with prosodic features for speaker diarization. ICSLP, 2011.
- Friedland, G.; Vinyals, O.; Huang, Y.; Muller, C. Prosodic and other long-term features for speaker diarization. IEEE Transactions on Audio, Speech, and Language Processing 2009, 17, 985–993. [Google Scholar] [CrossRef]
- Rabiner, L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 1989, 77, 257–286. [Google Scholar] [CrossRef]
- Diez, M.; Burget, L.; Wang, S.; Rohdin, J.; Cernocký, J. Bayesian HMM Based x-Vector Clustering for Speaker Diarization. Interspeech 2019, 346–350. [Google Scholar]
- Geiger, J.; Wallhoff, F.; Rigoll, G. GMM-UBM based open-set online speaker diarization. Interspeech, 2333. [Google Scholar]
- Singh, P.; Vardhan, H.; Ganapathy, S.; Kanagasundaram, A. LEAP diarization system for the second DIHARD challenge. Interspeech 2019, 983–987. [Google Scholar]
- Snyder, D.; Ghahremani, P.; Povey, D.; Garcia-Romero, D.; Carmiel, Y.; Khudanpur, S. Deep neural network-based speaker embeddings for end-to-end speaker verification. IEEE Spoken Language Technology Workshop 2016, 165–170. [Google Scholar]
- SelL, G.; Garcia-Romero, D. Speaker diarization with PLDA i-vector scoring and unsupervised calibration. IEEE Spoken Language Technology Workshop 2014, 413–417. [Google Scholar]
- Fergani, B.; Davy, M.; Houacine, A. Speaker diarization using one-class support vector machines. Speech Communication 2008, 50, 355–365. [Google Scholar] [CrossRef]
- Kenny, P.; Stafylakis, T.; Ouellet, P.; Alam, H.J. JFA-based front ends for speaker recognition. ICASSP 2014, 1705–1709. [Google Scholar]
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning,” Nature, 521, 436–444, 2015.
- Yella, S.H.; Stolcke, A.; Slaney, M. “Artificial neural network features for speaker diarization. IEEE Spoken Language Technology Workshop 2014, 402–406. [Google Scholar]
- Fini, E.; Brutti, A. Supervised online diarization with sample mean loss for multi-domain data. ICASSP 2020, 7134–7138. [Google Scholar]
- Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. Arcface: Additive angular margin loss for deep face recognition. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4690-4699, 2019.
- Kolbæk, M.; Yu, D.; Tan, Z.H.; Jensen, J. Multi-talker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 2017, 25, 1901–1913. [Google Scholar] [CrossRef]
- Park, T.J.; Han, K.J.; Kumar, M.; Narayanan, S. Auto-tuning spectral clustering for speaker diarization using normalized maximum eigengap. IEEE Signal Processing Letters 2019, 27, 381–385. [Google Scholar] [CrossRef]
- Huang, Z.; Watanabe, S.; Fujita, Y.; García, P.; Shao, Y.; Povey, D.; Khudanpur, S. , "Speaker diarization with region proposal network. ICASSP 2020, 6514–6518. [Google Scholar]
- Liu, Y.C.; Han, E.; Lee, C.; Stolcke, A. End-to-end neural diarization: From transformer to conformer. arXiv, arXiv:2106.07167, 2021.
- Zazo, R.; Lozano-Diez, A.; Gonzalez-Dominguez, J.; Toledano, D.; Gonzalez-Rodriguez, J. Language identification in short utterances using long short-term memory (LSTM) recurrent neural networks. PloS one, 0146. [Google Scholar]
- Wang, Q.; Downey, C.; Wan, L.; Mansfield, P.A.; Moreno, I.L. Speaker diarization with LSTM. ICASSP 2018, 5239–5243. [Google Scholar]
- Fujita, Y.; Kanda, N.; Horiguchi, S.; Xue, Y.; Nagamatsu, K.; Watanabe, S. End-to-end neural speaker diarization with self-attention. ASRU 2019, 296–303. [Google Scholar]
- Subakan, C.; Ravanelli, M.; Cornell, S.; Bronzi, M.; Zhong, J. Attention is all you need in speech separation. ICASSP 2021, 21–25. [Google Scholar]
- Garcia-Romero, D.; Snyder, D.; Sell, G.; Povey, D.; McCree, A. Speaker diarization using deep neural network embeddings. ICASSP 2017, 4930–4934. [Google Scholar]
- Landini, F.; Profant, J.; Diez, M.; Burget, L. Bayesian HMM clustering of x-vector sequences (vbx) in speaker diarization: theory, implementation and analysis on standard tasks. Computer Speech & Language, 2022. [Google Scholar]
- Senoussaoui, M.; Kenny, P.; Stafylakis, T.; Dumouchel, P. A study of the cosine distance-based mean shift for telephone speech diarization. IEEE/ACM Transactions on Audio, Speech, and Language Processing 2013, 22, 217–227. [Google Scholar] [CrossRef]
- Medennikov, I.; Korenevsky, M.; Prisyach, T.; Khokhlov, Y.; Korenevskaya, M.; Sorokin, I.; Timofeeva, T.; Mitrofanov, A.; Andrusenko, A.; Podluzhny, I.; Laptev, A. Target-speaker voice activity detection: a novel approach for multi-speaker diarization in a dinner party scenario”. Interspeech, 2020. [Google Scholar]
- Padmanabhan, M.; Bahl, L.R.; Nahamoo, D.; Picheny, M.A. Speaker clustering and transformation for speaker adaptation in speech recognition systems. IEEE Transactions on Speech and Audio Processing 1998, 6, 71–77. [Google Scholar] [CrossRef]
- Shum, S.H.; Dehak, N.; Dehak, R.; Glass, J.R. Unsupervised methods for speaker diarization: An integrated and iterative approach,” IEEE Transactions on Audio, Speech, and Language Processing 2013, 21, 2015–2028. ,.
- Li, Q.; Kreyssig, F.L.; Zhang, C.; Woodland, P.C. Discriminative neural clustering for speaker diarisation. IEEE Spoken Language Technology Workshop 2021, 574–581. [Google Scholar]
- Lin, Q.; Yin, R.; Li, M.; Bredin, H.; Barras, C. LSTM based similarity measurement with spectral clustering for speaker diarization. Interspeech, 2019. [Google Scholar]
- Xia, W.; Lu, H.; Wang, Q.; Tripathi, A.; Huang, Y.; Moreno, I.L.; Sak, H. Turn-to-diarize: Online speaker diarization constrained by transformer transducer speaker turn detection. ICASSP 2022, 8077–8081. [Google Scholar]
- Gish, H.; Siu, M.H.; Rohlicek, J.R. Segregation of speakers for speech recognition and speaker identification. ICASSP 1991, 873–876. [Google Scholar]
- Watanabe, S. A widely applicable Bayesian information criterion. The Journal of Machine Learning Research 2013, 14, 867–897. [Google Scholar]
- Siegler, M.A.; Jain, U.; Raj, B.; Stern, R.M. Automatic segmentation, classification and clustering of broadcast news audio. DARPA Speech Recognition Workshop, 1997.
- Dimitriadis, D.; Fousek, P. Developing On-Line Speaker Diarization System. Interspeech 2017, 2739–2743. [Google Scholar]
- Kinoshita, K.; Delcroix, M.; Tawara, N. Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds. ICASSP 2021, 7198–7202. [Google Scholar]
- Ng, A.; Jordan, M.; Weiss, Y. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems.
- Han, K.; Narayanan, S.S. A robust stopping criterion for agglomerative hierarchical clustering in a speaker diarization system. Interspeech 2007, 1853–1856. [Google Scholar]
- Ning, H.; Liu, M.; Tang, H.; Huang, T.S. A spectral clustering approach to speaker diarization. ICSLP,.
- Davidson, I.; Ravi, S.S. Clustering with constraints: Feasibility issues and the k-means algorithm. SIAM International Conference on Data Mining 2005, 138–149. [Google Scholar]
- Yu, C.; Hansen, J.H. Active learning based constrained clustering for speaker diarization. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2188. [Google Scholar]
- Yin, R.; Bredin, H.; Barras, C. Neural speech turn segmentation and affinity propagation for speaker diarization,” Interspeech, 2018.
- Xue, Y.; Horiguchi, S.; Fujita, Y.; Watanabe, S.; García, P.; Nagamatsu, K. Online end-to-end neural diarization with speaker-tracing buffer. IEEE Spoken Language Technology Workshop 2021, 841–848. [Google Scholar]
- Bredin, H.; Laurent, A. End-to-end speaker segmentation for overlap-aware resegmentation. Interspeech,.
- Bullock, L.; Bredin, H.; Garcia-Perera, L.P. Overlap-aware diarization: Re-segmentation using neural end-to-end overlapped speech detection. ICASSP 2020, 7114–7118. [Google Scholar]
- Zhang, A.; Wang, Q.; Zhu, Z.; Paisley, J.; Wang, C. Fully supervised speaker diarization. ICASSP 2019, 6301–6305. [Google Scholar]
- Comaniciu, D.; Meer, P. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 2002, 24, 603–619. [Google Scholar] [CrossRef]
- Xie, J.; Girshick, R.; Farhadi, A. Unsupervised deep embedding for clustering analysis. International conference on machine learning 2016, 478–487. [Google Scholar]
- Guo, X.; Gao, L.; Liu, X.; Yin, J. Improved deep embedded clustering with local structure preservation." International Joint Conference on Artificial Intelligence 2017, 1753–1759. –.
- Fujita, Y.; Kanda, N.; Horiguchi, S.; Nagamatsu, K.; Watanabe, S. End-to-end neural speaker diarization with permutation-free objectives. Interspeech, 4304. [Google Scholar]
- Han, E.; Lee, C.; Stolcke, A. BW-EDA-EEND: Streaming end-to-end neural speaker diarization for a variable number of speakers. ICASSP 2021, 7193–7197. [Google Scholar]
- Tripathi, A.; Lu, H.; Sak, H. End-to-end multi-talker overlapping speech recognition. ICASSP 2020, 6129–6133. [Google Scholar]
- Horiguchi, S.; Fujita, Y.; Watanabe, S.; Xue, Y.; Nagamatsu, K. End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors. Int, 2020. [Google Scholar]
- Janin, D.; Baron, J.; Edwards, J.; Ellis, D.; Gelbart, D.; Morgan, N.; Peskin, B.; Pfau, T.; Shriberg, E.; Stolcke, A.; Wooters, C. The ICSI meeting corpus. ICASSP, 2003. [Google Scholar]
- Fiscus, J.G.; Ajot, J.; Garofolo, J.S. ,, “The Rich Transcription 2007 meeting recognition evaluation. International Evaluation Workshop on Rich Transcription 2007, 373–389. [Google Scholar]
- Przybocki, M.; Martin, A. 2000 NIST Speaker Recognition Evaluation,” Linguistic Data Consortium, 2011.
- Ryant, N.; Singh, P.; Krishnamohan, V.; Varma, R.; Church, K.; Cieri, C.; Du, J.; Ganapathy, S.; Liberman, M. The third DIHARD diarization challenge. Interspeech, 3574. [Google Scholar]
- Carletta, J. Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus. Language Resources and Evaluation 2007, 41, 181–190. [Google Scholar] [CrossRef]
- Boeddeker, C.; Heitkaemper, J.; Schmalenstroeer, J.; Drude, L.; Heymann, J.; Haeb-Umbach, R. Front-end processing for the CHiME-5 dinner party scenario. CHiME5 Workshop,.
- Watanabe, S.; Mandel, M.; Barker, J.; Vincent, E.; Arora, A.; Chang, X.; Khudanpur, S.; Manohar, V.; Povey, D.; Raj, D.; Snyder, D. CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings. International Workshop on Speech Processing in Everyday Environments,.
- Cornell, S.; Wiesner, M.; Watanabe, S.; Raj, D.; Chang, X.; Garcia, P.; Masuyama, Y.; Wang, Z.Q.; Squartini, S.; Khudanpur, S. The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios. arXiv:2306.13734, arXiv:2306.13734, 2023.
- Hansen, J.H.; Joglekar, A.; Shekhar, M.C.; Kothapally, V.; Yu, C.; Kaushik, L.; Sangwan, A. The 2019 inaugural Fearless Steps challenge: A giant leap for naturalistic audio,” Interspeech, 2019.
- Perero-Codosero, J.M.; Antón-Martín, J.; Merino, D.T.; Gonzalo, E.L.; Gómez, L.A.H. Exploring Open-Source Deep Learning ASR for Speech-to-Text TV program transcription. IberSPEECH 2018, 262–266. [Google Scholar]
- Yu, F.; Zhang, S.; Fu, Y.; Xie, L.; Zheng, S.; Du, Z.; Huang, W.; Guo, P.; Yan, Z.; Ma, B.; Xu, X. M2MeT: The ICASSP 2022 multi-channel multi-party meeting transcription challenge. ICASSP 2022, 6167–6171. [Google Scholar]
- Kahn, J.; Galibert, O.; Quintard, L.; Carré, M.; Giraudel, A.; Joly, P. A presentation of the REPERE challenge. International Workshop on Content-Based Multimedia Indexing.
- Fu, Y.; Cheng, L.; Lv, S.; Jv, Y.; Kong, Y.; Chen, Z.; Hu, Y.; Xie, L.; Wu, J.; Bu, H.; Xu, X. AISHELL-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario. Interspeech,.
- Lleida, E.; Ortega, A.; Miguel, A.; Bazán-Gil, V.; Pérez, C.; Gómez, M.; De Prada, A. ,“Albayzin 2018 evaluation: the Iberspeech-RTVE challenge on speech technologies for Spanish broadcast media. Applied sciences, 2019. [Google Scholar]
- Chung, J.S.; Huh, J.; Nagrani, A.; Afouras, T.; Zisserman, A. Spot the conversation: speaker diarisation in the wild. Interspeech, 2020. [Google Scholar]
- Grauman, K.; Westbury, A.; Byrne, E.; Chavis, Z.; Furnari, A.; Girdhar, R.; Hamburger, J.; Jiang, H.; Liu, M.; Liu, X.; Martin, M. Ego4d: Around the world in 3,000 hours of egocentric video. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8995. [Google Scholar]
- Mao, H.H.; Li, S.; McAuley, J.; Cottrell, G. Speech recognition and multi-speaker diarization of long conversations. Interspeech.
- Bredin, H. pyannote. audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe. Interspeech, 1987. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).