Submitted:
28 June 2024
Posted:
02 July 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
2.1. Phone-to-Audio Alignment
2.2. Phoneme Recognition
2.3. Systems Predicting Phones and Boundaries
3. Methodology
3.1. Our Proposed Method
3.2. Pre-Trained Self-Supervised Model
3.3. Data Processing and Knowledge Transfer
4. Experiments
4.1. Text Independent Phone-to-Audio Alignment on TIMIT
4.2. Text-Independent Phone-to-Audio Alignment on SCRIBE
5. Discussion
Funding
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 |
References
- Chapelle, C.A. The relationship between second language acquisition theory and computer-assisted language learning. The modern language journal 2009, 93, 741–753. [Google Scholar] [CrossRef]
- Golonka, E.M.; Bowles, A.R.; Frank, V.M.; Richardson, D.L.; Freynik, S. Technologies for foreign language learning: A review of technology types and their effectiveness. Computer assisted language learning 2014, 27, 70–105. [Google Scholar] [CrossRef]
- Tits, N.; Broisson, Z. Flowchase: a Mobile Application for Pronunciation Training. Proc. 9th Workshop on Speech and Language Technology in Education (SLaTE), 2023, pp. 93–94.
- Tan, C.; Sun, F.; Kong, T.; Zhang, W.; Yang, C.; Liu, C. A survey on deep transfer learning. International conference on artificial neural networks. Springer, 2018, pp. 270–279.
- Jaiswal, A.; Babu, A.R.; Zadeh, M.Z.; Banerjee, D.; Makedon, F. A survey on contrastive self-supervised learning. Technologies 2020, 9, 2. [Google Scholar] [CrossRef]
- Zoph, B.; Yuret, D.; May, J.; Knight, K. Transfer learning for low-resource neural machine translation. arXiv preprint arXiv:1604.02201, 2016. [Google Scholar]
- Tits, N.; El Haddad, K.; Dutoit, T. Exploring Transfer Learning for Low Resource Emotional TTS. Intelligent Systems and Applications; Bi, Y., Bhatia, R., Kapoor, S., Eds.; Springer International Publishing: Cham, 2020; pp. 52–60. [Google Scholar]
- Tits, N.; Wang, F.; Haddad, K.E.; Pagel, V.; Dutoit, T. Visualization and Interpretation of Latent Spaces for Controlling Expressive Speech Synthesis through Audio Analysis. Proc. Interspeech 2019, 2019, 4475–4479. [Google Scholar] [CrossRef]
- Tits, N.; El Haddad, K.; Dutoit, T. Analysis and assessment of controllability of an expressive deep learning-based tts system. Informatics. MDPI, 2021, Vol. 8, p. 84.
- Tits, N.; El Haddad, K.; Dutoit, T. ASR-based Features for Emotion Recognition: A Transfer Learning Approach. Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML). Association for Computational Linguistics, 2018, pp. 48–52.
- Zhou, K.; Sisman, B.; Liu, R.; Li, H. Emotional voice conversion: Theory, databases and ESD. Speech Communication 2022, 137, 1–18. [Google Scholar] [CrossRef]
- Hu, W.; Qian, Y.; Soong, F.K.; Wang, Y. Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers. Speech Communication 2015, 67, 154–166. [Google Scholar] [CrossRef]
- Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Hannemann, M.; Motlicek, P.; Qian, Y.; Schwarz, P.; others. The Kaldi speech recognition toolkit. IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011, number CONF.
- Young, S.; Evermann, G.; Gales, M.; Hain, T.; Kershaw, D.; Liu, X.; Moore, G.; Odell, J.; Ollason, D.; Povey, D.; others. The HTK book. Cambridge university engineering department 2002, 3, 12. [Google Scholar]
- Gorman, K.; Howell, J.; Wagner, M. Prosodylab-aligner: A tool for forced alignment of laboratory speech. Canadian Acoustics 2011, 39, 192–193. [Google Scholar]
- Koizumi, T.; Mori, M.; Taniguchi, S.; Maruya, M. Recurrent neural networks for phoneme recognition. Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP’96. IEEE, 1996, Vol. 1, pp. 326–329.
- Müller, M.; Stüker, S.; Waibel, A. Phonemic and graphemic multilingual ctc based speech recognition. arXiv preprint arXiv:1711.04564, 2017. [Google Scholar]
- Conneau, A.; Baevski, A.; Collobert, R.; Mohamed, A.; Auli, M. Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979, 2020. [Google Scholar]
- Xu, Q.; Baevski, A.; Auli, M. Simple and effective zero-shot cross-lingual phoneme recognition. arXiv preprint arXiv:2109.11680, 2021. [Google Scholar]
- Zhu, J.; Zhang, C.; Jurgens, D. Phone-to-audio alignment without text: A semi-supervised approach. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 8167–8171.
- Garofolo, J.S. Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993 1993.
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 2020, 33, 12449–12460. [Google Scholar]
- Solak, I. The M-AILABS speech dataset, 2019.


| Model | P | R | F1 | r-val |
| W2V2-CTC-10ms | 0.31 | 0.29 | 0.30 | 0.42 |
| W2V2-CTC-20ms | 0.31 | 0.30 | 0.31 | 0.42 |
| Phone recognition + W2V2-FS | ||||
| W2V2-FS-20ms | 0.40 | 0.42 | 0.41 | 0.48 |
| W2V2-FS-10ms | 0.56 | 0.58 | 0.57 | 0.63 |
| W2V2-FC-32k-Libris | 0.57 | 0.57 | 0.57 | 0.64 |
| Direct inference | ||||
| W2V2-FC-20ms-Libris | 0.57 | 0.59 | 0.58 | 0.63 |
| W2V2-FC-10ms-Libris | 0.55 | 0.58 | 0.56 | 0.62 |
| W2V2-FC-32k-Libris | 0.60 | 0.63 | 0.61 | 0.66 |
| Our Proposed Model | ||||
| W2V2-PCA-C | 0.61 | 0.68 | 0.63 | 0.58 |
| Model | P | R | F1 | r-val |
| charsiu Model | ||||
| W2V2-FC-10ms-Libris | 0.93 | 0.71 | 0.80 | 0.79 |
| Our Proposed Model | ||||
| W2V2-PCA-C | 0.89 | 0.85 | 0.87 | 0.88 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).