Submitted:
19 September 2023
Posted:
20 September 2023
You are already at the latest version
Abstract
Keywords:
1. Introduction
- A MFCC+CNN based disordered speech classification module.
- A Transformer-based grapheme-to-phoneme(G2P) converter module.
- A Tacotron2-based IPA-to-Speech speech synthesis module.
- Build a Grapheme-to-Phoneme system to convert all the English text to IPA format.
- Build the TTS system with the processed data.
2. Related work
2.1. Linguistic E-Learning
2.2. Speech Disorders Classification
2.3. Grapheme-to-Phoneme Conversion
2.4. Speech Synthesis System
3. The MFCC+CNN based disordered speech classification
3.1. Feature Extraction
- Pre-emphasis the audio signal to increase to energy of the signal at a higher frequency.
- Break the sound signal into the overlapping window.
- Take the Fourier transform to transfer the signal from the time domain to the frequency domain.
- Compute the Mel spectrum by passing the Fourier-transformed signal through the Mel-filter bank. The transformation from the Hertz scale to the Mel scale is:
- Take the discrete cosine transform of the mel log signals and the result of this conversion is MFCCs.
3.2. Data selection
3.3. Implementation and Evaluation
4. The Transformer-based Multilingual G2P converter
4.1. Data Selection
4.2. The Transformer-based G2P converter
4.3. Implementation and Evaluation
5. The Tacotron2-based IPA-to-Speech System
5.1. Data preprocess
5.2. Tacotron2-based IPA-to-Speech system
- A recurrent sequence-to-sequence feature prediction network with attention which predicts a sequence of Mel spectrogram frames from an input character sequence
- A modified version of WaveNet which generates time-domain waveform samples conditioned on the predicted Mel spectrogram frames
5.3. Implementation and Evaluation
6. Conclusion and Future work
Author Contributions
Funding
References
- Brown, Adam. International phonetic alphabet. The encyclopedia of applied linguistics 2012. [Google Scholar] [CrossRef]
- Howard, Sara J and Heselwood, Barry C. Learning and teaching phonetic transcription for clinical purposes. Clinical Linguistics & Phonetics 2002, 16, 371–401. [Google Scholar] [CrossRef]
- Seals, Cheryl D., et al. "Applied webservices platform supported through modified edit distance algorithm: automated phonetic transcription grading tool (APTgt)." Learning and Collaboration Technologies. Designing, Developing and Deploying Learning Experiences: 7th International Conference, LCT 2020, Held as Part of the 22nd HCI International Conference, HCII 2020. [CrossRef]
- Liu, Jueting, et al. "Optimization to automated phonetic transcription grading tool (APTgt)–automatic exam generator." International Conference on Human-Computer Interaction. Cham: Springer International Publishing, 2021. [CrossRef]
- Liu, Jueting, et al. "Transformer-Based Multilingual G2P Converter for E-Learning System." International Conference on Human-Computer Interaction. Cham: Springer International Publishing, 2022. [CrossRef]
- Liu, Jueting, et al. "Speech Disorders Classification by CNN in Phonetic E-Learning System." International Conference on Human-Computer Interaction. Cham: Springer International Publishing, 2022. [CrossRef]
- Schwarz, Petr, Pavel Matějka, and Jan Černocký. "Towards lower error rates in phoneme recognition." International Conference on Text, Speech and Dialogue. Berlin, Heidelberg: Springer Berlin Heidelberg, 2004. [CrossRef]
- Wu, Xiaoping and Fitzgerald, Richard. Reaching for the stars: DingTalk and the Multi-platform creativity of a ‘one-star’campaign on Chinese social media. Discourse, Context & Media 2021, 44, 100540. [Google Scholar] [CrossRef]
- Downes, Stephen. E-learning 2.0. ELearn 2005, 10, 1. [Google Scholar]
- Madan, Akansha and Gupta, Divya. Speech feature extraction and classification: A comparative review. International Journal of computer applications 2014, 90. [Google Scholar]
- Mohan, Bhadragiri Jagan. "Speech recognition using MFCC and DTW." 2014 international conference on advances in electrical engineering (ICAEE). IEEE, 2014. [CrossRef]
- Lin, Yi-Lin, and Gang Wei. "Speech emotion recognition based on HMM and SVM." 2005 international conference on machine learning and cybernetics. Vol. 8. IEEE, 2005. [CrossRef]
- Hunnicutt, S. Grapheme-to-phoneme rules: A review. Speech Transmission Laboratory, Royal Institute of Technology, Stockholm, Sweden, QPSR 1980, 2-3, 38–60. [Google Scholar]
- Taylor, Paul. "Hidden Markov models for grapheme to phoneme conversion." Ninth European Conference on Speech Communication and Technology. 2005.
- Bisani, Maximilian and Ney, Hermann. Joint-sequence models for grapheme-to-phoneme conversion. Speech communication 2008, 50, 434–451. [Google Scholar] [CrossRef]
- Yao, Kaisheng and Zweig, Geoffrey. Sequence-to-sequence neural net models for grapheme-to-phoneme conversion. arXiv preprint arXiv:1506.00196, arXiv:1506.00196 2015. [CrossRef]
- Rao, Kanishka, et al. "Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks." 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015. [CrossRef]
- Yolchuyeva, Sevinj and Németh, Géza and Gyires-Tóth, Bálint. Transformer based grapheme-to-phoneme conversion. arXiv preprint arXiv:2004.06338, arXiv:2004.06338 2020. [CrossRef]
- Tan, Xu and Qin, Tao and Soong, Frank and Liu, Tie-Yan. A survey on neural speech synthesis. arXiv preprint arXiv:2106.15561, arXiv:2106.15561 2021. [CrossRef]
- Wang, Yuxuan and Skerry-Ryan, RJ and Stanton, Daisy and Wu, Yonghui and Weiss, Ron J and Jaitly, Navdeep and Yang, Zongheng and Xiao, Ying and Chen, Zhifeng and Bengio, Samy and others. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135, arXiv:1703.10135 2017. [CrossRef]
- Shen, Jonathan, et al. "Natural tts synthesis by conditioning wavenet on mel spectrogram predictions." 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018. [CrossRef]
- Arık, Sercan Ö., et al. "Deep voice: Real-time neural text-to-speech." International conference on machine learning. PMLR, 2017.
- Ren, Yi and Ruan, Yangjun and Tan, Xu and Qin, Tao and Zhao, Sheng and Zhao, Zhou and Liu, Tie-Yan. Fastspeech: Fast, robust and controllable text to speech. Advances in neural information processing systems.
- Gupta, Shikha and Jaafar, Jafreezal and Ahmad, WF Wan and Bansal, Arpit. Feature extraction using MFCC. Signal & Image Processing: An International Journal 2013, 4, 101–108. [Google Scholar]
- Speights Atkins, Marisha and Bailey, Dallin J and Boyce, Suzanne E. Speech exemplar and evaluation database (SEED) for clinical training in articulatory phonetics and speech science. Clinical Linguistics & Phonetics 2020, 34, 878–886. [Google Scholar] [CrossRef]
- Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia. Attention is all you need. Advances in neural information processing systems 2017, 30. [Google Scholar]
- The LJ Speech Dataset. Available online: URL (https://keithito.com/LJ-Speech-Dataset/).
- Streijl, Robert C and Winkler, Stefan and Hands, David S. Mean opinion score (MOS) revisited: methods and applications, limitations and alternatives. Multimedia Systems 2016, 22, 213–227. [Google Scholar] [CrossRef]













| Layer | Output shape | Param Number |
|---|---|---|
| conv2d | (None, 148, 148, 32) | 896 |
| max_pooling2d | (None, 74, 74, 32) | 0 |
| conv2d_1 | (None, 72, 72, 64) | 18496 |
| max_pooling2d_1 | (None, 36, 36, 64) | 0 |
| conv2d_2 | (None, 34, 34, 128) | 73856 |
| max_pooling2d_2 | (None, 17, 17, 128) | 0 |
| conv2d_3 | (None, 15, 15, 128) | 147584 |
| max_pooling2d_3 | (None, 7, 7, 128) | 0 |
| flatten | (None, 6272) | 0 |
| dense | (None, 512) | 3211776 |
| dense_1 | (None, 1) | 513 |
| Written Format | CMUDict | IPA symbols |
|---|---|---|
| eat | IY T | it |
| confirm | K AH N F ER M | k@n"f3rm |
| minute | M IH N AH T | "min@t |
| quick | K W IH K | kwik |
| maker | M EY K ER | "meIker |
| relate | R IH L EY T | rI"leIt |
| Dataset | Number of pairs of words | For validation |
|---|---|---|
| English-IPA | 125,912 | 20% |
| French-IPA | 122,986 | 20% |
| Spanish-IPA | 99,315 | 20% |
| Models | PER | WER |
|---|---|---|
| Hidden Markov Model | 9.02% | 42.69% |
| Joint-sequence | 5.88% | 24.53% |
| Seq2Seq | 5.45% | 23.55% |
| LSTM | 9.1% | 21.3% |
| Transformer(ours) | 2.6% | 10.7% |
| Language | Written Format | Correct Phonemes | Generated Phonemes |
|---|---|---|---|
| ]3*English | displeasure | dIspl"EZ@ | dIspl"Z@ |
| buoyant | b"OI@nt | b"OI@nt | |
| immortal | Im’O:t@l | Im’O:t@l | |
| ]3*Spanish | ababillarais | aBaBiLaRis | aBaBiLaRis |
| cacofónicos | kako"fonikos | kako"fonikos | |
| cadañega | kaDaeGa | kaDaɲeGa | |
| ]3*French | câlineriez | kalin9Kje | kalin9Kje |
| damasquiner | damaskine | damaskine | |
| effrangé | efKãZe | efKãZe |
| Original text in LJSpeech | Converted text |
|---|---|
| The overwhelming majority of people in this country know how to sift the wheat from the chaff in what they hear and what they read. | D9 oUv3wElmIN m9dZOr9ti 2v pi:p9l iN DIs k2ntri noU hAU tu: sIft D9 wi:t fr2m D9 tSaef In w2t DeI hi:r 9nd w2t DeI rEd. |
| All the committee could do in this respect was to throw the responsibility on others. | Ol D9 k9mIti kUd du: In DIs rIspEkt wA:z tu: 8roU D9 ri:spA:ns9bIl9ti a:n 2D3z. |
| since these agencies are already obliged constantly to evaluate the activities of such groups | sIns Di:z eIdZ9si:z A:r OlrEdi 9blAIdZd kA:nst9ntli tu: Ivaelju:eIt D9 aektIvIti:z 2v s2tS gru:ps. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).