1. Introduction
In recent years, the rapid advancement of artificial intelligence technology has spurred the further development of intelligent coal mine construction [
1]. The introduction of policies such as the “Intelligent Coal Mine Guide (2021 Edition)” and the “Trial Measures for the Acceptance Management of Intelligent Demonstration Coal Mines” has underscored the growing necessity of establishing intelligent coal mines that leverage related artificial intelligence technologies [2-3]. Given the unique characteristics of the coal industry, developing an intelligent coal mine that includes a voice interaction system tailored to the sector is crucial for ensuring the safety of coal mine production.
Northern Shaanxi, one of the most coal-rich regions in China, is at the forefront of producing and managing the production environment, conferences, dispatching, and command operations within the coal industry. The integration of a voice interaction system that accommodates the local dialect is vital for enhancing communication efficiency and safety in these contexts.
Management personnel in the coal mining industry predominantly communicate using the Northern Shaanxi dialect, commonly referred to as “Shaanxi Pu,” which is characterized by a distinct Northern Shaanxi accent. The dialect serves not only as a cultural emblem but also as a carrier of traditional cultural heritage. Consequently, it is imperative to compile a corpus of the Northern Shaanxi dialect and to develop speech recognition capabilities tailored to the coal mining industry. This initiative is vital for both preserving the regional culture and enhancing operational efficiency within the sector.
To address the challenge of dialect recognition within the field of speech recognition, initial research efforts by scholars involved adapting traditional speech recognition models for dialect recognition purposes. This included the application of linear predictive coding (LPC), dynamic time warping (DTW), and hidden Markov models (HMMs) as technical frameworks for dialect identification. Furthermore, the integration of Gaussian Mixture Model (GMM) technology in speech modeling has notably enhanced recognition rates. For instance, studies [
4,
5,
6] have utilized the GMM to develop dialect recognition systems for the Mongolian dialect, Chongqing dialect, and the Shuozhou dialect in Shanxi Province.
Due to the reliance of traditional dialect recognition methods on extensive corpora and manual annotation, these approaches incur high costs and often yield only moderate recognition performance. Moreover, the advent of deep learning has transformed the landscape of speech recognition technology.
Technology, particularly deep learning, not only streamlines the process of speech recognition but also markedly enhances recognition accuracy. For instance, the study in literature [
7] developed an end-to-end Listen, Attend and Spell (LAS) model for Tujia speech recognition, incorporating a multi-head attention mechanism to boost the accuracy of Tujia dialect recognition. In document [
8], an end-to-end dialect speech recognition method based on transfer learning is introduced, which leverages shared feature extraction to enhance the recognition performance of low-resource dialects. Document [
9] presents an end-to-end speech recognition system that integrates a multi-head self-attention mechanism with a residual network (ResNet) and a bidirectional long short-term memory network (Bi-LSTM), significantly improving the recognition of Jiangxi and Hakka dialects. Nonetheless, these models could benefit from further enhancements in incorporating contextual semantic information and capturing positional details.
Currently, end-to-end (E2E) speech recognition technology has yielded substantial research outcomes [
10]. Architectures such as Recurrent Neural Networks (RNNs) [
11], Convolutional Neural Networks (CNNs) [12-13], self-attention-based Transformer networks [
14], and the Conformer [
15] have emerged as prominent backbone structures for Automatic Speech Recognition (ASR) models, garnering significant attention. However, the Conformer decoder exhibits limited text generation capabilities, the Transformer model is computationally intensive, incurs substantial memory costs, and has a weaker ability to capture local features.
Addressing these limitations, this study introduces a Conformer-Transformer-CTC (Connectionist Temporal Classification) fusion approach for dialectal speech recognition systems. The proposed method leverages the audio modeling prowess of Conformer as the encoder, utilizes Transformer for text generation as the decoder, and harnesses the flexible alignment capabilities of CTC to construct an end-to-end dialect speech recognition model, thereby enhancing speech recognition accuracy.
2. Related Work
The end-to-end speech recognition process entails the consolidation of acoustic, pronunciation, and linguistic factors within a single deep neural network (DNN) to streamline the modeling process and achieve direct mapping from speech input to text output [
16,
17]. As deep learning techniques gain widespread application across various domains, end-to-end speech recognition models have demonstrated superior recognition performance compared to traditional speech recognition models, garnering increasing interest and attention [
18,
19].
Currently, the prevalent end-to-end methods include: a) Connectionist Temporal Classification (CTC) [
20], which computes the model’s loss and optimizes it using forward and backward algorithms [
21]. Study [
22] further introduces intermediate CTC loss to regularize CTC training and enhance speech recognition performance. b) Recurrent Neural Network Transducer (RNN-T) [
23], which is suitable for streaming speech recognition as it can recall past information. c) The Encoder-Decoder architecture with an attention mechanism [
24] has garnered significant attention and research. In recent years, the diverse modeling approaches in end-to-end Automatic Speech Recognition (ASR) systems have sparked interest in developing hybrid methods to leverage their complementary strengths for speech recognition [
25]. For instance, the research in [
26] combined CTC with the Transformer decoder to achieve higher recognition accuracy. The study in [
27] integrated the self-attention mechanism and the multi-layer perceptron module into dual branches within the model, yielding exceptional recognition performance.
In essence, the integration of the strengths of different architectures has demonstrated potential for enhancing speech recognition performance in recent research [
28]. In light of this, this study harnesses the benefits of end-to-end models to construct a speech recognition system for the Northern Shaanxi dialect, aiming to improve recognition performance.
3. Method
3.1. Corpora Establishment
Currently, there is a significant lack of openly accessible corpora concerning the dialect used within the coal mining sector in northern Shaanxi. To facilitate research in the realm of speech recognition for the northern Shaanxi dialect, this study has successfully compiled a specialized dialect corpus for the northern Shaanxi coal mining industry. The methodology employed in constructing this corpus is detailed in
Figure 1.
The initial phase involves the careful selection of speech materials from the region. By analyzing the distinctive features of the northern Shaanxi dialect, which are summarized in
Table 1, the corpus was compiled. The selection process was guided by the textual content of coal mine dispatch logs, industry-specific terminology from the coal mining sector, and relevant industrial texts. Subsequently, recorded transcripts were created based on this selected material.
The recording protocol adopted in this study involves the collection of dialect data by 20 volunteers. The recordings are primarily based on the text from the northern Shaanxi coal mine scene-specific dialect dataset. Among the 20 volunteers, there are 13 males and 7 females, with ages ranging from 18 to 40 years. All participants are native to northern Shaanxi, and the dialects recorded are exclusively of the northern Shaanxi variety. The resultant dataset is detailed in
Table 2.
To guarantee the quantity and integrity of the data, this study employs professional recording equipment to capture the audio. The recordings are saved in WAV format with a 16 kHz sampling rate. Subsequent to recording, the audio files are meticulously checked and annotated using Adobe Audition.
In the final phase of data preparation, all recorded data, including participant information, the original phonetic corpus, the annotated and processed corpus, along with the corresponding text annotations, are systematically organized and stored within a unified corpus repository. This structured approach ensures that the dataset is both comprehensive and accessible for further analysis and use in speech recognition model training.
3.2. Conformer-Transformer-CTC Model Structure
To achieve a dialect speech recognition model with enhanced dialect recognition accuracy, this paper introduces a Conformer-Transformer-CC based speech recognition system. The system comprises several key modules: a prepossessing module, which includes both speech and text prepossessing, and a code module where the encoder utilizes the Conformer architecture, while the decoder combines Transformer and CTC for joint decoding.
Figure 2 illustrates the end-to-end architecture of the dialect Conformer-Transformer-CC speech recognition system. This integrated approach is designed to leverage the strengths of both Conformer and Transformer networks, along with the benefits of CTC for robust and accurate dialect recognition.
3.2.1. Preprocessing Module
The preprocessing module includes a speech preprocessing module and a text preprocessing module, as shown in
Figure 3.
The speech processioning module developed in this study comprises a down-sampling module (Sub-sampling Embedding), a convolution module, and position coding (Position Encoding).
Initially, the speech features undergo down-sampling. Dialectal speech may exhibit significant variations in temporal length or rapid speaking rates. Down-sampling reduces the time dimension’s resolution, enabling the model to more effectively handle rapidly changing features.
Subsequently, the convolution module includes a 2D-Convolution layer followed by a Rel U activation layer to perform the convolution operation. Dialectal speech acoustic features often exhibit extemporization correlations. Two-dimensional convolution allows for the capture of these acoustic feature patterns across both time and frequency dimensions, facilitating the learning of local features within dialectal speech. Furthermore, the ReLU activation function enhances the model’s expressive capacity, particularly for acoustic features that deviate from standard language norms.
Following this, a linear layer is employed to extract dialect-specific acoustic features and patterns, yielding a more tailored feature representation. Additionally, fixed position coding, utilizing sine and cosine functions, is applied to better comprehend the sequential distribution and structure of dialectal speech features, as detailed in formulas (1-2).
Conclusively, the application of Dropout operations involves the random suppression of certain features, which decreases the model’s reliance on individual features and, in turn, strengthens its robustness and generalization capabilities.
The position encoding matrix, which indicates the specific location of the current word within the sequence. It represents the i-th dimension of the word vector and the dimension size of the word vector.
Through the aforementioned processes, the speech processioning module facilitates dimensional reduction, feature transformation, and position modeling for dialectal speech. This results in a feature representation with enhanced discriminating power, thereby improving the performance and accuracy of dialect recognition. Utilizing this speech processioning module enables the model to more effectively adapt to the characteristics of dialectal speech and to extract dialect-specific acoustic features and patterns.
The text processioning module developed in this paper includes an embedding layer, position coding, and convolution modules.
Initially, the embedding layer is employed to convert text labels into dense vector representations, capturing the semantic relationships between words. Subsequently, the same position coding technique used in the speech preprocessing module is applied to encode the positional information of the words. Following this, a 1-D convolutional layer is utilized to extract implicit positional information and to capture more nuanced local semantic details. Layer normalization is applied to normalize the model’s output at this stage. Finally, the ReLU activation function is introduced to incorporate non-linearities, thereby enhancing the model’s representational capacity.
By undergoing this text processing pipeline, the speech recognition system can enhance its comprehension and expression of textual information, thereby improving the overall performance of speech recognition.
3.2.2. Codec
The advantage of Conformer architecture as an encoder lies in its ability to process both time-domain and frequency-domain features of the audio signal. This dual processing capability allows the Conformer to yield a rich audio representation, enhancing its understanding of the input audio signal and providing more informative features for the subsequent decoder stage. Nevertheless, the Conformer structure exhibits limited text generation capabilities within its decoder. To address this, the present study employs a Transformer-based decoder. The self-attention mechanism inherent in the Transformer is adept at managing long-range dependencies, which allows the decoder to take into account the global context when generating text. This results in the production of accurate and coherent textual outputs. The proposed architecture is depicted in
Figure 4.
The Conformer model primarily consists of four key modules: the first feedforward module (Feed Forward), the multi-head attention module (Multi-Head Self-Attention), the convolutional module (Convolution Module), and the second feedforward module. For the Conformer, the process of computing the output for the input vector at the first encoder stage can be described by the following formula:
Among the components, FFN denotes the Feedforward module. ‘First’ signifies the initial feedforward module, succeeded by the ‘Second’ feedforward module. MHSA stands for the Multi-Head Self-Attention module, while ‘Conv’ is an abbreviation for the Convolution module. ‘Layernorm’ represents Layer Normalization. Each of these modules incorporates a residual connection to enhance the flow of gradients and stabilize training.
Order represents the input sequence, represents the advanced sequence, is the output sequence, the probability of encoding and decoding is as follows:
At each time, the conditional dependence of the output on the encoder features is calculated through the attention mechanism. The attention mechanism is a function of the hidden state of the current decoder and the encoder output features, which is compressed into a context vector. Is the learning parameter. The attention distribution is obtained by performing the normalization:
Using and hiding states, the corresponding context direction is obtained by using weighted sums:
The final model decoder training loss function is defined as:
3.2.3. CTC Auxiliary Training
Methods for improving speech recognition assistance tasks were investigated by assuming different levels of learning representations at different levels[
29]. In this paper, the CTC objective function is integrated into the Conformer-Transformer fusion model. End-to-end training through CTC learning sequence-to-sequence mapping without explicit alignment, which to reduce irregular alignment of Attention-based Encoder-Decoder (AED) models for better performance.
During training, the model consists of three parts, the Conformer encoder, the Transformer decoder, and the CTC decoder. The end-to-end dialect speech recognition training is as follows:
The output of the encoder was used to calculate the CTC loss, setting the training set as, Then the CTC loss function is:
Combining the CTC loss and the decoder loss facilitates the convergence of the decoder, while enabling the hybrid model to exploit the label dependence. Since the CTC is used to assist the decoder alignment, the CTC is less weighted in the fusion. The total loss function is defined as the weighted sum of the CTC and the decoder loss:
where,, is used to measure the importance of CTC loss and decoder loss. It represents speech features and represents text annotation.
4. For Experimental Validation
4.1. Experimental Indicators
In this paper, the experimental results are evaluated on the self-built dialect speech data set, and the evaluation algorithm index is word error rate WER, as the evaluation index. In order to align the identified word sequence and the standard word sequence, it needs to replace, delete, or insert certain words. The total number of words inserted, replaced, and deleted is divided by the percentage of the total number of words in the standard word sequence, which is as follows:
This I represents the number of miswords added, D refers to the excluded words, T refers to the total word of the whole sentence, but S refers to the number of words that are replaced. A smaller WER value indicates a better identification effect.
4.2. Experimental Configuration
The hardware configuration used for the experiment is an Intel (R) Xeon (R) Gold 6330 processor, with 32 GB of running memory, and a GPU of NVIDIA GeForce RTX 3090. The software environment used is Anaconda 3 and Python 3.8 environments based on PaddlePaddle 2.4.1 deep learning framework under Ubuntu 20.04.2 LTS operating system. The model was trained simultaneously with the Adam optimizer and the learning rate adaptive change strategy[
30].
4.3. Experimental Data
The corpus of northern Shaanxi dialect used in the model includes the prescribed exclusive vocabulary of coal mine industry, self-selected vocabulary, common sayings, coal mine dispatching call, coal mine report and other related corpora, with a total of 9770 sentences, with a total length of 25 hours. The types of the corpus include: vocabulary and grammar, oral culture, dialect dialogue and dialect narration, as shown in
Table 3, which are part of the data content of the corpus of northern Shaanxi dialect.
4.4. Experimental Results and Analysis
In this paper, different experiments are set up from the following three aspects: feature extraction, parameter tuning and recognition rate (comparison experiment), which verify the influence of feature extraction technology, model parameters on the recognition rate and the recognition rate of this model is better than the mainstream model. The details are as follows:
4.4.1. Feature Extraction Experiment
Due to the phonetic phonetic characteristics of dialect, it is difficult to achieve the best performance in the corpus of northern Shaanxi dialect by using a single phonetic feature extraction technology. Therefore, according to the phonetic rhyme characteristics of northern Shaanxi dialect, multiple speech features are extracted, and different feature extraction techniques have an influence on the speech recognition performance of dialects. The results are shown in
Table 4.
Considering the above analysis, the root cause can be attributed to the differences in the ability of different characteristics and the information expression mode. The MFCC features are relatively weak, but the combination with other features can provide some complementary information. The combination of different features can provide a more comprehensive and mainly rich audio feature representation, thus improving the performance of the speech recognition system.
4.4.2. Parameter Tuning Experiment
Since the self-built northern Shaanxi dialect data set cannot fully match the end-to-end model of depth, it is necessary to tune the model hyperparameters to perform the best results as far as possible. In this paper, several key parameters of the model: the depth and width (layer number and dimension) of the encoder (Conformer module), the multiple heads and dimension of the self-attention mechanism, and the size of the deep convolution kernel. By selecting these parameters, the best model parameters suitable for the data set of northern Shaanxi dialect.
Other parameters remain unchanged, adjust the number and dimension of Conformer modules, set the number of Conformer modules according to 16,12, and 8, and set the corresponding dimensions according to 1024,2048, and 4096.
Experimental results
Table 5 shows the influence of the number of encoders (Encode number) and the encoder dimension (Encoder dimension) in different groups on the performance. From the WER index, group 3 achieves the optimal effect. The dimension of the encoder was increased to 2048, providing a larger parameter space to capture the complex features of the audio. Higher encoder dimensions can better represent audio data and learn a richer feature representation, thus improving the accuracy of speech recognition. The poor performance in group 1 is be due to the larger number of encoders, resulting in excessive model parameters and increasing the risk of over fitting. The performance in Group 3 is also poor, probably due to the high dimension of the encoder, which makes the model too complex to fully learn and generalize. Therefore, we find the optimal parameters to achieve balance, thus improving the performance of the speech recognition system.
Other parameters remain unchanged, the multiple heads of the encoder Self-Attention module are set according to 2,4 and 8, and the corresponding dimensions are set according to 512,256 and 128. The Self-Attention in the corresponding decoder is also set consistently for experiments.
Experimental results
Table 6 shows that the influence of Self-Attention multiple heads and each head dimension on speech recognition performance are interrelated, and from the WER index, group 1 achieves the optimal effect. In this group, the dimension per Attention long head is high, allowing each head to better capture the key information in the input sequence, and the multiple heads are balanced with the dimension, thus reducing identification errors. Groups 2 and 3 performed poorly relative to group 1. As the number of Self-Attention multiple heads increases, the dimensionality of each head decreases. The lower head dimension may limit the expression ability of each head to the input sequence, causing the model to accurately learn key features.
Other parameters are unchanged, the convolution kernel size is singular, and the experiments were conducted according to 3,7 and 15.
Experimental results
Table 7 indicates that group 1 had the best results. Generally speaking, the larger the convolution core, the larger the receptive field, the more information the network "sees", and the better the obtained feature representation. However, in the current scale dialect, the use of small convolutional kernel size can better capture local details, which may help to better extract the characteristics of local details. The large convolutional kernel instead blur these detailed features, which makes group 1 better than the other two groups. At the same time, the larger the size, the larger the computation, the greater the computational cost.
4.4.3. Comparison Experiments
Using self-built corpus, some mainstream end-to-end models including WeNet (Conformer), WeNet (Transformer) are identified. Through the results shown in
Table 8, it can be seen that this model is better than other mainstream models.
The experimental results show that the Conformer end-to-end dialect recognition model improves the sequence generation ability by introducing Transformer and CTC, while enhancing the ability to model long sequences and the flexible alignment. At the same time, it shows that when the preprocessing module is not used, the recognition performance is lower than that of the preprocessing module. Through the dimension reduction, feature transformation and position modeling, the model is more adapted to the phonetic characteristics of northern Shaanxi dialect and better extract the unique acoustic features and patterns of dialect.
5. Conclusions and Outlook
This paper presents a dialect speech recognition model based on the end-to-end coal mining industry in northern Shaanxi. The Conformer-Transformer-CTC fusion approach utilizes the strengths of each component, including robust feature extraction, accurate sequence generation, and alignment flexibility. Through experimental analysis and comparison, we prove that the proposed dialect speech recognition model is more suitable for northern Shaanxi dialect, thus having lower error rate and better generalization performance. The next step will be
This paper studies how to effectively integrate external language models to improve dialect speech recognition performance, and how to extract more effective pronunciation features of different dialects to achieve better recognition effect and further expand in other official dialects.
References
- WANG Guofa, DU Yibo, CHEN Xiaojing, et al. Development and innovative practice from coal mine mechanization to automation and intelligence: Commemorating the 50th anniversary of the founding of Industry and Mine Automation. Industry and Mine Automation 2023, 49, 1–18.
- WANG Feng. Eight ministries and commissions: Promote the integrated development of intelligenttechnology and coal industry. China Plant Engineering 2020, 1-1.
- Kingdom law. Interpretation of The Guide to Intelligent Construction of Coal Mine (2021 edition) —— from the perspective of the writing team. Smart Mine 2021, 2, 2–9.
- WANG Guofa. Interpretation of theguide to intelligent construction of coal mines (2021): From the perspective of the writing group. Journalof Intelligent Mine 2021, 2, 2–9.
- LIU Zhiqiang MA Zhiqiang ZHANG, Xiaoxu; et al. IMUT-MC:a speech corpus for Mongolian soeech recognition. China Scientific Data 2022, 7, 75–87. [Google Scholar]
- Zhang Ce, Wei Pengcheng, Lu Xiaoyan, et al. Design and implementation of the Speech recognition system in Chongqing dialect. Computer measurement and control 2018, 26, 256–259,263.
- Yu Guo, Huan Jin Xia, Liu Xiaofeng, etc. Implementation and application of speech recognition system in Shanxi dialect. Computer and Digital Engineering 2021, 49, 2168–2173.
- YU Benguo,HUAN Jinxia,LIU Xiaofeng, et al. Implementationand Application of Speech RecognitionSystem inShanxi Dialect. Computer& Digital Engineering 2021, 49, 2168–2173.
- Yu Heavy, Wu Jiajia, Chen Yunbing, etc. End-to-end Tujia speech recognition based on the multi-head attention mechanism. Computer Simulation 2022, 39, 258–262 + 282.
- YU Chongchong,WU Jiajia,CHEN Yunbing,et al. End-to-End Speech Recognition Framework for Tujia Languagebased on Multi-head Attention Mechanism. Computer Simulation 2022, 39, 258–262+282.
- Yu C, Chen Y, Li Y, et al. Yu, Chongchong, et al. Cross-language end-to-end speech recognition research based on transfer learning for the low-resource Tujialanguage. Symmetry 2019, 11, 179–193.
- XU Fan,YANG Jianfeng,YAN Weizhi, et a1. An end-to-end dialect speech recognition model based.
- on self attention. Journal of Signal Processing 2021, 37, 1860–1871.
- Kim, S.; Gholami, A.; Shaw, A. , et al. Squeezeformer: An efficient transformer for automatic speech recognition. arXiv 2022, arXiv:2206.00888.
- Yin, W.; Kann, K.; Yu, M. , et al. Comparative study of CNN and RNN for natural language processing. arXiv 2017, arXiv:1702.01923. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis & Machine Intelligence 2017, 39, 1137–1149. [Google Scholar]
- Majumdar, S.; Balam, J.; Hrinchuk, O. , et al. Citrinet: Closing the gap between non-autoregressive and autoregressive end-to-end models for automatic speech recognition. arXiv 2021, arXiv:2104.01721. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N. , et al. Attention is all you need. Advances in neural information processing systems, 2017, 30, 5998–6008. [Google Scholar]
- Gulati, A.; Qin, J.; Chiu, C.C.; et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv 2020, arXiv:2005.08100. [Google Scholar]
- Wang, D.; Wang, X.; Lv, S. An overview of end-to-end automatic speech recognition. Symmetry 2019, 11, 1018. [Google Scholar] [CrossRef]
- Li, J. Recent advances in end-to-end automatic speech recognition. APSIPA Transactions on Signal and Information Processing 2022, 11. [Google Scholar] [CrossRef]
- Zhang, Q.; Lu, H.; Sak, H.; et al. Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020: 7829-7833.
- Chang, F.J.; Radfar, M.; Mouchtaris, A. , et al. End-to-end multi-channel transformer for speech recognition[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, 5884-5888.
- Li, J.; Ye, G.; Das, A. , et al. Advancing acoustic-to-word CTC model[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018: 5794-5798.
- Deng, K.; Cao, S.; Zhang, Y. , et al. Improving ctc-based speech recognition via knowledge transferring from pre-trained language models[C]//ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022: 8517-8521.
- Lee, J.; Watanabe, S. Intermediate loss regularization for ctc-based speech recognition[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021: 6224-6228.
- Li, B.; Chang, S.; Sainath, T.N. , et al. Towards fast and accurate streaming end-to-end ASR[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020: 6069-6073.
- Li, S.; Dabre, R.; Lu, X. , et al. Improving Transformer-Based Speech Recognition Systems with Compressed Structure and Speech Attributes Augmentation[C]//Interspeech. 2019: 4400-4404.
- Sainath, T.N.; Pang, R.; Rybach, D. , et al. Two-pass end-to-end speech recognition. arXiv 2019, arXiv:1908.10992. [Google Scholar]
- Zhang, B.; Wu, D.; Peng, Z. , et al. Wenet 2.0: More productive end-to-end speech recognition toolkit. arXiv 2022, arXiv:2203.15455. [Google Scholar]
- Peng, Y.; Dalmia, S.; Lane, I. , et al. Branchformer: Parallel mlp-attention architectures to capture local and global context for speech recognition and understanding[C]//International Conference on Machine Learning. PMLR, 2022: 17627-17643.
- Cui, M.; Deng, J.; Hu, S. , et al. Two-pass decoding and cross-adaptation based system combination of end-to-end conformer and hybrid tdnn asr systems. arXiv 2022, arXiv:2206.11596. [Google Scholar]
- Nozaki, J.; Komatsu, T. Relaxing the conditional independence assumption of CTC-based ASR by conditioning on intermediate predictions. arXiv arXiv:2104.02724, 2021.
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
|
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).