Automatic Voice Query Service for Multi-Accented Mandarin Speech

Automatic Voice Query Service can extremely reduce the artificial cost, which can improve the response efficiency for users. The automatic speech recognition (ASR) is one of the important component in AVQS. However, many dialect areas in China make the AVQS have to response the multi-accented Mandarin users by single acoustic model in ASR. This problem severely limits the accuracy of ASR for multi-accented speech in the AVQS. In this paper, a new framework for AVQS is proposed to improve the accuracy of response. Firstly, the fusion feature including iVector and filterbank acoustic features is used to train the Transformer-CTC model. Secondly, the transformer-CTC model is used to construct the end-to-end ASR. Finally, keywords matching algorithm for AVQS based on fuzzy mathematic theory is proposed to further improve the accuracy of response. The results show that the final accuracy in our proposed framework for AVQS arrives at 91.5%. The proposed framework for AVQS can satisfy the service requirement of different areas in mainland of China. This research has a great significance for exploring the application value of artificial intelligence in the real scene.


I. INTRODUCTION
The Telephone/Mobile phone (T/M) Automatic Voice Query Service (AVQS) is an important application research in the field of intelligent speech communication [1,2]. Users can fetch their required information by T/M AVQS. However, the cost of human resource is too high for manual voice query service because the amount of customers is too large. In recent years, about 1.597 billion user receive the mobile phone service in China. Therefore, AVQS is an effective solution to decrease the cost of human resource. In AVQS, the automatic speech recognition (ASR) is one of the key part [3]. However, the accented Mandarin speech makes the ASR become a great challenge in AVQS [4,5].
Besides, the uncertainty of user townships increases the difficulty of ASR task for Mandarin speech. In China, there are seven main dialect townships include "Mandarin, Cantonese, Wu, Xiang, Min and Gan and Kejia" [6]. Dialect is the first language (native language) of speakers, while the other speech may be their second language. There is a great difference between the second language pronunciation and native language pronunciation of speaker, such as rhythm and tone variation [7]. This difference makes the speech recognition for second language a great challenge [8,9]. Pronunciation of Mandarin is different from the pronunciation of other dialects. Therefore the Mandarin speech spoke by dialect speaker is the accented Mandarin speech and the acoustic domain of Mandarin speech does not match the acoustic domain of accented Mandarin speech [10]. Undoubtedly, this mismatch would further improve the difficulty of ASR for AVQS. Several researches have been proposed to solve the mismatch problem. These methods can be categorized into two main types: dictionary adaptation [11][12][13][14][15][16] and model adaptation [17][18][19][20][21][22]. Dictionary adaptation focus on variation of phoneme. For example, expansion of phoneme list and pronunciation vocabulary are used generally. However, the dictionary adaptation may lead to the confusion of lexical. Therefore, the effect of dictionary adaptation is limited. The model adaptation is used to reduce the confusion of pronunciation by acoustic model based on the phoneme level or state of Hidden Markov Model (HMM) [15,16]. The model adaptation method focuses on acoustic variation, and it requires a large amount of accented speech to train the acoustic model directly.
AVQS is a complex application field of ASR. In this task, not only multi-accented Mandarin speech requires to be modeled, but also the different severity accented Mandarin speech should be processed. Therefore, our purpose in this paper is to explore how to improve the accuracy of multi-accented Mandarin speech recognition. Recently, the speaker identification feature, such as iVector, has been used in the accented speech recognition [23][24][25], and the experimental results show that the fusion features that include speaker identification and acoustic features are very useful to improve the accuracy of accented speech recognition. Moreover, with the development of ASR, the end-to-end ASR has excellent performance [26][27][28]. Especially, the connectionist temporal classifier (CTC) for neural networks is very useful in end-to-end ASR. Inspired by the above studies, we proposed a novel framework to improve the accuracy of ASR for AVQS in this paper. This framework includes three main parts: 1) the fusion features that fuses iVector and filterbank; 2) the end-to-end ASR; 3) the keywords matching algorithm based on fuzzy mathematic theory. Especially, the keywords matching algorithm is designed according to the mismatch between pronunciations of accented Mandarin speech and pronunciations of standard Mandarin speech.
The contributions of this paper mainly include: 1) exploring the suitable ASR, which has a good robustness for multi-accented Mandarin speech in the application of AVQS; 2) iVector and filterbank are fused into the fusion features, which are used to train and test the ASR for multi-accented Mandarin speech; 3) the keywords matching algorithm is proposed to further improve the response accuracy of AVQS The rest of this paper is organized as follows: Section II introduces the overview of the AVQS framework; Section III introduces the end-to-end ASR for AVQS; Section IV introduces the fusion features including iVector and filterbank features; Section V introduces the keywords matching algorithm based on the fuzzy mathematics theory; Section VI introduces the experiment setup and results; Section VII is the conclusion of our research.

II. FRAMEWORK OF AVQS FOR T/M SPEECH
The framework of our T/M AVQS is shown in Figure 1. Our framework include the two parts: query request and query response of T/M voice. The query request of T/M voice include three steps: 1) extraction of fusion features that fuses iVector and acoustic features; 2) ASR; 3) extraction of keywords based on named entity recognition (NER). In addition, query response of T/M voice include two main steps: 1) fuzzy matching of keywords; 2) answering based on Text-To-Speech (TTS). Figure.1 shows the overview of AVQS in T/M environment. Two types of users are "telephone users" and "mobile phone users". After the user voice dialing into server, the ASR service process the query request from user. Then, the NER procedure extract the keywords from the recognition result. Another procedure finish the matching process according to the record pre-saved in the database. Finally, the TTS procedure translate the matching result into speech and send the speech to user. The above process is one round of voice query interaction in T/M AVQS.

III. ASR BASED ON TRANSFORMER-CTC IN AVQS
Transformer is proposed by [29], which is a typical seq2seq model performs better than the BiRNN on machine translation. Recently, some ASR researches based on seq2seq are proposed to improve the accuracy [30,31]. In this paper, we proposed the Transformer-CTC-based ASR to get the content from T/M speech, referred to [32]. The ESPNet toolkit [33] is used as the basis for developing ASR. The framework of ASR based on Transformer is shown in Figure 2.
In Figure. 2, the fusion features are normalized as input sequence of Transformer Encoder. The output label embedding is as the Transformer Decoder. The Encoder, Decoder and CTC are used to train seq2seq ASR based on Transformer-CTC. Q is query vector, K is key vector, V is value vector. For the acoustic features of each input syllable, we construct a vector group of Q, K and V.

A. Encoder Stack
The Encoder of Transformer has M (=6) same neural networks stack. In this stack, the first sublayer of each layer is designed based on "Multi-head Attention", and the second sublayer is a simple full-connected feed-forward neural network. Around each of the two sub-layers, a keep-residual connection layer [34] and a normalization layer are used [35]. Their calculations are shown by formula (1), Where b and g are defined as the bias and gate parameters , hh W represents the weights of recurrent hidden layer and xh W represents the weights of line that from input layer to hidden layer. Please note that, the outputs of every sublayer are ( ( )) LayerNorm x Sublayer x  , where () Sublayer x is achieved by every sublayer themselves.

B. Decoder Stack
Similar to Encoder, the Decoder is designed by stack including N(=6) same neural networks layers. Different from Encoder, the Decoder has three sublayers. The third layer of Decoder processes the outputs of Encoder stack based on multi-head attention mechanism. Every sublayer of Decoder is designed by keep-residual connectionist neural networks. Following each sublayer, normalization is operated. The modification of every sublayer based on self-attention is proposed to prevent the model from paying too much attention on the follow-up positions. Combing with the embedding outputs, the above operations can ensure that all the predictions depends on the previous outputs.

C. Attention Mechanism
Attention in Transformer can be described as the mapping of keys, values and queries. The keys, values and queries are all vectors. The final results are the weighted sum of outputs.

Scaled Dot-Product Attention
The attention mechanism in this paper is designed based on scaled dot-product attention [29]. The input not only include query and key, whose dimensions are k d , but also include the value, whose dimension is v d . The detail process of calculation is shown as formula (2).
Where SPDA is the scaled dot-product attention mechanism. The SPDA is used in Transformer due to its low time complexity and space complexity. The larger the k d dimension, the greater the calculation cost of dot-product. Softmax function is used to decrease the gradients. The coefficient k 1 d is used to counteract this effect.

Multi-Head Attention
Different from single attention mechanism, Transformer processes the linear results of query and key with dimension of d k , and value with dimension of d v .
The multi-head attention [29] can capture the different positions of subspace representation. However, if the model has only one head, the averaging operation will suppress such scattered representation. The multi-head attention can be calculated by formula (3).
Where the parameter matrixes of projection are

D.CTC
CTC is a connectionist temporal classification model proposed by Alex Graves et al. [36]. It can solve the problem that training data in ASR require pre-segmentation and post-processing for label sequences. This requirement constrains the performance of neural networks. The CTC model performs better than HMM because it can predict the corresponding label sequences directly according to the unsegmented input data. Especially, a L-length sequence includes some Chinese characters in CTC, such as   l C = c μ = l = 1,2, ,L  . In addition, blank symbol '' b  is also used in CTC to define the boundary of a word.
The set ' C with '' b  can be defined as formula (5).
 if l is an odd number and ' l c would be Chinese character if l is an even number. The acoustic model can be calculated by CTC, such as (6).
Where X denotes the input; z denotes the outputs; T is the amount of frames. Especially, CTC obeys the conditional independent assumption rule, therefore we can obtain . Moreover, the length of hidden layer sequence should be less than the length of input sequence. The acoustic model is constructed based on Transformer, and the probability of every state can be calculated by formula (7).
 is chosen as the active function; () Linear  denotes the linear layer operation for converting the vectors of hidden layers; () t Transfomer  catches all of the inputs and outputs vectors of hidden layer at the moment t . The CTC model for character sequence is shown as (8).

IV. FUSION FEATURES FOR TRAINING ASR
Dehak et al. [37] proposed iVector for speaker identification, which is a milestone in the field of speaker identification. iVector can effectively improve the accuracy of accented speech recognition. Therefore, we fused iVector and acoustic features into the fusion feature for training ASR of AVQS. The iVector can be obtained by formula (9), and the filterbank is used as the acoustic feature. dimension. T is the total variability matrix [36]. We use the open source toolkit kaldi to extract the iVector, and the iVector and acoustic features are composed into the fusion features. iVector is concatenated with the filterbank feature frame by frame according to the time series characteristic of the filterbank feature, which means that one frame of filterbank feature is concatenated by one iVector.

V. KEYWORDS MATCHING ALGORITHM FOR AVQS BASED ON FUZZY MATHEMATIC THEORY
The difference between accented Mandarin speech and standard Mandarin speech leads to the poor accuracy of ASR. Besides, the poor accuracy of ASR further leads to low response accuracy of the AVQS. Therefore, the keywords matching algorithm based on fuzzy mathematic theory is proposed to further improve the response accuracy of AVQS. The keywords matching algorithm is on the basis of pinyin syllable level. After getting the content from T/M speech, the keywords can be obtained by named entity recognition (NER). Then the keywords can be transformed into pinyin sequence, and the pinyin of keywords pre-saved in database is used to match the pinyin sequence obtained by ASR utilizing keywords matching algorithm. Finally, the results obtained by matching process would be synthesized into speech and send to user. The whole process can be found in Figure. 1.
The error-prone pronunciations of keywords are counted statistically. And according to the mapping relationship between error-prone pronunciations and correct pronunciations, the dictionary is constructed. Finally, the degree of membership can be obtained by equation (10)   Where D denotes the edit distance; T represents the total number of characters in one pinyin syllable. In the matching process, the best result can be returned according to the biggest degree of membership.

VI. EXPERIMENT
Several experiments are designed to evaluate the performance of the proposed framework for AVQS. The experiments include three parts: 1) comparison of different ASR methods on AVQS testing data; 2) comparison of filterbank and fusion features for AVQS response; 3) evaluation for keywords matching algorithm based on fuzzy mathematic theory.

A. Experiment Setup
The configuration of parameters for ASR based on Transformer-CTC is shown in Table I. In this paper, the configuration is used in all experiments.

B. Data Preparation
AIShell-1 (AIShell) speech corpus [38] is used in this paper, which is a sub-dataset of AIShell-ASR0009. The speech corpus is recorded by 400 speakers including multi-speaker accented speech. In this paper, the AIShell speech is also used to obtain iVector. The real T/M speech corpus is supported by 114 of China Telecom (114 is a voice service of China Telecom Corporation). In the experiment, the accented Mandarin speech can be separated into 7 areas (Standard Mandarin, Cantonese, Wu, Xiang, Min and Gan and Kejia) according to the characteristics of speakers' dialect area. According to the severity of accented Mandarin speech, the accented Mandarin speech is split into three severities including "light", "medium" and "heavy" in the dataset. The split is based on expert experience. The details of the above speech corpus for training and testing ASR are shown in Table  II.
In Table II, 2000 utterances of speech from real T/M Voice Corpus, which were not used in the training process, were selected to evaluate the performance of the ASR. The voice corpus for training and testing AVQS includes the three severities of accented Mandarin speech, which are "light", "medium" and "heavy", respectively. In addition, the voice corpus also includes all of the 7 dialect-area speech. The details of configuration are shown in Table III. In Table III, the standard Mandarin speech has 200 utterances; the others have 300 utterances including light (100 utterances), medium (100 utterances) and heavy (100 utterances).

C. Results of AVQS
Character Error Rate (CER), Sentence Error Rate (SER), Keyword Error Rate (KWER) and Response Error Rate (RER) of AVQS (which is optimized by keywords matching algorithm) are used to evaluate the performance. Their calculation can obtained by (11), (12), (13) and (14), respectively.

S + D + I S + D + I CER
Where S denotes the amount of substitutions; D denotes the amount of deletions; I denotes the amount of insertions; C denotes the amount of corrects. This calculation method is different from article [39], because we think that the error rate should not be more than 1. In addition, S E is the amount of sentences with wrong characters in testing dataset; S T is the total amount of sentences in testing dataset. kw E denotes the amount of errors of keywords; kw T denotes the total amount of keywords. S Acc denotes the amount of correct response sentences of AVQS; S T denotes the total amount of testing sentences of AVQS.

D. Accuracy of ASR
Different ASR methods including DNN-HMM, bilstm-CTC, Transformer-CTC are compared in the experiment. The filterbank is only used to train the acoustic model of ASR. The results of accented Mandarin speech are shown in Table IV, Table V and Table VI

Results of Fusion Features
In this paper, the fusion feature and filterbank feature are also compared on testing dataset. In experiments, only the Transformer-CTC is tested. The results are shows in Table  VII.  Table VII shows the results of comparison between the filterbank and fusion features. Especially, the results in Table  VII are the average CERs of light accented, medium accented and heavy accented speech, respectively. Table VII shows that the fusion features can reduce the CERs obviously.

Results of Keywords Matching Algorithm based on Fuzzy Mathematic Theory
In this paper, the keywords matching algorithm based on fuzzy mathematic theory is evaluated according to the testing dataset. This experiment is based on trained ASR. The results are shown in Table VIII.  Table VIII shows the error rate of AVQS, where the KWER represents the error rate of keywords in one sentence; the RER represents the error rate of AVQS response after being optimized by the keywords matching algorithm based on fuzzy mathematic theory. Obviously, the optimization algorithm can effectively improve the accuracy of AVQS response. The highest accuracy of AVQS arrives at 91.5%, which means that the AVQS can satisfy the requirement of the different speakers in Chinese mainland.

VII. CONCLUSION
AVQS is an interesting application of artificial intelligence. It has a great significance to explore how to improve the performance of AVQS for Mandarin speaker. ASR is one key component of the AVQS. However, the nonstandard Mandarin pronunciation and the great differences between different dialects severely affect the performance of AVQS.
In this paper, a novel framework is proposed to improve the performance of AVQS. The framework includes three main parts: fusion features including iVector and acoustic features extraction; the ASR based on Transformer-CTC; keywords matching based fuzzy mathematic theory. The fusion features can effectively improve the accuracy of multi-accented Mandarin speech recognition. The keywords matching algorithm can deal with the problem pronunciation error. Our proposed algorithm can effectively improve the response accuracy of AVQS. Experimental results show that the highest response accuracy of AVQS arrives at 91.5%. The proposed framework can effectively improve the whole response accuracy of AVQS for light, medium and heavy accented Mandarin speech.