Assessment of speech quality during speech rehabilitation based on the solution of the classification problem

: The article considers an approach to the problem of assessing the quality of speech during speech rehabilitation as a classification problem. For this, a classifier is built on the basis of an LSTM neural network for dividing speech signals into two classes: before the operation and immediately after. At the same time, speech before the operation is the standard to which it is necessary to approach in the process of rehabilitation. The metric of belonging of the evaluated signal to the reference class acts as an assessment of speech. An experimental assessment of rehabilitation sessions and a comparison of the resulting assessments with expert assessments of phrasal intelligibility were carried out.


Relevance of work
The problem of oncological diseases of the organs of the speech-forming tract is urgent. According to statistical studies [1], for the period from 2009 to 2019, there has been a steady increase in such indicators for the localization of the lip, oral cavity, pharynx, as the incidence per 100,000 people, the overall incidence, and the cumulative risk of this type of disease in the age category 0 -74 years old. At the same time, the proportion of tumors of the organs of the speech-forming tract in the total number of oncological diseases remains practically unchanged due to a decrease in the proportion of diseases of the lips. These trends are graphically presented in Figure 1. These quantitative values determine the relevance of research related to oncological diseases of the organs of the vocal tract.
Another feature of this localization of diseases is the influence of its treatment on the quality of life. Surgical treatment requires relearning to speak. This requires a speech rehabilitation procedure. The particular importance of this procedure is due to the fact that the bulk of the sick are of working age, and the lack of speech function significantly reduces the quality of life, preventing most of the communicative functions from being performed both at work and at home. Based on these facts, it can be concluded that developments in the field of increasing the effectiveness of speech rehabilitation are relevant. One of the subspecies of such studies is obtaining objective quantitative assessments of speech quality, which this work is devoted to.

Existing approaches to assessing speech quality
If we consider the general structure of methods for assessing the quality of speech ( Figure 2), then they can be divided into 2 categories: objective and subjective. Subjective assessment methods are based on research and assessment of pronounced units by experts. At the same time, the units themselves can differ significantly: individual phonemes, syllables, phrases. The most striking example of this category is the assessment based on GOST R 50840-95 [2]. For rehabilitation tasks, this standard allows one to obtain estimates of syllable and phrasal intelligibility [3].
Objective assessment methods, in turn, can be divided into 2 classes: they work on the basis of comparison of the same signal before and after transmission and use different signal realizations for assessment. At the same time, the use of the former for the tasks of speech rehabilitation is extremely problematic, since the recordings of the patients' speech before and after the operation are not the same signal before / after exposure. In the second category, many assessment methods have been developed with input from our team. It is possible to distinguish assessment approaches based on the normalization of signals and their subsequent comparison [4,5] and the use of recognition tools for assessment as a substitute for an expert in the GOST method [6,7]. However, the limited number of assessment methods and their incomplete coincidence with the GOST reference method in terms of the accuracy of the esti-mates obtained and their interpretability, suggests the relevance of the search for new approaches to obtaining such estimates.
In this paper, we propose to consider a new class of such methods -based on the application of machine learning methods to obtain an estimate of the speech quality as a result of solving the classification problem.

Dataset description
During the experiment, we used a previously collected set of recorded phrases from GOST. Made 25 records of phrases in one session. The number of patients with two sessions (before and after surgery) is 24, with three sessions -18, with four sessions -7. The total number of records is 3250. The sampling rate is 12000 Hz. The number of pairs of sessions suitable for constructing the classifier was 49. To construct the classifier, 80% of the sets were selected into the training set, the remaining 20% into the test set.
During processing, each signal was converted into a spectral form using the Fourier transform, block length 64 ms, 50% overlap. After that, the obtained spectrograms were transferred to the input of the classifier. This approach to con-structing inputs is basic [8] and is suitable as the first iteration in constructing a classifier.

Speech quality assessment based on the classification problem
The main idea of the proposed approach is easy to understand. At the time of the visit to the clinic, the patient, despite the presence of the disease, practically does not disturb the intelligibility of speech. The resulting grades of phrasal intelligibility are almost always equal to 1, and the grades of syllabic intelligibility are close to 1 (differences may arise more due to incorrect reading of syllables than due to their incorrect pronunciation). This fact allows us to speak about the possibility of using the notes before the operation as a standard of speech for a particular patient. This approach allows us to take into account the presence of speech features and individual defects in the patient, because further comparison will go exactly with the speech of a particular patient.
After the operation, speech intelligibility is significantly reduced. The final value depends on the volume and localization of the surgical intervention, however, syllabic intelligibility in some cases may fall below 0.1.
In fact, we can say that we have 2 classes of records: before and after surgery. Within the framework of the proposed approach, it is proposed to build a ma-chine learning system that solves the problem of determining whether the presented record is a record before or after the operation. If you train such a system to solve the described problem, then there is an opportunity to present it with the notes made during the rehabilitation process and use the metric of belonging to the reference class as an assessment of the quality of pronouncing the phrase.

Speech quality assessment based on the classification problem
A neural network was chosen as a machine learning method for constructing the classifier. The use of such networks is typical for solving a variety of speech analysis problems, such as speech recognition [9,10], authentication [11,12], sentiment determination [13], and others. For this reason, it was decided to use this particular type of city when constructing the classifier.
Considering the small amount of data and examples of using these networks for speech analysis tasks [14], a neural network based on LSTM was chosen [15].
To combat overfitting, regularization, dropout and batch normalization were applied.
The architecture of this neural network is shown in Figure 3. The next section describes the experimental study of the proposed approach and the establishment of its applicability.

All-user training and personalized training
Training was carried out according to two methods: for all users and a separate one only for the user of interest. The second training is based on a limited set of data, but the output is a classifier designed to work with a specific patient. A system trained on all users is more capable of generalizing data, however, due to the lack of focus on working with an individual user, it is likely to show less accurate results in the final assessment.   It can be seen that it is possible to train the neural network without retraining for one user. The final accuracy for the case without separation of users was 0.8, and there are signs of overfitting.

Obtaining final speech quality scores
After constructing a ready-made classifier, the signals were processed during the rehabilitation process, their quality was assessed and the resulting estimate was compared with the estimate obtained by an expert. Thus, values were obtained for 32 sessions. The obtained values are presented in table 1. After the expected receipt of quality assessments, you can proceed to the analysis of the results obtained.

Discussion
To assess the results obtained, we will find the correlation coefficient between them and check its statistical significance. The calculation will be carried out using Spearman's rank correlation coefficients. The calculation was carried out in the SPSS program.
The obtained values and the level of their significance are presented in Table 2. The results show that the results obtained for one user are consistent, which allows us to speak about the absence of obvious contradictions between the considered assessment method. For estimates based on all users, more significant discrepancies are visible and the correlation coefficient turns out to be statistically insignificant. This is due to the fact that during the training all users were united in one class, regardless of the volume and localization of the surgical intervention. Thus, the previous assumption about the best applicability of the method when working with one user is experimentally confirmed.

Conclusions
The experiment carried out has shown the potential applicability of the proposed approach based on the application of the classification. The efficiency in solving the problem of dividing speech into classes before / after the operation is shown. The applicability of this approach is shown when constructing a classifier for a specific patient. Spearman's correlation coefficient for estimates obtained by this method and estimates obtained by an expert way is 0.772 and is statistically significant. In the future, it is planned to analyze the applicability of the proposed approach when grouping patients into groups (gender, location and volume of surgery).
Funding: This research was funded by Ministry of Education and Science of the Russian Federation within the framework of scientific projects carried out by teams of research laboratories of educational institutions of higher education subordinate to the Ministry of Science and Higher Education of the Russian Federation, project number FEWM-2020-0042 (АААА-А20-120111190016-9).