ARTICLE | doi:10.20944/preprints202004.0001.v1
Subject: Medicine And Pharmacology, Neuroscience And Neurology Keywords: stuttering; power spectra; speech preparation; imagined speech; simulated speech
Online: 1 April 2020 (07:52:09 CEST)
Purpose: The present study which addressed adults who stutter (AWS), has been an attempt to investigate power spectral dynamics in stuttering state through answering the written questions using quantitative electroencephalography (qEEG).Materials and Methods: A 64-channel EEG setup was used for data acquisition in 9 AWS. Since speech, and especially stuttering, causes significant noise in the EEG, the three conditions of speech preparation (SP), imagined speech (IS), and simulated speech (SS) in a 7-band format were chosen to source localize the signals using the standard low-resolution electromagnetic tomography (sLORETA) tool in fluent and disfluent states. Results: Having extracted enough fluent and disfluent utterances, significant differences were noted. Consistent with previous studies, the lack of beta suppression in SP, especially in beta2 and beta3 and somewhat in gamma band, was localized in supplementary motor area (SMA) and pre motor area in disfluent state. Delta band frequency was the best marker of stuttering shared in all 3 experimental conditions. Decreased delta power in SMA of both hemispheres and right premotor area through SP, in fronto-central and right angular gyrus through IS, and in SMA of both hemispheres through SS were a notable qEEG features of disfluent speech. Conclusion: The dynamics of beta and delta frequency bands may potentially explain the neural networks involved in stuttering. Based on the above, neurorehabilitation may better be formulated in the treatment of speech disfluency, namely stuttering.
ARTICLE | doi:10.20944/preprints202112.0196.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: Speech Rehabilitation; Speech Quality Assessment; LSTM
Online: 13 December 2021 (10:10:36 CET)
The article considers an approach to the problem of assessing the quality of speech during speech rehabilitation as a classification problem. For this, a classifier is built on the basis of an LSTM neural network for dividing speech signals into two classes: before the operation and immediately after. At the same time, speech before the operation is the standard to which it is necessary to approach in the process of rehabilitation. The metric of belonging of the evaluated signal to the reference class acts as an assessment of speech. An experimental assessment of rehabilitation sessions and a comparison of the resulting assessments with expert assessments of phrasal intelligibility were carried out.
ARTICLE | doi:10.20944/preprints202005.0383.v1
Subject: Social Sciences, Psychology Keywords: child speech; speech production; speech perception; learning; consonant age of acquisition
Online: 24 May 2020 (16:07:44 CEST)
Purpose: Perceptual learning and production practice are basic mechanisms that children depend on to acquire adult levels of speech accuracy. In this study, we examined perceptual learning and production practice as they contributed to changes in speech accuracy in three- and four-year-old children. Our primary focus was manipulating the order of perceptual learning and baseline production practice to better understand when and how these learning mechanisms interact. Method: Sixty-five typically-developing children between the ages of three and four were included in the study. Children were asked to produce CVCCVC nonwords like /bozjəm/ and /tʌvtʃəp/ that were described as the names of make-believe animals. All children completed two separate experimental blocks: a baseline block in which participants heard each nonword once and repeated it, and a test block in which the perceptual input frequency of each nonword varied between 1 and 10. Half of the participants completed a baseline-test order; half completed a test-baseline order. Results: Greater accuracy was observed for nonwords produced in the second experimental block, reflecting a production practice effect. Perceptual learning resulted in greater accuracy during the test for nonwords that participants heard 3 or more times. However, perceptual learning did not carry over to baseline productions in the test-baseline design, suggesting that it reflects a kind of temporary priming. Finally, a post hoc analysis suggested that the size of the production practice effect depended on the age of acquisition of the consonants that comprised the nonwords. Conclusions: The study provides new details about how perceptual learning and production practice interact with each other and with phonological aspects of the nonwords, resulting in complex effects on speech accuracy and learning of form-referent pairs. These findings may ultimately help speech-language pathologists maximize their clients’ improvement in therapy.
ARTICLE | doi:10.20944/preprints202306.0223.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Voice Cloning; Speech Synthesis; Speech Quality Evaluation
Online: 5 June 2023 (02:27:49 CEST)
Voice cloning, an emerging field in the speech processing area, aims to generate synthetic utterances that closely resemble the voices of specific individuals. In this study, we investigate the impact of various techniques on improving the quality of voice cloning, specifically focusing on a low-quality dataset. To contrast our findings, we also use two high-quality corpora for comparative analysis. We conduct exhaustive evaluations of the quality of the gathered corpora in order to select the most suitable audios for the training of a Voice Cloning system. Following these measurements, we conduct a series of ablations by removing audios with lower SNR and higher variability in utterance speed from the corpora in order to decrease their heterogeneity. Furthermore, we introduce a novel algorithm that calculates the fraction of aligned input characters by exploiting the attention matrix of the Tacotron 2 Text-to-Speech (TTS) system. This algorithm provides a valuable metric for evaluating the alignment quality during the voice cloning process. We present the results of our experiments, demonstrating that the performed ablations significantly increase the quality of synthesised audios for the challenging low-quality corpus. Notably, our findings indicate that models trained on a 3-hour corpus from a pre-trained model exhibit comparable audio quality to models trained from scratch using significantly larger amounts of data.
ARTICLE | doi:10.20944/preprints201807.0106.v1
Subject: Social Sciences, Cognitive Science Keywords: auditory-visual speech perception; bipolar disorder; speech perception
Online: 6 July 2018 (05:21:19 CEST)
The focus of this study was to investigate how individuals with bipolar disorder integrate auditory and visual speech information compared to non-disordered individuals and whether there were any differences in auditory and visual speech integration in the manic and depressive episodes in bipolar disorder patients. It was hypothesized that bipolar groups’ auditory-visual speech integration would be less robust than the control group. Further, it was predicted that those in the manic phase of bipolar disorder would integrate visual speech information more than their depressive phase counterparts. To examine these, the McGurk effect paradigm was used with typical auditory-visual speech (AV) as well as auditory-only (AO) speech perception on visual-only (VO) stimuli. Results. Results showed that the disordered and non-disordered groups did not differ on auditory-visual speech (AV) integration and auditory-only (AO) speech perception but on visual-only (VO) stimuli. The results are interpreted to pave the way for further research whereby both behavioural and physiological data are collected simultaneously. This will allow us understand the full dynamics of how, actually, the auditory and visual (relatively impoverished in bipolar disorder) speech information are integrated in people with bipolar disorder.
ARTICLE | doi:10.20944/preprints202210.0480.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Speech Recognition; Automatic Speech Recognition; Language Identification; Wav2Vec2; Multilingual
Online: 31 October 2022 (10:06:34 CET)
This paper documents the development of a special case of multilingual Automatic Speech Recognition model, specifically tailored to attend two languages spoken by the majority of Latin America, Portuguese and Spanish. The bilingual model combines Language Identification and Speech Recognition developed with the Wav2Vec2.0 architecture and trained on several open and private speech datasets. In this model, the feature encoder is trained jointly for all tasks and different context encoders are trained for each task. The model is evaluated separately on two tasks: language identification and speech recognition. The results indicate that this model achieves good performance on speech recognition and average performance on language identification, training on a low quantity of speech material. The average accuracy of the language identification module on the MLS dataset is 66.75%. The average Word Error Rate in the same scenario is 13.89%, which is better than average 22.58% achieved by the commercial speech recognizer developed by Google.
ARTICLE | doi:10.20944/preprints202309.0497.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Arabic Hate Speech; Natural Language Processing (NLP); Machine Learning; Arabic 18 Hate Speech Detection; Arabic Hate Speech Corpus
Online: 7 September 2023 (07:14:15 CEST)
Hate Speech Detection in Arabic presents a multifaceted challenge due to the broad and diverse linguistic terrain. With its multiple dialects and rich cultural subtleties, Arabic requires particular measures to address hate speech online successfully. To address this issue, academics and developers have used natural language processing (NLP) methods and machine learning algorithms adapted to the complexities of Arabic text. However, many proposed methods were hampered by a lack of a comprehensive dataset/corpus of Arabic hate speech. In this research, we propose a novel multi-class public Arabic dataset comprised of 403,688 annotated tweets categorized as extremely positive, positive, neutral, or negative based on the presence of hate speech. Using our developed dataset, we additionally characterize the performance of multiple machine learning models for Hate speech identification in Arabic Jordanian dialect tweets. Specifically, the Word2Vec, TF-IDF, and AraBert text representation models have been applied to produce word vectors. With the help of these models, we can provide classification models with vectors representing text. After that, seven Machine learning classifiers have been evaluated: Support Vector Machine (SVM), Logistic Regression (LR), Naive Bays (NB), Random Forest (RF), AdaBoost (Ada), XGBoost (XGB), and CatBoost (CatB). In light of this, the experimental evaluation revealed that, in this challenging and unstructured setting, our gathered and annotated datasets were rather efficient and generated encouraging assessment outcomes. This will enable academics to delve further into this crucial field of study.
ARTICLE | doi:10.20944/preprints202104.0651.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: speech processing, data augmentation, speech emotion recognition, generative adversarial net-works
Online: 26 April 2021 (10:49:55 CEST)
Nowadays, and with the mechanization of life, speech processing has become so crucial for the interaction between humans and machines. Deep neural networks require a database with enough data for training. The more features are extracted from the speech signal, the more samples are needed to train these networks. Adequate training of these networks can be ensured when there is access to sufficient and varied data in each class. If there is not enough data; it is possible to use data augmentation methods to obtain a database with enough samples. One of the obstacles to developing speech emotion recognition systems is the Data sparsity problem in each class for neural network training. The current study has focused on making a cycle generative adversarial network for data augmentation in a system for speech emotion recognition. For each of the five emotions employed, an adversarial generating network is designed to generate data that is very similar to the main data in that class, as well as differentiate the emotions of the other classes. These networks are taught in an adversarial way to produce feature vectors like each class in the space of the main feature, and then they add to the training sets existing in the database to train the classifier network. Instead of using the common cross-entropy error to train generative adversarial networks and to remove the vanishing gradient problem, Wasserstein Divergence has been used to produce high-quality artificial samples. The suggested network has been tested to be applied for speech emotion recognition using EMODB as training, testing, and evaluating sets, and the quality of artificial data evaluated using two Support Vector Machine (SVM) and Deep Neural Network (DNN) classifiers. Moreover, it has been revealed that extracting and reproducing high-level features from acoustic features, speech emotion recognition with separating five primary emotions has been done with acceptable accuracy.
ARTICLE | doi:10.20944/preprints202103.0513.v1
Subject: Engineering, Automotive Engineering Keywords: Automatic Voice Query Service; Automatic Speech Recognition; Multi-Accented Mandarin Speech Recognition
Online: 22 March 2021 (10:55:53 CET)
Automatic Voice Query Service can extremely reduce the artificial cost, which could improve the response efficiency for users. The automatic speech recognition (ASR) is one of the important component in AVQS. However, many dialect areas in China make the AVQS have to response the multi-accented Mandarin users by single acoustic model in ASR. This problem severely limits the accuracy of ASR for multi-accented speech in the AVQS. In this paper, a new framework for AVQS is proposed to improve the accuracy of response. Firstly, the fusion feature including iVector and filterbank acoustic features is used to train the Transformer-CTC model. Secondly, the transformer-CTC model is used to construct the end-to-end ASR. Finally, key words matching algorithm for AVQS based on fuzzy mathematic theory is proposed to further improve the accuracy of response. The results show that the final accuracy in our proposed framework for AVQS arrives at 91.5%. The proposed framework for AVQS can satisfy the service requirement of different areas in mainland of China. This research has a great significance for exploring the application value of artificial intelligence in the real scene.
ARTICLE | doi:10.20944/preprints202301.0008.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: Secondary emotions; emotional speech synthesis; fundamental frequency contour; Fujisaki model; low-resource; empathetic speech
Online: 3 January 2023 (07:29:37 CET)
A low-resource emotional speech synthesis system for empathetic speech synthesis based on modelling prosody features is presented here. Secondary emotions, identified to be needed for empathetic speech, are modelled and synthesised in this paper. As secondary emotions are subtle in nature, they are difficult to model compared to primary emotions. They are also less explored, and this is one of the few studies that model secondary emotions in speech. Current speech synthesis research uses large databases and deep learning techniques to develop emotion models. There are many secondary emotions, and hence, developing large databases for each of the secondary emotions is expensive. This research presents a proof-of-concept using hand-crafted feature extraction and modelling of these features using a low resource-intensive machine learning approach, thus creating synthetic speech with secondary emotions. Here, a quantitative model-based transformation is used to shape the emotional speech fundamental frequency contour. Speech rate and mean intensity are modelled via rule-based approaches. Using these models, an emotional text-to-speech synthesis system to synthesise five secondary emotions - anxious, apologetic, confident, enthusiastic and worried is developed. A perception test to evaluate the synthesised emotional speech is also conducted.
ARTICLE | doi:10.20944/preprints202304.0575.v3
Subject: Engineering, Bioengineering Keywords: Inner Speech; Imagined Speech; EEG Decoding; Brain-Computer Interface; BCI; LSTM; Wavelet Scattering Transformation; WST.
Online: 15 May 2023 (05:43:54 CEST)
In this paper, we propose an imagined speech-based brain wave pattern recognition using deep learning. Multiple features were extracted concurrently from eight-channel Electroencephalography (EEG) signals. To obtain classifiable EEG data with fewer number of sensors, we placed the EEG sensors on carefully selected spots on the scalp. To decrease the dimensions and complexity of the EEG dataset and to avoid overfitting during the deep learning algorithm, we utilized the wavelet scattering transformation. A low-cost 8-channel EEG headset was used with MATLAB 2023a to acquire the EEG data. The Long-Short Term Memory Recurrent Neural Network (LSTM-RNN) was used to decode the identified EEG signals into four audio commands: Up, Down, Left, and Right. Wavelet scattering transformation was applied to extract the most stable features by passing the EEG dataset through a series of filtration processes. Filtration has been implemented for each individual command in the EEG datasets. The proposed imagined speech-based brain wave pattern recognition approach achieved a 92.50% overall classification accuracy. This accuracy is promising for designing a trustworthy imagined speech-based Brain-Computer Interface (BCI) future real-time systems. For better evaluation of the classification performance, other metrics were considered, and we obtained 92.74%, 92.50% and 92.62% for precision, recall, and F1-score, respectively.
BRIEF REPORT | doi:10.20944/preprints202310.0690.v2
Subject: Computer Science And Mathematics, Other Keywords: Keyword Detection; Audio Models; Speech Processing
Online: 7 November 2023 (02:34:57 CET)
This study introduces an original comprehensive system centered on identifying specific terms that indicate a user's position, particularly the discrete values representing latitude and longitude. This system not only detects these terms but also retrieves the corresponding numerical data for accurate and efficient determination of locations. The importance of this study can be applied various fields, notably aiding offline operations of military personnel, who often lack internet access. In such scenarios, precise awareness of location is vital for strategic manoeuvres, rescue operations, and navigating unfamiliar landscapes. The system allows these personnel by allowing them to extract exact location coordinates from spoken terms, thereby enhancing their awareness even in challenging surroundings. Apart from its military utility, the project holds broader significance. Teams responding to emergencies, personnel involved in disaster management, and exploratory missions can all gain from this technology during disruptions in communication infrastructure. Furthermore, travelers, adventurers, and outdoor enthusiasts can utilize this system to accurately determine their positions in remote areas without relying on online maps. We used offline speech recognition techniques to precisely transcribe spoken terms, achieving an accuracy of over 91.3% and a word error rate of 4.2%. For sound recognition, the OpenAI Whisper model was used, and a conversion process from SpeechRecognition to AudioSegmentation was implemented, followed by transforming the audio into .wav format, we have also developed the interface of the app to use it efficiently using Streamlit. This was done to ensure seamless compatibility with the Whisper model and uninterrupted audio input. By training the system to identify specific linguistic linked to location, it achieves robust detection and extraction of relevant terms. This approach eliminates the necessity for constant internet connectivity, rendering it exceptionally useful in remote, offline, and resource-limited situations.
ARTICLE | doi:10.20944/preprints202310.0722.v1
Subject: Computer Science And Mathematics, Computer Science Keywords: Audio-visual speech; emotion recognition; children
Online: 13 October 2023 (07:11:19 CEST)
Detecting and understanding emotions is critical for our daily activities. As emotion recognition (ER) systems develop, we start looking at more difficult cases than just acted adult audio-visual speech. In this work, we investigate automatic classification of audio-visual emotional speech of children. Our interest is, specifically, in the improvement of the utilization of the cross-modal relationships between the selected modalities: video and audio. To underscore the importance of developing ER systems for the real-world environment, we present a corpus of children’s emotional audio-visual speech that we collected. We select a state-of-the-art model as a baseline for the purposes of comparison and present several modifications focused on a deeper learning of the cross-modal relationships. By conducting experiments with our proposed approach and the selected baseline model, we observe a relative improvement in performance by 2%. Finally, we conclude that focusing more on the cross-modal relationships may be beneficial for building ER systems for child-machine communications and the environments where qualified professionals work with children.
CONCEPT PAPER | doi:10.20944/preprints202108.0194.v1
Subject: Social Sciences, Sociology Keywords: congruence; voice; speech; communication; identity; personality
Online: 9 August 2021 (12:41:06 CEST)
Purpose: We present a theoretical framework that formalizes and defines the constructs of communicative congruence and communicative dysphoria that is rooted within a comprehensive and mechanistic theory of personality. Background: Voice therapists have likely encountered a patient who states that a therapeutic target voice “isn’t me.” The ability to accurately convey a person’s sense of self, or identity, through their voice, speech, and communication behaviors seems to have high relevance to both patients and clinicians alike. However, to date, we lack a mechanistic theoretical framework through which to understand and interrogate the phenomenon of congruence between one’s communication behaviors and their sense of self. Results: We review the initial notion of congruence, first proposed by Carl Rogers. We then review several theories on selfhood, identity, and personality. After reviewing these theories, we explain how our proposed constructs fit within our chosen theory, the Cybernetic Big Five Theory of Personality. We then discuss similarities and differences to a similarly named construct, the Vocal Congruence Scale. Next, we review how these constructs may come to bear on an existing theory relevant to voice therapy, the Trans Theoretical Model of Health Behavior Change. Finally, we state testable hypotheses for future exploration, which we hope will establish a foundation for future investigations into communicative congruence. Conclusion: To our knowledge, the present paper is the first to explicitly define communicative congruence and communicative dysphoria. We embed these constructs within a comprehensive and mechanistic theory of personality and, in doing so, hope to provide a rigorous and comprehensive theoretical framework that will allow us to test and better understand these proposed constructs.
ARTICLE | doi:10.20944/preprints202011.0646.v1
Subject: Computer Science And Mathematics, Computer Science Keywords: social media; hate speech; text classification
Online: 25 November 2020 (14:12:07 CET)
The exponential increase in the use of the Internet and social media over the last two decades has changed human interaction. This has led to many positive outcomes, but at the same time it has brought risks and harms. While the volume of harmful content online, such as hate speech, is not manageable by humans, interest in the academic community to investigate automated means for hate speech detection has increased. In this study, we analyse six publicly available datasets by combining them into a single homogeneous dataset and classify them into three classes, abusive, hateful or neither. We create a baseline model and we improve model performance scores using various optimisation techniques. After attaining a competitive performance score, we create a tool which identifies and scores a page with effective metric in near-real time and uses the same as feedback to re-train our model. We prove the competitive performance of our multilingual model on two langauges, English and Hindi, leading to comparable or superior performance to most monolingual models.
ARTICLE | doi:10.20944/preprints202303.0158.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: speech enhancement; online applicability; real-time factor
Online: 8 March 2023 (15:25:56 CET)
Deep-learning-based speech enhancement techniques have been recently grown in interest, since their impressive performance can potentially benefit a wide variety of digital voice communication systems. However, such performance has been evaluated mostly in offline audio processing scenarios (i.e. feeding the model, in one go, a complete audio recording, which may extend several seconds). It is of great interest to evaluate and characterize the current state-of-the-art in applications that process audio online (i.e. feeding the model a sequence of segments of audio data, concatenating the results at the output end). Although evaluations and comparisons between speech enhancement techniques have been carried out before, as far as the author knows, the work presented here is the first that evaluates the performance of such techniques in relation to their online applicability. Meaning, this work measures how the output signal-to-interference ratio (as a separation metric), the response time and memory usage (as online metrics) are impacted by the input length (the size of audio segments), in addition to the amount of noise, amount and number of interferences, and amount of reverberation. Three popular models were evaluated, given their availability on public repositories and online viability: MetricGAN+, Spectral Feature Mapping with Mimic Loss, and Demucs-Denoiser. The characterization was carried out using a systematic evaluation protocol based on the Speechbrain framework. Several intuitions are presented and discussed, and some recommendations for future work are proposed.
ARTICLE | doi:10.20944/preprints202302.0465.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: CGA-MGAN; Gated Attention Unit; Speech Enhancement
Online: 27 February 2023 (09:24:31 CET)
In recent years, neural networks based on attention mechanisms have been increasingly widely used in speech recognition, separation, enhancement, and other fields. In particular, the convolution-augmented transformer has achieved good performance as it can combine the advantages of convolution and self-attention. Recently, the gated attention unit (GAU) has been proposed. Compared with the traditional multi-head self-attention, approaches with GAU are effective and computationally efficient. In this article, we propose a network for speech enhancement called CGA-MGAN, a kind of Metric GAN based on convolution-augmented gated attention. CGA-MGAN captures local and global correlations in speech signals at the same time through the fusion of convolution and gated attention units. Experiments on Voice Bank + DEMAND show that the CGA-MGAN we propose achieves an excellent performance (3.47 PESQ, 0.96 STOI, and 11.09dB SSNR) at a relatively small model size (1.14M).
ARTICLE | doi:10.20944/preprints202301.0580.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: Electronic monitoring; hate speech; data leakage; prediction.
Online: 31 January 2023 (08:59:39 CET)
Technological innovations and the expansion of Internet access have produced significant changes in the configurations of organizations and, consequently, in the relationships between employees and employers. This new scenario generates the need for greater monitoring in the workplace in order to control inappropriate behavior or situations that may generate misfortunes. Two important problems faced are the dissemination of hate through networks and data leakage that can have social, psychological, and financial impacts. Thus, monitoring tools can be incorporated to assist in surveillance, and thus ensure the achievement of organizational objectives. This paper presents a workplace computer monitoring solution that integrates Spyware techniques, and text sentiment classification, along with a distributed microservices architecture, which aims to collect a range of information and generate alerts to managers regarding hate speech and vulnerabilities. Preliminary tests have been conducted to evaluate the performance of Spyware integrated with prediction models.
ARTICLE | doi:10.20944/preprints202211.0017.v1
Subject: Computer Science And Mathematics, Computer Science Keywords: text-to-speech; naturalness; intelligibility; Brazilian Portuguese
Online: 1 November 2022 (04:37:04 CET)
This paper compares the performance of three text-to-speech (TTS) models released from June 2021 to January 2022 in order to establish a baseline for Brazilian Portuguese. Those models were trained using dataset for Brazilian Portuguese. The experimental setup considers tts-portuguese dataset to fine-tune the following TTS models: VITS end-to-end model; glowtts and gradtts acoustic models both using hifi-gan vocoder. Performance metrics are arranged into objective and subjective metrics. As subjective metrics, the naturalness and intelligibility are measured based on the mean opinion score (MOS). Results shows that gradtts+hifigan model achieved naturalness of 4.07 MOS, close to performance of current commercial models.
ARTICLE | doi:10.20944/preprints202102.0156.v1
Subject: Social Sciences, Anthropology Keywords: ANN; NN; Speech Recognition; interaction; hybrid method
Online: 5 February 2021 (10:58:40 CET)
Human and Computer interaction has been a part of our day-to-day life. Speech is one of the essential and comfortable ways of interacting through devices as well as a human being. The device, particularly smartphones have multiple sensors in camera and microphone, etc. speech recognition is the process of converting the acoustic signal to a smartphone as a set of words. The efficient performance of the speech recognition system highly enhances the interaction between humans and machines by making the latter more receptive to user needs. The recognized words can be applied for many applications such as Commands & Control, Data entry, and Document preparation. This research paper highlights speech recognition through ANN (Artificial Neural Network). Also, a hybrid model is proposed for audio-visual speech recognition of the Tamil and Malay language through SOM (Self-organizing map0 and MLP (Multilayer Perceptron). The Effectiveness of the different models of NN (Neural Network) utilized in speech recognition will be examined.
ARTICLE | doi:10.20944/preprints201712.0058.v1
Subject: Social Sciences, Language And Linguistics Keywords: speech synthesis; evaluation; hesitation; virtual agents; interaction
Online: 11 December 2017 (07:03:14 CET)
Conversational spoken dialogue systems that interact with the user rather than merely reading text can be equipped with hesitations to manage the dialogue flow and the users' attention. Based on a series of empirical studies, we built an elaborated hesitation synthesis strategy for dialogue systems that inserts hesitations of scalable extent wherever needed in the ongoing utterance. So far, evaluations of hesitating systems have shown that synthesis quality is affected negatively by hesitations, but that there is improvement in interaction quality. We argue that due to its conversational nature, hesitation synthesis needs interactive evaluation rather than traditional MOS-based questionnaires. To prove this point, we dually evaluate our system’s speech synthesis component: on the one hand, linked to the dialogue system evaluation, on the other hand, in the traditional MOS way. This way we are able to analyze and discuss differences that arise due to the evaluation methodology. Our results suggest that MOS scales are not sufficient to assess speech synthesis quality, which has implications for future research that are discussed in this paper. Furthermore, our results indicate that hesitations work well to increase task performance and that an elaborated strategy is necessary to avoid likability issues.
ARTICLE | doi:10.20944/preprints202311.1851.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Speech enhancement; Noise suppression; Deep learning; Variational autoencoders
Online: 29 November 2023 (06:25:59 CET)
This paper presents an approach to enhancing the clarity and intelligibility of speech in digital communications compromised by various background noises. Utilizing deep learning techniques, specifically a Variational Autoencoder (VAE) with 2D convolutional filters, we aim to suppress background noise in audio signals. Our method focuses on four simulated environmental noise scenarios: storms, wind, traffic, and aircraft. Training dataset has been obtained from public sources (TED-LIUM 3 dataset, which includes audio recordings from the popular TED-TALK series) combining with these background noises. The audio signals were transformed into 2D power spectrograms, upon which our VAE model was trained to filter out the noise and reconstruct clean audio. Our results demonstrate that the model outperforms existing state-of-the-art solutions in noise sup-pression. Although differences in noise types were observed, it was challenging to definitively conclude which background noise most adversely affects speech quality. Results have been assessed with objective methods (mathematical metrics) and subjective (listening to a set of audios by humans). Notably, wind noise showed the smallest deviation between the noisy and cleaned audio, perceived subjectively as the most improved scenario. Future work involves refining the phase calculation of the cleaned audio and creating a more balanced dataset to minimize differences in audio quality across scenarios. Additionally, practical ap-plications of the model in real-time streaming audio are envisaged. This research contributes significantly to the field of audio signal processing by offering a deep learning solution tailored to various noise conditions, enhancing digital communication quality.
ARTICLE | doi:10.20944/preprints202310.1967.v1
Subject: Medicine And Pharmacology, Dentistry And Oral Surgery Keywords: tongue frenulum; ankyloglossia; swallowing; tongue mobility; speech; occlusion
Online: 31 October 2023 (07:59:09 CET)
(1) The incidence of ankyloglossia ranges from 0.02 to 10.7%. The literature describes the effect of ankyloglossia on selected dysfunctions of the stomatognathic system, however no studies could be found reporting the influence of ankyloglossia on the occurrence of several disorders in a group of subjects. The aim of the present study was to assess the effect of lingual frenulum on swallowing, speech, occlusion, and periodontal status; (2) Methods: The subjects were 172 patients, 86 with ankyloglossia (study group) and 86 with normal tongue frenulum (control group). In all subjects, the length of tongue frenulum, the type of swallowing, tongue mobility, occlusion, periodontal status and speech abnormalities were assessed; (3) Results: All subjects from the control group and all those with mild ankyloglossia showed normal tongue mobility. A limited tongue mobility was found in 29.4% subject with moderate and in 70.6% subjects with severe ankyloglossia. Rhotacism was observed in 21.3% subjects with normal frenulum, in 2.1% with mild, 38.3% with moderate, and 38.3% with severe ankyloglossia. Malocclusion or crowding was diagnosed in subjects with mild, moderate and severe ankyloglossia in 7.4%, 33.9% and 20.7% subjects (total 62%), respectively, whereas in the control group - in 21.6% subjects. No abnormalities in the periodontium in the area of the lingual surfaces of the crowns of the lower central incisors were found in any of the examined persons. Among patients with infantile type of swallowing 24.4% had a normal length of the tongue frenulum, 11.1% - mild, 28.9% - moderate, and 35.6 - severe ankyloglossia. Among patients presenting a mature type of swallowing 58.7% had a normal length of the frenulum; (4) Conclusions: 1.A shortened tongue frenulum correlates with “infantile swallowing pattern”. 2. Moderate or severe ankyloglossia significantly limits tongue mobility. 3. Short tongue frenulum is related to speech disorders.
ARTICLE | doi:10.20944/preprints202306.1186.v1
Subject: Engineering, Bioengineering Keywords: SAEF; audiology competencies; audiometry simulation; speech language; students.
Online: 16 June 2023 (07:39:25 CEST)
The information society has transformed human life. Technology is almost everywhere, including health and education. For example, years ago, speech and language therapy students required a long time and high-cost equipment to develop healthcare of the auditory and vestibular systems competencies. The high cost of the equipment permitted its practical use only in classes, hindering students’ autonomy in developing those competencies. That situation was a real issue, even more in times of pandemic where online education was essential. This article describes SAEF, an open-source software simulator for autonomously developing procedural audiology therapy competencies, user acceptance, and the validity of experiments and results. SAEF delivers immediate feedback and performance results. Obtained results permit validating students’ and educators’ acceptability of SAEF in audiology therapy education. Obtained results invite authors to continue developing simulator software solutions in other health education contexts. SAEF was developed using open-source technology to facilitate its accessibility, classification, and sustainability.
ARTICLE | doi:10.20944/preprints202211.0047.v1
Subject: Medicine And Pharmacology, Otolaryngology Keywords: hearing therpy; speech therapy; cochlea implant; digital application
Online: 2 November 2022 (06:10:30 CET)
Background: In order to achieve the best possible hearing and understanding with a cochlear implant (CI), regular hearing speech therapy treatment is necessary after implantation. This treatment should also be accessible to the growing proportion of hearing-impaired people with a migration background. This requires an alternative to therapy in the therapist's native language. The aim of this study was to evaluate six multilingual conversation applications with regard to their usefulness for therapy. Material and Methods: The six most commonly used applications were reviewed in terms of accuracy of content and grammatical translation, as well as pronunciation for English, Spanish, Arabic, Turkish, and Russian by native speakers. The number of available languages, availability, cost, and additional features were also analyzed. The accuracy of the content and grammatical translation as well as the pronunciation were statistically evaluated, and the differences were highlighted. The results of the different applications were compared with the performance of a native speaker. Results: All applications tested differed significantly from the native speaker level, with Google Translator showing the closest approximation to the native speaker level. All apps offer translations for multiple languages and, with exceptions, are available in both app stores. Furthermore, all apps have additional therapist-facilitating features. Conclusion: Multilingual conversation apps can make speech therapy in a foreign language much easier when used with patients. An adaptation of the software to the specific requirements of a hearing speech therapy is necessary to achieve a linguistic level that corresponds to the native language of the therapist and to enable an easy use in the therapy.
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Multimodal Machine Learning; Deep Learning; Hate Speech Detection
Online: 15 March 2021 (13:46:27 CET)
Hateful and abusive speech presents a major challenge for all online social media platforms. Recent advances in Natural Language Processing and Natural Language Understanding allow more accurate detection of hate speech in textual streams. This study presents a multimodal approach to hate speech detection by combining Computer Vision and Natural Language processing models for abusive context detection. Our study focuses on Twitter messages and, more specifically, on hateful, xenophobic and racist speech in Greek aimed at refugees and migrants. In our approach we combine transfer learning and fine-tuning of Bidirectional Encoder Representations from Transformers (BERT) and Residual Neural Networks (Resnet). Our contribution includes the development of a new dataset for hate speech classification, consisting of tweet ids, along with the code to obtain their visual appearance, as they would have been rendered in a web browser. We have also released a pre-trained Language Model trained on Greek tweets, which has been used in our experiments. We report a consistently high level of accuracy (accuracy score=0.970, f1-score=0.947 in our best model) in racist and xenophobic speech detection.
ARTICLE | doi:10.20944/preprints202010.0342.v1
Subject: Social Sciences, Safety Research Keywords: online hate; hate speech; online disinhibition; online safety
Online: 16 October 2020 (08:27:29 CEST)
Today’s youth have almost universal access to the internet and frequently engage in social networking activities using various social media platforms and devices. This is a phenomenon that hate groups are exploiting when disseminating their propaganda. This study seeks to better understand youth exposure to hateful material in the online space by exploring predictors of such exposure including demographic characteristics (age, gender and race), academic performance, online behaviours, online disinhibition, risk perception, and parents/guardians’ supervision of online activities. We implemented a cross-sectional study design, using a paper questionnaire, in two high schools in Massachusetts (USA), focusing on students 14 to 19 years old. Logistic regression models were used to study the association between independent variables (demographics, online behaviours, risk perception, parental supervision) and exposure to hate online. Results revealed an association between exposure to hate messages in the online space and time spent online, academic performance, communicating with a stranger on social media, and benign online disinhibition. In our sample, benign online disinhibition was also associated with students’ risk of encountering someone online that tried to convince them of racist views. This study represents an important first step in understanding youth’s risk factors of exposure to hateful material online.
ARTICLE | doi:10.20944/preprints201911.0346.v1
Subject: Medicine And Pharmacology, Neuroscience And Neurology Keywords: speech; Parkinson’s disease; deep brain stimulation; voice; articulation
Online: 28 November 2019 (02:57:03 CET)
Deep brain stimulation (DBS) of the subthalamic nucleus (STN) has become an effective and widely used tool in the treatment of Parkinson’s disease (PD). STN-DBS has varied effects on speech. Clinical speech ratings suggest worsening following STN-DBS, but quantitative intelligibility, perceptual, and acoustic studies have produced mixed and inconsistent results. Improvements in phonation and declines in articulation have frequently been reported during different speech tasks under different stimulation conditions. Questions remain about preferred STN-DBS stimulation settings. Seven right-handed, native speakers of English with PD treated with bilateral STN-DBS were studied off medication at three stimulation conditions: stimulators off, 60 Hz (low frequency stimulation - LFS), and the typical clinical setting of 185 Hz (High frequency - HFS). Spontaneous speech was recorded in each condition and excerpts were prepared for transcription (intelligibility) and difficulty judgements. Separate excerpts were prepared for listeners to rate abnormalities in voice, articulation, fluency, and rate. Intelligibility for spontaneous speech was reduced at both HFS and LFS when compared to STN-DBS off. Speech produced at HFS was more intelligible than that produced at LFS, but HFS made the intelligibility task (transcription) subjectively more difficult. Both voice quality and articulation were judged to be more abnormal with STN-DBS on. STN-DBS reduced the intelligibility of spontaneous speech at both LFS and HFS but lowering the frequency did not improve intelligibility. Voice quality ratings with STN-DBS were correlated with the ratings made without stimulation. This was not true for articulation ratings. STN-DBS exacerbated an existing voice disorder and may have introduced new articulatory abnormalities.
ARTICLE | doi:10.20944/preprints201910.0376.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: artificial neural network; deep learning; LSTM; speech processing
Online: 31 October 2019 (16:40:30 CET)
Speech signals are degraded in real-life environments, product of background noise or other factors. The processing of such signals for voice recognition and voice analysis systems presents important challenges. One of the conditions that make adverse quality difficult to handle in those systems is reverberation, produced by sound wave reflections that travel from the source to the microphone in multiple directions.To enhance signals in such adverse conditions, several deep learning-based methods have been proposed and proven to be effective. Recently, recurrent neural networks, especially those with long and short-term memory (LSTM), have presented surprising results in tasks related to time-dependent processing of signals, such as speech. One of the most challenging aspects of LSTM networks is the high computational cost of the training procedure, which has limited extended experimentation in several cases. In this work, we present a proposal to evaluate the hybrid models of neural networks to learn different reverberation conditions without any previous information. The results show that some combination of LSTM and perceptron layers produce good results in comparison to those from pure LSTM networks, given a fixed number of layers. The evaluation has been made based on quality measurements of the signal's spectrum, training time of the networks and statistical validation of results. Results help to affirm the fact that hybrid networks represent an important solution for speech signal enhancement, with advantages in efficiency, but without a significan drop in quality.
ARTICLE | doi:10.20944/preprints201910.0231.v1
Subject: Computer Science And Mathematics, Robotics Keywords: Android; arduino; bluetooth; grass cutter; sensors; speech recognition
Online: 20 October 2019 (02:03:44 CEST)
We present an Arduino-based automatic robotic system which is used for cutting grass or lawns, mostly healthy grass which needs to cut neatly like in a public park or a private garden. The purpose of this proposed project is to design a programmable automatic pattern design grass cutting robot with solar power which no longer requires time-consuming manual grass-cutting, and that can be operated wirelessly using an Android Smartphone via Bluetooth from a safe distance which is capable of cutting the grass in indeed required shapes and patterns; the cutting blade can also be adjusted to maintain the different length of the grass. The main focus was to design a prototype that can work with a little or no Physical user interaction. The proposed work is accomplished by using an Arduino microcontroller, DC geared Motors, IR obstacle detection sensor, motor shield, relay module, DC battery, solar panel, and Bluetooth module. The grass-cutting robot system can be moved to the location in the lawn remotely where the user wants to cut the grass directly or in desired patterns. The user can press the desired pattern button from the mobile application, and the system will start cutting grass in the similar design such as a circle, spiral, rectangle, and continue pattern. Also, with the assistance of sensors positioned at the front of the vehicle, an automatic barrier detection system is introduced to enhance safety measurements to prevent any risks. IR obstacle detector sensors are used to detect obstacles, if any obstacle is found in front of the robot while traveling; it avoids the barrier by taking a right/right turn or stop automatically appropriately, thereby preventing the collision. Also, the main aim of this project is the formation of a grass cutter that relieves the user from mowing their own grasses and reduces environmental and noise pollution. The proposed system is designed as a lab-scale prototype to experimentally validate the efﬁciency, accuracy, and affordability of the systems. The experimental results prove that the proposed work has all in one capability (Simple and Pattern based grass cutting with mobile-application, obstacle detection), is very easy to use, and can be easily assembled in a simple hardware circuit. We note that the systems proposed can be implemented on a large scale under real conditions in the future, which will be useful in robotics applications and cutting grass in playing grounds such as cricket, football, and hockey, etc.
ARTICLE | doi:10.20944/preprints202305.1060.v1
Subject: Social Sciences, Education Keywords: EFL; language functions; speech acts; teacher’s perception; textbook evaluation
Online: 15 May 2023 (15:54:12 CEST)
This study mostly analyzes the pragmatic viewpoints of speech acts and language functions through Halliday’s (1975) language functions and Searle’s (1976) speech acts were adapted to analyze the functional aspects of the conversations and it was also intended to explore the teachers’ perception toward teaching and learning English as a Foreign Language (EFL) learners’ textbooks and the teacher’s components of communicative knowledge regarding the functions of language in daily activity. The participants of this paper consisted of thirteen Sunrise 10, 11, 12 grades of Kurdish teachers at high school English in Iraqi Kurdistan. Through semi-structured interview, it was found that the conversations in the mentioned textbooks are insufficient from the pragmatic point of view. Some recommendation for the textbook designers, teachers, material developer were raised to make up the shortcomings of the textbooks. The findings reveal that the conversation texts in Sunrise textbooks are not meeting the systematic standard of pragmatic competence the English language learners and the book designer must be recommended to be aware of those shortcomings of the textbook series if they are required to develop their speaking skills in both student and activity series. The implications of this paper can be helpful in comparing the results of this study with other similar studies to check if there is a universal pattern in performing the speech acts and language functions and the interest to Kimberley Education for Life learners in increasing their knowledge of pragmatics in general and the role of language functions and speech acts investigated in this study.
ARTICLE | doi:10.20944/preprints202211.0041.v1
Subject: Social Sciences, Language And Linguistics Keywords: older adults; whispered speech; lexical tone; vowel; duration; intensity
Online: 2 November 2022 (03:53:54 CET)
Purpose: This study aimed to examine how aging and modifications of critical acoustic parameters may affect the perception of whispered speech as a degraded signal. Method: Forty Mandarin-speaking adults were included in the study. Part 1 of the study compared the perception of Mandarin lexical tones, vowels, and syllables in older and younger adults in whispered vs. phonated speech conditions. Parts 2 and 3 further examined how modification of duration and intensity cues contributed to the perceptual outcomes. Results: Perception of whispered tones was compromised in older and younger adults. Older adults identified lexical tones less accurately than their younger counterparts, particularly for phonated T2, T3 and whispered T3. Aging also negatively affected the vowel identification of /i, u/ in the whispered condition. Syllable-level accuracy was largely dependent on the accuracy of lexical tones and vowels. Furthermore, reduced duration led to the decreased accuracy of phonated T3 and whispered T2, T3 but increased accuracy of phonated T4. Reduced intensity lowered the recognition accuracy for phonated vowels /i, ɤ, o, y/ in older adults and /i, u/ in younger adults, and it also lowered the accuracy of whispered vowels /a, ɤ/ in older adults. Contrary to our expectation, increased duration and intensity did not improve older adults’ speech perception in either phonated or whispered conditions. Conclusion: The results suggest that aging adversely affected speech perception in both phonated and whispered conditions with more challenges in identifying whispered speech for older adults. While older adults’ diminished performance may be potentially due to problems with processing the degraded temporal and spectral information of the target speech sounds, it cannot be simply compensated for by increasing the duration and intensity of the target sounds beyond the audible level.
ARTICLE | doi:10.20944/preprints202210.0424.v1
Subject: Social Sciences, Language And Linguistics Keywords: emotional speech processing; communication channel; emotion category; task type
Online: 27 October 2022 (08:04:59 CEST)
How language mediates emotional perception and experience is poorly understood. The present event-related potential (ERP) study examined the explicit and implicit processing of emotional speech to differentiate the relative influences of communication channel, emotion category and task type in the prosodic salience effect. Thirty participants (15 women) were presented with spoken words denoting happiness, sadness and neutrality in either the prosodic or semantic channel. They were asked to judge the emotional content (explicit task) and speakers’ gender (implicit task) of the stimuli. Results indicated that emotional prosody (relative to semantics) triggered larger N100 and P200 amplitudes with greater delta, theta and alpha inter-trial phase coherence (ITPC) values in the corresponding early time windows, and continued to produce larger LPC amplitudes and faster responses during late stages of higher-order cognitive processing. The relative salience of prosodic and semantics was modulated by emotion and task, though such modulatory effects varied across different processing stages. The prosodic salience effect was reduced for sadness processing and in the implicit task during early auditory processing and decision-making but reduced for happiness processing in the explicit task during conscious emotion processing. Additionally, across-trial synchronization of delta, theta and alpha bands predicted the ERP components with higher ITPC values significantly associated with stronger N100, P200 and LPC enhancement. These findings reveal the neurocognitive dynamics of emotional speech processing with prosodic salience tied to stage-dependent emotion- and task-specific effects, which can reveal insights to research reconciling language and emotion processing from cross-linguistic/cultural and clinical perspectives.
ARTICLE | doi:10.20944/preprints202105.0777.v1
Subject: Social Sciences, Psychology Keywords: statistical learning; experiment interaction; phonology; child speech; language acquisition
Online: 31 May 2021 (13:37:09 CEST)
When participants in a statistical learning paradigm are asked to learn from two incompatible or competing inputs, they often fail to learn from one or both inputs. This study presents the results of two experiments that were both completed by one group of typically developing four-year-old children. One experiment targeted word-medial consonant patterns (phonotactics), whereas the other targeted strong-weak and weak-strong stress patterns (prosody). The order of the experiments was critical for learning outcomes in the phonotactics experiment: When children learned phonotactics first, their production accuracy increased following exposure to a high frequency input. When children learned phonotactics second, however, their production accuracy dropped when they were exposed to the high frequency input. Results from the prosody experiment were inconclusive, with limited evidence of any learning effect. Overall, the results suggest that children may conflate learning experiences, and patterns learned from an initial experimental input compete with patterns in a subsequent experiment. When considering natural language acquisition, the results suggest that an isolated episode of learning may lead to generalizations that are incompatible with later input, and possibly, with larger patterns in the language.
REVIEW | doi:10.20944/preprints202009.0197.v2
Subject: Social Sciences, Psychology Keywords: academic freedom; free speech; censorship; free inquiry; thought suppression
Online: 12 October 2020 (10:07:22 CEST)
This paper explores the suppression of ideas within academic scholarship by academics, either by self-suppression or because of the efforts of other academics. Legal, moral, and social issues distinguishing freedom of speech, freedom of inquiry, and academic freedom are reviewed. How these freedoms and protections can come into tension is then explored by an analysis of denunciation mobs who exercise their legal free speech rights to call for punishing scholars who express ideas they disapprove of and condemn. When successful, these efforts, which constitute legally protected speech, will suppress certain ideas. Real-world examples over the past five years of academics who have been sanctioned or terminated for scholarship targeted by a denunciation mob are then explored.
ARTICLE | doi:10.20944/preprints202305.0247.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Speech emotion recognition; one-dimensional neural network; LSTM; CNN; MFCCs
Online: 4 May 2023 (09:45:11 CEST)
In recent years, with the popularity of smart mobile devices, the interaction between devices and users, especially in the form of voice interaction, has become increasingly important. If smart devices can understand more users' emotional states through voice data, more customized services can be provided for users. This paper proposes a novel machine learning model for speech emotion recognition, which combines convolutional neural networks (CNN), long short-term memory neural networks (LSTM), and deep neural networks (DNN), called CLDNN. To make the designed system can recognize the audio signal closer to the human auditory system does, this article uses the Mel frequency cepstral coefficients (MFCCs) of audio data as the input of the machine learning model. First, the MFCCs of the voice signal is extracted as the input of the model, and the feature values of the data are calculated using several local feature learning blocks (LFLB) composed of one-dimensional CNN. Because the audio signals are time-series data, the feature values obtained from LFLBs then input into LSTM layer to enhance the learning on time-series level. Finally, fully connected layers are used for classification and prediction. Three databases RAVDESS, EMO-DB and IEMOCAP are used for the experiments in this paper. The experimental results show that the proposed method can improve the accuracy compared to other related researches in speech emotion recognition.
CASE REPORT | doi:10.20944/preprints202212.0561.v1
Subject: Social Sciences, Psychology Keywords: Potocki–Lupski syndrome; 17p11.2; PTLS; autism; ASD; EEG; language; speech
Online: 29 December 2022 (13:00:18 CET)
Potocki-Lupski Syndrome (PTLS) is a rare condition associated with a duplication of 17p11.2 that may underlie a wide range of congenital abnormalities and heterogeneous behavioral phenotypes. Along with developmental delay and intellectual disability, autism-specific traits are often reported to be the most common among patients with PTLS. To contribute to the discussion of the role of autism spectrum disorder (ASD) in the PTLS phenotype, we present a case of a female adolescent with a de novo dup(17)(p11.2p11.2) without ASD features, focusing on in-depth clinical, behavioral, and electrophysiological (EEG) evaluations. Among EEG features, we found the atypical peak-slow wave patterns and a unique saw-like sharp wave of 13 Hz that was not previously described in any other patient. The power spectral density of the resting state EEG was typical in our patient with only the values of non-linear EEG dynamics: Hjorth complexity and Fractal dimension were drastically attenuated compared with the patient’s neurotypical peers. Here we also summarize results from previously published reports of PTLS that point to the about 21% occurrence of ASD in PTLS that might be biased, taking into account methodological limitations. More consistent among PTLS patients were intellectual disability and speech and language disorders.
ARTICLE | doi:10.20944/preprints202212.0426.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: Speech Recognition; Keyword Spotting; Child abuse; Federated Learning; Whisper; Wav2vec2.0
Online: 22 December 2022 (09:27:37 CET)
The growth in online child exploitation material is a significant challenge for European Law Enforcement Agencies (LEAs). One of the most important sources of such online information corresponds to audio material that needs to be analyzed to find evidence in a timely and practical manner. That is why LEAs require a next-generation AI-powered platform to process audio data from online sources. We propose the use of speech recognition and keyword spotting to transcribe audiovisual data and to detect the presence of keywords related to child abuse. The considered models are based on two of the most accurate neural-based architectures to date: Wav2vec2.0 and Whisper. The systems are tested under an extensive set of scenarios in different languages. Additionally, keeping in mind that obtaining data from LEAs is very sensitive, we explore the use of federated learning to have more robust systems for the addressed application, while maintaining the privacy of the data to LEAs. The considered models achieved a word error rate between 11% and 25%, depending on the language. In addition, the systems are able to recognize a set of spotted words with true positives rates between 82% and 98%, depending on the language. Finally, federated learning strategies show that they can maintain and even improve the performance of the systems when compared to centralized trained models. The proposed systems sit the basis for an AI-powered platform for automatic analysis of audio in the context of forensic applications within child abuse. The use of federated learning is also promising for the addressed scenario, where data privacy is an important issue to be managed.
DATA DESCRIPTOR | doi:10.20944/preprints202212.0118.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: Lip reading; Visual speech recognition; Turkish dataset; Face parts detection
Online: 7 December 2022 (06:50:33 CET)
The promised dataset was obtained from the daily Turkish words and phrases pronounced by various people in the videos posted on YouTube. The purpose of collecting the dataset is to provide detection of the spoken word by recognizing patterns or classifying lip movements with supervised, unsupervised, semi-supervised learning and machine learning algorithms. Most of the datasets related with lip reading consist of people recorded on camera with fixed backgrounds and the same conditions, but the dataset presented here consists of images compatible with machine learning models developed for real-life challenges. It contains a total of 2335 instances taken from TV series, movies, vlogs, and song clips on YouTube. The images in the dataset vary due to factors such as the way people say words, accent, speaking rate, gender and age. Furthermore, the instances in the dataset consist of videos with different angles, shadows, resolution, and brightness that are not created manually. The most important feature of our lip reading dataset is that we contribute to the non-synthetic Turkish dataset pool, which does not have wide dataset varieties. Machine learning studies can be carried out in many areas, such as the defense industry and social life, with this dataset.
ARTICLE | doi:10.20944/preprints202208.0109.v1
Subject: Computer Science And Mathematics, Data Structures, Algorithms And Complexity Keywords: speech emotion recognition; affective computing; data augmentations; wav2vec 2.0; SVM
Online: 4 August 2022 (14:09:21 CEST)
Data augmentation techniques recently gained more adoption in speech processing, including speech emotion recognition. Although more data tends to be more effective, there may be a trade-off in which more data will not provide a better model. This paper reports experiments on investigating the effects of data augmentation in speech emotion recognition. The investigation aims at finding the most useful type of data augmentation and the number of data augmentations for speech emotion recognition. The experiments are conducted on the Japanese Twitter-based emotional speech corpus. The results show that for speaker-independent data, two data augmentations with glottal source extraction and silence removal exhibited the best performance among others, even with more data augmentation techniques. For the text-independent data (including speaker and text-independent), more data augmentations tend to improve speech emotion recognition performances. The results highlight the trade-off between the number of data augmentation and the performance of speech emotion recognition showing the necessity to choose a proper data augmentation technique for a specific application.
ARTICLE | doi:10.20944/preprints202205.0066.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: code-switching; automatic speech recognition; low resource languages; language modelling
Online: 6 May 2022 (09:09:31 CEST)
We present improvements in n-best rescoring of code-switched speech achieved by n-gram augmentation as well as optimised pretraining of long short-term memory (LSTM) language models with larger corpora of out-of-domain monolingual text. In addition, we consider the application of large pretrained transformer-based architectures. Our experimental evaluation is performed on an under-resourced corpus of code-switched speech comprising four bilingual code-switched sub-corpora, each containing a Bantu language (isiZulu, isiXhosa, Sesotho, or Setswana) and English. We find in our experiments that, by combining n-gram augmentation with the optimised pretraining strategy, speech recognition errors are reduced for each individual bilingual pair by 3.51% absolute on average over the four corpora. Importantly, we find that even speech recognition at language boundaries improves by 1.14% even though the additional data is monolingual. Utilising the augmented n-grams for lattice generation, we then contrast these improvements with those achieved after fine-tuning pretrained transformer-based models such as distilled GPT-2 and M-BERT. We find that, even though these language models have not been trained on any of our target languages, they can improve speech recognition performance even in zero-shot settings. After fine-tuning on in-domain data, these large architectures offer further improvements, achieving a 4.45% absolute decrease in overall speech recognition errors and a 3.52% improvement over language boundaries. Finally, a combination of the optimised LSTM and fine-tuned BERT models achieves a further gain of 0.47% absolute on average for three of the four language pairs compared to M-BERT. We conclude that the careful optimisation of the pretraining strategy used for neural network language models can offer worthwhile improvements in speech recognition accuracy even at language switches, and that much larger state-of-the-art architectures such as GPT-2 and M-BERT promise even further gains.
ARTICLE | doi:10.20944/preprints202203.0333.v1
Subject: Engineering, Control And Systems Engineering Keywords: Hate speech detection; Social media; Machine learning; Multi-model learning
Online: 25 March 2022 (02:10:12 CET)
Users on the social networking platform have the freedom to express themselves freely. Towards the same time, this has created a forum for disagreement and hate directed at someone, society, racism, sexual orientation, and so on. Identifying hate online is a challenging task. Researchers from all around the world have contributed major methods for detecting hate speech, but owing to the issue's complexity, there are still many unresolved issues. In this research, we offer a multi-model learning strategy for detecting hate speech on Twitter. We utilised the Kaggle TwitterHate dataset, which had 31962 tweets categorised as binary hate or non-hate, to evaluate our technique. The suggested method is tested using commonly used machine learning classifiers with multi-model technique. Using TF-IDF features, we acquired detection results of 96.29 %, precision of 96%, recall of 96%, and f1-score of 96%.
ARTICLE | doi:10.20944/preprints201805.0274.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: artificial intelligence; semantic web; natural language; Google cloud speech; SPARQL
Online: 21 May 2018 (12:38:00 CEST)
The main restriction of the Semantic Web is the difficult of the SPARQL language, that is necessary to extract information from the Knowledge Representation also known as ontology. Making the Semantic Web accessible for people who do not know SPARQL, is essential the use of friendlier interfaces and a good alternative is Natural Language. This paper shows the implementation of a friendly prototype interface to query and retrieve, by voice, information from website building with the Semantic Web tools. In that way, the end users avoid the complicated SPARQL language. To achieve this, the interface recognizes a speech query and converts it into text, it processes the text through a java program and identifies keywords, generates a SPARQL query, extracts the information from the website and read it in voice, for the user. In our work Google Cloud Speech API makes Speech-to-Text conversions and Text-to Speech conversions are made with SVOX Pico. As results, we have measured three variables: The success rate in queries, the response time of query and a usability survey. The values of the variables allows the evaluation of our prototype. Finally the interface proposed provides us a new approach in the problem, using the Cloud like a Service, reducing barriers of access to the Semantic Web for people without technical knowledge of Semantic Web technologies.
CASE REPORT | doi:10.20944/preprints202105.0278.v1
Subject: Biology And Life Sciences, Biochemistry And Molecular Biology Keywords: Growth Hormone; Recurrent nerve injury; Speech therapy; Neurostimulation; Vocal cord paralysis
Online: 13 May 2021 (09:27:48 CEST)
The aim of this study is to describe the cognitive and speech results obtained after growth hormone (GH) treatment and neurorehabilitation in a man that suffered a traumatic brain injury (TBI). 17 months after the accident, the patient was treated with growth hormone (GH), together with neurostimulation and speech therapy. At admission, the left vocal cord revealed paralyzed, in the paramedian position, a situation compatible with a recurrent nerve injury. Clinical and rehabilitation assessments revealed a prompt improvement in speech and cognitive functions, and following completion of treatment, endoscopic examination showed recovery of vocal cord mobility. These results, together with previous results from our group, indicate that GH treatment is safe and effective for helping neurorehabilitation in chronic speech impairment due to central laryngeal paralysis, as well as impaired cognitive functions.
ARTICLE | doi:10.20944/preprints201905.0228.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Deep learning, LSTM, Machine learning, Post-filtering, Signal processing, Speech Synthesis
Online: 17 May 2019 (16:16:53 CEST)
Several researchers have contemplated deep learning-based post-filters to increase the quality of statistical parametric speech synthesis, which perform a mapping of the synthetic speech to the natural speech, considering the different parameters separately and trying to reduce the gap between them. The Long Short-term Memory (LSTM) Neural Networks have been applied successfully in this purpose, but there are still many aspects to improve in the results and in the process itself. In this paper, we introduce a new pre-training approach for the LSTM, with the objective of enhancing the quality of the synthesized speech, particularly in the spectrum, in a more efficient manner. Our approach begins with an auto-associative training of one LSTM network, which is used as an initialization for the post-filters. We show the advantages of this initialization for the enhancing of the Mel-Frequency Cepstral parameters of synthetic speech. Results show that the initialization succeeds in achieving better results in enhancing the statistical parametric speech spectrum in most cases when compared to the common random initialization approach of the networks.
ARTICLE | doi:10.20944/preprints201808.0522.v1
Subject: Social Sciences, Cognitive Science Keywords: speech-to-song illusion, auditory illusion, perception, pace, emotion, language tonality
Online: 30 August 2018 (10:37:13 CEST)
The speech-to-song illusion is a type of auditory illusion that the repetition of a part of a sentence would change people’s perception tendency from speech-like to song-like. The study aims to examine how pace, emotion, and language tonality affect people’s experience of the speech-to-song illusion. It uses a between-subject (Pace: fast, normal, vs. slow) and within-subject (Emotion: positive, negative, vs. neutral; language tonality: tonal language vs. non-tonal language) design. Sixty Hong Kong college students were randomly assigned to one of the three conditions characterized by pace. They listened to 12 audio stimuli, each with repetitions of a short excerpt, and rated their subjective perception of the presented phrase, whether it sounded like a speech or a song, on a five-point Likert-scale. Paired-sample t-tests and repeated measures ANOVAs were used to analyze the data. The findings reveal that a faster speech pace could strengthen the tendency of the speech-to-song illusion. Neither emotion nor language tonality show a statistically significant influence on the speech-to-song illusion. This study suggests that the perception of sound should be in a continuum and facilitates the understanding of song production in which speech can turn into music by having repetitive phrases and to be played in a relatively fast pace.
ARTICLE | doi:10.20944/preprints201802.0096.v2
Subject: Computer Science And Mathematics, Computer Networks And Communications Keywords: IoT; security; encryption; quantized speech image; SNR; PESQ; histogram; entropy; correlation
Online: 15 February 2018 (19:57:48 CET)
The IoT Internet of Things being a promising technology of the future. It is expected to connect billions of devices. The increased communication number is expected to generate data mountain and the data security can be a threat. The devices in the architecture are fundamentally smaller in size and low powered. In general, classical encryption algorithms are computationally expensive and this due to their complexity and needs numerous rounds for encrypting, basically wasting the constrained energy of the gadgets. Less complex algorithm, though, may compromise the desired integrity. In this paper we apply a lightweight encryption algorithm named as Secure IoT (SIT) to a quantized speech image for Secure IoT. It is a 64-bit block cipher and requires 64-bit key to encrypt the data. This quantized speech image is constructed by first quantizing a speech signal and then splitting the quantized signal into frames. Then each of these frames is transposed for obtaining the different columns of this quantized speech image. Simulations result shows the algorithm provides substantial security in just five encryption rounds.
ARTICLE | doi:10.20944/preprints201907.0305.v1
Subject: Social Sciences, Language And Linguistics Keywords: politics; political speech; economic crisis; Greece; deictics; space; time; image schemas; metonymicity
Online: 27 July 2019 (00:51:33 CEST)
This paper discusses the metonymic uses of the greek deictic adverbs εδώ [here] and εκεί [there] in the language of politics. The paper draws examples from political speeches which taken place in the Hellenic Parliament during 2011 and discussed the financial situation of Greece during that time. The paper discusses the multiple senses of these deictic adverbs and suggests that the temporal and spatial denotations of εδώ and εκεί are subject to image schemas. It is argued that the image schemas in which εδώ and εκεί are rooted have a metonymic basis. The paper also suggests that the spatio-temporal senses of εδώ and εκεί go beyond their deictic function due to their metonymic basis.
ARTICLE | doi:10.20944/preprints201903.0047.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Gender Recognition; Speech Signal; Deep Learning; Evolutionary Search; PSO search; Wolf Search
Online: 4 March 2019 (13:42:02 CET)
The speech entailed in human voice comprises essentially para-linguistic information used in many voice-recognition applications. Gender voice-recognition is considered one of the pivotal parts to be detected from a given voice, a task that involves certain complications. In order to distinguish gender from a voice signal, a set of techniques have been employed to determine relevant features to be utilized for building a model from a training set. This model is useful for determining the gender (i.e, male or female) from a voice signal. The contributions are involved in two folds: (i) providing analysis information about well-known voice signal features using a prominent dataset, (ii) studying various machine learning models of different theoretical families to classify the voice gender, and (iii) using three prominent feature selection algorithms to find promisingly optimal features for improving classification models. Experimental results show the importance of sub-features over others, which are vital for enhancing the efficiency of classification models performance. Experimentation reveals that the best recall value is equal to 99.97%; 99.7% of two models of Deep Learning (DL) and Support Vector Machine (SVM) and with feature selection the best recall value is 100% for SVM techniques.
ARTICLE | doi:10.20944/preprints201811.0126.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: Speech/Music Classification; Enhanced Voice Service, Long Short-Term Memory, Big Data
Online: 5 November 2018 (17:02:36 CET)
Speech/music classification that facilitates optimized signal processing from classification results has been extensively adapted as an essential part of various electronics applications, such as multi-rate audio codecs, automatic speech recognition, and multimedia document indexing. In this paper, a new technique to improve the robustness of speech/music classifier for 3GPP enhanced voice service (EVS) using long short-term memory (LSTM) is proposed. For effective speech/music classification, feature vectors implemented with the LSTM are chosen from the features of the EVS. Experiments show that LSTM-based speech/music classification produces better results than conventional EVS under a variety of conditions and types of speech/music data.
ARTICLE | doi:10.20944/preprints201802.0108.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: Mandarin; prosody generation; linguistic feature; break prediction; text-to-speech; punctuation confidence
Online: 16 February 2018 (15:39:58 CET)
This paper proposes two fully-automatic machine-extracted linguistic features from an unlimited text input for Mandarin prosody generation. One is the punctuation confidence (PC) which measures the likelihood of inserting a major punctuation mark (PM) at a word boundary. Another is the quotation confidence (QC) which measures the likelihood of a word string to be quoted as a meaningful or emphasized unit in text. Because a major PM in a text is highly correlated with a prosodic break, and a quoted word string plays an important role in human language understanding, the two features potentially could provide useful information for prosody generation. The idea is first realized by employing conditional random field (CRF)-based models to predict major PMs, quoted word string locations, and their associated confidences, i.e., the PC and the QC, for each word boundary. Then, the predicted punctuations and their confidences are combined with traditional contextual linguistic features to predict prosodic-acoustic features. Both objective and subjective tests showed that the prosody generation with the proposed linguistic features performed better than the one without the proposed features. So, the proposed PC and QC are promising features for Mandarin prosody generation.
ARTICLE | doi:10.20944/preprints202309.1636.v1
Subject: Social Sciences, Language And Linguistics Keywords: Autism spectrum conditions; Atypical resource allocation; Listening effort; Pupillometry; Speech-in-noise recognition
Online: 26 September 2023 (03:10:24 CEST)
Purpose: School-age children with autism spectrum conditions (ASC) often experience difficulties in speech-in-noise (SiN) perception, leading to increased listening effort that impacts their well-being and academic performance. This study aimed to investigate the SiN processing challenges faced by Mandarin-speaking children with ASC and its impact on their listening effort. Methods: Participants completed sentence recognition tests in both quiet and noisy conditions, with a steady-state noise masker presented at 0 dB signal-to-noise ratio in the noisy condition. We compared recognition accuracy and task-evoked pupil responses from 23 Mandarin-speaking children with ASC to 19 age-matched neurotypical (NT) counterparts to gauge their behavioral performance and listening effort during these auditory tasks. Results: The ASC group demonstrated notably decreased accuracy in noise compared to their NT peers, suggesting poorer SiN perception. Pupillometric data further revealed significantly larger peak dilations in the ASC group than in the NT group under comparable conditions. Importantly, the ASC group's peak dilation in quiet mirrored the NT group's in noise. However, the ASC group exhibited shorter peak latencies and reduced mean dilations than the NT group in similar conditions. Such patterns indicate the ASC group might initially experience a heightened cognitive load but utilize fewer cognitive resources as the task continued, indicating an atypical allocation of cognitive resources and a potential tendency towards relatively superficial and automated auditory processing. Conclusion: Our findings highlight the unique SiN processing challenges children with ASC face, underscoring the importance of a nuanced, individual-centric approach for interventions and support.
REVIEW | doi:10.20944/preprints202309.0505.v1
Subject: Medicine And Pharmacology, Otolaryngology Keywords: cochlear implant; patient-reported outcomes; pure tone average; speech in noise; music perception
Online: 7 September 2023 (11:22:04 CEST)
Electric stimulation via a Cochlear Implant (CI) enables people with severe to profound sensorineural hearing loss to regain speech understanding and music appreciation and thus allowing them to actively engage in social life. Three main manufacturers (Cochlear, MED-EL and Advanced Bionics “AB”) have been offering CI systems, thus challenging CI recipients and Otolaryngologists with a difficult decision, as currently no comprehensive overview or meta-analyses on performance outcome following CI implantation is available. The main goal of this scoping review is to provide evidence that data and standardized speech and music performance tests are available for performing such comparisons. To this end, a literature search was conducted to find studies that address speech and music outcomes in CI recipients. From a total of 1592 papers, 188 paper abstracts were analyzed and 147 articles were found suitable for examination of full text. From which, 42 studies were included for synthesis. A total of 16 studies used the consonant-nucleus-consonant (CNC) word recognition test in quiet at 60db SPL. We found that aside from technical comparisons, only very few publications compare speech outcomes across manufacturers of CI systems. Evidence suggests though, that these data are available in large CI centers in Germany and US. Future studies should therefore leverage large data cohorts to perform such comparisons that could provide critical evaluation criteria and assist both CI recipients and Otolaryngologists to make informed performance-based decisions.
REVIEW | doi:10.20944/preprints202308.2166.v1
Subject: Public Health And Healthcare, Primary Health Care Keywords: dysphagia; artificial intelligence; videofluoroscopic swallowing study; deep learning; machine learning; imaging; speech pathology
Online: 31 August 2023 (10:42:28 CEST)
Radiological imaging is an essential component of a swallowing assessment. Artificial intelligence (AI), and especially deep learning (DL) models, have enhanced the efficiency and efficacy through which imaging is interpreted, and subsequently have important implications for swallow diagnostics and intervention planning. However, the application of AI for the interpretation of videofluoroscopic swallowing studies (VFSS) is still emerging. This review showcases recent literature in the use of AI to interpret VFSS and highlights clinical implications for speech pathologists (SPs). With a surge in AI research, there have been advances made in dysphagia assessment. Several studies have demonstrated successful implementation of DL algorithms to analyze VFSS. Notably, convolutional neural networks (CNNs) have been used to detect pertinent aspects of the swallowing process with high levels of precision. DL algorithms have the potential to streamline VFSS interpretation, improve efficiency and accuracy, and enable precise interpretation of instrumental dysphagia evaluation, which is especially advantageous when access to skilled clinicians is not ubiquitous. By enhancing precision, speed, and depth of VFSS interpretation, SPs can obtain a more comprehensive understanding of swallow physiology and deliver targeted and timely intervention that is tailored towards the individual. This has practical applications for both clinical practice and dysphagia research. As this research area grows and AI technologies progress, the application of DL in the field of VFSS interpretation is clinically beneficial and has the potential to transform dysphagia assessment and management. With broader validation and inter-disciplinary collaborations, AI-augmented VFSS interpretation is likely to transform swallow evaluation and ultimately improve outcomes for individuals with dysphagia.
BRIEF REPORT | doi:10.20944/preprints202207.0062.v1
Subject: Medicine And Pharmacology, Otolaryngology Keywords: Total Laryngectomy; Cancer; Voice; Voice prosthesis; Otolaryngology; Head Neck Surgery; Speech Language Therapists.
Online: 5 July 2022 (05:44:14 CEST)
Background: In the present study, we assessed the feasibility and success outcomes of voice prosthesis (VP) changes when performed by speech-language pathologist (SLP). Methods: Patients treated with total laryngectomy (TL) from January 2020 to December 2020 were prospectively recruited from our medical center. Patients benefited from tracheoesophageal puncture. The VP changes were performed by the senior SLP and the following data were collected for each VP change: date of placement; change or removal; VP type and size; reason for change or removal; and use of a washer for periprosthetic leakage. A patient-reported outcome questionnaire including 6 items was proposed to patients at each VP change (Appendix 1). Items were assessed with a 10-point Likert-scale. Results: Fifty-two VP changes were performed by the senior SLP during the study period. The mean duration of the SLP consultation, including patient history, examination and VP change procedure was 20 min (range: 15-30). The median prosthesis lifetime was 88 days. The main reasons for VP changes were transprosthetic (N=34; 79%) and periprosthetic (N=7; 21%) leakages, respectively. SLP successfully performed all VP changes. He did not change one VP but used a periprosthetic silastic to stop the periprosthetic leakages. In two cases, SLP needed the surgeon examination to discuss about the following indication: implant mucosa inclusion and autologous fat injection. The patient satisfaction was high according to the speed and the quality of care by the SLP. Conclusion: The delegation of VP change from the otolaryngologist-head and neck sur-geon to the speech-language pathologist (SLP) may be done without significant complications. The delegation of VP change procedure to SLP may be interesting in some rural regions with otolaryngologist shortage.
ARTICLE | doi:10.20944/preprints202311.0963.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Hate Speech Detection; Machine Learning; Sentiment Analysis; Semi-Supervised Learning; Self-Learning; Text Mining
Online: 15 November 2023 (09:58:07 CET)
Text annotation is an essential element of the natural language processing approaches. The manual annotation process performed by humans has several drawbacks, such as subjectivity, slowness, fatigue, and possibly carelessness. In addition, annotators may annotate ambiguous data. So, we developed the concept of automated annotation to get the best annotations using several machine-learning approaches. The proposed approach is based on an ensemble algorithm of meta-learners and meta-vectorizer techniques. The approach employs a semi-supervised learning technique for automated annotation, aimed at detecting hate speech. This involves leveraging various machine learning algorithms, including Support Vector Machine (SVM), Decision Tree (DT), K-Nearest Neighbors (KNN), and Naive Bayes (NB), in conjunction with Word2Vec and TF-IDF text extraction methods. The annotation process is performed using 13,169 Indonesian YouTube comments data. The proposed model used a Stemming approach using data from Sastrawi and also new data of 2,245 words. Semi-supervised learning uses 5%, 10%, and 20% of labeled data as compared to performing labeling based on 80% of the datasets. In semi-supervised learning, the model learns from the labeled data, which provides explicit information, and the unlabeled data, which offers implicit insights. This hybrid approach enables the model to generalize and make informed predictions even when limited labeled data is available, ultimately enhancing its ability to handle real-world scenarios with scarce annotated information. In addition, the proposed method uses a variety of thresholds for matching words labeled with hate speech ranging from 0.6, 0.7, 0.8, and 0.9. The experiment showed that the KNN-Word2ec model has the best accuracy value of 96.9% with a scenario of 5%:80%:0.9. However, several other methods have also accuracy above 90%, such as SVM and DT based on both text extraction methods in several test scenarios.
ARTICLE | doi:10.20944/preprints202211.0037.v1
Subject: Social Sciences, Language And Linguistics Keywords: non-native speech learning; talker variability; phonetically-irrelevant variability; long-term retention; cognitive abilities
Online: 2 November 2022 (03:05:23 CET)
Talker variability has been reported to facilitate generalization and retention of speech learning, but is also shown to place demands on cognitive resources. Our recent study provided evidence that phonetically-irrelevant acoustic variability in single-talker (ST) speech is sufficient to induce equivalent amounts of learning to the use of multiple-talker (MT) training. This study is a follow-up contrasting MT versus ST training with varying degrees of temporal exaggeration to examine how cognitive measures of individual learners may influence the role of input variability in immediate learning and long-term retention. Native Chinese-speaking adults were trained on the English /i/-/ɪ/ contrast. We assessed the trainees’ working memory and selective attention before training. Trained participants showed retention of more native-like cue weighting in both perception and production regardless of talker variability condition. The ST training group showed long-term benefit in word identification, whereas the MT training group did not retain the improvement. The results demonstrate the role of phonetically-irrelevant variability in robust speech learning and modulatory functions of nonlinguistic working memory and selective attention, highlighting the necessity to consider the interaction between input characteristics, task difficulty, and individual differences in cognitive abilities in assessing learning outcomes.
ARTICLE | doi:10.20944/preprints202203.0258.v1
Subject: Social Sciences, Law Keywords: hate speech; artificial intelligence; social media platforms; content moderation; freedom of expression; non-discrimination
Online: 17 March 2022 (15:26:41 CET)
Artificial Intelligence is increasingly being used by social media platforms to tackle online hate speech. The sheer quantity of content, the speed at which is it developed and the enhanced pressure companies are facing by States to remove hate speech quickly from their platforms have led to a tricky situation. This commentary argues that automated mechanisms, which may have biased datasets and be unable to pick up on the nuances of language, should not be left unattended with hate speech as this can lead to issues of violating freedom of expression and the right to non-discrimination.
ARTICLE | doi:10.20944/preprints202112.0134.v1
Subject: Computer Science And Mathematics, Robotics Keywords: Human Robot Interaction (HRI); social robot; Speech Emotion Recognition (SER); Gender Recognition, affective states
Online: 8 December 2021 (14:31:07 CET)
The real challenge in Human Robot Interaction (HRI) is to build machines capable of perceiving human emotions so that robots can interact with humans in a proper manner. It is well known from the literature that emotion varies accordingly to many factors. Among these, gender represents one of the most influencing one, and so an appropriate gender-dependent emotion recognition system is recommended. In this paper, a two-level hierarchical Speech Emotion Recognition (SER) system is proposed: the first level is represented by the Gender Recognition (GR) module for the speaker’s gender identification; the second is a gender-specific SER block. Specifically for this work, the attention was focused on the optimisation of the first level of the proposed architecture. The system was designed to be installed on social robots for hospitalised and living at home elderly patients monitoring. Hence, the importance of reducing the software computational effort of the architecture also minimizing the hardware bulkiness, in order for the system to be suitable for social robots. The algorithm was executed on the Raspberry Pi hardware. For the training, the Italian emotional database EMOVO was used. Results show a GR accuracy value of 97.8%, comparable with the ones found in literature.
Subject: Medicine And Pharmacology, Neuroscience And Neurology Keywords: hearing loss; aging; hyperactivity; excitability; loss of inhibition; neurophysiology; auditory perception; neural plasticity; speech processing
Online: 15 April 2021 (13:34:54 CEST)
Many aging adults experience some form of hearing problems that may arise from auditory peripheral damage. However, it has been increasingly acknowledged that hearing loss is not only a dysfunction of the auditory periphery but results from changes within the entire auditory system, from periphery to cortex. Damage to the auditory periphery is associated with an increase in neural activity at various stages throughout the auditory pathway. Here, we review neurophysiological evidence of hyperactivity, auditory perceptual difficulties that may result from hyperactivity, and outline open conceptual and methodological questions related to the study of hyperactivity. We suggest that hyperactivity alters all aspects of hearing – including spectral, temporal, spatial hearing – and, in turn, impairs speech comprehension when background sound is present. By focusing on the perceptual consequences of hyperactivity and the potential challenges of investigating hyperactivity in humans, we hope to bring animal and human electrophysiologists closer together to better understand hearing problems in older adulthood.
ARTICLE | doi:10.20944/preprints202008.0645.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Speech Emotion Recognition; Emotion AI; Self-Supervised Learning; Transfer Learning; Low Resource Training; wav2vec
Online: 28 August 2020 (15:05:37 CEST)
We propose a novel transfer learning method for speech emotion recognition allowing us to obtain promising results when only few training data is available. With as low as 125 examples per emotion class, we were able to reach a higher accuracy than a strong baseline trained on 8 times more data. Our method leverages knowledge contained in pre-trained speech representations extracted from models trained on a more general self-supervised task which doesn’t require human annotations, such as the wav2vec model. We provide detailed insights on the benefits of our approach by varying the training data size, which can help labeling teams to work more efficiently. We compare performance with other popular methods on the IEMOCAP dataset, a well-benchmarked dataset among the Speech Emotion Recognition (SER) research community. Furthermore, we demonstrate that results can be greatly improved by combining acoustic and linguistic knowledge from transfer learning. We align acoustic pre-trained representations with semantic representations from the BERT model through an attention-based recurrent neural network. Performance improves significantly when combining both modalities and scales with the amount of data. When trained on the full IEMOCAP dataset, we reach a new state-of-the-art of 73.9% unweighted accuracy (UA).
ARTICLE | doi:10.20944/preprints201811.0163.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: Cymatics, Speech recognition, Mel-Frequency Cepstral Coefficients (MFCC), Dynamic time warping (DTW), Chladni plates
Online: 7 November 2018 (13:42:22 CET)
This paper propose an original approach of achieving a Cymatics based visual perception of isolated speech commands. The idea is to smartly combine the effective speech processing and analysis methods with the phenomena of Cymatics. In this context, an effective approach for automatic isolated speech based message recognition is proposed. The incoming speech segment is enhanced by applying the appropriate pre-emphasis filtering, noise thresholding and zero alignment operations. The Mel-Frequency Cepstral coefficients (MFCCs), Delta coefficients and Delta-Delta coefficients are extracted from the enhanced speech segment. Later on, the Dynamic Time Warping (DTW) technique is employed to compare these extracted features with the reference templates. The comparison outcomes are used to make the classification decision. The classification decision is transformed into a methodical excitation. Finally, this excitation is converted into the systematic visual perceptions via the phenomenon of Cymatics. The system functionality is tested with an experimental setup and results are presented. The approach is novel and can be employed in various applications like visual art, encryption, education, archeology, architecture, integration of impaired people, etc.
ARTICLE | doi:10.20944/preprints201810.0739.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: Event-Driven Processing, Speech recognition, Adaptive Resolution Analysis, Features extraction, Dynamic Time Warping, Classification
Online: 31 October 2018 (08:14:15 CET)
This paper proposes a novel approach, based on the adaptive rate processing and analysis, for the isolated speech recognition. The idea is to smartly combine the event-driven signal acquisition and windowing along with adaptive rate processing, analysis and classification for realizing an effective isolated speech recognition. The incoming speech signal is digitized with an event-driven A/D converter (EDADC). The output of EDADC is windowed with an activity selection process. These windows are later on resampled uniformly with an adaptive rate interpolator. The resampled windows are de-noised with an adaptive rate filter and their spectrum are computed with an adaptive resolution short time Fourier transform (ARSTFT). Later on, the magnitude, Delta and Delta-Delta spectral coefficients are extracted. The Dynamic Time Warping (DTW) technique is employed to compare these extracted features with the reference templates. The comparison outcomes are used to make the classification decision. The system functionality is tested for a case study and results are presented. An 8.2 times reduction in acquired number of samples is achieved by the devised approach as compared to the classical one. It aptitudes a significant computational gain and power consumption reduction of the proposed system over the counter classical ones. An average subject dependent isolated speech recognition accuracy of 96.8% is achieved. It shows that the proposed approach is a potential candidate for the automatic speech recognition applications like rehabilitation centers, smart call centers, smart homes, etc.
ARTICLE | doi:10.20944/preprints202309.1339.v1
Subject: Computer Science And Mathematics, Other Keywords: linguistic E-learning; phonetic transcription; mel frequency cepstrum coefficient; grapheme-to-phoneme; transformer; speech synthesis
Online: 20 September 2023 (09:59:40 CEST)
The E-learning system has achieved great development after the pandemic. In this work, we proposed three artificial intelligence-based enhancements to our linguistic interactive E-learning system from different aspects. Compared with the original phonetic transcription exam system, our enhancements include an MFCC+CNN-based disordered speech classification module, a Transformer-based Grapheme-to-Phoneme converter, and a Tacotron2-based IPA-to-Speech speech synthesis system. This work not only provides a better experience for the users of this system but also explores the utilization of artificial intelligence technologies in the E-learning field and linguistic field.
ARTICLE | doi:10.20944/preprints202108.0433.v1
Subject: Computer Science And Mathematics, Computer Science Keywords: Speech emotion recognition; Feature extraction; Heterogeneous parallel network; Spectral features; Prosodic features; Multi-feature fusion
Online: 23 August 2021 (12:16:40 CEST)
Speech emotion recognition remains a heavy lifting in natural language processing. It has strict requirements to the effectiveness of feature extraction and that of acoustic model. With that in mind, a Heterogeneous Parallel Convolution Bi-LSTM model is proposed to address these challenges. It consists of two heterogeneous branches: the left one contains two dense layers and a Bi-LSTM layer, while the right one contains a dense layer, a convolution layer, and a Bi-LSTM layer. It can exploit the spatiotemporal information more effectively, and achieves 84.65%, 79.67%, and 56.50% unweighted average recall on the benchmark databases EMODB, CASIA, and SAVEE, respectively. Compared with the previous research results, the proposed model achieves better performance stably.
ARTICLE | doi:10.20944/preprints202106.0296.v1
Subject: Social Sciences, Psychology Keywords: reading comprehension; speech-in-noise recognition; nature F0 contours; flattened F0 contours; Chinese character decoding
Online: 10 June 2021 (13:36:17 CEST)
Theories of reading comprehension emphasize decoding and listening comprehension as two essential components. The current study aimed to investigate how Chinese character decoding and context-driven auditory semantic integration contribute to reading comprehension in Chinese middle school students. Seventy-five middle school students were tested. Context-driven auditory semantic integration was assessed with speech-in-noise tests in which the fundamental frequency (F0) contours of spoken sentences were either kept natural or acoustically flattened with the latter requiring a higher degree of contextual information. Statistical modelling with hierarchical regression was conducted to examine the contributions of Chinese character decoding and context-driven auditory semantic integration to reading comprehension. Performance on Chinese character decoding and auditory semantic integration scores with the flattened (but not natural) F0 sentences significantly predicted reading comprehension. Furthermore, the contributions of these two factors to reading comprehension were better fitted with an additive model instead of a multiplicative model. These findings indicate that reading comprehension in middle schoolers is associated with not only character decoding but also the listening ability to make better use of the sentential context for semantic integration in a severely degraded speech-in-noise condition. The results add to our better understanding of the multi-faceted reading comprehension in children. Future research could further address the age-dependent development and maturation of reading skills by examining and controlling other important cognitive variables, and apply neuroimaging techniques such as functional magmatic resonance imaging to reveal the neural substrates for the contribution of auditory semantic integration and the observed additive model to reading comprehension.
CASE REPORT | doi:10.20944/preprints202004.0443.v1
Subject: Medicine And Pharmacology, Neuroscience And Neurology Keywords: traumatic brain injury (TBI); Dysarthria; transcranial direct current stimulation (tDCS); Quantitative Electroencephalography (QEEG); speech therapy
Online: 24 April 2020 (13:56:38 CEST)
Purpose: Dysarthria, a neurological injury of the motor component of the speech circuitry, is of common consequences of traumatic brain injury (TBI). Palilalia is a speech disorder characterized by involuntary repetition of words, phrases, or sentences. Based on the evidence supporting the effectiveness of transcranial direct current stimulation (tDCS) in some speech and language disorders, we hypothesized that using tDCS would enhances the effectiveness of speech therapy in a client with chronic dysarthria following TBI. Method: We applied the constructs of the “Be Clear” protocol, a relatively new approach in speech therapy in dysarthria, together with tDCS on a chronic subject who affected by dysarthria and palilalia after TBI. Since there was no research on the use of tDCS in such cases, regions of interest (ROIs) were identified based on deviant brain electrophysiological patterns in speech tasks and resting state compared with normal expected patterns using the Quantitative Electroencephalography (QEEG) analysis. Results: Measures of perceptual assessments of intelligibility, an important index in the assessment of dysarthria, were superior to the primary protocol results immediately and 4 months after intervention. We did not find any factor other than the use of tDCS to justify this superiority. The percentage of repeated words, an index in palilalia assessment, had a remarkable improvement immediately after intervention but fell somewhat after 4 months. We justified this case with subcortical origins of palilalia. Conclusion: Our present case-based findings suggested that applying tDCS together with speech therapy may improve intelligibility in similar case profiles as compared to traditional speech therapy. To reconfirm the effectiveness of the above approach in cases with dysarthria following TBI, more investigation need to be pursued.
ARTICLE | doi:10.20944/preprints201901.0029.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: Android; arduino; bluetooth; hand-gesture recognition; low cost; open source; sensors; smart cars; speech recognition
Online: 3 January 2019 (14:32:23 CET)
Gesture recognition has always been a technique to decrease the distance between the physical and the digital world. In this work, we introduce an Arduino based vehicle system which no longer require manual controlling of the cars. The proposed work is achieved by utilizing the Arduino microcontroller, accelerometer, RF sender/receiver, and Bluetooth. Two main contributions are presented in this work. Firstly, we show that the car can be controlled with hand-gestures according to the movement and position of the hand. Secondly, the proposed car system is further extended to be controlled by an android based mobile application having different modes (e.g., touch buttons mode, voice recognition mode). In addition, an automatic obstacle detection system is introduced to improve the safety measurements to avoid any hazards. The proposed systems are designed at lab-scale prototype to experimentally validate the efﬁciency, accuracy, and affordability of the systems. We remark that the proposed systems can be implemented under real conditions at large-scale in the future that will be useful in automobiles and robotics applications.
CASE REPORT | doi:10.20944/preprints201805.0300.v1
Subject: Medicine And Pharmacology, Neuroscience And Neurology Keywords: IGF-1; MT; Blackcurrant extracts; Oxidative stress; Mecp2; Speech therapy; Neurostimulation; cyclic glycine-proline; GPE.
Online: 22 May 2018 (11:25:58 CEST)
1) This study describes the good evolution of a 6-year-old girl genetically diagnosed with Rett syndrome (RTT), after having been treated with IGF-1, MT (MT), blackcurrant extracts (BC), and rehabilitation during 6 months. 2) The patient stopped her normal development from the first year of age. The patient showed low weight and height and met the main criteria for typical RTT. Curiously, there was pubic hair (Tanner II), very high plasma testosterone, despite low gonadotropins. No adrenal enzymatic deficits existed, and ultrasound abdominal studies were normal. Treatment consisted in IGF-1 (0.04 mg/kg/day, 5/week, sc) during 3-months and then 15-days resting, MT (50 mg/day, orally, uninterruptedly) and neurorehabilitation. The new blood tests were absolutely normal and the pubic hair disappeared. Then, a new treatment with IGF-1, MT, and BC started for another 3 months. After it, pubic Tanner stage increased to III, without a known cause. 3) The treatment followed led to clear improvements in most of the initial impairments, perhaps because of the effect of IGF-I, the antioxidant effects of MT and BC, and the increase in cyclic-glycine-proline (cGP) after BC administration. 4) A continuous treatment with IGF-1, MT and BC may recover most of the neurologic disabilities that occur in RTT.
ARTICLE | doi:10.20944/preprints202303.0517.v1
Subject: Biology And Life Sciences, Behavioral Sciences Keywords: Autism spectrum disorder; Auditory stream segregation; Hearing assistive technology; Speech-in-noise perception; Tonal language speakers
Online: 30 March 2023 (02:52:15 CEST)
Purpose: Hearing assistive technology (HAT) has been shown to be a viable solution to the speech-in-noise perception (SPIN) issue in children with autism spectrum disorder (ASD); however, little is known about its efficacy in tonal language speakers. This study compared sentence-level SPIN performance between Chinese children with ASD and neurotypical (NT) children and evaluated HAT use in improving SPIN performance and easing SPIN difficulty. Methods: Children with ASD (n=26) and NT children (n=19) aged 6-12 performed two adaptive tests in steady-state noise and three fixed-level tests in quiet and steady-state noise with and without using HAT. Speech recognition thresholds (SRT) and accuracy rates were assessed using adaptive and fixed-level tests, respectively. Parents or teachers of the ASD group completed a questionnaire regarding children’s listening difficulty under six circumstances before and after a ten-day trial period of HAT use. Results: Although the two groups of children had comparable SRTs, the ASD group showed a significantly lower SPIN accuracy rate than the NT group. Also, a significant impact of noise was found in the ASD group’s accuracy rate, but not in the NT group’s. There was a general improvement in the ASD group’s SPIN performance with HAT and a decrease in their listening difficulty ratings across all conditions after the device trial. Conclusion: The findings indicated inadequate SPIN in the ASD group using a relatively sensitive measure to gauge SPIN performance among children. The markedly increased accuracy rate in noise during HAT-on sessions for the ASD group confirmed the feasibility of HAT for improving SPIN performance in controlled laboratory settings, and the reduced post-use ratings of listening difficulty further confirmed the benefits of HAT use in daily scenarios.
ARTICLE | doi:10.20944/preprints202103.0221.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: Speech enhancement; Kalman filter; Kalman gain; robustness metric; sensitivity metric; LPC, whitening filter; real-life noise
Online: 8 March 2021 (13:39:44 CET)
The inaccurate estimates of linear prediction coefficient (LPC) and noise variance introduce bias in Kalman filter (KF) gain and degrades speech enhancement performance. The existing methods proposed a tuning of the biased Kalman gain particularly in stationary noise condition. This paper introduces a tuning of the KF gain for speech enhancement in real-life noise conditions. First, we estimate noise from each noisy speech frame using a speech presence probability (SPP) method to compute the noise variance. Then construct a whitening filter (with its coefficients computed from the estimated noise) and employed to the noisy speech, yielding a pre-whitened speech, from where the speech LPC parameters are computed. Then construct KF with the estimated parameters, where the robustness metric offsets the bias in Kalman gain during speech absence to that of the sensitivity metric during speech presence to achieve better noise reduction. Where the noise variance and the speech model parameters are adopted as a speech activity detector. The reduced-biased Kalman gain enables the KF to minimize the noise effect significantly, yielding the enhanced speech. Objective and subjective scores on NOIZEUS corpus demonstrates that the enhanced speech produced by the proposed method exhibits higher quality and intelligibility than some benchmark methods.
ARTICLE | doi:10.20944/preprints202101.0621.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: Speech Command; MFCC; Tsetlin Machine; Learning Automata; Pervasive AI; Machine Learning; Artificial Neural Network; Keyword Spotting
Online: 29 January 2021 (13:01:47 CET)
The emergence of Artificial Intelligence (AI) driven Keyword Spotting (KWS) technologies has revolutionized human to machine interaction. Yet, the challenge of end-to-end energy efficiency, memory footprint and system complexity of current Neural Network (NN) powered AI-KWS pipelines has remained ever present. This paper evaluates KWS utilizing a learning automata powered machine learning algorithm called the Tsetlin Machine (TM). Through significant reduction in parameter requirements and choosing logic over arithmetic based processing, the TM offers new opportunities for low-power KWS while maintaining high learning efficacy. In this paper we explore a TM based keyword spotting (KWS) pipeline to demonstrate low complexity with faster rate of convergence compared to NNs. Further, we investigate the scalability with increasing keywords and explore the potential for enabling low-power on-chip KWS.
ARTICLE | doi:10.20944/preprints202309.1202.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: speech emotion recognition; deep learning; Deep Belief Network; deep neural network; Convolutional Neural Network; LSTM; attention mechanism
Online: 19 September 2023 (08:24:22 CEST)
Speech Emotion Recognition (SER) is an interesting and difficult problem to handle. In this paper, we deal with it through the implementation of deep learning networks. We have designed and implemented six different deep learning networks, a Deep Belief Network (DBN), a simple deep neural network (SDNN), a LSTM network (LSTM), a LSTM network with the addition of an attention mechanism (LSTM-ATN), a Convolutional neural network (CNN), and a Convolutional neural network with the addition of an attention mechanism (CNN-ATN), having in mind, apart from solving the SER problem, to test the impact of attention mechanism to the results. Dropout and Batch Normalization techniques are also used to improve the generalization ability (prevention of overfitting) of the models as well as to speed up the training process. The Surrey Audio-Visual Expressed Emotion database (SAVEE), and the Ryerson Audio-Visual Database (RAVDESS) database were used for training and evaluation of our models. The results showed that networks with the addition of the attention mechanism did better than the others. Furthermore, they showed that CNN-ATN was the best among tested networks, achieving an accuracy of 74% for the SAVEE and 77% for the RAVDESS dataset, and exceeded existing state-of-the-art systems for the same datasets.
ARTICLE | doi:10.20944/preprints202307.0347.v1
Subject: Social Sciences, Psychology Keywords: Augmentative and Alternative Communication; Autism; Picture Exchange Communication; Speech Generating Device; Vocal production; Problem behavior; Communicative behavior
Online: 5 July 2023 (15:38:56 CEST)
Previous research on the relative benefits of Augmentative and Alternative Communication Systems (AACs) has indicated controversial results regarding the effectiveness, the ease of use, and the preference for the different systems. The study aims to observe the comparative effectiveness of two different Augmentative and Alternative Communication (AAC) tools, the Picture Exchange Communication System (PECS) and a speech-generating device (SGD), as communication aids for children with autism. Three children with severe autism and minimally verbally or with no functional language participated. The results showed an increase of communicative behavior with both AAC intervention strategies and a speed acquisition time slightly shorter for the SGD training. Two of the three participants showed a preference for the use of the SGD. Reduction in problem behaviors and improvement in vocal production was observed for one of the participants. Results suggest that PECS and SGD are similarly appropriate for the development of initial request skills and that they can encourage vocal production in students with specific prerequisites.
REVIEW | doi:10.20944/preprints201903.0033.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: augmentative and alternative communication; assistive technologies; sensing modalities; signal processing; voice communication; machine learning; mobile health; speech disability
Online: 4 March 2019 (10:14:44 CET)
High-tech augmentative and alternative communication (AAC) methods are on a constant rise; however, the interaction between the user and the assistive technology is still challenged for an optimal user experience centered around the desired activity. This review presents a range of signal sensing and acquisition methods utilized in conjunction with the existing high-tech AAC platforms for speech disabled individuals, including imaging methods, touch-enabled systems, mechanical and electro-mechanical access, breath-activated methods, and brain computer interfaces (BCI). The listed AAC sensing modalities are compared in terms of ease of access, affordability, complexity, portability, and typical conversational speeds. A revelation of the associated AAC signal processing, encoding, and retrieval highlights the roles of machine learning (ML) and deep learning (DL) in the development of intelligent AAC solutions. The demands and the affordability of most systems were found to hinder the scale of usage of high-tech AAC. Further research is indeed needed for the development of intelligent AAC applications reducing the associated costs and enhancing the portability of the solutions for a real user’s environment. The consolidation of natural language processing with current solutions also needs to be further explored for the amelioration of the conversational speeds. The recommendations for prospective advances in coming high-tech AAC are addressed in terms of developments to support mobile health communicative applications.
ARTICLE | doi:10.20944/preprints202302.0035.v1
Subject: Medicine And Pharmacology, Otolaryngology Keywords: Speech-in-noise hearing difficulties; Hidden hearing loss (HHL); hearing aids; self-report; Reaction time; Ecologically momentary assessment (EMA)
Online: 2 February 2023 (08:37:41 CET)
Objective: This study assessed hearing aid benefits for people with a normal audiogram but hearing-in-noise problems in everyday listening situations. Design: Exploratory double-blinded case control study whereby participants completed retrospective questionnaires, ecological momentary assessments, speech-in-noise testing, and mental effort testing with and without hearing aids. Twenty-seven adults reporting speech-in-noise problems but normal air-conduction pure-tone audiometry took part in the study. They were randomly separated into an experimental group who trialled mild-gain hearing aids with advanced directional processing and a control group fitted with hearing aids with no gain or directionality. Results: Self-reports showed mild-gain hearing aids reduce hearing-in-noise difficulties and provide a better hearing experience (i.e., improved understanding, participation, and mood). Despite the self-reported benefits, the laboratory tests did not reveal a benefit from the mild-gain hearing aids, with no group differences on speech-in-noise tests or mental effort measures. Further, participants found the elevated cost of hearing aids to be a barrier for their adoption. Conclusions: Hearing aids benefit the listening experience in some listening situations for people with normal audiogram who report hearing difficulties in noise. Decreasing the price of hearing aids may lead to greater accessibility to those seeking remediation for their communication needs.
ARTICLE | doi:10.20944/preprints201904.0274.v1
Subject: Social Sciences, Language And Linguistics Keywords: computer-aided translation; machine translation; speech translation; translation memory-machine translation integration; user interface; domain-adaptation; human-computer interface
Online: 25 April 2019 (07:59:18 CEST)
When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side. This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project.
ARTICLE | doi:10.20944/preprints202308.0742.v1
Subject: Public Health And Healthcare, Public Health And Health Services Keywords: inner speech; spontaneous self-talk; goal-directed self-talk; big five personality traits; self-determination theory; autonomy; competence; relatedness; sport
Online: 9 August 2023 (10:31:01 CEST)
Good health and the promotion of well-being for all is the third of the 17 Global Goals included in the 2030 Agenda for Sustainable Development. Contributing to this goal, the current study aimed to examine the relationships between one kind of athletes’ well-being, namely state organic self-talk, with personality traits, and basic psychological need satisfaction and frustration within their sport. Athletes (N = 691; mean age 21.65) from a variety of individual (n = 270) and team sports (n = 421) completed a multisection questionnaire capturing the targeted variables. Three-step hierarchical regression analyses revealed that: In step 1, all personality traits were to some extent a significant predictor of athletes’ organic, spontaneous self-talk dimensions and goal-directed self-talk functions. In step 2, need satisfaction significantly contributed to all spontaneous self-talk dimensions and goal-directed self-talk functions (except for creating functional deactivated states) over and above personality. Finally, in step 3, need frustration significantly contributed to negative spontaneous self-talk dimensions, and to all goal-directed self-talk functions (except for instruction) over and above personality and need satisfaction. Overall, our results indicate the importance of personality traits as personal antecedents, and perceptions of basic psychological need satisfaction and frustration as social-environmental antecedents, in shaping athletes’ state organic self-talk.
ARTICLE | doi:10.20944/preprints202101.0005.v1
Subject: Medicine And Pharmacology, Immunology And Allergy Keywords: apraxia of speech (AOS); transcranial direct current stimulation (tDCS); primary progressive aphasia (PPA); inferior frontal gyrus (IFG); sound duration; brain stimulation
Online: 4 January 2021 (10:19:48 CET)
Transcranial direct current stimulation (tDCS) over the left Inferior Frontal Gyrus (IFG) was found to improve apraxia of speech (AOS) in post-stroke aphasia, speech fluency in adults who stutter, naming and spelling in primary progressive (PPA). This paper aims to determine whether tDCS over the left IFG coupled with AOS therapy improves speech fluency in patients with PPA more than sham. Eight patients with non-fluent PPA with AOS symptoms received either active or sham tDCS, along with speech therapy for 15 weekday sessions. Speech therapy consisted of repetition of increasing syllable-length words. Evaluations took place before, immediately after, and two months post-intervention. Words were segmented into vowels and consonants and the duration of each vowel and consonant was measured. Segmental duration was significantly shorter after tDCS than sham for both consonants and vowels. tDCS gains generalized to untrained words. The effects of tDCS sustained over two months post-treatment in trained words. Taken together, these results demonstrate that the tDCS over the left IFG facilitates speech production by reducing segmental duration. The results provide preliminary evidence that tDCS can maximize efficacy of speech therapy in non-fluent PPA with AOS.
ARTICLE | doi:10.20944/preprints202307.0413.v1
Subject: Computer Science And Mathematics, Other Keywords: Voice user interface; Geographic Information System; human-computer interaction; multimodal interface; natural language; Web application; Natural language interaction; Voice virtual assistant; Speech recognition
Online: 6 July 2023 (10:08:55 CEST)
ARTICLE | doi:10.20944/preprints202106.0687.v1
Subject: Physical Sciences, Acoustics Keywords: automatic speech recognition (ASR); automatic assessment tools; foreign language pronunciation; pronunciation training; computer-assisted pronunciation training (CAPT); automatic pronunciation assessment; learning environments; minimal pairs
Online: 29 June 2021 (07:31:41 CEST)
General–purpose automatic speech recognition (ASR) systems have improved their quality and are being used for pronunciation assessment. However, the assessment of isolated short utterances, as words in minimal pairs for segmental approaches, remains an important challenge, even more for non-native speakers. In this work, we compare the performance of our own tailored ASR system (kASR) with the one of Google ASR (gASR) for the assessment of Spanish minimal pair words produced by 33 native Japanese speakers in a computer-assisted pronunciation training (CAPT) scenario. Participants of a pre/post-test training experiment spanning four weeks were split into three groups: experimental, in-classroom, and placebo. Experimental group used the CAPT tool described in the paper, which we specially designed for autonomous pronunciation training. Statistically significant improvement for experimental and in-classroom groups is revealed, and moderate correlation values between gASR and kASR results were obtained, beside strong correlations between the post-test scores of both ASR systems with the CAPT application scores found at the final stages of application use. These results suggest that both ASR alternatives are valid for assessing minimal pairs in CAPT tools, in the current configuration. Discussion on possible ways to improve our system and possibilities for future research are included.
ARTICLE | doi:10.20944/preprints202305.0402.v1
Subject: Social Sciences, Psychology Keywords: self; course; self-reflection; self-rumination; self-knowledge; mindfulness; prospection; autobiography; self-regulation; self-recognition; self-esteem; culture; inner speech; traumatic brain injury; Theory-of-Mind
Online: 6 May 2023 (09:32:55 CEST)
In this paper I tentatively answer 50 questions sampled from a pool of over 10,000 weekly questions formulated by students in a course entitled “The Self”. The questions pertain to various key topics about self-processes, such as self-awareness, self-knowledge, self-regulation, self-talk, self-esteem, and self-regulation. The students’ weekly questions and their answers highlight what is currently know about the self. Answers to the student questions also allow for the identification of some recurrent lessons about the self. Some of these lessons include: all self-processes are interconnected (e.g., prospection depends on autobiography), self-terms must be properly defined (e.g., self-rumination and worry are not the same), inner speech plays an important role in self-processes, controversies are numerous (are animals self-aware?), measurement issues abound (e.g., self-reflection as an operationalization of self-awareness), deficits in some self-processes can have devastating effects (e.g., self-regulatory deficits may lead to financial problems), and there are lots of unknowns about the self (e.g., gender differences in Theory-of-Mind).
ARTICLE | doi:10.20944/preprints202310.1830.v1
Subject: Medicine And Pharmacology, Neuroscience And Neurology Keywords: Apraxia of speech; Trisomy 21 (Down syndrome); transcranial Direct Current Stimulation (tDCS); Rapid Syllable Transition Training (ReST); Broca’s area; Wernicke’s area; supramarginal gyrus; Sylvian Temporal Parietal Junction
Online: 30 October 2023 (07:16:51 CET)
Apraxia of speech is a persistent speech motor disorder that affects speech intelligibility. Studies on speech motor disorders with transcranial Direct Current Stimulation (tDCS) have been mostly directed to post-stroke aphasia. Only a few tDCS studies have focused on apraxia of speech or childhood apraxia of speech (CAS), and no study has investigated individuals with CAS and people with trisomy 21 (T21, Down Syndrome). This study examined the effects of tDCS combined with a motor learning task in developmental apraxia of speech co-existing with T21 (ReBEC RBR-5435x9). The accuracy of speech sound production of nonsense words (NSWs) during Rapid Syllable Transition Training (ReST) under 10 sessions of anodal tDCS (1.5 mA, 25 cm) over the Broca’s area with cathode over the contralateral region was compared to 10 sessions of sham-tDCS and 4 control sessions in a 20-year-old male individual with T21 presenting moderate-severe childhood apraxia of speech (CAS). The accuracy for NSWs production progressively improved (gain 40%) under tDCS only (sham-tDCS and control sessions showed <20% gain). A decrease in speech severity from moderate-severe to mild-moderate indicated transfer effects in speech production. The speech accuracy under tDCS was correlated with Wernicke’s area activation (P3 current source density), which in turn, was correlated with the activation of the left supramarginal gyrus and the Sylvian Parietal Temporal Junction. Repetitive bihemispheric tDCS paired with ReST may have facilitated the speech sound acquisition in a young adult with T21 and CAS, possibly by recruiting brain regions required for the phonological working memory.
ARTICLE | doi:10.20944/preprints202308.0528.v1
Subject: Engineering, Bioengineering Keywords: Speech Imagery; Mental Task; Machine Leaning; Feature Extraction; Common spatial pattern (CSP); Filter bank Common Spatial Pattern (FBCSP); Brain – Computer Interface (BCI); Principal Components Analysis (PCA); Feature Selection; Channel Selection; Mutual Information; Lagrange Formula; Deep Learning; SVM Classifier
Online: 7 August 2023 (10:23:13 CEST)
Nowadays, brain signal processing is performed rapidly in various brain-computer interface (BCI) applications. Most researchers focus on developing new methods for the future or improving the basic implemented models to identify the optimum standalone feature set. Our research focuses on four ideas. One of them introduces future communication models, and the others are for improving old models or methods. These are: 1) new communication imagery model instead of speech imager using the mental task: Due to speech imagery is very difficult, and it is impossible to imagine sound for all of the characters in all of the languages. Our research introduces a new mental task model for all languages that call Lip-sync imagery. This model can use for all characters in all languages. This paper implemented two lip-sync for two sounds, characters or letters. 2) New combination Signals: Selecting an inopportune frequency domain can lead to inefficient feature extraction. Therefore, domain selection is so important for processing. This combination of limited frequency ranges proposes a preliminary for creating Fragmentary Continuous frequency. For the first model, two s intervals of 4 Hz as filter banks were examined and tested. The primary purpose is to identify the combination of filter banks with 4Hz (scale of each filter bank) from the 4Hz to 40Hz frequency domain as new combination signals (8Hz) to obtain well and efficient features using increasing distinctive patterns and decreasing similar patterns of brain activities.3) new supplement bond graph classifier for SVM classifier: When SVM linear uses in very noisy, the performance is decreased. But we introduce a new bond graph linear classifier to supplement SVM linear in noisy data. 4) a deep formula recognition model: it converts the data of the first layer into a formula model (formula extraction model). The main goal is to reduce the noise in the subsequent layers for the coefficients of the formulas. The output of the last layer is the coefficients selected by different functions in different layers. Finally, the classifier extracts the root interval of the formulas, and the diagnosis does based on the root interval. For all of the ideas achieved the results of implementing methods. The results are between 55% to 98%. Less result is 55% for the deep detection formula, and the highest result is 98% for new combination signals.