BRIEF REPORT | doi:10.20944/preprints202207.0062.v1
Subject: Medicine And Pharmacology, Otolaryngology Keywords: Total Laryngectomy; Cancer; Voice; Voice prosthesis; Otolaryngology; Head Neck Surgery; Speech Language Therapists.
Online: 5 July 2022 (05:44:14 CEST)
Background: In the present study, we assessed the feasibility and success outcomes of voice prosthesis (VP) changes when performed by speech-language pathologist (SLP). Methods: Patients treated with total laryngectomy (TL) from January 2020 to December 2020 were prospectively recruited from our medical center. Patients benefited from tracheoesophageal puncture. The VP changes were performed by the senior SLP and the following data were collected for each VP change: date of placement; change or removal; VP type and size; reason for change or removal; and use of a washer for periprosthetic leakage. A patient-reported outcome questionnaire including 6 items was proposed to patients at each VP change (Appendix 1). Items were assessed with a 10-point Likert-scale. Results: Fifty-two VP changes were performed by the senior SLP during the study period. The mean duration of the SLP consultation, including patient history, examination and VP change procedure was 20 min (range: 15-30). The median prosthesis lifetime was 88 days. The main reasons for VP changes were transprosthetic (N=34; 79%) and periprosthetic (N=7; 21%) leakages, respectively. SLP successfully performed all VP changes. He did not change one VP but used a periprosthetic silastic to stop the periprosthetic leakages. In two cases, SLP needed the surgeon examination to discuss about the following indication: implant mucosa inclusion and autologous fat injection. The patient satisfaction was high according to the speed and the quality of care by the SLP. Conclusion: The delegation of VP change from the otolaryngologist-head and neck sur-geon to the speech-language pathologist (SLP) may be done without significant complications. The delegation of VP change procedure to SLP may be interesting in some rural regions with otolaryngologist shortage.
ARTICLE | doi:10.20944/preprints201711.0027.v1
Subject: Engineering, Control And Systems Engineering Keywords: convolution neural networks; melody extraction; singing voice activity detection; voice false alarm detection
Online: 3 November 2017 (14:51:47 CET)
Singing melody extraction is the task that identifies the melody pitch contour of singing voice from polyphonic music. Most of the traditional melody extraction algorithms are based on calculating salient pitch candidates or separating the melody source from the mixture. Recently, classification-based approach based on deep learning has drawn much attentions. In this paper, we present a classification-based singing melody extraction model using deep convolutional neural networks. The proposed model consists of a singing pitch extractor (SPE) and a singing voice activity detector (SVAD). The SPE is trained to predict a high-resolution pitch label of singing voice from a short segment of spectrogram. This allows the model to predict highly continuous curves. The melody contour is smoothed further by post-processing the output of the melody extractor. The SVAD is trained to determine if a long segment of mel-spectrogram contains a singing voice. This often produces voice false alarm errors around the boundary of singing segments. We reduced them by exploiting the output of the SPE. Finally, we evaluate the proposed melody extraction model on several public datasets. The results show that the proposed model is comparable to state-of-the-art algorithms.
ARTICLE | doi:10.20944/preprints202103.0348.v1
Online: 12 March 2021 (19:58:40 CET)
In the last few years, researchers have paid increasing attention to singing voice evaluations.In their studies, they observed changes in the vibrations of the vocal folds during the transi-tion of registers. Additionally, they also found that these changes are less visible and audiblein the case of skilled singers. In order to confirm this theory we defined a new parameter,the Passaggio Peak Coefficient (PPC), obtained from an EGG signal to analyse pitch andopen quotient jump characteristics during the transition of vocal registers among 21 femaleand male choir members with different singing skills. The Kruskal-Wallis test proved thatit is possible to distinguish vocal skills, based on the ability to smoothen transitions amongfemale singers at a 5% significance level.
ARTICLE | doi:10.20944/preprints201806.0280.v2
Subject: Social Sciences, Cognitive Science Keywords: phonagnosia, acquired, developmental, apperceptive, associative, voice-identity processing, speaker recognition, core-voice system, extended system
Online: 4 December 2018 (16:31:48 CET)
The voice contains elementary social communication cues, conveying speech, as well as paralinguistic information pertaining to the emotional state and the identity of the speaker. In contrast to vocal-speech and vocal-emotion processing, voice-identity processing has been less explored. This seems surprising, given the day-to-day significance of person recognition by voice. A valuable approach to unravel how voice-identity processing is accomplished is to investigate people who have a selective deficit in recognising voices. Such a deficit has been termed phonagnosia. In the present chapter, we provide a systematic overview of studies on phonagnosia and how they relate to current neurocognitive models of person recognition. We review studies that have characterised people who suffer from phonagnosia following brain damage (i.e. acquired phonagnosia) and also studies, which have examined phonagnosia cases without apparent brain lesion (i.e. developmental phonagnosia). Based on the reviewed literature, we emphasise the need for a careful behavioural characterisation of phonagnosia cases by taking into consideration the multistage nature of voice-identity processing and the resulting behavioural phonagnosia subtypes.
COMMUNICATION | doi:10.20944/preprints202307.1896.v1
Subject: Computer Science And Mathematics, Signal Processing Keywords: voice spoofing; acoustic configuration; deep learning
Online: 28 July 2023 (10:14:32 CEST)
Voice spoofing attempts to break into a specific automatic speaker verification (ASV) system by forging the user’s voice, and can be used through methods, such as text-to-speech (TTS), voice conversion (VC), and replay attacks. Recently, deep learning-based voice spoofing countermeasures have been developed. however, the problem with replay is that it is difficult to construct a large number of datasets because it requires a physical recording process. To overcome these problems, this study proposes a pre-training framework based on multi-order acoustic simulation for replay voice spoofing detection. Multi-order acoustic simulation utilizes existing clean signal and room impulse response (RIR) datasets to generate audios, which simulate the various acoustic configurations of the original and replayed audios. The acoustic configuration refers to factors, such as the microphone type, reverberation, time delay, and noise that may occur between a speaker and microphone during the recording process. We assume that a deep learning model trained on an audio that simulates the various acoustic configurations of the original and replayed audios can classify the acoustic configurations of the original and replay audios well. To validate this, we performed pre-training to classify the audio generated by the multi-order acoustic simulation into 3 classes: clean signal, audio simulating the acoustic configuration of the original audio, and audio simulating the acoustic configuration of the replay audio. We also set the weights of the pre-training model to the initial weights of the replay voice spoofing detection model using the existing replay voice spoofing dataset and then performed fine-tuning. To validate the effectiveness of the proposed method, we evaluated the performance of the conventional method without pre-training and proposed method using an objective metric, i.e., the accuracy. As a result, the conventional method achieved 92.94% accuracy and proposed method achieved 98.16% accuracy.
ARTICLE | doi:10.20944/preprints202212.0387.v1
Subject: Social Sciences, Behavior Sciences Keywords: emotion discrimination; voice; frequency-tagging; EEG
Online: 21 December 2022 (06:07:12 CET)
Successfully engaging in social communication requires efficient processing of subtle socio-communicative cues. Voices convey a wealth of social information, such as gender, identity and the emotional state of the speaker. We tested whether our brain can systematically and automatically differentiate and track a periodic stream of emotional utterances among a series of neutral vocal utterances. We recorded frequency-tagged EEG responses of 20 neurotypical male adults while presenting streams of neutral utterances at 4 Hz base rate, interleaved with emotional utterances every third stimulus, hence at 1.333 Hz oddball frequency. Four emotions (happy, sad, angry, and fear) were presented as different conditions in different streams. To control the impact of low-level acoustic cues, we maximized variability among the stimuli and included a control condition with scrambled utterances. This scrambling preserves low-level acoustic characteristics but ensures that the emotional character is no longer recognizable. Results revealed significant oddball EEG responses for all conditions, indicating that every emotion category can be discriminated from the neutral stimuli, and every emotional oddball response was significantly higher than the response for the scrambled utterances. These findings demonstrate that emotion discrimination is fast, automatic, and is not merely driven by low-level perceptual features.
CONCEPT PAPER | doi:10.20944/preprints202108.0194.v1
Subject: Social Sciences, Sociology Keywords: congruence; voice; speech; communication; identity; personality
Online: 9 August 2021 (12:41:06 CEST)
Purpose: We present a theoretical framework that formalizes and defines the constructs of communicative congruence and communicative dysphoria that is rooted within a comprehensive and mechanistic theory of personality. Background: Voice therapists have likely encountered a patient who states that a therapeutic target voice “isn’t me.” The ability to accurately convey a person’s sense of self, or identity, through their voice, speech, and communication behaviors seems to have high relevance to both patients and clinicians alike. However, to date, we lack a mechanistic theoretical framework through which to understand and interrogate the phenomenon of congruence between one’s communication behaviors and their sense of self. Results: We review the initial notion of congruence, first proposed by Carl Rogers. We then review several theories on selfhood, identity, and personality. After reviewing these theories, we explain how our proposed constructs fit within our chosen theory, the Cybernetic Big Five Theory of Personality. We then discuss similarities and differences to a similarly named construct, the Vocal Congruence Scale. Next, we review how these constructs may come to bear on an existing theory relevant to voice therapy, the Trans Theoretical Model of Health Behavior Change. Finally, we state testable hypotheses for future exploration, which we hope will establish a foundation for future investigations into communicative congruence. Conclusion: To our knowledge, the present paper is the first to explicitly define communicative congruence and communicative dysphoria. We embed these constructs within a comprehensive and mechanistic theory of personality and, in doing so, hope to provide a rigorous and comprehensive theoretical framework that will allow us to test and better understand these proposed constructs.
ARTICLE | doi:10.20944/preprints202306.0223.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Voice Cloning; Speech Synthesis; Speech Quality Evaluation
Online: 5 June 2023 (02:27:49 CEST)
Voice cloning, an emerging field in the speech processing area, aims to generate synthetic utterances that closely resemble the voices of specific individuals. In this study, we investigate the impact of various techniques on improving the quality of voice cloning, specifically focusing on a low-quality dataset. To contrast our findings, we also use two high-quality corpora for comparative analysis. We conduct exhaustive evaluations of the quality of the gathered corpora in order to select the most suitable audios for the training of a Voice Cloning system. Following these measurements, we conduct a series of ablations by removing audios with lower SNR and higher variability in utterance speed from the corpora in order to decrease their heterogeneity. Furthermore, we introduce a novel algorithm that calculates the fraction of aligned input characters by exploiting the attention matrix of the Tacotron 2 Text-to-Speech (TTS) system. This algorithm provides a valuable metric for evaluating the alignment quality during the voice cloning process. We present the results of our experiments, demonstrating that the performed ablations significantly increase the quality of synthesised audios for the challenging low-quality corpus. Notably, our findings indicate that models trained on a 3-hour corpus from a pre-trained model exhibit comparable audio quality to models trained from scratch using significantly larger amounts of data.
ARTICLE | doi:10.20944/preprints201911.0346.v1
Subject: Medicine And Pharmacology, Neuroscience And Neurology Keywords: speech; Parkinson’s disease; deep brain stimulation; voice; articulation
Online: 28 November 2019 (02:57:03 CET)
Deep brain stimulation (DBS) of the subthalamic nucleus (STN) has become an effective and widely used tool in the treatment of Parkinson’s disease (PD). STN-DBS has varied effects on speech. Clinical speech ratings suggest worsening following STN-DBS, but quantitative intelligibility, perceptual, and acoustic studies have produced mixed and inconsistent results. Improvements in phonation and declines in articulation have frequently been reported during different speech tasks under different stimulation conditions. Questions remain about preferred STN-DBS stimulation settings. Seven right-handed, native speakers of English with PD treated with bilateral STN-DBS were studied off medication at three stimulation conditions: stimulators off, 60 Hz (low frequency stimulation - LFS), and the typical clinical setting of 185 Hz (High frequency - HFS). Spontaneous speech was recorded in each condition and excerpts were prepared for transcription (intelligibility) and difficulty judgements. Separate excerpts were prepared for listeners to rate abnormalities in voice, articulation, fluency, and rate. Intelligibility for spontaneous speech was reduced at both HFS and LFS when compared to STN-DBS off. Speech produced at HFS was more intelligible than that produced at LFS, but HFS made the intelligibility task (transcription) subjectively more difficult. Both voice quality and articulation were judged to be more abnormal with STN-DBS on. STN-DBS reduced the intelligibility of spontaneous speech at both LFS and HFS but lowering the frequency did not improve intelligibility. Voice quality ratings with STN-DBS were correlated with the ratings made without stimulation. This was not true for articulation ratings. STN-DBS exacerbated an existing voice disorder and may have introduced new articulatory abnormalities.
ARTICLE | doi:10.20944/preprints202308.0553.v1
Subject: Medicine And Pharmacology, Otolaryngology Keywords: thyroidectomy; bilateral vocal folds paralysis; voice therapy; arytenoidectomy; NIM
Online: 8 August 2023 (03:38:16 CEST)
Bilateral recurrent nerve damage following total thyroidectomy in thyroid surgery represents severe complications. These complications have almost low incidence thanks also by using the Nerve Intraoperative Monitoring. The aim of this observational retrospective study is to evaluate the inception mode and the recovery time for different clinical laryngeal pictures that arise from this surgery. We enrolled 25 patients with bilateral vocal folds mobility deficit between October 2017 and October 2022, diagnosed in ENT Unit of University of Campania “L. Vanvitelli”, out of a total of 1417 patients undergoing total thyroidectomy. The 25 patients (23F,2M) aged from 24 to 78 years old (average age 51.7) presented a bilateral vocal folds motility deficit (occurring in about 0.1% of cases). All patients underwent 9 months diagnostic/therapeutic process, which started approximately 30 days after thyroid surgery. There are several outcomes of these complications with functional laryngeal defects being mainly related to respiratory and phonatory activities. These clinical manifestations evolve in different ways within a context of a wide range of possibilities, from spontaneous bilateral or monolateral recovery to functional or surgical restoration. This study allowed the acquisition of useful information about prognostic indications and an adequate therapeutic process, based on the specific clinical characteristics.
ARTICLE | doi:10.20944/preprints202305.0740.v2
Subject: Computer Science And Mathematics, Computer Science Keywords: Blockchain; Symverse; Ethereum; Payment; gas; voice fishing; fin-tech
Online: 8 June 2023 (11:17:45 CEST)
With the increase of intelligent voice phishing and the increasing reliance on open banking systems, there has been a rise in cases where individuals' personal information has been exposed, resulting in significant financial losses for the victims. Non-face-to-face transactions in the financial sector face challenges such as customer identification, ensuring transaction integrity, and preventing transaction rejection. Blockchain-based distributed ledgers have been proposed as a solution, but their adoption is limited due to the difficulty of managing private keys and the burden of gas fees management. This paper proposes a non-face-to-face P2P real-time token payment system that minimizes the risk of key loss by storing private keys in a keystore file and database through a server-based key management module. The proposed system simplifies token creation and management through a server-based token management module and implements an automatic gas charging function for smooth token transactions. Transaction integrity and non-repudiation are ensured through a transaction confirmation module that uses transaction IDs without exposing personal information. Furthermore, advanced security measures such as blocking foreign IP access and DDoS defense are implemented to securely protect user data. The proposed system aims to provide a convenient, secure, and accessible online payment solution to the public by implementing a self-authentication function using a web application that is not limited to smart phones or application platforms.
ARTICLE | doi:10.20944/preprints202202.0340.v1
Subject: Social Sciences, Education Keywords: cyberlearning; educational innovation; higher education; online learning; student voice
Online: 25 February 2022 (15:20:55 CET)
Many assumptions exist about online learning and its impact on college students. Hitherto, the views of those meant to be the beneficiaries of this technology have been given little consideration despite the fact that students use cyberspace for academic work and beyond. This qualitative case-study report is based on research conducted by college students at a private university in the Eastern Province of Saudi Arabia. The aim was to examine the online learning experiences of their peers during the first wave of the coronavirus global pandemic, with a view to understand how prepared their university is for an academic genre located in cyberspace. The findings are based on the perspectives of 2,298 college students responding to a survey administered to the entire student population comprising around 9,000 individuals. They suggest that increasing opportunities for cyberlearning could have positive effects on students. Also provided is cautionary advice about the need to improve teaching pedagogies and combat academic dishonesty.
CASE REPORT | doi:10.20944/preprints202104.0248.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: forensic speaker comparison; voice processing; ordinary least squares; OLS
Online: 8 April 2021 (17:56:51 CEST)
This case report investigates 5 real cases which followed legal channels and were judged by Mato Grosso Court in Brazil. Audio systems served as elements of key evidence on those lawsuits. The goal here is to analyze the cases by using a methodology based on the forensic speaker verification by using the Ordinary Least Squares (OLS) algorithm and to compare results with analyses obtained on real cases. The comparative analysis is assessed for time elapsed for obtaining results, as well as results quality. In Brazil, the lawsuit duration is very important, since the Penal Code foresees prescription after a given time, and it may lead to impunity. Results show that the analysis, by using OLS, generates immediate, effective results when compared to those obtained with traditional methodologies on the studied Brazilian lawsuits.
ARTICLE | doi:10.20944/preprints202307.0413.v1
Subject: Computer Science And Mathematics, Other Keywords: Voice user interface; Geographic Information System; human-computer interaction; multimodal interface; natural language; Web application; Natural language interaction; Voice virtual assistant; Speech recognition
Online: 6 July 2023 (10:08:55 CEST)
ARTICLE | doi:10.20944/preprints202307.1591.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: voice recognition; channel adversarial training; information security domain; speaker confirmation
Online: 24 July 2023 (11:36:43 CEST)
With the rapid development of big data, artificial intelligence, and Internet technologies, the human-human contact and human-machine interaction have produced an explosive growth of voice data. Rapidly identifying the speaker's identity and retrieving and managing his or her speech data in the massive amount of speech data has become a major challenge for intelligent speech applications in the field of information security. This research proposes a vocal recognition technique based on information adversarial training for speaker identity recognition in massive audio and video, and speaker identification when oriented to the information security domain. The experimental results show that the method projects data from different scene channels all onto the same space and dynamically generates interactive speaker representations. It solves the channel mismatch problem and effectively improves the recognition of speaker's voice patterns across channels and scenes. It is able to separate overlapping voices when multiple people speak at the same time and reduce speaker separation errors. It realizes speaker voice recognition for information security field and achieves 89% recall rate in massive database, which has practical application value for intelligent application field.
ARTICLE | doi:10.20944/preprints202003.0179.v1
Subject: Engineering, Mechanical Engineering Keywords: fast tool servo; voice coil motor; flexure mechanism; resonant controller
Online: 11 March 2020 (04:04:49 CET)
In this paper, a voice coil motor (VCM) actuated fast tool servo (FTS) system is developed for diamond turning. To guide motions of the VCM actuator, a crossed double parallelogram flexure mechanism is selected featuring totally symmetric structure with high lateral stiffness. To facilitate the determination of the multi-physical parameters, analytical models of both electromagnetic and mechanical systems are developed. The designed FTS with balanced stroke and natural frequency is then verified through the finite element analysis. Finally, the prototype of the VCM actuated FTS is fabricated and experimentally demonstrated to have a stroke of ±59.02 μm and a first natural frequency of 253 Hz. By constructing a closed-loop control using PID controller with the internal-model based resonant controller, the error for tracking a harmonic trajectory with ±10 μm amplitude and 120 Hz frequency is obtained to be ±0.2 μm, demonstrating the capability of the FTS for high accuracy trajectory tracking.
BRIEF REPORT | doi:10.20944/preprints202205.0060.v1
Subject: Biology And Life Sciences, Virology Keywords: COVID-19; otolaryngology; Larynx; laryngeal; laryngology; intubation; voice; Head Neck; surgery
Online: 6 May 2022 (04:37:05 CEST)
Objective: To investigate postacute laryngeal injuries and dysfunctions (PLID) in coronavirus disease 2019 (COVID-19) patients. Methods: Three independent investigators performed a systematic review of the current literature studying PLID in patients with a history of COVID-19. The review was performed according to PRISMA Statement. Epidemiological, clinical, hospitalization features, laryngeal diseases and voice outcomes were extracted from the included papers. Results: Eight papers met our inclusion criteria (393 patients) corresponding to 5 uncontrolled prospective and 3 retrospective studies. The most prevalent PLID were vocal fold dysmotility (65%), vocal fold edema (35%), laryngopharyngeal reflux (21%), and muscle tension dysphonia (21%). Posterior glottic stenosis (12%), granuloma (14%), and posterior glottic diastasis (12%) were the most common injuries. Most patients with PLID were obese and had a history of intensive care unit hospitalization, and orotracheal intubation. The delay between the discharge and the laryngology office consultation ranged from 51 to 122 days. The mean duration of intubation ranged from 10 to 34 days. Seventy-eight (49%) intubated patients were in prone position. The proportion of patients requiring surgical treatment ranged from 39% to 70% (mean=48%). There was an important heterogeneity between studies about inclusion, exclusion criteria and outcomes. Conclusion: COVID-19 appeared to be associated with PLID, especially in patients with a history of intubation. However, future controlled studies are needed to evaluate if intubated COVID-19 patients reported more frequently PLID than patients who were intubated for other conditions.
ARTICLE | doi:10.20944/preprints202008.0251.v1
Subject: Medicine And Pharmacology, Psychiatry And Mental Health Keywords: mental health assessment; vitality; mental activity; voice index; emotion analysis; noninvasiveness
Online: 11 August 2020 (05:33:57 CEST)
In many developed countries, mental health disorders have become problematic, and the economic loss due to treatment costs and interference with work is immeasurable. Therefore, we developed a method to assess individuals’ mental health using emotional components contained in their voice. We propose two indices of mental health: vitality, a short-term index, and mental activity, a long-term index capturing the trends in vitality. To evaluate our method, we used the voices of healthy individuals (n = 14) and patients with major depression (n = 30). The patients were also assessed by specialists using the Hamilton Rating Scale for Depression (HAM-D). A significant negative correlation existed between the vitality extracted from the voices and HAM-D scores (r = -0.33, p < .05). We could discriminate the voice data of healthy individuals and patients having depression with a high accuracy using the vitality (p = .0085, area under the curve = 0.76). Further, we developed a method to estimate stress through emotion instead of analyzing stress directly from voice data. By daily monitoring of vitality using smartphones, we can encourage hospital visits for people before they become depressed or during the early stages of depression, to prevent adverse consequences of depression.
ARTICLE | doi:10.20944/preprints202103.0513.v1
Subject: Engineering, Automotive Engineering Keywords: Automatic Voice Query Service; Automatic Speech Recognition; Multi-Accented Mandarin Speech Recognition
Online: 22 March 2021 (10:55:53 CET)
Automatic Voice Query Service can extremely reduce the artificial cost, which could improve the response efficiency for users. The automatic speech recognition (ASR) is one of the important component in AVQS. However, many dialect areas in China make the AVQS have to response the multi-accented Mandarin users by single acoustic model in ASR. This problem severely limits the accuracy of ASR for multi-accented speech in the AVQS. In this paper, a new framework for AVQS is proposed to improve the accuracy of response. Firstly, the fusion feature including iVector and filterbank acoustic features is used to train the Transformer-CTC model. Secondly, the transformer-CTC model is used to construct the end-to-end ASR. Finally, key words matching algorithm for AVQS based on fuzzy mathematic theory is proposed to further improve the accuracy of response. The results show that the final accuracy in our proposed framework for AVQS arrives at 91.5%. The proposed framework for AVQS can satisfy the service requirement of different areas in mainland of China. This research has a great significance for exploring the application value of artificial intelligence in the real scene.
ARTICLE | doi:10.20944/preprints202010.0455.v1
Subject: Engineering, Automotive Engineering Keywords: KINECT; industrial robot; vision system; RobotStudio; Visual Studio; gesture control; voice control
Online: 22 October 2020 (09:57:07 CEST)
The paper presents the possibility of using KINECT v2 module to control an industrial robot by means of gestures and voice commands. It describes elements of creating software for off-line and on-line robot control. The application for KINECT module was developed in C# language in Visual Studio environment, while the industrial robot control program was developed in RAPID language in RobotStudio environment. The development of a two-threaded application in RAPID language allowed to separate two independent tasks for the IRB120 robot. The main task of the robot is performed in thread no. 1 (responsible for movement). Simultaneously working thread no. 2 ensures continuous communication with the KINECT system and provides information about the gesture and voice commands in real time without any interference in thread no. 1. The applied solution allows the robot to work in industrial conditions without negative impact of communication task on the time of robot’s work cycles. Thanks to the development of a digital twin of the real robot station, tests of proper application functioning in off-line mode (without using a real robot) were conducted. Obtained results were verified online (on the real test station). Tests of correctness of gesture recognition were carried out, the robot recognized all programmed gestures. Another test carried out was the recognition and execution of voice commands. A difference in the time of task completion between the actual and virtual station was noticed - the average difference was 0.67 s. The last test carried out was to examine the impact of interference on the recognition of voice commands. With a 10dB difference between the command and noise, the recognition of voice commands was equal to 91.43%. The developed computer programs have a modular structure, which enables easy adaptation to process requirements.
ARTICLE | doi:10.20944/preprints201904.0184.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: autoregressive models; entropy power; linear prediction model; CELP voice codecs; mutual information
Online: 16 April 2019 (11:08:04 CEST)
We write the mutual information between an input speech utterance and its reconstruction by a Code-Excited Linear Prediction (CELP) codec in terms of the mutual information between the input speech and the contributions due to the short term predictor, the adaptive codebook, and the fixed codebook. We then show that a recently introduced quantity, the log ratio of entropy powers, can be used to estimate these mutual informations in terms of bits/sample. A key result is that for many common distributions and for Gaussian autoregressive processes, the entropy powers in the ratio can be replaced by the corresponding minimum mean squared errors. We provide examples of estimating CELP codec performance using the new results and compare to the performance of the AMR codec and other CELP codecs. Similar to rate distortion theory, this method only needs the input source model and the appropriate distortion measure.
ARTICLE | doi:10.20944/preprints201811.0126.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: Speech/Music Classification; Enhanced Voice Service, Long Short-Term Memory, Big Data
Online: 5 November 2018 (17:02:36 CET)
Speech/music classification that facilitates optimized signal processing from classification results has been extensively adapted as an essential part of various electronics applications, such as multi-rate audio codecs, automatic speech recognition, and multimedia document indexing. In this paper, a new technique to improve the robustness of speech/music classifier for 3GPP enhanced voice service (EVS) using long short-term memory (LSTM) is proposed. For effective speech/music classification, feature vectors implemented with the LSTM are chosen from the features of the EVS. Experiments show that LSTM-based speech/music classification produces better results than conventional EVS under a variety of conditions and types of speech/music data.
ARTICLE | doi:10.20944/preprints201906.0104.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Deep Learning, Generative Adversarial Networks (GANs), Machine Learning, Autoencoders, Voice Conversion, Ethics, CycleGANs
Online: 12 June 2019 (11:17:52 CEST)
The upsurge of Generative Adversarial Networks (GANs) in the previous five years has led to advancements in unsupervised data manipulation, sourced feature translation, and precise input-output synthesis through a competitive optimization of the discriminator and generator networks. More specifically, the recent rise of cycle-consistent GANs enables style transfers from a discrete source (input A) to target domain (input B) by preprocessing object features for a multi-discriminative adversarial network. Traditionally, cyclical adversarial networks have been exploited for unpaired image-to-image translation and domain adaptation by determining mapped relationships between an input A graphic and an input B graphic. However, this integral mechanism of domain adaptation can be applied to the complex acoustical features of human speech. Although well-established datasets, such as the 2018 Voice Conversion Challenge repository, paved way for female-male voice transformation, cycle-GANs have rarely been re-engineered for voices outside the datasets. More critically, cycle-GANs have massive potential to extract surface-level and hidden feature to distort an input A source into a texturally unrelated target voice. By preprocessing, compressing, and packaging unique acoustical voice properties, CycleGANs can learn to decompose speech signals and implement new translation models while preserving emotion, the intent of words, rhythm, and accents. Due to the potential of CycleGAN’s autoencoder in realistic unsupervised voice-voice conversion/feature adaptation, the researchers raise the ethical implications of controlling source input A to manipulate target voice B, particularly in cases of defamation and sabotage of target B’s words. This paper analyzes the potential of cycle-consistent GANs in deceptive voice-voice conversion by manipulating interview excerpts of political candidates.
ARTICLE | doi:10.20944/preprints202212.0567.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Voice Assistance, Machine Learning, Virtual Assistance, Artificial Intelligence, Selection Bias, Sample Population, Python 3.10, Pyttsx3, PyTorch, JSON
Online: 30 December 2022 (02:12:34 CET)
In recent times, voice assistants have become a part of our day-to-day lives, allowing information retrieval by voice synthesis, voice recognition, and natural language processing. These voice assistants can be found in many modern-day devices such as Apple, Amazon, Google, and Samsung. This project is primarily focused on Virtual Assistance in Natural Language Processing. Natural Language Processing is a form of AI that helps machines understand people and create feedback loops. This project will use deep learning to create a Voice Recognizer and use Commonvoice and data collected from the local community for model training using Google Colaboratory. After recognizing a command, the AI assistant will be able to perform the most suitable actions and then give a response. The motivation for this project comes from the race and gender bias that exists in many virtual assistants. The computer industry is primarily dominated by the male gender, and because of this, many of the products produced do not regard women. This bias has an impact on natural language processing. This project will be utilizing various open-source projects to implement machine learning algorithms and train the assistant algorithm to recognize different types of voices, accents, and dialects. Through this project, the goal to use voice data from underrepresented groups to build a voice assistant that can recognize voices regardless of gender, race, or accent. Increasing the representation of women in the computer industry is important for the future of the industry. By representing women in the initial study of voice assistants, it can be shown that females play a vital role in the development of this technology. In line with related work, this project will use first-hand data from the college population and middle-aged adults to train voice assistants to combat gender bias.
ARTICLE | doi:10.20944/preprints202008.0221.v1
Subject: Computer Science And Mathematics, Mathematical And Computational Biology Keywords: arousal level; emotion; major depression severity; voice index; Hurst exponent; zero-crossing rate; Hamilton Rating Scale for Depression
Online: 9 August 2020 (21:15:01 CEST)
Recently, the relationship between emotional arousal and depression has been studied. Focusing on this relationship, we first developed an arousal level voice index (ALVI) to measure arousal levels using the Interactive Emotional Dyadic Motion Capture database. Then, we calculated ALVI from the voices of depressed patients from two hospitals (Ginza Taimei Clinic [GTC] and National Defense Medical College hospital [NDMC]) and compared them with the severity of depression as measured by the Hamilton Rating Scale for Depression (HAM-D). Depending on the HAM-D score, the datasets were classified into a no depression (HAM-D<8) and a depression group (HAM-D≥8) for each hospital. A comparison of the mean ALVI between the groups was performed using the Wilcoxon rank-sum test and a significant difference at the level of 10% (p = 0.094) at GTC and 1% (p = 0.0038) at NDMC was determined. The area under the curve (AUC) of the receiver operating characteristic was 0.66 when categorizing between the two groups for GTC, and the AUC for NDMC was 0.70. The relationship between arousal level and depression severity was indirectly suggested via ALVI.
REVIEW | doi:10.20944/preprints201903.0033.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: augmentative and alternative communication; assistive technologies; sensing modalities; signal processing; voice communication; machine learning; mobile health; speech disability
Online: 4 March 2019 (10:14:44 CET)
High-tech augmentative and alternative communication (AAC) methods are on a constant rise; however, the interaction between the user and the assistive technology is still challenged for an optimal user experience centered around the desired activity. This review presents a range of signal sensing and acquisition methods utilized in conjunction with the existing high-tech AAC platforms for speech disabled individuals, including imaging methods, touch-enabled systems, mechanical and electro-mechanical access, breath-activated methods, and brain computer interfaces (BCI). The listed AAC sensing modalities are compared in terms of ease of access, affordability, complexity, portability, and typical conversational speeds. A revelation of the associated AAC signal processing, encoding, and retrieval highlights the roles of machine learning (ML) and deep learning (DL) in the development of intelligent AAC solutions. The demands and the affordability of most systems were found to hinder the scale of usage of high-tech AAC. Further research is indeed needed for the development of intelligent AAC applications reducing the associated costs and enhancing the portability of the solutions for a real user’s environment. The consolidation of natural language processing with current solutions also needs to be further explored for the amelioration of the conversational speeds. The recommendations for prospective advances in coming high-tech AAC are addressed in terms of developments to support mobile health communicative applications.
ARTICLE | doi:10.20944/preprints201905.0125.v1
Subject: Computer Science And Mathematics, Computer Vision And Graphics Keywords: Parkinson’s disease (PD); Biomedical voice measurements; Multi-layer Perceptron Neural Network (MLP); Biogeography-based Optimization (BBO); Medical diagnosis. Bio-inspired computation
Online: 10 May 2019 (13:56:59 CEST)
In recent years, Parkinson's Disease (PD) as a progressive syndrome of the nervous system has become highly prevalent worldwide. In this study, a novel hybrid technique established by integrating a Multi-layer Perceptron Neural Network (MLP) with the Biogeography-based Optimization (BBO) to classify PD based on a series of biomedical voice measurements. BBO is employed to determine the optimal MLP parameters and boost prediction accuracy. The inputs comprised of 22 biomedical voice measurements. The proposed approach detects two PD statuses: 0– disease status and 1– reasonable control status. The performance of proposed methods compared with PSO, GA, ACO and ES method. The outcomes affirm that the MLP-BBO model exhibits higher precision and suitability for PD detection. The proposed diagnosis system as a type of speech algorithm detects early Parkinson’s symptoms, and consequently, it served as a promising new robust tool with excellent PD diagnosis performance.
ARTICLE | doi:10.20944/preprints202103.0568.v1
Subject: Social Sciences, Psychology Keywords: Voice and sexual orientation; Human sex pheromones; Evolution of sexual orientation; Development of sexual orientation; Puberty; Causes of sexual orientation; Biology of sexual orientation
Online: 23 March 2021 (12:47:18 CET)
The biology of sexual orientations has intrigued people for generations. Many models have been providing insights to that topic, but there are still unanswered questions. In humans, sexual orientation has a learned component. Humans have to learn cues by which they identify the sex of their mates, and cues of the emotional messages that those mates broadcast. Many of those cues depend on arbitrary societal conventions. The cues are learned automatically and subconsciously during childhood, based on non-sexual experiences. When sexual orientation emerges at puberty, the youngsters cannot tell how and when they have acquired it. A model that deals with those phenomena is presented. A basic tenet of the model is that a sexual orientation is determined by the innate wirings of the brain. The model describes how the brain learns cues for identifying the sex of the mate, and cues for identifying emotional messages that the mate broadcasts. The learning mechanism is conditioning. The unconditioned stimulus is human voice. The unconditioned responses are the triggers of the physical and emotional manifestations of sexual activity. The model suggests that innate connections from auditory detectors of men’s and women’s voice onto brain centers that trigger sexual activities, such as the hypothalamus, determine the sexual orientation that emerges at puberty. Innate connections from those auditory centers to emotional centers, such as the amygdala, determine the learned emotional cues. It is also proposed that during evolution, the roles of the chemosensory system in identifying mates were taken over by the auditory system.
HYPOTHESIS | doi:10.20944/preprints202003.0134.v1
Subject: Social Sciences, Gender And Sexuality Studies Keywords: nature and nurture of sexual orientation; development of sexual orientation; Puberty; Voice and sexual orientation; causes of sexual orientation; MGN and sexual orientation; biology of sexual orientation
Online: 8 March 2020 (05:11:50 CET)
A testable theoretical model is presented, proposing which brain parts and mechanisms are responsible for the nature and the nurture components of all human sexual orientations. The model integrates observations from humans and a wide range of animals. If validated, the model would provide a proximate explanation of the biological substrates of all sexual orientations. The basic assumptions of the model are: (1) Children learn automatically and subconsciously in non-sexual conditioning experiences cues for recognizing sexual mates. That skill emerges at puberty. (2) Adults in the child’s surroundings act as innocuous, unaware role-models that provide the learned cues for recognizing mates. (3) Voices of men and women serve as the innate, primary unconditioned stimuli (US) in that learning process. (4) The hypothalamus is the main area that elicits the signals of the unconditioned responses (UR). Those signals trigger the learning of the associated conditioned stimuli (CS) broadcasted by the role-models. (5) The amygdala, base nuclei of the Stria Terminalis (bnST) and hypothalamus play in humans similar roles to those they play in the other species. (6) The human medial geniculate nucleus (MGN) plays the roles played by the olfactory bulbs in rodents. (7) Detectors of innate primary US and activators of the unconditioned sexual responses (UR) are located in the MGN, Amygdala, bnST and Hypothalamus Axis (MASHA). The learned conditioned stimuli (CS) are recorded in the MASHA and in cortical areas. (8) The innate US-UR connections vary across three groups of children. In the first group, only men’s voices trigger the UR. In the second group, only women’s voices trigger the UR, and in a third group each voice can trigger the UR. That determines the learned cues. The first group will be attracted at puberty only to men, the second only to women, and the third group to both.