1. Introduction
Voice analysis combined with artificial intelligence (AI) is rapidly becoming a vital tool for disease diagnosis and monitoring.[1,2,3,4,5,6,7] For many decades, voice-based diagnostics has been applied in otorhinolaryngology to study voice disorders.[1] Because human voice is produced by a combination of several organs, voice-based diagnostics is shown to be applicable also to other medical disciplines.[2] For example, in neurology to diagnose and monitor dementia and Parkinson’s disease,[3] in pulmonology for respiratory diseases,[2,4] in oncology for early detection of many types of cancer including laryngeal cancer, throat cancer, oral cancer, and lung cancer.[5] Recently, it was also found that voice-based diagnostics can be effectively applied to detect and monitor heart failures.[6]
Voice-based diagnostics have many benefits. It is non-invasive and can be more comfortable than traditional diagnostic procedures for patients. It enables early detection. It is easily accessible and with reduced costs. Especially, patients could monitor their health by submitting voice samples remotely to a healthcare provider, enabling ongoing evaluations and personalized care.
A key issue currently under intensive study is the identification of vocal biomarkers, that are quantifiable features extracted from voice signals to assess a person’s health status or predict the likelihood of certain diseases.[2,8,9] Currently, the traditional methods for feature extraction in speech technology using pitch-asynchronous analysis methods are used.[10,11] The following biomarkers are often utilized.[6,13]
Mel-frequency cepstral coefficients (MFCCs), the coefficients derived by computing the cepstrum of a log-magnitude spectrum based on the Mel scale of voice perception.[14,15] It is a widely used biomarker.
Linear frequency cepstral coefficient (LFCCs), the coefficients derived by using a linear scale of frequency.
The first two formants F1 and F2, which are the two lowest resonance frequencies of the vocal tract.
Linear prediction coefficients (LPC).[30,31] It is widely used in speech technology and voice analysis but not frequently used in voice-based diagnostics.[1]
All the above biomarkers are extracted using a pitch-asynchronous analysis method. During the extraction process, a lot of vital information is lost. As we will show in this paper, based on a better understanding of voice production and a pitch synchronous parameterization method, better biomarkers can be extracted.
According to the physiology of human voice production, as presented in Section III,[18,19,20,21] human voice is generated a pitch period at a time, staring at each glottal closing instant (GCI). The elementary sound wave triggered by each GCI, called a timbron, contains full information on the timbre. Continuous voice is generated by a superposition of a sequence of timbrons triggered by a series of glottal closing events. According to the timbron theory of voice production, the time difference between two adjacent GCIs defines the pitch period, and the sound waveform in each pitch period contains full information on the timbre.
To extract more information from voice signals, a pitch-synchronous analysis method is developed.[18,19,20,21] By using that method, pitch information and timbre information are cleanly separated. On average, the pitch period is 8 msec for men, and 4 msec for women. Therefore, for every 4 to 8 msec, a complete and accurate set of information on the timbre and pitch of human voice can be obtained. The biomarkers extracted using a pitch-synchronous analysis method contain abundant, accurate, objective, and reproducible information from the voice signals that could improve the usability and reliability of voice-based diagnostics.
The organization of the article is as follows.
In Section II, the deficiencies of the traditional pitch-asynchronous analysis methods are analyzed. It was developed in the middle of the 20th century, when the computing power was low and computing languages were underdeveloped. A lot of vital information is lost during the extraction process.
In Section III, the timbron theory of voice production, a modern version of the transient theory, is presented as a logical consequence of the temporal correlation of the voice signals and the simultaneously acquired electroglottograph (EGG) signals. According to the timbron theory, the time difference between two adjacent glottal closing instants (GCIs) defines the pitch period (the inverse of which is the pitch frequency), and the sound waveform in each individual pitch period contains full information on the timbre of the voice.
In Section IV, a pitch-synchronous method of voice analysis is presented. By applying the pitch-synchronous analysis method to a standard US English speech corpus, the formant parameters of all US English monophthong vowels are measured, see Section V. The correctness of the formant parameters is tested by voice synthesis, using a program appended to the article.
The format parameters are not the best biomarkers for voice-based diagnostics. In Section VI, the definition of timbre vectors together with the method of extraction, is presented. From a standard US English speech corpus, the timbre vectors for all monophthong vowels are presented. Especially, the timbre distances among all US English monophthongs are presented, showing the reliability and accuracy of the method.
In Section VII, the methods of finding jitter, shimmer, and spectral irregularities are presented. It is based on the recordings in the Saarbrucken voice database, showing the effectiveness of the pitch-synchronous analysis method and the usefulness of timbre vectors.
In Section VIII, the methods for detecting GCIs from voice signals are presented. Based on the reference GCIs from the EGG, the accuracy and usefulness of the methods of detecting GCIs from voice signals is discussed. The values of simultaneously acquired EGG signals are also discussed. Section IX presents results and discussions.