Pilot Study of Voice and Speech Biomarkers: Exploring Healthy Controls in a Non-Clinical Setting

Tara Chatty; Shreshtha Das; Corinthian Ewesuedo; Ezimma Onwuka; Waleed Shirwa; Paul C. Bryson; Colin K. Drummond

doi:10.20944/preprints202512.1151.v2

Submitted:

27 May 2026

Posted:

29 May 2026

You are already at the latest version

Abstract

Voice-based approaches for screening and diagnostic applications, particularly in telemedicine, often rely on patient recordings collected outside clinical environments. Establishing normative baselines is essential to advance voice analytics and clinical utility. This pilot study examined acoustic parameters in 32 healthy young adults (ages 18–24) with no history of vocal pathology, neurological disorders, or speech impediments. Recordings focused on sustained vowels (/a/, /e/, /o/, /u/) for the primary study of voice characteristics, and an exploratory speech study and a standardized phonetically balanced phrase. Analyses focused on features including fundamental frequency, jitter, shimmer, harmonics-to-noise ratio, formants (F1–F3), speaking rate, intensity, and spectral measures. Preliminary results revealed significant differences between healthy controls and a reference dataset of laryngitis patients, suggesting acoustic features can serve as objective markers of vocal fold inflammation. However, pathology-specific biomarker identification was constrained by the quality of available laryngitis data. Simple statistical comparisons proved insufficient, emphasizing the value of multivariate analysis, cepstral peak prominence (CPP), and mel-frequency cepstral coefficients (MFCC). Challenges in non-clinical data collection highlight the need for standardized, detailed annotation of patient recordings to improve diagnostic accuracy and strengthen the predictive power of future biomarker studies.

Keywords:

voice

;

biomarker

;

statistical baseline

;

healthy adults

;

clinical trial

Subject:

Engineering - Bioengineering

1. Introduction

The human voice, a complex acoustic output, is increasingly recognized as a promising, non-invasive, and cost-effective biomarker in healthcare [1,2]. Voice contains intricate acoustic markers that have been linked to a wide array of health conditions, including neurodegenerative diseases such as dementia, mood disorders, and even various forms of cancer [1,3]. The expanding field of vocal biomarkers, defined as features from the audio signal of the voice associated with a clinical outcome, holds significant potential for patient monitoring, disease diagnosis, grading of severity, depression [4] and even drug development [5,6].

Voice data collected through everyday devices like smartphones offers a scalable, low-cost, patient-friendly path to diagnostic and public-health applications. Advances in AI, remote monitoring, and initiatives like Bridge2AI-Voice underscore the need for well-characterized healthy and pathological baselines to support vocal-biomarker development [6,7]. The current work contributes to that effort in two ways:

It introduces a new dataset of healthy young adults recorded in non-clinical, naturalistic settings—an age group largely underrepresented in existing voice-biomarker research.
The utility of these healthy recordings is tested as an external control benchmark against voices with pathology.

In the current work, the well-known Saarbrücken Voice Database (SVD) [8,9,10] provides open-access to detailed pathological voice samples for analysis. By pairing an annotated healthy cohort with a widely used pathological dataset, this study provides a clearer comparative framework for advancing diagnostic voice analytics.

1.1. Core Acoustic Measurements

Researchers can extract an extensive array of acoustic features from human voice recordings to obtain quantifiable insights into vocal biomechanics and health [11,12]. While traditional parameters (see Table A1) remain staples in the literature, they often suffer from high intra- and inter-speaker variability, which can limit their diagnostic specificity. Consequently, there is a growing shift toward spectral analysis to capture the more nuanced complexities of voice quality that traditional measures miss. For our pilot program we have selected the following ten acoustic measures:

F0 Fundamental frequency, Hz

Jitter Jitter variance, expressed as a fraction or a percentage

Shimmer Cycle to cycle variation in voice amplitude, expressed as a fraction

HNR Harmonics-to-noise ratio, dB

Intensity Energy transmitted by vocal vibrations

CPP Cepstral Peak Prominence

RSPL Root-Mean-Square Sound Pressure Level

MPT Maximum Phonation Time

F3 Third formant frequency, Hz

MFCCS-1 First Mel-Frequency Cepstral Coefficient

1.2. Healthy Voice Baseline as an External Control

It has been proposed that a prerequisite for using voice as a reliable biomarker is the establishment of comprehensive baselines of healthy vocal patterns [13]. These baselines are essential for accurate comparison with pathological voices, forming the foundation for screening, diagnosis, and remote monitoring [7,13,14]. Without standardized baselines across diverse populations, identifying clinically meaningful deviations becomes unreliable and limits the adoption of vocal biomarkers [13,15]. The human voice is a complex physiological signal produced through the coordinated interaction of the respiratory system, larynx, and vocal tract [16]. Because pathology in any of these structures alters normal vibratory or resonatory function, it will manifest as measurable acoustic deviations from a typical healthy voice.

In clinical research, the strongest method for tracking pathology is intra-subject comparison - evaluating a patient’s condition against their own pre-pathology state [17]. This approach accounts for individual biological variability and strengthens causal inference. However, ideal pre-pathology baselines are often unavailable due to acute onset, late diagnosis, or retrospective data collection. In such cases, external controls are used, drawing on data from other patients or prior studies [18].

Using well-characterized healthy voice data reduces metric arbitrariness and improves diagnostic reliability [19]. Because such baselines often outperform noisy person-specific controls, our study examines whether our dataset can function as a stable healthy reference.

1.3. Sustained Vowel Phonation and Connected Speech Tasks

Clinicians and researchers have utilized both sustained vowel phonation and connected speech tasks to evaluate vocal function [20]. While sustaining a note isolates how the vocal cords work, speaking in sentences shows how the voice performs in the real world. The inclusion of both sustained vowels and connected speech in voice evaluations is advocated due to the distinct yet complementary information each stimulus provides [21].

Sustained vowel phonation is widely used in voice disorder assessment because it provides a relatively “steady-state” production that minimizes the effects of articulation, intonation, stress, and speaking rate, making intrinsic vocal fold function easier to analyze [22]. These tasks are particularly valuable for measuring stability and regularity of vocal fold vibration, enabling precise evaluation of perturbation parameters such as jitter and shimmer. Their time-invariance and ease of control make sustained vowels a cornerstone of baseline acoustic analysis [22].

In contrast, connected speech offers a realistic view of how individuals use their voices in daily communication. It captures dynamic behaviors such as voice onsets and offsets, pauses, voiceless phonemes, and continuous fluctuations in pitch and intensity shaped by prosody and phonetic context. Many voice disorders manifest primarily under these complex demands, making connected speech essential for evaluating intelligibility and naturalness [23,24].

Professional organizations such as ASHA endorse combining sustained vowels with connected speech to integrate auditory-perceptual and laboratory measures[17]. Sustained vowels isolate the glottal source [25], while short phrases capture performance under natural communicative demands [2,25]. Although some studies report similar information across tasks, others show significant differences—for example, higher sentence-level F0 in men and distinct jitter and shimmer patterns. Despite greater variability in connected speech, short-phrase measures can better indicate disorders such as hoarseness [25]. Using both tasks provides complementary insights essential for developing robust vocal biomarkers and evidence-based assessment protocols.

1.4. Source-Filter Theory for Voice Acoustics Analysis

The source–filter theory explains that vocal folds generate the glottal source of pitch and quality, while the vocal tract filters this signal to shape resonance and produce distinct speech sounds [26,27]. Vocal fold vibration creates a complex tone for voiced phonemes, and movements of the tongue, lips, and jaw dynamically reshape the vocal tract to amplify or suppress specific frequencies. These resonant peaks (formants) provide the acoustic cues that distinguish vowel phonemes and account for both individual voice differences and variation across articulatory contexts.

Phonation-related measures reflect vocal fold behavior, with parameters such as F0, jitter, shimmer, and CPP indexing vibratory stability, glottal closure, and voice quality. Pathologies that alter fold mass or stiffness produce corresponding acoustic changes independent of the vocal tract. In contrast, filter-related measures capture how vocal tract anatomy and articulation shape resonance. Formant frequencies (F1–F3) vary with tract length and articulator configuration, explaining why speakers with similar pitch can sound different. Although coarticulation shifts formants during connected speech, anatomical constraints limit their range [28]. Together, source- and filter-level measures provide complementary insight into how physiology and articulation jointly shape the acoustic signal. Table A2 summarizes key attributes of sustained vowels vs. connected speech (compiled from various sources

1.5. Laryngitis Case Study

In the present work we consider a case study in laryngitis, an inflammatory condition that disrupts vocal fold vibration and produces acoustic changes such as hoarseness, reduced pitch range, and increased perturbation. We compare healthy controls with pathological samples from the Saarbrücken Voice Database (SVD) to identify differentiating acoustic features. The analysis focuses on chronic laryngitis (symptoms persisting beyond three weeks) because it reflects sustained mechanical or environmental stress rather than transient acute infection.

1.6. Research Scope

This 32-participant pilot study evaluates whether a healthy cohort can serve as a robust normative baseline for biomarker research and as an external control for a laryngitis case study using Saarbrücken Voice Database (SVD) samples. It also enables paired analysis of vowels and phrases to characterize healthy profiles and quantify how these diverge in pathology. Overall, the study aims to assess the value of this pilot cohort as a foundation for future healthy-voice baseline development, validating its performance against pathology and addressing research questions that move from defining norms to contrasting them with laryngitis and evaluating applied utility. Our research focuses on the following three questions:

What are the normative acoustic characteristics of sustained vowel phonations and short phrase productions in healthy young adults?
How do acoustic profiles from healthy young adults diverge from those in the pathological acute laryngitis from the Saarbrucken Voice Database?
How can paired vowel and phrase data enhance the robustness of voice outcome measures collected in non-clinical settings?

2. Materials and Methods

2.1. Participants and Ethical Approval

Healthy voice samples were collected from 32 participants (11 females and 21 males) between the ages of 18 and 24 on the Case Western Reserve University campus. The study was approved by the Case Western Reserve University Institutional Review Board. Participation was voluntary, and all individuals provided informed consent before contributing recordings. No personally identifying information was collected; each participant was assigned a numerical code to maintain anonymity. Table 1 shows self-reported binary gender (male/female). Subjects were drawn from the CWRU Voice Study (CVS) and the Saarbrücken Voice Database (SVD).

2.2. Data Collection

Data collection took place in a quiet conference room with minimal background noise and the subject comfortably seated. A voice recorder (Sony ICD-UX570 Digital Voice Recorder) was held approximately 12 inches at about a 45 degree angle longitudinally from the subject to collect voice samples. Participants were instructed to pronounce a series of sustained vowels (/a/, /e/, /i/, /o/, /u/) and to recite the phrase, “The sun sets in Cincinnati on Saturday.” This phrase was selected because of the repetitive /s/ in various places, and thought to be relevant for connected speech evaluation, though the literature offers hundreds of sentence options [29].

Demographic data (age and gender) and self-reported health status (“Do you feel healthy today?”) were reported before voice capture. A wide variety of voice recorders could have been used for this study; while technology makes a difference in acoustics research [30] and smartphone are becoming popular for research [31], a contemporary survey of “off-the-shelf” commercially available units guided the selection of the Sony recorder [32]. Subjects were requested to speak at their natural pace without any practice runs, so sampling duration varied slightly depending on the rate of speech of the subject. File names of recordings were anonymized by subject numerical code.

2.3. Signal Processing and Feature Extraction

Minimal signal processing was required in this study and primarily involved file preprocessing that included signal normalization to standardize amplitude levels, trimming of silent segments to remove non-speech intervals, and conversion of each recording into a monophonic waveform for consistent analysis.

Feature extraction from the pre-processed voice recordings for the core acoustic measurements outline in Section 1.1 were performed using Python scripting, Microsoft Excel, Praat software [33], and parselmouth.praat, librosa, and MANOVA Python modules integrated into a Python workflow developed by the research team. While other statistical packages are available for research, Praat remains a common standard for voice analysis [34]; though results may vary slightly depending on the software used [35].

While the Python routines generally provided ready-to-use acoustic measurements, fundamental frequency F0 metrics required extracting a pitch object from each recording to calculate values across the entire sound. Minimum and maximum F0 values were estimated using pitch floor and ceiling settings combined with parabolic interpolation for improved accuracy. or clinical applications requiring higher precision, more exhaustive reviews of F0 estimation methods are available [36].

Formant analysis was conducted using Burg’s method, a linear predictive coding technique featured in the Praat software. Parameters included a 25 ms time step and window length, a maximum number of 5 formants, and frequency limits appropriate for adult male and female voices. Pre-emphasis from 50 Hz upward was applied to enhance high-frequency components. A Formant object was generated to track time-varying formant trajectories, and the first three formants (F₁, F₂, F₃) were sampled at the temporal midpoint of each sound file and reported in Hertz. This comprehensive process ensured accurate extraction of both spectral and temporal features for subsequent statistical analysis.

Feature extraction also included Cepstral Peak Prominence (CPP), Maximum Phonation Time (MPT), and Mel-frequency Cepstral Coefficients (MFCCs). CPP is a robust indicator of voice quality and periodicity and was calculated using Python routines integrating Parselmouth (Praat) and Librosa. Librosa was also utilized to compute MPT for vocal endurance.

In this study, we computed Mel-frequency Cepstral Coefficients (MFCCs) using Praat software. MFCCs are widely used in speech recognition [37,38] because they compactly represent the vocal tract's spectral envelope [39,40]. By mapping frequencies to the non-linear Mel scale—which mimics human auditory perception—MFCCs isolate linguistic content while minimizing background noise [38]. Although influenced by formants, MFCCs are distinct parameters [41] capturing the short-term spectral "fingerprints" required for signal classification.

2.4. Multivariate and Uninvariate Statistics

Voice has natural variability in measures such as F0, perturbation, and spectral features that introduce the risk of Type I error in hypothesis testing. Such errors can incorrectly suggest an observed acoustic difference reflects a true physiological or articulatory effect, thus producing false positives. To mitigate Type I error, this work employs both univariate and multivariate analyses.

Single parameter analyses were performed using the Microsoft Excel Data Analysis Toolpak. Hypothesis testing involved (a) comparing t-Statistic with t-Critical, and (b) confirmed by evaluating the two-tailed p-value (“P(T≤t) two-tail”) against the significance level (α). If p < α, H₀ is rejected; otherwise, it is retained. This complementary approach ensures clarity and reliability in statistical decision-making. For example, Figure A1 illustrates that at the chosen significance level α=0.05, the critical t-value was 1.997, while the calculated t-Statistic was 1.85. Because |t-Statistic| < t-Critical, we fail to reject H₀, indicating no statistically significant difference between data sets; observed variation is likely random rather than a true effect.

Mutivariate analyses of variance (MANOVA) treats correlated acoustic features jointly, reducing the family-wise error rate by minimizing the number of statistical tests performed. This approach is particularly effective for physiologically related features, such as the combination of jitter, shimmer, and CPP. Using the statsmodels library in Python, we implemented MANOVA to generate four standard multivariate metrics of which Wilks’ Lambda serves as the primary evaluative statistic for the binary group comparisons (e.g., Male vs. Female) common in this study. Wilks’ Lambda (Λ) represents the proportion of variance not explained by the grouping variable; a Λ value approaching 0 indicates high explanatory power, while a Λ value near 1 suggests no effect. To determine significance, the system transforms the Lambda value into an F-distribution, providing a standard p-value for the overall model.

2.5. Intensity Profiles

Voice intensity profiles provide evidence of an involuntary association between variations in vocal intensity and a speaker’s underlying emotional or cognitive state. Changes in amplitude and energy, for example, often signal heightened emotions such as anger, fear, or excitement, which are typically characterized by elevated mean intensity levels compared to more subdued affective states such as sadness or contentment. Deviations from normative intensity measures may also serve as clinical indicators of medical conditions, including Parkinson’s disease or (in our case) laryngeal pathologies, both of which compromise vocal fold function and respiratory support. Diminished variability in intensity and the presence of prolonged pauses are frequently observed under conditions of increased cognitive load or mental fatigue. In the current work, intensity profiles were developed from the Praat software.

2.6. Spectrograph

Spectrograms are very conveniently obtained from the Praat software used to convert raw audio signals into time–frequency representations. This enables visualization of spectral variation that is not apparent in amplitude-based waveforms. Whereas waveforms depict only changes in signal amplitude, spectrograms decompose the acoustic signal into its constituent frequencies, allowing formant structures and other spectral features to be visually identified. These representations support the differentiation of vowels and consonants and aid in detecting artifacts such as background hums, clicks, or other non-speech events. Although interpretation of spectrograms requires domain-specific expertise, they provide a valuable exploratory tool for isolating speech impediments and assessing signal quality during preprocessing and analysis.

2.7. Saarbrücken Voice Database for Laryngitis Case Study

A systematic search of institutional and publicly available voice repositories was conducted to identify suitable acoustic datasets. Resources varied in sample size, data type (e.g., sustained vowels, sentences, continuous speech), and access permissions, with licensing models often limiting reproducibility. Projects such as PhysioNet [42,43] provide healthy voice samples via credentialed access, while many open-access databases feature small cohorts or narrow pathological focus. Following comparison, the Saarbrücken Voice Database (SVD) [10] was selected as the most comprehensive, fully open-access resource for our laryngitis case study. The SVD offers clinically annotated recordings of both normal and pathological voices, including sustained vowels and standardized sentences. Its laryngitis subset provides controlled pathological phonation, enabling systematic comparison with healthy voices and consistent measurement of fundamental frequency, perturbation, and formant patterns. From the SVD's laryngitis dataset, we selected 18 male subjects (aged 50–60) to ensure demographic consistency.

Although cross-linguistic differences may challenge comparisons between German and English datasets, prior research indicates that basic vowel parameters in healthy adults show notable similarity across languages [13,44,45,46]. These characteristics support the use of the SVD laryngitis subset for validating digital signal processing workflows and benchmarking feature extraction pipelines. As laryngitis is characterized by inflammation and edema of the vocal folds, this increases their effective mass and reduces their capacity for regular vibration. These physiological changes manifest as distinct visual and statistical artifacts in the fundamental frequency (F0) contour, which in this work will be computed with Parselmouth (Praat).

3. Results

Results for ten voice statistics across three demographics cohorts (gender, health, pathology) produced a significant number of datasets that are provided in Supplementary data files. For brevity, only key summary tables are presented in this Results section.

3.1. Healthy Subject Results: Vowel Pronunciation

A key research objective was to identify normative acoustic characteristics of sustained vowel phonations and short phrase productions in healthy young adults. As shown in Table 2, the three healthy subject cohorts from the CWRU Voice Study (CVS) and the Saarbrücken Voice Database (SVD) were very close in age. Key voice parameters for the pronunciation of a vowel for each cohort are also shown in Table 2.

Fundamental Frequency, F0, correlates directly with vocal pitch. As F0 increases (faster vocal fold vibration), pitch sounds higher, and as it decreases, pitch sounds lower. It is considered the physical measurement of what the auditory system constructs as pitch. In healthy young adults, F0 exhibits clear gender-dependent differences, with females typically having a higher F0 than males [29]. Research also indicates that F0 can show increases or decreases in young adults when subjected to increased cognitive load, suggesting a link between mental stressors and vocal output [30]. While there are some common frameworks for collecting data among specific organizations (NIST guidelines for speaker recognition and IEEE/ASA standards for some acoustic measurements) there is no single set of universal standards for collecting and recording voice analytics. Thus, it is to be expected that values of F0 vary widely. By way of comparison, Table 3 illustrates F0 from other research efforts.

The mean F0 value of 215.8 (SD 26.2) for a CVS healthy female, 116.93 (SD 26.8) for a CVS healthy male from Table 2 compare somewhat favorably and consistent with values from the literature in Table 3. While Table 2 suggests subtle similarities in acoustic parameters, visual inspection alone is insufficient to confirm significant differences. Samples of minimum and maximum F0 echo this trend (Table S16). Consequently, both univariate and multivariate analyses were performed. Supplementary Tables S3–S8, illustrate that univariate comparisons across vowel pronunciations fail to capture the holistic variance between cohorts. A more robust assessment is a one-way multivariate analysis of variance (MANOVA), conducted using the following 6 parameters: F0, Jitter, Shimmer, HRN, Intensity, and CPP.

Table 4. Summary of multivariate outcomes for all healthy subjects pronouncing the vowel “a”.

	CVS Male vs CVS Female	CVS Male vs SVD Male	CVS Female vs SVD Male
Wilks’ Λ	0.1152	0.0672	0.1591
F (6,9)	11.516	27.752	10.576
Pr>F (p-value)	0.0009	< 0.0001	0.0003

In all cases the p-value is less that 0.001 so there is strong evidence that there is a statistically significant difference between the voice comparisons, in that the “vocal profile” (the combination of F0, Jitter, Shimmer, HRN, Intensity, and CPP) is distinctively different for each group. Further, Wilks’ Λ is very low in each case, suggesting that the variance in the voice can be explained simply knowing that the speaker is male versus female, and in the case of the CVS versus SVD comparison the distinction between domestic and international voices is distinctively different. Given the MANOVA was significant and proved the groups are different, an examination of the ANOVA individual parameters and found that F0 was the primary driver of the difference, followed by F3, shimmer and jitter.

The F3 formant has been suggested as generally the best to compare between test subjects for speaker identification [42]. F1 and F2 that primarily define the vowel quality (which specific vowel sound is being made) [42,43].

3.2. Healthy Subject Results: Phrase Pronunciation

Following the procedure outlined in the previous section, the results for the pronunciation of a phrase are summarized in Table 5. As mentioned above, the univariate t-tests of Tables S3–S8 for pronunciation of a phrase, do not provide a clear pattern of variance between cohorts. A multivariate analysis of variance was conducted for F0, Jitter, Shimmer, HRN, Intensity, and CPP and the results are shown in Table 6. Wilks’ Λ is very low in each case, once again suggesting that the variance in the voice can for the pronunciation of a phrase is similar to that of a vowel and can be explained simply knowing the gender of the speaker.

We explored the difference between the pronunciation of a vowel and a phrase for the same person. For the CVS healthy female, the Wilks’ Λ = 0.12, F(6,11)=13.92, and p < 0.001 suggest a significant difference exists between the data sets. Repeating this for a CVS healthy male, the Wilks’ Λ = 0.07, F(6,11)=29.53, and p < 0.001 that again suggests a significant difference exists between the data sets.

3.3. Comparison of Healthy Subjects and SVD Subject with Laryngitis

The multivariate analysis presented in Table 7—derived from the supplemental data in Tables S7 and S8—reveals a contrast in data distribution. While the comparison between healthy CVS subjects and those with SVD Laryngitis suggests two entirely independent and statistically divergent datasets, the comparison between healthy SVD subjects and those with Laryngitis shows a surprising degree of overlap. This suggests that Laryngitis significantly alters vocal biomarkers, the underlying SVD classification maintains a level of data consistency for which the univariate analysis (Tables S9–S12) provides insight that F3, jitter, and shimmer tend to be driving variance.

Results for Wilks’ Λ from Table 7 and Table 8 reveal similar patterns. The Λ statistic for SVD laryngitis data compared to CVS male and CVS female suggests data set differences are primarily attributable to gender. In contrast, the comparison of SVD male with and without laryngitis produces a Λ that is very nearly close to 1 and thus gender does not explain the difference as much as it can be implied the presence of pathology does.

3.4. F0 Contour Plots

As noted previously, laryngitis is characterized by inflammation and edema of the vocal folds, manifest as distinct visual and statistical artifacts in the fundamental frequency (F0) contour. Figure A2 presents the F0 contours for a healthy male, a healthy female, and a male subject with laryngitis. Consistent with our earlier calculations, the mean F0 for the healthy female was 215.8 Hz, for the healthy male 116.9 Hz, and for the male with laryngitis 133.5 Hz. The observed contours align with prior findings, reinforcing the expected deviations in periodicity associated with vocal fold pathology.

3.5. Intensity Profiles

Laryngeal pathologies can alter normative intensity patterns, and deviations in these contours may serve as qualitative indicators of impaired vocal fold function or reduced respiratory support. Representative intensity trajectories from the Praat software for a healthy male, a healthy female, and a male subject with laryngitis are shown in Figure A3. Although no quantitative intensity metrics were derived, the plots still provide meaningful visual evidence: the laryngitis contour exhibits a distinctly irregular and unstable profile compared with healthy speakers, underscoring the potential diagnostic value of intensity patterns in voice analytics.

3.6. Spectrographs

Typical spectrograms produced with the Pratt software for healthy male, a healthy female, and a male subject with laryngitis are shown in Figure A4. An alternate view of the Spectrogram is shown in Figure These time–frequency representations highlight differences in formant structure, harmonic organization, and spectral stability that are not evident from waveform inspection alone. An alternate view is provided in Figure A5, where the calculated formants (F1–F5) are superimposed on a grayscale spectrogram to illustrate how vocal tract resonances align with the underlying spectral energy patterns.

3.7. MFCC

In this study, we computed Mel-frequency Cepstral Coefficients (MFCCs) using Praat software. MFCCs are widely used in speech recognition [37,38] because they compactly represent the vocal tract's spectral envelope [39,40]. By mapping frequencies to the non-linear Mel scale—which mimics human auditory perception—MFCCs isolate linguistic content while minimizing background noise [38]. Although influenced by formants, MFCCs are distinct parameters [41]. Figure 6 illustrates MFCCs C2–C12, highlighting the spectral variations between healthy and pathological voice samples. Due to their robustness, MFCCs are a standard feature set in machine learning frameworks [47].

4. Discussion

Our discussion of results is structured around our three key research questions outlined in the Introduction.

What are the normative acoustic characteristics of sustained vowel phonations and short phrase productions in healthy young adults?
How do acoustic profiles from healthy young adults diverge from those in the pathological acute laryngitis from the Saarbrucken Voice Database?
How can paired vowel and phrase data enhance the robustness of voice outcome measures collected in non-clinical settings?

4.1. Acoustical Outcomes and Significance

Analysis of the data sets provides results for sustained vowel phonations and short phrase productions approximately similar to foundational "vocal profiles" in the literature for healthy young adults [10,21,44,46,48,49,50,51,52,53]. Given the variety and emerging nature of standards for collecting and recording voice analytics, an inherent variability complicates affirming precision.

The acoustic parameters captured for vowel pronunciation—specifically F0, F3, Jitter, Shimmer, HNR, Intensity, and CPP—provide a comprehensive baseline for the cohorts studied. The mean F0 values for the CVS cohorts (215.8 Hz for females; 116.9 Hz for males) align closely with established literature averages of 220.7 Hz and 124.8 Hz, respectively. These results reinforce the status of F0 as a physical measurement of perceived pitch and its reliable gender-dependent nature in healthy populations.

Visual inspection of individual parameters and univariate t-tests proved insufficient for capturing the holistic variance between groups. A one-way MANOVA was essential to demonstrate that the combination of acoustic variables forms a distinct vocal profile. While the differences found between genders might be viewed as intuitive, the statistical significance (p < 0.001) and low Wilks’ Λ values provide empirical evidence that gender and cohort origin are the primary drivers of vocal variance. Furthermore, F0 was identified as the primary driver of these differences, followed by F3, shimmer, and jitter.

A finding of this study is the statistically significant distinction between the domestic (CVS) and international (SVD) data sets.

Vocal Profile Distinctiveness: The MANOVA results for CVS Male vs. SVD Male yielded a Wilks’ Λ of 0.0672 (p < 0.0001), indicating that the vocal profiles are significantly different.
F0 Variations: In vowel pronunciation, the SVD cohort (20.3 years old) exhibited a mean F0 of 131.78 Hz, which is notably higher than the CVS Male mean of 116.93 Hz.
HNR and Stability: The SVD cohort showed a higher Harmonic-to-Noise Ratio (HNR) of 19.409 compared to the CVS Male (10.477) and CVS Female (13.581).
Duration and Sustention: A marked difference was observed in Maximum Phonation Time (MPT); the SVD cohort averaged 1.353 seconds for vowel pronunciation, while CVS cohorts remained at or below 0.500 seconds.
Phrase Production: Similar trends persisted in phrase productions, where the SVD cohort maintained a distinct profile from the CVS cohorts (p < 0.0001), suggesting that geographic or linguistic factors may influence normative "healthy" values.

Beyond the fundamental frequency, F3 was highlighted as a key variable for distinguishing between subjects, consistent with its use in speaker identification. In contrast, F1 and F2 remained primarily responsible for defining the specific vowel quality rather than cohort differences. The significant differences found between the pronunciation of a vowel and a phrase for the same individual (p < 0.001) further underscore that normative values must be context-specific to the type of vocalization.

Identifying a robust baseline for HNR is a challenge. While our research outcomes indicates a difference in HNR based on gender, other research does not [46]. Age is conventionally believed to lower HNR, but even in our case of the SVD Male laryngitis, lowering of HNR may be attributable in part to medications taken by the many elderly subjects.

Because no universally accepted gold standard exists for HNR, direct comparative assessment is not possible. While our observed results appear to be are consistent with prior research reports, it remains challenging to support HNR acceptability and relevance within the broader context of voice quality evaluation.

4.2. Comparative Healthy vs. Pathological Vocal Characteristics

The comparison between healthy young adults and individuals with acute laryngitis highlights two facets of acoustic differecens: geographic baseline variation and actual pathology. Our results indicate that (quite expectedly) absent the introduction of the CVS (healthy American) and SVD (German) cohorts differ significantly. MANOVA results confirm robust cross-cohort separation (Wilks’ Λ = 0.0672 for males; 0.1308 for females), driven primarily by fundamental frequency, with CVS males averaging 116.93 Hz versus 131.78 Hz in SVD males. Additional discrepancies such as the higher HNR (19.409) and longer MPT (1.353 s vs. 0.500 s) in healthy SVD males point to the linguistic, cultural, and recording-environment factors at play in establishing distinct “normative” acoustic baselines. It is clear baseline bias must be accounted for before interpreting voice deviations as pathological.

The divergence between healthy and pathological voices reflects our use of two different databases of different origin. Within the SVD cohort, acute laryngitis produces a consistently strong multivariate signature, with very low Wilks’ Lambda values, high F-statistics, and p < 0.001 across all models. Pathology in this group is evident as clear deviations in jitter, shimmer, HNR, and related measures, yielding a robust and easily identifiable “laryngitis profile.” In contrast, the CVS (American) data do not exhibit the same uniform pattern as some pathological markers appear muted or inconsistent relative to their SVD (German) counterparts. This asymmetry suggests that the acoustic expression of laryngitis is not universally constant but is filtered through each cohort’s baseline phonatory characteristics. As a result, even though F0 and F3 remain influential contributors to group separation, the absence of standardized recording and linguistic conditions complicates identifying pathological traits without calibration.

4.3. Paired Vowel and Phrase Data

Combining paired vowel and phrase data improves voice assessments by anchoring dynamic speech to a stable physiological baseline. While F1 and F2 define specific vowel sounds, isolated vowels act as "steady-state" snapshots because F3 is shaped primarily by fixed vocal-tract anatomy. Research suggests that F3 is the most reliable parameter for identifying individual speakers across diverse groups. This anatomical stability makes vowel phonation a sensitive marker for pathology, as significant shifts in F3 likely reflect structural abnormalities rather than simple linguistic choices. In non-clinical environments where recording standards are absent, this invariance provides an interesting calibration opportunity. By establishing a speaker-specific "normal" through vowel production, clinicians can more accurately evaluate the holistic variance and "vocal profile" found in complex connected speech.

Published formant values vary by demographics and often exclude higher-order formants (F4, F5), limiting direct comparison [63,64]. As shown in Table 9, current findings (CVS) generally align with the range of selected prior studies. Notably, F3 remained relatively stable, contrasting with the expected variability of F1 and F2..

Connected speech captures the physiological demands of real communication—prosody, coarticulation, articulation rate, and respiratory-phonatory coordination. These factors introduce variability that simple statistics often fail to resolve, as reflected in the inconsistent t-test outcomes across CVS and SVD comparisons. Measures such as CPP and MFCCs reliably detect pathology in phrases, but traditional metrics (F0, jitter, shimmer) frequently fail to reject the null hypothesis because they are low-dimensional and highly context-dependent. Yet Figure A5 demonstrates that phrase-level formants still trace back to the vowel’s anatomical baseline: the trajectory of F3 during connected speech remains anchored to the value established in the isolated vowel. This means phrases reveal how the vocal system performs under load, while vowels reveal the system’s structural constraints. Together, they expose discrepancies—such as reduced CPP or altered spectral shape—that may be masked when either task is used alone.

While Cepstral Peak Prominence (CPP) applies to both sustained phonation and continuous speech, methodological variations can complicate age discrimination [65]. Consistent with prior literature [66], Table 10 shows lower CPP values in continuous speech than in sustained vowels. Furthermore, our findings confirm that vocal pathology reduces CPP.

Cepstral peak prominence (CPP) quantifies the overall level of noise present in the vocal signal and correlates with the auditory perception of overall voice quality. This measure is particularly valuable because it is robust across all types of voice signals [67], making it preferable over traditional perturbation measures for assessing voice quality in various vocal conditions. Healthy, normal voices are typically characterized by higher CPP values. Studies have also observed increased CPP measures in healthy young speakers under cognitive loading, further highlighting its sensitivity to physiological responses.

Figure 6 illustrates MFCCs C2–C12, demonstrating distinct spectral variations between healthy and pathological voice samples. This visualization enhances the separation between the two categories, highlighting subtle acoustic differences that are often obscured in traditional displays. This suggests that comparing subjects across different demographics (countries) requires analysis more granular than simple descriptive statistics.

4.4. Limitations and Challenges

This study is subject to specific limitations related to the nature of voice samples:

In the current study, data collection was conducted in a university conference room situated adjacent to a high-traffic public area. Background noise adds random energy that Praat interprets as "perturbation," thus increasing Jitter (to 0.8%)and Shimmer (to 5%). We acknowledge that the acoustic environment was not clinically isolated; consistent with prior literature, the presence of ambient noise in this setting may have artificially inflated the calculated values for jitter and shimmer.
Discriminatory Power of Acoustics: As noted in recent literature [83], high variability in vowel acoustics (even for standard vowels such as /a/) suggests that acoustic measures alone may not be fully adequate for discriminating between healthy and disordered speech without supplementary modalities. A significantly larger dataset is required for accurately predict the potentially large number of parameters.
Hardware and Environmental Variance: The reference dataset is subject to unknown variability in recording hardware and acoustic environments. These inconsistencies complicate direct comparisons with the locally collected healthy controls.
Linguistic Mismatch: There is a linguistic divergence between the pathological dataset (German) and the control group (English). While the control group included diverse accents, the fundamental phonetic differences between languages may introduce confounding variables in formant and prosodic feature extraction. Within reported results can be difficult to differentiate between cultural accents and the detection and influence emotional state [72].
Class Imbalance: The laryngitis subset comprises a relatively small fraction of the total pathological data, potentially limiting the statistical robustness of the analysis for this specific condition.
Lack of Deep Phenotyping: The utility of the dataset for supervised machine learning is constrained by a lack of detailed clinical annotation. The data lacks metadata regarding severity, duration, or clinician-confirmed perceptual scores, preventing a deeper analysis of how acoustic features correlate with disease progression.
Algorithm Performance: Our selection of the commonly used Pratt software provided convenience in streamlining analytic workflow, but research has shown that other software packages may not provide equivalent outcomes[35].

5. Conclusions

Voice-based approaches for emerging screening and diagnostic applications, particularly in telemedicine, often require patient recordings collected outside clinical environments. However, the variability of vocal parameters and limitations in data acquisition for the current work hindered the development of reliable predictive models for pathology assessment. There is some evidence that paired vowel and phrase data provides a stable anatomical anchor and a dynamic performance measure, potentially offering a more reliable detection of vocal change in uncontrolled environments. Validating steady-state formants linked to time-varying speech features may overcomes demographic and recording variability, establishes a more robust framework for non-clinical voice assessment.

Supplementary Materials

The following supporting information can be downloaded at the website of this paper posted on Preprints.org.

Author Contributions

Conceptualization, T.C., S.D., C.E., E.O., W.S., P.B. and C.K.D; methodology, T.C., S.D., C.E., P.B. and C.K.D; software, SD., C.E., T.C., E.O., formal analysis, T.C., SD., C.E., E.O., W.S., and C.K.D; investigation, SD., C.E., T.C., E.O., X.Y.Z.,and W.S.; resources, C.K.D; data curation, T.C., S.D., C.E., E.O., and C.K.D writing—original draft preparation, T.C., S.D., C.E., E.O., W.S., P.B. and C.K.D; writing—review and editing, P.B. and C.K.D.; supervision, C.K.D.; validation, C.K.D.; project administration, C.K.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board of Case Western Reserve University (protocol code CWRU Study 2023-1013, “Collection of Voice Samples to Establish a Control for Voice Analytics,” 18 January 2024).

Informed Consent Statement

Written informed consent was obtained from all subjects involved in the study prior to participation. Participants were informed about the study’s purpose, procedures, potential risks, and their right to withdraw at any time without penalty.

Data Availability Statement

The data supporting the conclusions of this article will be made are available on reasonable request from the corresponding author.

Acknowledgments

Research reported in this publication was internally funded by Case Western Reserve University Department of Biomedical Engineering. We are grateful for the study participant volunteers from the students at Case Western Reserve University. The authors have reviewed and edited the output and take full responsibility for the content of this publication.”.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

F0	Fundamental frequency
F3	Third formant frequency
Jitter	Jitter variance
Shimmer	Cycle to cycle variation in voice amplitude
HNR	Harmonic-to-noise ratio
Intensity	Energy transmitted by vocal vibrations
CPP	Cepstral Peak Prominence
MFCCS	Mel-Frequency Cepstral Coefficient
RSPL	Root-Mean-Square Sound Pressure Level
MPT	Maximum Phonation Time

Appendix A

Table A1. Summary of Typical Parameters in Voice Analytics (adapted from [44,73,74,75]).

Acoustic Parameter	Definition	Significance
Fundamental Frequency (F0)	Instantaneous vibration rate of the vocal folds (Hz)	Because human speech is dynamic, F0 changes constantly as during speech to create intonation, emphasis, and emotion.
Mean Fundamental Frequency	Average rate of vocal fold vibration across a sustained sound or speech sample.	It reflects overall pitch control and can shift with inflammation, strain, or other abnormalities.
Minimum Fundamental Frequency	Minimum Fundamental Frequency is the lowest pitch produced for a sample.	Indicates the lower limit of vocal fold vibration and drop w/ edema or impaired vibratory control.
Jitter Variance	Fundamental frequency variation over a period of time.	Measure of pitch stability; relevant to vowels, not phrases. Measured as a fraction or as a percentage.
Harmonics-to-Noise Ratio (HNR)	HNR quantifies how much harmonic (periodic) energy exists compared to noise.	Low HNR signals increased vocal irregularity and potential pathology.
Formant Frequencies (F1, F2, F3)	Formants are the resonant frequency peaks shaped by the vocal tract during speech. F1, F2, and F3 are the first three resonant frequency peaks that shape vowel sounds.	Reflect articulatory configuration and are key for distinguishing vowels and detecting pathological changes. They reveal articulatory placement and can shift in predictable ways for vocal tract disrupted by pathology.
Intensity	Refers to a physical measure as the energy transmitted by vocal vibrations.	It reflects the amplitude of vocal fold oscillations and is significant because variations in vocal loudness can indicate pathology.
Cepstral Peak Prominence (CPP)	An acoustic measure that quantifies the strength and clarity of the harmonic structure in the voice signal	It reflects vocal quality and stability, with lower CPP often linked to dysphonia or voice disorders.
Mel-Frequency Cepstral Coefficients (MFCC)	Features derived from the short-term power spectrum of speech related sound frequency perception.	MFCC’s represent how humans perceive sound frequencies, typically expressed through 13 coefficient, each representing unique vocal tract characteristics.
Root-Mean-Square Sound Pressure Level (RSPL)	Average acoustic energy of a voice signal, reflecting vocal loudness and stability over time	RSPL is significant as a biomarker because abnormal variations can indicate vocal fatigue, respiratory issues, or neurological disorders.
Maximum Phonation Time (MPT)	The longest duration a person can sustain a vowel sound on one breath.	Reflects respiratory support, vocal fold efficiency, and phonatory control, significant for assessing vocal function and detecting respiratory or laryngeal disorders.

Table A2. Comparative Strengths of Sustained Vowels vs. Connected Speech (compiled from various sources [20,26,45,76,77]).

Feature	Sustained Vowels	Connected Speech
Type of Vocal Task	Isolated phonation	Dynamic speech production
Primary Info Captured	Vocal fold vibratory stability, laryngeal function	Articulatory coordination, prosody, natural voice use
Acoustic Parameters Typically Measured	F0, cepstral peak prominence (CPP), Jitter, Shimmer, HNR, SPL, and MPT	Fundamental frequency F0 variability, SPL range, CPP in context.
Advantages	Controlled, repeatable, less articulatory influence, useful for multilingual analysis	Reflects real-life communication, captures dynamic vocal attributes, may be more reliable for qualities like hoarseness
Limitations/ Considerations	May not reflect natural voice use, less dynamic information, many measures east to extract.	More complex analysis, influenced by speaking rate, intonation, and articulation, thus feature extraction can be more difficult

Figure A1. Sample probability density distribution to illustrate the use of t-Critical and t-Statistic in the assessment of the Null hypothesis.

Figure A2. Typical Fundamental Frequency F0 contours for healthy male, a healthy female, and a male subject with laryngitis.

Figure A3. Typical intensity contours for healthy male, a healthy female, and a male subject with laryngitis.

Figure A4. Typical spectrogram for healthy male, a healthy female, and a male subject with laryngitis.

Figure A5. Typical Formant values (in red) superimposed on a spectrogram for healthy male, a healthy female, and a male subject with laryngitis.

Figure A6. Typical MFCCs C2–C12 for a healthy female, and a male subject with laryngitis. Note that the MFCC values shown are not normalized and are the raw value of the log-energy of the spectral energies post processed by a discrete cosine transform [37,78].

References

Bensoussan, Y.; Sigaras, A.; Rameau, A.; Elemento, O.; Powell, M.; Dorr, D.; Payne, P.; Ravitsky, V.; Bélisle-Pipon, J.-C.; Johnson, A.; et al. Bridge2AI-Voice: An Ethically-Sourced, Diverse Voice Dataset Linked to Health Information.
Lyberg-Åhlander, V.; Rydell, R.; Fredlund, P.; Magnusson, C.; Wilén, S. Prevalence of Voice Disorders in the General Population, Based on the Stockholm Public Health Cohort. J Voice 2019, 33, 900–905. [CrossRef]
Skodda, S.; Grönheit, W.; Mancinelli, N.; Schlegel, U. Progression of Voice and Speech Impairment in the Course of Parkinson’s Disease: A Longitudinal Study. Parkinsons Dis 2013, 2013, 389195. [CrossRef]
Solomon, C.; Valstar, M.; Morriss, R.; Crowe, J. Objective Methods for Reliable Detection of Concealed Depression. Frontiers in ICT 2015, 2. [CrossRef]
Fagherazzi, G.; Fischer, A.; Ismael, M.; Despotovic, V. Voice for Health: The Use of Vocal Biomarkers from Research to Clinical Practice. Digit Biomark 2021, 5, 78–88. [CrossRef]
Cordella, F. The Sounds of Health: Harnessing Vocal Biomarkers for Scalable Health Tracking Available online: https://www.eitdigital.eu/newsroom/grow-digital-insights/the-sounds-of-health-harnessing-vocal-biomarkers-for-scalable-health-tracking/ (accessed on 12 July 2025).
O’Connell, K. 5 Vocal Biomarker Trends to Watch in 2025. Canary Speech 2025.
Available online: https://stimmdb.coli.uni-saarland.de/ (accessed on 26 November 2025).
Koreman, J. A German Database of Patterns of Pathological Vocal Fold Vibration. Phonus. Saarbrücken, Institut für … 1997.
Pützer, M.; Barry, W.J. Saarbruecken Voice Database 2008.
Brockmann-Bauser, M.; de Paula Soares, M.F. Do We Get What We Need from Clinical Acoustic Voice Measurements? Applied Sciences 2023, 13, 941. [CrossRef]
Patel, R.R.; Awan, S.N.; Barkmeier-Kraemer, J.; Courey, M.; Deliyski, D.; Eadie, T.; Paul, D.; Švec, J.G.; Hillman, R. Recommended Protocols for Instrumental Assessment of Voice: American Speech-Language-Hearing Association Expert Panel to Develop a Protocol for Instrumental Assessment of Vocal Function. American Journal of Speech-Language Pathology 2018, 27, 887–905. [CrossRef]
Saggio, G.; Costantini, G. Worldwide Healthy Adult Voice Baseline Parameters: A Comprehensive Review. Journal of Voice 2022, 36, 637–649. [CrossRef]
Guimarães, I.; Abberton, E. Health and Voice Quality in Smokers: An Exploratory Investigation. Logopedics Phoniatrics Vocology 2005, 30, 185–191. [CrossRef]
Mizuta, M.; Abe, C.; Taguchi, E.; Takeue, T.; Tamaki, H.; Haji, T. Validation of Cepstral Acoustic Analysis for Normal and Pathological Voice in the Japanese Language. Journal of Voice 2022, 36, 770–776. [CrossRef]
Jetté, M. Toward an Understanding of the Pathophysiology of Chronic Laryngitis. Perspect ASHA Spec Interest Groups 2016, 1, 14–25. [CrossRef]
O’Connell, N.S.; Dai, L.; Jiang, Y.; Speiser, J.L.; Ward, R.; Wei, W.; Carroll, R.; Gebregziabher, M. Methods for Analysis of Pre-Post Data in Clinical Research: A Comparison of Five Common Methods. J Biom Biostat 2017, 8, 1–8. [CrossRef]
Thorlund, K.; Dron, L.; Park, J.J.H.; Mills, E.J. Synthetic and External Controls in Clinical Trials – A Primer for Researchers. Clin Epidemiol 2020, 12, 457–467. [CrossRef]
De Los Reyes, A.; Kazdin, A. When the Evidence Says, “Yes, No, and Maybe So.” Current directions in psychological science 2008, 17, 47–51. [CrossRef]
Gerratt, B.R.; Kreiman, J.; Garellek, M. Comparing Measures of Voice Quality From Sustained Phonation and Continuous Speech. J Speech Lang Hear Res 2016, 59, 994–1001. [CrossRef]
Behlau, M.; Madazio, G.; Yamasaki, R. Dynamic Vocal Analysis: Vocal Functionality Evaluation. Codas 35, e20210083. [CrossRef]
Goy, H.; Fernandes, D.N.; Pichora-Fuller, M.K.; Lieshout, P. van Normative Voice Data for Younger and Older Adults. Journal of Voice 2013, 27, 545–555. [CrossRef]
Rodrigo, I.; Duñabeitia, J.A. Listening to the Mind: Integrating Vocal Biomarkers into Digital Health. Brain Sciences 2025, 15, 762. [CrossRef]
Glaspey, A.M.; Wilson, J.J.; Reeder, J.D.; Tseng, W.-C.; MacLeod, A.A.N. Moving Beyond Single Word Acquisition of Speech Sounds to Connected Speech Development With Dynamic Assessment. Journal of Speech, Language, and Hearing Research 2022, 65, 508–524. [CrossRef]
Lowie, W.; Verspoor, M. A Dynamic Systems Theory Approach to Second Language Acquisition. Bilingualism: Language and Cognition 2007, 10, 7–21. [CrossRef]
Kent, R.D. Vocal Tract Acoustics. Journal of Voice 1993, 7, 97–117. [CrossRef]
Jongman, A. Acoustic Phonetics II: Source-Filter Theory of Speech Production. Speech Prosody Studies Group 2023.
Available online: https://www.wiley.com/en-us/The+Handbook+of+Phonetic+Sciences%2C+2nd+Edition-p-9781405145909 (accessed on 27 April 2026).
Available online: https://www.izharishaksa.com/blog/harvard-sentences-complete-guide (accessed on 30 April 2026).
Vogel, A.P.; Maruff, P. Comparison of Voice Acquisition Methodologies in Speech Research. Behav Res Methods 2008, 40, 982–987. [CrossRef]
Awan, S.N.; Shaikh, M.A.; Awan, J.A.; Abdalla, I.; Lim, K.O.; Misono, S. Smartphone Recordings Are Comparable to “Gold Standard” Recordings for Acoustic Measurements of Voice. Journal of Voice 2025, 39, 1019–1032. [CrossRef]
Acad. Transcr. Serv. 2022.
Available online: https://www.fon.hum.uva.nl/praat/ (accessed on 27 November 2025).
Burris, C.; Vorperian, H.; Fourakis, M.; Kent, R.; Bolt, D. Quantitative and Descriptive Comparison of Four Acoustic Analysis Systems: Vowel Measurements. Journal of Speech, Language, and Hearing Research 2014, 57, 26–45. [CrossRef]
Amir, O.; Wolf, M.; Amir, N. A Clinical Comparison between Two Acoustic Analysis Softwares: MDVP and Praat. Biomedical Signal Processing and Control 2009, 4, 202–205. [CrossRef]
Parsa, V.; Jamieson, D.G. A Comparison of High Precision F0 Extraction Algorithms for Sustained Vowels. J Speech Lang Hear Res 1999, 42, 112–126. [CrossRef]
Ramadhina, D.; Magdalena, R.; Saidah, S. Individual Identification Through Voice Using Mel-Frequency Cepstrum Coefficient (MFCC) and Hidden Markov Models (HMM) Method. Journal of Measurements, Electronics, Communications, and Systems 2020, 7, 26. [CrossRef]
Banuroopa, K.; Shanmuga Priyaa, D. MFCC Based Hybrid Fingerprinting Method for Audio Classification through LSTM. International Journal of Nonlinear Analysis and Applications 2021, 12, 2125–2136. [CrossRef]
Alkhatib, B.; Eddin, M. Voice Identification Using MFCC and Vector Quantization. Baghdad Science Journal 2020, 17, 1019. [CrossRef]
Tracey, B.; Volfson, D.; Glass, J.; Haulcy, R.; Kostrzebski, M.; Adams, J.; Kangarloo, T.; Brodtmann, A.; Dorsey, E.R.; Vogel, A. Towards Interpretable Speech Biomarkers: Exploring MFCCs. Sci Rep 2023, 13, 22787. [CrossRef]
Vasquez-Serrano, P.; Reyes-Moreno, J.; Guido, R.C.; Sepúlveda-Sepúlveda, A. MFCC Parameters of the Speech Signal: An Alternative to Formant-Based Instantaneous Vocal Tract Length Estimation. Journal of Voice 2025, 39, 1431–1439. [CrossRef]
Goldberger, A.L.; Amaral, L.A.; Glass, L.; Hausdorff, J.M.; Ivanov, P.C.; Mark, R.G.; Mietus, J.E.; Moody, G.B.; Peng, C.K.; Stanley, H.E. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. Circulation 2000, 101, E215-220. [CrossRef]
Cesari, U.; De Pietro, G.; Marciano, E.; Niri, C.; Sannino, G.; Verde, L. A New Database of Healthy and Pathological Voices. Computers & Electrical Engineering 2018, 68, 310–321. [CrossRef]
de Felippe, A.C.N.; Grillo, M.H.M.M.; Grechi, T.H. Standardization of Acoustic Measures for Normal Voice Patterns. Brazilian Journal of Otorhinolaryngology 2006, 72, 659–664. [CrossRef]
Hippargekar, P.; Bhise, S.; Kothule, S.; Shelke, S. Acoustic Voice Analysis of Normal and Pathological Voices in Indian Population Using Praat Software. Indian J Otolaryngol Head Neck Surg 2022, 74, 5069–5074. [CrossRef]
Demirhan, E.; Unsal, E.M.; Yilmaz, C.; Ertan, E. Acoustic Voice Analysis of Young Turkish Speakers. J Voice 2016, 30, 378.e21-25. [CrossRef]
Vreča, J.; Pilipović, R.; Biasizzo, A. Hardware–Software Co-Design of an Audio Feature Extraction Pipeline for Machine Learning Applications. Electronics 2024, 13, 875. [CrossRef]
Ma, E.P.-M.; Love, A.L. Electroglottographic Evaluation of Age and Gender Effects during Sustained Phonation and Connected Speech. J Voice 2010, 24, 146–152. [CrossRef]
Biever, D.M.; Bless, D.M. Vibratory Characteristics of the Vocal Folds in Young Adult and Geriatric Women. Journal of Voice 1989, 3, 120–131. [CrossRef]
Brown, W.S.; Morris, R.J.; Michel, J.F. Vocal Jitter in Young Adult and Aged Female Voices. Journal of Voice 1989, 3, 113–119. [CrossRef]
Ferrand, C.T. Harmonics-to-Noise Ratio: An Index of Vocal Aging. J Voice 2002, 16, 480–487. [CrossRef]
Banh, J.; Naumenko, K.; Goy, H.; Van Lieshout, P.; Fernandes, D.; Pichora-Fuller, K. Establishing Normative Voice Characteristics of Younger and Older Adults. Canadian Acoustics - Acoustique Canadienne 2009, 37, 190–191.
Dwire, A.; McCauley, R. Repeated Measures of Vocal Fundamental Frequency Perturbation Obtained Using the Visi-Pitch. J Voice 1995, 9, 156–162. [CrossRef]
Brockmann, M.; Drinnan, M.J.; Storck, C.; Carding, P.N. Reliable Jitter and Shimmer Measurements in Voice Clinics: The Relevance of Vowel, Gender, Vocal Intensity, and Fundamental Frequency Effects in a Typical Clinical Task. J Voice 2011, 25, 44–53. [CrossRef]
Teixeira, J.P.; Oliveira, C.; Lopes, C. Vocal Acoustic Analysis – Jitter, Shimmer and HNR Parameters. Procedia Technology 2013, 9, 1112–1122. [CrossRef]
Lovato, A.; Colle, W.D.; Giacomelli, L.; Piacente, A.; Righetto, L.; Marioni, G.; Filippis, C. de Multi-Dimensional Voice Program (MDVP) vs Praat for Assessing Euphonic Subjects: A Preliminary Study on the Gender-Discriminating Power of Acoustic Analysis Software. Journal of Voice 2016, 30, 765.e1-765.e5. [CrossRef]
Fernandes, J.; Teixeira, F.; Guedes, V.; Junior, A.; Teixeira, J. Harmonic to Noise Ratio Measurement - Selection of Window and Length. Procedia Computer Science 2018, 138, 280–285. [CrossRef]
Sheena; Mary, B.B.; Aswin, V.A.; Suprent, A. Variation of Harmonics to Noise Ratio from the Age Range of 9–18 Years Old in Both the Genders. Indian J Otolaryngol Head Neck Surg 2022, 74, 5518–5523. [CrossRef]
Orlikoff, R.F.; Kahane, J.C. Influence of Mean Sound Pressure Level on Jitter and Shimmer Measures. Journal of Voice 1991, 5, 113–119. [CrossRef]
Bele, I.V. The Speaker’s Formant. Journal of Voice 2006, 20, 555–578. [CrossRef]
Kent, R.D.; Vorperian, H.K. Static Measurements of Vowel Formant Frequencies and Bandwidths: A Review. J Commun Disord 2018, 74, 74–97. [CrossRef]
Maurer, D. Acoustics of the Vowel; 2016; ISBN 978-3-0343-2391-8.
Aalto, D.; Aaltonen, O.; Happonen, R.-P.; Jääsaari, P.; Kivelä, A.; Kuortti, J.; Luukinen, J.-M.; Malinen, J.; Murtola, T.; Parkkola, R.; et al. Large Scale Data Acquisition of Simultaneous MRI and Speech. Applied Acoustics 2014, 83, 64–75. [CrossRef]
Tang, D.; Niziolek, C.A.; Parrell, B. Formant Variability Is Related to Vowel Duration across Speakers. JASA Express Lett. 2025, 5, 115202. [CrossRef]
Buckley, D.P.; Abur, D.; Stepp, C.E. Normative Values of Cepstral Peak Prominence Measures in Typical Speakers by Sex, Speech Stimuli, and Software Type Across the Life Span. Am J Speech Lang Pathol 2023, 32, 1565–1577. [CrossRef]
Murton, O.; Hillman, R.; Mehta, D. Cepstral Peak Prominence Values for Clinical Voice Evaluation. American Journal of Speech-Language Pathology 2020, 29, 1596–1607. [CrossRef]
Anand, S.; Kopf, L.M.; Shrivastav, R.; Eddins, D.A. Using Pitch Height and Pitch Strength to Characterize Type 1, 2, and 3 Voice Signals. J Voice 2021, 35, 181–193. [CrossRef]
Brewer, C. Norms For Voice, Motor Speech, & Resonance Assessments Available online: https://theadultspeechtherapyworkbook.com/norms-for-voice/ (accessed on 14 July 2025).
Stathopoulos, E.T.; Huber, J.E.; Sussman, J.E. Changes in Acoustic Characteristics of the Voice Across the Life Span: Measures From Individuals 4–93 Years of Age. Journal of Speech, Language, and Hearing Research 2011, 54, 1011–1021. [CrossRef]
Abraham, E.A.; Geetha, A. Acoustical and Perceptual Analysis of Voice in Individuals with Parkinson’s Disease. Indian J Otolaryngol Head Neck Surg 2023, 75, 427–432. [CrossRef]
Burridge, J.; Vaux, B. Brownian Dynamics for the Vowel Sounds of Human Language. Phys. Rev. Research 2020, 2, 013274. [CrossRef]
Rabiei, M.; Gasparetto, A. A Methodology for Recognition of Emotions Based on Speech Analysis, for Applications to Human-Robot Interaction. An Exploratory Study. Paladyn, Journal of Behavioral Robotics 2014, 5. [CrossRef]
Patel, R.R.; Awan, S.N.; Barkmeier-Kraemer, J.; Courey, M.; Deliyski, D.; Eadie, T.; Paul, D.; Švec, J.G.; Hillman, R. Recommended Protocols for Instrumental Assessment of Voice: American Speech-Language-Hearing Association Expert Panel to Develop a Protocol for Instrumental Assessment of Vocal Function. American Journal of Speech-Language Pathology 2018, 27, 887–905. [CrossRef]
Titze, I.R.; Baken, R.J.; Bozeman, K.W.; Granqvist, S.; Henrich, N.; Herbst, C.T.; Howard, D.M.; Hunter, E.J.; Kaelin, D.; Kent, R.D.; et al. Toward a Consensus on Symbolic Notation of Harmonics, Resonances, and Formants in Vocalization. J Acoust Soc Am 2015, 137, 3005–3007. [CrossRef]
Kent, R. The MIT Encyclopedia of Communication Disorders; 2003; ISBN 978-0-262-27702-0.
Anand, S.; Kopf, L.M.; Shrivastav, R.; Eddins, D.A. Using Pitch Height and Pitch Strength to Characterize Type 1, 2, and 3 Voice Signals. J Voice 2021, 35, 181–193. [CrossRef]
Brinca, L.F.; Batista, A.P.F.; Tavares, A.I.; Gonçalves, I.C.; Moreno, M.L. Use of Cepstral Analyses for Differentiating Normal from Dysphonic Voices: A Comparative Study of Connected Speech versus Sustained Vowel in European Portuguese Female Speakers. J Voice 2014, 28, 282–286. [CrossRef]
Tracey, B.; Volfson, D.; Glass, J.; Haulcy, R.; Kostrzebski, M.; Adams, J.; Kangarloo, T.; Brodtmann, A.; Dorsey, E.R.; Vogel, A. Towards Interpretable Speech Biomarkers: Exploring MFCCs. Sci Rep 2023, 13, 22787. [CrossRef]

Table 1. Baseline Demographic and Clinical Characteristics of Study Participants.

Cohort	Status	Sex		Age
Cohort	Status	M/F	Mean (%)	n (SD)	Range
CWRU	Healthy	M	21 (65.6%)	20.1 (1.5)	18-24
(N=32)	Healthy	F	11 (34.4%)	20.5 (0.7)	19-21
Sarbruucken	Healthy	M	N=21	20.3 (0.7)	20-21
	Laryngitis	M	N=21	55.0 (4.0)	50-60

M=Male, F=Female, n = number of participants in a group, N = total number of participants, SD=Standard Deviation.

Table 2. Key voice parameters for the pronunciation of a vowel for the healthy cohorts.

Statistic	CVS Female Healthy Age: 20.5 (0.7)		CVS Male Healthy Age: 20.1 (1.5)		SVD Healthy Age 20.3 (0.7)
F0	215.78	(26.67)	116.93	(26.80)	131.78	(32.52)
F3	2784.38	(423.12)	2662.57	(167.25)	2482.14	(330.84)
Jitter	0.00567	(< 0.001)	0.00742	(<0.001)	0.00421	(<1e-5)
Shimmer	0.04550	(<0.001)	0.04291	(<0.001)	0.03200	(0.02)
HNR	13.581	(2.87)	10.477	(3.52)	19.409	(3.53)
Intensity	78.760	(3.01)	81.491	(2.45)	76.947	(2.99)
CPP	26.403	(2.84)	28.811	(3.51)	29.470	(3.81)
MFCCS-1	167.67	(28.98)	112.329	(35.95)	216.196	(42.07)
RSPL	14.724	(3.00)	12.406	(2.37)	16.911	(3.01)
MPT	0.473	(0.13)	0.500	(0.19)	1.353	(0.37)

Table 3. Average F0 from select literature for healthy subjects.

Healthy General		Female Healthy		Male Healthy
Biever (1989)	193.7	Ferand (1997)	209.7	Gray (2008)	125.0
Brown (1989)	211.0	Bahn (2009)	222.9	Bahn (2009)	177.8
Fellippe (2006)	162.8	Hippargeka (2022)	226.0	Hippargeka (2022)	131.6
		Ma (2010)	224.1	Davies (2015)	125.0
Average	189.2		220.7		124.8
Current work			215.7		116.9

Table 5. Key voice parameters for the pronunciation of a phrase for healthy cohorts.

Statistic	CVS Female Healthy Age: 20.5 (0.7)		CVS Male Healthy Age: 20.1 (1.5)		SVD Healthy Age 20.3 (0.7)
F0	184.308	(18.17)	116.871	(22.32)	136.775	(28.35)
Jitter	0.0224	(0.0039)	0.10224	(0.0118)	0.02522	(0.0062)
Shimmer	0.0940	(0.0012)	4.8537	(1.744)	0.91726	(0.0205)
HNR	7.1518	(1.540)	76.071	(2.055)	10.173	(2.35)
Intensity	74.373	(1.725)	17.110	(5.375)	73.875	(2.22)
CPP	16.239	(4.683)	172.388	(28.439)	19.360	(5.72)
MFCCS-1	215.626	(25.010)	17.499	(2.212)	298.069	(31.10)
RSPL	18.739	(1.795)	2.391	(0.464)	20.196	(2.33)
MPT	2.517	(0.304)	0.10224	(0.0118)	1.628	(0.22)

Table 6. Summary of multivariate outcomes for all healthy subjects pronouncing a phrase.

	CVS Male vs CVS Female	CVS Male vs SVD Male	CVS Female vs SVD Male
Wilks’ Λ	0.1617	0.2088	0.1308
F (6, 16)	13.829	11.9982	17.7171
Pr>F (p-value)	< 0.0001	< 0.0001	< 0.0001

Table 7. Summary of multivariate outcomes for a vowel comparing male and female healthy subjects with a male subject with laryngitis.

	Healthy CVS Male vs SVD Male w/Laryngitis	Healthy CVS Female vs SVD Male w/Laryngitis	Healthy SVD Male vs SVD Male w/Laryngitis
Wilks’ Λ	0.1106	0.1304	0.6109
F	16.089	13.338	1.592
Pr>F (p-value)	< 0.0001	0.0001	0.2170

Table 8. Summary of multivariate outcomes for a phrase comparing male and female healthy subjects with a male subject with laryngitis.

	Healthy CVS Male vs SVD Male w/Laryngitis	Healthy CVS Female vs SVD Male w/Laryngitis	Healthy SVD Male vs SVD Male w/Laryngitis
Wilks’ Λ	0.2437	0.1152	0.8160
F	12.414	30.707	1.015
Pr>F (p-value)	< 0.0001	< 0.0001	0.4366

Table 9. Formants F1, F2, and F3 for pronunciation of a vowel /a/.

Cohort	F1	F2	F3
Theatre Group (age varies) [60]	496	1368	2506
Female Youth Healthy [61]	625	2050	3050
Female Adult Healthy [62]	717	2501	3289
CVS Healthy Female	905	1393	2784
Male Adult Healthy [63]	269	2143	3182
Male Adult Healthy [62]	588	1952	2601
Healthy Male [8]	626	1145	2482
Laryngitis Male [8]	599	1114	2602
CVS Healthy Male	736	1243	2662
Coefficient of variation (overall)	0.28	0.31	0.11

Table 10. CPP comparison.

Cohort	Vowel		Phrase
CVS Female Healthy	26.40	(2.84)	16.24	(4.68)
CVS Male Healthy	28.81	(3.51)	17.11	(5.38)
SVD Male Healthy	29.41	(3.81)	19.36	(5.72)
SVD Male Laryngitis	25.64	(4.08)	18.13	(6.05)

Note: Values are presented as mean (standard deviation).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.