1. Introduction
Voice assistants have become increasingly prevalent in modern society, integrating seamlessly into smartphones, smart speakers, and various IoT devices. The widespread adoption of virtual assistants like Amazon’s Alexa, Apple’s Siri, and Google Assistant underscores the growing demand for intuitive and hands-free interaction with technology [
1]. This trend is propelled by advancements in speech recognition and synthesis technologies, enabling more natural and efficient human-computer communication [
2].
The primary and most widely adopted approach for designing voice assistant systems is the modular pipeline architecture. This structure divides the system into specialized components that each handle a distinct task. First, a speech-to-text (STT) engine transcribes the user’s spoken input into text. Next, a Natural Language Processing (NLP) component—typically powered by a Large Language Model (LLM)—interprets the text and generates a response. Finally, a text-to-speech (TTS) engine converts the generated text back into natural-sounding speech for the user [
3]. This clear division allows each part of the system to leverage domain-specific advances: STT models focus on acoustic and linguistic accuracy, LLMs handle semantics and intent reasoning, and TTS models synthesize expressive, human-like speech.
An emerging alternative to this modular design is an integrated, end-to-end trainable system that combines all stages—recognition, understanding, and synthesis—into a single model. One prominent example is Moshi [
4], a real-time speech-to-speech dialogue model developed by Kyutai
1. Although Moshi eliminates explicit interfaces between components (no separate ASR (Automatic Speech Recognition), NLP (Natural Language Processing), or TTS (text-to-Speech) modules), its architecture remains logically modular: it uses a large language model core to reason over internally predicted text tokens, while audio tokenizers encode and decode the speech signals. The entire system is trained end-to-end, allowing for tight integration and low latency, but it still relies on internal text-based reasoning and separate token streams for input and output speech.
Overall, the base quality of the integrated approach remains behind that of modular pipelines — Moshi, for example, prioritizes conversational fluidity and low latency over raw quality, and while it achieves decent intelligibility and speaker similarity, it is not yet intended to be competitive against state-of-the-art commercial TTS systems in terms of naturalness and expressiveness, as acknowledged by its authors [
4], making the more common modular approach the preferred choice for voice assistant systems. This approach, with distinct components for speech-to-text (STT), natural language processing (NLP), and text-to-speech (TTS), allows for higher-quality outputs by leveraging advancements in each field. Benchmarking the performance of these individual components is critical not only for selecting the most effective models, but also for ensuring system reliability and maintainability—since failures or bottlenecks in a modular pipeline can be isolated and resolved at the component level without retraining the entire system. This enables more targeted optimization and efficient integration of new models, which is essential for building robust voice assistant applications. Moreover, because the system is not end-to-end but modular, any quality or latency issues can be systematically traced to and addressed within the responsible module, streamlining both development and debugging processes.
Among the three core components of voice assistants—speech-to-text (STT), large language models (LLMs), and text-to-speech (TTS) — LLMs have received the most research attention due to rapid advancements in AI. Several benchmarks have been developed to assess their capabilities, such as BIG-Bench, which evaluate over 200 tasks spanning linguistics, logic, and world knowledge to test generalization limits [
5], while MMLU measures accuracy across 57 academic subjects to assess reasoning depth [
6]. HumanEval focuses on code generation accuracy using pass@k execution metrics [
7], and MLPerf Inference benchmarks model latency and throughput under standardized deployment scenarios [
8]. Project MPG introduces a combined “Goodness” and “Fastness” score to balance correctness with query-per-second efficiency [
9]. Holistic chatbot evaluations such as the E2E Chatbot Benchmark assess semantic similarity against expert responses [
10], MT-Bench-101 scores multi-turn conversational coherence [
11], and Malode analyzes deployment feasibility in domain-specific contexts [
12]. Finally, FACTS Grounding measures faithful, document-grounded generation using multi-judge evaluations across real-world enterprise domains [
13]. Together, these benchmarks reflect a shift from accuracy-only metrics toward comprehensive evaluations incorporating latency, grounding, and real-world applicability.
There are also benchmarks that evaluate voice assistants holistically, assessing the entire end-to-end system rather than isolating individual modules, such as AudioBench [
14], End-to-End Speech Benchmark (ESB) [
15], S2S-Bench [
16], VoiceBench [
17], and CURATe [
18], which will be discussed in subsequent sections. While these works provide critical insights into the LLM component and end-to-end systems’ performance, they leave a major gap in evaluating TTS performance — particularly its responsiveness, which is essential for real-time voice interactions. Unlike LLMs, TTS systems have received relatively little benchmarking attention, despite their direct impact on perceived latency, fluidity of conversation, and overall user experience.
This work addresses that gap by introducing a benchmark dedicated to evaluating TTS responsiveness for real-time voice assistants, focusing on latency, tail latency, and processing efficiency rather than naturalness or intelligibility. It provides a standardized framework for fair comparisons, establishes performance baselines, and informs future optimization of low-latency TTS systems.
Section 2 reviews related evaluations on LLMs, STT, and TTS;
Section 3 details the benchmarking methodology;
Section 4 describes the experimental setup;
Section 5 presents results; and
Section 6 discusses broader trends and future directions.
3. Benchmarking Methodology
This project is influenced by the MLPerf Inference framework [
8], a widely recognized benchmark suite for evaluating machine learning systems across different deployment scenarios. Among its four defined scenarios—single-stream, multistream, server, and offline—we adopt the single-stream setting, which best reflects real-time voice assistant usage where queries are processed sequentially for a single user. In our evaluation, we feed each model datasets of text from different categories - which will be discussed in the experiment section and record the latency for every input. We then summarize these results using descriptive statistics, primarily median latency and tail latency (e.g., 90th percentile), to capture both typical and worst-case performance across the distribution. This approach reflects realistic usage patterns rather than single-point measurements and provides a more reliable picture of model responsiveness.
This paper defines TTS input processing as the sequence where text is received and converted into an output audio file. Because streaming support varies—some models offer native streaming while others rely on external tools—evaluating only file generation time ensures fair, consistent measurement of responsiveness across all models.
For speech processing models, the task is to transcribe audio input into text accurately and efficiently. The following metrics are chosen for evaluating model responsiveness and quality:
-
Latency: Measures the time from when the model begins processing the text input to when the generated audio file is fully produced. This study reports the
median latency, which summarizes typical synthesis speed across the dataset while minimizing the effect of outliers. Lower median latency directly indicates faster audio generation, which is critical for real-time voice assistant performance.
where:
- -
is the timestamp when the text input is passed to the TTS model.
- -
is the timestamp when the audio file generation is completed.
-
Tail Latency: Captures worst-case performance by measuring the 90th percentile latency across all samples. This reflects edge-case delays relevant to voice assistant interactions; keeping this value low is essential for maintaining smooth and responsive user experiences.
where:
- -
is the 90th percentile latency across the dataset.
In addition to latency metrics, this paper evaluates Audio Quality as a primary measure of synthesis performance. These metrics directly assess the clarity, naturalness, and intelligibility of the generated speech, focusing on phrases and sentences rather than isolated single-word utterances. In this study, audio quality is further examined alongside responsiveness to investigate whether there is any trade-off between synthesis speed and perceived quality.
To evaluate audio quality, this study uses a speech-to-text (STT) model to transcribe the audio generated by each TTS system back into text. The resulting transcription is then compared against the original input text to measure how accurately the TTS model preserved the intended content. Both the ground-truth input text and the STT-produced transcription undergo normalization following Whisper’s evaluation guidelines [
26], which include lowercasing, punctuation removal, whitespace standardization, contraction normalization, and numeric formatting. This preprocessing ensures that the comparison focuses on semantic accuracy rather than superficial formatting differences, providing a fair assessment of how faithfully each TTS model conveys the intended speech content.
Percentage of Incorrect Audio Files: The proportion of generated audio files whose transcriptions, after being converted back to text via STT and normalized as described above, do not exactly match the original input text. In other words, this metric is based on strict string equality between the normalized input and output texts.
-
Overall WER: A cumulative measure of word error rate (WER) across all transcriptions. Word Error Rate (WER) is a common metric used to evaluate transcription accuracy, calculated for a single sample as:
where:
- -
S is the number of substitutions (incorrectly transcribed words),
- -
D is the number of deletions (missing words),
- -
I is the number of insertions (extra words added),
- -
N is the total number of words in the reference transcript.
For multiple transcriptions, the
Overall WER, used by Whisper [
26] aggregates WER across all samples by summing the total number of errors (substitutions, deletions, and insertions) across all transcriptions and dividing by the total number of words in all reference transcripts:
where
i represents each individual sample in the dataset. This method provides a cumulative error rate that reflects the overall performance of the transcription system across a dataset rather than an average of individual WER values.
Median WER (Mismatched Files): In addition to the Summation-Based WER following Whisper’s methodology, we also compute the Median WER specifically for files with mismatched transcriptions, providing further insight into transcription accuracy variability.
4. Experiments - Benchmarked Text Inputs
This section describes the experiments conducted to measure each model’s performance, detailing the datasets of text inputs used as well as models tested.
4.1. Evaluation Categories
First, to reflect the models’ capabilities of naturally transforming text into speech in conversations, we look into various elements in human speech - where certain sentence types consist solely of words or short phrases that can function as complete, standalone utterances. We also take into account the elements that are present in dialogues, including utterances, backchannel words such as "uh-huh", "I see" or filler & transition words such as "well", "so", "anyway" to define the range of inputs for models testing. Accordingly, this project evaluates single-word utterances and two-word phrases that can function as complete sentences. Furthermore, full-length sentences are incorporated to capture typical structures observed in standard written and spoken communication. A guideline for scientific writings suggested 12-17 words to be the optimal length [
27], while another one indicated today’s experts recommended 15-18 words per sentence [
28]. To accommodate these guidelines, we incorporate 12-word sentences and 18-word sentences into our benchmarking, and to sum up, came up with the following criterias for experimentation:
To ensure a robust evaluation, while latency is computed for all input categories, audio quality metrics are assessed only on 12-word and 18-word sentences generated by Llama3-8B. This choice avoids the limitations of single-word evaluations: while the benchmark includes standalone words (e.g., “Yes,” “Stop”), such cases can yield misleading accuracy scores because multiple pronunciations may be valid and the lack of surrounding context provides few cues for disambiguation. Multi-word phrases and sentences, by contrast, more closely resemble natural speech patterns and offer richer contextual information, resulting in a more reliable assessment of audio quality.
Table 1.
Categorization of Benchmarked Text Input Types with Word Count, Sources, and Data Size.
Table 1.
Categorization of Benchmarked Text Input Types with Word Count, Sources, and Data Size.
| Input Type |
# Words |
Justification |
Source |
# Data Points |
| Single Words |
| 1-syllable words |
1 |
Single words containing exactly one syllable, filtered and deduplicated. |
CMU Dictionary [29,30] |
5,352 |
| 2-syllable words |
1 |
Single words containing exactly two syllables, filtered and deduplicated. |
CMU Dictionary [29,30] |
13,549 |
| Most common in dialogues |
1 |
Frequently spoken standalone words (questions, answers, imperatives, etc.). |
COCA [31] |
501 |
| 2-Word Phrases |
| LLM-generated |
2 |
Random 2-word phrases used as standalone expressions, deduplicated. |
LLaMA 3 8B (LLM-generated) |
5,659 |
| Most common |
2 |
Most frequent spoken 2-word N-grams based on raw frequency. |
iWebCorpus [32] |
504 |
| Sentences with Specific Lengths |
| 12-word sentences |
12 |
Minimum recommended sentence length for benchmarking. |
LLaMA 3 8B (LLM-generated) |
4,994 |
| 18-word sentences |
18 |
Maximum recommended sentence length for benchmarking. |
LLaMA 3 8B (LLM-generated) |
5,506 |
4.2. TTS Models
This study focuses on evaluating the most popular and widely-used TTS models, as identified through extensive literature and community engagement metrics—such as GitHub stars or likes—alongside models developed by reputable service providers. Since voice assistants typically do not require voice cloning or dynamic multi-voice capabilities, we prioritize models trained on single-speaker datasets. We also include models trained on multiple voices, provided that each voice is independently preset—meaning the model is trained on specific voices individually and allows users to choose from those predefined options.
In many TTS systems, the core model is responsible for converting input text into intermediate audio representations, typically mel-spectrograms. These spectrograms must then be transformed into actual audio waveforms using a vocoder. Therefore, unless the TTS model is fully end-to-end, a separate vocoder is required to synthesize the final audio output. In this study, we focus on TTS models that perform the text-to-spectrogram transformation, however still making use of vocoders where applicable. The list of models and architectures evaluated is as follows:
Table 2.
Summary of TTS models evaluated in this study.
Table 2.
Summary of TTS models evaluated in this study.
| Model |
Architecture / Approach |
Setup (Vocoder / Dataset) |
Year |
Stars/Forks |
Key Features |
| BarkTTS [33] |
Transformer text-to-audio, semantic token-based |
EnCodec [34] |
2023 |
N/A |
Multilingual, includes nonverbal audio and music |
| VALL-E X [35] |
Neural codec language model |
EnCodec [34] |
2023 |
N/A |
Zero-shot voice cloning, cross-lingual synthesis |
| Neural HMM [36] |
Probabilistic seq2seq with HMM alignment |
Hifigan 2 [37] / LJSpeech [38] |
2022 |
41.6k / 5.4k |
Stable synthesis, adjustable speaking rate, low data need |
| Overflow [39] |
Neural HMM + normalizing flows |
Hifigan 2 [37] / LJSpeech [38] |
2022 |
41.6k / 5.4k |
Expressive, low WER, efficient training |
| Edge-TTS [40] |
Cloud API |
Cloud Backend |
2021 |
N/A |
Lightweight, easy integration, multilingual |
| FastPitch [41] |
Parallel non-autoregressive, pitch-conditioned |
Hifigan 2 [37] / LJSpeech [38] |
2021 |
41.6k / 5.4k |
Low latency, controllable prosody, real-time capable |
| VITS [42] |
End-to-end VAE + adversarial learning |
None / LJSpeech [38] |
2021 |
41.6k / 5.4k |
Unified pipeline, high naturalness, expressive prosody |
| Glow-TTS [43] |
Flow-based, monotonic alignment |
Multiband MelGAN [44] / LJSpeech [38] |
2020 |
41.6k / 5.4k |
Fast inference, robust alignment, pitch/rhythm control |
| Capacitron [45] |
Tacotron 2 + VAE for prosody |
Hifigan 2 [37] / Blizzard 2013 [46] |
2019 |
41.6k / 5.4k |
Fine-grained prosody modeling, expressive synthesis |
| Tacotron 2 (DCA/DDC) [47] |
Two-stage seq2seq, attention-based |
Hifigan 2 [37], Multiband MelGAN [44] / LJSpeech [38] |
2018 |
41.6k / 5.4k |
High naturalness, improved alignment with DDC/DCA |
| Google TTS (gTTS) [48] |
Cloud API (Google Translate) |
Cloud backend |
2014 |
N/A |
Lightweight, easy integration, multilingual |
| Microsoft Speech API 5.4 (Microsoft Speech) [49] |
Concatenative / parametric synthesis |
Built-in voices / Windows |
2009 |
N/A |
Commercial baseline, real-time, multilingual support |
The experiments are conducted on a machine equipped with a 13th Gen Intel Core i9-13900HX processor (24 cores, 32 threads) and an NVIDIA GeForce RTX 4070 Laptop GPU with 8 GB of VRAM, providing sufficient computational resources for text-to-speech benchmarking. As this setup reflects a high-end consumer device rather than specialized server hardware, it closely mirrors real-world conditions in which virtual assistants are typically deployed, ensuring the results remain practically relevant. To ensure a fair and consistent comparison, all experiments are conducted using each model’s default configurations, with no manual hyperparameter tuning or additional training. The models are evaluated exactly as provided by their official implementations or APIs, relying on pre-trained weights and configurations publicly released by their developers. No fine-tuning or custom training was performed in this study.
5. Results
This section provides an overview of model performance, focusing on both responsiveness and audio quality for the 12-word and 18-word sentence input types. We begin with a broad discussion of overall trends before examining the results and their implications in greater details in the subsequent section.
5.1. Responsiveness
Overall, across both median and 90th percentile rankings, FastPitch consistently dominates performance. It ranks first or second in nearly every category, achieving first place in both median and 90th percentile for most inputs such as 2-syllable words, most common words, and LLM 2-word phrases. Even in longer sequences like 12-word and 18-word sentences, it maintains top-three positions (median rank 1, P90 rank 3). This highlights FastPitch’s exceptional balance between typical and worst-case latency, ensuring predictable real-time responsiveness across varying input complexities.
Microsoft Speech follows closely behind and frequently secures first or second place in both median and P90 rankings across almost all categories. It is ranked first in median latency for 1-syllable words, tied first for most common words, and first again for 12-word sentences. Even at 90th percentile, Microsoft Speech rarely falls below rank 3, showing remarkable stability despite being a non-neural, CPU-based system. Its consistent top-tier performance demonstrates how deterministic architectures can outperform many neural models under variable conditions.
GlowTTS and VITS also perform strongly, staying within the top five in most categories. GlowTTS achieves ranks as high as third for both median and P90 in shorter inputs (e.g., 1-syllable and 2-syllable words) and second for 18-word sentences at P90, though occasionally dropping to fifth in LLM 2-word phrases. VITS shows similar behavior, with steady mid-upper rankings (median rank 4–6) and slightly worse tail performance (P90 ranks 4–5), indicating minor degradation as inputs grow more complex.
The middle tier consists of NeuralHMM, Overflow, and Tacotron2-DCA. NeuralHMM ranks mid-table, for example 8th in median and 6th in P90 for 1-syllable words, and improves to 4th in most common words (median). However, its P90 scores degrade with input length, reaching 7th for 18-word sentences. Overflow follows a similar trajectory, ranking around 6th–10th across categories; for instance, it is 10th median and 7th P90 for 1-syllable words and 9th/8th for most common words. Tacotron2-DCA places better at median (2nd–4th in shorter phrases) but worsens in P90, hitting 10th for 2-syllable words and 9th for most common words, indicating occasional latency spikes.
Capacitron50 and Capacitron150 occupy mid-lower rankings. Capacitron50 is typically 6th–7th median (e.g., 6th for 1-syllable words, 7th for most common words) and similarly placed for P90. Capacitron150 fares slightly worse, with 9th median and 9th P90 for 1-syllable words and 7th–8th across other categories. The minimal performance gap between the 50- and 150-unit variants suggests increased model size does not significantly alleviate tail-latency issues.
At the bottom, Vall-E-X and BarkTTS consistently rank last, with extremely high median and P90 latencies across all input lengths. Vall-E-X sits consistently at 13th–14th for both median and P90, for example 13th median and 13th P90 in most common words, and 14th in 18-word sentences. BarkTTS is similarly poor, often tied for 14th (e.g., 14th median and 14th P90 in 1-syllable and 2-syllable words) and showing no improvement even in shorter inputs. These rankings reflect their computationally heavy architectures and lack of optimization for low-latency inference.
Table 3.
Median (M) and 90th Percentile (P90) Latencies Across All Benchmarked Text Input Categories (s).
Table 3.
Median (M) and 90th Percentile (P90) Latencies Across All Benchmarked Text Input Categories (s).
| Model |
1- syllable word |
2- syllable word |
Most Common word |
LLM 2-word phrase |
Most Common 2-word |
LLM 12-word sentence |
LLM 18-word sentence |
| |
M |
P90 |
M |
P90 |
M |
P90 |
M |
P90 |
M |
P90 |
M |
P90 |
M |
P90 |
| BarkTTS [33] |
5.73 |
13.01 |
4.41 |
12.11 |
1.49 |
1.90 |
2.01 |
5.05 |
7.45 |
12.42 |
31.05 |
78.86 |
14.75 |
36.00 |
| Vall-E-X [35] |
2.20 |
11.72 |
1.75 |
5.59 |
2.63 |
4.36 |
7.53 |
15.78 |
1.73 |
5.93 |
10.09 |
13.90 |
17.79 |
45.38 |
| NeuralHMM [36] |
0.27 |
0.33 |
0.28 |
0.35 |
0.24 |
0.34 |
0.06 |
0.12 |
0.20 |
0.24 |
0.45 |
0.70 |
0.64 |
0.85 |
| Overflow [39] |
0.33 |
0.41 |
0.23 |
0.28 |
0.34 |
0.53 |
0.37 |
0.45 |
0.22 |
0.27 |
0.68 |
0.88 |
1.56 |
1.99 |
| EdgeTTS [40] |
1.11 |
1.58 |
1.29 |
1.59 |
0.83 |
0.98 |
1.19 |
1.98 |
0.87 |
1.16 |
1.64 |
2.26 |
1.78 |
2.41 |
| Fastpitch [41] |
0.06 |
0.09 |
0.04 |
0.05 |
0.05 |
0.12 |
0.04 |
0.07 |
0.04 |
0.10 |
0.06 |
0.17 |
0.13 |
0.30 |
| VITS [42] |
0.29 |
0.33 |
0.22 |
0.31 |
0.19 |
0.28 |
0.25 |
0.29 |
0.20 |
0.23 |
0.28 |
0.42 |
0.51 |
0.73 |
| GlowTTS [43] |
0.10 |
0.14 |
0.09 |
0.13 |
0.09 |
0.14 |
0.23 |
0.29 |
0.05 |
0.07 |
0.08 |
0.13 |
0.09 |
0.15 |
| Capacitron50 [45] |
0.23 |
0.35 |
0.24 |
0.39 |
0.20 |
0.32 |
0.29 |
0.37 |
0.19 |
0.37 |
0.35 |
0.48 |
0.66 |
0.94 |
| Capacitron150 [45] |
0.25 |
0.51 |
0.28 |
0.58 |
0.22 |
0.42 |
0.33 |
0.69 |
0.22 |
0.50 |
0.64 |
0.86 |
0.39 |
0.62 |
| Tacotron2-DCA [47] |
0.10 |
0.53 |
0.18 |
1.16 |
0.16 |
0.70 |
0.09 |
0.46 |
0.18 |
0.50 |
0.44 |
0.62 |
0.53 |
0.77 |
| Tacotron2-DDC [47] |
0.17 |
16.49 |
0.26 |
15.95 |
0.37 |
8.19 |
0.45 |
21.99 |
0.16 |
8.14 |
0.91 |
1.24 |
0.93 |
1.52 |
| GTTS [48] |
0.43 |
0.65 |
0.45 |
0.58 |
0.58 |
0.65 |
0.42 |
0.62 |
0.58 |
0.64 |
0.68 |
0.78 |
0.93 |
1.39 |
| Microsoft Speech [49] |
0.06 |
0.07 |
0.06 |
0.08 |
0.06 |
0.07 |
0.07 |
0.08 |
0.06 |
0.07 |
0.07 |
0.08 |
0.07 |
0.08 |
A particularly striking outlier is Tacotron2-DDC. While its median ranks are mid-tier (e.g., 4th for 1-syllable words, 8th for 2-syllable words), its P90 ranks collapse to last place in almost every input (14th P90 for 1-syllable words, 13th–14th for longer inputs). This discrepancy stems from pathological inference behavior, where stop-token failures and alignment drift cause excessively long outputs even for short inputs.
Cloud-based systems like GTTS and EdgeTTS show distinct patterns. GTTS generally ranks in the 10th–11th range for both median and P90, such as 10th median and 10th P90 for 1-syllable words, reflecting moderate but stable performance. EdgeTTS fares slightly worse: 12th in both median and P90 for 1-syllable words and 11th–12th for most categories, with tail latencies particularly affected by network fluctuations. These results highlight how cloud services, while convenient, struggle to match the deterministic low-latency performance of local neural models like FastPitch or GlowTTS.
5.2. Audio Quality
In terms of output audio quality, Microsoft Windows Speech, GTTS, and EdgeTTS emerge as the best overall performers in terms of audio quality. All three achieve extremely low incorrect file rates — between 2–4% for 12-word inputs and around 4% for 18-word inputs — with overall WER consistently below 0.5. Median WER remains low (approx. 10–16), demonstrating both stability and precision. Their consistent accuracy across sentence lengths places them firmly at the top of the ranking. Notably, Microsoft Windows Speech achieves this despite being CPU-based and non-neural, highlighting the efficiency of its deterministic synthesis pipeline. GTTS and EdgeTTS, despite being cloud-based, maintain similarly high quality, with minimal degradation even as input length doubles.
Among locally hosted neural models, VITS demonstrates the strongest performance in audio quality. It records incorrect rates of just 5.11% (12-word) and 7.11% (18-word), combined with the lowest overall WER among local models (approx. 0.65) and consistently low median WER values (approx. 5–8%). FastPitch and GlowTTS follow closely, with slightly higher incorrect rates (7–20%) and WER values around 0.7–1.8, remaining highly competitive in quality despite their focus on speed.
Mid-tier results are observed for Tacotron2-DCA, Tacotron2-DDC, and Overflow. These models maintain moderate incorrect rates (approx. 12–20%) and overall WER between 1–3. While Tacotron2-DDC suffers from severe tail-latency issues in responsiveness, its quality metrics remain stable, suggesting that overly long outputs do not necessarily lead to higher transcription mismatches. Overflow and Tacotron2-DCA show similar performance, with modest error rates but less consistency compared to top-tier models.
Table 4.
TTS Models’ Audio Quality for LLM-Generated 12-Word and 18-Word Sentence Inputs.
Table 4.
TTS Models’ Audio Quality for LLM-Generated 12-Word and 18-Word Sentence Inputs.
| |
12-Word Sentence Inputs |
18-Word Sentence Inputs |
| Model |
% Incorrect |
Overall WER |
Median WER |
% Incorrect |
Overall WER |
Median WER |
| BarkTTS [33] |
26.72 |
5.28 |
8.33 |
32.30 |
4.20 |
5.56 |
| Vall-E-X [35] |
26.36 |
4.04 |
8.33 |
32.67 |
3.75 |
5.56 |
| NeuralHMM [36] |
51.44 |
8.48 |
16.67 |
62.78 |
8.50 |
11.11 |
| Overflow [39] |
12.06 |
1.38 |
8.33 |
16.32 |
1.28 |
5.56 |
| EdgeTTS [40] |
3.02 |
0.41 |
16.67 |
4.02 |
0.38 |
11.11 |
| Fastpitch [41] |
7.15 |
0.82 |
8.33 |
9.43 |
0.77 |
5.56 |
| VITS [42] |
5.11 |
0.65 |
8.33 |
7.11 |
0.65 |
5.56 |
| GlowTTS [43] |
14.92 |
1.80 |
8.33 |
20.03 |
1.69 |
5.56 |
| Capacitron 50 [45] |
17.01 |
3.90 |
16.67 |
23.87 |
4.77 |
11.11 |
| Capacitron 150 [45] |
28.19 |
7.94 |
16.67 |
23.64 |
4.89 |
11.11 |
| Tacotron 2 - DCA [47] |
16.01 |
2.01 |
8.33 |
20.40 |
1.99 |
5.56 |
| Tacotron 2 - DDC [47] |
12.56 |
2.68 |
8.33 |
19.68 |
2.84 |
5.56 |
| GTTS [48] |
2.84 |
0.40 |
16.67 |
4.25 |
0.39 |
10.82 |
| Microsoft Speech [49] |
2.86 |
0.39 |
16.67 |
4.09 |
0.39 |
10.53 |
NeuralHMM and the Capacitron series perform worse in terms of quality. NeuralHMM exhibits error rates exceeding 50% and overall WER around 8.5, indicating frequent synthesis failures or alignment errors. Capacitron50 and Capacitron150 fare slightly better but still remain in the lower tier, with 17–28% incorrect rates and overall WER between 4–8, reflecting inconsistent handling of longer phrases.
Vall-E-X and BarkTTS consistently deliver the lowest quality scores. Both models show incorrect rates exceeding 25–32% and overall WER above 3.5–5, with median WER comparable to low-performing models despite significantly heavier architectures. Their degradation worsens with longer inputs, highlighting poor optimization for short prompts and limited robustness to input length scaling.
Overall, cloud-based models (GTTS, EdgeTTS) and in-built Microsoft Windows Speech dominate the quality metrics, showing the most accurate and stable outputs, while VITS leads among locally hosted neural architectures. FastPitch and GlowTTS also maintain high quality with slightly more variability, while models like Vall-E-X, BarkTTS, and Capacitron variants consistently trail behind due to higher error rates and poor robustness.
6. Discussions & Conclusion
6.1. High-Performing Models: Strong Responsiveness with Minimal Quality Trade-Offs Across Input Categories
One of the primary aims of this benchmarking study was to identify whether any text-to-speech (TTS) models could deliver both low latency and high audio quality consistently across a range of input lengths. Most models exhibited trade-offs — excelling in one dimension while underperforming in the other. However, Microsoft Speech stands out as the clear overall leader. Despite being a CPU-based, non-neural system, it achieves top-tier responsiveness and audio quality simultaneously, ranking first or second in nearly every median and 90th percentile latency category and also securing the lowest error rates in audio quality metrics (approximately 2–4% incorrect files, overall WER around 0.39-0.41). Its deterministic design avoids pathological tail-latency spikes and maintains pristine transcription fidelity even for long inputs, making it uniquely suited for latency-critical real-time applications.
In terms of pure responsiveness, FastPitch remains the fastest neural model in this benchmark. Its feed-forward, non-autoregressive design consistently ranks first in median and tail latency across almost every category, including 1- and 2-syllable words, most common words, and short phrases. Even in longer sentences (12-word and 18-word), it maintains top-three positions, highlighting its predictability under load. However, FastPitch’s audio quality lags behind the top tier: its incorrect rates are slightly higher (around 7–9%), and its overall WER (around 0.7-0.8) is above the best performers. While still competitive and suitable for real-time systems where speed is paramount, its quality ceiling prevents it from unseating Microsoft Speech as the best all-rounder.
Conversely, EdgeTTS and Google’s GTTS dominate the audio quality leaderboard. Both achieve error rates in the 2–4% range and overall WER below 0.5, rivaling or even matching Microsoft Speech for transcription fidelity. However, their cloud-based nature introduces significant responsiveness issues. GTTS performs mid-tier in latency (10th-11th ranks across most inputs), while EdgeTTS frequently ranks near the bottom (11th-12th at P90) due to network variability and API response times. These results underline a key trade-off: while cloud services can deliver exceptional quality, their unpredictable latency makes them unsuitable for deterministic real-time deployments.
Among locally hosted neural architectures, VITS strikes the best balance between speed and quality. It ranks mid-upper for latency (median ranks 4-6, P90 ranks 4-5) and delivers the lowest WER among all neural models (approximately 0.65), with incorrect rates around 5-7%. This combination makes VITS a strong candidate where moderate latency is acceptable but higher audio fidelity is desired, such as offline synthesis or user-facing content generation.
Other models - including GlowTTS, NeuralHMM, Overflow, Tacotron2-DCA, and Capacitron variants - occupy the mid-tier, with moderate rankings in both responsiveness and quality. GlowTTS is competitive at low latency but slightly less stable at P90; NeuralHMM and Overflow degrade as input length grows; and Capacitron 50/150 gain little from scaling, remaining slower and less accurate than top performers.
At the bottom, Tacotron2-DDC, Vall-E-X, and BarkTTS show severe limitations. Tacotron2-DDC suffers catastrophic tail-latency failures (e.g., 16.49s P90 for 1-syllable words) due to attention/stop-token misalignment, despite decent median scores. Vall-E-X and BarkTTS, meanwhile, consistently rank last in both latency and quality: their heavy autoregressive pipelines result in extreme latencies (median >2s, P90 >12-45s) and high error rates (25-32%), making them impractical for any interactive use case.
In summary, Microsoft Speech emerges as the best all-round TTS system in this study - unrivaled in its ability to combine ultra-low latency with near-perfect quality. FastPitch is the fastest neural option, ideal when responsiveness outweighs quality, while EdgeTTS and GTTS lead in fidelity but are hampered by cloud-induced delays. For local neural solutions balancing both, VITS provides the strongest alternative, though it still trails Microsoft Speech in overall stability.
6.2. Unusual Patterns in Performance Metrics
While most models exhibit predictable behavior in terms of latencyand synthesis quality, certain models display unexpected performance trends that merit closer examination. These anomalies provide deeper insights into the internal workings of the architectures and highlight potential optimizations or inefficiencies in model design.
A particularly notable irregularity arises in the performance of Tacotron2-DDC (Double Decoder Consistency). While most models maintain predictable scaling between median and tail latencies, Tacotron2-DDC exhibits extreme instability. For example, in the 1-syllable word category it records a median latency of 0.17 s, yet its P90 latency spikes to 16.49 s. This pattern repeats across all inputs — 2-syllable words (median 0.26 s vs. P90 15.95 s) and LLM 2-word phrases (median 0.45 s vs. P90 21.99 s) — indicating severe tail-latency issues irrespective of text length or complexity. Such behavior suggests systemic alignment failures in its dual-decoder mechanism: when the coarse and fine decoders desynchronize, the model enters extended decoding loops, inflating inference times even on otherwise short prompts. These anomalies render Tacotron2-DDC unsuitable for latency-critical deployments, despite its competitive median responsiveness.
Beyond individual strengths at specific input lengths, the latency rankings also reveal which models maintain stable performance across all text categories. Microsoft Speech stands out as the clearest example of consistency: its median latency remains at 0.06–0.07 s across every input type, from single-syllable words to 18-word sentences, with 90th percentile latency barely exceeding 0.08 s. This indicates virtually no degradation as inputs grow longer, a rare trait among all evaluated models. Combined with its leading audio quality — consistently achieving the lowest incorrect-file rates and word-error rates across both 12-word and 18-word sentences — Microsoft Speech delivers exceptional balance between responsiveness and fidelity. These qualities make it particularly well-suited for general-purpose voice assistants, where reply lengths can vary widely yet both latency and intelligibility must remain predictably high.
FastPitch also displays relatively stable trends — for example, FastPitch’s median latency only rises from 0.06 s (1-syllable words) to 0.13 s (18-word sentences). It achieves competitive audio quality, though not as consistently strong as Microsoft Speech or VITS. By contrast, autoregressive models such as Tacotron2-DDC and BarkTTS degrade sharply, with 90th percentile latencies ballooning from 16.49 s and 13.01 s (1-syllable words) to 31.99 s and 78.86 s, respectively, for 2-word and 18-word phrases. These findings highlight how Microsoft Speech — and, to a slightly lesser degree, FastPitch and GlowTTS — provide predictable real-time performance while balancing speed and quality, a critical property for interactive systems where worst-case latency and transcription accuracy both impact user experience.
A notable anomaly is observed in GlowTTS, where latency for longer inputs is sometimes comparable to — or even lower than — that of shorter ones. For instance, GlowTTS records a median latency of 0.10 s for 1-syllable words but achieves 0.09 s for the most common single words and maintains 0.23 s for LLM-generated 2-word phrases. Despite handling more content, its per-word latency does not scale upward as expected, and in some cases, longer phrases are synthesized proportionally faster. This behavior suggests that GlowTTS benefits from parallel flow-based synthesis and stable duration modeling, allowing it to maintain efficient performance even as input complexity increases. In practical applications, this implies GlowTTS can process full-sentence prompts with minimal latency overhead relative to single-word utterances, making it especially effective in scenarios where reply lengths vary dynamically.
In contrast, Vall-E-X and BarkTTS perform markedly worse than other models in terms of latency, with inefficiencies evident across every input length tested. For short single-word inputs, Vall-E-X records a median latency of 2.20 s with 90th percentile latency spiking to 11.72 s, while BarkTTS fares even worse at 5.73 s median and 13.01 s P90. Performance deteriorates further with longer sequences: for LLM-generated 18-word sentences, Vall-E-X reaches 17.79 s median and 45.38 s P90, and BarkTTS records 14.75 s median with an extreme 36.00 s at the 90th percentile.
These values place both models firmly at the bottom of the ranking across all categories, reflecting their computationally heavy autoregressive architectures and lack of optimization for low-latency inference. Their high tail latencies also translate to poor throughput, rendering them impractical for any real-time or interactive voice assistant deployment.
These results indicate that while these models prioritize high-fidelity synthesis and expressive speech generation, their computational demands render them impractical for latency-sensitive applications.
These unusual performance patterns underscore the complexities inherent in text-to-speech benchmarking. Future research will need to focus on stabilizing tail latency, especially in hybrid or multi-stage decoder architectures like Tacotron 2 DDC. Potential approaches include adopting blockwise or speculative decoding methods to limit worst-case sequential steps [
50], introducing monotonic alignment constraints or probabilistic duration models as used in Neural HMMs [
36], and redesigning decoder pipelines to separate coarse and fine-grained generation stages with explicit convergence checks. Furthermore, the unexpected speed-ups observed for longer sequences suggest opportunities to exploit batching or hierarchical context windows, where multiple phonemes or words are processed jointly to amortize computation. These findings may also guide training strategies such as curriculum learning (progressively increasing input length) or targeted fine-tuning for structured versus unstructured texts, ensuring models maintain both stability and efficiency across diverse input types. More broadly, this highlights the importance of detailed benchmarking to uncover counterintuitive trends that can inform architectural refinements and future design priorities.
6.3. Cloud-Based TTS Services: GTTS vs. EdgeTTS
Cloud-based TTS services provide an alternative to locally deployed models, offering scalable synthesis capabilities with varying constraints on performance and accessibility. Two cloud-based models evaluated in this study are GTTS (Google Text-to-Speech) and EdgeTTS. These models differ significantly in availability, with GTTS imposing usage limits while EdgeTTS allows unrestricted access. This distinction is crucial for real-time applications, where consistent availability is essential.
When examining responsiveness, GTTS demonstrates moderate yet stable latency behavior across input types. As shown in Table 3, its median latency ranges from 0.43 s on single-word inputs to 0.93 s for 18-word sentences, with 90th-percentile latencies remaining between 0.65 s and 1.39 s. While this performance does not match the ultra-low latency of models such as FastPitch or Microsoft Speech, it remains predictable and comfortably within real-time thresholds across all tested scenarios. EdgeTTS, by contrast, performs notably worse in responsiveness: its median latency spans 1.11 s to 1.78 s, and 90th-percentile latencies rise as high as 2.41 s for longer inputs. The disparity is most evident in shorter phrases, where GTTS sustains sub-0.5 s median latency, whereas EdgeTTS more than doubles that figure.
Audio quality results for 12-word and 18-word sentence synthesis, summarized in Table 4, further highlight differences between the two cloud-based systems. GTTS achieves overall word error rates (WER) of 0.40 for 12-word inputs and 0.39 for 18-word inputs, with median WERs of 16.67 and 10.82, respectively. EdgeTTS records nearly identical overall WER values (0.41 and 0.38) and slightly higher median WERs (16.67 and 11.11). Both models therefore provide comparable transcription fidelity, and both substantially outperform models like NeuralHMM (overall WER 8.48–8.50) or BarkTTS (5.28–4.20), which suffer from high misalignment and token omission rates.
Incorrect transcription rates reinforce this finding: GTTS produces 2.84% incorrect files for 12-word inputs and 4.25% for 18-word inputs, while EdgeTTS reports 3.02% and 4.02%, respectively. These errors primarily arise from phoneme-duration misalignments rather than network instability, indicating that both cloud systems are robust to input length. Taken together, GTTS offers a slight advantage in responsiveness, while EdgeTTS delivers similar audio quality but suffers higher latency — a trade-off that may constrain its use in latency-critical deployments while remaining viable for general-purpose synthesis scenarios.
The primary trade-off between gTTS and EdgeTTS is an availability–versus–quality balance. Looking ahead, major TTS providers already expose configurable performance profiles—e.g., OpenAI’s Realtime API for low-latency speech interactions
4 (and its separate, higher-latency Chat Completions audio path)
5, Microsoft Azure’s guidance on chunked/streamed synthesis to cut first-byte and finish latency
6, and ElevenLabs’ streaming endpoints for real-time audio generation
7.
Commercial tiers and usage limits are likewise segmented: Google Cloud TTS prices Standard/WaveNet/Neural2 voices separately
8, while Amazon Polly differentiates Standard, Neural, Long-Form, and Generative voices with distinct quotas and rates
9. Meanwhile, industry forecasts project the voice user interface market to grow at
CAGR through 2032, reflecting rising demand for scalable yet stable voice tech—pressure that will drive convergence toward offerings blending reliability with accessibility.
10
In the end, this benchmark provides a dedicated systematic evaluation of open-source TTS models with a focus on responsiveness—a metric critical to real-time voice assistant performance yet largely absent from prior studies. By standardizing latency and tail-latency measurement across multiple architectures, the framework establishes a reproducible baseline for future research and enables fair comparison with commercial systems. Our findings highlight the strengths of flow-based, parallel as well as in-built CPU based models for low-latency deployment, while exposing persistent bottlenecks in autoregressive approaches. Beyond benchmarking, this work offers practical insights for selecting TTS models in latency-sensitive applications and sets the stage for integrating similar evaluations into end-to-end voice assistant pipelines, ultimately supporting the development of more natural, real-time human–AI interactions.
Author Contributions
Conceptualization, H.P.T.D., A.C., and R.A.P.; methodology, H.P.T.D., A.C. and R.A.P.; software, H.P.T.D.; validation, H.P.T.D.; formal analysis, H.P.T.D.; investigation, H.P.T.D., resources, H.P.T.D.; writing—original draft preparation, H.P.T.D.; writing—review and editing, A.C., R.A.P. and M.L.; visualization, H.P.T.D.; supervision, A.C. and M.L.; project administration, A.C. All authors have read and agreed to the published version of the manuscript.