Test-Time Adaptation for Personal Voice Activity Detection: VAD-Gated Test-Time Training and Speaker Embedding Adaptation

Tai-You Chen; Chien-Chia Chiu; Jung-Shan Lin; Jeih-Weih Hung

doi:10.20944/preprints202606.1411.v1

Submitted:

18 June 2026

Posted:

18 June 2026

You are already at the latest version

Abstract

Personal voice activity detection (PVAD) identifies whether each detected speech frame originates from a designated target speaker. Modern PVAD systems are typically trained offline and then deployed with frozen model parameters and a fixed, pre-enrolled speaker embedding, leaving them unable to adapt to distribution shifts at inference time such as unseen acoustic environments, changing speaking styles, or mismatches between enrollment and test conditions. Test-time training (TTT) and test-time adaptation have shown promise in language, vision, and several speech tasks, yet their behavior on PVAD has not been studied. In this work, we present an empirical study of two complementary test-time adaptation mechanisms built on top of the recently proposed FDE-Mamba backbone. The first is a VAD-gated TTT adapter, which instantiates the TTT-Linear formulation within the personalization pathway and augments it with a VAD-probability gate and exponential moving-average stabilization, adapting an internal weight matrix on the speaker-conditioned feature stream of each test utterance. The second is TEA (Test-time Embedding Adaptation), a scheme that keeps all model parameters frozen and instead adapts the target speaker d-vector itself via self-supervised objectives at inference time, directly targeting enrollment–test mismatch. We evaluate both mechanisms on the LibriSpeech PVAD benchmark across two backbones (LSTM-based FDE-RNN and Mamba-based FDE-Mamba), reporting category-wise average precision, mean average precision (mAP), accuracy, recall, precision, and real-time factor. Our results show that test-time adaptation yields consistent but modest gains over the FDE-Mamba baseline (e.g., mAP from 0.9605 to 0.9641 and precision from 0.881 to 0.899), while slightly reducing recall and increasing inference cost when TEA is enabled. Through ablation studies, we quantify the independent and combined contribution of each component, characterize the recall–precision trade-off introduced by adaptation, and isolate the effect of post-hoc Gaussian smoothing so that its benefit is not conflated with that of the adaptation mechanisms. These findings, together with a discussion of their limitations and cost–benefit profile, provide a measured baseline and design insights for future work on adaptive PVAD, particularly under stronger acoustic and enrollment mismatches than those captured by the LibriSpeech protocol.

Keywords:

personal voice activity detection

;

voice activity detection

;

test-time training

;

test-time adaptation

;

test-time embedding adaptation

;

Mamba

;

selective state space model

;

entropy minimization

Subject:

Engineering - Electrical and Electronic Engineering

1. Introduction

Voice-based human–machine interaction has become a central interface in modern smart devices, creating sustained demand for robust and efficient speech processing front-ends. Voice Activity Detection (VAD), which classifies incoming audio frames as speech or non-speech, is a fundamental preprocessing stage for downstream systems such as automatic speech recognition and voice assistants [1,2]. Beyond filtering out non-speech regions, VAD also plays an important role in reducing power consumption and managing computational resources on resource-constrained devices.

Conventional VAD, however, cannot determine whether detected speech belongs to a specific target speaker—a limitation that becomes critical in multi-speaker environments. Personal Voice Activity Detection (PVAD) was introduced to address this gap, extending the binary VAD formulation into a three-way classification over target speaker speech (tss), non-target speaker speech (ntss), and non-speech (ns) [3]. By conditioning on a target speaker embedding derived from a speaker verification model, PVAD systems can respond selectively to a designated speaker while suppressing competing voices, making them well suited to personal voice assistant applications. A substantial body of work has refined how speaker embeddings are fused with acoustic features [4,5,6,7], culminating most recently in the Flexible Dynamic Encoder RNN (FDE-RNN) [8] and its selective state-space successor, FDE-Mamba [9], which together represent the current state of the art on this benchmark.

Despite these architectural advances, essentially all PVAD systems share a common deployment assumption: once trained, the model is frozen and operates with a fixed, pre-enrolled speaker embedding. The model parameters do not change in response to the audio actually encountered at inference time, and the speaker representation is computed once from enrollment speech and then held constant.

This static design is particularly consequential for PVAD, more so than for ordinary VAD. Because PVAD makes a speaker-conditioned decision, its accuracy depends not only on how well the model separates speech from non-speech, but also on how faithfully the fixed enrollment embedding represents the target speaker as they actually sound in the test utterance, and on whether the acoustic conditions of that utterance resemble those seen during training. Both assumptions are fragile in practice. The acoustic environment at test time—noise, reverberation, recording channel, and device—routinely differs from the training distribution, shifting the feature statistics on which the speaker-discrimination decision is based. A speaker’s voice is itself non-stationary: speaking rate, vocal effort, emotional state, and health vary across sessions, so an embedding enrolled on one occasion may be a poor match for the same speaker on another. The enrollment utterance may also be short, noisy, or recorded under conditions unlike the test signal, yielding an embedding that is biased from the outset. In each of these cases the mismatch falls precisely on the speaker-conditioning pathway that distinguishes PVAD from VAD, and a model that cannot adapt at inference time has no recourse: it must rely entirely on the generalization acquired during offline training, even when the test utterance in front of it carries information that could be used to correct the mismatch. This observation motivates asking whether PVAD can instead adapt to each test utterance as it is processed.

A growing line of research addresses exactly this problem through test-time adaptation, a family of methods in which a trained model adjusts itself to the test distribution using only the unlabeled test signal, without any ground-truth labels. The central difficulty is that adaptation must be driven by a signal that is available at inference, and different methods differ in what self-supervised signal they use and what they update. Entropy minimization, exemplified by TENT [10], updates a small set of model parameters—typically the affine parameters of normalization layers—so as to reduce the entropy of the model’s own predictions, on the premise that confident predictions tend to be correct ones; it has been applied to speech recognition and to speaker adaptation for lip reading [11], usually resetting the adapted parameters after each utterance. Test-Time Training (TTT) takes a different route. Introduced by Sun et al. [12] to combat train–test distribution shift, it pairs the main task with an auxiliary self-supervised task during training; at inference, the self-supervised task is used to fine-tune the model on each test input before the main prediction is made, so that the representation is specialized to the local statistics of that input. A recent reformulation [13] casts TTT as a sequence-modeling layer whose hidden state is itself a small machine-learning model: the hidden state is a weight matrix

W

, and the update rule is a step of gradient descent on a self-supervised reconstruction loss applied at every position, so that

W

keeps learning even as it processes a test sequence. This view is appealing for a streaming, speaker-conditioned task such as PVAD for two reasons. First, the self-supervised reconstruction signal is computed directly from the incoming features, so adaptation is driven by the very utterance whose speaker decision is at stake—exactly the information a frozen model discards. Second, because the adapted state is a weight matrix updated online, it can absorb a distribution shift expressed jointly across many feature dimensions, rather than requiring the shift to be attributed to any single pre-specified factor. TTT has shown benefits in language modeling [13], computer vision [12], and more recently in speech tasks such as speech enhancement [14] and speech editing, where adapting to unseen noise or speaker conditions at inference improves output quality.

To date, however, the behavior of these adaptation mechanisms on PVAD has not been investigated. PVAD differs from previously studied tasks in two respects that make this an open question rather than a foregone conclusion: it is an inherently speaker-dependent task, and its decision depends jointly on the acoustic signal and an externally supplied speaker representation. It is therefore not obvious a priori whether test-time adaptation helps, how large any benefit is, or—given that PVAD has two adaptable surfaces, the internal features and the speaker embedding—which part of the system should adapt. These are empirical questions specific to PVAD, and answering them is the purpose of this study.

In this work we conduct an empirical study of test-time adaptation for PVAD, built on the FDE-Mamba backbone. We investigate two complementary mechanisms. The first, a VAD-gated TTT adapter, instantiates the TTT-Linear formulation [13] within the personalization pathway, adapting an internal weight matrix on the speaker-conditioned feature stream of each utterance. Crucially, this adapter operates on features after speaker-embedding fusion, so the distribution shift it absorbs reflects the combined influence of acoustic conditions, speaking style, and speaker characteristics on the test signal, rather than any single factor in isolation. We tailor the adapter to PVAD with a VAD-probability gate, which restricts adaptation to frames classified as speech, and an exponential moving-average update, which stabilizes the weight trajectory during inference. The second mechanism, TEA (Test-time Embedding Adaptation), performs test-time adaptation of the target speaker embedding: with all model parameters frozen, it updates the d-vector itself through self-supervised objectives, directly targeting the enrollment–test mismatch. Throughout, we treat the study as a measurement exercise: we quantify each component’s contribution independently and in combination, report a recall–precision trade-off that adaptation introduces, and separately isolate the effect of a post-hoc Gaussian smoothing step so that its contribution is not conflated with that of the adaptation mechanisms.

The contributions of this paper are as follows:

To the best of our knowledge, this is the first work to apply Test-Time Training to PVAD. We adapt the TTT-Linear formulation with a VAD-probability gate and an exponential moving-average update tailored to the speech-dependent, streaming nature of PVAD, and we clarify that the adapter absorbs a composite distribution shift over the speaker-conditioned feature stream rather than adapting the speaker embedding alone.
We introduce TEA (Test-time Embedding Adaptation), a test-time speaker embedding adaptation scheme for PVAD. While entropy-minimization-based test-time adaptation has been studied for other speech tasks, we are, to our knowledge, the first to apply it to PVAD and, distinctively, to adapt the target speaker d-vector itself rather than the model’s internal parameters.
We present a systematic empirical study that quantifies the independent and combined effect of each mechanism, analyzes the recall–precision trade-off they induce, isolates the contribution of post-hoc Gaussian smoothing, and shows that the observed behavior holds across both an LSTM-based (FDE-RNN) and a Mamba-based (FDE-Mamba) backbone. We report the gains as consistent but modest, and discuss the limitations that this implies for adaptive PVAD.

The remainder of this paper is organized as follows. Section 2 reviews PVAD and test-time adaptation. Section 3 describes the FDE-Mamba backbone and the two adaptation mechanisms. Section 4 details the experimental setup, and Section 5 reports the main results against the FDE-Mamba backbone, an ablation of each component, the computational cost, and a generalization check on the FDE-RNN backbone, followed by a discussion of limitations. Section 6 concludes.

2. Related Work

2.1. Personal Voice Activity Detection

Personal Voice Activity Detection extends conventional VAD by incorporating target speaker information, enabling the system to distinguish target speaker speech (tss) from non-target speaker speech (ntss) and non-speech (ns). The seminal work of Ding et al. [3] introduced a speaker-conditioned framework that fuses a d-vector speaker embedding [15] with acoustic features for ternary frame-level classification. Personal VAD 2.0 [4] subsequently reduced model complexity for on-device speech recognition while maintaining competitive performance.

Later work explored richer strategies for integrating speaker and acoustic information. AS-pVAD [6] proposed a frame-wise attentive score loss to sharpen target-speaker detection; COIN-AT-PVAD [16] introduced conditional intermediate attention within a stacked encoder–decoder; Makishima et al. [17] studied enrollment-free training to remove the dependency on explicit enrollment utterances; and SE-PVAD [18] replaced log-Mel features with a learnable extractor for improved input discrimination. Most recently, FDE-RNN [8] proposed a flexible dynamic architecture that decouples VAD and PVAD into a shared Dynamic Encoder RNN backbone and a detachable personalization module, using a gating-based skipping mechanism to bypass redundant computation on non-speech frames. FDE-Mamba [9] subsequently replaced the LSTM components of FDE-RNN with selective state-space (Mamba) [19] blocks, improving accuracy and reducing the real-time factor while preserving the dynamic gating design.

A common thread across these systems is a doubly static deployment assumption: the speaker embedding is computed once from enrollment speech and then held fixed, and the model itself is frozen after training. The first assumption leaves the system unable to correct a mismatch between the enrollment embedding and the target speaker as they actually sound at test time; the second leaves it unable to adapt to the acoustic conditions of the test utterance—noise, reverberation, channel, or speaking style—when these differ from the training distribution. Robustness to such adverse conditions has been pursued primarily through training-time interventions, such as multi-condition training or self-supervised pretraining [20], which must anticipate the test distribution in advance. The complementary question of whether a trained PVAD model can instead be adapted at inference time, using the test signal itself—adapting both its internal representation to the acoustic conditions of each utterance and its speaker representation to the target—has received little attention, and it is this question that the present study addresses.

2.2. Test-Time Adaptation and Test-Time Training

Test-time adaptation denotes a family of methods that adjust a trained model to the test distribution using only the unlabeled test data encountered at inference, without ground-truth labels. Methods in this family are distinguished by two design choices: the self-supervised signal that drives adaptation, and the subset of quantities that are updated. We organize the relevant literature along these axes.

TTT as a distribution-shift remedy. Test-Time Training was introduced by Sun et al. [12] to combat train–test distribution shift. The model has a Y-shaped architecture: a shared feature encoder with parameters

θ_{e}

feeds two branches—a main-task head

θ_{m}

and a self-supervised auxiliary head

θ_{s}

. During training, all three parameter sets are optimized jointly. At inference, for each individual test input the auxiliary head provides a self-supervised loss

L_{s}

(computed without labels), and a few gradient steps on

L_{s}

update only the shared encoder

θ_{e}

—the two heads stay fixed—before the main prediction is read out. Because the adaptation is performed per input and then discarded, TTT treats each test example as a one-sample learning problem, trading additional inference computation for robustness to shifts unseen during training. Subsequent work extended this principle to a range of domains, including 3D reconstruction and medical imaging, and explored which subset of parameters to adapt—from normalization statistics to low-rank or bias-only updates—in order to control the cost and stability of the inner-loop update.

TTT as a sequence-modeling layer. A more recent line of work [13] reinterprets TTT not as an add-on adaptation procedure but as a sequence-modeling layer in its own right. The key idea is to make the layer’s hidden state a small machine-learning model whose parameters are a weight matrix

W

, and to define the recurrence as a step of self-supervised learning: at each position t, given an input feature (denoted

{\tilde{F}}_{t}

to match the notation we use later for the speaker-conditioned stream),

W

is updated by gradient descent on a reconstruction loss between a corrupted and a clean projection of

{\tilde{F}}_{t}

, and the layer output

z_{t}

is

{\tilde{F}}_{t}

mapped through the updated

W

. The authors instantiate this as TTT-Linear and TTT-MLP, where

W

parameterizes a linear map and a two-layer network, respectively. Viewed this way, the hidden state “learns” on the test sequence with linear inference complexity, and the construction has been shown to compete with attention on long-context language modeling and to transfer to video generation and image classification. Our TTT adapter builds on this formulation; we describe the modifications it requires for PVAD in Section 3.3. Figure 1 contrasts the two formulations; symbols are defined here and, for the components specific to our model, in Section 3.

Entropy minimization and parameter-level adaptation. A parallel and widely used approach is entropy minimization. TENT [10] adapts the affine parameters of normalization layers at test time so as to minimize the entropy of the model’s predictions, on the premise that lower-entropy predictions are more reliable. Later methods refine this with confidence-based sample filtering, sequence-level objectives suited to structured outputs, and moving-average stabilization for continual adaptation [21]. What these methods share is that the adapted quantity is a set of model parameters—most often normalization or affine weights—rather than any input to the model.

Test-time adaptation in speech. Within speech processing, both threads have begun to appear. Dumpala et al. [22] introduced TTT for speech, reporting improvements on several downstream tasks under shifts in speaking style, gender, and age. TTT has since been applied to speech enhancement [14], where a self-supervised auxiliary task lets the model adapt to unseen noise at inference, and to speech editing. On the entropy-minimization side, test-time adaptation has been used for speech recognition and for speaker adaptation in lip reading [11], typically updating a subset of model parameters and resetting them after each utterance.

Two gaps follow from this survey and motivate our study. First, although TTT has been applied to several speech tasks, it has not been applied to PVAD; PVAD’s speaker-dependent, dual-input nature—a decision jointly determined by the acoustic stream and an external speaker embedding— makes its response to TTT an open empirical question rather than a predictable transfer. Second, while test-time adaptation by entropy minimization is itself well established, prior work adapts model parameters; the alternative of adapting the target speaker embedding at test time, an option that exists only for speaker-conditioned tasks such as PVAD, has not to our knowledge been examined. Our study investigates both gaps, adapting the internal feature transformation via TTT (Section 3.3) and the speaker embedding via TEA (Section 3.4).

3. Proposed Method

We study two test-time adaptation mechanisms layered on top of the FDE-Mamba backbone. Section 3.1 briefly recaps the backbone. Section 3.3 introduces the VAD-gated TTT adapter, an in-model mechanism that adapts an internal weight matrix on the speaker-conditioned feature stream of each utterance. Section 3.4 introduces TEA, an inference-time procedure that adapts the target speaker embedding while keeping all model parameters frozen. Section 3.5 clarifies that the two mechanisms act at distinct representation levels, and Section 3.6 describes the post-hoc Gaussian smoothing step that we treat as a separate, clearly delineated component.

Notation. We use bold lowercase for vectors and bold uppercase for matrices. A test utterance is a sequence of T frames. The acoustic input is

x_{1 : T}

with per-frame feature

x_{t} \in R^{F}

, where F is the input feature dimension;

e \in R^{E}

is the E-dimensional target speaker embedding (a d-vector). The backbone hidden dimension is D, so the latent and speaker-conditioned features

h_{t}, F_{t}, {\tilde{F}}_{t}

all lie in

R^{D}

, and the TTT weight matrix

W \in R^{D \times D}

. Frame-level outputs are the VAD logits

v_{t} \in R^{2}

(non-speech vs. speech) and the PVAD logits

p_{t} \in R^{3}

(ns, ntss, tss). The subscript

[s]

selects the speech component of a softmax output, ⊙ denotes the element-wise (Hadamard) product,

{(\cdot)}^{⊤}

the matrix transpose, and

∥ \cdot ∥

the Euclidean norm. A parenthesized superscript

{(\cdot)}^{(k)}

indexes an inner-loop iteration. Table 1 summarizes the symbols.

Figure 2 contrasts the two mechanisms at a conceptual level before we detail them. They differ in what they adapt and when: TTT updates a weight matrix inside the model as it processes each sequence, whereas TEA keeps the model frozen and updates the speaker embedding fed into it, once per utterance.

3.1. FDE-Mamba Backbone

The FDE-Mamba backbone [9], whose flowchart is depicted in Figure 3, follows the two-stage design of FDE-RNN [8], comprising a Dynamic Encoder for VAD and a detachable Personalization module for PVAD, with all three recurrent components instantiated as Mamba [19] blocks. Given a sequence of acoustic features

x_{1 : T}

with

x_{t} \in R^{F}

and a target speaker embedding

e \in R^{E}

, the backbone produces frame-level VAD logits

v_{t}

and PVAD logits

p_{t}

. The prediction block first yields a VAD probability

p_{vad} (t) = softmax {(v_{t})}_{[s]}

, the speech-class entry of the softmax over the VAD logits; the encoder block produces a latent representation

h_{t}

, and a weighted residual connection combines the encoder output with the raw input according to VAD confidence:

F_{t} = h_{t} + (1 - p_{vad} (t)) x_{t} .

(1)

The fused feature is then modulated by the speaker embedding through a Feature-wise Linear Modulation (FiLM) layer [5]:

{\tilde{F}}_{t} = γ (e) ⊙ F_{t} + β (e),

(2)

where

γ (e)

and

β (e)

are the scale and shift vectors generated from

e

, and ⊙ denotes the element-wise product. The speaker-conditioned feature

{\tilde{F}}_{t}

is processed by the back-end Mamba block and a linear classifier to produce the PVAD logits. In the original FDE-Mamba,

{\tilde{F}}_{1 : T}

passes directly into the back-end block; in this work, we insert the VAD-Gated TTT adapter at exactly this point, so that the adapter operates on the speaker-conditioned stream.

3.2. Overview of the Proposed Modifications

We modify the personalization module of FDE-Mamba at two points, as illustrated in Figure 4. First, within the interaction block, we insert a VAD-gated TTT adapter between the FiLM layer and the Mamba block. This adapter maintains an internal weight matrix

W

that is updated at every chunk by a self-supervised reconstruction signal, gated by the VAD probability

g_{t}

so that only speech frames drive the adaptation. It adapts the speaker-conditioned feature stream

{\tilde{F}}_{1 : T}

to the acoustic and speaker conditions of the current test utterance. Second, upstream of the interaction block, we introduce TEA (Test-time Embedding Adaptation), which adapts the target speaker embedding

e_{target}

itself before it enters the FiLM generator. With all model parameters frozen, TEA performs a small number of gradient steps on

e

using self-supervised objectives computed from the model’s output, directly targeting the enrollment–test mismatch.

The two modifications thus act at different points in the data flow: the TTT adapter operates inside the model on the feature representation, while TEA operates on the input embedding. Their conceptual contrast is shown in Figure 2; Figure 4 shows their concrete placement within the personalization module. Additionally, a Gaussian smoothing step (Section 3.6) is applied as model-agnostic post-processing on the output probabilities; it is unrelated to the adaptation mechanisms and is reported separately. Section 3.3 and Section 3.4 detail each adaptation mechanism in turn.

3.3. VAD-Gated Test-Time Training Adapter

Background: TTT-Linear. The TTT-Linear layer [13] treats its hidden state as a linear model whose weight matrix

W

is updated by a step of self-supervised learning at each position. For an input token

u_{t} \in R^{D}

, the weight is updated by gradient descent on a reconstruction loss

ℓ (W; u_{t}) = {∥ W u_{t} - u_{t} ∥}^{2}

, which measures how well the linear map

W

reconstructs its own input:

W_{t} = W_{t - 1} - η \nabla_{W} ℓ (W_{t - 1}; u_{t}), z_{t} = W_{t} u_{t},

(3)

where

η

is the inner-loop learning rate. Because

W

continues to be updated on the test sequence, the layer adapts its representation to the local statistics of each input even at inference time. Directly inserting Eq. (3) into the PVAD personalization pathway, however, raises three design questions specific to this task. We address each in turn.

3.3.1. Restricting Adaptation to Speech Frames via a VAD Gate

In a generic sequence model, every position contributes equally to the inner-loop update. In PVAD, however, a large fraction of frames are non-speech, and the reconstruction signal on those frames reflects background noise rather than speaker-relevant structure. Updating

W

uniformly across all frames would therefore let non-speech frames steer the adaptation, diluting the speaker-conditioned information the adapter is meant to refine.

We address this by gating the inner-loop update with the VAD probability. Operating on the speaker-conditioned stream

{\tilde{F}}_{1 : T}

(Eq. (2)), and writing

g_{t} = p_{vad} (t)

for the (detached) VAD probability at frame t — the speech-class softmax output of the prediction block, the same quantity used in the weighted residual of Eq. (1) — the gated update direction is

Δ W_{t} = \frac{1}{\sqrt{D}} (W {\tilde{F}}_{t} - {\tilde{F}}_{t}) {(g_{t} {\tilde{F}}_{t})}^{⊤},

(4)

where D is the hidden dimension and the factor

1 / \sqrt{D}

scales the update to keep its magnitude stable across feature dimensions, so that the contribution of frame t to the update is scaled by its probability of being speech. Frames the VAD stage deems non-speech (

g_{t} \approx 0

) contribute little to

Δ W_{t}

, while confident speech frames drive the adaptation. We refer to this as a VAD-gated adapter, since the VAD probability

g_{t}

acts as a gate that confines the test-time weight update to frames the model itself judges to be speech. This couples the adapter to the model’s own VAD decision rather than to an external mask, and keeps the adaptation focused on the frames where speaker discrimination matters.

To keep the inner-loop update stable when adapting over long utterances, we additionally clamp the update with a scaled hyperbolic tangent applied element-wise to

Δ W_{t}

,

Δ W_{t} \leftarrow c tanh (Δ W_{t} / c), c = 0.1,

(5)

which bounds every entry of the update to

(- c, c)

and limits the influence of outlier frames whose reconstruction error is large. We found this bounded update to be important for stability: without it,

W

can accumulate large updates and drift over a long utterance.

3.3.2. Stabilizing the Weight Trajectory with an EMA Update

In TTT-Linear the weight matrix is overwritten at each step (Eq. (3)). Applied unmodified during inference on PVAD, this can let

W

drift over a long utterance, since each update is taken on a single, noisy reconstruction signal and there is no outer-loop loss to correct accumulated error. To stabilize the trajectory, we combine the gradient step with an exponential moving average (EMA). The post-step weight applies the (clamped) update of Eq. (5) with the inner-loop rate (the factor of 2 arises from differentiating the squared reconstruction loss

ℓ = {∥ W {\tilde{F}}_{t} - {\tilde{F}}_{t} ∥}^{2}

):

W^{step} = W - 2 η Δ W_{t} .

(6)

The retained weight is then a convex combination of the previous and updated matrices,

W \leftarrow α W + (1 - α) W^{step},

(7)

where

α \in [0, 1)

is the EMA coefficient. A larger

α

retains more of the accumulated state and yields a smoother trajectory;

α = 0

recovers the plain TTT step. The adapter output applied to the feature stream uses the post-step weight,

z_{t} = {\tilde{F}}_{t} + W^{step} {\tilde{F}}_{t},

(8)

i.e., the adapter produces an additive correction to the speaker-conditioned feature rather than replacing it, so that the back-end block always receives the original feature plus a learned, input-dependent residual. The EMA-smoothed weight

W

of Eq. (7) is carried to the next chunk as the running adaptation state, while the current chunk’s output uses the post-step weight

W^{step}

.

3.3.3. Chunked Updates for Streaming Operation

Performing one gradient step per frame (Eq. (3)) is both expensive and unstable on the relatively short, noisy feature stream of a PVAD utterance. We therefore divide the sequence into non-overlapping chunks of length b and perform a single matrix-form update per chunk, accumulating the gated reconstruction signal over the b frames within the chunk before applying Eqs. (6) and (7). The resulting weight

W

is carried across chunks within an utterance, providing a running adaptation state, and is reset to its learned initialization

W_{0}

at the start of each utterance. The chunk size b trades off adaptation granularity against update stability, and is treated as a hyperparameter in our study (Section 5).

Taken together, Eqs. (4)–(8) define the VAD-gated TTT adapter. We emphasize that the adapter operates on

{\tilde{F}}_{1 : T}

, the feature stream after speaker fusion. The distribution shift it absorbs at test time is therefore a composite of acoustic conditions, speaking style, and speaker characteristics as they are jointly expressed in the speaker-conditioned representation; the adapter does not, and is not intended to, isolate any single one of these factors.

3.4. TEA: Test-Time Embedding Adaptation

The second mechanism, Test-time Embedding Adaptation (TEA), adapts a different part of the system. Rather than modifying any model parameter, it adapts the target speaker embedding

e

itself, directly addressing the case where the enrollment embedding is mismatched with the test signal. With all model parameters frozen, the embedding is treated as the only free variable. Starting from the enrolled embedding

e^{(0)} = e

, TEA performs K gradient steps on a self-supervised objective

L_{TEA}

computed from the test utterance,

e^{(k + 1)} = e^{(k)} - μ \nabla_{e} L_{TEA} (e^{(k)}), k = 0, \dots, K - 1,

(9)

and then runs a final forward pass with the adapted embedding

e^{(K)}

to produce the predictions. Because the procedure adapts only the input embedding and resets it for each utterance, it introduces no persistent change to the model and no additional parameters. In Eq. (9),

L_{TEA}

is a self-supervised objective requiring no ground-truth labels; we consider three instantiations:

Entropy. Following the entropy-minimization principle [10], we minimize the uncertainty of the PVAD predictions, restricted to speech frames via the VAD gate

g_{t}

:

L_{ent} = \frac{\sum_{t} g_{t} [- \sum_{c} q_{t, c} log q_{t, c}]}{\sum_{t} g_{t}},

(10)

where

q_{t, c} = softmax {(p_{t})}_{c}

is the predicted PVAD probability for class c at frame t. Minimizing

L_{ent}

sharpens the model’s decisions on speech frames, encouraging an embedding under which the model is confident.

Reconstruction. The second objective reuses the self-supervised signal of the TTT adapter. With the speaker-conditioned feature

{\tilde{F}}_{t}

recomputed from the candidate embedding via Eqs. (1)–(2), the loss is the reconstruction error of the adapter’s initial linear map,

L_{rec} = \frac{1}{T} \sum_{t} {∥W_{0} {\tilde{F}}_{t} - {\tilde{F}}_{t}∥}^{2},

(11)

which favors an embedding whose induced feature stream is consistent with the representation the adapter was trained to reconstruct.

Joint. The third objective is a weighted combination,

L_{joint} = L_{ent} + λ L_{rec},

(12)

with

λ

a balancing coefficient, set to

λ = 0.5

in our experiments. We compare all three objectives empirically in Section 5.

Although TEA shares the entropy-minimization principle with prior test-time adaptation methods [10], it differs in what is adapted: where those methods update model parameters (typically normalization or affine weights), TEA updates the target speaker embedding. This choice is specific to speaker-conditioned tasks and, to our knowledge, has not previously been studied for PVAD.

3.5. Two Complementary Adaptation Levels

The two mechanisms adapt distinct parts of the system and should not be read as redundant. The VAD-gated TTT adapter modifies an internal weight matrix that transforms the speaker-conditioned feature stream; its effect is a learned, input-dependent residual applied to

{\tilde{F}}_{1 : T}

within the model. TEA, by contrast, leaves every model weight untouched and moves the speaker embedding

e

, which enters the model upstream through the FiLM layer (Eq. (2)). One mechanism therefore operates in the feature-transformation space and the other in the speaker-representation space. We avoid characterizing them as cleanly handling separate sources of distribution shift—for instance attributing noise to one and speaker mismatch to the other—because the speaker-conditioned representation entangles these factors; rather, we describe them as acting at different representation levels and evaluate their independent and combined effects directly.

3.6. Gaussian Smoothing Post-Processing

Finally, we consider a post-hoc smoothing step that is independent of the two adaptation mechanisms and of the model. After the per-frame probabilities are produced, a one-dimensional Gaussian filter of standard deviation

σ

is applied along the time axis to both the VAD and PVAD probability sequences prior to thresholding:

{\hat{q}}_{t, c} = \sum_{τ} K_{σ} (τ) q_{t - τ, c}, K_{σ} (τ) \propto exp (- \frac{τ^{2}}{2 σ^{2}}) .

(13)

This suppresses isolated single-frame errors—brief false detections within non-speech regions and momentary dropouts within speech regions —by enforcing temporal continuity in the probability sequence. We stress that Gaussian smoothing is a generic post-processing operation unrelated to test-time adaptation; we include it because it appears in our evaluation pipeline, and we report its contribution separately (Section 5) so that it is not conflated with the effect of the TTT adapter or TEA.

4. Experimental Setup

4.1. Dataset and Data Preparation

We follow the experimental protocol of FDE-RNN [8] and FDE-Mamba [9] exactly, to ensure direct comparability with prior work. All models are trained and evaluated on the LibriSpeech corpus [23]. To simulate multi-speaker scenarios with speaker turns, we adopt the concatenated-utterance scheme of Ding et al. [3]: for each sample, the number of speakers is drawn from

[1, 3]

, non-repeating speakers are randomly selected with one designated as the target, a single utterance is sampled per speaker, and the utterances are shuffled and concatenated. Target speaker embeddings are 256-dimensional d-vectors [15] obtained by passing additional, independently sampled utterances of the target speaker through a pre-trained speaker verification model. Following Yu et al. [16], in 20% of training samples the designated target is replaced by a speaker absent from the concatenated utterance, encouraging robustness to unseen enrollment conditions. Multi-condition training with reverberation is applied, with room impulse responses and additive noise drawn from standard corpora [24]. There is no speaker or utterance overlap between the training and test partitions, and all embeddings are constructed within their respective partitions.

4.2. Evaluation Metrics

Consistent with prior work [3,8], we report: Average Precision (AP) for each category, presented as the combined ns&ntss AP and the tss AP; mean Average Precision (mAP), the average of AP across categories, used as the primary ranking metric; Accuracy, the frame-level classification accuracy; and Recall and Precision for the tss category, to characterize target-speaker detection sensitivity and specificity. To assess inference cost, we additionally report the real-time factor (RTF), peak memory allocation, and kilo-FLOPs per frame.

4.3. Implementation Details

All models are implemented in PyTorch. Input features are 40-dimensional log-Mel filterbank coefficients, and the hidden dimension of all blocks is set to

D = 64

, following [9]. We train with a batch size of 64 and an outer-loop learning rate of

10^{- 3}

. The VAD-gated TTT adapter uses a chunk size of

b = 16

and an EMA coefficient of

α = 0.7

; the inner-loop learning rate

η

is a learned scalar (a trainable parameter initialized to 0, updated by backpropagation during outer-loop training). For TEA, the embedding learning rate

μ

, number of steps K, and loss type (entropy, reconstruction, or joint) are swept as described in Section 5; unless otherwise stated, the Gaussian smoothing standard deviation is

σ = 5

, applied to both the VAD and PVAD probability sequences. The same adaptation mechanisms are additionally applied to the LSTM-based FDE-RNN backbone to assess cross-architecture generality. Unless otherwise stated, all systems are evaluated with the Gaussian smoothing post-processing described in Section 3.6. We treat this smoothing as a standard, model-agnostic component of the inference pipeline, and focus our analysis on the behavior of the TTT and TEA adaptation mechanisms.

5. Results and Discussion

We first establish the overall effect of the proposed test-time adaptation against the FDE-Mamba backbone (Section 5.1). We then dissect the contribution of each component through an ablation study (Section 5.2), report the computational cost (Section 5.3), and verify that the same mechanisms transfer to the LSTM-based FDE-RNN backbone (Section 5.4). We close with a discussion of a recall–precision trade-off and the limitations of the study (Section 5.5). Throughout, “TTT” denotes the VAD-gated TTT adapter, “TEA” the test-time embedding adaptation, and “Gauss” Gaussian smoothing with

σ = 5

.

5.1. Main Results

Our primary comparison is against the FDE-Mamba backbone [9], which is the direct, already-published baseline that this work extends. Table 2 reports the FDE-RNN baseline for reference, the FDE-Mamba backbone, and the proposed full system (TTT + TEA + Gauss). The proposed system improves the primary PVAD metric, mAP, from 0.9605 to 0.9641 (

+ 0.0036

) over FDE-Mamba, with the ns&ntss AP rising from 0.9699 to 0.9740 and precision from 0.881 to 0.899; accuracy improves from 89.87% to 90.19%. We state plainly that the improvement over the FDE-Mamba backbone is modest in magnitude. Its value, in the context of this empirical study, lies in showing that inference-time adaptation—which requires no retraining and no additional labels—yields a consistent gain on a strong, recently published PVAD baseline, and in characterizing precisely where that gain comes from, which we do next. To check that the best-mAP result is not an artifact of the stochasticity in the TEA inner loop, we repeated the evaluation of the proposed system five times with the trained model held fixed; the PVAD mAP was

0.96406 \pm 0.00001

(mean ± std over five runs), confirming that the test-time adaptation procedure is highly stable at inference. We emphasize that this quantifies only the inference-time stochasticity, not the variance across independent training runs (see Section 5.5).

5.2. Ablation Study

To attribute the overall gain to its sources, Table 3 reports every combination of the three components on the FDE-Mamba backbone. We make four observations.

The TTT adapter contributes a small, precision-oriented gain. Adding TTT alone raises mAP from 0.9605 to 0.9623 (

+ 0.0018

) and precision from 0.881 to 0.900, while accuracy is essentially unchanged. The adapter sharpens speaker discrimination rather than improving overall frame classification.

TEA helps only in combination with TTT. Used on its own (“+ TEA”), TEA does not improve over the backbone—mAP decreases to 0.9564—indicating that the inference-time perturbation of the embedding is not, by itself, beneficial under the current protocol. Added on top of TTT, however, it raises mAP from 0.9623 to 0.9633, and the full combination attains the best mAP of 0.9641. We report this dependence explicitly: the value of TEA in this study is conditional on the presence of the TTT adapter.

Gaussian smoothing is the largest single contributor to accuracy, and is independent of the adaptation mechanisms. Smoothing alone raises accuracy from 89.87% to 90.47%—the largest accuracy improvement of any single component, and larger than that of the TTT adapter—while its effect on mAP is comparatively small (

+ 0.0013

). Because smoothing is a generic, model-agnostic post-processing step, we report its contribution separately so that it is not conflated with the adaptation mechanisms: the accuracy gains it provides and the precision/mAP gains the adapter provides arise from different mechanisms.

A recall–precision trade-off accompanies adaptation. Across the table, adding TTT raises precision but lowers tss recall (0.870 → 0.849); TEA and smoothing partially restore it (to 0.856), but the full system does not recover the recall of the unadapted backbone. We return to this in Section 5.5.

A sweep over the TEA objective (entropy, reconstruction, joint), learning rate

μ

, and step count K is reported in Table 4 and Table 5; we select the configuration with the highest mAP for the main results. Two findings stand out. First, TEA is highly sensitive to the learning rate: all three objectives degrade sharply as

μ

increases, collapsing by

μ = 10^{- 1}

(the entropy objective falls to 0.526). A small learning rate (

μ = 10^{- 3}

) is essential, with the reconstruction and joint objectives performing best there (mAP 0.9641). Second, more adaptation steps do not help: mAP peaks at

K = 2

and declines thereafter, dropping to 0.9567 at

K = 10

(Table 5). Both findings indicate that TEA is useful only as a small, carefully bounded correction; over-adaptation harms performance.

5.3. Computational Cost

Table 6 reports the inference cost. The TTT adapter is inexpensive: it adds about 3% more parameters (145,524 → 150,053) and raises the real-time factor (RTF) only modestly (0.00519 → 0.00708, a

1.4 \times

increase), with no change in FLOPs per frame. TEA, however, is costly. Because it performs K gradient steps on the embedding—each requiring a forward and backward pass through the model—per utterance at inference, both RTF and FLOPs scale with K. At the chosen setting

K = 2

, RTF rises to 0.02876 (

5.5 \times

the FDE-Mamba backbone) and FLOPs per frame increase roughly sevenfold (221.9 → 1553.5). We note plainly that this nearly erases the RTF advantage of the Mamba backbone over FDE-RNN (0.02813): the full system with TEA runs at approximately the same speed as the LSTM baseline. The peak memory also grows substantially (43.5 MB → 136 MB), reflecting the activations retained for the backward pass. This cost–benefit profile is unfavorable: TEA at

K = 2

buys a further

+ 0.0008

mAP over TTT alone while increasing RTF by roughly

4 \times

. The TTT adapter, by contrast, delivers most of the benefit at negligible cost. We return to this trade-off in Section 5.5.

5.4. Generalization to the FDE-RNN Backbone

While the focus of this study is the FDE-Mamba backbone, we additionally verify that the proposed mechanisms are not specific to a selective state-space model by applying the same TTT adapter and TEA to the LSTM-based FDE-RNN. Table 7 shows the same qualitative pattern: TTT raises mAP from 0.9419 to 0.9434, and the full combination reaches 0.9443, with tss AP improving monotonically as components are added. The gains are again modest and of comparable magnitude to those on FDE-Mamba. This consistency across an LSTM and a selective state-space backbone indicates that the contribution of the adaptation mechanisms is largely independent of the underlying sequence model. We present this as a supporting observation rather than a primary result.

5.5. Discussion and Limitations

The recall–precision trade-off. Adding the TTT adapter raises precision (0.881 → 0.900) but lowers tss recall (0.870 → 0.849): the adapter makes the model more conservative in declaring a frame to be target-speaker speech, committing fewer false positives but missing more true target frames. TEA and Gaussian smoothing partially recover recall (to 0.856 in the full system), but the full system still does not match the recall of the unadapted backbone. This trade-off is consistent with the precision-oriented nature of the reconstruction-based adaptation signal, and it is the main reason the accuracy gains are smaller than the AP gains. Applications that prioritize not missing target speech may prefer the unadapted backbone, while those that prioritize suppressing false target detections may benefit from the adapter.

Cost versus benefit of TEA. The two mechanisms differ sharply in cost-effectiveness. The TTT adapter delivers most of the benefit (mAP

0.9605 \to 0.9623

) for a

1.4 \times

increase in RTF and no extra FLOPs. TEA, in contrast, adds only a further

+ 0.0008

mAP at

K = 2

yet raises RTF by roughly

4 \times

and FLOPs sevenfold (Table 6), because each adaptation step requires a full forward–backward pass. As a result, the complete system runs at about the same speed as the LSTM-based FDE-RNN baseline, largely forfeiting the efficiency advantage of the Mamba backbone. For deployments where inference cost matters, the TTT adapter alone is the more sensible operating point; TEA is justified only when the small additional mAP is worth a severalfold latency increase. We report this plainly rather than presenting the best-mAP system as unconditionally preferable.

Sensitivity of TEA. TEA is also delicate to configure. Table 4 and Table 5 show that performance collapses if the learning rate is raised beyond

10^{- 3}

or if more than a few adaptation steps are taken—both symptoms of over-adaptation, in which the embedding drifts away from a useful speaker representation when pushed too hard by the self-supervised objective. The method is therefore beneficial only as a small, tightly bounded correction. This fragility is itself a finding: unconstrained test-time adaptation of the speaker embedding is not robust for PVAD, and careful regularization of the adaptation step is necessary.

Why the gains are modest. The gains from test-time adaptation, while consistent, are small. We attribute this in part to the evaluation protocol, which presents limited distribution shift for either mechanism to absorb. On the acoustic side, although multi-condition training adds reverberation and noise, the train and test distributions are drawn from the same corpus and augmentation pipeline, so the train–test acoustic mismatch that the TTT adapter is designed to correct is mild; on the speaker side, the enrollment embeddings are generally well matched to the test condition, limiting what TEA can recover. Both mechanisms may offer larger benefits under deliberately mismatched conditions—for the TTT adapter, test utterances from unseen acoustic environments, channels, or speaking styles; for TEA, short, noisy, or cross-session enrollment—which the present protocol does not isolate.

Limitations. Three limitations follow. First, although we verified that the proposed system is stable across repeated evaluations of a fixed trained model (Section 5.1), each model configuration was trained only once; given the small effect sizes, repeating the training across several random seeds and reporting the resulting variance would further strengthen the conclusions, and we therefore regard the current gains as suggestive rather than definitive. Second, the benefit of TEA is contingent on the TTT adapter and does not appear in isolation, so its standalone value remains unestablished. Third, the LibriSpeech-based protocol uses concatenated read speech with artificially clean speaker boundaries, well-matched enrollment, and only synthetic acoustic augmentation; evaluation under genuine overlapping speech, real acoustic-environment mismatch, and mismatched enrollment remains important future work, and is the setting in which we would expect test-time adaptation—both the acoustic adaptation of the TTT adapter and the speaker adaptation of TEA—to be most useful.

6. Conclusion and Future Work

We presented an empirical study of test-time adaptation for personal voice activity detection, introducing a VAD-gated Test-Time Training (TTT) adapter and a test-time speaker embedding adaptation scheme (TEA) on top of the FDE-Mamba backbone. To our knowledge, this is the first application of TTT to PVAD, and the first to adapt the target speaker embedding itself at inference time for this task. Across both an LSTM-based and a selective state-space backbone, the two mechanisms produce consistent but modest improvements in mAP and precision, concentrated on speaker-discrimination metrics rather than overall accuracy, at the cost of a small reduction in recall. We further showed that a model-agnostic Gaussian smoothing step accounts for the largest single accuracy gain and should be reported separately from the adaptation mechanisms.

Taken together, these findings provide a measured baseline for adaptive PVAD and clarify the cost–benefit profile of different adaptation choices: the VAD-gated TTT adapter offers low-cost, inference-only gains with minimal architectural changes, while TEA yields additional but smaller improvements at a noticeable increase in computational cost and latency. We also highlighted several limitations of the present study, including the reliance on a single training seed per configuration, the use of a LibriSpeech-based protocol that does not explicitly enforce severe acoustic or enrollment mismatch, and the fact that TEA’s benefit appears most clearly when combined with TTT rather than as a standalone mechanism.

These limitations point directly to several directions for future work. First, extending the evaluation to more challenging and realistic scenarios—such as far-field capture, reverberant and noisy environments, heavily mismatched or short enrollments, and overlapping multi-speaker speech—would better stress the conditions under which test-time adaptation is most likely to help and may reveal larger gains than those observed here. Second, conducting multi-seed training and broader hyperparameter sweeps would allow a more rigorous assessment of the statistical significance and robustness of the reported improvements. Third, it would be valuable to explore alternative adaptation targets and schedules, including cross-utterance or continual test-time adaptation for PVAD, while explicitly addressing stability and catastrophic forgetting, as well as integrating TTT and TEA with stronger pretraining or self-supervised front-ends. We hope that the empirical baseline and analysis provided in this work will serve as a reference point for future research on adaptive PVAD and, more broadly, on test-time adaptation for speaker-conditioned streaming speech tasks.

Author Contributions

Conceptualization, T.-Y.C., C.-C.C. and J.-W.H.; methodology, T.-Y.C., C.-C.C. and J.-W.H.; software, T.-Y.C. and C.-C.C.; validation, T.-Y.C., C.-C.C., J.-S.L. and J.-W.H.; formal analysis, J.-S.L. and J.-W.H.; investigation, J.-W.H.; resources, J.-S.L. and J.-W.H.; data curation, T.-Y.C., C.-C.C. and J.-W.H.; writing—original draft preparation, J.-W.H.; writing—review and editing, J.-S.L. and J.-W.H.; visualization, J.-W.H.; supervision, J.-W.H.; project administration, J.-W.H.; funding acquisition, J.-W.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Publicly available datasets were analyzed in this study: LibriSpeech (https://www.openslr.org/12) and MUSAN (https://www.openslr.org/17).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Eyben, F.; Weninger, F.; Squartini, S.; Schuller, B. Real-life voice activity detection with LSTM Recurrent Neural Networks and an application to Hollywood movies. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013; pp. 483–487. [Google Scholar] [CrossRef]
Jia, F.; Majumdar, S.; Ginsburg, B. MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection. In Proceedings of the ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021; pp. 6818–6822. [Google Scholar] [CrossRef]
Ding, S.; Wang, Q.; Chang, S.Y.; Wan, L.; Moreno, I.L. Personal VAD: Speaker-Conditioned Voice Activity Detection. In Proceedings of the Proc. Odyssey: The Speaker and Language Recognition Workshop, 2020; pp. 433–439. [Google Scholar] [CrossRef]
Ding, S.; Rikhye, R.; Liang, Q.; He, Y.; Wang, Q.; Narayanan, A.; O’Malley, T.; McGraw, I. Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition. Proc. Proc. Interspeech 2022, 09 2022, 3744–3748. [Google Scholar] [CrossRef]
Perez, E.; Strub, F.; de Vries, H.; Dumoulin, V.; Courville, A. FiLM: Visual Reasoning with a General Conditioning Layer. In Proceedings of the Proc. AAAI Conference on Artificial Intelligence, 2018; pp. 3942–3951. [Google Scholar] [CrossRef]
Liu, F.; Xiong, F.; Hao, Y.; Zhou, K.; Zhang, C.; Feng, J. AS-pVAD: A Frame-Wise Personalized Voice Activity Detection Network with Attentive Score Loss. Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024, 11476–11480. [Google Scholar] [CrossRef]
Zeng, B.; Cheng, M.; Tian, Y.; Liu, H.; Li, M. Efficient Personal Voice Activity Detection with Wake Word Reference Speech. Proc. Proc. ICASSP 2024, 12241–12245. [Google Scholar] [CrossRef]
Yu, E.L.; Wang, C.C.; Hung, J.W.; Huang, S.C.; Chen, B. Flexible VAD-PVAD Transition: A Detachable PVAD Module for Dynamic Encoder RNN VAD. In Proceedings of the Proc. Interspeech, 2025; pp. 5793–5797. [Google Scholar] [CrossRef]
Chiu, C.C.; Chen, T.Y.; Wang, T.W.; Chen, B.; Hung, J.W. FDE-Mamba: Selective State Space Modeling for Personal Voice Activity Detection. Appl. Sci. 2026, 16, 4688. [Google Scholar] [CrossRef]
Wang, D.; Shelhamer, E.; Liu, S.; Olshausen, B.; Darrell, T. Tent: Fully Test-Time Adaptation by Entropy Minimization. In Proceedings of the Proc. International Conference on Learning Representations (ICLR), 2021. [Google Scholar]
Shen, H.; Jiang, W.; Huang, J. Speaker Adaptation for Lip Reading with Robust Entropy Minimization and Adaptive Pseudo Labels. In Proceedings of the Proceedings of the International Conference on Computing, Machine Learning and Data Science, New York, NY, USA, 2024; CMLDS ’24. [Google Scholar] [CrossRef]
Sun, Y.; Wang, X.; Liu, Z.; Miller, J.; Efros, A.A.; Hardt, M. Test-Time Training with Self-Supervision for Generalization under Distribution Shifts. In Proceedings of the Proc. International Conference on Machine Learning (ICML), 2020; pp. 9229–9248. [Google Scholar]
Sun, Y.; Li, X.; Dalal, K.; Xu, J.; Vikram, A.; Zhang, G.; Dubois, Y.; Chen, X.; Wang, X.; Koyejo, S.; et al. Learning to (Learn at Test Time): RNNs with Expressive Hidden States. arXiv 2024, arXiv:2407.04620. [Google Scholar] [CrossRef]
Behera, A.; Easow, R.A.; Parvathala, V.; Murty, K.S.R. Test-Time Training for Speech Enhancement. Proc. Interspeech 2025, 2025, 2375–2379. [Google Scholar] [CrossRef]
Wan, L.; Wang, Q.; Papir, A.; Moreno, I.L. Generalized End-to-End Loss for Speaker Verification. In Proceedings of the Proc. ICASSP, 2018; pp. 4879–4883. [Google Scholar] [CrossRef]
Yu, E.L.; Chang, R.X.; Hung, J.W.; Huang, S.C.; Chen, B. COIN-AT-PVAD: A Conditional Intermediate Attention PVAD. In Proceedings of the Proc. Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2024; pp. 1–5. [Google Scholar]
Makishima, N.; Ihori, M.; Tanaka, T.; Takashima, A.; Orihashi, S.; Masumura, R. Enrollment-Less Training for Personalized Voice Activity Detection. In Proceedings of the Proc. Interspeech, 2021; pp. 346–350. [Google Scholar] [CrossRef]
Yu, E.L.; Ho, K.H.; Hung, J.W.; Huang, S.C.; Chen, B. Speaker Conditional Sinc-Extractor for Personal VAD. In Proceedings of the Proc. Interspeech, 2024; pp. 2115–2119. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. In Proceedings of the Proc. Conference on Language Modeling (COLM), 2024. [Google Scholar] [CrossRef]
Bovbjerg, H.S.; Jensen, J.; Østergaard, J.; Tan, Z.H. Self-Supervised Pretraining for Robust Personalized Voice Activity Detection in Adverse Conditions. Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024, 10126–10130. [Google Scholar] [CrossRef]
Dong, J.; Jia, H.; Chatterjee, S.; Ghosh, A.; Bailey, J.; Dang, T. E-BATS: Efficient Backpropagation-Free Test-Time Adaptation for Speech Foundation Models. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Google Scholar]
Dumpala, S.H.; Sastry, C.S.; Oore, S. Test-Time Training for Speech. arXiv 2023, arXiv:2309.10930. [Google Scholar] [CrossRef]
Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. LibriSpeech: An ASR Corpus Based on Public Domain Audio Books. In Proceedings of the Proc. ICASSP, 2015; pp. 5206–5210. [Google Scholar] [CrossRef]
Snyder, D.; Chen, G.; Povey, D. MUSAN: A Music, Speech, and Noise Corpus. arXiv 2015, arXiv:1510.08484. [Google Scholar] [CrossRef]

Figure 1. Two formulations of Test-Time Training (TTT). (a) Original TTT [12]: a Y-shaped architecture comprising a shared encoder

θ_{e}

, a main-task head

θ_{m}

, and a self-supervised auxiliary head

θ_{s}

. At training time, all three parameter sets are updated jointly. At test time, the auxiliary loss

L_{s}

is used to update only the shared encoder

θ_{e}

; both heads remain fixed. (b) Sequence-layer TTT [13] (used in this work): the hidden state is a weight matrix

W

that is updated by a self-supervised reconstruction step at every position, during both training and inference, with no separate auxiliary head. Components shown in blue are updated by the self-supervised signal.

Figure 1. Two formulations of Test-Time Training (TTT). (a) Original TTT [12]: a Y-shaped architecture comprising a shared encoder

θ_{e}

, a main-task head

θ_{m}

, and a self-supervised auxiliary head

θ_{s}

. At training time, all three parameter sets are updated jointly. At test time, the auxiliary loss

L_{s}

is used to update only the shared encoder

θ_{e}

; both heads remain fixed. (b) Sequence-layer TTT [13] (used in this work): the hidden state is a weight matrix

W

that is updated by a self-supervised reconstruction step at every position, during both training and inference, with no separate auxiliary head. Components shown in blue are updated by the self-supervised signal.

Figure 2. Conceptual contrast of the two test-time adaptation mechanisms. (a) TTT treats a weight matrix

W

as a fast state that is updated by a self-supervised step (gated by the VAD probability

g_{t}

) while the model processes the feature stream

{\tilde{F}}_{t}

; the model parameter changes during inference. (b) TEA freezes all model parameters and instead updates the target speaker embedding

e

once per utterance by minimizing a self-supervised loss

L

on the predictions. The two thus adapt different quantities (a parameter vs. an input) at different granularities.

Figure 2. Conceptual contrast of the two test-time adaptation mechanisms. (a) TTT treats a weight matrix

W

as a fast state that is updated by a self-supervised step (gated by the VAD probability

g_{t}

) while the model processes the feature stream

{\tilde{F}}_{t}

; the model parameter changes during inference. (b) TEA freezes all model parameters and instead updates the target speaker embedding

e

once per utterance by minimizing a self-supervised loss

L

on the predictions. The two thus adapt different quantities (a parameter vs. an input) at different granularities.

Figure 3. The original FDE-Mamba flowchart [9], comprising a Dynamic Encoder RNN (left) for VAD and a personalization module (right) for PVAD. The interaction block (blue dashed box) is the locus in which we insert the VAD-gated TTT adapter; see Figure 4 for the updated personalization module.

Figure 4. The updated personalization module of FDE-Mamba, incorporating the two proposed test-time adaptation mechanisms (highlighted in blue). The target speaker embedding

e_{target}

enters TEA, which adapts the embedding at inference time before passing it to the FiLM generator; the generator produces the modulation parameters

γ, β

. The encoder feature

F

and

γ, β

are then fused by the FiLM layer. Within the interaction block (blue dashed box), we insert the VAD-gated TTT adapter between the FiLM layer and the Mamba block: it adapts an internal weight matrix

W

(via the EMA self-update loop) on the speaker-conditioned feature stream, gated by the VAD probability

g_{t}

supplied by the dynamic encoder outside this module. The Gaussian smoothing post-processing step is added after the Linear + softmax layer. Components introduced in this work are shown in bold blue.

Figure 4. The updated personalization module of FDE-Mamba, incorporating the two proposed test-time adaptation mechanisms (highlighted in blue). The target speaker embedding

e_{target}

enters TEA, which adapts the embedding at inference time before passing it to the FiLM generator; the generator produces the modulation parameters

γ, β

. The encoder feature

F

and

γ, β

are then fused by the FiLM layer. Within the interaction block (blue dashed box), we insert the VAD-gated TTT adapter between the FiLM layer and the Mamba block: it adapts an internal weight matrix

W

(via the EMA self-update loop) on the speaker-conditioned feature stream, gated by the VAD probability

g_{t}

supplied by the dynamic encoder outside this module. The Gaussian smoothing post-processing step is added after the Linear + softmax layer. Components introduced in this work are shown in bold blue.

Table 1. Summary of notation used in Section 3.

Symbol	Meaning
T	Number of frames in the utterance
$F, E, D$	Input-feature, speaker-embedding, and hidden dimensions
$x_{t} \in R^{F}$	Acoustic feature at frame t (log-Mel)
$e \in R^{E}$	Target speaker embedding (d-vector)
$h_{t} \in R^{D}$	Encoder latent representation
$F_{t} \in R^{D}$	Weighted-residual fused feature (Eq. (1))
${\tilde{F}}_{t} \in R^{D}$	Speaker-conditioned feature after FiLM (Eq. (2))
$γ (e), β (e) \in R^{D}$	FiLM scale and shift vectors
$v_{t}, p_{t}$	VAD logits ( $\in R^{2}$ ), PVAD logits ( $\in R^{3}$ )
$p_{vad} (t), g_{t}$	VAD speech probability; VAD gate $g_{t} = p_{vad} (t)$
$q_{t, c}$	Softmax PVAD probability of class $c \in {ns, ntss, tss}$ at frame t
$W, W_{0} \in R^{D \times D}$	TTT weight matrix; its learned initialization
$z_{t} \in R^{D}$	TTT adapter output at frame t
$η, α, b$	TTT inner-loop rate (learned), EMA coefficient ( $α = 0.7$ ), chunk size ( $b = 16$ )
$μ, K, λ$	TEA learning rate, step count, joint-loss weight ( $λ = 0.5$ )
$σ$	Gaussian-smoothing standard deviation

Table 2. Main results on the LibriSpeech PVAD task. All variants in this table include the Gaussian smoothing post-processing described in Section 3.6; the ablation in Table 3 decomposes the individual contributions of TTT, TEA, and smoothing. The proposed system applies the VAD-gated TTT adapter and TEA on top of the FDE-Mamba backbone under this default pipeline. Improvements (

Δ

) are reported relative to the FDE-Mamba + Gauss baseline. Best value in each column is in bold.

Table 2. Main results on the LibriSpeech PVAD task. All variants in this table include the Gaussian smoothing post-processing described in Section 3.6; the ablation in Table 3 decomposes the individual contributions of TTT, TEA, and smoothing. The proposed system applies the VAD-gated TTT adapter and TEA on top of the FDE-Mamba backbone under this default pipeline. Improvements (

Δ

) are reported relative to the FDE-Mamba + Gauss baseline. Best value in each column is in bold.

Model	Acc. (%)	Recall	Prec.	ns&ntss	tss	mAP
FDE-RNN [8] (reproduced)	86.85	0.847	0.834	0.9587	0.9101	0.9419
FDE-Mamba [9] (baseline)	89.87	0.870	0.881	0.9699	0.9452	0.9605
Proposed (TTT+TEA+Gauss)	90.19	0.856	0.899	0.9740	0.9477	0.9641
$Δ$ vs. FDE-Mamba	$+ 0.32$	$- 0.014$	$+ 0.018$	$+ 0.0041$	$+ 0.0025$	$+ 0.0036$

Table 3. Ablation of the three components on the FDE-Mamba backbone. The baseline (top row, no adaptation) has 145,524 parameters; all other variants, which include the TTT adapter, have 150,053 parameters. Best value in each column is in bold.

TTT	TEA	Gauss	Acc. (%)	Recall	Prec.	ns&ntss	tss	mAP
			89.87	0.870	0.881	0.9699	0.9452	0.9605
		✓	90.47	0.864	0.899	0.9694	0.9469	0.9618
✓			89.99	0.849	0.900	0.9723	0.9460	0.9623
✓		✓	90.08	0.853	0.900	0.9734	0.9464	0.9632
	✓		89.31	0.835	0.890	0.9672	0.9388	0.9564
	✓	✓	89.41	0.839	0.892	0.9686	0.9413	0.9578
✓	✓		90.15	0.851	0.902	0.9732	0.9460	0.9633
✓	✓	✓	90.19	0.856	0.899	0.9740	0.9477	0.9641

Table 4. TEA objective and learning-rate sweep on FDE-Mamba (PVAD mAP, applied on top of the TTT adapter,

K = 2

steps). Best mAP value in bold.

Table 4. TEA objective and learning-rate sweep on FDE-Mamba (PVAD mAP, applied on top of the TTT adapter,

K = 2

steps). Best mAP value in bold.

Objective	$μ = 10^{- 3}$	$5 \times 10^{- 3}$	$10^{- 2}$	$5 \times 10^{- 2}$	$10^{- 1}$
Entropy	0.9610	0.9387	0.9194	0.7062	0.5257
Reconstruction	0.9641	0.9551	0.8245	0.4484	0.4578
Joint	0.9641	0.9563	0.8424	0.4485	0.4581

Table 5. Effect of the TEA step count K on FDE-Mamba (PVAD mAP, reconstruction objective,

μ = 10^{- 3}

). Best value in bold.

Table 5. Effect of the TEA step count K on FDE-Mamba (PVAD mAP, reconstruction objective,

μ = 10^{- 3}

). Best value in bold.

Steps K	1	2	5	10
PVAD mAP	0.9639	0.9641	0.9633	0.9567

Table 6. Inference cost (RTF measured on GPU with warm cache). The TTT adapter is nearly free; TEA cost scales with the step count K.

Model	Params	RTF	Peak Mem. (MB)	kFLOPs/frame
FDE-RNN (reproduced)	92,372	0.02813	16.60	136.43
FDE-Mamba	145,524	0.00519	43.45	221.93
+ TTT	150,053	0.00708	54.72	221.93
+ TTT + TEA ( $K = 1$ )	150,053	0.01845	135.97	887.73
+ TTT + TEA ( $K = 2$ )	150,053	0.02876	135.97	1553.52
+ TTT + TEA ( $K = 5$ )	150,053	0.06137	134.97	3550.90

Table 7. Generalization check: the same adaptation mechanisms applied to the FDE-RNN backbone. The reconstruction objective is used for TEA. Best value in each column is in bold.

Model	Acc. (%)	Recall	Prec.	ns&ntss	tss	mAP
FDE-RNN (reproduced)	86.85	0.847	0.834	0.9587	0.9101	0.9419
+ TTT	87.17	0.853	0.836	0.9583	0.9136	0.9434
+ TTT + Gauss	87.53	0.858	0.841	0.9598	0.9151	0.9439
+ TTT + TEA	87.25	0.859	0.834	0.9585	0.9148	0.9440
+ TTT + TEA + Gauss	87.55	0.863	0.842	0.9602	0.9168	0.9443

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Test-Time Adaptation for Personal Voice Activity Detection: VAD-Gated Test-Time Training and Speaker Embedding Adaptation

Abstract

Keywords:

Subject:

1. Introduction

2. Related Work

2.1. Personal Voice Activity Detection

2.2. Test-Time Adaptation and Test-Time Training

3. Proposed Method

3.1. FDE-Mamba Backbone

3.2. Overview of the Proposed Modifications

3.3. VAD-Gated Test-Time Training Adapter

3.3.1. Restricting Adaptation to Speech Frames via a VAD Gate

3.3.2. Stabilizing the Weight Trajectory with an EMA Update

3.3.3. Chunked Updates for Streaming Operation

3.4. TEA: Test-Time Embedding Adaptation

3.5. Two Complementary Adaptation Levels

3.6. Gaussian Smoothing Post-Processing

4. Experimental Setup

4.1. Dataset and Data Preparation

4.2. Evaluation Metrics

4.3. Implementation Details

5. Results and Discussion

5.1. Main Results

5.2. Ablation Study

5.3. Computational Cost

5.4. Generalization to the FDE-RNN Backbone

5.5. Discussion and Limitations

6. Conclusion and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

MDPI Initiatives

Important Links

Subscribe