1. Introduction
Voice-based human–machine interaction has become a central interface in modern smart devices, creating sustained demand for robust and efficient speech processing front-ends. Voice Activity Detection (VAD), which classifies incoming audio frames as speech or non-speech, is a fundamental preprocessing stage for downstream systems such as automatic speech recognition and voice assistants [
1,
2]. Beyond filtering out non-speech regions, VAD also plays an important role in reducing power consumption and managing computational resources on resource-constrained devices.
Conventional VAD, however, cannot determine whether detected speech belongs to a specific target speaker—a limitation that becomes critical in multi-speaker environments. Personal Voice Activity Detection (PVAD) was introduced to address this gap, extending the binary VAD formulation into a three-way classification over target speaker speech (tss), non-target speaker speech (ntss), and non-speech (ns) [
3]. By conditioning on a target speaker embedding derived from a speaker verification model, PVAD systems can respond selectively to a designated speaker while suppressing competing voices, making them well suited to personal voice assistant applications. A substantial body of work has refined how speaker embeddings are fused with acoustic features [
4,
5,
6,
7], culminating most recently in the Flexible Dynamic Encoder RNN (FDE-RNN) [
8] and its selective state-space successor, FDE-Mamba [
9], which together represent the current state of the art on this benchmark.
Despite these architectural advances, essentially all PVAD systems share a common deployment assumption: once trained, the model is frozen and operates with a fixed, pre-enrolled speaker embedding. The model parameters do not change in response to the audio actually encountered at inference time, and the speaker representation is computed once from enrollment speech and then held constant.
This static design is particularly consequential for PVAD, more so than for ordinary VAD. Because PVAD makes a speaker-conditioned decision, its accuracy depends not only on how well the model separates speech from non-speech, but also on how faithfully the fixed enrollment embedding represents the target speaker as they actually sound in the test utterance, and on whether the acoustic conditions of that utterance resemble those seen during training. Both assumptions are fragile in practice. The acoustic environment at test time—noise, reverberation, recording channel, and device—routinely differs from the training distribution, shifting the feature statistics on which the speaker-discrimination decision is based. A speaker’s voice is itself non-stationary: speaking rate, vocal effort, emotional state, and health vary across sessions, so an embedding enrolled on one occasion may be a poor match for the same speaker on another. The enrollment utterance may also be short, noisy, or recorded under conditions unlike the test signal, yielding an embedding that is biased from the outset. In each of these cases the mismatch falls precisely on the speaker-conditioning pathway that distinguishes PVAD from VAD, and a model that cannot adapt at inference time has no recourse: it must rely entirely on the generalization acquired during offline training, even when the test utterance in front of it carries information that could be used to correct the mismatch. This observation motivates asking whether PVAD can instead adapt to each test utterance as it is processed.
A growing line of research addresses exactly this problem through
test-time adaptation, a family of methods in which a trained model adjusts itself to the test distribution using only the unlabeled test signal, without any ground-truth labels. The central difficulty is that adaptation must be driven by a signal that is available at inference, and different methods differ in what self-supervised signal they use and what they update. Entropy minimization, exemplified by TENT [
10], updates a small set of model parameters—typically the affine parameters of normalization layers—so as to reduce the entropy of the model’s own predictions, on the premise that confident predictions tend to be correct ones; it has been applied to speech recognition and to speaker adaptation for lip reading [
11], usually resetting the adapted parameters after each utterance. Test-Time Training (TTT) takes a different route. Introduced by Sun et al. [
12] to combat train–test distribution shift, it pairs the main task with an auxiliary self-supervised task during training; at inference, the self-supervised task is used to fine-tune the model on each test input before the main prediction is made, so that the representation is specialized to the local statistics of that input. A recent reformulation [
13] casts TTT as a sequence-modeling layer whose hidden state is
itself a small machine-learning model: the hidden state is a weight matrix
, and the update rule is a step of gradient descent on a self-supervised reconstruction loss applied at every position, so that
keeps learning even as it processes a test sequence. This view is appealing for a streaming, speaker-conditioned task such as PVAD for two reasons. First, the self-supervised reconstruction signal is computed directly from the incoming features, so adaptation is driven by the very utterance whose speaker decision is at stake—exactly the information a frozen model discards. Second, because the adapted state is a weight matrix updated online, it can absorb a distribution shift expressed jointly across many feature dimensions, rather than requiring the shift to be attributed to any single pre-specified factor. TTT has shown benefits in language modeling [
13], computer vision [
12], and more recently in speech tasks such as speech enhancement [
14] and speech editing, where adapting to unseen noise or speaker conditions at inference improves output quality.
To date, however, the behavior of these adaptation mechanisms on PVAD has not been investigated. PVAD differs from previously studied tasks in two respects that make this an open question rather than a foregone conclusion: it is an inherently speaker-dependent task, and its decision depends jointly on the acoustic signal and an externally supplied speaker representation. It is therefore not obvious a priori whether test-time adaptation helps, how large any benefit is, or—given that PVAD has two adaptable surfaces, the internal features and the speaker embedding—which part of the system should adapt. These are empirical questions specific to PVAD, and answering them is the purpose of this study.
In this work we conduct an empirical study of test-time adaptation for PVAD, built on the FDE-Mamba backbone. We investigate two complementary mechanisms. The first, a
VAD-gated TTT adapter, instantiates the TTT-Linear formulation [
13] within the personalization pathway, adapting an internal weight matrix on the speaker-conditioned feature stream of each utterance. Crucially, this adapter operates on features
after speaker-embedding fusion, so the distribution shift it absorbs reflects the combined influence of acoustic conditions, speaking style, and speaker characteristics on the test signal, rather than any single factor in isolation. We tailor the adapter to PVAD with a VAD-probability gate, which restricts adaptation to frames classified as speech, and an exponential moving-average update, which stabilizes the weight trajectory during inference. The second mechanism,
TEA (Test-time Embedding Adaptation), performs test-time adaptation of the target speaker embedding: with all model parameters frozen, it updates the d-vector itself through self-supervised objectives, directly targeting the enrollment–test mismatch. Throughout, we treat the study as a measurement exercise: we quantify each component’s contribution independently and in combination, report a recall–precision trade-off that adaptation introduces, and separately isolate the effect of a post-hoc Gaussian smoothing step so that its contribution is not conflated with that of the adaptation mechanisms.
The contributions of this paper are as follows:
To the best of our knowledge, this is the first work to apply Test-Time Training to PVAD. We adapt the TTT-Linear formulation with a VAD-probability gate and an exponential moving-average update tailored to the speech-dependent, streaming nature of PVAD, and we clarify that the adapter absorbs a composite distribution shift over the speaker-conditioned feature stream rather than adapting the speaker embedding alone.
We introduce TEA (Test-time Embedding Adaptation), a test-time speaker embedding adaptation scheme for PVAD. While entropy-minimization-based test-time adaptation has been studied for other speech tasks, we are, to our knowledge, the first to apply it to PVAD and, distinctively, to adapt the target speaker d-vector itself rather than the model’s internal parameters.
We present a systematic empirical study that quantifies the independent and combined effect of each mechanism, analyzes the recall–precision trade-off they induce, isolates the contribution of post-hoc Gaussian smoothing, and shows that the observed behavior holds across both an LSTM-based (FDE-RNN) and a Mamba-based (FDE-Mamba) backbone. We report the gains as consistent but modest, and discuss the limitations that this implies for adaptive PVAD.
The remainder of this paper is organized as follows.
Section 2 reviews PVAD and test-time adaptation.
Section 3 describes the FDE-Mamba backbone and the two adaptation mechanisms.
Section 4 details the experimental setup, and
Section 5 reports the main results against the FDE-Mamba backbone, an ablation of each component, the computational cost, and a generalization check on the FDE-RNN backbone, followed by a discussion of limitations.
Section 6 concludes.