Speculative Decoding for Multimodal Models: A Survey

Yifan Zhang; Yuren Wang; Yunta Heish; Xin Wang; Ping Zhang; Ziyi Yang; Jianing Ma; Zesen Zhao; Boyuan Zheng; Hei Ting (Una) Chan; Jiarui Li; Xueshen Liu; Kunxiao Gao; Yanheng Shang; Ruoyan Zhang; Ruiyao Liu; Jingxuan Zhang; Junchen Li; Zhongwei Wan; Ziheng Zhang; Jing Xiong; Shatong Zhu; Hangrui Cao; Hui Shen

doi:10.20944/preprints202603.2344.v2

Submitted:

16 April 2026

Posted:

20 April 2026

You are already at the latest version

Abstract

Multimodal generative models have demonstrated remarkable capabilities across diverse domains, from visual understanding and image generation to video processing, audio synthesis, and embodied control. These capabilities, however, incur substantial inference overhead due to autoregressive decoding or iterative generation, which is further compounded by modality-specific challenges such as extensive visual token redundancy, strict real-time latency constraints in robotic control, and prolonged sequential generation in text-to-image synthesis. Speculative decoding has emerged as a promising paradigm to accelerate inference without degrading output quality, yet existing surveys remain focused on text-only large language models. In this survey, we provide a systematic and comprehensive review of speculative decoding methods for multimodal models, spanning Vision-Language, Vision-Language-Action, Video-Language, Speech, Text-to-Image (Vision Auto-Regressive), and Diffusion models. We organize the literature into a unified taxonomy with two primary axes, covering the draft generation stage and the verification and acceptance stage, complemented by an analysis of inference framework support. Through this taxonomy, we identify recurring cross-modal design patterns, including token compression, KV cache optimization, target-informed transfer, drafter-target alignment, verification cost reduction, relaxed acceptance, and verify-to-draft feedback, and examine how successful techniques transfer across modalities. We further provide a systematic comparison of existing methods under both self-reported and standardized benchmarking settings. Finally, we discuss open challenges and outline future directions. We have also created a GitHub repository that organizes the papers featured in this survey at https://github.com/zyfzs0/Multimodal-Models-Speculative-Decoding-Survey, and we will actively maintain it as new research emerges. We hope this survey can serve as a valuable resource for researchers and practitioners working on accelerating multimodal inference.

Keywords:

speculative decoding

;

multimodal models

;

vision-language models

;

vision-language-action models

;

video-language models

;

speech and audio models

;

text-to-image (vision AR) models

;

diffusion models

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

The sequential nature of autoregressive generation and the iterative nature of diffusion processes severely limit the practical deployment of multimodal generative models, despite their high fidelity in complex reasoning and synthesis tasks. High-resolution images, long video streams, high-frequency robotic control loops, streaming audio codecs, and iterative diffusion trajectories amplify the standard LLM memory-bandwidth bottleneck into a multimodal scaling wall. Visual token sequences exceeding 1000 tokens lead to rapid KV cache growth and substantial prefill costs; Vision–Language–Action (VLA) models face strict real-time control latency limits; speech models must meet stringent time-to-first-audio requirements under multi-codebook generation; Text-to-Image (Vision Auto-Regressive, T2I) models suffer from prolonged token-by-token generation times; and Diffusion Transformers incur high per-step compute across many iterative denoising steps. Together, these factors make sequential and iterative inference latency a primary obstacle to practical multimodal deployment.

Speculative decoding accelerates autoregressive generation without degrading output quality [1,2,3]. A lightweight draft model efficiently proposes K candidate tokens, and a full-capacity target model verifies these tokens in parallel. This paradigm shifts the computational burden from memory-bandwidth-bound sequential generation to compute-bound batch verification.

Despite the rapid proliferation of speculative decoding techniques, no existing survey addresses their application beyond text-only LLMs [3]. Extending speculative decoding to multimodal models is not a straightforward application of text-based methods. Multimodal speculation requires specialized drafting architectures that handle large cross-modal contexts and multi-scale generation. It also demands new verification criteria that relax strict probabilistic matching in favor of feature-level thresholds, phrase-level semantics, perceptual tolerance, and continuous-space coupling. As shown in Figure 1, innovations in Vision–Language Models (VLMs), Vision–Language–Action (VLA) agents, Video–Language models, Speech systems, T2I (Vision AR) generators, and Diffusion-based generators are siloed within their respective sub-communities, lacking a unified framework. Figure 2 contrasts the standard text-only speculative decoding pipeline with the architectures adopted in each multimodal domain, highlighting the domain-specific adaptations required for drafting and verification.

The goal of this survey is to provide a unified overview of speculative decoding for multimodal models. As illustrated in Figure 3, we organize the literature around two primary axes, the draft generation stage and the verification and acceptance stage, and complement them with an analysis of inference framework support. Together, these components address distinct yet interconnected research questions and provide a systematic view of multimodal speculative decoding. Specifically,

Draft Generation (Section 3): The draft generation stage determines how candidate tokens are produced efficiently. We survey methods along three dimensions: draft architecture (independent drafters, shared-backbone drafters, and drafter-free speculation), draft execution strategies (multi-token expansion and multi-candidate branching), and draft optimization techniques related to token compression, KV cache optimization, target-informed transfer, and drafter-target alignment.
Verification and Acceptance (Section 4): The verification stage determines how drafted tokens are validated against the target model. We survey verification execution strategies (linear, tree-based, path-level, and iterative verification) and optimization techniques including cost reduction, relaxed acceptance criteria, and verify-to-draft feedback loops.
Inference Frameworks (Section 5): We survey existing frameworks that provide system-level support for speculative decoding, covering their unique features and optimizations for multimodal workloads.

We further provide a systematic comparison of existing methods under both self-reported and standardized benchmarking settings (Section 6), and discuss open challenges to outline future directions (Section 7).

2. Background

2.1. Standard Speculative Decoding

Standard speculative decoding accelerates autoregressive inference by decomposing generation into a draft-then-verify paradigm [1,2,3]. Given a context

x_{< t}

, a lightweight draft model

M_{draft}

with distribution

p_{draft} (x ∣ x_{< t})

autoregressively generates K candidate tokens

{\tilde{x}}_{t}, \dots, {\tilde{x}}_{t + K - 1}

. The full-capacity target model

M_{target}

with distribution

p_{target} (x ∣ x_{< t})

then verifies all K candidates in a single parallel forward pass. Each candidate token

{\tilde{x}}_{t + i}

is accepted with probability:

α_{t + i} = min (1, \frac{p_{target} ({\tilde{x}}_{t + i} ∣ x_{< t + i})}{p_{draft} ({\tilde{x}}_{t + i} ∣ x_{< t + i})}) .

(1)

The first rejected token truncates the drafted sequence. The target model then resamples a token from a modified distribution to correct the error, and the process repeats. This exact-match acceptance criterion ensures that the final output distribution matches

p_{target} (x ∣ x_{< t})

exactly.

The expected speedup hinges on the token acceptance rate

E [α]

and the computational overhead of generating K tokens via

M_{draft}

. While effective for text, applying this exact discrete formulation to multimodal contexts often yields suboptimal

E [α]

due to domain mismatches, necessitating the specialized mechanisms categorized in this survey.

2.2. Multimodal Generation Architectures

We analyze the distinct architectures and computational bottlenecks across six multimodal domains.

Vision–Language Models (VLMs) fuse pre-trained visual encoders (e.g., ViT [4]) with large language models [5,6]. High-resolution images are tokenized into long sequences (often

1000 +

tokens per image), causing substantial prefill overhead and rapid memory exhaustion during the autoregressive phase. Visual token redundancy offers unique opportunities for draft compression.

Vision–Language–Action Models (VLAs) map visual input and textual instructions directly to low-level robotic control tokens [7]. Operating in physical environments imposes hard, real-time inference latency limits that pure autoregressive decoding struggles to meet.

Video–Language Models (VideoLMs) extend VLMs across the temporal dimension, processing streams of frame-level visual tokens [8]. Token count scales linearly with video length, exacerbating Key-Value (KV) cache bottlenecks and introducing a requirement for temporal coherence during drafting.

Speech and Audio Models generate discretized audio using neural audio codecs representing waveforms at multiple quantization levels [9]. Because human hearing processes audio semantically, these models exhibit perceptual tolerance: acoustically equivalent sounds can have differing discrete representations, violating the premise of Equation (1) and requiring relaxed verification criteria.

Text-to-Image (Vision Auto-Regressive, T2I) Models synthesize images by predicting discrete codebook indices within an autoregressive transformer [10]. These models suffer from prolonged generation times due to long flattened token sequences, motivating speculative acceleration of the sequential token prediction process.

Diffusion Models generate structured outputs by iteratively removing noise from continuous latent spaces [11,12]. Because they do not operate on discrete vocabularies or standard left-to-right decoding, they require distinct “proposal-and-verification” paradigms framed around trajectory dynamics rather than token-level probabilities. Diffusion is not naturally compatible with prefix-wise verification: it operates over continuous trajectories without a discrete prefix structure, making partial correctness and local rollback ill-defined. Diffusion models are also widely used for text-to-image generation through latent diffusion; however, because their speculative decoding mechanisms differ fundamentally from autoregressive approaches, they are treated as a separate category in this survey.

3. Draft Generation Stage

The draft generation stage directly determines overall speedup through two factors: the speculation accuracy of the drafter, measured by the average number of accepted tokens per step, and drafting latency. Balancing high speculation accuracy against low drafting latency is particularly challenging in the multimodal setting, as each modality introduces distinct constraints on candidate generation.

In this section, we classify various drafting strategies following the taxonomy in Figure 3 into three dimensions: draft architecture (Section 3.1), draft execution (Section 3.2), and draft optimization (Section 3.3).

3.1. Draft Architecture

Figure 4 illustrates the structural overview of these categories, Figure 5 presents the sub-taxonomy of draft architecture, and Table 1 summarizes their formulations.

3.1.1. Independent Drafter

Independent drafters balance speculation accuracy against drafting efficiency by decoupling the draft model from the target. In the multimodal setting, the key design question is how the drafter handles visual and other non-text inputs. We categorize independent drafters by their treatment of these modalities.

Text-Only Independent Drafters

Text-only drafters discard visual and acoustic inputs during drafting entirely, simplifying the drafting distribution to

p_{draft} (y ∣ x_{text})

, where y denotes the output token and

x_{text}

the textual input.

SPD-MLLM [13] provides the foundational feasibility proof: applying speculative decoding to a VLM with a language-only drafter achieves meaningful speedup, demonstrating that the draft stage need not process image tokens at all, since only the target must verify correctness.

Vision–Language Independent Drafters

To recover the acceptance rate degradation caused by discarding non-text inputs, this family of drafters explicitly processes multimodal tokens, modeling

p_{draft} (y ∣ x_{text}, x_{vision})

. Design choices range from lightweight visual adaptors to full multimodal projectors, and from purely independent forward passes to architectures that receive precomputed features from the target to avoid redundant visual encoding.

ViSpec [14] introduces a Q-Former-style [15] vision adaptor that compresses visual features into a small set of query tokens and a global visual vector injected at every text position. SpecVLM [16] addresses the prefill bottleneck through an elastic visual compressor supporting multiple adaptive compression modes. HiViS [17] removes explicit visual token processing from the drafter entirely, instead conditioning on a precomputed semantic embedding exported by the target model. SpecFLASH [18] combines latent-aware visual token compression with semi-autoregressive drafting, allowing it to propose blocks of candidate tokens at once. MSD [19] decouples text and vision in the drafting pipeline, processing each modality according to its characteristics, while MASSV [20] trains a compact independent LM equipped with a lightweight multimodal projector. AASD [21] employs a dedicated speculative module that reuses the target’s KV cache via learned projections. DREAM [22] conditions the drafter on target intermediate features via adaptive cross-attention. STAR [23] reframes drafter design as a neural architecture search (NAS) problem [24], jointly optimizing architecture and target interaction. Beyond single-drafter designs, IbED [25] (In-batch Ensemble Drafting) runs multiple independent strategies as a batch, and TABED [26] extends this with training-free test-time adaptation that selects ensemble weights minimizing KL divergence against the target distribution. EdgeSD [27] introduces a vision-decoding disaggregation (VED) architecture for edge-cloud VLM speculative decoding, decoupling the visual encoder and LLM backbone across separate edge servers; it further contributes bandwidth-aware dynamic image token merging and an adaptive token tree via delta-stepping. HSD [28] (Hierarchical Speculative Decoding) targets the document parsing scenario, where VLMs must generate long structured outputs (e.g., full-page Markdown): a lightweight pipeline-based parser serves as the draft model, the page is partitioned into semantically independent regions whose draft–verify cycles execute in parallel, and a second page-level verification stage reassembles all accepted region outputs against the full VLM with a tolerance-based acceptance criterion (

τ

-matching).

Vision–Language–Action Independent Drafters

Spec-VLA [29] applies speculative decoding to Vision–Language–Action models by pairing a compact VLA drafter with the full-scale target policy; the drafter generates candidate action tokens autoregressively, which the target verifies under a relaxed acceptance criterion that tolerates functionally equivalent control signals. KERV [30] (Kinematic-Rectified Speculative Decoding) further advances VLA speculation by combining token-domain drafting with kinematic-domain compensation: when verification rejects a draft token, KERV activates a Kalman Filter that predicts the remaining actions from the robot’s kinematic history, avoiding GPU-side recomputation entirely. HeiSD [31] introduces hybrid speculative decoding for VLA models, dynamically switching between retrieval-based and drafter-based SD according to trajectory kinematics. A kinematic-based fused metric combining curvature radius and cumulative displacement determines the hybrid boundary: high-metric (straight, fast-moving) segments use retrieval-based SD for near-zero drafting overhead, while low-metric (curved, fine-grained) segments fall back to a trained drafter. To make retrieval-based SD practical, HeiSD introduces an adaptive verify-skip mechanism that selectively bypasses verification when feature similarity to historical trajectories is high, and a sequence-wise relaxed acceptance strategy that groups kinematically correlated tokens and accepts the entire sequence when its aggregate bias remains small.

Video–Language Independent Drafters

Sparrow [32] targets long-video scenarios where extensive visual token sequences cause attention dilution, offloading visual computation to the target model and eliminating the drafter’s visual KV cache through target-informed knowledge transfer and attention constraints (detailed in Section 3.3). To handle the multi-document retrieval setting, VideoSpeculateRAG [33] introduces speculative decoding into the video RAG pipeline: each retrieved document is independently paired with the video and question and processed in parallel by a lightweight draft VLM, with a larger verifier scoring each candidate through

δ

-tolerant reliability filtering followed by entity-alignment reranking. Addressing the orthogonal challenge of pipeline efficiency, ParallelVLM [34] pairs a same-family smaller VLM as a training-free independent drafter with a fully parallel pipeline where prefilling and decoding of draft and target execute concurrently, hiding draft overhead under the target’s latency. An Unbiased Verifier-Guided Pruning (UV-Prune) strategy selects draft visual tokens based on vision–text similarity variations across the target model’s early layers, avoiding the positional bias of attention-guided pruning.

Speech Independent Drafters

Speech speculative decoding exploits the low-entropy, strongly conditioned nature of acoustic generation. SSD [35] (Speech Speculative Decoding) employs a compact, independently trained audio language model as the drafter, generating candidate codec token sequences that are then verified by the full TTS model. SpecASR [36] applies a related paradigm to automatic speech recognition, pairing a lightweight draft ASR model with the full target recognizer. SMUD [37] introduces an alternative paradigm: rather than using a separate draft model, a CTC greedy search provides a pseudo-draft by generating a preliminary sequence and masking low-confidence regions. Mask boundaries are then refined via a single non-autoregressive decoder forward pass, acting as an efficient one-shot pseudo-draft step. UGSD [38] (Uncertainty-Guided Speculative Decoding) frames speech emotion captioning as an edge–cloud collaborative pipeline: a lightweight SALM drafts captions on-device, and only token blocks whose maximum entropy exceeds a threshold are escalated to a cloud-side LALM verifier. The verifier applies a rank-based acceptance rule, and the block length adapts dynamically based on recent acceptance history.

Text-to-Image (Vision AR) Independent Drafters

LANTERN [39] and LANTERN++ [40] train compact visual AR models as drafters for image generation, retaining the standard dual-model framework while introducing relaxed latent-space acceptance to overcome the low token-match rates associated with visual codebook prediction. GSD [41] pairs an independent drafter with dynamic token clustering, grouping visually equivalent tokens to boost acceptance without training. VVS [42] and CSpD [43] similarly employ independent drafters but shift their innovations to verification skipping and continuous-density acceptance, respectively. MuLo-SD [44] (Multi-Scale Local Speculative Decoding) introduces a multi-scale drafting paradigm: a low-resolution AR model generates coarse draft tokens that are expanded to the target resolution via a trained up-sampler, exploiting the natural hierarchy of image resolutions. The verification-side innovations of these T2I methods, including latent-space neighborhood acceptance, grouped acceptance, and local spatial relaxation, are analyzed in Section 4.2.2.

The optimization strategies underlying these independent drafters, including visual token compression, KV cache sharing, target-informed knowledge transfer, and drafter–target alignment, are discussed later in this section, while their verification-side design choices are covered in Section 4.

3.1.2. Shared-Backbone Drafter

Several methods eliminate the separate draft model by repurposing the target model itself for efficient drafting. This shared-backbone approach reduces the two-model system to a single model that operates at two computational granularities.

In the VLM domain, FastVLM [45] eliminates the separate draft model entirely through self-speculative decoding. The first n layers of a single VLM backbone serve as the drafter; the full L-layer model performs verification. Because draft and verification share the same weights and layer ordering, computation from the first n layers transfers directly to the verification pass; an imitation network bridges the representation gap between stages.

In the VLA domain, SpecPrune-VLA [46] implements self-speculative decoding through action-aware visual token pruning: the pruned model serves as its own drafter within a shared backbone, eliminating the need for a separate draft model, a practical advantage for embodied deployment where memory is constrained.

In the VideoLM domain, STD [47] implements a KV-split variant of shared-backbone drafting. Rather than splitting by layer depth, it splits by attention density: the same backbone serves both roles, but drafting uses a sparse subset of attention entries while verification restores the full dense attention. This approach exploits the empirical observation that VideoLM attention is consistently sparse during decoding, making a single backbone sufficient for both fast drafting and accurate verification. Rather than splitting attention, HIPPO [48] realizes shared-backbone drafting through pipelined overlapping: it overlaps target verification of batch t with draft generation of batch

t + 1

, hiding verification latency behind drafting computation via a double-buffer KV cache management scheme. Its algorithmic contribution, holistic video token scoring that fuses global semantic relevance, temporal redundancy, and spatial redundancy signals, is detailed in Section 3.3.1.

In the speech domain, Codec-MTP [49] equips the target model’s internal layers with lightweight multi-token prediction heads, enabling it to self-propose blocks of future codec tokens in a single forward pass. CTC-SSD [50] (Self-Speculative Decoding) reuses the CTC encoder head of a speech-aware LLM as the drafter: the greedy CTC hypothesis is generated non-autoregressively, and high-confidence outputs are accepted under a relaxed criterion. When verification fails, AR decoding resumes from the accepted prefix. This three-stage pipeline simultaneously improves WER (through complementary CTC-LLM error patterns) and accelerates inference.

In the T2I (Vision AR) domain, the SJD family [51,52,53,54] applies this self-drafting philosophy to visual autoregressive generation through Jacobi-style parallel prediction. SJD [51] initializes multiple token positions simultaneously and iterates the target AR model in Jacobi mode, accepting tokens that reach probabilistic convergence; no auxiliary model is required. SJD++ [52] reuses high-confidence tokens across iterations to accelerate convergence, while MC-SJD [53] stabilizes convergence via maximal coupling. SJD-PV [55] inherits the Jacobi self-drafting framework, using the target model’s iterative refinement to propose candidate tokens which the target then validates via phrase-level joint acceptance. These methods share the backbone philosophy of FastVLM [45] and STD [47] but differ in mechanism: rather than layer-splitting or KV-splitting, they exploit iterative fixed-point convergence of the full model.

3.1.3. Drafter-Free Speculation

In contrast to the preceding categories, drafter-free methods accelerate diffusion inference without constructing a separate draft model or repurposing a sub-graph of the target. Instead, they exploit mathematical properties of the diffusion process itself, including coupling structure, timestep exchangeability, and feature smoothness, to speculate over multiple steps at once. Because no auxiliary model is trained or maintained, these approaches are structurally lightweight, though their applicability remains specific to continuous generative processes.

Accelerated Diffusion Sampling [56] constructs training-free proposals via reflection maximal coupling: given the current noisy sample

x_{t}

, it proposes a future sample

x_{t - k}

by exploiting the geometric structure of the SDE, requiring no learned draft model. ASD [57] discovers that diffusion timesteps are exchangeable under stochastic localization, enabling the model to evaluate multiple timestep orderings in parallel without constructing any draft; the target model itself validates the reordered proposals. SpeCa [58] caches intermediate features across denoising steps and uses Taylor expansion to forecast future activations, converting cached computation into speculative proposals that bypass redundant model evaluations.

3.2. Draft Execution

Beyond the choice of drafting mechanism, the structure of the speculation, i.e., how candidates are organized in width and depth, strongly influences both speculation accuracy and the number of tokens decoded per step. The majority of surveyed methods default to standard single-token autoregressive drafting (or, for diffusion models, the standard single-step denoising schedule), generating one candidate token per forward pass prior to verification. The two sub-categories below, multi-token expansion and multi-candidate branching, highlight methods that innovate beyond this default paradigm by producing multiple tokens or multiple candidate sequences per drafting round. Conceptually, multi-token expansion increases speculation depth along a single trajectory, while multi-candidate branching expands the width of the candidate space. Figure 6 and Figure 7 illustrate the draft execution sub-taxonomy and strategies, and Table 2 contrasts the two approaches.

3.2.1. Multi-Token Expansion

Generating multiple future tokens per forward pass reduces the number of serial draft steps required per speculation round.

In the VLM domain, SpecFLASH [18] equips the drafter with semi-autoregressive heads that produce K tokens simultaneously. Placeholder tokens fill positions for not-yet-generated tokens within a block; blocks are decoded autoregressively (each block conditions on the previous block’s output) while tokens within each block are generated in parallel. This block-wise decoding reduces the number of serial forward passes by a factor of K. AASD [21] and FastVLM [45] adopt standard

γ

-token (i.e.,

γ = K

draft tokens per step) speculative decoding for parallel verification, although their primary contributions lie in target–draft interaction and backbone sharing rather than dedicated multi-token head design.

In the VideoLM domain, Sparrow [32] employs a multi-token prediction strategy to bridge the training–inference distribution gap, generating multiple draft tokens per step via recursive self-conditioning.

In the speech domain, depth parallelism is the dominant strategy because ASR and TTS are strongly conditioned, low-entropy generation tasks that favor extensive long-sequence speculation over multi-branch exploration. Codec-MTP [49] predicts n future codec tokens in a single forward pass, reducing decoding steps proportionally. SpecASR [36] adaptively extends draft length, dynamically adjusting based on prediction confidence to minimize verification rounds.

In the T2I (Vision AR) domain, the SJD family [51,52,53,54] achieves depth parallelism natively through its Jacobi iteration framework: multiple token positions are predicted simultaneously in each forward pass and iteratively refined until convergence, generating blocks of tokens per round without auxiliary prediction heads. SJD² [54] further structures each Jacobi round as a denoising trajectory from Gaussian noise. CSpD [43] adapts multi-token block prediction to continuous-valued visual AR models, combining denoising trajectory alignment with token pre-filling to construct density-aligned candidate blocks. MuLo-SD [44] introduces a multi-scale drafting strategy for visual AR models: a low-resolution drafter generates coarse candidate tokens, which are then up-sampled via a learned up-sampler into the target resolution, producing a full-resolution candidate patch from each low-resolution draft token without increasing draft sequence length.

3.2.2. Multi-Candidate Branching

Several methods expand the candidate space by generating multiple draft branches simultaneously, forming a speculation tree that the target verifies in a single pass.

In the VLM domain, Spec-LLaVA [59] constructs a dynamic speculation tree with online structural and budget pruning: structural pruning removes low-probability or redundant branches, while budget pruning limits the tree to the top-n candidate tokens, preventing tree explosion; the target model then verifies this tree structure. Rather than branching at the token level, SV (Speculative Verdict) [60] operates at the reasoning level: multiple lightweight VLMs independently generate diverse reasoning paths, a consensus filter based on negative log-likelihood scores selects high-agreement paths, and the verdict model synthesizes a new final answer in a single inference call, functioning as an evidence synthesizer rather than a voter. While the above methods use fixed or pre-defined tree structures, SAGE [61] dynamically adjusts the tree shape using the drafter’s prediction entropy at each step, allocating wider branching for high-entropy (uncertain) predictions and deeper speculation for low-entropy predictions. EdgeSD [27] takes adaptivity further by formulating tree generation as a single-source shortest path (SSSP) problem, solved via a parallel delta-stepping algorithm that maximizes the expected number of accepted tokens under strict computational budgets of edge servers.

In the VideoLM domain, SpecVLM (Video) [62] adopts an EAGLE-style [63] static tree structure with tree attention masks; the draft model generates multi-branch candidate trees over pruned video tokens, which the target verifies via structured tree attention. Instead of token-level trees, VideoSpeculateRAG [33] takes a document-parallel branching approach, generating one complete candidate answer per retrieved document and treating each document–video–question triple as an independent branch; the verifier selects the best branch through path-level two-stage scoring. Finally, HIPPO [48] shares the backbone between draft and verification (Section 3.1.2), but its pipelined overlap of consecutive batches across time steps is more naturally viewed as an execution-level multi-candidate strategy that transforms the serial draft-then-verify cycle into a production-line pipeline.

In the T2I (Vision AR) domain, LANTERN++ [40] applies static tree drafting to visual autoregressive generation, constructing a fixed tree topology that enables multi-branch speculation over visual codebook tokens (Section 4.2.2).

3.3. Draft Optimization

As shown in Figure 8 and Figure 9, four recurring optimization strategies emerge that transcend domain boundaries. These patterns, namely token compression, KV cache optimization, target-informed knowledge transfer, and drafter-target alignment, are observed across vision–language, VLA, video, speech, and T2I (Vision AR) systems, where high-dimensional inputs introduce substantial redundancy and require careful draft–target coordination.

3.3.1. Token/Input Compression

Multimodal inputs are fundamentally redundant; drafters that compress aggressively while targets retain full resolution during verification can substantially reduce drafting latency without sacrificing speculation accuracy.

Semantic Visual Compression.

The core idea is to replace dense raw visual tokens with a compact semantic surrogate before drafting. ViSpec [14] uses a Q-Former-style [15] adaptor to compress visual tokens into a small set of queries plus a global vector. HiViS [17] eliminates visual-prefill cost entirely, replacing visual tokens with fused semantic embeddings from the target model. SpecFLASH [18] applies latent-aware compression using target sub-top-layer features.

Adaptive and Dynamic Compression

Here the compression ratio is input-dependent rather than fixed, so the drafter can spend computation only where it is most useful. SpecVLM [16] adaptively applies pooling, convolution, or pruning based on token redundancy, supporting multiple compression modes within a single framework. DREAM [22] uses target intermediate features to guide visual token selection, retaining only informative tokens. MSD [19] decouples text and vision processing, effectively bypassing visual tokens during drafting. EdgeSD [27] introduces a bandwidth-aware dynamic image token merging (ITM) method that progressively merges similar image tokens across transformer layers using cosine similarity on key vectors, reducing both computational cost and inter-server transmission latency in vision-decoding disaggregated architectures.

Attention-Based Token Pruning

These methods use attention signals from prefill or early layers to identify low-salience tokens that can be pruned. STD [47] selects the top-K visual KV cache entries per layer and per head based on prefill-stage attention scores for VideoLMs. SpecVLM (Video) [62] applies training-free pruning that retains high-attention tokens via top-p selection while spatially sub-sampling low-attention regions to preserve geometric structure; the feedback mechanism that drives this pruning is detailed in Section 4.2.3. HIPPO [48] fuses three complementary scoring signals (global semantic relevance, temporal redundancy, and spatial redundancy) into a holistic score for video token selection. SpecPrune-VLA [46] extends attention-based pruning to Vision–Language–Action models, fusing global action history with local attention signals to identify which visual tokens are expendable with respect to the current action decision, demonstrating that attention-guided compression applies beyond VLMs.

Visual Computation Elimination

Rather than compressing the visual stream, this strategy removes it from the drafter altogether. Sparrow [32] eliminates the visual KV cache from the drafter entirely via text-anchored window attention (VATA), relying instead on text hidden states that already encode visual semantics distilled from the target model.

Similarity-Based Token Selection

Token retention is driven by cross-modal similarity dynamics rather than raw attention magnitude. ParallelVLM [34] introduces UV-Prune, which replaces attention-score-based token selection with vision–text cosine similarity variation measured across the target model’s early layers. By tracking how each video token’s cross-modal relevance evolves through the network, UV-Prune avoids the positional bias of attention-guided methods and is fully compatible with FlashAttention.

3.3.2. KV Cache Optimization

Multimodal KV caches are substantially larger than text-only caches due to visual and temporal token sequences, making KV management a critical bottleneck for speculative decoding efficiency.

Several complementary strategies address this challenge across modalities. AASD [21] reuses the target’s KV cache directly through learned projections, compressing the multimodal KV before cross-attention to make the speculative module tractable. FastVLM [45] takes a different approach, sharing the backbone between drafter and verifier to maintain a single KV cache for both stages and eliminate redundant KV computation. SpecVLM [16] optimizes KV management at the prefill stage, reducing the memory footprint of visual token caching, while STAR [23] treats KV-affecting design choices (visual token count, attention head count, interaction layer position) as NAS [24] search dimensions, explicitly optimizing KV scale and memory access patterns. STD [47] exploits attention sparsity by selecting only the top-K KV entries, reducing I/O cost while sharing all parameters with the target model so no additional GPU memory is needed.

3.3.3. Target-Informed Transfer

These methods reduce draft–target mismatch by feeding the drafter with signals derived from the target model itself.

DREAM [22] selects target intermediate layers using attention entropy and injects features via cross-attention at each draft step. STAR [23] identifies optimal distillation layers (high attention concentration, low cross-layer variation) and injects target features through the NAS-searched interaction architecture. Sparrow [32] applies target-informed transfer at two stages: at inference time, hidden state reuse (HSR) feeds the drafter with the target’s penultimate-layer text hidden state, which already encodes internalized visual semantics from the target model; at training time, intermediate-layer visual state bridging (IVSB) extracts visual hidden states from the target’s interaction-active middle layers as supervision for the drafter, filtering out low-level visual noise. ParallelVLM [34] transfers the target model’s early-layer vision–text similarity signals to guide draft-side visual token pruning, providing alignment-aware token selection without runtime feature injection overhead.

Draft recycling, where rejected tokens are locally repaired and reused rather than discarded, is a feedback mechanism triggered by the verification stage and is detailed in Section 4.2.3.

3.3.4. Drafter-Target Alignment

Complementing the inference-time signal injection of Target-Informed Transfer, a parallel design pattern aligns the drafter with the target during training, embedding compatibility into the drafter’s weights or architecture so that no runtime overhead is incurred.

Architectural Inheritance

The drafter’s capacity is distilled from the target’s architecture via weight sharing or knowledge transfer. SSD [35] trains a compact audio language model via knowledge distillation from the full TTS target (CosyVoice-2), exploiting the strong acoustic conditioning in speech synthesis to maintain high acceptance rates with minimal drafter capacity. FastVLM [45] bridges the gap between the shallow (n-layer) draft path and the full (L-layer) target through an imitation network trained to mimic the remaining

L - n

layers, with all backbone parameters frozen.

Feature-Level Distillation

Training explicitly matches the drafter to the target at the logit or hidden-state level. MASSV [20] employs a two-stage training protocol: projector pretraining on paired image-text data followed by self-data distillation from the target VLM, transferring multimodal reasoning capabilities into a compact drafter. Spec-LLaVA [59] applies online logit distillation during drafter training, aligning the drafter’s output distribution with the target’s at each token position. SpecVLM [16] similarly employs online logit distillation alongside its elastic visual compressor, ensuring distribution-level compatibility between drafter and target during decoding.

Representation Alignment

Auxiliary modules are trained so that compressed inputs remain compatible with the target’s internal feature space. ViSpec [14] trains a Q-Former-style [15] vision adaptor to produce compressed visual tokens and a global visual vector aligned with the target’s visual processing. HiViS [17] conditions the drafter on precomputed semantic embeddings exported from the target model, augmented with step-aware residual vectors that encode decoding-position-specific information. SpecFLASH [18] co-trains the drafter with latent-aware compression using the target’s sub-top-layer features, ensuring the compressed visual tokens remain semantically compatible with the target’s expectations.

This training-time alignment pattern is distinct from, and often complementary to, inference-time Target-Informed Transfer. For example, Sparrow [32] combines both: training-time intermediate-layer visual state bridging (IVSB) aligns the drafter’s representations, while inference-time hidden state reuse (HSR) provides runtime target signals.

4. Verification and Acceptance Stage

In each decoding step, the drafted tokens are verified in parallel to ensure the outputs align with the target model. This process determines the number of tokens accepted per step, a key factor impacting the overall speedup. We organize verification into execution strategies (Section 4.1) and verification optimization (Section 4.2).

4.1. Verification Execution

As shown in Figure 10 and Figure 11, and Table 3, this section summarizes various verification execution strategies, encompassing linear verification (Section 4.1.1), tree-based verification (Section 4.1.2), path-level verification (Section 4.1.3), and iterative / Jacobi-style verification (Section 4.1.4).

4.1.1. Linear Verification (Standard)

Standard verification evaluates K draft tokens left to right. The first token whose acceptance probability (Equation (1)) falls below a uniform random threshold terminates the draft; the target model’s sample at that position replaces it, and drafting resumes. This procedure preserves the target distribution exactly.

Linear verification remains the most widely adopted baseline across multimodal speculative decoding.

In the VLM domain, the following methods all employ standard linear verification without modifying the verification algorithm itself: SPD-MLLM [13], MASSV [20], SpecVLM [16], MSD [19], SpecFLASH [18], HiViS [17], AASD [21], DREAM [22], FastVLM [45], STAR [23], IbED [25], and TABED [26].

In the VideoLM domain, STD [47], HIPPO [48], Sparrow [32], and ParallelVLM [34] retain standard Leviathan-style accept/reject rules.

In the speech domain, SSD [35] employs linear verification along a single draft sequence with speech-specific acceptance modifications. UGSD [38] also performs sequential left-to-right verification, combining it with a relaxed rank-based acceptance criterion and uncertainty-gated cloud offloading. CTC-SSD [50] employs linear verification of the CTC draft hypothesis through a single LLM forward pass, accepting the hypothesis if all token likelihoods exceed a threshold; otherwise it falls back to AR decoding from the longest accepted prefix. SMUD [37] dynamically verifies candidates across two parallel decoding hypotheses (“still inside mask” vs. “exited mask”), selecting the winning path based on a mixed CTC-AR probability score.

In the T2I (Vision AR) domain, LANTERN [39], GSD [41], and VVS [42] employ sequential left-to-right verification for visual token generation, though LANTERN and GSD relax the acceptance criterion via latent-space neighborhood or grouped acceptance, and VVS dynamically skips verification for high-confidence tokens.

While the execution strategy in all these methods is linear, many introduce domain-specific acceptance relaxations or cost optimizations, which are discussed later in this section.

4.1.2. Tree-Based Verification

Tree-based verification evaluates multiple candidate continuations simultaneously using structured attention masks, forming a token tree. The target model processes the entire tree in parallel and selects the longest accepted path. Tree-based verification increases per-step cost compared to linear verification but yields more accepted tokens per step, providing net speedups when per-token draft accuracy is moderate.

In the VLM domain, Spec-LLaVA [59] is the primary method employing strict tree-based verification. A token-tree attention mask enables the target to evaluate all branches in a single forward pass. Verification proceeds leaf-to-root: deeper paths are tried first (since they yield more accepted tokens on success), and a mismatch at any depth truncates the speculative block at the failure point. ViSpec [14] similarly uses a tree-based speculative mechanism during generation. SAGE [61] extends this paradigm with entropy-guided shaping that dynamically determines the tree topology the target verifies. EdgeSD [27] employs tree-based verification via masked tree attention on the cloud-side target VLM, verifying the adaptive token tree generated by the edge-side drafter in a single forward pass and selecting the longest accepted branch.

In the VLA domain, HeiSD [31] adapts tree-based verification by constructing a sequence-wise tree from top-K retrieved drafts, where each node represents a kinematically correlated token group (position, rotation, gripper) rather than a single token. Verification proceeds via depth-first search over chains, combining sequence-wise relaxed acceptance with adaptive verify-skip to select the longest acceptable action sequence.

In the VideoLM domain, SpecVLM (Video) [62] adopts EAGLE-style [63] static tree structures with tree attention masks, verifying multi-branch candidate trees generated from pruned video tokens.

In the speech domain, SpecASR [36] employs a two-pass sparse tree structure that branches at positions of high uncertainty and dynamically constructs a tree, moving beyond pure linear sequences to multi-sequence branching for the target to evaluate.

In the T2I (Vision AR) domain, LANTERN++ [40] extends tree-based verification to continuous domains: the target evaluates all branches of a static speculation tree and selects the longest accepted path, where acceptance is defined over codebook embedding neighborhoods rather than exact token matches.

4.1.3. Path-Level Verification

Path-level verification is a coarse-grained consistency check that operates over entire candidate trajectories rather than individual tokens. Several recent methods do not introduce an explicit verifier, but their selection rules play an analogous role by validating global hypotheses against the target model or process.

In the VLM domain, SV (Speculative Verdict) [60] performs trajectory-level validation for reasoning tasks. The verdict model does not perform token-by-token accept/reject decisions. Instead, it receives multiple complete reasoning paths as evidence and synthesizes a new final answer in a single inference call. A consensus filter based on cross-model NLL scores (“how likely would other models find this answer reasonable?”) pre-screens paths before presenting them to the verdict model, reducing its input cost.

In the VideoLM domain, VideoSpeculateRAG [33] applies candidate-level verification to video RAG, verifying full answer candidates rather than token prefixes, effectively treating each retrieved document as an independent speculative branch. Each retrieved document produces an independent candidate answer via a lightweight draft VLM; the verifier then performs two-stage candidate reranking combining tolerance-based reliability filtering with entity-alignment scoring.

In the speech domain, Codec-MTP [49] performs sequence-level selection. Rather than accepting or rejecting individual tokens, Codec-MTP applies HMM/Viterbi global optimal path selection over multiple candidate codec token sequences. This sequence-level verification selects the most likely path given the full generative model, analogous to diffusion’s trajectory-level verification but operating over discrete codec states. A top-k reduction of the state space makes Viterbi path scoring computationally feasible.

In the T2I (Vision AR) domain, CSpD [43] enforces density-level consistency for continuous-valued visual AR models through acceptance-rejection sampling over the draft–target density ratio. SJD-PV [55] (Speculative Jacobi Decoding with Phrase Verification) elevates verification granularity from individual tokens to the phrase level. Observing that visual semantics are encoded across contiguous token sequences, SJD-PV constructs a phrase library via BPE-style iterative merging on large-scale datasets to extract recurring semantic priors. During verification, if a draft sequence matches a library entry, a joint acceptance score

log R_{p} = \sum_{k} (log p (v_{k}) - log q (v_{k}))

validates the phrase as a single unit, where

R_{p}

is the phrase-level acceptance ratio,

v_{k}

is the k-th token in the matched phrase, and

p (\cdot)

,

q (\cdot)

are the target and draft token probabilities, respectively. This joint criterion is more efficient than token-wise verification because aggregation prevents individual low-confidence tokens from prematurely truncating a high-confidence block. As a training-free module, SJD-PV augments existing SJD variants.

In the diffusion domain, path-level verification evaluates whether a proposed denoising trajectory segment constitutes a valid draw from the target diffusion process, requiring coupling-based acceptance criteria [56].

4.1.4. Iterative / Jacobi-Style Verification

Iterative verification evaluates whether the Jacobi fixed-point iteration has converged, rather than comparing draft tokens against a separate target model’s output. This paradigm is unique to self-drafting methods that use the target model itself in Jacobi mode.

The SJD family [51,52,53,54] defines probabilistic stability criteria: tokens whose predictions remain stable across consecutive iterations are accepted simultaneously, bypassing the standard draft–target verification framework entirely. MC-SJD [53] strengthens convergence detection by applying maximal coupling between iterations, maximizing the probability that consecutive steps sample identical tokens. SJD² [54] refines unaccepted tokens along a structured denoising trajectory rather than re-sampling independently, yielding smoother convergence and higher per-step acceptance rates.

4.2. Verification Optimization

As with draft generation, several verification-side optimization strategies recur across modalities, as shown in Figure 12 and Figure 13.

4.2.1. Cost Reduction

The target model’s verification forward pass typically dominates the speculative decoding pipeline’s computational cost; reducing its per-step cost is therefore critical for achieving net latency gains.

Confidence-Based Verification Skipping

The shared idea is to bypass expensive target verification when the draft already appears reliable enough. VVS [42] implements this at token granularity, building on two observations: verification redundancy (many draft tokens would be accepted regardless) and stale feature reusability (cached target features remain informative across consecutive steps). This verification skipping parallels SpeCa [58]’s forecast gating, where accurate cached predictions bypass full recomputation, but operates at the token level rather than the feature level. UGSD [38] takes a complementary approach: rather than skipping verification for confident tokens, it skips cloud offloading entirely; low-entropy token blocks remain on the edge device without invoking the cloud verifier, reducing communication and computation overhead while reserving cloud capacity for the most uncertain predictions. CTC-SSD [50] implements an analogous entropy-based gating at the CTC level: if all frame-level entropies of the CTC output fall below a threshold

τ_{CTC}

, the greedy CTC hypothesis is accepted as final without any LLM verification pass, entirely bypassing the most expensive stage of the pipeline. HeiSD [31] extends verification skipping from token-level to trajectory-segment level: when feature similarity between retrieved drafts and historical trajectories exceeds a learned threshold, the entire verification pass is bypassed, with the threshold adapting online via task-completion feedback. While the preceding methods skip verification proactively, KERV [30] avoids target recomputation reactively: upon rejection, a Kalman Filter predicts the remaining action trajectory from kinematic history, entirely bypassing GPU-side re-drafting.

Cross-Stage Computation Reuse

Verification becomes cheaper here because computation produced during drafting is reused instead of recomputed. FastVLM [45] shares the KV cache between draft and verification stages within a single backbone: the first n layers’ KV entries, computed during drafting, transfer directly to full-model verification, eliminating redundant multimodal encoding. STD [47] similarly reassigns computational responsibility: the target model

M

(with full KV cache) handles verification exclusively, while the sparse surrogate

M_{s}

(with reduced KV) handles drafting, sharing all parameters so no additional model memory is required. HIPPO [48] takes a complementary approach: rather than eliminating redundant KV computation, it hides verification latency by pipelining target verification of batch t with draft generation of batch

t + 1

via double-buffer KV cache management.

Input Pruning for Shared-Backbone Verification

The savings here come from shrinking the token set that the shared backbone must process in both stages. SpecPrune-VLA [46] achieves this through its action-aware pruning framework: visual token count entering the shared backbone is reduced, making both draft and verification forward passes cheaper. Its self-speculative design further avoids maintaining a separate draft model.

4.2.2. Relaxed Acceptance

Standard speculative decoding enforces exact distribution matching. Multiple modalities, including VLA, speech, T2I (Vision AR), and diffusion, benefit from relaxing this requirement when perceptual or functional equivalence suffices. For diffusion models, verification is naturally defined as the consistency of the proposed trajectory under the target dynamics, rather than token-wise acceptance.

Perceptual and Functional Tolerance

Exact token agreement is replaced by modality-specific notions of perceptual or functional equivalence. In the VLA domain, Spec-VLA [29] accepts draft action tokens whose continuous control signals fall within a task-dependent distance tolerance of the target action. KERV [30] deepens this paradigm by replacing Spec-VLA’s static acceptance threshold with a kinematic-based dynamic adjustment strategy that tunes the tolerance based on real-time kinematic variability; its Kalman Filter fallback mechanism for rejected tokens is discussed earlier in the drafting section. HeiSD [31] introduces sequence-wise relaxed acceptance: rather than evaluating tokens individually, it groups kinematically correlated action dimensions (position, rotation, gripper) into sequences and accepts an entire sequence when its aggregate bias

{bias}_{seq}

(computed over the full sequence) remains within tolerance, even if the bias of an individual action dimension

a_{j}

, denoted

{bias}_{a_{j}}

, is larger. In the VideoLM domain, VideoSpeculateRAG [33] applies tolerance-based candidate selection: rather than requiring exact token matching, it retains all candidates whose reliability scores fall within a margin

δ

of the maximum, then reranks this tolerant set by entity-alignment similarity to select the final answer. In the VLM domain, HSD [28] introduces

τ

-matching for document parsing: a drafted region-level Markdown sequence is accepted if its edit distance to the target falls below a formatting tolerance threshold

τ

, permitting minor whitespace and markup variations that are semantically equivalent. In the speech domain, SSD [35] relaxes the standard acceptance criterion by introducing a perceptual tolerance parameter

β

: acoustically equivalent tokens, i.e., those producing perceptually indistinguishable audio despite differing in discrete codec representation, are accepted even when exact distribution matching fails. UGSD [38] adopts a rank-based acceptance rule for speech emotion captioning: a drafted token is accepted if it falls within the top-R most probable tokens under the cloud verifier’s distribution, replacing strict exact matching with a practical relaxation suited to open-ended caption generation. CTC-SSD [50] relaxes acceptance differently: the greedy CTC hypothesis is accepted if all token likelihoods under the LLM distribution exceed a threshold

τ_{SLM}

, replacing exact token matching with a plausibility check that permits the verifier to endorse acoustically grounded but lexically distinct hypotheses. Across all these modalities, the common insight is that multiple distinct tokens can function equivalently when evaluated through a modality-appropriate perceptual or task-specific metric.

Latent-Space and Continuous-Density Acceptance

Acceptance is defined by proximity in a latent or continuous space rather than exact token identity. In visual generation, exact token matching yields impractically low acceptance rates due to codebook redundancy. LANTERN [39] and LANTERN++ [40] relax exact token matching to latent-space neighborhood acceptance: a draft token is accepted if it falls within a distance threshold of the target’s prediction in the codebook embedding space. GSD [41] adopts grouped acceptance, treating visually equivalent token clusters as interchangeable. CSpD [43] extends relaxation to continuous-valued token spaces through density-ratio acceptance-rejection sampling, enabling theoretically grounded verification in continuous output spaces. MuLo-SD [44] introduces local spatial relaxation: rather than rejecting all tokens after the first failure in raster-scan order, it accepts each draft token independently when the pooled probability over its k-nearest latent neighbors exceeds a threshold

τ

, and resamples only within a spatial neighborhood of radius l (in discrete token positions) around rejected positions, exploiting the locality of visual AR attention patterns.

Coupling-Based and Exchangeability-Based Acceptance

Here correctness is established through structural properties of the diffusion process itself rather than discrete token matching. Accelerated Diffusion Sampling [56] verifies proposals through reflection maximal coupling: the proposed future sample is accepted if the coupled SDE trajectory remains within a valid region defined by the score function geometry, providing a theoretically grounded acceptance criterion that preserves the target diffusion distribution. ASD [57] provides theoretical guarantees through the exchangeability property of stochastic localization: because permuting denoising timestep orderings does not change the output distribution, self-proposed multi-step jumps are provably correct without requiring explicit verification.

Forecast Gating

The verifier selectively trusts cheap feature forecasts and falls back to full recomputation only when the forecast degrades. SpeCa [58] applies a gating mechanism to Taylor-forecasted features: if the discrepancy between the cached forecast and the actual model computation exceeds a threshold, the forecast is rejected and the full model recomputes that step.

Dual-Hypothesis Boundary Detection

The key idea is to keep two competing boundary hypotheses alive until the verifier can resolve the masked-region ambiguity. Unlike traditional speculative decoding where the verifier merely checks if a proposed token is correct, SMUD [37] tackles the fundamental ambiguity of masked decoding: when the AR decoder processes a token, it does not know if the token belongs inside the mask region or has exited into the known post-mask text. SMUD [37] solves this by simultaneously maintaining two parallel decoding hypotheses:

H_{in}

(assuming decoding continues inside the mask) and

H_{out}

(assuming the mask has ended). The verifier selects the correct trajectory by comparing a joint CTC-decoder score, effectively using the autoregressive decoder as a soft verifier to probabilistically determine the true mask boundary while preserving CTC prefix scores.

4.2.3. Verify-to-Draft Feedback

Rather than treating verification as a one-directional judgment, several methods create feedback loops where verification-side information actively improves subsequent draft quality.

Training-Time Alignment

Feedback is injected during training by baking draft–target compatibility into the drafter’s parameters. The target-informed transfer mechanisms described in the drafting section, including AASD [21]’s T-D Attention and DREAM [22]’s entropy-adaptive feature injection, function as implicit feedback loops: by aligning draft representations to target representations during training, these methods ensure that the verification step encounters fewer out-of-distribution tokens at inference time, increasing the effective acceptance rate

α

without adding inference-time overhead. FastVLM [45] creates a more explicit feedback loop: rejected tokens from verification are corrected by the full model and fed back to improve the imitation network through iterative imitation learning, progressively tightening draft–target alignment over decoding rounds.

Verifier Attention as Pruning Signal

Verifier attention is reused as an explicit signal for the drafter’s next round of token selection. SpecVLM (Video) [62] applies this principle to video token pruning. After each verification step, the verifier’s attention distribution over video tokens determines which tokens the drafter retains in subsequent rounds, creating an adaptive pruning loop where verification-side information continuously refines draft-side input selection.

Draft Recycling

Rejected draft content is treated as partial progress to be repaired or reused, rather than thrown away. SpecASR [36] (speech) transforms verification from “reject and restart” to “repair and reuse”: rejected draft tokens are locally recycled, merged, reused, or expanded at uncertain positions, converting wasted draft computation into usable partial results. SJD++ [52] (T2I, Vision AR) achieves analogous recycling through high-confidence token reuse: within Jacobi iterations, tokens whose predictions remained stable across two consecutive steps are locked rather than re-sampled, converting otherwise-discarded iteration results into convergence acceleration. Both methods share the insight that partially correct draft information carries value beyond binary accept/reject decisions; the reject signal from verification is a structured feedback that guides subsequent drafting rather than a mere termination condition.

5. Frameworks and Systems

As speculative decoding for multimodal models matures, it becomes essential to examine the production inference frameworks that support it. Table 4 summarizes the major frameworks, their speculative decoding capabilities, and their current level of multimodal support.

vLLM

vLLM [64] is a high-throughput LLM serving engine built on PagedAttention for efficient KV cache management and continuous batching. The engine provides the most comprehensive speculative decoding support among production frameworks, implementing model-based methods (EAGLE, EAGLE-3, MTP, and draft models) as well as simpler n-gram and suffix decoding. vLLM’s speculative decoding is algorithmically validated to be lossless, maintaining the same output distribution as standard decoding. As of v0.12.0, vLLM has begun integrating multimodal-aware speculative decoding: the Qwen3-VL model class natively supports both EAGLE and EAGLE-3 speculation (PR #29594), and broader multimodal draft model support is under active development (Issue #33458). However, deeper multimodal-specific optimizations such as visual token compression or heterogeneous KV cache management have not yet been incorporated into the speculative decoding pipeline.

SGLang

SGLang [65] is an inference engine optimized for structured generation and multi-turn conversations through RadixAttention, which efficiently reuses shared KV cache prefixes. It supports speculative decoding via EAGLE and EAGLE-3 and has been actively expanding support for multimodal models, including Vision–Language Models. Its development roadmap for 2025 explicitly prioritized speculative decoding optimizations, including adaptive batch-size-aware speculation. However, like vLLM, SGLang’s speculative decoding currently targets text-only token prediction without multimodal-specific adaptations.

TensorRT

NVIDIA provides two inference frameworks under the TensorRT brand that support speculative decoding, targeting datacenter and edge deployment scenarios, respectively.

TensorRT-LLM

TensorRT-LLM [66] is NVIDIA’s datacenter-oriented inference optimization framework built on a PyTorch-native architecture, applying mixed-precision quantization (FP8, FP4, INT4 AWQ, INT8 SmoothQuant), layer fusion, and kernel auto-tuning for GPU-optimized deployment. It offers the broadest speculative decoding method coverage among NVIDIA’s frameworks, supporting draft-target model pairs, EAGLE (1/2/3), Medusa, ReDrafter, Multi-Token Prediction (MTP), lookahead decoding, and n-gram methods. While TensorRT-LLM supports multimodal model serving (Qwen2-VL, LLaVA-NeXT, Llama 3.2 Vision, among others) and speculative decoding independently, its speculative decoding pipeline is primarily designed for text-only LLMs and does not yet provide end-to-end multimodal-aware speculation.

TensorRT Edge-LLM

TensorRT Edge-LLM [67] is a separate lightweight C++ inference runtime purpose-built for embedded platforms such as NVIDIA DRIVE AGX Thor and Jetson Thor. Unlike TensorRT-LLM, it focuses on minimal dependencies and low resource footprint for real-time edge applications. Its speculative decoding support is limited to EAGLE-3, but notably it provides end-to-end multimodal-aware speculative decoding: it supports multi-batch EAGLE-3 speculation for both LLMs and VLMs, with native support for Qwen2/2.5/3-VL, InternVL3, and Phi-4-Multimodal. Industry partners including Bosch, ThunderSoft, and MediaTek have adopted TensorRT Edge-LLM for in-vehicle AI assistants and cabin monitoring, making it one of the few production frameworks where multimodal speculative decoding is deployed in real-world applications.

LMDeploy

LMDeploy [68] provides high-performance inference through its TurboMind C++ backend and PyTorch backend, featuring continuous batching and efficient CUDA kernels. It implements speculative decoding through Medusa-style TreeMask verification. Although LMDeploy supports multimodal model deployment (VLMs), its speculative decoding capabilities remain text-focused and are still marked as experimental.

Hugging Face Transformers

The Hugging Face Transformers library [69] provides a widely adopted assisted_generation API that implements basic speculative decoding with draft models. Among the surveyed methods, HIPPO [48] explicitly builds on Transformers v4.57.0, and DREAM [22] uses “official Hugging Face implementations” as its model backends. More broadly, methods targeting LLaVA-series, Qwen-VL-series, and InternVL-series models inherit the Hugging Face interfaces, making it the de facto prototyping platform for multimodal speculative decoding research.

Speculative Decoding Algorithm Libraries

Beyond production serving frameworks, several open-source algorithm libraries provide reusable implementations of speculative decoding techniques. EAGLE [63] and its successor EAGLE-2 offer feature-level auto-regression heads with tree-structured verification, directly adopted by LANTERN [39], SpecVLM [16], SAGE [61], and DREAM [22]. Medusa [70] provides multi-head parallel drafting, used as a baseline by DREAM [22]. Spec-Bench [3] offers standardized evaluation protocols (speedup ratio, acceptance length, temperature sensitivity) that multimodal methods widely adopt for reporting results, despite being focused on text-only scenarios.

Discussion

A narrowing but still persistent gap remains between speculative decoding and multimodal model serving in production inference frameworks. While all major frameworks now support both capabilities independently, only a few have begun combining them: TensorRT Edge-LLM provides end-to-end multimodal EAGLE-3 speculation for edge deployment, and vLLM has introduced EAGLE/EAGLE-3 support for Qwen3-VL as of v0.12.0, with broader multimodal draft model support under active development. However, these initial integrations remain limited in scope, typically restricted to specific model families and a single speculation method, without incorporating deeper multimodal-specific optimizations such as visual token compression, heterogeneous KV cache management, or relaxed verification criteria tailored to vision tokens. The surveyed research methods universally implement their multimodal SD innovations as standalone codebases (typically built on Hugging Face Transformers or EAGLE), rather than as extensions to production serving systems. Bridging this gap fully, i.e., integrating the full spectrum of multimodal-aware speculation techniques into frameworks like vLLM and SGLang with broad model coverage and production-grade robustness, remains a critical engineering direction for enabling practical deployment (Section 7).

6. Comparison and Benchmarking

This section provides a systematic comparison of representative multimodal speculative decoding methods. We first present a cross-modal comparative summary (Section 6.1), then discuss the first standardized VLM benchmark (Section 6.2), and finally address benchmarking for other modalities (Section 6.3).

6.1. Cross-Modal Comparative Summary

Table 5 highlights four fundamental differences from text-only speculative decoding: (1) Drafting architecture is modality-dependent: independent drafters dominate VLMs, while other modalities exhibit a broader mix of shared-backbone, Jacobi self-drafting, and drafter-free approaches; (2) T2I generation reveals two distinct paradigms: relaxed-acceptance dual-model methods versus Jacobi self-drafting; (3) Tuning-free deployment remains a primary goal: yet the highest reported speedups still generally require dedicated drafter training; (4) Verification criteria diverge by output space: VLMs and Video–Language Models largely retain exact-match acceptance, while modalities with continuous, latent, perceptual, or task-tolerant outputs more often require relaxed, convergence-based, or coupling-based criteria.

First, Independent Drafting dominates Vision–Language and Video–Language Models, mirroring text-only practice. However, multimodal drafters increasingly incorporate target-model features to improve speculation accuracy (DREAM [22], STAR [23], AASD [21]). Conversely, domains with strong self-drafting structures, such as video KV-splits (STD [47]), layer-sharing (FastVLM [45]), and Jacobi iteration (the SJD [51] family), adopt Shared Backbone mechanisms. Speech methods similarly span Independent (e.g., SSD [35]) and Shared Backbone (e.g., Codec-MTP [49]) paradigms. Diffusion models uniquely introduce Drafter-Free Speculation approaches that generate trajectory segments rather than discrete tokens.

Second, this modality dependence is particularly visible in Text-to-Image (Vision AR) generation, where methods bifurcate along the drafting axis. Methods like LANTERN [39] and GSD [41] retain the dual-model framework but relax acceptance criteria in the visual latent space, which CSpD [43] adapts for continuous-valued representations. MuLo-SD [44] takes an orthogonal approach, combining a low-resolution independent drafter with multi-scale up-sampling to propose candidate patches beyond sequential token extension. In contrast, the SJD [51] family eliminates separate drafters entirely through Jacobi self-drafting, and SJD-PV [55] further improves the acceptance rate by introducing phrase-level joint verification that exploits semantic continuity across consecutive visual tokens.

Third, Tuning-free adaptation remains a primary objective across all modalities (indicated by numerous ). Several tuning-free methods achieve substantial speedups through structural innovation, including MC-SJD [53] (

3.8

–

4.2 \times

), EdgeSD [27] (3–

5 \times

), and GSD [41] (

\approx 3.8 \times

). Nevertheless, closing the performance gap with trained drafters remains an open challenge: models like Spec-LLaVA [59], STAR [23], and DREAM [22] still report the highest VLM speedups, all relying on dedicated distillation or fine-tuning.

Finally, the choice of acceptance criterion tracks the output space. VLMs and Video–Language Models, which generate discrete tokens, overwhelmingly adopt exact-match verification. In contrast, modalities operating over continuous, latent, or perceptual outputs—VLA, Speech/Audio, Text-to-Image (Vision AR), and Diffusion Transformer (DiT)—more frequently rely on relaxed, convergence-based, or continuous-state criteria to achieve practical acceptance rates.

6.2. Unified VLM Benchmarking

While Table 5 collects self-reported speedups under heterogeneous settings, a fair comparison requires controlled evaluation. MMSpec [71] provides the first standardized benchmark for speculative decoding in VLMs, evaluating vision-agnostic methods (EAGLE-1/2/3, Medusa) and vision-aware methods (MSD, ViSpec) across six subtasks on Qwen2.5-VL-7B and LLaVA-1.5-7B. The benchmark reveals that vision-agnostic methods can degrade below the autoregressive baseline on VLMs, while vision-aware methods consistently outperform them, confirming that modeling visual-conditioned token distributions is critical. Speedup also varies substantially across subtasks, highlighting the need for adaptive speculation strategies.

6.3. Other Modalities Benchmarking

The absence of similarly standardized benchmarks for Text-to-Image (Vision AR), VLA, Video, Speech, and Diffusion domains (Section 7) makes direct comparison across these modalities difficult, and the speedup figures in Table 5 remain incomparable because they reflect heterogeneous target models and evaluation protocols.

7. Open Challenges and Future Directions

While speculative decoding accelerates multimodal inference, applying the paradigm to high-dimensional multimodal spaces introduces several unsolved problems. Addressing these challenges requires advances in algorithm design, theoretical formulation, and hardware optimization.

7.1. The Multimodal Drafting Bottleneck

In standard LLM speculative decoding, drafting accounts for a negligible fraction of the total inference time. However, in multimodal models, even small drafter models must process high-resolution images, streaming audio, or long videos, resulting in a substantial “multimodal drafting bottleneck.” As target models shift to complex “any-to-any” generation (e.g., interleaved text, image, and audio generation), building an ultra-lightweight drafter capable of generating reliable multi-format proposals becomes increasingly difficult. Future research should explore universally compressible representations or training-free self-speculation mechanisms (such as early-exiting or feature caching) to keep drafting overhead strictly bounded.

7.2. Extended Sequence Lengths and Memory Wall

The growth of sequence lengths in video understanding (VideoLMs) and high-fidelity audio generation places immense pressure on memory bandwidth. Although speculative decoding can alleviate parts of the memory-bandwidth bottleneck, longer KV caches and increasingly expensive attention memory accesses still cause memory transactions to overshadow arithmetic operations as sequence length grows. This highlights that VideoLMs fundamentally shift the bottleneck toward memory-bound processing over long temporal sequences. While current compression strategies (Section 3.3.1) prune redundant visual tokens to mitigate this, excessive pruning risks degrading fine-grained semantic grounding. Future work should integrate KV cache quantization, offloading strategies, or linear-attention mechanisms directly into the speculative decoding verification phase.

7.3. Rigorous Verification Theory for Continuous Spaces

For many discrete-token generation settings, classical speculative decoding provides strong mathematical guarantees of lossless recovery (i.e., reproducing the exact output distribution of the target model). In contrast, models operating in continuous latent spaces, such as image and video Diffusion Transformers (DiTs) or visually quantized representations in visual autoregressive (VAR) models, lack equivalent theoretical bounds. Current verification strategies for these models (e.g., feature distance thresholds or semantic relaxations) rely heavily on empirical hyperparameter tuning. Developing a rigorous verification theory for continuous random variables that guarantees output distribution fidelity while allowing for extensive speculation is a critical open problem.

7.4. Strict Real-Time Constraints: Audio Streaming and VLA Control

Beyond latency reduction, Vision–Language–Action (VLA) models and speech systems are often deployed in environments where strict real-time constraints (e.g., high-frequency robotic control loops or time-to-first-audio) are non-negotiable. While speculative decoding increases overall throughput, naive block-level speculation can introduce unacceptable jitter or artificially delay the emission of the first acoustic or action frames. Designing drafters that generate structurally sound future actions or audio codecs far ahead of the target, without sacrificing streaming experience or suffering catastrophic rejection during physical execution, remains a key engineering challenge in real-time multimodal deployment.

7.5. Cross-Pollinating Representation-Level Verification to VLMs

Current VLM speculative decoding largely relies on strict token matching, forcing the drafter to predict the target’s exact discrete token sequence. However, as demonstrated by several T2I (Vision AR) methods (Section 4.2.2), relaxing verification to the latent or representation space can substantially improve acceptance rates. A promising future direction is to adapt this representation-level verification to discrete VLM tokens. Because visual semantics are fundamentally continuous, verifying drafts based on embedding similarity or semantic equivalence rather than exact lexical matches may help overcome current tuning-free VLM speedup limitations without requiring costly self-correction.

7.6. Algorithm–Hardware Co-Design

Advanced speculative decoding algorithms frequently employ dynamic, tree-structured speculation to verify multiple candidate trajectories simultaneously (Section 4.1.2). This dynamic branching creates irregular compute geometries, sparse attention masks, and dynamic batch sizes, which underutilize modern AI accelerators (e.g., GPUs and TPUs) optimized for static, dense matrix multiplications. To unlock the theoretical speedups of width-parallel speculative decoding, the field needs hardware–algorithm co-design, including specialized CUDA kernels for sparse tree-attention and optimized memory access patterns tailored for parallel target verification.

7.7. Evaluation Standardization and Reproducibility

The speculative decoding literature currently suffers from fragmented evaluation. Reported speedup multipliers vary widely depending on the baseline implementation (e.g., vanilla PyTorch vs. vLLM/TensorRT), hardware platform, batch size, and prompt characteristics. Unlike text generation where speedup is easily quantified by tokens per second, multimodal generation speedup depends on the input length (e.g., the number of images/frames relative to the generated text). While recent efforts such as MMSpec [71] have introduced standardized benchmarks for vision–language models (Section 6.2), no equivalent evaluation platform exists for Vision–Language–Action control, Video–Language understanding, Speech/Audio models, Text-to-Image (Vision AR) generation, or Diffusion Transformers. This cross-modal benchmarking gap hinders fair comparison of algorithmic contributions and obscures which innovations genuinely transfer across modalities. Establishing standardized cross-modal benchmarking frameworks that isolate algorithmic efficiency from system-level engineering and define modality-appropriate quality metrics is essential for transparent and reproducible progress.

8. Conclusions

This survey presented a comprehensive, unified taxonomy of speculative decoding for multimodal models. Moving beyond the text-centric roots of the paradigm, we systematically analyzed draft architecture, execution strategies, optimization patterns, verification criteria, and inference framework support across six distinct domains: Vision–Language Models (VLMs), Vision–Language–Action (VLA) agents, Video–Language models, Speech systems, Text-to-Image (Vision Auto-Regressive, T2I) generators, and Diffusion-based generators. By formalizing the problem space, we identified key recurring design patterns, including visual token compression, KV cache optimization, target-informed transfer, drafter-target alignment, verification cost reduction, relaxed acceptance criteria, and verify-to-draft feedback loops.

Each modality introduces unique computational bottlenecks that demand tailored solutions, from extensive spatial pruning and multi-scale drafting in text-to-image synthesis to continuous-space and phrase-level verification criteria, strict real-time latency constraints in robotic control, and tolerance for stochastic variance in diffusion processes. As foundation models increasingly adopt native multimodal generation, sequential and iterative inference latency is becoming a critical deployment constraint. Recent standardized VLM benchmarks suggest that text-only speculation can degrade substantially on vision–language tasks, underscoring the importance of the domain-aware techniques reviewed herein. Speculative decoding provides a promising pathway toward interactive-latency AI systems without sacrificing generative quality. This survey provides a technical foundation and roadmap for researchers and practitioners working to accelerate multimodal inference.

references

References

Leviathan, Y.; Kalman, M.; Matias, Y. Fast inference from transformers via speculative decoding. In Proceedings of the International Conference on Machine Learning. PMLR, 2023; pp. 19274–19286. [Google Scholar]
Chen, C.; Borgeaud, S.; Irving, G.; Lespiau, J.B.; Sifre, L.; Jumper, J. Accelerating large language model decoding with speculative sampling. arXiv 2023, arXiv:2302.01318. [Google Scholar] [CrossRef]
Xia, H.; Yang, Z.; Dong, Q.; Wang, P.; Li, Y.; Ge, T.; Liu, T.; Li, W.; Sui, Z. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. Findings of the Association for Computational Linguistics: ACL 2024, 7655–7671. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Advances in neural information processing systems 2023, 36, 34892–34916. [Google Scholar]
Team, Q. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv 2023, arXiv:2308.12966. [Google Scholar]
Zitkovich, B.; Yu, T.; Xu, S.; Xu, P.; Xiao, T.; Xia, F.; Wu, J.; Wohlhart, P.; Welker, S.; Wahid, A.; et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Proceedings of the Conference on Robot Learning. PMLR, 2023; pp. 2165–2183. [Google Scholar]
Lin, B.; Ye, Y.; Zhu, B.; Cui, J.; Ning, M.; Jin, P.; Yuan, L. Video-llava: Learning united visual representation by alignment before projection. In Proceedings of the Proceedings of the 2024 conference on empirical methods in natural language processing, 2024; pp. 5971–5984. [Google Scholar]
Défossez, A.; Copet, J.; Synnaeve, G.; Adi, Y. High fidelity neural audio compression. arXiv 2022, arXiv:2210.13438. [Google Scholar] [CrossRef]
Esser, P.; Rombach, R.; Ommer, B. Taming transformers for high-resolution image synthesis. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021; pp. 12873–12883. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems 2020, 33, 6840–6851. [Google Scholar]
Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. arXiv 2020, arXiv:2011.13456. [Google Scholar]
Gagrani, M.; Goel, R.; Jeon, W.; Park, J.; Lee, M.; Lott, C. On speculative decoding for multimodal large language models. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024; pp. 8285–8289. [Google Scholar]
Kang, J.; Shu, H.; Li, W.s.; Zhai, Y.; Chen, X. ViSpec: Accelerating vision-language models with vision-aware speculative decoding. arXiv 2025, arXiv:2509.15235. [Google Scholar]
Li, J.; Li, D.; Savarese, S.; Hoi, S. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International conference on machine learning. PMLR, 2023; pp. 19730–19742. [Google Scholar]
Huang, H.; Yang, F.; Liu, Z.; Yin, X.; Li, D.; Ren, P.; Barsoum, E. SpecVLM: Fast Speculative Decoding in Vision-Language Models. arXiv 2025, arXiv:2509.11815. [Google Scholar]
Xie, Z.; Wang, P.; Qiu, S.; Cheng, J. HiViS: Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models. arXiv 2025, arXiv:2509.23928. [Google Scholar]
Wang, Z.; Li, R.; Du, H.; Zhou, J.T.; Zhang, Y.; Yang, X. FLASH: Latent-Aware Semi-Autoregressive Speculative Decoding for Multimodal Tasks. arXiv 2025, arXiv:2505.12728. [Google Scholar]
Lin, L.; Lin, Z.; Zeng, Z.; Ji, R. Speculative decoding reimagined for multimodal large language models. arXiv 2025, arXiv:2505.14260. [Google Scholar] [CrossRef]
Ganesan, M.; Segal, S.; Aggarwal, A.; Sinnadurai, N.; Lie, S.; Thangarasa, V. MASSV: Multimodal adaptation and self-data distillation for speculative decoding of vision-language models. arXiv 2025, arXiv:2505.10526. [Google Scholar]
Yang, C.; Chen, R.; Zhang, M.; Pang, W.; Chen, Y.; Xu, R.; Fu, K.; Wang, C.; Gao, L. AASD: Accelerate Inference by Aligning Speculative Decoding in Multimodal Large Language Models. In Proceedings of the 2025 62nd ACM/IEEE Design Automation Conference (DAC); IEEE, 2025; pp. 1–7. [Google Scholar]
Hu, Y.; Xia, T.; Liu, Z.; Raman, R.; Liu, X.; Bao, B.; Sather, E.; Thangarasa, V.; Zhang, S.Q. Dream: Drafting with refined target features and entropy-adaptive cross-attention fusion for multimodal speculative decoding. arXiv 2025, arXiv:2505.19201. [Google Scholar]
Liu, Z.; Hu, Y.; Xia, T.; Bao, B.; Sather, E.; Thangarasa, V.; Zhang, S.Q. STAR: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation. In OpenReview; 2025. [Google Scholar]
Cai, H.; Gan, C.; Wang, T.; Zhang, Z.; Han, S. Once-for-all: Train one network and specialize it for efficient deployment. arXiv 2019, arXiv:1908.09791. [Google Scholar]
Lee, M.; Kang, W.; Ahn, B.; Classen, C.; Yan, M.; Koo, H.I.; Lee, K. In-batch Ensemble Drafting: Robust Speculative Decoding for LVLMs. In Proceedings of the First Workshop on Scalable Optimization for Efficient and Adaptive Foundation Models, 2025. [Google Scholar]
Lee, M.; Kang, W.; Ahn, B.; Classen, C.; Galim, K.; Oh, S.; Yan, M.; Koo, H.I.; Lee, K. TABED: Test-Time Adaptive Ensemble Drafting for Robust Speculative Decoding in LVLMs. arXiv 2026, arXiv:2601.20357. [Google Scholar] [CrossRef]
Huang, H.; Zhan, W.; Duan, H.; Peng, K.; Min, G.; Zhao, Z.; Zhao, Z.; Ye, Y. EdgeSD: Efficient Speculative Decoding with Vision-Decoding Disaggregation for MLLM Inference in Edge-Cloud Networks. IEEE Transactions on Mobile Computing, 2026. [Google Scholar]
Liao, W.; Li, H.; Xie, P.; Cai, X.; Shen, Y.; Xin, Y.; Qin, Q.; Ye, S.; Li, T.; Hu, M.; et al. Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding. arXiv 2026, arXiv:2602.12957. [Google Scholar]
Wang, S.; Yu, R.; Yuan, Z.; Yu, C.; Gao, F.; Wang, Y.; Wong, D.F. Spec-VLA: Speculative Decoding for Vision-Language-Action Models with Relaxed Acceptance. arXiv 2025, arXiv:2507.22424. [Google Scholar]
Zheng, Z.; Mao, Z.; Li, M.; Chen, J.; Sun, X.; Zhang, Z.; Cao, D.; Mei, H.; Chen, X. KERV: Kinematic-Rectified Speculative Decoding for Embodied VLA Models. In Proceedings of the Proceedings of the 63rd ACM/IEEE Design Automation Conference (DAC), 2026. [Google Scholar]
Zheng, Z.; Mao, Z.; Tian, S.; Li, M.; Chen, J.; Sun, X.; Zhang, Z.; Liu, X.; Cao, D.; Mei, H.; et al. HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness. arXiv 2026, arXiv:2603.17573. [Google Scholar]
Zhang, L.; Zhang, Z.; Hong, W.; Qiao, P.; Li, D. Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs. arXiv 2026, arXiv:2602.15318. [Google Scholar]
Li, G.; Liu, P. FastV-RAG: Towards Fast and Fine-Grained Video QA with Retrieval-Augmented Generation. arXiv 2026, arXiv:2601.01513. [Google Scholar]
Kong, Q.; Shen, Y.; Ji, Y.; Li, H.; Wang, C. ParallelVLM: Lossless Video-LLM Acceleration with Visual Alignment Aware Parallel Speculative Decoding. arXiv 2026, arXiv:2603.19610. [Google Scholar]
Lin, Z.; Zhang, Y.; Yuan, Y.; Yan, Y.; Liu, J.; Wu, Z.; Hu, P.; Yu, Q. Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding. arXiv 2025, arXiv:2505.15380. [Google Scholar] [CrossRef]
Wei, L.; Zhong, S.; Xu, S.; Wang, R.; Huang, R.; Li, M. SpecASR: Accelerating LLM-based Automatic Speech Recognition via Speculative Decoding. In Proceedings of the 2025 62nd ACM/IEEE Design Automation Conference (DAC); IEEE, 2025; pp. 1–7. [Google Scholar]
Okabe, K.; Yamamoto, H. Simultaneous Masked and Unmasked Decoding with Speculative Decoding Masking for Fast ASR without Accuracy Loss. Proceedings of the Proc. Interspeech 2025, 2025, 634–638. [Google Scholar]
Xue, X.; Lu, J.; Gao, Y.; Huang, G.; Dang, T.; Jia, H. Edge–Cloud Collaborative Speech Emotion Captioning via Token-Level Speculative Decoding in Audio-Language Models. arXiv 2026, arXiv:2603.11397. [Google Scholar]
Jang, D.; Park, S.; Yang, J.Y.; Jung, Y.; Yun, J.; Kundu, S.; Kim, S.Y.; Yang, E. LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding. arXiv 2025. 2025, arXiv:2410.03355. [Google Scholar]
Jang, D.; Park, S.; Yang, J.Y.; Jung, Y.; Yun, J.; Kundu, S.; Kim, S.Y.; Yang, E. LANTERN++: Enhanced Relaxed Speculative Decoding with Static Tree Drafting for Visual Auto-regressive Models. ICLR 2025 SCOPE Workshop arXiv, 2025. [Google Scholar]
So, J.; Shin, J.; Kook, H.; Park, E. Grouped Speculative Decoding for Autoregressive Image Generation. arXiv 2025. arXiv:2508.07747. [CrossRef]
Dong, H.; Li, Y.; Lu, R.; Tang, C.; Xia, S.T.; Wang, Z. VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping. arXiv 2025, arXiv:2511.13587. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, R.; Ding, K.; Yang, Q.; Li, F.; Xiang, S. Continuous speculative decoding for autoregressive image generation. arXiv 2024, arXiv:2411.11925. [Google Scholar] [CrossRef]
Peruzzo, E.; Sautière, G.; Habibian, A. Multi-Scale Local Speculative Decoding for Image Generation. arXiv 2026, arXiv:2601.05149. [Google Scholar] [CrossRef]
Bajpai, D.J.; Hanawal, M.K. FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference. In Proceedings of the Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, 2025; pp. 1166–1183. [Google Scholar]
Wang, H.; Xu, J.; Pan, J.; Zhou, Y.; Dai, G. SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning. arXiv 2025, arXiv:2509.04043. [Google Scholar]
Zhang, X.; Du, C.; Yu, S.; Wu, J.; Zhang, F.; Gao, W.; Liu, Q. Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs. Proceedings of the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics 2025, Volume 2, 734–742. [Google Scholar]
Lv, Q.; Liu, T.; Wu, W.; Xu, X.; Zhou, B.; Wu, F.; Zhang, C. HIPPO: Accelerating Video Large Language Models Inference via Holistic-aware Parallel Speculative Decoding. arXiv 2026, arXiv:2601.08273. [Google Scholar] [CrossRef]
Nguyen, T.D.; Kim, J.H.; Choi, J.; Choi, S.; Park, J.; Lee, Y.; Chung, J.S. Accelerating codec-based speech synthesis with multi-token prediction and speculative decoding. In Proceedings of the ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE; Volume 2025, pp. 1–5.
Saon, G.; Thomas, S.; Fukuda, T.; Nagano, T.; Dekel, A.; Lastras, L. Self-Speculative Decoding for LLM-based ASR with CTC Encoder Drafts. arXiv 2026, arXiv:2603.11243. [Google Scholar]
Teng, Y.; Shi, H.; Liu, X.; Ning, X.; Dai, G.; Wang, Y.; Li, Z.; Liu, X. Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding. arXiv 2025. 2025, arXiv:2410.01699. [Google Scholar]
Teng, Y.; Jiang, Z.; Shi, H.; Liu, X.; Ning, X.; Dai, G.; Wang, Y.; Li, Z.; Liu, X. SJD++: Improved Speculative Jacobi Decoding for Training-free Acceleration of Discrete Auto-regressive Text-to-Image Generation. arXiv 2025, arXiv:2512.07503. [Google Scholar]
So, J.; Kook, H.; Jang, C.; Park, E. MC-SJD: Maximal Coupling Speculative Jacobi Decoding for Autoregressive Visual Generation Acceleration. arXiv 2025, arXiv:2510.24211. [Google Scholar]
Teng, Y.; Wang, F.; Liu, X.; Chen, Z.; Shi, H.; Wang, Y.; Li, Z.; Liu, W.; Zou, D.; Liu, X. Speculative Jacobi-Denoising Decoding for Accelerating Autoregressive Text-to-image Generation. arXiv 2025. 2025, arXiv:2510.08994. [Google Scholar]
Yu, Z.; Zhang, B.; Shan, B.; Liu, X.; Zhou, D.; Liang, G.; Ye, G.; Ye, Y. SJD-PV: Speculative Jacobi Decoding with Phrase Verification for Autoregressive Image Generation. arXiv 2026, arXiv:2603.06666. [Google Scholar]
De Bortoli, V.; Galashov, A.; Gretton, A.; Doucet, A. Accelerated diffusion models via speculative sampling. arXiv 2025, arXiv:2501.05370. [Google Scholar] [CrossRef]
Hu, H.; Das, A.; Sadigh, D.; Anari, N. Diffusion Models are Secretly Exchangeable: Parallelizing DDPMs via Autospeculation. arXiv 2025, arXiv:2505.03983. [Google Scholar] [CrossRef]
Liu, J.; Zou, C.; Lyu, Y.; Ren, F.; Wang, S.; Li, K.; Zhang, L. Speca: Accelerating diffusion transformers with speculative feature caching. In Proceedings of the Proceedings of the 33rd ACM International Conference on Multimedia, 2025; pp. 10024–10033. [Google Scholar]
Huo, M.; Zhang, J.; Wang, H.; Xu, J.; Chen, Z.; Tai, H.; Chen, Y. Spec-LLaVA: Accelerating Vision-Language Models with Dynamic Tree-Based Speculative Decoding. arXiv 2025, arXiv:2509.11961. [Google Scholar]
Liu, Y.; Qin, L.; Wang, S. Small drafts, big verdict: Information-intensive visual reasoning via speculation. arXiv 2025, arXiv:2510.20812. [Google Scholar] [CrossRef]
Tong, Y.; Zhang, T.; Wan, Y.; Lin, K.; Yuan, J.; Hu, C. SAGE: Accelerating Vision-Language Models via Entropy-Guided Adaptive Speculative Decoding. arXiv 2026, arXiv:2602.00523. [Google Scholar]
Ji, Y.; Zhang, J.; Xia, H.; Chen, J.; Shou, L.; Chen, G.; Li, H. Specvlm: Enhancing speculative decoding of video llms via verifier-guided token pruning. In Proceedings of the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025; pp. 7216–7230. [Google Scholar]
Li, Y.; Wei, F.; Zhang, C.; Zhang, H. EAGLE: Speculative sampling requires rethinking feature uncertainty. arXiv 2024, arXiv:2401.15077. [Google Scholar] [CrossRef]
Kwon, W.; Li, Z.; Zhuang, S.; Sheng, Y.; Zheng, L.; Yu, C.H.; Gonzalez, J.; Zhang, H.; Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the Proceedings of the 29th symposium on operating systems principles, 2023; pp. 611–626. [Google Scholar]
Zheng, L.; Yin, L.; Xie, Z.; Sun, C.; Huang, J.; Yu, C.H.; Cao, S.; Kozyrakis, C.; Stoica, I.; Gonzalez, J.E.; et al. Sglang: Efficient execution of structured language model programs. Advances in neural information processing systems 2024, 37, 62557–62583. [Google Scholar]
NVIDIA. TensorRT-LLM. 2024. Available online: https://github.com/NVIDIA/TensorRT-LLM (accessed on 2026-03-11).
NVIDIA. TensorRT-Edge-LLM: High-Performance Inference for LLMs and VLMs on Embedded Platforms. GitHub repository. 2026. Available online: https://github.com/NVIDIA/TensorRT-Edge-LLM (accessed on 11 March 2026).
MMRazor, MMDeploy; Teams. LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLMs. Available online: https://github.com/InternLM/lmdeploy (accessed on 11 March 2026).
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv 2019, arXiv:1910.03771. [Google Scholar]
Cai, T.; Li, Y.; Geng, Z.; Peng, H.; Lee, J.D.; Chen, D.; Dao, T. Medusa: Simple LLM inference acceleration framework with multiple decoding heads. arXiv 2024, arXiv:2401.10774. [Google Scholar] [CrossRef]
Shen, H.; Wang, X.; Zhang, P.; Hsieh, Y.; Han, Q.; Wan, Z.; Zhang, Z.; Zhang, J.; Xiong, J.; Liu, Z.; et al. MMSpec: Benchmarking Speculative Decoding for Vision-Language Models. arXiv 2026, arXiv:2603.14989. [Google Scholar] [CrossRef]

Figure 1. Timeline of multimodal speculative decoding methods surveyed in this work. Colors indicate modality: methods span Vision–Language, Vision–Language–Action, Video–Language, Speech, T2I (Vision AR), and Diffusion models.

Figure 2. Overview of speculative decoding architectures across modalities. (a) Text-only: a small LLM drafts tokens verified by a large LLM. (b) VLM: drafting conditioned on text and (optionally compressed) visual tokens; the target VLM verifies with full visual encoding. (c) VLA: a compact or pruned VLA drafts action tokens, verified by the full VLA model. (d) VideoLM: video frames are tokenized with optional compression; the drafter generates tokens guided by target features. (e) Speech: acoustic inputs are tokenized via neural codecs; the target verifies with relaxed acceptance criteria. (f) T2I (Vision AR): visual tokens are drafted and verified in parallel; accepted tokens are decoded into images via a visual decoder (e.g., VQGAN). (g) Diffusion: a draft model proposes multi-step trajectory jumps, while the target model validates the proposed trajectory and rolls back upon divergence.

Figure 3. Taxonomy of speculative decoding for multimodal models. * denotes methods requiring training/fine-tuning; unmarked methods are training-free. Blue: draft architecture. Violet: draft execution. Teal: draft optimization. Orange: verification execution. Red: verification optimization. Green: frameworks.

Figure 4. Architectural overview of the three draft architecture categories. (a) Independent Drafter: a separate, smaller model generates draft tokens that the target verifies. (b) Shared-Backbone Drafter: the target model itself serves dual roles, using a sub-graph for fast drafting and the full model for verification. (c) Drafter-Free Speculation: applicable to diffusion models, where mathematical properties of the denoising process enable speculative steps without any auxiliary model.

Figure 5. Sub-taxonomy of draft architecture (Section 3.1). Methods are categorized by their structural relationship to the target model. * denotes methods requiring training/fine-tuning.

Figure 6. Sub-taxonomy of draft execution strategies (Section 3.2). Multi-token expansion explores depth; multi-candidate branching explores width. * denotes methods requiring training/fine-tuning.

Figure 7. Illustration of two draft execution strategies. (a) Multi-Token Expansion: multiple future tokens (

x_{t + 1}, x_{t + 2}, x_{t + 3}

) are predicted in a single draft forward pass along a single trajectory, reducing the number of serial drafting steps. (b) Multi-Candidate Branching: multiple candidate paths (

c_{1}, c_{2}, \dots

) are generated simultaneously, forming a speculation tree that is pruned and expanded over successive steps; the target model verifies all branches in parallel.

Figure 7. Illustration of two draft execution strategies. (a) Multi-Token Expansion: multiple future tokens (

x_{t + 1}, x_{t + 2}, x_{t + 3}

) are predicted in a single draft forward pass along a single trajectory, reducing the number of serial drafting steps. (b) Multi-Candidate Branching: multiple candidate paths (

c_{1}, c_{2}, \dots

) are generated simultaneously, forming a speculation tree that is pruned and expanded over successive steps; the target model verifies all branches in parallel.

Figure 8. Sub-taxonomy of draft optimization strategies (Section 3.3). These strategies recur across modalities. Target-Informed Transfer covers inference-time signal injection; Drafter-Target Alignment covers training-time compatibility. * denotes methods requiring training/fine-tuning.

Figure 9. Illustration of four draft optimization strategies in multimodal speculative decoding. (a) Token/Input Compression: multimodal inputs are compressed from N tokens to

M ≪ N

tokens via semantic or adaptive compression, reducing drafting latency while preserving information for the target. (b) KV Cache Optimization: the target model’s KV cache is shared with or reused by the drafter through learned projections or shared-backbone designs, eliminating redundant multimodal encoding. (c) Target-Informed Transfer: intermediate features or hidden representations from the target model are transferred to guide the drafter at inference time, tightening draft–target alignment without retraining. (d) Drafter-Target Alignment: the drafter is trained to align with the target through feature-level distillation (

L_{feat}

) or logit-level knowledge distillation (

L_{KD}

), embedding compatibility into the drafter’s weights.

Figure 9. Illustration of four draft optimization strategies in multimodal speculative decoding. (a) Token/Input Compression: multimodal inputs are compressed from N tokens to

M ≪ N

tokens via semantic or adaptive compression, reducing drafting latency while preserving information for the target. (b) KV Cache Optimization: the target model’s KV cache is shared with or reused by the drafter through learned projections or shared-backbone designs, eliminating redundant multimodal encoding. (c) Target-Informed Transfer: intermediate features or hidden representations from the target model are transferred to guide the drafter at inference time, tightening draft–target alignment without retraining. (d) Drafter-Target Alignment: the drafter is trained to align with the target through feature-level distillation (

L_{feat}

) or logit-level knowledge distillation (

L_{KD}

), embedding compatibility into the drafter’s weights.

Figure 10. Sub-taxonomy of verification execution strategies (Section 4.1). Linear verification remains dominant; tree, path, and iterative approaches address domain-specific needs. * denotes methods requiring training/fine-tuning.

Figure 11. Illustration of four verification execution strategies in multimodal speculative decoding. (a) Linear Verification: draft tokens are verified sequentially left to right; the first rejected token is replaced by the target model’s correction, and subsequent draft tokens are discarded. (b) Tree-Based Verification: multiple candidate continuations form a draft tree; the target model evaluates all branches in a single forward pass via tree attention masks and selects the longest accepted path. (c) Path-Level Verification: entire candidate paths are scored globally, and the highest-scoring path is selected rather than performing token-by-token accept/reject decisions. (d) Iterative/Jacobi-Style Verification: a fixed-length draft sequence initialized with noise is iteratively refined through forward passes; tokens that converge to stable predictions across consecutive iterations are progressively locked until the entire sequence reaches a fixed point.

Figure 12. Sub-taxonomy of verification optimization strategies (Section 4.2). These strategies recur across modalities. * denotes methods requiring training/fine-tuning.

Figure 13. Illustration of three verification optimization strategies in multimodal speculative decoding. (a) Cost Reduction: verification overhead is reduced by selectively skipping verification for high-confidence tokens or leveraging auxiliary information, avoiding redundant target model forward passes. (b) Relaxed Acceptance: the strict distributional match criterion is replaced with representation-level or latent-space proximity checks; draft outputs—whether discrete tokens, continuous latent states, or intermediate features—are accepted when they fall within an acceptable region around the target’s predictions, accommodating perceptual or functional equivalence across modalities. (c) Feedback Loop: verification outcomes are fed back to improve subsequent drafting; accepted tokens accumulate context while rejected tokens are recycled or reused rather than discarded, creating a closed-loop mechanism that progressively improves draft quality.

Table 1. Summary of formulations for draft architecture in multimodal speculative decoding.

p_{1 \dots K}

denotes the draft probability outputs for K candidate positions,

M_{draft}

and

M_{target}

are the draft and target models (defined in Section 2.1),

{\tilde{z}}_{1 \dots K}

are proposed latent states,

F_{proposal}

is a training-free proposal function, and

s_{t}

is the diffusion state at step t. Methods are categorized by their structural relationship to the target model.

Table 1. Summary of formulations for draft architecture in multimodal speculative decoding.

p_{1 \dots K}

denotes the draft probability outputs for K candidate positions,

M_{draft}

and

M_{target}

are the draft and target models (defined in Section 2.1),

{\tilde{z}}_{1 \dots K}

are proposed latent states,

F_{proposal}

is a training-free proposal function, and

s_{t}

is the diffusion state at step t. Methods are categorized by their structural relationship to the target model.

Methods	Formulation ( $p_{1} \dots p_{K}$ or ${\tilde{x}}_{1 \dots K}$ )	Architecture
Independent Drafters	$p_{1 \dots K} = M_{draft} (x ∣ prompt)$ (Separate model)	Small VLM / LM, Vision Predictor, ConvNet Head, Retrieval-Based
Shared Backbone	$p_{1 \dots K} = M_{target} (x ∣ prompt, skip)$ (Target sub-graph)	Early Exiting, Layer Skipping, Sparse KV Routing, Jacobi Self-Drafting
Drafter-Free Speculation	${\tilde{z}}_{1 \dots K} \sim F_{proposal} (s_{t})$ (No auxiliary model)	Coupling Jumps, Exchangeability, Feature Forecasting

Table 2. Unified view of draft execution strategies. t denotes the current decoding position,

h_{t}

the hidden state at position t, and

f_{k}

the prediction function for the k-th future token.

c_{n}

is the n-th candidate sequence,

η_{n}

its strategy-specific branching parameter (e.g., temperature or prompt variant), and N the total number of candidates. Multi-token expansion predicts multiple future positions along a single trajectory, while multi-candidate branching explores multiple candidate continuations at the same step.

Table 2. Unified view of draft execution strategies. t denotes the current decoding position,

h_{t}

the hidden state at position t, and

f_{k}

the prediction function for the k-th future token.

c_{n}

is the n-th candidate sequence,

η_{n}

its strategy-specific branching parameter (e.g., temperature or prompt variant), and N the total number of candidates. Multi-token expansion predicts multiple future positions along a single trajectory, while multi-candidate branching explores multiple candidate continuations at the same step.

Strategy	State (Q) & Pattern	Drafter Mechanism
Multi-Token Expansion	$Q = {t + 1, \dots, t + K}$ $p_{t + k} = f_{k} (h_{t}),$ $k = 1, \dots, K$	Block Prediction, Jacobi Refinement, Semi-AR Heads
Multi-Candidate Branching	$Q = {c_{1}, c_{2}, \dots, c_{N}}$ $c_{n} \sim M_{draft} (x ∣ η_{n})$	Branch Expansion, Speculation Trees, Parallel Proposals

Table 3. Evolution of correctness criteria in speculative decoding.

x_{i}

denotes the i-th candidate token,

p_{target / draft}

the target/draft model probability (Equation (1)),

D

a distance metric,

ϕ_{target / draft}

the representation function of the target/draft model,

τ

an acceptance threshold, and

N_{δ} (\cdot)

a

δ

-neighborhood in the codebook embedding space. Classical text-only methods enforce distributional equivalence, while multimodal generation often adopts relaxed notions of consistency to accommodate continuous representations or perceptual invariances.

Table 3. Evolution of correctness criteria in speculative decoding.

x_{i}

denotes the i-th candidate token,

p_{target / draft}

the target/draft model probability (Equation (1)),

D

a distance metric,

ϕ_{target / draft}

the representation function of the target/draft model,

τ

an acceptance threshold, and

N_{δ} (\cdot)

a

δ

-neighborhood in the codebook embedding space. Classical text-only methods enforce distributional equivalence, while multimodal generation often adopts relaxed notions of consistency to accommodate continuous representations or perceptual invariances.

Criteria	Formulation	Motivation
Distributional Match (Section 4.1.1)	$x_{i} \sim Accept (min (1, \frac{p_{target} (x_{i})}{p_{draft} (x_{i})}))$	Distribution-preserving decoding (lossless equivalence to target model)
Representation Consistency (Section 4.2.2)	$D (ϕ_{target} (x_{i}), ϕ_{draft} (x_{i})) < τ$	Continuous-state validation (no discrete token identity)
Latent-Space Equivalence (Section 4.2.2)	$x_{i}^{draft} \in N_{δ} (x_{i}^{target})$	Perceptual / codebook invariance (multiple tokens map to same meaning)

Table 4. Production inference frameworks with speculative decoding support. “SD Methods” lists the supported speculative decoding algorithms. “MM” indicates native multimodal model support. “MM SD” indicates whether the speculative decoding pipeline can operate in a multimodal-aware manner rather than falling back to text-only speculation. ▵ denotes partial support limited to specific model families or methods.

Framework	SD Methods	MM SD
vLLM [64]	EAGLE, EAGLE-3, MTP, Draft Model, n-gram, Suffix	▵^a
SGLang [65]	EAGLE, EAGLE-3, MTP, Standalone Draft, n-gram	×
TensorRT-LLM [66]	EAGLE (1/2/3), Medusa, ReDrafter, MTP, Lookahead, Draft Model, n-gram	×
TensorRT Edge-LLM [67]	EAGLE-3	^b
LMDeploy [68]	Medusa (TreeMask)	×
HF Transformers [69]	Draft Model (`assisted_generation`)	×^c
^aAs of v0.12.0, EAGLE/EAGLE-3 multimodal SD is supported for Qwen3-VL (PR #29594); broader multimodal draft model support is under development (Issue #33458). ^bSupports multi-batch EAGLE-3 for VLMs including Qwen2/2.5/3-VL, InternVL3, and Phi-4-Multimodal on embedded platforms. ^cServes as the de facto prototyping backend for multimodal SD research methods, though its `assisted_generation` API itself lacks multimodal-specific optimizations.

Table 5. Systematic comparative summary of representative multimodal speculative decoding methods. Methods are grouped by modality and analyzed through the lens of our proposed two-stage taxonomy (Drafting and Verification). “” denotes tuning-free deployment, while “×” signifies the method requires auxiliary training or distillation. Speedups are self-reported in the respective original papers under varying configurations (e.g., target models, hardware platforms, benchmarks, and batch sizes) and are therefore not directly comparable across methods. ^★Speculation-inspired reasoning framework, not canonical lossless speculative decoding acceleration. ^‡Document-parsing VLM setting. ^†Block efficiency improvement, not direct walltime speedup.

Methods	Drafting			Verification		Representative Target	Speedup (rep.)
	Draft Mechanism	Architecture / Approach	Tuning-free	Criterion	Pattern
Vision–Language Models
SPD-MLLM	Independent	Text-Only LM	× (Pretrain)	Strict	Linear	LLaVA (LLaMA)	$\leq 2.37 \times$
MASSV	Independent	Small VLM	× (Distillation)	Strict	Linear	Qwen2.5-VL, Gemma3	$\leq 1.46 \times$
ViSpec	Independent	Visual Adaptor	× (Tuning)	Strict	Tree	LLaVA-1.6, Qwen2.5-VL	$\leq 3.22 \times$
SpecVLM	Independent	Visual Compressor	× (Tuning)	Strict	Linear	LLaVA, Qwen2-VL	$\leq 2.9 \times$
FastVLM	Shared Backbone	Early Exiting	× (Imitation)	Strict	Linear	LLaVA-1.5	$1.55 \times \sim 1.85 \times$
MSD	Independent	Text-Vision Decouple		Strict	Linear	LLaVA-1.5	$\leq 2.46 \times$
SpecFLASH	Independent	Semi-AR Heads	× (Tuning)	Strict	Linear	LLaVA, Qwen-VL	$\leq 2.55 \times$
Spec-LLaVA	Independent	Token Tree	× (Tuning)	Strict	Tree	LLaVA-1.5	$\leq 3.28 \times$
SAGE	Independent	Adaptive Tree	× (Tuning)	Strict	Tree	LLaVA-OV, Qwen2.5-VL	$\leq 3.36 \times$
HiViS	Independent	Visual Token Hiding	× (Tuning)	Strict	Linear	Qwen2.5-VL	$\leq 3.15 \times$
AASD	Independent	T-D Attention	× (T-D Attn)	Strict	Linear	LLaVA	$\leq 2.0 \times$
DREAM	Independent	Target-Informed	× (Tuning)	Strict	Linear	LLaVA-v1.6	$\leq 3.6 \times$
STAR	Independent	NAS + Target Feat.	× (OFA Tuning)	Strict	Linear	LLaVA-v1.6	$\leq 3.8 \times$
IbED	Independent	Multi-Prompt Ensemble		Strict	Linear	LLaMA, LLaVA-1.5	$1.06 \times \sim 1.23 \times$ ^†
TABED	Independent	Test-Time Adaptive Weighting		Strict	Linear	LLaVA-1.5, LLaVA-NeXT	$\leq 1.74 \times$
EdgeSD	Independent	VED + ITM + Tree		Strict	Tree	LLaVA-OV, InternVL2.5	$3.04 \times \sim 5.12 \times$
SV (Verdict)^★	Independent	Small VLM		Path-Level (NLL)	Path	Qwen2.5-VL	N/A
HSD^‡	Independent	Pipeline Draft		Relaxed ( $τ$ -Tol.)	Linear (Hier.)	Qwen3-VL	$\leq 4.89 \times$
Vision–Language–Action Models
Spec-VLA	Independent	Small VLA	× (Training)	Relaxed (Action)	Linear	OpenVLA	$\approx 1.42 \times$
SpecPrune-VLA	Shared Backbone	Action-Aware Pruning		Strict	Linear	OpenVLA-OFT, $π_{0}$	$1.46 \times \sim 1.57 \times$
KERV	Independent	KF-Rectified Draft	× (Draft Train)	Relaxed (Kinematic)	Linear	OpenVLA (LIBERO)	$1.48 \times \sim 1.57 \times$
HeiSD	Independent	Hybrid Retrieval + Drafter	× (Draft Train)	Relaxed (Seq-Wise)	Tree	OpenVLA (LIBERO)	$\leq 2.45 \times$
Video–Language Models
STD	Shared Backbone	Sparse KV Routing		Strict	Linear	Qwen2-VL, LLaVA-OV	$\leq 1.94 \times$
SpecVLM (Video)	Independent	Token Tree		Strict	Tree	LLaVA-OV	$\leq 2.68 \times$
HIPPO	Shared Backbone	Pipeline Overlap		Strict	Linear	LLaVA-OV	$\leq 3.51 \times$
Sparrow	Independent	HSR + VATA	× (Training)	Strict	Linear	LLaVA-OV, Qwen2.5-VL	$\leq 2.82 \times$
VideoSpeculateRAG	Independent	Small VLM + Per-Doc Parallel		Tolerance-Based ( $δ$ )	Path	Qwen2.5-VL-32B	$\approx 2 \times$
ParallelVLM	Independent	UV-Prune + Pipeline Overlap		Strict	Linear	LLaVA-OV, Qwen2.5-VL	$\leq 3.36 \times$
Speech and Audio Models
SSD	Independent	Small Audio LM	× (Distillation)	Relaxed ( $β$ -Tol.)	Linear	CosyVoice-2	$\approx 1.4 \times$
SpecASR	Independent	Adaptive Len. + Tree	× (Tuning)	Strict (w/ Repair)	Tree	Llama, Vicuna	$3.04 \times \sim 3.79 \times$
SMUD	Independent	CTC Pseudo-Draft		Dual-Hypothesis	Linear	E-Branchformer ASR	$\approx 1.4 \times$
Codec-MTP	Shared Backbone	Block Prediction		Path-Level (Viterbi)	Path	VALL-E / USLM	$4.0 \times \sim 5.0 \times$
UGSD	Independent	Edge–Cloud Uncertainty Gate		Relaxed (Top-R Rank)	Linear	Qwen2.5-Omni-3B / Qwen3-Omni	$\approx 1.40 \times$
CTC-SSD	Shared Backbone	CTC Encoder Draft		Relaxed (Likelihood $τ$ )	Linear	Granite-Speech-1B	$\leq 4.4 \times$
Text-to-Image (Vision AR) Models
LANTERN	Independent	Small AR Model	× (Tuning)	Relaxed (Latent)	Linear	LlamaGen	$1.75 \times \sim 1.82 \times$
LANTERN++	Independent	Static Tree	× (Tuning)	Relaxed (Latent)	Tree	LlamaGen	$\approx 2.56 \times$
GSD	Independent	Dynamic Clustering		Relaxed (Grouped)	Linear	AR Image Models	$\approx 3.8 \times$
SJD	Shared Backbone	Jacobi Iteration		Convergence	Iterative	LlamaGen, Emu3	$1.5 \times \sim 2.0 \times$
SJD²	Shared Backbone	Denoise Trajectory	× (Fine-tune)	Convergence	Iterative	LlamaGen	$\approx 4.0 \times$
MC-SJD	Shared Backbone	Gumbel Coupling		Convergence	Iterative	LlamaGen	$3.8 \times \sim 4.2 \times$
VVS	Independent	Conf. Skip		Conf. Skip	Linear	LlamaGen	$\approx 2.8 \times$
SJD++	Shared Backbone	Token Reuse	× (Fine-tune)	Convergence	Iterative	LlamaGen	$\approx 2.4 \times$
CSpD	Independent	Density-Ratio Sampling		Path-Level (Density)	Path	MAR	$\leq 2.33 \times$
SJD-PV	Shared Backbone	Phrase Library + Jacobi		Phrase-Level (Joint)	Path	Lumina-mGPT	$\leq 2.71 \times$
MuLo-SD	Independent	Multi-Scale Drafting	× (Training)	Relaxed Local (Neighborhood)	Linear	Tar-1.5B	$\leq 1.7 \times$
Diffusion Models
SpeCa	Drafter-Free Speculation	Feature Caching		Relaxed (Feat.)	Path	DiT, FLUX, HunyuanVideo	$\leq 7.3 \times$
Accel. Diff.	Drafter-Free Speculation	Coupling Jumps		Path-Level (Coupling)	Path	DiT, EDM	$2.0 \times \sim 3.0 \times$
ASD	Drafter-Free Speculation	Stochastic Exchange		Convergence	Path	DDPM	$1.8 \times \sim 4.0 \times$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.