Preprint
Article

This version is not peer-reviewed.

SA-MAP: Similarity-Attention Boosted Token Compression via Merging-and-Pruning for Efficient Speech LLMs

  † These authors contributed equally to this work.

Submitted:

08 June 2026

Posted:

09 June 2026

You are already at the latest version

Abstract
Speech Large Language Models (Speech LLMs) have exhibited exceptional efficacy across various tasks, including Automatic Speech Recognition (ASR) and general audio understanding. However, the quadratic complexity of self-attention mechanisms constrains long-form scalability and incurs prohibitive inference overhead. Consequently, token compression has become a key strategy for facilitating lightweight and efficient inference. To address this, we propose SA-MAP, a plug-and-play compression framework for Similarity-Attention driven joint token Merging And Pruning that exploits the inherent temporal dependencies of speech. Specifically, the framework operates in two sequential stages: the first stage merges adjacent tokens to maximize information retention, while the second stage executes pruning to safeguard token diversity, thereby equilibrating compression intensity and information integrity. Extensive evaluations on mainstream Speech LLMs demonstrate that SA-MAP consistently establishes new state-of-the-art (SOTA) benchmarks, outperforming established baselines. Notably, when applied to Qwen2-Audio and Kimi-Audio, SA-MAP achieves a 50% lossless token reduction in understanding tasks, and a 40% compression ratio in ASR with only marginal degradation in Word Error Rate (WER). The code is available at https://github.com/Tencent/AngelSlim/tree/SA-MAP.
Keywords: 
;  ;  ;  ;  

1. Introduction

Benefiting from the advances in Large Language Models (LLMs) [1,2,3], Speech Large Language Models (Speech LLMs) [4,5,6,7] have achieved remarkable progress, demonstrating exceptional capabilities in both Automatic Speech Recognition (ASR) and complex audio understanding. A typical Speech LLM comprises three core components: (1) an audio encoder that converts raw waveforms into discrete or continuous representations; (2) a backbone LLM that generates responses autoregressively (which constitutes the primary computational bottleneck); and (3) an optional speech synthesizer for waveform reconstruction.
Most existing SpeechLLMs remain predominantly constrained to short-duration audio (e.g., 30s) due to prohibitive computational costs and latency. Given that the backbone LLM constitutes the primary inference bottleneck, the high density of audio tokens in the input sequence accounts for the vast majority of the inference overhead. While high-frame-rate acoustic embeddings afford fine-grained resolution, they introduce substantial redundancy, particularly during silent intervals or stationary noise. Furthermore, the quadratic complexity of self-attention mechanisms imposes a significant bottleneck on long-form audio scalability, escalating both inference latency and memory overhead. Consequently, efficient compression of audio tokens is imperative to enhance end-to-end throughput and facilitate long-context speech processing.
Current compression strategies generally fall into two categories: merging and pruning. In speech acceleration, merging-based methods (e.g., A-ToMe [8], FastAdaSP [9]) predominate due to the sequential nature of speech, progressive merging adjacent audio tokens during the prefill phase, whereas pruning method [10] selects the top-k salient audio tokens based on attention weights. However, these compression modules are primarily integrated into the LLM transformer blocks, which often introduces a compatibility bottleneck: direct access to internal attention scores is frequently precluded by highly optimized kernels like FlashAttention [11], which prioritize throughput via fused operations.
Based on the core features leveraged, compression methods can be divided into similarity-based and attention-based paradigms. The former leverages inter-token correlations to retain a maximally diverse token set but risks omitting critical fine-grained details, whereas the latter exploits attention-sparsity to extract salient tokens but often retains redundant duplicates. Hybrid approaches like VisPruner[12] and VisionZip [13] attempt to combine these metrics through multi-stages. however, as similarity and attention are applied in isolation, they fail to achieve a synergistic convergence of these complementary features.
Notably, as shown in Figure 1, an inherent task divergence between Vision Language Models (VLMs) and Speech LLMs exacerbates the incompatibility of direct method migration. VLMs rely on sparse visual cues for reasoning and context comprehension, whereas Speech LLMs execute multiple tasks within a single model: they address sequence-intensive tasks (e.g., ASR) that require leveraging all audio tokens to ensure transcription accuracy, while simultaneously handling sparsity-oriented tasks (e.g., Emotion Recognition (ER), Spoken Question Answering (SQA)) where only a few tokens carry key predictive information. This duality necessitates a more prudent token reduction strategy that accounts for temporal correlations and varying semantic densities. Existing VLM works focus primarily on local spatial redundancy and instruction relevance of visual information, failing to capture these speech-specific temporal dynamics; thus, direct migration of such techniques to Speech LLMs cannot yield optimal performance and gains.
To address these limitations, we introduce SA-MAP, a plug-and-play, similarity-attention synergistically driven framework for joint token mergingandpruning. To maximize hardware efficiency and ensure compatibility with optimized inference kernels, we deploy the SA-MAP module before the LLM. Different from methods requiring deep integration into the Transformer layers, our approach only utilizes attention scores from a single audio encoder layer, significantly reducing architectural coupling and memory overhead. Tailored to the temporal dependencies and redundancy of speech tokens, we propose a two-part merging-pruning pipeline. In the first merging stage, we group adjacent tokens based on intra-group global feature similarity and perform attention-guided weighted merging to collapse redundant segments while preserving critical semantic information. In the next Pruning Stage, we implement a pruning kernel that integrates both attention and similarity information for diversity-driven pruning. By incorporating a similarity thresholding mechanism, SA-MAP adaptively calibrates the ratio between merging and pruning for each individual sample, achieving a dynamic equilibrium between compression aggressiveness and task-specific accuracy.
Extensive experiments validate that SA-MAP effectively reduces the computational overhead of various Speech LLMs, with evaluations performed on three representative models: Qwen2-Audio, Kimi-Audio, and GLM-ASR. For dense ASR tasks, at a 60% token retention ratio, SA-MAP only incurs a marginal absolute WER degradation of 1.33%–2.26%. Notably, compared to existing methods, SA-MAP facilitates a further 40.5%–68.7% reduction in the performance gap relative to the uncompressed models. Furthermore, for general audio understanding tasks, SA-MAP achieves a 50% lossless token reduction, effectively halving the sequence length with negligible impact on semantic accuracy.
In summary, the contributions of our work are as follows:
  • We introduce an adaptive, hardware-friendly, plug-and-play framework driven by the synergy of similarity and attention for joint token Merging and Pruning.
  • Similarity and Attention are synergistically integrated, merging employs intra-group global similarity with attention-guided weighted aggregation; pruning uses a similarity-attention integrated kernel for diversity-driven pruning.
  • Extensive validation on multiple public speech datasets and mainstream Speech LLMs confirms the method’s effectiveness and generalization across ASR and audio understanding tasks.

3. Method

This section details the design principles and core components of our SA-MAP. Section 3.1 covers attention mechanisms and token compression in Speech LLMs as preliminaries. Section 3.2 and 3.3 examine two core modules: the attention-enhanced token merging module and attention-similarity based token pruning module. Section 3.4 presents the adaptive merging-pruning fusion framework, followed by a trainable tuning scheme proposed in Section 3.5.

3.1. Preliminaries

Architecture of Speech LLMs. Existing Speech LLMs typically consist of three core components: audio encoder, modality projector, and a LLM. Audio encoder, such as Whisper large-v3 [25] (adopted in Qwen2-Audio and Kimi-Audio) and audio tokenizer (utilized in Kimi-Audio), transform raw audio waveforms into continuous or discrete tokens that are interpretable and processable by the LLM. The projection module aligns these audio tokens with the word embedding space of the large language model. The LLM then fuses the aligned audio and text information to generate corresponding responses.
Audio Token compression. Most audio token compression modules are embedded within the backbone of LLMs. High compression ratio is achieved via multi-layer stacking, a design that entails a high level of hardware customization. Notably, some of these modules demand direct access to the attention scores of intermediate layers, rendering them generally incompatible with highly optimized acceleration libraries (e.g., FlashAttention). To maximize hardware efficiency, we draw inspiration from relevant studies in VLMs and deploy the proposed SA-MAP compression module upstream of the LLM, only utilizing attention scores from a single audio encoder layer.
Attention in Audio encoder. Typical audio encoders, e.g., Whisper v3 is based on transformer architecture. Typically, we assess the importance of each audio token by analyzing the attention scores within the audio encoder. Specifically, we compute the attention score according to Eq. 1.
A h = S o f t m a x ( Q h K h T D h ) ,
within the context of multi-head attention, A h stands for the attention score of each individual head, D h signifies the head dimension, and Q h and K h refer to the query and key respectively.

3.2. Attention-Enhanced Token Merging

Merging similar audio tokens can eliminate redundant audio information while preserving critical content. To this end, we propose a novel token merging method: Intra-Group Global Similarity-based Attention-Weighted Token Merging (G-SAM), deployed upstream of the LLM, with two sequential implementation steps: First, audio token grouping, which leverages the similarity between audio features. We cluster adjacent, highly similar audio tokens into groups for subsequent merging. Furthermore, to maximize the information retention of tokens post-merging, we incorporate attention information from the audio encoder to introduce an importance metric, and perform weighted merging on intra-group tokens.
Audio Token Grouping. Given the audio features H a R N × D generated by the projection module preceding the LLM, we compute the pairwise cosine similarity across tokens to derive the similarity matrix L :
L i , j = H a i · H a j H a i · H a j .
Unlike the strategies adopted by A-ToMe and FastAdaSP, which only leverage the similarity between adjacent tokens for clustering, our method considers the similarity among any tokens within the group. This design can capture more comprehensive correlation patterns between tokens in the group, effectively avoiding the locality bias caused by relying solely on adjacent relationships, thereby enhancing the rationality of clustering results and the completeness of feature representation, and better preserving the key semantic information in audio data.
As illustrated in Figure 2, we fix the merging similarity threshold λ and iterate through the audio token list S = { 1 , . . . , N } to sequentially incorporate multiple adjacent indices into a merged cluster. Let s i = { k , . . . , k + m 1 } denote the set of audio tokens in the original merged cluster, where the cardinality of s i is m. We then compute the average similarity between the next adjacent token (with index k + m ) and the tokens in s i , which is formally defined as:
l k + m , s i = 1 m t = k k + m 1 L k + m , t ,
if l k + m , s i λ we incorporate the token indexed by k + m into s i ; otherwise, we initialize a new cluster s i + 1 = { k + m } . The above procedure is repeated iteratively until all audio tokens are assigned to corresponding clusters. Ultimately, we yield K merged clusters, formally denoted as S * = { s i | i = 1 , , K } .
Weighted Token Merging. Specifically, we extract attention scores from the audio encoder’s -2 layer to calculate the importance of each audio token, as this layer retains the fine-grained semantic information of audio features. We observe that attention maps across heads exhibit significant discrepancies, with each head capturing distinct audio semantic dimensions. To maximize the discriminative importance information encoded by attention, we compute the maximum value across the head dimension, and then derive the important score W k R N of each audio token:
W k = 1 N j = 0 N max i A i , j , k ,
where N denotes the length of the audio token sequence, H stands for the number of attention heads, and attention scores A R H × N × N .
Then, based on Eq. 5, we aggregate the audio features of each subset s i in the cluster set S * (derived from the aforementioned clustering process) into a single token feature , ultimately yielding K compressed tokens denoted as H ˜ a = { H ˜ a i | i = 1 , . . . , K } . and the corresponding index list set is updated to S ˜ = { 1 , . . . , K } , in which:
H ˜ a i = | s i | W j · H a j | s i | W j ,
where j s i , and H ˜ a i denotes the merged feature corresponding to the cluster subset s i .

3.3. Attention-Similarity Based Token Pruning

Furthermore, we design an attention-enhanced similarity-driven pruning module denoted as ADPruner, integrating attention into the similarity matrix to optimize the DPP kernel and maximize the diversity of audio tokens as well as preserve their critical semantic information.
DPP [26] was initially introduced to model fermion repulsion in quantum physics and has since been widely applied in list-wise diversity modeling. Formally, it is a probability measure defined on the power set of a discrete set S, characterized by a positive semi-definite (PSD) kernel matrix L R N × N indexed by elements of S. The probability of sampling a subset S ^ S is:
P ( S ^ ) = det ( L S ^ ) det ( L + I ) det ( L S ^ ) ,
in which, L S ^ represents the principal submatrix of L corresponding to S ^ .
Via the DPP sampling procedure, the optimal subset H ^ a is determined as:
S ^ * = arg max S ^ S , | S ^ | = m det L S ^ , H ^ a = H a i i S ^ * .
In the context of VLM token pruning, CDpruner [22] leverages DPP to model the diversity of the retained subset of visual tokens. Given a set of visual tokens H v , the kernel matrix L is defined based on the pairwise cosine similarity of visual features. Similarly, when applied to audio token pruning, the default kernel matrix L is constructed using the pairwise cosine similarity of audio features, see Eq. 2. If the construction of the DPP kernel matrix relies solely on feature similarity (a single dimension), it tends to overlook fine-grained critical information when preserving diversity. To address this issue, we propose an attention-enhanced DPP kernel matrix, enabling the simultaneous consideration of both the feature similarity and importance of audio tokens during pruning. We similarly extract attention scores from the penultimate layer of the audio encoder to compute the importance matrix A ^ R N × N for audio tokens, A ^ = 1 H i = 0 H A i , j , k .
Finally, we integrate feature similarity and token importance for audio token pruning, yielding our proposed attention-enhanced DPP-based pruning method (ADPruner), as illustrated in Figure 2. Specifically, we weight the original kernel matrix by the derived importance scores to construct a novel conditional kernel matrix:
L ^ = d i a g A ^ · L · d i a g A ^ .
The updated log-probability of the subset S for DPP is:
l o g d e t L ^ S ^ = i S ^ l o g A ^ i 2 + l o g d e t L S ^ ,
which jointly considers both feature similarity and importance of the retained audio tokens. We then obtain the optimal subset via MAP inference [27].

3.4. Adaptive Merging-Pruning Fusion Framework

Merging combines highly similar tokens, reducing redundancy while preserving key information. However, under high compression rates, using merging alone may force the fusion of dissimilar tokens, which can distort speech representations. In contrast, pruning keeps only the most important tokens and removes the others, which may lead to the loss of useful information. To address these issues, SA-MAP integrates improved merging and pruning modules, as shown in Figure 2. We introduce a threshold that adaptively adjusts the ratio of merging to pruning for each sample, thereby achieving high compression rates while robustly maintaining model performance. Changing the threshold leads to different accuracy, as illustrated in Figure 3. In practice, the threshold is selected through a manual search on a small dataset, which keeps the tuning cost low. As described in Appendix 1, the proposed compression process consists of two stages. In the first stage, adjacent tokens are selectively merged based on the threshold to preserve as much information as possible. In the second stage, pruning is applied to the merged tokens to improve representation diversity and reduce redundancy. By jointly optimizing merging and pruning, our method achieves an effective balance between compression efficiency and information preservation.

3.5. Optional: Trainable Compression Module

We introduce a trainable compression module to enhance token pruning within the merging-pruning joint framework, which integrates two lightweight, end-to-end optimized components: a learnable merging weight generator and a trainable importance scorer. Each consists of two linear projection layers, similar to the key and query transformations in standard self-attention. The former adaptively determines token fusion weights based on input-dependent attention patterns, while the latter estimates token importance through simplified self-attention interactions for refining the DPP kernel. The details are shown in Appendix C.
Tuning merging weights. As shown in Eq. 4 and Eq. 5, we perform weighted fusion of similar tokens in the merger module. In our joint framework, the attention scores from the penultimate layer of the audio encoder are used for this computation. Furthermore, the weighting coefficients in Eq. 5 can be derived from a trainable weighting module. Specifically, we design a lightweight module that generates these coefficients dynamically in a manner analogous to computing attention scores. This module consists of two linear projection layers, similar to the key and query transformations in standard self-attention. We first project the audio features using two separate learnable weight matrices, Θ q and Θ k :
Q = H a Θ q , K = H a Θ k ,
where Θ q , Θ k R D × D . The weighting coefficient W i for each token i is then computed as the scaled dot-product attention scores over the input elements:
A = Sigmoid Q K d k , W i = 1 N j = 0 N 1 A i j .
Subsequently, these coefficients are used in Eq. 5 to perform a weighted merging of components, allowing the model to adaptively determine the contribution of each part based on the input context.
Tuning importance scorer. We introduce a lightweight, learnable importance scorer to calculate the importance scores of tokens in DPP kernel. The design of this scorer draws inspiration from the Token Selector module proposed in VisionSelector[28]. Specifically, the input representation is projected into queries and keys via two separate linear layers, after which a simplified self-attention matrix is computed to capture pairwise token interactions, as expressed in Eq.10. The importance score for each token is then derived by averaging its interaction scores with all other tokens, as follows:
P = Q K d k , A i = 1 N j = 0 N 1 P i j .
Finally, the resulting importance score vector is incorporated into the similarity matrix to construct a refined DPP kernel, as described by Eq. 8, which guides the subsequent pruning of tokens.

4. Experiments

To validate the effectiveness of our method, we conduct a series of experiments. This section details our experimental setup, and the evaluation benchmarks used.

4.1. Experiment Setting

We choose open speech datasets for evaluation, covering ASR and speech understanding tasks. For speech recognition tasks, we select LibriSpeech[29], Fleurs[30], AISHELL1[31], AISHELL2-ios[32] and WenetSpeech[33]. For speech understanding tasks, we choose MELD[34], VocalSound[35], TUT2017[36], Nonspeech7k[37] and MMAU[38]. And we conduct experiments on Qwen2-Audio, Kimi-Audio, GLM-ASR. We select A-ToMe and FastAdasp for comparison. In addition, we migrate methods from the visual domain to audio models, including the pure pruning approaches VisPruner and CDPruner, as well as VisionZip, which combines both pruning and merging. Our experiments are conducted using the Kimi-Audio-Evakit toolbox[7]. For ASR and speech understanding tasks, we employ WER and accuracy metrics for evaluation, respectively.
Model Configuration: We consistently integrate SA-MAP after the encoder and the projection module of the model. For Kimi-Audio, which utilizes both discrete and continuous features, we select the continuous features for calculating the similarity matrix. For a fair comparison, we modified the two methods, A-ToMe and FastAdasp, which perform merging within the LLM block, after the projector instead. Unlike the toolbox, we do not employ GPT-4o-mini as the judge model for the sparsity task. Instead, we use a string matching approach to ensure a fair comparison.

4.2. Main Results

In this section, we compare our method with other methods on both dense and sparse tasks. For full experiments results, please refer to Appendix B.
Performance on Dense ASR Tasks. As shown in Table 1, on dense ASR benchmarks, preserving fine-grained acoustic details is critical. SA-MAP’s two-stage merging-pruning approach demonstrates clear superiority over methods that rely solely on a single metric or local dependencies. For instance, at a 60% retention rate, on Qwen2-Audio, SA-MAP reduces the absolute average WER by 1.53% and 0.96% compared to the best-performing pure merging method (FastAdaSP), and the strongest pure pruning method (CDPruner) respectively, narrowing the performance gap between these method and the uncompressed baseline by 53.4% and 41.9%. Furthermore, when compared with the hybrid approach VisionZip, SA-MAP yields a substantially higher reduction of 5.83% in absolute average WER, effectively closing the performance gap by 81.4%.
Across different compression ratios, our approach consistently exhibits superior performance compared with existing methods, particularly under high compression settings. For example, on Qwen2-Audio with retention rates ranging from 70% to 50%, SA-MAP further narrows the performance gap with the unpruned model by 41.84%–45.76% compared with the best-performing existing pruning method. Similar trends are observed across different backbone. On Kimi-Audio, SA-MAP achieves an additional reduction of 17.92%–41.20% at retention rates between 70% and 50%. Likewise, on GLM-ASR-Nano with retention rates from 70% to 60%, SA-MAP further reduce by 40.45%–52.48%. These results indicate that SA-MAP maintains stable and superior performance across a wide range of compression levels and backbones. Details are reported in Table A1Table A3.
Performance on Sparse Understanding Tasks. For sparse understanding tasks, SA-MAP emphasizes adaptively compression to balance efficiency and semantic preservation as shown in Table 2. On Qwen2-Audio at 50% compression, it achieves an average accuracy of 64.82%, representing a relative improvement of 0.4% over the unpruned baseline. Compared with hybrid methods like VisionZip and FastAdaSP, SA-MAP further delivers relative gains of 0.64% and 1.5%, respectively, demonstrating its superior robustness under aggressive compression. On Kimi-Audio, SA-MAP achieves the highest overall average accuracy of 73.23%, outperforming all compared methods and achieving top performance on 4 out of 7 sub-tasks. Notably, on the challenging MMAU benchmark, it surpasses the unpruned baseline by 0.90% on music and by 0.6% on sound, demonstrating that our joint attention–similarity modeling effectively removes acoustic redundancy while preserving critical semantic information. Across other subsets, SA-MAP maintains stable recognition performance, highlighting its robustness in complex acoustic environments.
Table 1. Performance comparison of different pruning methods on ASR dense tasks (Qwen2-audio / Kimi-Audio / GLM-ASR)
Table 1. Performance comparison of different pruning methods on ASR dense tasks (Qwen2-audio / Kimi-Audio / GLM-ASR)
Method LibriSpeech Fleurs AISHELL-1 AISHELL-2 WenetSpeech Average
dev_clean dev_other test_clean test_other zh en test-meeting test-net
Qwen2-Audio 1.67 3.65 1.74 4.03 3.63 5.20 1.52 3.08 8.40 7.64 4.06
Retain 60% Tokens (40% Compression Ratio)
VisionZip 7.31 10.35 7.08 10.10 8.02 6.85 6.99 13.85 17.88 23.75 11.22
VisPruner 7.42 9.74 7.20 9.69 7.00 8.39 6.91 9.67 13.37 14.07 9.35
CDPruner 4.22 6.05 4.18 6.53 4.88 7.17 2.70 4.62 12.29 10.90 6.35
A-ToMe 4.12 6.81 4.20 6.98 8.00 8.13 4.18 5.56 14.05 14.15 7.62
FastAdaSP 4.91 7.26 4.95 7.51 5.47 7.31 3.28 4.69 11.51 12.30 6.92
SA-MAP 2.59 5.00 2.72 5.02 4.37 5.94 2.69 4.42 11.05 10.11 5.39
Kimi-Audio 1.23 2.39 1.38 2.45 2.87 4.92 0.61 2.57 6.33 5.39 3.01
Retain 60% Tokens (40% Compression Ratio)
VisionZip 6.35 7.94 5.93 7.71 7.63 8.91 6.84 10.01 14.65 19.51 9.55
VisPruner 5.36 6.90 5.07 6.75 5.73 7.82 4.36 7.65 13.70 18.56 8.19
CDPruner 5.70 6.96 5.44 6.78 6.73 8.66 5.07 9.21 14.95 18.41 8.79
A-ToMe 4.13 6.04 3.92 5.87 6.26 7.57 3.58 6.88 12.97 14.99 7.22
FastAdaSP 4.49 6.15 4.28 6.02 4.27 7.64 2.22 5.67 12.62 14.75 6.81
SA-MAP 2.32 3.67 2.49 3.75 4.37 6.26 2.05 3.99 11.47 12.35 5.27
GLM-ASR-Nano 2.14 4.05 2.18 4.53 3.44 4.11 2.47 3.48 8.43 6.65 4.15
Retain 70% Tokens (30% Compression Ratio)
VisionZip 9.18 12.75 8.27 13.27 8.17 8.08 13.9 20.55 19.36 20.45 13.40
VisPruner 9.22 12.33 8.34 12.13 6.95 8.33 11.68 14.57 18.11 16.56 11.82
CDPruner 5.38 7.75 5.28 7.76 4.99 5.93 4.88 5.94 14.21 11.70 7.38
A-ToMe 4.55 8.04 4.66 7.83 5.08 5.77 5.40 5.95 13.16 12.08 7.25
FastAdaSP 6.45 10.34 6.50 10.41 4.34 6.70 6.42 7.02 16.10 14.10 8.84
SA-MAP 3.41 5.53 3.43 5.55 3.76 4.83 3.53 4.52 12.07 9.60 5.62
Table 2. Performance comparison of different pruning methods on Audio Understanding tasks (Qwen2-Audio / Kimi-Audio)
Table 2. Performance comparison of different pruning methods on Audio Understanding tasks (Qwen2-Audio / Kimi-Audio)
Method MELD mmau-test-mini Nonspeech7k TUT2017 VocalSound Average
music sound speech
Qwen2-Audio 51.19 60.48 72.07 60.36 86.48 32.35 88.00 64.42
Retain 50% Tokens (50% Compression Ratio)
VisionZip 51.73 60.48 72.07 61.56 85.38 31.73 86.33 64.18
VisPruner 50.88 61.68 73.87 62.16 84.69 30.74 86.22 64.32
CDPruner 51.61 61.38 71.77 61.26 86.62 31.73 87.50 64.55
A-ToMe 50.15 61.38 71.17 56.46 83.45 29.20 84.32 62.30
FastAdaSP 50.81 59.88 72.07 59.46 83.86 31.23 85.96 63.32
SA-MAP 51.96 61.38 72.07 63.06 85.10 32.53 87.64 64.82
Kimi-Audio 57.48 63.77 79.28 65.77 93.10 61.11 94.24 73.54
Retain 50% Tokens (50% Compression Ratio)
VisionZip 57.78 63.47 79.88 61.86 90.21 60.56 93.23 72.43
VisPruner 56.67 65.57 79.88 63.66 90.21 60.25 93.21 72.78
CDPruner 57.40 63.17 79.58 65.17 89.52 60.31 93.15 72.61
A-ToMe 57.25 64.67 79.28 63.66 92.14 60.68 93.40 73.01
FastAdaSP 56.79 63.47 79.88 66.37 90.48 60.37 93.34 72.96
SA-MAP 56.40 65.57 79.88 66.37 90.76 60.19 93.43 73.23

4.3. Trainable Compression Results

In Table 3, we evaluate SA-MAP’s trainable compression on LibriSpeech using Qwen2-Audio under multiple compression ratios. Results show that it consistently mitigates performance loss, especially at high compression ratios. This enables SA-MAP to maintain strong recognition accuracy, confirming the effectiveness of the trainable compression strategy.

4.4. Ablation Study

We conduct ablation studies on LibriSpeech to validate SA-MAP’s key components and optimal configurations. Experiments are performed across Qwen2-Audio, Kimi-Audio, and GLM-ASR, evaluating performance WER (% ↓). Results are detailed in Figure 4, Figure 5 and Figure 6.
Impact of Attention-Weighted Merging.Figure 4 demonstrates that our attention-weighted merging (lighter bars) consistently outperforms standard average merging across all models. For instance, on Qwen2-Audio with a 60% retention ratio, the WER drops significantly by 1.42% like A-ToMe. Similar trends are observed in Kimi-Audio and GLM-ASR. This confirms that weighting tokens by attention scores effectively preserves critical acoustic information that are otherwise diluted by background noise in simple averaging.
Impact of Attention-Enhanced Pruning. We compare ADPruner against the baseline CDPruner in Figure 5. ADPruner exhibits superior robustness, particularly at high compression rates. Notably, on Kimi-Audio with 50% token retention, ADPruner maintains a low WER of 7.67%, achieving a 6.17% reduction compared to CDPruner. This indicates that attention-guided pruning successfully retains semantically rich tokens that magnitude-based metrics often discard.
Impact of GSA-Merging (Combined with ADPruner)Figure 6 illustrates that GSA-Merging achieves the optimal efficiency-accuracy trade-off. Unlike A-ToMe or FastAdaSP which rely on adjacent token similarity, GSA captures comprehensive correlation patterns among all tokens within a group. This global perspective mitigates locality bias, ensuring more rational clustering and complete feature representation. Notably, on Kimi-Audio with 50% token retention, SA-MAP achieves a 22.2% relative WER improvement over FastAdaSP, confirming its ability to better preserve key semantic information in audio data.
Impact of Threshold Selection Taking Qwen2-Audio as an example, we selected LibriSpeech-dev_clean as the reference sample, and Figure 3 presents WER of SA-MAP under different thresholds. As can be seen from the figure, setting different thresholds affects the final accuracy. We determined the thresholds corresponding to various compression rates through a search process on the sample set. For instance, a compression rate of 30% can be achieved by setting the threshold to a specific value (e.g., 0.72), as demonstrated in the results.

5. Conclusion

This paper introduces SA-MAP, an adaptive, hardware-friendly, plug-and-play framework for joint audio token merging and pruning in Speech LLMs, driven by the synergy of similarity and attention. The first stage merges adjacent tokens to maximize semantic fidelity, while the second executes pruning to preserve token diversity. By setting a similarity threshold and jointly optimizing merging and pruning, our method effectively balances compression efficiency and information retention, maintaining reliability even under high compression settings. Experiments demonstrate that our method achieves a 50% lossless token reduction in understanding tasks and a 40% compression ratio in ASR with only marginal degradation evaluated on representative Speech LLMs.

Author Contributions

Hong Liu, Rui Cen: methodology, software, validation, formal analysis, investigation, resources, data curation, writing—original draft preparation, writing—review and editing, visualization. Guanghua Yu: writing—review and editing, supervision, project administration. Jianchen Zhu: supervision, project administration. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Algorithm

Algorithm A1 Adaptive Merging-Pruning Fusion Method
  • Input: Audio feature H a , attention scores A , Audio list S , similarity threshold λ , retain tokens m
  • Output: Optimized subset H ^ a
  • // First: attention-enhanced token merging
  • for i , j i n S do
  •     L i , j = H a i · H a j H a i · H a j
  • end for
  • S * = { } , s = { } , initialize the subset
  • for i = 1 , 2 , . . . , N do
  •     W i = 1 N j = 0 N max h A h , j , i
  •    if s is empty then
  •      Add i to s
  •    else
  •       l i , | s | = 1 s | s | L i , | s |
  •      if  l i , | s | > λ  then
  •         Add i to s
  •      else
  •         Add s to S, set s = { }
  •      end if
  •    end if
  • end for
  • for s i in S do
  •     H ˜ a i = | s i | W j · H a j | s i | W j
  • end for
  • index list set is updated to S ˜ = { 1 , . . . , K }
  • // Second: attention-enhanced DPP-based token pruning
  • H ^ a initialize the output subset
  • L ^ = d i a g A ^ · L · d i a g A ^
  • c i =[], d i 2 = L ^ i , i
  • j = a r g m a x i S ˜ l o g ( d i 2 ) , S ^ = { j }
  • while | S ^ | < m do
  •    for i S ˜ / S ^  do
  •       e i = L j i < c j , c i > / d j
  •       c i = [ c i , e i ] , d i 2 = d i 2 e i 2
  •    end for
  •     j = a r g m a x i S l o g ( d i 2 ) , S = S { j }
  •    set H ˜ a j to H ^ a
  • end while
  • return H ^ a

Appendix B. Full Experiments Results

This section provides the full experimental results to further validate the effectiveness and robustness of our proposed method. See Table A1 - Table A5.
Threshold Settings: For ASR, the thresholds set for Qwen2-Audio at compression rates of 30%, 40%, and 50% are 0.72, 0.65, and 0.6, respectively. For Kimi-Audio at the same rates, the thresholds are 0.753, 0.93, and 0.94. For GLM-ASR-Nano, the thresholds at compression rates of 30% and 40% are 0.93 and 0.9, respectively. For speech understanding tasks, the thresholds set for Qwen2-Audio at compression rates of 40%, 50%, and 70% are 0.82, 0.80, and 0.85, respectively. For Kimi-Audio at the same rates, the thresholds are 0.74, 0.7, and 0.6.
Table A1. Performance comparison of different pruning methods on ASR dense tasks (Qwen2-audio)
Table A1. Performance comparison of different pruning methods on ASR dense tasks (Qwen2-audio)
Method LibriSpeech Fleurs AISHELL-1 AISHELL-2-ios WenetSpeech Average
dev_clean dev_other test_clean test_other zh en test-meeting test-net
base 1.67 3.65 1.74 4.03 3.63 5.20 1.52 3.08 8.40 7.64 4.06
Retain 70% Tokens (30% Compression Ratio)
VisionZip 3.16 5.68 3.26 5.82 4.90 5.53 3.31 7.70 12.61 15.98 6.80
VisPruner 3.93 5.89 3.81 6.00 4.70 6.08 3.10 5.14 10.35 10.22 5.92
CDPruner 2.58 4.35 2.59 4.75 3.96 5.99 1.83 3.51 9.77 8.89 4.82
A-ToMe 2.54 4.86 2.59 5.00 4.93 6.09 2.26 3.86 10.35 10.29 5.28
FastAdaSP 2.44 4.54 2.54 4.83 4.00 5.62 1.94 3.43 9.53 9.10 4.80
SA-MAP 1.98 4.07 2.00 4.26 3.67 5.39 1.83 3.70 9.26 8.71 4.49
Retain 60% Tokens (40% Compression Ratio)
VisionZip 7.31 10.35 7.08 10.10 8.02 6.85 6.99 13.85 17.88 23.75 11.22
VisPruner 7.42 9.74 7.20 9.69 7.00 8.39 6.91 9.67 13.37 14.07 9.35
CDPruner 4.22 6.05 4.18 6.53 4.88 7.17 2.70 4.62 12.29 10.90 6.35
A-ToMe 4.12 6.81 4.20 6.98 8.00 8.13 4.18 5.56 14.05 14.15 7.62
FastAdaSP 4.91 7.26 4.95 7.51 5.47 7.31 3.28 4.69 11.51 12.30 6.92
SA-MAP 2.59 5.00 2.72 5.02 4.37 5.94 2.69 4.42 11.05 10.11 5.39
Retain 50% Tokens (50% Compression Ratio)
VisionZip 18.17 20.45 15.92 20.80 14.42 10.06 14.04 22.42 27.52 34.19 19.80
VisPruner 15.44 18.33 15.11 18.05 12.34 14.41 15.42 19.89 19.75 22.19 17.09
CDPruner 7.80 9.88 7.95 10.12 9.34 10.55 6.83 8.20 16.08 19.13 10.59
A-ToMe 7.79 10.96 7.98 11.19 21.31 13.81 11.05 12.10 21.40 22.10 13.97
FastAdaSP 11.78 14.77 11.96 14.82 10.51 12.82 8.24 9.39 17.21 20.85 13.24
SA-MAP 4.49 7.20 4.53 7.24 5.53 7.49 4.41 7.41 13.67 14.02 7.60
Table A2. Performance comparison of different pruning methods on ASR dense tasks (GLM-ASR)
Table A2. Performance comparison of different pruning methods on ASR dense tasks (GLM-ASR)
Method LibriSpeech Fleurs AISHELL-1 AISHELL-2-ios WenetSpeech Average
dev_clean dev_other test_clean test_other zh en test-meeting test-net
base 2.14 4.05 2.18 4.53 3.44 4.11 2.47 3.48 8.43 6.65 4.15
Retain 70% Tokens (30% Compression Ratio)
VisionZip 9.18 12.75 8.27 13.27 8.17 8.08 13.90 20.55 19.36 20.45 13.40
VisPruner 9.22 12.33 8.34 12.13 6.95 8.33 11.68 14.57 18.11 16.56 11.82
CDPruner 5.38 7.75 5.28 7.76 4.99 5.93 4.88 5.94 14.21 11.70 7.38
A-ToMe 4.55 8.04 4.66 7.83 5.08 5.77 5.40 5.95 13.16 12.08 7.25
FastAdaSP 6.45 10.34 6.50 10.41 4.34 6.70 6.42 7.02 16.10 14.10 8.84
SA-MAP 3.41 5.53 3.43 5.55 3.76 4.83 3.53 4.52 12.07 9.60 5.62
Retain 60% Tokens (40% Compression Ratio)
VisionZip 19.18 24.16 17.91 24.27 15.66 16.22 27.40 36.09 30.56 32.89 24.43
VisPruner 18.02 22.68 17.01 22.06 12.48 15.27 22.82 28.81 27.46 26.78 21.34
CDPruner 9.55 12.53 9.88 11.90 6.68 9.38 8.40 9.78 21.10 16.53 11.57
A-ToMe 8.24 12.51 8.39 12.19 8.37 8.23 9.27 10.48 18.00 16.85 11.25
FastAdaSP 13.23 17.52 13.55 17.60 7.84 7.46 8.61 9.46 17.60 15.88 12.88
SA-MAP 6.17 8.48 6.24 8.08 4.81 6.54 5.84 7.52 16.29 13.82 8.38
Table A3. Performance comparison of different pruning methods on ASR dense tasks (Kimi-audio)
Table A3. Performance comparison of different pruning methods on ASR dense tasks (Kimi-audio)
Method LibriSpeech Fleurs AISHELL-1 AISHELL-2-ios WenetSpeech Average
dev_clean dev_other test_clean test_other zh en test-meeting test-net
base 1.23 2.39 1.38 2.45 2.87 4.92 0.61 2.57 6.33 5.39 3.01
Retain 70% Tokens (30% Compression Ratio)
VisionZip 2.83 4.22 2.77 4.17 4.46 6.09 2.71 5.70 9.92 12.59 5.55
VisPruner 2.54 3.68 2.52 3.83 3.87 6.17 1.67 4.36 9.40 11.50 4.95
CDPruner 2.72 3.87 2.60 3.90 4.38 6.42 1.81 4.75 10.14 11.48 5.21
A-ToMe 2.09 3.54 2.07 3.52 3.74 5.60 1.22 3.73 8.89 9.67 4.41
FastAdaSP 2.00 3.40 2.12 3.38 3.22 5.76 0.95 3.37 8.68 8.98 4.19
SA-MAP 1.48 2.64 1.50 2.73 3.36 5.11 1.03 3.28 8.66 9.97 3.98
Retain 60% Tokens (40% Compression Ratio)
VisionZip 6.35 7.94 5.93 7.71 7.63 8.91 6.84 10.01 14.65 19.51 9.55
VisPruner 5.36 6.90 5.07 6.75 5.73 7.82 4.36 7.65 13.70 18.56 8.19
CDPruner 5.70 6.96 5.44 6.78 6.73 8.66 5.07 9.21 14.95 18.41 8.79
A-ToMe 4.13 6.04 3.92 5.87 6.26 7.57 3.58 6.88 12.97 14.99 7.22
FastAdaSP 4.49 6.15 4.28 6.02 4.27 7.64 2.22 5.67 12.62 14.75 6.81
SA-MAP 2.32 3.67 2.49 3.75 4.37 6.26 2.05 3.99 11.47 12.35 5.27
Retain 50% Tokens (50% Compression Ratio)
VisionZip 13.75 15.53 12.44 15.00 13.70 13.84 14.00 17.35 22.64 29.64 16.79
VisPruner 11.90 14.30 11.47 13.41 10.38 13.20 10.88 14.92 22.03 29.29 15.18
CDPruner 12.79 14.70 12.14 14.31 11.75 14.27 13.47 18.90 23.23 29.62 16.52
A-ToMe 8.66 11.62 8.31 11.24 14.66 11.55 10.04 12.44 20.31 22.48 13.13
FastAdaSP 10.71 12.98 10.36 12.20 7.17 11.87 7.24 11.15 20.37 25.07 12.91
SA-MAP 5.19 6.93 5.13 6.72 6.81 8.50 5.06 6.52 17.12 20.36 8.83
Table A4. Performance comparison of different pruning methods on sparse tasks (Qwen2-Audio)
Table A4. Performance comparison of different pruning methods on sparse tasks (Qwen2-Audio)
Audio Understanding
Method MELD mmau-test-mini Nonspeech7k TUT2017 VocalSound Average
music sound speech
base 51.19 60.48 72.07 60.36 86.48 32.35 88.00 64.42
Retain 60% Tokens (40% Compression Ratio)
VisionZip 51.38 59.88 71.77 61.86 86.48 32.28 87.33 64.43
VisPruner 52.30 60.18 72.97 62.16 86.21 31.05 86.69 64.51
CDPruner 50.88 61.08 73.57 60.96 87.17 32.65 87.94 64.89
A-ToMe 51.46 60.78 70.27 61.86 84.83 31.73 86.44 63.91
FastAdaSP 50.50 59.88 72.37 62.16 85.93 32.41 87.11 64.34
SA-MAP 51.69 61.08 73.27 62.46 87.31 32.84 87.97 65.23
Retain 50% Tokens (50% Compression Ratio)
VisionZip 51.73 60.48 72.07 61.56 85.38 31.73 86.33 64.18
VisPruner 50.88 61.68 73.87 62.16 84.69 30.74 86.22 64.32
CDPruner 51.61 61.38 71.77 61.26 86.62 31.73 87.50 64.55
A-ToMe 50.15 61.38 71.17 56.46 83.45 29.20 84.32 62.30
FastAdaSP 50.81 59.88 72.07 59.46 83.86 31.23 85.96 63.32
SA-MAP 51.96 61.38 72.07 63.06 85.10 32.53 87.64 64.82
Retain 30% Tokens (70% Compression Ratio)
VisionZip 51.53 60.78 72.97 61.26 77.66 29.20 78.08 61.64
VisPruner 50.08 60.78 72.67 61.56 70.90 28.77 84.91 61.38
CDPruner 50.84 58.68 72.07 60.96 81.38 31.36 85.55 62.98
A-ToMe 48.73 61.38 70.87 48.65 67.72 27.04 79.78 57.74
FastAdaSP 48.66 59.58 71.77 51.95 69.24 28.40 79.00 58.37
SA-MAP 50.42 60.78 71.77 61.86 79.17 32.04 87.25 63.33
Table A5. Performance comparison of different pruning methods on sparse tasks (Kimi-Audio)
Table A5. Performance comparison of different pruning methods on sparse tasks (Kimi-Audio)
Audio Understanding
Method MELD mmau-test-mini Nonspeech7k TUT2017 VocalSound Average
music sound speech
base 57.48 63.77 79.28 65.77 93.10 61.11 94.24 73.54
Retain 60% Tokens (40% Compression Ratio)
VisionZip 58.59 64.67 79.88 65.17 91.45 60.80 93.62 73.45
VisPruner 57.63 64.97 79.28 65.77 92.00 60.49 93.71 73.41
CDPruner 57.86 62.57 78.98 64.56 91.17 60.80 93.68 72.80
A-ToMe 57.82 64.97 78.38 68.17 92.28 60.06 93.90 73.65
FastAdaSP 57.63 64.67 80.78 64.86 91.17 60.43 93.71 73.32
SA-MAP 57.25 66.47 80.78 66.07 91.31 60.49 93.73 73.73
Retain 50% Tokens (50% Compression Ratio)
VisionZip 57.78 63.47 79.88 61.86 90.21 60.56 93.23 72.43
VisPruner 56.67 65.57 79.88 63.66 90.21 60.25 93.21 72.78
CDPruner 57.40 63.17 79.58 65.17 89.52 60.31 93.15 72.61
A-ToMe 57.25 64.67 79.28 63.66 92.14 60.68 93.40 73.01
FastAdaSP 56.79 63.47 79.88 66.37 90.48 60.37 93.34 72.96
SA-MAP 56.40 65.57 79.88 66.37 90.76 60.19 93.43 73.23
Retain 30% Tokens (70% Compression Ratio)
VisionZip 53.87 64.97 78.98 60.66 86.90 57.84 92.76 70.85
VisPruner 53.91 64.67 79.58 63.06 86.90 58.64 92.98 71.39
CDPruner 54.03 65.27 79.28 60.96 86.76 59.01 92.06 71.05
A-ToMe 54.37 65.57 78.68 62.76 89.79 57.22 90.39 71.25
FastAdaSP 54.41 62.87 77.18 61.86 87.45 58.15 91.56 70.50
SA-MAP 55.37 64.07 79.88 65.47 86.62 55.99 92.76 71.45

Appendix C. Trainable Compression Module

In order to further enhance the performance of token pruning, we have added two types of lightweight, trainable modules to our merging-pruning joint compression framework. These modules are optimized through end-to-end training to improve accuracy.
Two-phase training. To ensure better convergence during training, we adopt a two-stage training approach. For instance, given a compression ratio of m, in the first stage, the merging ratio is fixed at n to train the merging weights. In the second stage, using the weights from the first stage, the pruning ratio is set to m n to train the importance scorer. The training framework is shown in Figure A1.
Figure A1. Overview of our Trainable Compression Framework.
Figure A1. Overview of our Trainable Compression Framework.
Preprints 217558 g0a1
Training Settings: In the trainable mode, we conducted experiments on Qwen2-Audio. The training data consisted of 960 hours of speech from LibriSpeech, and all stages were trained for one epoch. During end-to-end training, only the newly added linear layers were updated, while all other parameters were kept frozen. In the first stage, we trained the merging weights using a learning rate of 2e-5 for a 30% compression rate, and 5e-5 for compression rates of 40% and 50%. In the second stage, we trained the importance scorer with a learning rate of 5e-4. After training, inference still followed the merging–pruning strategy of SA-MAP.

References

  1. Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen technical report. arXiv 2023, arXiv:2309.16609. [Google Scholar] [CrossRef]
  2. Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
  3. Team, G.; Kamath, A.; Ferret, J.; Pathak, S.; Vieillard, N.; Merhej, R.; Perrin, S.; Matejovicova, T.; Ramé, A.; Rivière, M.; et al. Gemma 3 technical report. arXiv 2025, arXiv:2503.19786. [Google Scholar] [CrossRef]
  4. Chu, Y.; Xu, J.; Yang, Q.; Wei, H.; Wei, X.; Guo, Z.; Leng, Y.; Lv, Y.; He, J.; Lin, J.; et al. Qwen2-audio technical report. arXiv 2024, arXiv:2407.10759. [Google Scholar] [CrossRef]
  5. Zeng, A.; Du, Z.; Liu, M.; Wang, K.; Jiang, S.; Zhao, L.; Dong, Y.; Tang, J. Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot. arXiv 2024, arXiv:2412.02612. [Google Scholar]
  6. Xu, K.T.; Xie, F.L.; Tang, X.; Hu, Y. Fireredasr: Open-source industrial-grade mandarin speech recognition models from encoder-decoder to llm integration. arXiv 2025, arXiv:2501.14350. [Google Scholar]
  7. Ding, D.; Ju, Z.; Leng, Y.; Liu, S.; Liu, T.; Shang, Z.; Shen, K.; Song, W.; Tan, X.; Tang, H.; et al. Kimi-audio technical report. arXiv 2025, arXiv:2504.18425. [Google Scholar] [CrossRef]
  8. Li, Y.; Wu, Y.; Li, J.; Liu, S. Accelerating transducers through adjacent token merging. arXiv 2023, arXiv:2306.16009. [Google Scholar] [CrossRef]
  9. Lu, Y.; Song, J.; Yang, C.H.H.; Watanabe, S. FastAdaSP: Multitask-Adapted Efficient Inference for Large Speech Language Model. arXiv 2024, arXiv:2410.03007. [Google Scholar]
  10. Lee, T.; Lee, H. Token Pruning in Audio Transformers: Optimizing Performance and Decoding Patch Importance. arXiv 2025, arXiv:2504.01690. [Google Scholar] [CrossRef]
  11. Dao, T.; Fu, D.; Ermon, S.; Rudra, A.; Ré, C. Flashattention: Fast and memory-efficient exact attention with io-awareness. Adv. Neural Inf. Process. Syst. 2022, 35, 16344–16359. [Google Scholar]
  12. Zhang, Q.; Cheng, A.; Lu, M.; Zhang, R.; Zhuo, Z.; Cao, J.; Guo, S.; She, Q.; Zhang, S. Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs. arXiv 2025, arXiv:cs. [Google Scholar] [CrossRef]
  13. Yang, S.; Chen, Y.; Tian, Z.; Wang, C.; Li, J.; Yu, B.; Jia, J. VisionZip: Longer is Better but Not Necessary in Vision Language Models. arXiv 2024, arXiv:cs. [Google Scholar]
  14. An, K.; Chen, Y.; Deng, C.; Gao, C.; Gao, Z.; Gong, B.; Li, X.; Li, Y.; Lv, X.; Ji, Y.; et al. Fun-ASR Technical Report. arXiv 2025, arXiv:2509.12508. [Google Scholar] [CrossRef]
  15. Ranjan Behera, S.; Dhiman, A.; Gowda, K.; Narayani, A.S. FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge Distillation. arXiv E-Prints 2024, arXiv–2406. [Google Scholar]
  16. Lin, Y.; Fu, Y.; Zhang, J.; Liu, Y.; Zhang, J.; Sun, J.; Li, H.H.; Chen, Y. Speechprune: Context-aware token pruning for speech information retrieval. In Proceedings of the 2025 IEEE International Conference on Multimedia and Expo (ICME); IEEE, 2025; pp. 1–6. [Google Scholar]
  17. Rao, Y.; Zhao, W.; Liu, B.; Lu, J.; Zhou, J.; Hsieh, C.J. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Adv. Neural Inf. Process. Syst. 2021, 34, 13937–13949. [Google Scholar]
  18. Bolya, D.; Fu, C.Y.; Dai, X.; Zhang, P.; Feichtenhofer, C.; Hoffman, J. Token merging: Your vit but faster. arXiv 2022, arXiv:2210.09461. [Google Scholar]
  19. Shao, K.; Tao, K.; Qin, C.; You, H.; Sui, Y.; Wang, H. HoliTom: Holistic Token Merging for Fast Video Large Language Models. arXiv 2025, arXiv:2505.21334. [Google Scholar] [CrossRef]
  20. Chen, L.; Zhao, H.; Liu, T.; Bai, S.; Lin, J.; Zhou, C.; Chang, B. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In Proceedings of the European Conference on Computer Vision, 2024; Springer; pp. 19–35. [Google Scholar]
  21. Alvar, S.R.; Singh, G.; Akbari, M.; Zhang, Y. Divprune: Diversity-based visual token pruning for large multimodal models. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025; pp. 9392–9401. [Google Scholar]
  22. Zhang, Q.; Liu, M.; Li, L.; Lu, M.; Zhang, Y.; Pan, J.; She, Q.; Zhang, S. Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs. arXiv 2025, arXiv:2506.10967. [Google Scholar] [CrossRef]
  23. Chen, L.; Zhao, H.; Liu, T.; Bai, S.; Lin, J.; Zhou, C.; Chang, B. An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models. arXiv 2024, arXiv:cs. [Google Scholar]
  24. Zhang, Y.; Fan, C.K.; Ma, J.; Zheng, W.; Huang, T.; Cheng, K.; Gudovskiy, D.; Okuno, T.; Nakata, Y.; Keutzer, K.; et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference. arXiv 2024, arXiv:2410.04417. [Google Scholar]
  25. Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision, 2022. arXiv arXiv:eess.
  26. Macchi, O. The coincidence approach to stochastic point processes. Adv. Appl. Probab. 1975, 7, 83–122. [Google Scholar] [CrossRef]
  27. Chen, Z.; Wang, W.; Cao, Y.; Liu, Y.; Gao, Z.; Cui, E.; Zhu, J.; Ye, S.; Tian, H.; Liu, Z.; et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv 2024, arXiv:2412.05271. [Google Scholar]
  28. Zhu, J.; Zhu, Y.; Lu, X.; Yan, W.; Li, D.; Liu, K.; Fu, X.; Zha, Z.J. VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs. arXiv 2025, arXiv:2510.16598. [Google Scholar]
  29. Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: an asr corpus based on public domain audio books. In Proceedings of the 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP); IEEE, 2015; pp. 5206–5210. [Google Scholar]
  30. Conneau, A.; Ma, M.; Khanuja, S.; Zhang, Y.; Axelrod, V.; Dalmia, S.; Riesa, J.; Rivera, C.; Bapna, A. Fleurs: Few-shot learning evaluation of universal representations of speech. In Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023; pp. 798–805. [Google Scholar]
  31. Bu, H.; Du, J.; Na, X.; Wu, B.; Zheng, H. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In Proceedings of the 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA); IEEE, 2017; pp. 1–5. [Google Scholar]
  32. Du, J.; Na, X.; Liu, X.; Bu, H. Aishell-2: Transforming mandarin asr research into industrial scale. arXiv 2018, arXiv:1808.10583. [Google Scholar] [CrossRef]
  33. Zhang, B.; Lv, H.; Guo, P.; Shao, Q.; Yang, C.; Xie, L.; Xu, X.; Bu, H.; Chen, X.; Zeng, C.; et al. Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE, 2022; pp. 6182–6186. [Google Scholar]
  34. Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; Mihalcea, R. Meld: A multimodal multi-party dataset for emotion recognition in conversations. In Proceedings of the Proceedings of the 57th annual meeting of the association for computational linguistics, 2019; pp. 527–536. [Google Scholar]
  35. Gong, Y.; Yu, J.; Glass, J. Vocalsound: A dataset for improving human vocal sounds recognition. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE, 2022; pp. 151–155. [Google Scholar]
  36. Mesaros, A.; Heittola, T.; Virtanen, T. TUT database for acoustic scene classification and sound event detection. In Proceedings of the 2016 24th European signal processing conference (EUSIPCO); IEEE, 2016; pp. 1128–1132. [Google Scholar]
  37. Rashid, M.M.; Li, G.; Du, C. Nonspeech7k dataset: Classification and analysis of human non-speech sound. IET Signal Process. 2023, 17, e12233. [Google Scholar] [CrossRef]
  38. Sakshi, S.; Tyagi, U.; Kumar, S.; Seth, A.; Selvakumar, R.; Nieto, O.; Duraiswami, R.; Ghosh, S.; Manocha, D. Mmau: A massive multi-task audio understanding and reasoning benchmark. arXiv 2024, arXiv:2410.19168. [Google Scholar] [CrossRef]
Figure 1. Task divergence between VLMs and Speech LLMs.
Figure 1. Task divergence between VLMs and Speech LLMs.
Preprints 217558 g001
Figure 2. Architecture of the proposed SA-MAP with adaptive merging-pruning fusion. By introducing a similarity threshold, SA-MAP dynamically balances token merging and pruning, according to speech temporal structures. Similar tokens are first adaptively merged to maximize information retention, followed by pruning to reduce redundancy and improve diversity under high compression rates.
Figure 2. Architecture of the proposed SA-MAP with adaptive merging-pruning fusion. By introducing a similarity threshold, SA-MAP dynamically balances token merging and pruning, according to speech temporal structures. Similar tokens are first adaptively merged to maximize information retention, followed by pruning to reduce redundancy and improve diversity under high compression rates.
Preprints 217558 g002
Figure 3. Impact of Threshold.
Figure 3. Impact of Threshold.
Preprints 217558 g003
Figure 4. The ablation study of attention-weighted merging on LibriSpeech dataset.
Figure 4. The ablation study of attention-weighted merging on LibriSpeech dataset.
Preprints 217558 g004
Figure 5. WER of merging methods combined with ADPruner on LibriSpeech dataset.
Figure 5. WER of merging methods combined with ADPruner on LibriSpeech dataset.
Preprints 217558 g005
Figure 6. WER of merging methods combined with ADPruner on LibriSpeech dataset.
Figure 6. WER of merging methods combined with ADPruner on LibriSpeech dataset.
Preprints 217558 g006
Table 3. Performance of SA-MAP w/wo tuning on LibriSpeech dataset.
Table 3. Performance of SA-MAP w/wo tuning on LibriSpeech dataset.
Method LibriSpeech Average
dev_clean dev_other test_clean test_other
Qwen2-Audio 1.67 3.65 1.74 4.03 2.77
Retain 70% Tokens (30% Compression Ratio)
SA-MAP 1.98 4.07 2.00 4.26 3.08
SA-MAP (tuning) 1.81 3.88 1.96 4.05 2.93
Retain 60% Tokens (40% Compression Ratio)
SA-MAP 2.59 5.00 2.72 5.02 3.83
SA-MAP (tuning) 2.16 4.24 2.27 4.58 3.31
Retain 50% Tokens (50% Compression Ratio)
SA-MAP 4.49 7.20 4.53 7.24 5.87
SA-MAP (tuning) 3.30 5.81 3.54 6.03 4.67
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Accessibility

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated