SA-MAP: Similarity-Attention Boosted Token Compression via Merging-and-Pruning for Efficient Speech LLMs

Hong Liu; Cen Rui; Guanghua Yu; Jianchen Zhu

doi:10.20944/preprints202606.0741.v1

Submitted:

08 June 2026

Posted:

09 June 2026

You are already at the latest version

Abstract

Speech Large Language Models (Speech LLMs) have exhibited exceptional efficacy across various tasks, including Automatic Speech Recognition (ASR) and general audio understanding. However, the quadratic complexity of self-attention mechanisms constrains long-form scalability and incurs prohibitive inference overhead. Consequently, token compression has become a key strategy for facilitating lightweight and efficient inference. To address this, we propose SA-MAP, a plug-and-play compression framework for Similarity-Attention driven joint token Merging And Pruning that exploits the inherent temporal dependencies of speech. Specifically, the framework operates in two sequential stages: the first stage merges adjacent tokens to maximize information retention, while the second stage executes pruning to safeguard token diversity, thereby equilibrating compression intensity and information integrity. Extensive evaluations on mainstream Speech LLMs demonstrate that SA-MAP consistently establishes new state-of-the-art (SOTA) benchmarks, outperforming established baselines. Notably, when applied to Qwen2-Audio and Kimi-Audio, SA-MAP achieves a 50% lossless token reduction in understanding tasks, and a 40% compression ratio in ASR with only marginal degradation in Word Error Rate (WER). The code is available at https://github.com/Tencent/AngelSlim/tree/SA-MAP.

Keywords:

speech LLMs

;

token compression

;

token merging

;

token pruning

;

efficient inference

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

Benefiting from the advances in Large Language Models (LLMs) [1,2,3], Speech Large Language Models (Speech LLMs) [4,5,6,7] have achieved remarkable progress, demonstrating exceptional capabilities in both Automatic Speech Recognition (ASR) and complex audio understanding. A typical Speech LLM comprises three core components: (1) an audio encoder that converts raw waveforms into discrete or continuous representations; (2) a backbone LLM that generates responses autoregressively (which constitutes the primary computational bottleneck); and (3) an optional speech synthesizer for waveform reconstruction.

Most existing SpeechLLMs remain predominantly constrained to short-duration audio (e.g., 30s) due to prohibitive computational costs and latency. Given that the backbone LLM constitutes the primary inference bottleneck, the high density of audio tokens in the input sequence accounts for the vast majority of the inference overhead. While high-frame-rate acoustic embeddings afford fine-grained resolution, they introduce substantial redundancy, particularly during silent intervals or stationary noise. Furthermore, the quadratic complexity of self-attention mechanisms imposes a significant bottleneck on long-form audio scalability, escalating both inference latency and memory overhead. Consequently, efficient compression of audio tokens is imperative to enhance end-to-end throughput and facilitate long-context speech processing.

Current compression strategies generally fall into two categories: merging and pruning. In speech acceleration, merging-based methods (e.g., A-ToMe [8], FastAdaSP [9]) predominate due to the sequential nature of speech, progressive merging adjacent audio tokens during the prefill phase, whereas pruning method [10] selects the top-k salient audio tokens based on attention weights. However, these compression modules are primarily integrated into the LLM transformer blocks, which often introduces a compatibility bottleneck: direct access to internal attention scores is frequently precluded by highly optimized kernels like FlashAttention [11], which prioritize throughput via fused operations.

Based on the core features leveraged, compression methods can be divided into similarity-based and attention-based paradigms. The former leverages inter-token correlations to retain a maximally diverse token set but risks omitting critical fine-grained details, whereas the latter exploits attention-sparsity to extract salient tokens but often retains redundant duplicates. Hybrid approaches like VisPruner[12] and VisionZip [13] attempt to combine these metrics through multi-stages. however, as similarity and attention are applied in isolation, they fail to achieve a synergistic convergence of these complementary features.

Notably, as shown in Figure 1, an inherent task divergence between Vision Language Models (VLMs) and Speech LLMs exacerbates the incompatibility of direct method migration. VLMs rely on sparse visual cues for reasoning and context comprehension, whereas Speech LLMs execute multiple tasks within a single model: they address sequence-intensive tasks (e.g., ASR) that require leveraging all audio tokens to ensure transcription accuracy, while simultaneously handling sparsity-oriented tasks (e.g., Emotion Recognition (ER), Spoken Question Answering (SQA)) where only a few tokens carry key predictive information. This duality necessitates a more prudent token reduction strategy that accounts for temporal correlations and varying semantic densities. Existing VLM works focus primarily on local spatial redundancy and instruction relevance of visual information, failing to capture these speech-specific temporal dynamics; thus, direct migration of such techniques to Speech LLMs cannot yield optimal performance and gains.

To address these limitations, we introduce SA-MAP, a plug-and-play, similarity-attention synergistically driven framework for joint token mergingandpruning. To maximize hardware efficiency and ensure compatibility with optimized inference kernels, we deploy the SA-MAP module before the LLM. Different from methods requiring deep integration into the Transformer layers, our approach only utilizes attention scores from a single audio encoder layer, significantly reducing architectural coupling and memory overhead. Tailored to the temporal dependencies and redundancy of speech tokens, we propose a two-part merging-pruning pipeline. In the first merging stage, we group adjacent tokens based on intra-group global feature similarity and perform attention-guided weighted merging to collapse redundant segments while preserving critical semantic information. In the next Pruning Stage, we implement a pruning kernel that integrates both attention and similarity information for diversity-driven pruning. By incorporating a similarity thresholding mechanism, SA-MAP adaptively calibrates the ratio between merging and pruning for each individual sample, achieving a dynamic equilibrium between compression aggressiveness and task-specific accuracy.

Extensive experiments validate that SA-MAP effectively reduces the computational overhead of various Speech LLMs, with evaluations performed on three representative models: Qwen2-Audio, Kimi-Audio, and GLM-ASR. For dense ASR tasks, at a 60% token retention ratio, SA-MAP only incurs a marginal absolute WER degradation of 1.33%–2.26%. Notably, compared to existing methods, SA-MAP facilitates a further 40.5%–68.7% reduction in the performance gap relative to the uncompressed models. Furthermore, for general audio understanding tasks, SA-MAP achieves a 50% lossless token reduction, effectively halving the sequence length with negligible impact on semantic accuracy.

In summary, the contributions of our work are as follows:

We introduce an adaptive, hardware-friendly, plug-and-play framework driven by the synergy of similarity and attention for joint token Merging and Pruning.
Similarity and Attention are synergistically integrated, merging employs intra-group global similarity with attention-guided weighted aggregation; pruning uses a similarity-attention integrated kernel for diversity-driven pruning.
Extensive validation on multiple public speech datasets and mainstream Speech LLMs confirms the method’s effectiveness and generalization across ASR and audio understanding tasks.

2. Related Work

Speech Large Language Models. Breakthroughs in Large Language Models (LLMs) [1,2,3] have driven research to bridge the gap between Natural Language Processing (NLP) and speech signal understanding, spawning Speech LLMs[4,5,6,14]. Diverging from conventional single-task systems like ASR or TTS, Speech LLMs integrate acoustic perception and language reasoning capabilitie to support complex tasks such as spoken QA and cross-modal retrieval. A standard Speech LLM comprises an audio encoder, a modality adapter, and a pre-trained LLM backbone. To preserve fine-grained temporal dynamics, most models retain high resolution during tokenization, resulting in extreme token density [4,14]. However, this high temporal fidelity incurs heavy computational overhead. In long-form scenarios, token sequences expand sharply: 30s for ∼ 750 tokens and 10min for over 15,000 [4]. Due to the quadratic complexity of self-attention, this raises inference latency and memory overhead and hinders edge deployment. Thus, developing cost-efficient techniques to address these challenges is critical.

Audio Token Reduction. Token compression in Speech LLMs is dominated by merging-based approaches, stemming from the inherent temporal dependencies of speech signals. Most existing frameworks embed compression modules directly within Transformer layers of LLM. For instance, FastAST [15] incorporates token merging and cross-modal distillation into the Spectrogram Transformer to optimize audio classification efficiency. Other representative merging methods, such as A-ToMe [8] iteratively merges pairwise adjacent audio tokens across stacked layers to achieve adaptive compression, while FastAdaSP [9] leverages intermediate-layer attention scores to guide the merging weights of adjacent token clusters. Additionally, pruning-based SpeechPrune [16] performs audio token pruning within Transformer layers using a two-stage strategy based on token-text similarity and attention. However, these methods exhibit several critical shortcomings. First, a sole reliance on pairwise adjacent similarity ignores the correlations between arbitrary tokens within a group, potentially compromising representation accuracy; Second, by embedding modules within transformer layers and requiring layer-wise stacking for high compression ratios, these approaches introduce excessive hyperparameters and elevate deployment difficulty.Third, the dependency on intermediate-layer attention scores (as seen in FastAdaSP and SpeechPrune renders these methods incompatible with optimized attention kernels like FlashAttention, leading to suboptimal hardware efficiency.

Visual Token Reduction. Compared to speech, visual inputs in multimodal large language models (MLLMs) exhibit even more pronounced redundancy [17,18]. Evidence suggests that over 50% of tokens in typical MLLM sequences exert negligible influence during inference [19,20]. Consequently, compression techniques are more mature in VLMs, where modules are typically decoupled from the backbone and deployed between the Vision Transformer (ViT) and the LLM. These methods leverage cross-modal information fusion to perform token pruning or merging based on attention or similarity. Specifically, similarity-based methods [18,21,22] typically calculate the cosine similarity between features, merging or pruning tokens with high similarity thresholds to mitigate redundancy, though they risk blurring critical fine-grained details. Attention-based methods [23,24] utilize intermediate attention weights to gauge importance and select top-K salient tokens, which tends to retain duplicate tokens with high attention. Notably, hybrid strategies like VisionZip [13] and VisPruner [12] attempt to adopts a two-stage pruning/merging strategy to exploit importance and similarity phasedly. Yet, a fundamental limitation persists: by applying similarity and attention in isolation, these methods fail to resolve potential conflicts between diversity and saliency, precluding a truly synergistic optimization of token information. Yet, a common limitation persists: applying similarity and attention in isolation may lead to potential conflicts between diversity and importance, precluding a truly synergistic optimization of token information.

3. Method

This section details the design principles and core components of our SA-MAP. Section 3.1 covers attention mechanisms and token compression in Speech LLMs as preliminaries. Section 3.2 and 3.3 examine two core modules: the attention-enhanced token merging module and attention-similarity based token pruning module. Section 3.4 presents the adaptive merging-pruning fusion framework, followed by a trainable tuning scheme proposed in Section 3.5.

3.1. Preliminaries

Architecture of Speech LLMs. Existing Speech LLMs typically consist of three core components: audio encoder, modality projector, and a LLM. Audio encoder, such as Whisper large-v3 [25] (adopted in Qwen2-Audio and Kimi-Audio) and audio tokenizer (utilized in Kimi-Audio), transform raw audio waveforms into continuous or discrete tokens that are interpretable and processable by the LLM. The projection module aligns these audio tokens with the word embedding space of the large language model. The LLM then fuses the aligned audio and text information to generate corresponding responses.

Audio Token compression. Most audio token compression modules are embedded within the backbone of LLMs. High compression ratio is achieved via multi-layer stacking, a design that entails a high level of hardware customization. Notably, some of these modules demand direct access to the attention scores of intermediate layers, rendering them generally incompatible with highly optimized acceleration libraries (e.g., FlashAttention). To maximize hardware efficiency, we draw inspiration from relevant studies in VLMs and deploy the proposed SA-MAP compression module upstream of the LLM, only utilizing attention scores from a single audio encoder layer.

Attention in Audio encoder. Typical audio encoders, e.g., Whisper v3 is based on transformer architecture. Typically, we assess the importance of each audio token by analyzing the attention scores within the audio encoder. Specifically, we compute the attention score according to Eq. 1.

A_{h} = S o f t m a x (\frac{Q_{h} K_{h}^{T}}{\sqrt{D_{h}}}),

(1)

within the context of multi-head attention,

A_{h}

stands for the attention score of each individual head,

D_{h}

signifies the head dimension, and

Q_{h}

and

K_{h}

refer to the query and key respectively.

3.2. Attention-Enhanced Token Merging

Merging similar audio tokens can eliminate redundant audio information while preserving critical content. To this end, we propose a novel token merging method: Intra-Group Global Similarity-based Attention-Weighted Token Merging (G-SAM), deployed upstream of the LLM, with two sequential implementation steps: First, audio token grouping, which leverages the similarity between audio features. We cluster adjacent, highly similar audio tokens into groups for subsequent merging. Furthermore, to maximize the information retention of tokens post-merging, we incorporate attention information from the audio encoder to introduce an importance metric, and perform weighted merging on intra-group tokens.

Audio Token Grouping. Given the audio features

H_{a} \in R^{N \times D}

generated by the projection module preceding the LLM, we compute the pairwise cosine similarity across tokens to derive the similarity matrix

L

:

L_{i, j} = \frac{H_{a}^{i} \cdot H_{a}^{j}}{∥ H_{a}^{i} ∥ \cdot ∥ H_{a}^{j} ∥} .

(2)

Unlike the strategies adopted by A-ToMe and FastAdaSP, which only leverage the similarity between adjacent tokens for clustering, our method considers the similarity among any tokens within the group. This design can capture more comprehensive correlation patterns between tokens in the group, effectively avoiding the locality bias caused by relying solely on adjacent relationships, thereby enhancing the rationality of clustering results and the completeness of feature representation, and better preserving the key semantic information in audio data.

As illustrated in Figure 2, we fix the merging similarity threshold

λ

and iterate through the audio token list

S = {1, . . ., N}

to sequentially incorporate multiple adjacent indices into a merged cluster. Let

s_{i} = {k, . . ., k + m - 1}

denote the set of audio tokens in the original merged cluster, where the cardinality of

s_{i}

is m. We then compute the average similarity between the next adjacent token (with index

k + m

) and the tokens in

s_{i}

, which is formally defined as:

l_{k + m, s_{i}} = \frac{1}{m} \sum_{t = k}^{k + m - 1} L_{k + m, t},

(3)

if

l_{k + m, s_{i}} \geq λ

we incorporate the token indexed by

k + m

into

s_{i}

; otherwise, we initialize a new cluster

s_{i + 1} = {k + m}

. The above procedure is repeated iteratively until all audio tokens are assigned to corresponding clusters. Ultimately, we yield K merged clusters, formally denoted as

S^{*} = {s_{i} | i = 1, \dots, K}

.

Weighted Token Merging. Specifically, we extract attention scores from the audio encoder’s -2 layer to calculate the importance of each audio token, as this layer retains the fine-grained semantic information of audio features. We observe that attention maps across heads exhibit significant discrepancies, with each head capturing distinct audio semantic dimensions. To maximize the discriminative importance information encoded by attention, we compute the maximum value across the head dimension, and then derive the important score

W_{k} \in R^{N}

of each audio token:

W_{k} = \frac{1}{N} \sum_{j = 0}^{N} max_{i} A_{i, j, k},

(4)

where N denotes the length of the audio token sequence, H stands for the number of attention heads, and attention scores

A \in R^{H \times N \times N}

.

Then, based on Eq. 5, we aggregate the audio features of each subset

s_{i}

in the cluster set

S^{*}

(derived from the aforementioned clustering process) into a single token feature , ultimately yielding K compressed tokens denoted as

{\tilde{H}}_{a} = {{\tilde{H}}_{a}^{i} | i = 1, . . ., K}

. and the corresponding index list set is updated to

\tilde{S} = {1, . . ., K}

, in which:

{\tilde{H}}_{a}^{i} = \frac{\sum^{| s_{i} |} W_{j} \cdot H_{a}^{j}}{\sum^{| s_{i} |} W_{j}},

(5)

where

j \in s_{i}

, and

{\tilde{H}}_{a}^{i}

denotes the merged feature corresponding to the cluster subset

s_{i}

.

3.3. Attention-Similarity Based Token Pruning

Furthermore, we design an attention-enhanced similarity-driven pruning module denoted as ADPruner, integrating attention into the similarity matrix to optimize the DPP kernel and maximize the diversity of audio tokens as well as preserve their critical semantic information.

DPP [26] was initially introduced to model fermion repulsion in quantum physics and has since been widely applied in list-wise diversity modeling. Formally, it is a probability measure defined on the power set of a discrete set S, characterized by a positive semi-definite (PSD) kernel matrix

L \in R^{N \times N}

indexed by elements of S. The probability of sampling a subset

\hat{S} \subseteq S

is:

P (\hat{S}) = \frac{det (L_{\hat{S}})}{det (L + I)} \propto det (L_{\hat{S}}),

(6)

in which,

L_{\hat{S}}

represents the principal submatrix of

L

corresponding to

\hat{S}

.

Via the DPP sampling procedure, the optimal subset

{\hat{H}}_{a}

is determined as:

{\hat{S}}^{*} = \underset{\begin{matrix} \hat{S} \subseteq S, | \hat{S} | = m \end{matrix}}{arg max} det (L_{\hat{S}}), {\hat{H}}_{a} = \{H_{a}^{i} ∣ i \in {\hat{S}}^{*}\} .

(7)

In the context of VLM token pruning, CDpruner [22] leverages DPP to model the diversity of the retained subset of visual tokens. Given a set of visual tokens

H_{v}

, the kernel matrix

L

is defined based on the pairwise cosine similarity of visual features. Similarly, when applied to audio token pruning, the default kernel matrix

L

is constructed using the pairwise cosine similarity of audio features, see Eq. 2. If the construction of the DPP kernel matrix relies solely on feature similarity (a single dimension), it tends to overlook fine-grained critical information when preserving diversity. To address this issue, we propose an attention-enhanced DPP kernel matrix, enabling the simultaneous consideration of both the feature similarity and importance of audio tokens during pruning. We similarly extract attention scores from the penultimate layer of the audio encoder to compute the importance matrix

\hat{A} \in R^{N \times N}

for audio tokens,

\hat{A} = \frac{1}{H} \sum_{i = 0}^{H} A_{i, j, k}

.

Finally, we integrate feature similarity and token importance for audio token pruning, yielding our proposed attention-enhanced DPP-based pruning method (ADPruner), as illustrated in Figure 2. Specifically, we weight the original kernel matrix by the derived importance scores to construct a novel conditional kernel matrix:

\hat{L} = d i a g (\hat{A}) \cdot L \cdot d i a g (\hat{A}) .

(8)

The updated log-probability of the subset S for DPP is:

l o g d e t ({\hat{L}}_{\hat{S}}) = \sum_{i \in \hat{S}} l o g ({\hat{A}}_{i}^{2}) + l o g d e t (L_{\hat{S}}),

(9)

which jointly considers both feature similarity and importance of the retained audio tokens. We then obtain the optimal subset via MAP inference [27].

3.4. Adaptive Merging-Pruning Fusion Framework

Merging combines highly similar tokens, reducing redundancy while preserving key information. However, under high compression rates, using merging alone may force the fusion of dissimilar tokens, which can distort speech representations. In contrast, pruning keeps only the most important tokens and removes the others, which may lead to the loss of useful information. To address these issues, SA-MAP integrates improved merging and pruning modules, as shown in Figure 2. We introduce a threshold that adaptively adjusts the ratio of merging to pruning for each sample, thereby achieving high compression rates while robustly maintaining model performance. Changing the threshold leads to different accuracy, as illustrated in Figure 3. In practice, the threshold is selected through a manual search on a small dataset, which keeps the tuning cost low. As described in Appendix 1, the proposed compression process consists of two stages. In the first stage, adjacent tokens are selectively merged based on the threshold to preserve as much information as possible. In the second stage, pruning is applied to the merged tokens to improve representation diversity and reduce redundancy. By jointly optimizing merging and pruning, our method achieves an effective balance between compression efficiency and information preservation.

3.5. Optional: Trainable Compression Module

We introduce a trainable compression module to enhance token pruning within the merging-pruning joint framework, which integrates two lightweight, end-to-end optimized components: a learnable merging weight generator and a trainable importance scorer. Each consists of two linear projection layers, similar to the key and query transformations in standard self-attention. The former adaptively determines token fusion weights based on input-dependent attention patterns, while the latter estimates token importance through simplified self-attention interactions for refining the DPP kernel. The details are shown in Appendix C.

Tuning merging weights. As shown in Eq. 4 and Eq. 5, we perform weighted fusion of similar tokens in the merger module. In our joint framework, the attention scores from the penultimate layer of the audio encoder are used for this computation. Furthermore, the weighting coefficients in Eq. 5 can be derived from a trainable weighting module. Specifically, we design a lightweight module that generates these coefficients dynamically in a manner analogous to computing attention scores. This module consists of two linear projection layers, similar to the key and query transformations in standard self-attention. We first project the audio features using two separate learnable weight matrices,

Θ_{q}

and

Θ_{k}

:

Q = H_{a} Θ_{q}, K = H_{a} Θ_{k},

(10)

where

Θ_{q}, Θ_{k} \in R^{D \times D}

. The weighting coefficient

W_{i}

for each token i is then computed as the scaled dot-product attention scores over the input elements:

A = Sigmoid (\frac{Q K^{⊤}}{\sqrt{d_{k}}}), W_{i} = \frac{1}{N} \sum_{j = 0}^{N - 1} A_{i j} .

(11)

Subsequently, these coefficients are used in Eq. 5 to perform a weighted merging of components, allowing the model to adaptively determine the contribution of each part based on the input context.

Tuning importance scorer. We introduce a lightweight, learnable importance scorer to calculate the importance scores of tokens in DPP kernel. The design of this scorer draws inspiration from the Token Selector module proposed in VisionSelector[28]. Specifically, the input representation is projected into queries and keys via two separate linear layers, after which a simplified self-attention matrix is computed to capture pairwise token interactions, as expressed in Eq.10. The importance score for each token is then derived by averaging its interaction scores with all other tokens, as follows:

P = \frac{Q K^{⊤}}{\sqrt{d_{k}}}, A_{i} = \frac{1}{N} \sum_{j = 0}^{N - 1} P_{i j} .

(12)

Finally, the resulting importance score vector is incorporated into the similarity matrix to construct a refined DPP kernel, as described by Eq. 8, which guides the subsequent pruning of tokens.

4. Experiments

To validate the effectiveness of our method, we conduct a series of experiments. This section details our experimental setup, and the evaluation benchmarks used.

4.1. Experiment Setting

We choose open speech datasets for evaluation, covering ASR and speech understanding tasks. For speech recognition tasks, we select LibriSpeech[29], Fleurs[30], AISHELL1[31], AISHELL2-ios[32] and WenetSpeech[33]. For speech understanding tasks, we choose MELD[34], VocalSound[35], TUT2017[36], Nonspeech7k[37] and MMAU[38]. And we conduct experiments on Qwen2-Audio, Kimi-Audio, GLM-ASR. We select A-ToMe and FastAdasp for comparison. In addition, we migrate methods from the visual domain to audio models, including the pure pruning approaches VisPruner and CDPruner, as well as VisionZip, which combines both pruning and merging. Our experiments are conducted using the Kimi-Audio-Evakit toolbox[7]. For ASR and speech understanding tasks, we employ WER and accuracy metrics for evaluation, respectively.

Model Configuration: We consistently integrate SA-MAP after the encoder and the projection module of the model. For Kimi-Audio, which utilizes both discrete and continuous features, we select the continuous features for calculating the similarity matrix. For a fair comparison, we modified the two methods, A-ToMe and FastAdasp, which perform merging within the LLM block, after the projector instead. Unlike the toolbox, we do not employ GPT-4o-mini as the judge model for the sparsity task. Instead, we use a string matching approach to ensure a fair comparison.

4.2. Main Results

In this section, we compare our method with other methods on both dense and sparse tasks. For full experiments results, please refer to Appendix B.

Performance on Dense ASR Tasks. As shown in Table 1, on dense ASR benchmarks, preserving fine-grained acoustic details is critical. SA-MAP’s two-stage merging-pruning approach demonstrates clear superiority over methods that rely solely on a single metric or local dependencies. For instance, at a 60% retention rate, on Qwen2-Audio, SA-MAP reduces the absolute average WER by 1.53% and 0.96% compared to the best-performing pure merging method (FastAdaSP), and the strongest pure pruning method (CDPruner) respectively, narrowing the performance gap between these method and the uncompressed baseline by 53.4% and 41.9%. Furthermore, when compared with the hybrid approach VisionZip, SA-MAP yields a substantially higher reduction of 5.83% in absolute average WER, effectively closing the performance gap by 81.4%.

Across different compression ratios, our approach consistently exhibits superior performance compared with existing methods, particularly under high compression settings. For example, on Qwen2-Audio with retention rates ranging from 70% to 50%, SA-MAP further narrows the performance gap with the unpruned model by 41.84%–45.76% compared with the best-performing existing pruning method. Similar trends are observed across different backbone. On Kimi-Audio, SA-MAP achieves an additional reduction of 17.92%–41.20% at retention rates between 70% and 50%. Likewise, on GLM-ASR-Nano with retention rates from 70% to 60%, SA-MAP further reduce by 40.45%–52.48%. These results indicate that SA-MAP maintains stable and superior performance across a wide range of compression levels and backbones. Details are reported in Table A1–Table A3.

Performance on Sparse Understanding Tasks. For sparse understanding tasks, SA-MAP emphasizes adaptively compression to balance efficiency and semantic preservation as shown in Table 2. On Qwen2-Audio at 50% compression, it achieves an average accuracy of 64.82%, representing a relative improvement of 0.4% over the unpruned baseline. Compared with hybrid methods like VisionZip and FastAdaSP, SA-MAP further delivers relative gains of 0.64% and 1.5%, respectively, demonstrating its superior robustness under aggressive compression. On Kimi-Audio, SA-MAP achieves the highest overall average accuracy of 73.23%, outperforming all compared methods and achieving top performance on 4 out of 7 sub-tasks. Notably, on the challenging MMAU benchmark, it surpasses the unpruned baseline by 0.90% on music and by 0.6% on sound, demonstrating that our joint attention–similarity modeling effectively removes acoustic redundancy while preserving critical semantic information. Across other subsets, SA-MAP maintains stable recognition performance, highlighting its robustness in complex acoustic environments.

Table 1. Performance comparison of different pruning methods on ASR dense tasks (Qwen2-audio / Kimi-Audio / GLM-ASR)

Method	LibriSpeech				Fleurs		AISHELL-1	AISHELL-2	WenetSpeech		Average
Method	dev_clean	dev_other	test_clean	test_other	zh	en	AISHELL-1	AISHELL-2	test-meeting	test-net	Average
Qwen2-Audio	1.67	3.65	1.74	4.03	3.63	5.20	1.52	3.08	8.40	7.64	4.06
Retain 60% Tokens (40% Compression Ratio)
VisionZip	7.31	10.35	7.08	10.10	8.02	6.85	6.99	13.85	17.88	23.75	11.22
VisPruner	7.42	9.74	7.20	9.69	7.00	8.39	6.91	9.67	13.37	14.07	9.35
CDPruner	4.22	6.05	4.18	6.53	4.88	7.17	2.70	4.62	12.29	10.90	6.35
A-ToMe	4.12	6.81	4.20	6.98	8.00	8.13	4.18	5.56	14.05	14.15	7.62
FastAdaSP	4.91	7.26	4.95	7.51	5.47	7.31	3.28	4.69	11.51	12.30	6.92
SA-MAP	2.59	5.00	2.72	5.02	4.37	5.94	2.69	4.42	11.05	10.11	5.39
Kimi-Audio	1.23	2.39	1.38	2.45	2.87	4.92	0.61	2.57	6.33	5.39	3.01
Retain 60% Tokens (40% Compression Ratio)
VisionZip	6.35	7.94	5.93	7.71	7.63	8.91	6.84	10.01	14.65	19.51	9.55
VisPruner	5.36	6.90	5.07	6.75	5.73	7.82	4.36	7.65	13.70	18.56	8.19
CDPruner	5.70	6.96	5.44	6.78	6.73	8.66	5.07	9.21	14.95	18.41	8.79
A-ToMe	4.13	6.04	3.92	5.87	6.26	7.57	3.58	6.88	12.97	14.99	7.22
FastAdaSP	4.49	6.15	4.28	6.02	4.27	7.64	2.22	5.67	12.62	14.75	6.81
SA-MAP	2.32	3.67	2.49	3.75	4.37	6.26	2.05	3.99	11.47	12.35	5.27
GLM-ASR-Nano	2.14	4.05	2.18	4.53	3.44	4.11	2.47	3.48	8.43	6.65	4.15
Retain 70% Tokens (30% Compression Ratio)
VisionZip	9.18	12.75	8.27	13.27	8.17	8.08	13.9	20.55	19.36	20.45	13.40
VisPruner	9.22	12.33	8.34	12.13	6.95	8.33	11.68	14.57	18.11	16.56	11.82
CDPruner	5.38	7.75	5.28	7.76	4.99	5.93	4.88	5.94	14.21	11.70	7.38
A-ToMe	4.55	8.04	4.66	7.83	5.08	5.77	5.40	5.95	13.16	12.08	7.25
FastAdaSP	6.45	10.34	6.50	10.41	4.34	6.70	6.42	7.02	16.10	14.10	8.84
SA-MAP	3.41	5.53	3.43	5.55	3.76	4.83	3.53	4.52	12.07	9.60	5.62

Table 2. Performance comparison of different pruning methods on Audio Understanding tasks (Qwen2-Audio / Kimi-Audio)

Method	MELD	mmau-test-mini			Nonspeech7k	TUT2017	VocalSound	Average
Method	MELD	music	sound	speech	Nonspeech7k	TUT2017	VocalSound	Average
Qwen2-Audio	51.19	60.48	72.07	60.36	86.48	32.35	88.00	64.42
Retain 50% Tokens (50% Compression Ratio)
VisionZip	51.73	60.48	72.07	61.56	85.38	31.73	86.33	64.18
VisPruner	50.88	61.68	73.87	62.16	84.69	30.74	86.22	64.32
CDPruner	51.61	61.38	71.77	61.26	86.62	31.73	87.50	64.55
A-ToMe	50.15	61.38	71.17	56.46	83.45	29.20	84.32	62.30
FastAdaSP	50.81	59.88	72.07	59.46	83.86	31.23	85.96	63.32
SA-MAP	51.96	61.38	72.07	63.06	85.10	32.53	87.64	64.82
Kimi-Audio	57.48	63.77	79.28	65.77	93.10	61.11	94.24	73.54
Retain 50% Tokens (50% Compression Ratio)
VisionZip	57.78	63.47	79.88	61.86	90.21	60.56	93.23	72.43
VisPruner	56.67	65.57	79.88	63.66	90.21	60.25	93.21	72.78
CDPruner	57.40	63.17	79.58	65.17	89.52	60.31	93.15	72.61
A-ToMe	57.25	64.67	79.28	63.66	92.14	60.68	93.40	73.01
FastAdaSP	56.79	63.47	79.88	66.37	90.48	60.37	93.34	72.96
SA-MAP	56.40	65.57	79.88	66.37	90.76	60.19	93.43	73.23

4.3. Trainable Compression Results

In Table 3, we evaluate SA-MAP’s trainable compression on LibriSpeech using Qwen2-Audio under multiple compression ratios. Results show that it consistently mitigates performance loss, especially at high compression ratios. This enables SA-MAP to maintain strong recognition accuracy, confirming the effectiveness of the trainable compression strategy.

4.4. Ablation Study

We conduct ablation studies on LibriSpeech to validate SA-MAP’s key components and optimal configurations. Experiments are performed across Qwen2-Audio, Kimi-Audio, and GLM-ASR, evaluating performance WER (% ↓). Results are detailed in Figure 4, Figure 5 and Figure 6.

Impact of Attention-Weighted Merging.Figure 4 demonstrates that our attention-weighted merging (lighter bars) consistently outperforms standard average merging across all models. For instance, on Qwen2-Audio with a 60% retention ratio, the WER drops significantly by 1.42% like A-ToMe. Similar trends are observed in Kimi-Audio and GLM-ASR. This confirms that weighting tokens by attention scores effectively preserves critical acoustic information that are otherwise diluted by background noise in simple averaging.

Impact of Attention-Enhanced Pruning. We compare ADPruner against the baseline CDPruner in Figure 5. ADPruner exhibits superior robustness, particularly at high compression rates. Notably, on Kimi-Audio with 50% token retention, ADPruner maintains a low WER of 7.67%, achieving a 6.17% reduction compared to CDPruner. This indicates that attention-guided pruning successfully retains semantically rich tokens that magnitude-based metrics often discard.

Impact of GSA-Merging (Combined with ADPruner)Figure 6 illustrates that GSA-Merging achieves the optimal efficiency-accuracy trade-off. Unlike A-ToMe or FastAdaSP which rely on adjacent token similarity, GSA captures comprehensive correlation patterns among all tokens within a group. This global perspective mitigates locality bias, ensuring more rational clustering and complete feature representation. Notably, on Kimi-Audio with 50% token retention, SA-MAP achieves a 22.2% relative WER improvement over FastAdaSP, confirming its ability to better preserve key semantic information in audio data.

Impact of Threshold Selection Taking Qwen2-Audio as an example, we selected LibriSpeech-dev_clean as the reference sample, and Figure 3 presents WER of SA-MAP under different thresholds. As can be seen from the figure, setting different thresholds affects the final accuracy. We determined the thresholds corresponding to various compression rates through a search process on the sample set. For instance, a compression rate of 30% can be achieved by setting the threshold to a specific value (e.g., 0.72), as demonstrated in the results.

5. Conclusion

This paper introduces SA-MAP, an adaptive, hardware-friendly, plug-and-play framework for joint audio token merging and pruning in Speech LLMs, driven by the synergy of similarity and attention. The first stage merges adjacent tokens to maximize semantic fidelity, while the second executes pruning to preserve token diversity. By setting a similarity threshold and jointly optimizing merging and pruning, our method effectively balances compression efficiency and information retention, maintaining reliability even under high compression settings. Experiments demonstrate that our method achieves a 50% lossless token reduction in understanding tasks and a 40% compression ratio in ASR with only marginal degradation evaluated on representative Speech LLMs.

Author Contributions

Hong Liu, Rui Cen: methodology, software, validation, formal analysis, investigation, resources, data curation, writing—original draft preparation, writing—review and editing, visualization. Guanghua Yu: writing—review and editing, supervision, project administration. Jianchen Zhu: supervision, project administration. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Algorithm

Algorithm A1 Adaptive Merging-Pruning Fusion Method

Input: Audio feature $H_{a}$ , attention scores $A$ , Audio list $S$ , similarity threshold $λ$ , retain tokens m
Output: Optimized subset ${\hat{H}}_{a}$
// First: attention-enhanced token merging
for $i, j i n S$ do
$L_{i, j} = \frac{H_{a}^{i} \cdot H_{a}^{j}}{∥ H_{a}^{i} ∥ \cdot ∥ H_{a}^{j} ∥}$
end for
$S_{*} = {}, s = {}$ , initialize the subset
for $i = 1, 2, . . ., N$ do
$W_{i} = \frac{1}{N} \sum_{j = 0}^{N} max_{h} A_{h, j, i}$
if s is empty then
Add i to s
else
$l_{i, | s |} = \frac{1}{∥ s ∥} \sum_{| s |} L_{i, | s |}$
if $l_{i, | s |} > λ$ then
Add i to s
else
Add s to S, set $s = {}$
end if
end if
end for
for $s_{i}$ in S do
${\tilde{H}}_{a}^{i} = \frac{\sum^{| s_{i} |} W_{j} \cdot H_{a}^{j}}{\sum^{| s_{i} |} W_{j}}$
end for
index list set is updated to $\tilde{S} = {1, . . ., K}$
// Second: attention-enhanced DPP-based token pruning
${\hat{H}}_{a}$ initialize the output subset
$\hat{L} = d i a g (\hat{A}) \cdot L \cdot d i a g (\hat{A})$
$c_{i}$ =[], $d_{i}^{2} = {\hat{L}}_{i, i}$
$j = a r g m a x_{i \in \tilde{S}} l o g (d_{i}^{2})$ , $\hat{S} = {j}$
while $| \hat{S} | < m$ do
for i∈ $\tilde{S}$ / $\hat{S}$ do
$e_{i} = L_{j i} - < c_{j}, c_{i} > / d_{j}$
$c_{i} = [c_{i}, e_{i}], d_{i}^{2} = d_{i}^{2} - e_{i}^{2}$
end for
$j = a r g m a x_{i \in S} l o g (d_{i}^{2}), S = S \cup {j}$
set ${\tilde{H}}_{a}^{j}$ to ${\hat{H}}_{a}$
end while
return ${\hat{H}}_{a}$

Appendix B. Full Experiments Results

This section provides the full experimental results to further validate the effectiveness and robustness of our proposed method. See Table A1 - Table A5.

Threshold Settings: For ASR, the thresholds set for Qwen2-Audio at compression rates of 30%, 40%, and 50% are 0.72, 0.65, and 0.6, respectively. For Kimi-Audio at the same rates, the thresholds are 0.753, 0.93, and 0.94. For GLM-ASR-Nano, the thresholds at compression rates of 30% and 40% are 0.93 and 0.9, respectively. For speech understanding tasks, the thresholds set for Qwen2-Audio at compression rates of 40%, 50%, and 70% are 0.82, 0.80, and 0.85, respectively. For Kimi-Audio at the same rates, the thresholds are 0.74, 0.7, and 0.6.

Table A1. Performance comparison of different pruning methods on ASR dense tasks (Qwen2-audio)

Method	LibriSpeech				Fleurs		AISHELL-1	AISHELL-2-ios	WenetSpeech		Average
Method	dev_clean	dev_other	test_clean	test_other	zh	en	AISHELL-1	AISHELL-2-ios	test-meeting	test-net	Average
base	1.67	3.65	1.74	4.03	3.63	5.20	1.52	3.08	8.40	7.64	4.06
Retain 70% Tokens (30% Compression Ratio)
VisionZip	3.16	5.68	3.26	5.82	4.90	5.53	3.31	7.70	12.61	15.98	6.80
VisPruner	3.93	5.89	3.81	6.00	4.70	6.08	3.10	5.14	10.35	10.22	5.92
CDPruner	2.58	4.35	2.59	4.75	3.96	5.99	1.83	3.51	9.77	8.89	4.82
A-ToMe	2.54	4.86	2.59	5.00	4.93	6.09	2.26	3.86	10.35	10.29	5.28
FastAdaSP	2.44	4.54	2.54	4.83	4.00	5.62	1.94	3.43	9.53	9.10	4.80
SA-MAP	1.98	4.07	2.00	4.26	3.67	5.39	1.83	3.70	9.26	8.71	4.49
Retain 60% Tokens (40% Compression Ratio)
VisionZip	7.31	10.35	7.08	10.10	8.02	6.85	6.99	13.85	17.88	23.75	11.22
VisPruner	7.42	9.74	7.20	9.69	7.00	8.39	6.91	9.67	13.37	14.07	9.35
CDPruner	4.22	6.05	4.18	6.53	4.88	7.17	2.70	4.62	12.29	10.90	6.35
A-ToMe	4.12	6.81	4.20	6.98	8.00	8.13	4.18	5.56	14.05	14.15	7.62
FastAdaSP	4.91	7.26	4.95	7.51	5.47	7.31	3.28	4.69	11.51	12.30	6.92
SA-MAP	2.59	5.00	2.72	5.02	4.37	5.94	2.69	4.42	11.05	10.11	5.39
Retain 50% Tokens (50% Compression Ratio)
VisionZip	18.17	20.45	15.92	20.80	14.42	10.06	14.04	22.42	27.52	34.19	19.80
VisPruner	15.44	18.33	15.11	18.05	12.34	14.41	15.42	19.89	19.75	22.19	17.09
CDPruner	7.80	9.88	7.95	10.12	9.34	10.55	6.83	8.20	16.08	19.13	10.59
A-ToMe	7.79	10.96	7.98	11.19	21.31	13.81	11.05	12.10	21.40	22.10	13.97
FastAdaSP	11.78	14.77	11.96	14.82	10.51	12.82	8.24	9.39	17.21	20.85	13.24
SA-MAP	4.49	7.20	4.53	7.24	5.53	7.49	4.41	7.41	13.67	14.02	7.60

Table A2. Performance comparison of different pruning methods on ASR dense tasks (GLM-ASR)

Method	LibriSpeech				Fleurs		AISHELL-1	AISHELL-2-ios	WenetSpeech		Average
Method	dev_clean	dev_other	test_clean	test_other	zh	en	AISHELL-1	AISHELL-2-ios	test-meeting	test-net	Average
base	2.14	4.05	2.18	4.53	3.44	4.11	2.47	3.48	8.43	6.65	4.15
Retain 70% Tokens (30% Compression Ratio)
VisionZip	9.18	12.75	8.27	13.27	8.17	8.08	13.90	20.55	19.36	20.45	13.40
VisPruner	9.22	12.33	8.34	12.13	6.95	8.33	11.68	14.57	18.11	16.56	11.82
CDPruner	5.38	7.75	5.28	7.76	4.99	5.93	4.88	5.94	14.21	11.70	7.38
A-ToMe	4.55	8.04	4.66	7.83	5.08	5.77	5.40	5.95	13.16	12.08	7.25
FastAdaSP	6.45	10.34	6.50	10.41	4.34	6.70	6.42	7.02	16.10	14.10	8.84
SA-MAP	3.41	5.53	3.43	5.55	3.76	4.83	3.53	4.52	12.07	9.60	5.62
Retain 60% Tokens (40% Compression Ratio)
VisionZip	19.18	24.16	17.91	24.27	15.66	16.22	27.40	36.09	30.56	32.89	24.43
VisPruner	18.02	22.68	17.01	22.06	12.48	15.27	22.82	28.81	27.46	26.78	21.34
CDPruner	9.55	12.53	9.88	11.90	6.68	9.38	8.40	9.78	21.10	16.53	11.57
A-ToMe	8.24	12.51	8.39	12.19	8.37	8.23	9.27	10.48	18.00	16.85	11.25
FastAdaSP	13.23	17.52	13.55	17.60	7.84	7.46	8.61	9.46	17.60	15.88	12.88
SA-MAP	6.17	8.48	6.24	8.08	4.81	6.54	5.84	7.52	16.29	13.82	8.38

Table A3. Performance comparison of different pruning methods on ASR dense tasks (Kimi-audio)

Method	LibriSpeech				Fleurs		AISHELL-1	AISHELL-2-ios	WenetSpeech		Average
Method	dev_clean	dev_other	test_clean	test_other	zh	en	AISHELL-1	AISHELL-2-ios	test-meeting	test-net	Average
base	1.23	2.39	1.38	2.45	2.87	4.92	0.61	2.57	6.33	5.39	3.01
Retain 70% Tokens (30% Compression Ratio)
VisionZip	2.83	4.22	2.77	4.17	4.46	6.09	2.71	5.70	9.92	12.59	5.55
VisPruner	2.54	3.68	2.52	3.83	3.87	6.17	1.67	4.36	9.40	11.50	4.95
CDPruner	2.72	3.87	2.60	3.90	4.38	6.42	1.81	4.75	10.14	11.48	5.21
A-ToMe	2.09	3.54	2.07	3.52	3.74	5.60	1.22	3.73	8.89	9.67	4.41
FastAdaSP	2.00	3.40	2.12	3.38	3.22	5.76	0.95	3.37	8.68	8.98	4.19
SA-MAP	1.48	2.64	1.50	2.73	3.36	5.11	1.03	3.28	8.66	9.97	3.98
Retain 60% Tokens (40% Compression Ratio)
VisionZip	6.35	7.94	5.93	7.71	7.63	8.91	6.84	10.01	14.65	19.51	9.55
VisPruner	5.36	6.90	5.07	6.75	5.73	7.82	4.36	7.65	13.70	18.56	8.19
CDPruner	5.70	6.96	5.44	6.78	6.73	8.66	5.07	9.21	14.95	18.41	8.79
A-ToMe	4.13	6.04	3.92	5.87	6.26	7.57	3.58	6.88	12.97	14.99	7.22
FastAdaSP	4.49	6.15	4.28	6.02	4.27	7.64	2.22	5.67	12.62	14.75	6.81
SA-MAP	2.32	3.67	2.49	3.75	4.37	6.26	2.05	3.99	11.47	12.35	5.27
Retain 50% Tokens (50% Compression Ratio)
VisionZip	13.75	15.53	12.44	15.00	13.70	13.84	14.00	17.35	22.64	29.64	16.79
VisPruner	11.90	14.30	11.47	13.41	10.38	13.20	10.88	14.92	22.03	29.29	15.18
CDPruner	12.79	14.70	12.14	14.31	11.75	14.27	13.47	18.90	23.23	29.62	16.52
A-ToMe	8.66	11.62	8.31	11.24	14.66	11.55	10.04	12.44	20.31	22.48	13.13
FastAdaSP	10.71	12.98	10.36	12.20	7.17	11.87	7.24	11.15	20.37	25.07	12.91
SA-MAP	5.19	6.93	5.13	6.72	6.81	8.50	5.06	6.52	17.12	20.36	8.83

Table A4. Performance comparison of different pruning methods on sparse tasks (Qwen2-Audio)

Audio Understanding
Method	MELD	mmau-test-mini			Nonspeech7k	TUT2017	VocalSound	Average
Method	MELD	music	sound	speech	Nonspeech7k	TUT2017	VocalSound	Average
base	51.19	60.48	72.07	60.36	86.48	32.35	88.00	64.42
Retain 60% Tokens (40% Compression Ratio)
VisionZip	51.38	59.88	71.77	61.86	86.48	32.28	87.33	64.43
VisPruner	52.30	60.18	72.97	62.16	86.21	31.05	86.69	64.51
CDPruner	50.88	61.08	73.57	60.96	87.17	32.65	87.94	64.89
A-ToMe	51.46	60.78	70.27	61.86	84.83	31.73	86.44	63.91
FastAdaSP	50.50	59.88	72.37	62.16	85.93	32.41	87.11	64.34
SA-MAP	51.69	61.08	73.27	62.46	87.31	32.84	87.97	65.23
Retain 50% Tokens (50% Compression Ratio)
VisionZip	51.73	60.48	72.07	61.56	85.38	31.73	86.33	64.18
VisPruner	50.88	61.68	73.87	62.16	84.69	30.74	86.22	64.32
CDPruner	51.61	61.38	71.77	61.26	86.62	31.73	87.50	64.55
A-ToMe	50.15	61.38	71.17	56.46	83.45	29.20	84.32	62.30
FastAdaSP	50.81	59.88	72.07	59.46	83.86	31.23	85.96	63.32
SA-MAP	51.96	61.38	72.07	63.06	85.10	32.53	87.64	64.82
Retain 30% Tokens (70% Compression Ratio)
VisionZip	51.53	60.78	72.97	61.26	77.66	29.20	78.08	61.64
VisPruner	50.08	60.78	72.67	61.56	70.90	28.77	84.91	61.38
CDPruner	50.84	58.68	72.07	60.96	81.38	31.36	85.55	62.98
A-ToMe	48.73	61.38	70.87	48.65	67.72	27.04	79.78	57.74
FastAdaSP	48.66	59.58	71.77	51.95	69.24	28.40	79.00	58.37
SA-MAP	50.42	60.78	71.77	61.86	79.17	32.04	87.25	63.33

Table A5. Performance comparison of different pruning methods on sparse tasks (Kimi-Audio)

Audio Understanding
Method	MELD	mmau-test-mini			Nonspeech7k	TUT2017	VocalSound	Average
Method	MELD	music	sound	speech	Nonspeech7k	TUT2017	VocalSound	Average
base	57.48	63.77	79.28	65.77	93.10	61.11	94.24	73.54
Retain 60% Tokens (40% Compression Ratio)
VisionZip	58.59	64.67	79.88	65.17	91.45	60.80	93.62	73.45
VisPruner	57.63	64.97	79.28	65.77	92.00	60.49	93.71	73.41
CDPruner	57.86	62.57	78.98	64.56	91.17	60.80	93.68	72.80
A-ToMe	57.82	64.97	78.38	68.17	92.28	60.06	93.90	73.65
FastAdaSP	57.63	64.67	80.78	64.86	91.17	60.43	93.71	73.32
SA-MAP	57.25	66.47	80.78	66.07	91.31	60.49	93.73	73.73
Retain 50% Tokens (50% Compression Ratio)
VisionZip	57.78	63.47	79.88	61.86	90.21	60.56	93.23	72.43
VisPruner	56.67	65.57	79.88	63.66	90.21	60.25	93.21	72.78
CDPruner	57.40	63.17	79.58	65.17	89.52	60.31	93.15	72.61
A-ToMe	57.25	64.67	79.28	63.66	92.14	60.68	93.40	73.01
FastAdaSP	56.79	63.47	79.88	66.37	90.48	60.37	93.34	72.96
SA-MAP	56.40	65.57	79.88	66.37	90.76	60.19	93.43	73.23
Retain 30% Tokens (70% Compression Ratio)
VisionZip	53.87	64.97	78.98	60.66	86.90	57.84	92.76	70.85
VisPruner	53.91	64.67	79.58	63.06	86.90	58.64	92.98	71.39
CDPruner	54.03	65.27	79.28	60.96	86.76	59.01	92.06	71.05
A-ToMe	54.37	65.57	78.68	62.76	89.79	57.22	90.39	71.25
FastAdaSP	54.41	62.87	77.18	61.86	87.45	58.15	91.56	70.50
SA-MAP	55.37	64.07	79.88	65.47	86.62	55.99	92.76	71.45

Appendix C. Trainable Compression Module

In order to further enhance the performance of token pruning, we have added two types of lightweight, trainable modules to our merging-pruning joint compression framework. These modules are optimized through end-to-end training to improve accuracy.

Two-phase training. To ensure better convergence during training, we adopt a two-stage training approach. For instance, given a compression ratio of m, in the first stage, the merging ratio is fixed at n to train the merging weights. In the second stage, using the weights from the first stage, the pruning ratio is set to

m - n

to train the importance scorer. The training framework is shown in Figure A1.

Figure A1. Overview of our Trainable Compression Framework.

Training Settings: In the trainable mode, we conducted experiments on Qwen2-Audio. The training data consisted of 960 hours of speech from LibriSpeech, and all stages were trained for one epoch. During end-to-end training, only the newly added linear layers were updated, while all other parameters were kept frozen. In the first stage, we trained the merging weights using a learning rate of 2e-5 for a 30% compression rate, and 5e-5 for compression rates of 40% and 50%. In the second stage, we trained the importance scorer with a learning rate of 5e-4. After training, inference still followed the merging–pruning strategy of SA-MAP.

References

Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen technical report. arXiv 2023, arXiv:2309.16609. [Google Scholar] [CrossRef]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
Team, G.; Kamath, A.; Ferret, J.; Pathak, S.; Vieillard, N.; Merhej, R.; Perrin, S.; Matejovicova, T.; Ramé, A.; Rivière, M.; et al. Gemma 3 technical report. arXiv 2025, arXiv:2503.19786. [Google Scholar] [CrossRef]
Chu, Y.; Xu, J.; Yang, Q.; Wei, H.; Wei, X.; Guo, Z.; Leng, Y.; Lv, Y.; He, J.; Lin, J.; et al. Qwen2-audio technical report. arXiv 2024, arXiv:2407.10759. [Google Scholar] [CrossRef]
Zeng, A.; Du, Z.; Liu, M.; Wang, K.; Jiang, S.; Zhao, L.; Dong, Y.; Tang, J. Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot. arXiv 2024, arXiv:2412.02612. [Google Scholar]
Xu, K.T.; Xie, F.L.; Tang, X.; Hu, Y. Fireredasr: Open-source industrial-grade mandarin speech recognition models from encoder-decoder to llm integration. arXiv 2025, arXiv:2501.14350. [Google Scholar]
Ding, D.; Ju, Z.; Leng, Y.; Liu, S.; Liu, T.; Shang, Z.; Shen, K.; Song, W.; Tan, X.; Tang, H.; et al. Kimi-audio technical report. arXiv 2025, arXiv:2504.18425. [Google Scholar] [CrossRef]
Li, Y.; Wu, Y.; Li, J.; Liu, S. Accelerating transducers through adjacent token merging. arXiv 2023, arXiv:2306.16009. [Google Scholar] [CrossRef]
Lu, Y.; Song, J.; Yang, C.H.H.; Watanabe, S. FastAdaSP: Multitask-Adapted Efficient Inference for Large Speech Language Model. arXiv 2024, arXiv:2410.03007. [Google Scholar]
Lee, T.; Lee, H. Token Pruning in Audio Transformers: Optimizing Performance and Decoding Patch Importance. arXiv 2025, arXiv:2504.01690. [Google Scholar] [CrossRef]
Dao, T.; Fu, D.; Ermon, S.; Rudra, A.; Ré, C. Flashattention: Fast and memory-efficient exact attention with io-awareness. Adv. Neural Inf. Process. Syst. 2022, 35, 16344–16359. [Google Scholar]
Zhang, Q.; Cheng, A.; Lu, M.; Zhang, R.; Zhuo, Z.; Cao, J.; Guo, S.; She, Q.; Zhang, S. Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs. arXiv 2025, arXiv:cs. [Google Scholar] [CrossRef]
Yang, S.; Chen, Y.; Tian, Z.; Wang, C.; Li, J.; Yu, B.; Jia, J. VisionZip: Longer is Better but Not Necessary in Vision Language Models. arXiv 2024, arXiv:cs. [Google Scholar]
An, K.; Chen, Y.; Deng, C.; Gao, C.; Gao, Z.; Gong, B.; Li, X.; Li, Y.; Lv, X.; Ji, Y.; et al. Fun-ASR Technical Report. arXiv 2025, arXiv:2509.12508. [Google Scholar] [CrossRef]
Ranjan Behera, S.; Dhiman, A.; Gowda, K.; Narayani, A.S. FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge Distillation. arXiv E-Prints 2024, arXiv–2406. [Google Scholar]
Lin, Y.; Fu, Y.; Zhang, J.; Liu, Y.; Zhang, J.; Sun, J.; Li, H.H.; Chen, Y. Speechprune: Context-aware token pruning for speech information retrieval. In Proceedings of the 2025 IEEE International Conference on Multimedia and Expo (ICME); IEEE, 2025; pp. 1–6. [Google Scholar]
Rao, Y.; Zhao, W.; Liu, B.; Lu, J.; Zhou, J.; Hsieh, C.J. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Adv. Neural Inf. Process. Syst. 2021, 34, 13937–13949. [Google Scholar]
Bolya, D.; Fu, C.Y.; Dai, X.; Zhang, P.; Feichtenhofer, C.; Hoffman, J. Token merging: Your vit but faster. arXiv 2022, arXiv:2210.09461. [Google Scholar]
Shao, K.; Tao, K.; Qin, C.; You, H.; Sui, Y.; Wang, H. HoliTom: Holistic Token Merging for Fast Video Large Language Models. arXiv 2025, arXiv:2505.21334. [Google Scholar] [CrossRef]
Chen, L.; Zhao, H.; Liu, T.; Bai, S.; Lin, J.; Zhou, C.; Chang, B. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In Proceedings of the European Conference on Computer Vision, 2024; Springer; pp. 19–35. [Google Scholar]
Alvar, S.R.; Singh, G.; Akbari, M.; Zhang, Y. Divprune: Diversity-based visual token pruning for large multimodal models. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025; pp. 9392–9401. [Google Scholar]
Zhang, Q.; Liu, M.; Li, L.; Lu, M.; Zhang, Y.; Pan, J.; She, Q.; Zhang, S. Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs. arXiv 2025, arXiv:2506.10967. [Google Scholar] [CrossRef]
Chen, L.; Zhao, H.; Liu, T.; Bai, S.; Lin, J.; Zhou, C.; Chang, B. An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models. arXiv 2024, arXiv:cs. [Google Scholar]
Zhang, Y.; Fan, C.K.; Ma, J.; Zheng, W.; Huang, T.; Cheng, K.; Gudovskiy, D.; Okuno, T.; Nakata, Y.; Keutzer, K.; et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference. arXiv 2024, arXiv:2410.04417. [Google Scholar]
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision, 2022. arXiv arXiv:eess.
Macchi, O. The coincidence approach to stochastic point processes. Adv. Appl. Probab. 1975, 7, 83–122. [Google Scholar] [CrossRef]
Chen, Z.; Wang, W.; Cao, Y.; Liu, Y.; Gao, Z.; Cui, E.; Zhu, J.; Ye, S.; Tian, H.; Liu, Z.; et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv 2024, arXiv:2412.05271. [Google Scholar]
Zhu, J.; Zhu, Y.; Lu, X.; Yan, W.; Li, D.; Liu, K.; Fu, X.; Zha, Z.J. VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs. arXiv 2025, arXiv:2510.16598. [Google Scholar]
Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: an asr corpus based on public domain audio books. In Proceedings of the 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP); IEEE, 2015; pp. 5206–5210. [Google Scholar]
Conneau, A.; Ma, M.; Khanuja, S.; Zhang, Y.; Axelrod, V.; Dalmia, S.; Riesa, J.; Rivera, C.; Bapna, A. Fleurs: Few-shot learning evaluation of universal representations of speech. In Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023; pp. 798–805. [Google Scholar]
Bu, H.; Du, J.; Na, X.; Wu, B.; Zheng, H. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In Proceedings of the 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA); IEEE, 2017; pp. 1–5. [Google Scholar]
Du, J.; Na, X.; Liu, X.; Bu, H. Aishell-2: Transforming mandarin asr research into industrial scale. arXiv 2018, arXiv:1808.10583. [Google Scholar] [CrossRef]
Zhang, B.; Lv, H.; Guo, P.; Shao, Q.; Yang, C.; Xie, L.; Xu, X.; Bu, H.; Chen, X.; Zeng, C.; et al. Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE, 2022; pp. 6182–6186. [Google Scholar]
Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; Mihalcea, R. Meld: A multimodal multi-party dataset for emotion recognition in conversations. In Proceedings of the Proceedings of the 57th annual meeting of the association for computational linguistics, 2019; pp. 527–536. [Google Scholar]
Gong, Y.; Yu, J.; Glass, J. Vocalsound: A dataset for improving human vocal sounds recognition. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE, 2022; pp. 151–155. [Google Scholar]
Mesaros, A.; Heittola, T.; Virtanen, T. TUT database for acoustic scene classification and sound event detection. In Proceedings of the 2016 24th European signal processing conference (EUSIPCO); IEEE, 2016; pp. 1128–1132. [Google Scholar]
Rashid, M.M.; Li, G.; Du, C. Nonspeech7k dataset: Classification and analysis of human non-speech sound. IET Signal Process. 2023, 17, e12233. [Google Scholar] [CrossRef]
Sakshi, S.; Tyagi, U.; Kumar, S.; Seth, A.; Selvakumar, R.; Nieto, O.; Duraiswami, R.; Ghosh, S.; Manocha, D. Mmau: A massive multi-task audio understanding and reasoning benchmark. arXiv 2024, arXiv:2410.19168. [Google Scholar] [CrossRef]

Figure 1. Task divergence between VLMs and Speech LLMs.

Figure 2. Architecture of the proposed SA-MAP with adaptive merging-pruning fusion. By introducing a similarity threshold, SA-MAP dynamically balances token merging and pruning, according to speech temporal structures. Similar tokens are first adaptively merged to maximize information retention, followed by pruning to reduce redundancy and improve diversity under high compression rates.

Figure 3. Impact of Threshold.

Figure 4. The ablation study of attention-weighted merging on LibriSpeech dataset.

Figure 5. WER of merging methods combined with ADPruner on LibriSpeech dataset.

Figure 6. WER of merging methods combined with ADPruner on LibriSpeech dataset.

Table 3. Performance of SA-MAP w/wo tuning on LibriSpeech dataset.

Method	LibriSpeech				Average
Method	dev_clean	dev_other	test_clean	test_other	Average
Qwen2-Audio	1.67	3.65	1.74	4.03	2.77
Retain 70% Tokens (30% Compression Ratio)
SA-MAP	1.98	4.07	2.00	4.26	3.08
SA-MAP (tuning)	1.81	3.88	1.96	4.05	2.93
Retain 60% Tokens (40% Compression Ratio)
SA-MAP	2.59	5.00	2.72	5.02	3.83
SA-MAP (tuning)	2.16	4.24	2.27	4.58	3.31
Retain 50% Tokens (50% Compression Ratio)
SA-MAP	4.49	7.20	4.53	7.24	5.87
SA-MAP (tuning)	3.30	5.81	3.54	6.03	4.67

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.