Submitted:
08 June 2026
Posted:
09 June 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
- We introduce an adaptive, hardware-friendly, plug-and-play framework driven by the synergy of similarity and attention for joint token Merging and Pruning.
- Similarity and Attention are synergistically integrated, merging employs intra-group global similarity with attention-guided weighted aggregation; pruning uses a similarity-attention integrated kernel for diversity-driven pruning.
- Extensive validation on multiple public speech datasets and mainstream Speech LLMs confirms the method’s effectiveness and generalization across ASR and audio understanding tasks.
2. Related Work
3. Method
3.1. Preliminaries
3.2. Attention-Enhanced Token Merging
3.3. Attention-Similarity Based Token Pruning
3.4. Adaptive Merging-Pruning Fusion Framework
3.5. Optional: Trainable Compression Module
4. Experiments
4.1. Experiment Setting
4.2. Main Results
| Method | LibriSpeech | Fleurs | AISHELL-1 | AISHELL-2 | WenetSpeech | Average | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| dev_clean | dev_other | test_clean | test_other | zh | en | test-meeting | test-net | ||||
| Qwen2-Audio | 1.67 | 3.65 | 1.74 | 4.03 | 3.63 | 5.20 | 1.52 | 3.08 | 8.40 | 7.64 | 4.06 |
| Retain 60% Tokens (40% Compression Ratio) | |||||||||||
| VisionZip | 7.31 | 10.35 | 7.08 | 10.10 | 8.02 | 6.85 | 6.99 | 13.85 | 17.88 | 23.75 | 11.22 |
| VisPruner | 7.42 | 9.74 | 7.20 | 9.69 | 7.00 | 8.39 | 6.91 | 9.67 | 13.37 | 14.07 | 9.35 |
| CDPruner | 4.22 | 6.05 | 4.18 | 6.53 | 4.88 | 7.17 | 2.70 | 4.62 | 12.29 | 10.90 | 6.35 |
| A-ToMe | 4.12 | 6.81 | 4.20 | 6.98 | 8.00 | 8.13 | 4.18 | 5.56 | 14.05 | 14.15 | 7.62 |
| FastAdaSP | 4.91 | 7.26 | 4.95 | 7.51 | 5.47 | 7.31 | 3.28 | 4.69 | 11.51 | 12.30 | 6.92 |
| SA-MAP | 2.59 | 5.00 | 2.72 | 5.02 | 4.37 | 5.94 | 2.69 | 4.42 | 11.05 | 10.11 | 5.39 |
| Kimi-Audio | 1.23 | 2.39 | 1.38 | 2.45 | 2.87 | 4.92 | 0.61 | 2.57 | 6.33 | 5.39 | 3.01 |
| Retain 60% Tokens (40% Compression Ratio) | |||||||||||
| VisionZip | 6.35 | 7.94 | 5.93 | 7.71 | 7.63 | 8.91 | 6.84 | 10.01 | 14.65 | 19.51 | 9.55 |
| VisPruner | 5.36 | 6.90 | 5.07 | 6.75 | 5.73 | 7.82 | 4.36 | 7.65 | 13.70 | 18.56 | 8.19 |
| CDPruner | 5.70 | 6.96 | 5.44 | 6.78 | 6.73 | 8.66 | 5.07 | 9.21 | 14.95 | 18.41 | 8.79 |
| A-ToMe | 4.13 | 6.04 | 3.92 | 5.87 | 6.26 | 7.57 | 3.58 | 6.88 | 12.97 | 14.99 | 7.22 |
| FastAdaSP | 4.49 | 6.15 | 4.28 | 6.02 | 4.27 | 7.64 | 2.22 | 5.67 | 12.62 | 14.75 | 6.81 |
| SA-MAP | 2.32 | 3.67 | 2.49 | 3.75 | 4.37 | 6.26 | 2.05 | 3.99 | 11.47 | 12.35 | 5.27 |
| GLM-ASR-Nano | 2.14 | 4.05 | 2.18 | 4.53 | 3.44 | 4.11 | 2.47 | 3.48 | 8.43 | 6.65 | 4.15 |
| Retain 70% Tokens (30% Compression Ratio) | |||||||||||
| VisionZip | 9.18 | 12.75 | 8.27 | 13.27 | 8.17 | 8.08 | 13.9 | 20.55 | 19.36 | 20.45 | 13.40 |
| VisPruner | 9.22 | 12.33 | 8.34 | 12.13 | 6.95 | 8.33 | 11.68 | 14.57 | 18.11 | 16.56 | 11.82 |
| CDPruner | 5.38 | 7.75 | 5.28 | 7.76 | 4.99 | 5.93 | 4.88 | 5.94 | 14.21 | 11.70 | 7.38 |
| A-ToMe | 4.55 | 8.04 | 4.66 | 7.83 | 5.08 | 5.77 | 5.40 | 5.95 | 13.16 | 12.08 | 7.25 |
| FastAdaSP | 6.45 | 10.34 | 6.50 | 10.41 | 4.34 | 6.70 | 6.42 | 7.02 | 16.10 | 14.10 | 8.84 |
| SA-MAP | 3.41 | 5.53 | 3.43 | 5.55 | 3.76 | 4.83 | 3.53 | 4.52 | 12.07 | 9.60 | 5.62 |
| Method | MELD | mmau-test-mini | Nonspeech7k | TUT2017 | VocalSound | Average | ||
|---|---|---|---|---|---|---|---|---|
| music | sound | speech | ||||||
| Qwen2-Audio | 51.19 | 60.48 | 72.07 | 60.36 | 86.48 | 32.35 | 88.00 | 64.42 |
| Retain 50% Tokens (50% Compression Ratio) | ||||||||
| VisionZip | 51.73 | 60.48 | 72.07 | 61.56 | 85.38 | 31.73 | 86.33 | 64.18 |
| VisPruner | 50.88 | 61.68 | 73.87 | 62.16 | 84.69 | 30.74 | 86.22 | 64.32 |
| CDPruner | 51.61 | 61.38 | 71.77 | 61.26 | 86.62 | 31.73 | 87.50 | 64.55 |
| A-ToMe | 50.15 | 61.38 | 71.17 | 56.46 | 83.45 | 29.20 | 84.32 | 62.30 |
| FastAdaSP | 50.81 | 59.88 | 72.07 | 59.46 | 83.86 | 31.23 | 85.96 | 63.32 |
| SA-MAP | 51.96 | 61.38 | 72.07 | 63.06 | 85.10 | 32.53 | 87.64 | 64.82 |
| Kimi-Audio | 57.48 | 63.77 | 79.28 | 65.77 | 93.10 | 61.11 | 94.24 | 73.54 |
| Retain 50% Tokens (50% Compression Ratio) | ||||||||
| VisionZip | 57.78 | 63.47 | 79.88 | 61.86 | 90.21 | 60.56 | 93.23 | 72.43 |
| VisPruner | 56.67 | 65.57 | 79.88 | 63.66 | 90.21 | 60.25 | 93.21 | 72.78 |
| CDPruner | 57.40 | 63.17 | 79.58 | 65.17 | 89.52 | 60.31 | 93.15 | 72.61 |
| A-ToMe | 57.25 | 64.67 | 79.28 | 63.66 | 92.14 | 60.68 | 93.40 | 73.01 |
| FastAdaSP | 56.79 | 63.47 | 79.88 | 66.37 | 90.48 | 60.37 | 93.34 | 72.96 |
| SA-MAP | 56.40 | 65.57 | 79.88 | 66.37 | 90.76 | 60.19 | 93.43 | 73.23 |
4.3. Trainable Compression Results
4.4. Ablation Study
5. Conclusion
Author Contributions
Funding
Conflicts of Interest
Appendix A. Algorithm
| Algorithm A1 Adaptive Merging-Pruning Fusion Method |
|
Appendix B. Full Experiments Results
| Method | LibriSpeech | Fleurs | AISHELL-1 | AISHELL-2-ios | WenetSpeech | Average | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| dev_clean | dev_other | test_clean | test_other | zh | en | test-meeting | test-net | ||||
| base | 1.67 | 3.65 | 1.74 | 4.03 | 3.63 | 5.20 | 1.52 | 3.08 | 8.40 | 7.64 | 4.06 |
| Retain 70% Tokens (30% Compression Ratio) | |||||||||||
| VisionZip | 3.16 | 5.68 | 3.26 | 5.82 | 4.90 | 5.53 | 3.31 | 7.70 | 12.61 | 15.98 | 6.80 |
| VisPruner | 3.93 | 5.89 | 3.81 | 6.00 | 4.70 | 6.08 | 3.10 | 5.14 | 10.35 | 10.22 | 5.92 |
| CDPruner | 2.58 | 4.35 | 2.59 | 4.75 | 3.96 | 5.99 | 1.83 | 3.51 | 9.77 | 8.89 | 4.82 |
| A-ToMe | 2.54 | 4.86 | 2.59 | 5.00 | 4.93 | 6.09 | 2.26 | 3.86 | 10.35 | 10.29 | 5.28 |
| FastAdaSP | 2.44 | 4.54 | 2.54 | 4.83 | 4.00 | 5.62 | 1.94 | 3.43 | 9.53 | 9.10 | 4.80 |
| SA-MAP | 1.98 | 4.07 | 2.00 | 4.26 | 3.67 | 5.39 | 1.83 | 3.70 | 9.26 | 8.71 | 4.49 |
| Retain 60% Tokens (40% Compression Ratio) | |||||||||||
| VisionZip | 7.31 | 10.35 | 7.08 | 10.10 | 8.02 | 6.85 | 6.99 | 13.85 | 17.88 | 23.75 | 11.22 |
| VisPruner | 7.42 | 9.74 | 7.20 | 9.69 | 7.00 | 8.39 | 6.91 | 9.67 | 13.37 | 14.07 | 9.35 |
| CDPruner | 4.22 | 6.05 | 4.18 | 6.53 | 4.88 | 7.17 | 2.70 | 4.62 | 12.29 | 10.90 | 6.35 |
| A-ToMe | 4.12 | 6.81 | 4.20 | 6.98 | 8.00 | 8.13 | 4.18 | 5.56 | 14.05 | 14.15 | 7.62 |
| FastAdaSP | 4.91 | 7.26 | 4.95 | 7.51 | 5.47 | 7.31 | 3.28 | 4.69 | 11.51 | 12.30 | 6.92 |
| SA-MAP | 2.59 | 5.00 | 2.72 | 5.02 | 4.37 | 5.94 | 2.69 | 4.42 | 11.05 | 10.11 | 5.39 |
| Retain 50% Tokens (50% Compression Ratio) | |||||||||||
| VisionZip | 18.17 | 20.45 | 15.92 | 20.80 | 14.42 | 10.06 | 14.04 | 22.42 | 27.52 | 34.19 | 19.80 |
| VisPruner | 15.44 | 18.33 | 15.11 | 18.05 | 12.34 | 14.41 | 15.42 | 19.89 | 19.75 | 22.19 | 17.09 |
| CDPruner | 7.80 | 9.88 | 7.95 | 10.12 | 9.34 | 10.55 | 6.83 | 8.20 | 16.08 | 19.13 | 10.59 |
| A-ToMe | 7.79 | 10.96 | 7.98 | 11.19 | 21.31 | 13.81 | 11.05 | 12.10 | 21.40 | 22.10 | 13.97 |
| FastAdaSP | 11.78 | 14.77 | 11.96 | 14.82 | 10.51 | 12.82 | 8.24 | 9.39 | 17.21 | 20.85 | 13.24 |
| SA-MAP | 4.49 | 7.20 | 4.53 | 7.24 | 5.53 | 7.49 | 4.41 | 7.41 | 13.67 | 14.02 | 7.60 |
| Method | LibriSpeech | Fleurs | AISHELL-1 | AISHELL-2-ios | WenetSpeech | Average | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| dev_clean | dev_other | test_clean | test_other | zh | en | test-meeting | test-net | ||||
| base | 2.14 | 4.05 | 2.18 | 4.53 | 3.44 | 4.11 | 2.47 | 3.48 | 8.43 | 6.65 | 4.15 |
| Retain 70% Tokens (30% Compression Ratio) | |||||||||||
| VisionZip | 9.18 | 12.75 | 8.27 | 13.27 | 8.17 | 8.08 | 13.90 | 20.55 | 19.36 | 20.45 | 13.40 |
| VisPruner | 9.22 | 12.33 | 8.34 | 12.13 | 6.95 | 8.33 | 11.68 | 14.57 | 18.11 | 16.56 | 11.82 |
| CDPruner | 5.38 | 7.75 | 5.28 | 7.76 | 4.99 | 5.93 | 4.88 | 5.94 | 14.21 | 11.70 | 7.38 |
| A-ToMe | 4.55 | 8.04 | 4.66 | 7.83 | 5.08 | 5.77 | 5.40 | 5.95 | 13.16 | 12.08 | 7.25 |
| FastAdaSP | 6.45 | 10.34 | 6.50 | 10.41 | 4.34 | 6.70 | 6.42 | 7.02 | 16.10 | 14.10 | 8.84 |
| SA-MAP | 3.41 | 5.53 | 3.43 | 5.55 | 3.76 | 4.83 | 3.53 | 4.52 | 12.07 | 9.60 | 5.62 |
| Retain 60% Tokens (40% Compression Ratio) | |||||||||||
| VisionZip | 19.18 | 24.16 | 17.91 | 24.27 | 15.66 | 16.22 | 27.40 | 36.09 | 30.56 | 32.89 | 24.43 |
| VisPruner | 18.02 | 22.68 | 17.01 | 22.06 | 12.48 | 15.27 | 22.82 | 28.81 | 27.46 | 26.78 | 21.34 |
| CDPruner | 9.55 | 12.53 | 9.88 | 11.90 | 6.68 | 9.38 | 8.40 | 9.78 | 21.10 | 16.53 | 11.57 |
| A-ToMe | 8.24 | 12.51 | 8.39 | 12.19 | 8.37 | 8.23 | 9.27 | 10.48 | 18.00 | 16.85 | 11.25 |
| FastAdaSP | 13.23 | 17.52 | 13.55 | 17.60 | 7.84 | 7.46 | 8.61 | 9.46 | 17.60 | 15.88 | 12.88 |
| SA-MAP | 6.17 | 8.48 | 6.24 | 8.08 | 4.81 | 6.54 | 5.84 | 7.52 | 16.29 | 13.82 | 8.38 |
| Method | LibriSpeech | Fleurs | AISHELL-1 | AISHELL-2-ios | WenetSpeech | Average | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| dev_clean | dev_other | test_clean | test_other | zh | en | test-meeting | test-net | ||||
| base | 1.23 | 2.39 | 1.38 | 2.45 | 2.87 | 4.92 | 0.61 | 2.57 | 6.33 | 5.39 | 3.01 |
| Retain 70% Tokens (30% Compression Ratio) | |||||||||||
| VisionZip | 2.83 | 4.22 | 2.77 | 4.17 | 4.46 | 6.09 | 2.71 | 5.70 | 9.92 | 12.59 | 5.55 |
| VisPruner | 2.54 | 3.68 | 2.52 | 3.83 | 3.87 | 6.17 | 1.67 | 4.36 | 9.40 | 11.50 | 4.95 |
| CDPruner | 2.72 | 3.87 | 2.60 | 3.90 | 4.38 | 6.42 | 1.81 | 4.75 | 10.14 | 11.48 | 5.21 |
| A-ToMe | 2.09 | 3.54 | 2.07 | 3.52 | 3.74 | 5.60 | 1.22 | 3.73 | 8.89 | 9.67 | 4.41 |
| FastAdaSP | 2.00 | 3.40 | 2.12 | 3.38 | 3.22 | 5.76 | 0.95 | 3.37 | 8.68 | 8.98 | 4.19 |
| SA-MAP | 1.48 | 2.64 | 1.50 | 2.73 | 3.36 | 5.11 | 1.03 | 3.28 | 8.66 | 9.97 | 3.98 |
| Retain 60% Tokens (40% Compression Ratio) | |||||||||||
| VisionZip | 6.35 | 7.94 | 5.93 | 7.71 | 7.63 | 8.91 | 6.84 | 10.01 | 14.65 | 19.51 | 9.55 |
| VisPruner | 5.36 | 6.90 | 5.07 | 6.75 | 5.73 | 7.82 | 4.36 | 7.65 | 13.70 | 18.56 | 8.19 |
| CDPruner | 5.70 | 6.96 | 5.44 | 6.78 | 6.73 | 8.66 | 5.07 | 9.21 | 14.95 | 18.41 | 8.79 |
| A-ToMe | 4.13 | 6.04 | 3.92 | 5.87 | 6.26 | 7.57 | 3.58 | 6.88 | 12.97 | 14.99 | 7.22 |
| FastAdaSP | 4.49 | 6.15 | 4.28 | 6.02 | 4.27 | 7.64 | 2.22 | 5.67 | 12.62 | 14.75 | 6.81 |
| SA-MAP | 2.32 | 3.67 | 2.49 | 3.75 | 4.37 | 6.26 | 2.05 | 3.99 | 11.47 | 12.35 | 5.27 |
| Retain 50% Tokens (50% Compression Ratio) | |||||||||||
| VisionZip | 13.75 | 15.53 | 12.44 | 15.00 | 13.70 | 13.84 | 14.00 | 17.35 | 22.64 | 29.64 | 16.79 |
| VisPruner | 11.90 | 14.30 | 11.47 | 13.41 | 10.38 | 13.20 | 10.88 | 14.92 | 22.03 | 29.29 | 15.18 |
| CDPruner | 12.79 | 14.70 | 12.14 | 14.31 | 11.75 | 14.27 | 13.47 | 18.90 | 23.23 | 29.62 | 16.52 |
| A-ToMe | 8.66 | 11.62 | 8.31 | 11.24 | 14.66 | 11.55 | 10.04 | 12.44 | 20.31 | 22.48 | 13.13 |
| FastAdaSP | 10.71 | 12.98 | 10.36 | 12.20 | 7.17 | 11.87 | 7.24 | 11.15 | 20.37 | 25.07 | 12.91 |
| SA-MAP | 5.19 | 6.93 | 5.13 | 6.72 | 6.81 | 8.50 | 5.06 | 6.52 | 17.12 | 20.36 | 8.83 |
| Audio Understanding | ||||||||
|---|---|---|---|---|---|---|---|---|
| Method | MELD | mmau-test-mini | Nonspeech7k | TUT2017 | VocalSound | Average | ||
| music | sound | speech | ||||||
| base | 51.19 | 60.48 | 72.07 | 60.36 | 86.48 | 32.35 | 88.00 | 64.42 |
| Retain 60% Tokens (40% Compression Ratio) | ||||||||
| VisionZip | 51.38 | 59.88 | 71.77 | 61.86 | 86.48 | 32.28 | 87.33 | 64.43 |
| VisPruner | 52.30 | 60.18 | 72.97 | 62.16 | 86.21 | 31.05 | 86.69 | 64.51 |
| CDPruner | 50.88 | 61.08 | 73.57 | 60.96 | 87.17 | 32.65 | 87.94 | 64.89 |
| A-ToMe | 51.46 | 60.78 | 70.27 | 61.86 | 84.83 | 31.73 | 86.44 | 63.91 |
| FastAdaSP | 50.50 | 59.88 | 72.37 | 62.16 | 85.93 | 32.41 | 87.11 | 64.34 |
| SA-MAP | 51.69 | 61.08 | 73.27 | 62.46 | 87.31 | 32.84 | 87.97 | 65.23 |
| Retain 50% Tokens (50% Compression Ratio) | ||||||||
| VisionZip | 51.73 | 60.48 | 72.07 | 61.56 | 85.38 | 31.73 | 86.33 | 64.18 |
| VisPruner | 50.88 | 61.68 | 73.87 | 62.16 | 84.69 | 30.74 | 86.22 | 64.32 |
| CDPruner | 51.61 | 61.38 | 71.77 | 61.26 | 86.62 | 31.73 | 87.50 | 64.55 |
| A-ToMe | 50.15 | 61.38 | 71.17 | 56.46 | 83.45 | 29.20 | 84.32 | 62.30 |
| FastAdaSP | 50.81 | 59.88 | 72.07 | 59.46 | 83.86 | 31.23 | 85.96 | 63.32 |
| SA-MAP | 51.96 | 61.38 | 72.07 | 63.06 | 85.10 | 32.53 | 87.64 | 64.82 |
| Retain 30% Tokens (70% Compression Ratio) | ||||||||
| VisionZip | 51.53 | 60.78 | 72.97 | 61.26 | 77.66 | 29.20 | 78.08 | 61.64 |
| VisPruner | 50.08 | 60.78 | 72.67 | 61.56 | 70.90 | 28.77 | 84.91 | 61.38 |
| CDPruner | 50.84 | 58.68 | 72.07 | 60.96 | 81.38 | 31.36 | 85.55 | 62.98 |
| A-ToMe | 48.73 | 61.38 | 70.87 | 48.65 | 67.72 | 27.04 | 79.78 | 57.74 |
| FastAdaSP | 48.66 | 59.58 | 71.77 | 51.95 | 69.24 | 28.40 | 79.00 | 58.37 |
| SA-MAP | 50.42 | 60.78 | 71.77 | 61.86 | 79.17 | 32.04 | 87.25 | 63.33 |
| Audio Understanding | ||||||||
|---|---|---|---|---|---|---|---|---|
| Method | MELD | mmau-test-mini | Nonspeech7k | TUT2017 | VocalSound | Average | ||
| music | sound | speech | ||||||
| base | 57.48 | 63.77 | 79.28 | 65.77 | 93.10 | 61.11 | 94.24 | 73.54 |
| Retain 60% Tokens (40% Compression Ratio) | ||||||||
| VisionZip | 58.59 | 64.67 | 79.88 | 65.17 | 91.45 | 60.80 | 93.62 | 73.45 |
| VisPruner | 57.63 | 64.97 | 79.28 | 65.77 | 92.00 | 60.49 | 93.71 | 73.41 |
| CDPruner | 57.86 | 62.57 | 78.98 | 64.56 | 91.17 | 60.80 | 93.68 | 72.80 |
| A-ToMe | 57.82 | 64.97 | 78.38 | 68.17 | 92.28 | 60.06 | 93.90 | 73.65 |
| FastAdaSP | 57.63 | 64.67 | 80.78 | 64.86 | 91.17 | 60.43 | 93.71 | 73.32 |
| SA-MAP | 57.25 | 66.47 | 80.78 | 66.07 | 91.31 | 60.49 | 93.73 | 73.73 |
| Retain 50% Tokens (50% Compression Ratio) | ||||||||
| VisionZip | 57.78 | 63.47 | 79.88 | 61.86 | 90.21 | 60.56 | 93.23 | 72.43 |
| VisPruner | 56.67 | 65.57 | 79.88 | 63.66 | 90.21 | 60.25 | 93.21 | 72.78 |
| CDPruner | 57.40 | 63.17 | 79.58 | 65.17 | 89.52 | 60.31 | 93.15 | 72.61 |
| A-ToMe | 57.25 | 64.67 | 79.28 | 63.66 | 92.14 | 60.68 | 93.40 | 73.01 |
| FastAdaSP | 56.79 | 63.47 | 79.88 | 66.37 | 90.48 | 60.37 | 93.34 | 72.96 |
| SA-MAP | 56.40 | 65.57 | 79.88 | 66.37 | 90.76 | 60.19 | 93.43 | 73.23 |
| Retain 30% Tokens (70% Compression Ratio) | ||||||||
| VisionZip | 53.87 | 64.97 | 78.98 | 60.66 | 86.90 | 57.84 | 92.76 | 70.85 |
| VisPruner | 53.91 | 64.67 | 79.58 | 63.06 | 86.90 | 58.64 | 92.98 | 71.39 |
| CDPruner | 54.03 | 65.27 | 79.28 | 60.96 | 86.76 | 59.01 | 92.06 | 71.05 |
| A-ToMe | 54.37 | 65.57 | 78.68 | 62.76 | 89.79 | 57.22 | 90.39 | 71.25 |
| FastAdaSP | 54.41 | 62.87 | 77.18 | 61.86 | 87.45 | 58.15 | 91.56 | 70.50 |
| SA-MAP | 55.37 | 64.07 | 79.88 | 65.47 | 86.62 | 55.99 | 92.76 | 71.45 |
Appendix C. Trainable Compression Module

References
- Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen technical report. arXiv 2023, arXiv:2309.16609. [Google Scholar] [CrossRef]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
- Team, G.; Kamath, A.; Ferret, J.; Pathak, S.; Vieillard, N.; Merhej, R.; Perrin, S.; Matejovicova, T.; Ramé, A.; Rivière, M.; et al. Gemma 3 technical report. arXiv 2025, arXiv:2503.19786. [Google Scholar] [CrossRef]
- Chu, Y.; Xu, J.; Yang, Q.; Wei, H.; Wei, X.; Guo, Z.; Leng, Y.; Lv, Y.; He, J.; Lin, J.; et al. Qwen2-audio technical report. arXiv 2024, arXiv:2407.10759. [Google Scholar] [CrossRef]
- Zeng, A.; Du, Z.; Liu, M.; Wang, K.; Jiang, S.; Zhao, L.; Dong, Y.; Tang, J. Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot. arXiv 2024, arXiv:2412.02612. [Google Scholar]
- Xu, K.T.; Xie, F.L.; Tang, X.; Hu, Y. Fireredasr: Open-source industrial-grade mandarin speech recognition models from encoder-decoder to llm integration. arXiv 2025, arXiv:2501.14350. [Google Scholar]
- Ding, D.; Ju, Z.; Leng, Y.; Liu, S.; Liu, T.; Shang, Z.; Shen, K.; Song, W.; Tan, X.; Tang, H.; et al. Kimi-audio technical report. arXiv 2025, arXiv:2504.18425. [Google Scholar] [CrossRef]
- Li, Y.; Wu, Y.; Li, J.; Liu, S. Accelerating transducers through adjacent token merging. arXiv 2023, arXiv:2306.16009. [Google Scholar] [CrossRef]
- Lu, Y.; Song, J.; Yang, C.H.H.; Watanabe, S. FastAdaSP: Multitask-Adapted Efficient Inference for Large Speech Language Model. arXiv 2024, arXiv:2410.03007. [Google Scholar]
- Lee, T.; Lee, H. Token Pruning in Audio Transformers: Optimizing Performance and Decoding Patch Importance. arXiv 2025, arXiv:2504.01690. [Google Scholar] [CrossRef]
- Dao, T.; Fu, D.; Ermon, S.; Rudra, A.; Ré, C. Flashattention: Fast and memory-efficient exact attention with io-awareness. Adv. Neural Inf. Process. Syst. 2022, 35, 16344–16359. [Google Scholar]
- Zhang, Q.; Cheng, A.; Lu, M.; Zhang, R.; Zhuo, Z.; Cao, J.; Guo, S.; She, Q.; Zhang, S. Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs. arXiv 2025, arXiv:cs. [Google Scholar] [CrossRef]
- Yang, S.; Chen, Y.; Tian, Z.; Wang, C.; Li, J.; Yu, B.; Jia, J. VisionZip: Longer is Better but Not Necessary in Vision Language Models. arXiv 2024, arXiv:cs. [Google Scholar]
- An, K.; Chen, Y.; Deng, C.; Gao, C.; Gao, Z.; Gong, B.; Li, X.; Li, Y.; Lv, X.; Ji, Y.; et al. Fun-ASR Technical Report. arXiv 2025, arXiv:2509.12508. [Google Scholar] [CrossRef]
- Ranjan Behera, S.; Dhiman, A.; Gowda, K.; Narayani, A.S. FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge Distillation. arXiv E-Prints 2024, arXiv–2406. [Google Scholar]
- Lin, Y.; Fu, Y.; Zhang, J.; Liu, Y.; Zhang, J.; Sun, J.; Li, H.H.; Chen, Y. Speechprune: Context-aware token pruning for speech information retrieval. In Proceedings of the 2025 IEEE International Conference on Multimedia and Expo (ICME); IEEE, 2025; pp. 1–6. [Google Scholar]
- Rao, Y.; Zhao, W.; Liu, B.; Lu, J.; Zhou, J.; Hsieh, C.J. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Adv. Neural Inf. Process. Syst. 2021, 34, 13937–13949. [Google Scholar]
- Bolya, D.; Fu, C.Y.; Dai, X.; Zhang, P.; Feichtenhofer, C.; Hoffman, J. Token merging: Your vit but faster. arXiv 2022, arXiv:2210.09461. [Google Scholar]
- Shao, K.; Tao, K.; Qin, C.; You, H.; Sui, Y.; Wang, H. HoliTom: Holistic Token Merging for Fast Video Large Language Models. arXiv 2025, arXiv:2505.21334. [Google Scholar] [CrossRef]
- Chen, L.; Zhao, H.; Liu, T.; Bai, S.; Lin, J.; Zhou, C.; Chang, B. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In Proceedings of the European Conference on Computer Vision, 2024; Springer; pp. 19–35. [Google Scholar]
- Alvar, S.R.; Singh, G.; Akbari, M.; Zhang, Y. Divprune: Diversity-based visual token pruning for large multimodal models. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025; pp. 9392–9401. [Google Scholar]
- Zhang, Q.; Liu, M.; Li, L.; Lu, M.; Zhang, Y.; Pan, J.; She, Q.; Zhang, S. Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs. arXiv 2025, arXiv:2506.10967. [Google Scholar] [CrossRef]
- Chen, L.; Zhao, H.; Liu, T.; Bai, S.; Lin, J.; Zhou, C.; Chang, B. An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models. arXiv 2024, arXiv:cs. [Google Scholar]
- Zhang, Y.; Fan, C.K.; Ma, J.; Zheng, W.; Huang, T.; Cheng, K.; Gudovskiy, D.; Okuno, T.; Nakata, Y.; Keutzer, K.; et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference. arXiv 2024, arXiv:2410.04417. [Google Scholar]
- Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision, 2022. arXiv arXiv:eess.
- Macchi, O. The coincidence approach to stochastic point processes. Adv. Appl. Probab. 1975, 7, 83–122. [Google Scholar] [CrossRef]
- Chen, Z.; Wang, W.; Cao, Y.; Liu, Y.; Gao, Z.; Cui, E.; Zhu, J.; Ye, S.; Tian, H.; Liu, Z.; et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv 2024, arXiv:2412.05271. [Google Scholar]
- Zhu, J.; Zhu, Y.; Lu, X.; Yan, W.; Li, D.; Liu, K.; Fu, X.; Zha, Z.J. VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs. arXiv 2025, arXiv:2510.16598. [Google Scholar]
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: an asr corpus based on public domain audio books. In Proceedings of the 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP); IEEE, 2015; pp. 5206–5210. [Google Scholar]
- Conneau, A.; Ma, M.; Khanuja, S.; Zhang, Y.; Axelrod, V.; Dalmia, S.; Riesa, J.; Rivera, C.; Bapna, A. Fleurs: Few-shot learning evaluation of universal representations of speech. In Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023; pp. 798–805. [Google Scholar]
- Bu, H.; Du, J.; Na, X.; Wu, B.; Zheng, H. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In Proceedings of the 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA); IEEE, 2017; pp. 1–5. [Google Scholar]
- Du, J.; Na, X.; Liu, X.; Bu, H. Aishell-2: Transforming mandarin asr research into industrial scale. arXiv 2018, arXiv:1808.10583. [Google Scholar] [CrossRef]
- Zhang, B.; Lv, H.; Guo, P.; Shao, Q.; Yang, C.; Xie, L.; Xu, X.; Bu, H.; Chen, X.; Zeng, C.; et al. Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE, 2022; pp. 6182–6186. [Google Scholar]
- Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; Mihalcea, R. Meld: A multimodal multi-party dataset for emotion recognition in conversations. In Proceedings of the Proceedings of the 57th annual meeting of the association for computational linguistics, 2019; pp. 527–536. [Google Scholar]
- Gong, Y.; Yu, J.; Glass, J. Vocalsound: A dataset for improving human vocal sounds recognition. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE, 2022; pp. 151–155. [Google Scholar]
- Mesaros, A.; Heittola, T.; Virtanen, T. TUT database for acoustic scene classification and sound event detection. In Proceedings of the 2016 24th European signal processing conference (EUSIPCO); IEEE, 2016; pp. 1128–1132. [Google Scholar]
- Rashid, M.M.; Li, G.; Du, C. Nonspeech7k dataset: Classification and analysis of human non-speech sound. IET Signal Process. 2023, 17, e12233. [Google Scholar] [CrossRef]
- Sakshi, S.; Tyagi, U.; Kumar, S.; Seth, A.; Selvakumar, R.; Nieto, O.; Duraiswami, R.; Ghosh, S.; Manocha, D. Mmau: A massive multi-task audio understanding and reasoning benchmark. arXiv 2024, arXiv:2410.19168. [Google Scholar] [CrossRef]






| Method | LibriSpeech | Average | |||
|---|---|---|---|---|---|
| dev_clean | dev_other | test_clean | test_other | ||
| Qwen2-Audio | 1.67 | 3.65 | 1.74 | 4.03 | 2.77 |
| Retain 70% Tokens (30% Compression Ratio) | |||||
| SA-MAP | 1.98 | 4.07 | 2.00 | 4.26 | 3.08 |
| SA-MAP (tuning) | 1.81 | 3.88 | 1.96 | 4.05 | 2.93 |
| Retain 60% Tokens (40% Compression Ratio) | |||||
| SA-MAP | 2.59 | 5.00 | 2.72 | 5.02 | 3.83 |
| SA-MAP (tuning) | 2.16 | 4.24 | 2.27 | 4.58 | 3.31 |
| Retain 50% Tokens (50% Compression Ratio) | |||||
| SA-MAP | 4.49 | 7.20 | 4.53 | 7.24 | 5.87 |
| SA-MAP (tuning) | 3.30 | 5.81 | 3.54 | 6.03 | 4.67 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).