Submitted:
25 February 2025
Posted:
26 February 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Background and Preliminaries
2.1. Transformer Models and Their Computational Complexity
2.2. Model Compression Techniques for Transformers
- Quantization: Reducing the precision of model parameters from 32-bit floating point to lower-bit representations (e.g., 8-bit, 4-bit) significantly decreases memory usage and accelerates inference.
- Knowledge distillation: A smaller student model is trained to mimic the behavior of a larger teacher model, achieving efficiency without significant loss in accuracy.
- Low-rank factorization: Decomposing weight matrices into low-rank components reduces the number of parameters and computational complexity [12].
- Efficient architectures: Specialized architectures such as Longformer, Linformer, and Reformer modify the Transformer structure to handle longer sequences with reduced computational overhead [13].
2.3. Definition and Taxonomy of Token Pruning
- Soft token pruning: Instead of completely discarding tokens, their contributions are downweighted or aggregated into fewer representations. This allows the model to retain some information while reducing the effective sequence length.
- Adaptive token pruning: The decision to prune is dynamically determined based on the importance of tokens, often using learnable gating mechanisms or attention-based importance scores [17].
- Hybrid approaches: Some methods combine token pruning with other compression techniques, such as knowledge distillation or weight pruning, to maximize efficiency.
2.4. Challenges in Token Pruning
- Token importance estimation: Accurately determining which tokens to prune in a computationally efficient manner is nontrivial. Many approaches rely on attention scores, gradient-based saliency measures, or reinforcement learning.
- Preserving model accuracy: Aggressive token pruning can lead to performance degradation, particularly for tasks requiring fine-grained token interactions such as named entity recognition and machine translation [18].
- Implementation complexity: Unlike weight pruning, which can be applied statically, token pruning requires runtime modifications to the model’s execution graph, making deployment more complex.
2.5. Scope and Organization of the Survey
3. Token Pruning Methodologies
3.1. Rule-Based Token Pruning
3.1.1. Attention Score-Based Pruning
3.1.2. Entropy-Based Pruning
3.1.3. Fixed-Length Pruning
3.2. Learning-Based Token Pruning
3.2.1. Reinforcement Learning-Based Pruning
3.2.2. Gated Token Pruning
3.2.3. Saliency-Based Pruning
3.3. Hybrid Token Pruning Approaches
3.3.1. Progressive Token Pruning
3.3.2. Multi-Stage Pruning
3.3.3. Integration with Other Compression Techniques
3.4. Comparison of Token Pruning Strategies
3.5. Summary and Insights
4. Effectiveness and Trade-Offs of Token Pruning
4.1. Impact on Model Accuracy
- Effect on Classification Tasks: In text classification tasks, token pruning has been found to work well since many input tokens contribute redundantly to the final decision [43]. Studies have shown that models can retain up to 95% of their accuracy while reducing sequence length by 50%.
- Impact on Sequence Labeling: For tasks such as named entity recognition (NER) and part-of-speech (POS) tagging, aggressive token pruning may lead to loss of fine-grained information, as every token contributes to the final output [44]. Adaptive pruning methods often perform better in such cases.
- Challenges in Generative Tasks: Tasks such as machine translation and text generation are particularly sensitive to token pruning, as the quality of generated text depends on maintaining long-range dependencies. In these cases, softer pruning strategies such as saliency-based approaches are often more effective [45].
4.2. Computational Efficiency Gains
- Reduction in FLOPs: Studies have shown that token pruning can reduce floating point operations (FLOPs) by up to 60% while maintaining comparable accuracy [49].
- Inference Speedup: Models with token pruning achieve 1.5× to 3× speedups on real-world benchmarks without requiring additional hardware modifications [50].
- Memory Savings: Since self-attention layers have quadratic complexity with respect to sequence length, reducing the number of tokens directly decreases memory consumption, making models more feasible for deployment on edge devices.
4.3. Robustness and Generalization
- Task-Specific Adaptability: Methods like reinforcement learning-based pruning can adapt pruning policies based on task requirements, improving robustness across datasets [51].
- Sensitivity to Domain Shifts: Token pruning strategies trained on one dataset may not generalize well to out-of-domain data. Hybrid approaches that combine multiple selection criteria tend to be more resilient to domain shifts.
- Pruning Stability: Applying token pruning in earlier layers generally results in more stable performance compared to late-layer pruning, as early layers encode redundant representations that can be safely removed.
4.4. Comparison with Other Compression Techniques
4.5. Summary and Key Insights
- Token pruning can achieve up to 60% computational savings with minimal impact on accuracy for classification and retrieval-based tasks.
- Generative and sequence labeling tasks require more careful pruning strategies to avoid loss of critical information.
- Adaptive and hybrid pruning approaches offer greater generalization and robustness compared to static rule-based methods.
- Token pruning is highly complementary to other model compression techniques and can be integrated into broader efficiency frameworks.
5. Applications of Token Pruning
5.1. Natural Language Processing (NLP)
5.1.1. Text Classification
5.1.2. Question Answering (QA)
5.1.3. Machine Translation
5.2. Computer Vision
5.2.1. Image Classification
5.2.2. Object Detection
5.2.3. Video Understanding
5.3. Speech Processing
5.3.1. Automatic Speech Recognition (ASR)
5.3.2. Speaker Identification
5.4. Deployment in Real-World Systems
5.4.1. Edge AI and Mobile Applications
5.4.2. Cloud-Based NLP Services
5.4.3. Energy-Efficient AI
5.5. Summary and Key Insights
- Token pruning significantly accelerates NLP models, particularly for classification, question answering, and retrieval tasks [71].
- Vision transformers benefit from pruning redundant image patches, leading to faster and more efficient object detection and classification [72].
- Speech models use token pruning to reduce audio sequence lengths, improving real-time processing efficiency [73].
- Token pruning is highly beneficial for deploying deep learning models on edge devices, mobile platforms, and cloud environments.
6. Challenges and Future Directions in Token Pruning
6.1. Challenges in Token Pruning
6.1.1. Accuracy vs. Efficiency Trade-Off
6.1.2. Dynamic vs. Static Pruning
6.1.3. Pruning Granularity
6.1.4. Generalization Across Tasks and Domains
6.1.5. Compatibility with Other Efficiency Techniques
6.1.6. Robustness to Adversarial Attacks
6.2. Future Research Directions
6.2.1. Neural Architecture Search for Token Pruning
6.2.2. Self-Supervised Learning for Token Importance Estimation
6.2.3. Multimodal Token Pruning
6.2.4. Hardware-Aware Pruning Strategies
6.2.5. Energy-Efficient and Green AI Pruning
6.3. Summary and Key Takeaways
- The trade-off between efficiency and accuracy remains a core challenge in token pruning research [86].
- Dynamic pruning methods offer adaptability but introduce additional computational complexity.
- Generalization across tasks and domains is a major concern, requiring more flexible pruning mechanisms.
- Future research should explore NAS-based pruning, self-supervised token importance estimation, and multimodal pruning strategies.
- Hardware-aware pruning and energy-efficient AI are promising areas for real-world deployment of token pruning techniques.
7. Conclusion
7.1. Key Contributions of Token Pruning
- Efficiency Gains: Token pruning significantly reduces computational overhead, leading to faster inference times and lower memory consumption, making Transformer models more scalable for real-world deployment.
- Task-Specific Adaptability: Various pruning strategies have been developed to cater to different NLP, vision, and speech processing tasks, demonstrating the versatility of token pruning across multiple domains.
- Integration with Other Efficiency Methods: Token pruning complements other model compression techniques, such as weight pruning, quantization, and knowledge distillation, enabling holistic model optimization.
- Real-World Impact: Token pruning has been successfully deployed in edge computing, cloud-based AI services, and mobile applications, proving its practical value in reducing latency and energy consumption [89].
7.2. Challenges and Open Questions
- The accuracy-efficiency trade-off remains a crucial factor, as excessive pruning may degrade model performance.
- Dynamic pruning techniques introduce additional computational complexity, necessitating more efficient selection mechanisms [90].
- Ensuring robustness and generalization across different datasets and domains is an ongoing challenge in pruning research [91].
- The integration of token pruning with hardware-aware optimizations is an important future direction for maximizing real-world performance gains.
7.3. Future Outlook
- Advances in self-supervised learning and neural architecture search may lead to more intelligent and adaptive token pruning strategies.
- Multimodal pruning techniques can enable efficient processing in models that handle text, images, and speech simultaneously.
- Green AI initiatives will likely drive research toward more energy-efficient token pruning methods, reducing the environmental impact of large-scale deep learning models.
- Hardware-aware token pruning methods tailored for specialized accelerators (e.g., GPUs, TPUs, edge processors) will enhance the deployability of pruned models in real-world applications.
7.4. Final Remarks
References
- Shixing Yu, Tianlong Chen, Jiayi Shen, Huan Yuan, Jianchao Tan, Sen Yang, Ji Liu, and Zhangyang Wang. Unified visual transformer compression. ArXiv, abs/2203.08243, 2022. [CrossRef]
- Saurabh Goyal, Anamitra Roy Choudhury, Saurabh Raje, Venkatesan Chakaravarthy, Yogish Sabharwal, and Ashish Verma. PoWER-BERT: Accelerating BERT inference via progressive word-vector elimination. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3690–3699. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/goyal20a.html.
- Chris J.C. Burges. From ranknet to lambdarank to lambdamart: An overview. Technical report, June 2010. URL https://www.microsoft.com/en-us/research/publication/from-ranknet-to-lambdarank-to-lambdamart-an-overview/.
- Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and Ping Tan. Neural window fully-connected crfs for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3916–3925, 2022.
- Sucheng Ren, Zhengqi Gao, Tianyu Hua, Zihui Xue, Yonglong Tian, Shengfeng He, and Hang Zhao. Co-advise: Cross inductive bias distillation. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 16773–16782, 2022.
- Learning Deep Structured Semantic Models for Web Search using Clickthrough Data, October 2013. ACM International Conference on Information and Knowledge Management (CIKM). URL https://www.microsoft.com/en-us/research/publication/learning-deep-structured-semantic-models-for-web-search-using-clickthrough-data/.
- Kavindu Chamith Hans Thisanke, Chamli Deshan. Semantic segmentation using vision transformers: A survey. arXiv preprint arXiv:2305.03273, 2023.
- Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531, 2014.
- Ellen M. Voorhees and Angela Ellis, editors. Proceedings of the Twenty-Eighth Text REtrieval Conference, TREC 2019, Gaithersburg, Maryland, USA, November 13-15, 2019, volume 1250 of NIST Special Publication, 2019. National Institute of Standards and Technology (NIST). URL https://trec.nist.gov/pubs/trec28/trec2019.html. /.
- Fedor Moiseev Elena Voita, David Talbot. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint arXiv:1905.09418, 2019.
- Yassine Zniyed, Thanh Phuong Nguyen, et al. Efficient tensor decomposition-based filter pruning. Neural Networks, 178:106393, 2024.
- Bohan Zhuang, Jing Liu, Zizheng Pan, Haoyu He, Yuetian Weng, and Chunhua Shen. A survey on efficient training of transformers. pages 6823–6831, 08 2023. [CrossRef]
- Yury A. Malkov and D. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42:824–836, 2020. [CrossRef]
- Nico Messikommer Yifei Liu, Mathias Gehrig. Revisiting token pruning for object detection and instance segmentation. arXiv preprint arXiv:2306.07050, 2023.
- Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
- Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. Dynabert: Dynamic bert with adaptive width and depth. Advances in Neural Information Processing Systems, 33:9782–9793, 2020.
- Kaleel Mahmood, Rigel Mahmood, and Marten Van Dijk. On the robustness of vision transformers to adversarial examples. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7838–7847, 2021.
- Qiming Zhang Yufei Xu, Jing Zhang. Vitpose: Simple vision transformer baselines for human pose estimation. arXiv preprint arXiv:2204.12484, 2022.
- Sayak Paul and Pin-Yu Chen. Vision transformers are robust learners. In Proceedings of the AAAI conference on Artificial Intelligence, volume 36, pages 2071–2081, 2022.
- Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3):415–424, 2022. [CrossRef]
- Sebastian Hofstätter and Allan Hanbury. Let’s measure run time! extending the ir replicability infrastructure to include performance aspects. SIGIR Open-Source IR Replicability Challenge (OSIRRC), 2019.
- Zhaoyang Huang, Xiaoyu Shi, Chao Zhang, Qiang Wang, Ka Chun Cheung, Hongwei Qin, Jifeng Dai, and Hongsheng Li. Flowformer: A transformer architecture for optical flow. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVII, pages 668–685. Springer, 2022.
- Yuxin Fang, Bencheng Liao, Xinggang Wang, and Fang. You only look at one sequence: Rethinking transformer in vision through object detection. Advances in Neural Information Processing Systems, 34:26183–26197, 2021.
- Wenliang Zhao Yongming Rao, Zuyan Liu. Dynamic spatial sparsification for efficient vision transformers and convolutional neural networks. arXiv preprint arXiv:2207.01580, 2022.
- Joel Mackenzie, Zhuyun Dai, Luke Gallagher, and Jamie Callan. Efficiency implications of term weighting for passage retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1821–1824, 2020.
- Xianing Chen, Qiong Cao, Yujie Zhong, Jing Zhang, Shenghua Gao, and Dacheng Tao. Dearkd: data-efficient early knowledge distillation for vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12052–12062, 2022.
- Yixing Fan, Jiafeng Guo, Yanyan Lan, Jun Xu, Chengxiang Zhai, and Xueqi Cheng. Modeling diverse relevance patterns in ad-hoc retrieval. CoRR, abs/1805.05737, 2018. URL http://arxiv.org/abs/1805.05737.
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019.
- Luyu Gao, Zhuyun Dai, Tongfei Chen, Zhen Fan, Benjamin Van Durme, and Jamie Callan. Complementing lexical retrieval with semantic residual embedding, 2020.
- G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun. ACM, 18(11):613–620, November 1975. ISSN 0001-0782. [CrossRef]
- M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results, 2012. URL http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
- Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. End-to-End Neural Ad-hoc Ranking with Kernel Pooling. In Proceedings of the 40th International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 2017.
- Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers, 2020.
- Xinyu Ma, Jiafeng Guo, Ruqing Zhang, Yixing Fan, Xiang Ji, and Xueqi Cheng. Prop: Pre-training with representative words prediction for ad-hoc retrieval, 2020.
- Ellen M. Voorhees and Donna K. Harman. TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing). The MIT Press, 2005. ISBN 0262220733.
- Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
- Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, and Judy Hoffman. Hydra attention: Efficient attention with many heads. In Leonid Karlinsky, Tomer Michaeli, and Ko Nishino, editors, Computer Vision – ECCV 2022 Workshops, pages 35–49, Cham, 2023. Springer Nature Switzerland. ISBN 978-3-031-25082-8.
- Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval, 2020.
- Shin-Jae Lee, Minsoo Jeon, Dongseung Kim, and Andrew Sohn. Partitioned parallel radix sort. Journal of Parallel and Distributed Computing, 62(4):656–668, 2002.
- Zhiwei Hao, Jianyuan Guo, Ding Jia, Kai Han, Yehui Tang, Chao Zhang, Han Hu, and Yunhe Wang. Learning efficient vision transformers via fine-grained manifold distillation. Advances in Neural Information Processing Systems, 35:9164–9175, 2022.
- Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. Distilling dense representations for ranking using tightly-coupled teachers, 2020.
- Ross Girshick Yanghao Li, Hanzi Mao. Exploring plain vision transformer backbones for object detection. arXiv preprint arXiv:2203.16527, 2022.
- Yukun Zheng, Zhen Fan, Yiqun Liu, Cheng Luo, Min Zhang, and Shaoping Ma. Sogou-qcl: A new dataset with click relevance label. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, pages 1117–1120, New York, NY, USA, 2018. ACM. ISBN 978-1-4503-5657-2. [CrossRef]
- Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, et al. A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence, 45(1):87–110, 2022. [CrossRef]
- Xiao Wang, Craig Macdonald, Nicola Tonellotto, and Iadh Ounis. Pseudo-relevance feedback for multiple representation dense retrieval. In Faegheh Hasibi, Yi Fang, and Akiko Aizawa, editors, ICTIR ’21, pages 297–306. ACM, 2021. [CrossRef]
- Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010. [CrossRef]
- Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022. [CrossRef]
- Yassine Zniyed, Thanh Phuong Nguyen, et al. Enhanced network compression through tensor decompositions and pruning. IEEE Transactions on Neural Networks and Learning Systems, 2024. [CrossRef]
- Kenton Lee J Devlin, M Chang. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE international conference on computer vision, pages 2736–2744, 2017.
- Bhaskar Mitra and Nick Craswell. An introduction to neural information retrieval. Foundations and Trends in Information Retrieval, pages 1–117, April 2018. URL https://www.microsoft.com/en-us/research/publication/introduction-neural-information-retrieval/.
- Artem Babenko and Victor Lempitsky. The inverted multi-index. IEEE transactions on pattern analysis and machine intelligence, 37(6):1247–1260, 2014. [CrossRef]
- Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. Document expansion by query prediction. CoRR, abs/1904.08375, 2019. URL http://arxiv.org/abs/1904.08375.
- Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollár. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6399–6408, 2019.
- Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin. Multi-stage document ranking with bert, 2019.
- Daniël Rennings, Felipe Moraes, and Claudia Hauff. An Axiomatic Approach to Diagnosing Neural IR Models. In Leif Azzopardi, Benno Stein, Norbert Fuhr, Philipp Mayr, Claudia Hauff, and Djoerd Hiemstra, editors, Advances in Information Retrieval, Lecture Notes in Computer Science, pages 489–503, Cham, 2019. Springer International Publishing. ISBN 978-3-030-15712-8. doi: 10/ggcmnb. ZSCC: NoCitationData[s0].
- Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In NIPS, 2014.
- Aleksandr Chuklin, Ilya Markov, and Maarten de Rijke. Click Models for Web Search. Morgan & Claypool, 2015. ISBN 9781627056489. [CrossRef]
- Charles R. Harris, K. Jarrod Millman, St’efan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fern’andez del R’ıo, Mark Wiebe, Pearu Peterson, Pierre G’erard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array programming with NumPy. Nature, 585(7825):357–362, September 2020. doi: 10.1038/s41586-020-2649-2. [CrossRef]
- Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6154–6162, 2018.
- Hang Li. Learning to Rank for Information Retrieval and Natural Language Processing. Morgan & Claypool Publishers, 2011. ISBN 1608457079, 9781608457076.
- Chong Yu, Tao Chen, Zhongxue Gan, and Jiayuan Fan. Boost vision transformer with gpu-friendly sparsity and quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22658–22668, 2023.
- Paul Pu Liang, Manzil Zaheer, Yuan Wang, and Amr Ahmed. Anchor & transform: Learning sparse embeddings for large vocabularies, 2020.
- Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’20, page 39–48, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450380164. [CrossRef]
- Nils Reimers and Iryna Gurevych. The curse of dense low-dimensional information retrieval for large index sizes, 2020.
- Zhengkai Tu, Wei Yang, Zihang Fu, Yuqing Xie, Luchen Tan, Kun Xiong, Ming Li, and Jimmy Lin. Approximate nearest neighbor search and lightweight dense vector reranking in multi-stage retrieval architectures. In Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval, pages 97–100, 2020.
- Han Cai, Junyan Li, Muyan Hu, Chuang Gan, and Song Han. Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17302–17313, 2023.
- Tharun Medini, Beidi Chen, and Anshumali Shrivastava. {SOLAR}: Sparse orthogonal learned and random embeddings. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=fw-BHZ1KjxJ.
- Jiachen Lu, Jinghan Yao, Junge Zhang, Xiatian Zhu, Hang Xu, Weiguo Gao, Chunjing Xu, Tao Xiang, and Li Zhang. Soft: Softmax-free transformer with linear complexity. Advances in Neural Information Processing Systems, 34:21297–21309, 2021.
- N Goyal Y Liu, M Ott. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Massih-Reza Amini and Gaussier Eric. Recherche d’Information - applications, modèles et algorithmes. Algorithmes. Eyrolles, April 2013. URL https://hal.archives-ouvertes.fr/hal-00881257. I-XIX, 1-233.
- Luyu Gao and Jamie Callan. Unsupervised corpus aware language model pre-training for dense passage retrieval, 2021.
- Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language: Beit pretraining for vision and vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19175–19186, 2023.
- Francois Chollet. Deep Learning with Python. Manning Publications Co., Greenwich, CT, USA, 1st edition, 2017. ISBN 1617294438, 9781617294433.
- S. Robertson. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009.
- Yijiang Liu, Huanrui Yang, Zhen Dong, Kurt Keutzer, Li Du, and Shanghang Zhang. Noisyquant: Noisy bias-enhanced post-training activation quantization for vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20321–20330, 2023.
- Jean-Baptiste Cordonnier, Andreas Loukas, and Martin Jaggi. On the relationship between self-attention and convolutional layers. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HJlnC1rKPB.
- Weijun Hong, Guilin Li, Weinan Zhang, Ruiming Tang, Yunhe Wang, Zhenguo Li, and Yong Yu. Dropnas: Grouped operation dropout for differentiable architecture search. arXiv preprint arXiv:2201.11679, 2022.
- Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. Mvitv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4804–4814, 2022.
- Grace Chu Andrew Howard, Mark Sandler. Searching for mobilenetv3. arXiv preprint arXiv:1905.02244, 2019.
- Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
- Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534, 2022.
- Zhihang Yuan, Chenhao Xue, Yiqi Chen, Qiang Wu, and Guangyu Sun. Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XII, pages 191–207. Springer, 2022.
- Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s):1–41, 2022. [CrossRef]
- Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008. ISBN 0521865719, 9780521865715.
- Yingqi Qu Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering, 2020.
- Leonid Boytsov and Eric Nyberg. Flexible retrieval with NMSLIB and FlexNeuART. In Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS), pages 32–43, Online, November 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.nlposs-1.6. [CrossRef]
- Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, Aaron Courville, David Lopez-Paz, and Yoshua Bengio. Manifold mixup: Better representations by interpolating hidden states, 2019.
- Gianni Amati and Cornelis Joost Van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst., 20(4):357–389, October 2002. ISSN 1046-8188. [CrossRef]
- Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. PMLR, 2015.
- Rodrigo Nogueira and Kyunghyun Cho. Passage re-ranking with bert, 2019.
| Method | Computational Cost | Adaptability | Implementation Complexity |
|---|---|---|---|
| Attention-Based | Low | Moderate | Low |
| Entropy-Based | Low | Low | Low |
| Fixed-Length | Low | Low | Low |
| Reinforcement Learning | High | High | High |
| Gated Pruning | Moderate | High | Moderate |
| Saliency-Based | High | High | High |
| Hybrid (Progressive) | Moderate | High | Moderate |
| Method | Accuracy Retention | Computational Savings | Deployment Complexity | Adaptability |
|---|---|---|---|---|
| Weight Pruning | Moderate | Moderate | High | Low |
| Quantization | High | High | Moderate | Moderate |
| Knowledge Distillation | High | High | High | Low |
| Token Pruning | Variable | High | Moderate | High |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
