Submitted:
27 August 2025
Posted:
28 August 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Through analyzing the attention mechanism, we propose a novel perspective: sequential inputs do not need to be organized in order within attention mechanisms.
- We introduce a novel approach to organizing context of LLMs that is capable of handling any length of context using a fixed size space, i.e. bucket attention.
- We propose a method to convert a pre-trained model based on traditional attention into that of bucket attention.
- We propose a method to train a model with bucket attention from scratch.
2. Analysis
2.1. Attention is Unordered
2.2. Finite-Dimensional Context
3. Bucket Attention
3.1. Training-free Approach
| Algorithm 1 Single Bucket: Calculate and Maintain State |
|
| Algorithm 2 Multiple Buckets: Calculate and Maintain State |
|
3.2. Partitioning of Buckets
3.3. Training From Scratch
4. Implementation
4.1. Training-Free Approach
4.2. Training From Scratch
5. Conclusions
References
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Advances in neural information processing systems 2017, 30. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186.
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I.; et al. Improving language understanding by generative pre-training 2018.
- Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]
- Ha, D.; Dai, A.M.; Le, Q.V. HyperNetworks. In Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26 2017; Conference Track Proceedings. 2017; pp. 1–18. [Google Scholar]
- Ye, Z.; Xia, M.; Yi, R.; Zhang, J.; Lai, Y.K.; Huang, X.; Zhang, G.; Liu, Y.j. Audio-driven talking face video generation with dynamic convolution kernels. IEEE Transactions on Multimedia 2022, 25, 2033–2046. [Google Scholar] [CrossRef]
- Ye, Z.; Sun, Z.; Wen, Y.H.; Sun, Y.; Lv, T.; Yi, R.; Liu, Y.J. Dynamic neural textures: Generating talking-face videos with continuously controllable expressions. arXiv 2022, arXiv:2204.06180. [Google Scholar] [CrossRef]
- Schug, S.; Kobayashi, S.; Akram, Y.; Sacramento, J.; Pascanu, R. Attention as a hypernetwork. arXiv 2024, arXiv:2406.05816. [Google Scholar] [CrossRef]
- Dherin, B.; Munn, M.; Mazzawi, H.; Wunder, M.; Gonzalvo, J. Learning without training: The implicit dynamics of in-context learning. arXiv 2025, arXiv:2507.16003. [Google Scholar] [CrossRef]
- Roy, A.; Saffar, M.; Vaswani, A.; Grangier, D. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics 2021, 9, 53–68. [Google Scholar] [CrossRef]
- Tay, Y.; Bahri, D.; Yang, L.; Metzler, D.; Juan, D.C. Sparse sinkhorn attention. In Proceedings of the International conference on machine learning. PMLR; 2020; pp. 9438–9447. [Google Scholar]
- Sun, Z.; Yang, Y.; Yoo, S. Sparse attention with learning to hash. In Proceedings of the International Conference on Learning Representations. 2021. [Google Scholar]
- Ge, S.; Zhang, Y.; Liu, L.; Zhang, M.; Han, J.; Gao, J. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv 2023, arXiv:2310.01801. [Google Scholar]
- Shi, L.; Zhang, H.; Yao, Y.; Li, Z.; Zhao, H. Keep the cost down: A review on methods to optimize LLM’s KV-cache consumption. arXiv 2024, arXiv:2407.18003. [Google Scholar]
- Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J.; Le, Q.V.; Salakhutdinov, R. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv 2019, arXiv:1901.02860. [Google Scholar]
- Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The long-document transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar] [CrossRef]
- Zaheer, M.; Guruganesh, G.; Dubey, K.A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. Big bird: Transformers for longer sequences. Advances in neural information processing systems 2020, 33, 17283–17297. [Google Scholar]
- Petrick, F.; Rosendahl, J.; Herold, C.; Ney, H. Locality-sensitive hashing for long context neural machine translation. In Proceedings of the Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022), 2022, pp. 32–42.
- Liu, M.; Rabbani, T.; O’Halloran, T.; Sankaralingam, A.; Hartley, M.A.; Huang, F.; Fermüller, C.; Aloimonos, Y. HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing. arXiv 2024, arXiv:2412.16187. [Google Scholar]
- Pappagari, R.; Zelasko, P.; Villalba, J.; Carmiel, Y.; Dehak, N. Hierarchical transformers for long document classification. In Proceedings of the 2019 IEEE automatic speech recognition and understanding workshop (ASRU). ieee; 2019; pp. 838–844. [Google Scholar]
- Nawrot, P.; Tworkowski, S.; Tyrolski, M. ; Kaiser, .; Wu, Y.; Szegedy, C.; Michalewski, H. Hierarchical transformers are more efficient language models. arXiv 2021, arXiv:2110.13711. [Google Scholar]
- Liu, Y.; Lapata, M. Hierarchical transformers for multi-document summarization. arXiv 2019, arXiv:1905.13164. [Google Scholar] [CrossRef]
- Ye, Z. The Sleep Mechanism of LLMs. Preprints 2025. [Google Scholar] [CrossRef]
- Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE transactions on information theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
- Liu, A.; Feng, B.; Wang, B.; Wang, B.; Liu, B.; Zhao, C.; Dengr, C.; Ruan, C.; Dai, D.; Guo, D.; et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv 2024, arXiv:2405.04434. [Google Scholar]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).