Submitted:
18 August 2025
Posted:
19 August 2025
Read the latest preprint version here
Abstract
Keywords:
1. Introduction
- Through analyzing the attention mechanism, we propose a novel perspective: sequential inputs do not need to be organized in order within attention mechanisms.
- We introduce a novel approach to organizing context of LLMs that is capable of handling any length of context using a fixed size space, i.e. bucket attention.
- We propose a method to convert a pre-trained model based on traditional attention into that of bucket attention.
- We propose a method to train a model with bucket attention from scratch.
2. Analysis
2.1. Attention is Unordered
2.2. Finite-Dimensional Context
3. Bucket Attention
3.1. Training-Free Approach
| Algorithm 1 Single Bucket: Calculate and Maintain State |
|
| Algorithm 2 Multiple Buckets: Calculate and Maintain State |
|
3.2. Partitioning of Buckets
3.3. Training From Scratch
4. Conclusion
References
- Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, and I. Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30. [Google Scholar]
- Devlin, J., M.W. Chang, K. Lee, and K. Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp. 4171–4186. [Google Scholar]
- Radford, A., K. Narasimhan, T. Salimans, I. Sutskever, and et al. 2018. Improving language understanding by generative pre-training 2018. [Google Scholar]
- Guo, D., D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, and et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. [Google Scholar]
- Ha, D., A.M. Dai, and Q.V Le. 2017. HyperNetworks. Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, Conference Track Proceedings, Toulon, France, April 24-26; pp. 1–18. [Google Scholar]
- Ye, Z., M. Xia, R. Yi, J. Zhang, Y.K. Lai, X. Huang, G. Zhang, and Y.J. Liu. 2022. Audio-driven talking face video generation with dynamic convolution kernels. IEEE Transactions on Multimedia 25: 2033–2046. [Google Scholar] [CrossRef]
- Ye, Z., Z. Sun, Y.H. Wen, Y. Sun, T. Lv, R. Yi, and Y.J. Liu. 2022. Dynamic neural textures: Generating talking-face videos with continuously controllable expressions. arXiv preprint arXiv:2204.06180. [Google Scholar]
- Schug, S., S. Kobayashi, Y. Akram, J. Sacramento, and R. Pascanu. 2024. Attention as a hypernetwork. arXiv preprint arXiv:2406.05816. [Google Scholar]
- Dherin, B., M. Munn, H. Mazzawi, M. Wunder, and J. Gonzalvo. 2025. Learning without training: The implicit dynamics of in-context learning. arXiv preprint arXiv:2507.16003. [Google Scholar]
- Roy, A., M. Saffar, A. Vaswani, and D. Grangier. 2021. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics 9: 53–68. [Google Scholar] [CrossRef]
- Tay, Y., D. Bahri, L. Yang, D. Metzler, and D.C Juan. 2020. Sparse sinkhorn attention. Proceedings of the International conference on machine learning. PMLR; pp. 9438–9447. [Google Scholar]
- Sun, Z., Y. Yang, and S. Yoo. 2021. Sparse attention with learning to hash. In Proceedings of the International Conference on Learning Representations. [Google Scholar]
- Ge, S., Y. Zhang, L. Liu, M. Zhang, J. Han, and J. Gao. 2023. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801. [Google Scholar]
- Shi, L., H. Zhang, Y. Yao, Z. Li, and H. Zhao. 2024. Keep the cost down: A review on methods to optimize LLM’s KV-cache consumption. arXiv preprint arXiv:2407.18003. [Google Scholar]
- Dai, Z., Z. Yang, Y. Yang, J. Carbonell, Q.V. Le, and R. Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860. [Google Scholar]
- Beltagy, I., M.E. Peters, and A. Cohan. 2020. Longformer: The long-document transformer. arXiv preprint. [Google Scholar]
- Zaheer, M., G. Guruganesh, K.A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, and et al. 2020. Big bird: Transformers for longer sequences. Advances in neural information processing systems 33: 17283–17297. [Google Scholar]
- Petrick, F., J. Rosendahl, C. Herold, and H. Ney. 2022. Locality-sensitive hashing for long context neural machine translation. Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022); pp. 32–42. [Google Scholar]
- Liu, M., T. Rabbani, T. O’Halloran, A. Sankaralingam, M.A. Hartley, F. Huang, C. Fermüller, and Y. Aloimonos. 2024. HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing. arXiv preprint arXiv:2412.16187. [Google Scholar]
- Pappagari, R., P. Zelasko, J. Villalba, Y. Carmiel, and N Dehak. 2019. Hierarchical transformers for long document classification. Proceedings of the 2019 IEEE automatic speech recognition and understanding workshop (ASRU); ieee, pp. 838–844. [Google Scholar]
- Nawrot, P., S. Tworkowski, M. Tyrolski, Ł. Kaiser, Y. Wu, C. Szegedy, and H. Michalewski. 2021. Hierarchical transformers are more efficient language models. arXiv preprint arXiv:2110.13711. [Google Scholar]
- Liu, Y., and M. Lapata. 2019. Hierarchical transformers for multi-document summarization. arXiv preprint arXiv:1905.13164. [Google Scholar]
- Ye, Z. 2025. The Sleep Mechanism of LLMs. [Google Scholar]
- Cover, T., and P. Hart. 1967. Nearest neighbor pattern classification. IEEE transactions on information theory 13: 21–27. [Google Scholar] [CrossRef]
- Liu, A., B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo, and et al. 2024. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434. [Google Scholar]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).