Bucket Attention: Fixed-Size Space for Any Length of Context

Zipeng Ye

doi:10.20944/preprints202508.1313.v1

Submitted:

18 August 2025

Posted:

19 August 2025

Read the latest preprint version here

Abstract

In this study, we analyze the attention mechanism and propose a novel perspective where sequential inputs within the attention mechanisms do not require strict order. We introduce an innovative approach, called bucket attention, which organizes context in large language models (LLMs) and effectively handles contexts of any length while utilizing a fixed-size space. Furthermore, we present techniques to convert pre-trained models based on traditional attention into the bucket attention framework, along with a method to train models with bucket attention from scratch. These approaches offer practical solutions to improve the efficiency and scalability of LLMs.

Keywords:

large language models

;

attention

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

In recent years, attention mechanisms Vaswani et al. (2017) have played a key role in natural language processing, driving the rapid advancement of large language models (LLMs) such as BERT Devlin et al. (2019), GPT Radford et al. (2018) and DeepSeek-R1 Guo et al. (2025). As the powerful capabilities of the attention mechanism Vaswani et al. (2017) have been recognized, transformers have gradually replaced convolutional neural networks. Hypernetworks Ha et al. (2017) is also a perspective for viewing the relationship between networks and different inputs. Studies indicate that attention mechanisms can be viewed as hypernetworks Ye et al. (2022a,b); Schug et al. (2024), where inputs are context. Therefore, context is crucial for LLMs, as it provides conditions and extensive information that profoundly influence the outputs.

Many tasks Dherin et al. (2025) require handling extremely long contexts, which often exceed standard context-length limits. It typically requires a large amount of space and takes a significant amount of time. To alleviate the challenges, a variety of methods have been proposed. Sparse attention Roy et al. (2021); Tay et al. (2020); Sun et al. (2021) computes only a portion of attention scores, effectively reducing memory usage and improving the computation speed by limiting the interaction scope or density of each element. Essentially, it is very similar to the approach of pruning the KV cache Ge et al. (2023); Shi et al. (2024). A straightforward approach is to compute local attention based on the position of each element Dai et al. (2019); Beltagy et al. (2020); Zaheer et al. (2020). Furthermore, Locality-Sensitive Hashing (LSH) Petrick et al. (2022); Liu et al. (2024) is a powerful tool to compute the correlation between elements. The other approach employs hierarchical structures to capture contextual information at different levels, gradually integrating long-distance dependencies, and thus reducing spatial and temporal costs Pappagari et al. (2019); Nawrot et al. (2021); Liu and Lapata (2019). Another approach leverages a sleep mechanism Ye (2025) to compress the context into network parameters, but is unable to handle dynamic long-context tasks.

However, these methods have not completely resolved the issue. To better address this challenge, we propose an innovative attention mechanism that utilizes fixed-size storage, enabling the attention mechanism to effectively manage contexts of any length. Key contributions of this paper include the following:

Through analyzing the attention mechanism, we propose a novel perspective: sequential inputs do not need to be organized in order within attention mechanisms.
We introduce a novel approach to organizing context of LLMs that is capable of handling any length of context using a fixed size space, i.e. bucket attention.
We propose a method to convert a pre-trained model based on traditional attention into that of bucket attention.
We propose a method to train a model with bucket attention from scratch.

2. Analysis

LLMs generally employ next token prediction and unidirectional attention. Therefore, throughout this paper, attention specifically refers to unidirectional attention.

2.1. Attention is Unordered

First, it is important to recognize that the attention mechanism is inherently unordered, which is why position encoding becomes essential. The organization of natural language is sequential, but fundamentally, this orderliness is linked only to position encoding, not to the attention mechanism itself. A notable fact is that we can completely rearrange the order of elements within the KV cache. Thus, we can modify the organization of the KV cache, enabling it to be structured as any efficient data structure rather than just a sequence. Numerous studies Pappagari et al. (2019); Nawrot et al. (2021); Liu and Lapata (2019) have organized the context into hierarchical structures. However, merely rearranging the order of the context is insufficient to fit an arbitrarily long context into a fixed space. To accomplish this, it is crucial to compress or trim the context while minimizing information loss. Information theory tells us that it is impossible to compress a random sequence into a very small space. However, natural language is not a random sequence, which contains a great deal of redundancy, allowing large portions of text to be summarized effectively.

2.2. Finite-Dimensional Context

We assume that the concepts in the world are finite, and consequently, the logical relationships between them are also finite. Therefore, the state of context should reside in a finite-dimensional space. This gives us hope that context can be described using this finite-dimensional space, rather than relying on a potentially infinite sequence. It is challenging to enumerate keywords and key relationships, but models learn the language structure, extract essential contextual information, and summarize it into key concepts. We can interpret it as a state, yet unlike the abstract states within Recurrent Neural Network (RNN), this state is fundamentally a form of attention.

Based on the above assumptions and analysis, we present a conceptual attention mechanism within a finite-dimensional space. The relationship between the novel attention mechanism and the traditional attention mechanism is similar to that between the Lebesgue integration and the Riemann integration. We partition the context into a fixed number of intervals by keys, aggregating values within each interval, thereby constructing attention that uses a fixed space.

We use equations to provide a more precise description of our approach. The formula for calculating attention is as follows.

A_{i + 1} = softmax (\frac{Q_{i + 1} K_{0 \dots i}^{⊤}}{\sqrt{d_{k}}}) V_{0 \dots i} ≜ softmax (\frac{Q_{i + 1} {[K_{0} \dots K_{i}]}^{⊤}}{\sqrt{d_{k}}}) [V_{0} \dots V_{i}],

(1)

where

K_{0 \dots i}

and

V_{0 \dots i}

are keys and values of the tokens from 0-th to i-th. We partition the space of keys into N groups (buckets) and we have the following.

A_{i + 1} = softmax (\frac{Q_{i + 1} {[K_{G_{0}} \dots K_{G_{N - 1}}]}^{⊤}}{\sqrt{d_{k}}}) [V_{G_{0}} \dots V_{G_{N - 1}}],

(2)

where

G_{0} \dots G_{N - 1}

are groups of keys and each group includes multiple elements. We need to approximately convert each group containing multiple elements into a single element. Since we group them by keys, we first select a representative element

{\vec{K}}_{t}

from each group

G_{t}

, where

{\vec{K}}_{t}

is a unit vector indicating direction. Assume

K_{j}

belongs to the

t (j)

-th group (bucket)

G_{t (j)}

,

K_{j} ≐ α_{j} {\vec{K}}_{t (j)}

.

\begin{matrix} A_{i + 1} & = softmax (\frac{Q_{i + 1} {[α_{0} {\vec{K}}_{t (0)} \dots α_{i} {\vec{K}}_{t (i)}]}^{⊤}}{\sqrt{d_{k}}}) [V_{0} \dots V_{i}] \end{matrix}

(3)

\begin{matrix} = softmax (\frac{Q_{i + 1} {[β_{0} {\vec{K}}_{0} \dots β_{N - 1} {\vec{K}}_{N - 1}]}^{⊤}}{\sqrt{d_{k}}}) [{\bar{V}}_{0} \dots {\bar{V}}_{N - 1}], \end{matrix}

(4)

where

β_{t}

is derived by weighting

α_{j}

from the same group and

{\bar{V}}_{t}

is derived by weighting

V_{j}

from the same group.

i . e . \{\begin{matrix} β_{t} = log (\sum_{j \in G_{t}} e^{α_{j}}) \\ γ_{j} = e^{α_{j} / \sqrt{d_{k}}} / e^{β_{t} / \sqrt{d_{k}}} \\ {\bar{V}}_{t} = \sum_{j \in G_{t}} γ_{j} V_{j} \end{matrix} .

(5)

Therefore, by employing a grouping method, we can compress an arbitrarily long context into fixed-length N buckets.

3. Bucket Attention

3.1. Training-Free Approach

Building on the above analysis, we can intuitively develop an inference-time scaling algorithm that requires no training and efficiently stores arbitrarily long contexts using only a fixed space during inference. Analyzing the formula above, during inference we need to update the bucket cache of

{\vec{K}}_{t}

,

{\bar{V}}_{t}

and

β_{t}

. Compared to typical inference with the KV cache, we will need to store an additional variable

β_{t}

. The detail is shown in Algorithm 1.

Algorithm 1 Single Bucket: Calculate

A_{i + 1}

and Maintain State

Input: $K_{i + 1}, Q_{i + 1}, V_{i + 1}$ and prefilled $\vec{K}, \bar{V}, β$
$A_{i} \leftarrow softmax (\frac{Q_{i + 1} {[β_{0} {\vec{K}}_{0} \dots β_{N - 1} {\vec{K}}_{N - 1}]}^{⊤}}{\sqrt{d_{k}}}) [{\bar{V}}_{0} \dots {\bar{V}}_{N - 1}]$
$t \leftarrow$ the bucket index of $K_{i + 1}$
$α_{i + 1} \leftarrow$ the projected length of $K_{i + 1}$
$β_{t}^{'} \leftarrow log (e^{β_{t}} + e^{α_{i + 1}})$
${\bar{V}}_{t} \leftarrow (e^{β_{t} / \sqrt{d_{k}}} {\bar{V}}_{t} + e^{α_{i + 1} / \sqrt{d_{k}}} V_{i + 1}) / e^{β_{t}^{'} / \sqrt{d_{k}}}$
$β_{t} \leftarrow β_{t}^{'}$
return $A_{i + 1}$

Furthermore, we can extend each layer of attention to span multiple buckets rather than being confined to just one. In practice, this is also essential. Note that in high-dimensional spaces, any two randomly chosen vectors are almost orthogonal. This provides a theoretical foundation for the fact that the representative elements of any two buckets are nearly orthogonal. Thus, we can avoid performing Schmidt orthogonalization on them. Consequently, a vector can be decomposed into multiple buckets by projection, which could be dense or sparse according to its principal components. The detail is shown in Algorithm 2 and Figure 1. To improve computational efficiency, we use sparse bucket attention, which limits the number of buckets involved each time, that is,

| T_{i + 1} | ≪ N

.

Algorithm 2 Multiple Buckets: Calculate

A_{i + 1}

and Maintain State

Input: $K_{i + 1}, Q_{i + 1}, V_{i + 1}$ and prefilled $\vec{K}, \bar{V}, β$
$A_{i} \leftarrow softmax (\frac{Q_{i + 1} {[β_{0} {\vec{K}}_{0} \dots β_{N - 1} {\vec{K}}_{N - 1}]}^{⊤}}{\sqrt{d_{k}}}) [{\bar{V}}_{0} \dots {\bar{V}}_{N - 1}]$
$T_{i + 1} \leftarrow$ { t ∣ the bucket indices of $K_{i + 1}$ }
${α_{t}^{i + 1}}_{t} \leftarrow$ the projected length of $K_{i + 1}$
for each t in $T_{i + 1}$ do
$β_{t}^{'} \leftarrow log (e^{β_{t}} + e^{α_{t}^{i + 1}})$
${\bar{V}}_{t} \leftarrow (e^{β_{t} / \sqrt{d_{k}}} {\bar{V}}_{t} + e^{α_{t}^{i + 1} / \sqrt{d_{k}}} V_{i + 1}) / e^{β_{t}^{'} / \sqrt{d_{k}}}$
$β_{t} \leftarrow β_{t}^{'}$
end for
return $A_{i + 1}$

We may need to consider different application scenarios to decide whether to use bucket attention. Essentially, this approach is a variant and enhancement of the KV cache, allowing flexible switching between traditional KV cache and bucket attention.

3.2. Partitioning of Buckets

Partitioning buckets is an intriguing issue to investigate. The simplest method is to divide the space uniformly, but this does not capture the distribution of the data. A more effective approach is to sample and partition based on data density, utilizing algorithms similar to K-Nearest Neighbors (KNN) Cover and Hart (1967).

An important issue to consider is the selection of the total number of buckets N and the number of principal components

| T_{i + 1} |

. As the number of buckets increases, the algorithm becomes more similar to traditional attention. Therefore, the fundamental principle is to utilize as much memory as possible. Furthermore, N of each attention layer could be independently selected. Although it is common practice to keep the parameters consistent across layers, they can also be varied empirically. Another selection method involves estimating the intrinsic dimensionality of the corpus through error analysis.

3.3. Training From Scratch

If we train a model from scratch instead of converting a pre-trained model, we can simplify the process further. We can allow the model to learn the buckets implicitly, similar to learning an embedding, rather than explicitly partitioning them. We call this method implicitly bucket attention, where the neural network outputs the decomposition coefficients of keys and queries in the bucket space instead of directly outputting values. This approach has the advantage of embedding the bucket partitions within the model parameters while reducing computational effort. Moreover, we can directly employ dense bucket attention without incurring any additional computational cost.

We describe the computation method of implicit bucket attention using equations.

\begin{matrix} Q_{i + 1} K_{0 \dots i}^{⊤} = θ^{i + 1} α^{⊤} \end{matrix}

(6)

\begin{matrix} where & \{\begin{matrix} K_{j} & = [α_{0}^{j} {\vec{K}}_{t (0)} \dots α_{N - 1}^{j} {\vec{K}}_{N - 1}] ≜ α^{j} \\ K_{0 \dots i} = [\begin{matrix} α^{0} \\ ⋮ \\ α^{i} \end{matrix}] & = [α_{0} {\vec{K}}_{t (0)} \dots α_{N - 1} {\vec{K}}_{N - 1}] ≜ α \\ Q_{i + 1} & = [θ_{0}^{i + 1} {\vec{K}}_{t (0)} \dots θ_{N - 1}^{i + 1} {\vec{K}}_{N - 1}] ≜ θ^{i + 1} \end{matrix}, \end{matrix}

(7)

where

α

and

θ

are outputs of the networks. The form of the above equation resembles multi-head latent attention (MLA) Liu et al. (2024), where the buckets can be viewed as a type of latent space. Note that

α

remains a sequence and we need to accumulate

α

to obtain

β

to represent the state.

\begin{matrix} Q_{i + 1} K_{0 \dots i}^{⊤} = θ^{i + 1} {β^{i + 1}}^{⊤} = θ^{i + 1} {[β_{0}^{i + 1} \dots β_{N - 1}^{i + 1}]}^{⊤}, \end{matrix}

(8)

where

\begin{matrix} β_{t}^{i + 1} = log (\sum_{j = 0}^{i} e^{α_{t}^{j}}) . \end{matrix}

(9)

The computation method for the value is the same as that in the training-free approach and will be omitted here.

4. Conclusion

In conclusion, our analysis of attention mechanisms has led to a groundbreaking insight that sequential inputs within these mechanisms do not need to adhere to a strict order. We have successfully introduced bucket attention, a novel approach that organizes context within large language models (LLMs) and adeptly manages contexts of varying lengths using fixed-size spaces. Additionally, we have developed techniques to transform pre-trained models based on traditional attention into the bucket attention paradigm, accompanied by methodologies for training models with bucket attention from the ground up. These advances provide practical solutions to significantly enhance the efficiency and scalability of LLMs, paving the way for more robust and adaptable models in the future.

References

Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, and I. Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30. [Google Scholar]
Devlin, J., M.W. Chang, K. Lee, and K. Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp. 4171–4186. [Google Scholar]
Radford, A., K. Narasimhan, T. Salimans, I. Sutskever, and et al. 2018. Improving language understanding by generative pre-training 2018. [Google Scholar]
Guo, D., D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, and et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. [Google Scholar]
Ha, D., A.M. Dai, and Q.V Le. 2017. HyperNetworks. Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, Conference Track Proceedings, Toulon, France, April 24-26; pp. 1–18. [Google Scholar]
Ye, Z., M. Xia, R. Yi, J. Zhang, Y.K. Lai, X. Huang, G. Zhang, and Y.J. Liu. 2022. Audio-driven talking face video generation with dynamic convolution kernels. IEEE Transactions on Multimedia 25: 2033–2046. [Google Scholar] [CrossRef]
Ye, Z., Z. Sun, Y.H. Wen, Y. Sun, T. Lv, R. Yi, and Y.J. Liu. 2022. Dynamic neural textures: Generating talking-face videos with continuously controllable expressions. arXiv preprint arXiv:2204.06180. [Google Scholar]
Schug, S., S. Kobayashi, Y. Akram, J. Sacramento, and R. Pascanu. 2024. Attention as a hypernetwork. arXiv preprint arXiv:2406.05816. [Google Scholar]
Dherin, B., M. Munn, H. Mazzawi, M. Wunder, and J. Gonzalvo. 2025. Learning without training: The implicit dynamics of in-context learning. arXiv preprint arXiv:2507.16003. [Google Scholar]
Roy, A., M. Saffar, A. Vaswani, and D. Grangier. 2021. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics 9: 53–68. [Google Scholar] [CrossRef]
Tay, Y., D. Bahri, L. Yang, D. Metzler, and D.C Juan. 2020. Sparse sinkhorn attention. Proceedings of the International conference on machine learning. PMLR; pp. 9438–9447. [Google Scholar]
Sun, Z., Y. Yang, and S. Yoo. 2021. Sparse attention with learning to hash. In Proceedings of the International Conference on Learning Representations. [Google Scholar]
Ge, S., Y. Zhang, L. Liu, M. Zhang, J. Han, and J. Gao. 2023. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801. [Google Scholar]
Shi, L., H. Zhang, Y. Yao, Z. Li, and H. Zhao. 2024. Keep the cost down: A review on methods to optimize LLM’s KV-cache consumption. arXiv preprint arXiv:2407.18003. [Google Scholar]
Dai, Z., Z. Yang, Y. Yang, J. Carbonell, Q.V. Le, and R. Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860. [Google Scholar]
Beltagy, I., M.E. Peters, and A. Cohan. 2020. Longformer: The long-document transformer. arXiv preprint. [Google Scholar]
Zaheer, M., G. Guruganesh, K.A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, and et al. 2020. Big bird: Transformers for longer sequences. Advances in neural information processing systems 33: 17283–17297. [Google Scholar]
Petrick, F., J. Rosendahl, C. Herold, and H. Ney. 2022. Locality-sensitive hashing for long context neural machine translation. Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022); pp. 32–42. [Google Scholar]
Liu, M., T. Rabbani, T. O’Halloran, A. Sankaralingam, M.A. Hartley, F. Huang, C. Fermüller, and Y. Aloimonos. 2024. HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing. arXiv preprint arXiv:2412.16187. [Google Scholar]
Pappagari, R., P. Zelasko, J. Villalba, Y. Carmiel, and N Dehak. 2019. Hierarchical transformers for long document classification. Proceedings of the 2019 IEEE automatic speech recognition and understanding workshop (ASRU); ieee, pp. 838–844. [Google Scholar]
Nawrot, P., S. Tworkowski, M. Tyrolski, Ł. Kaiser, Y. Wu, C. Szegedy, and H. Michalewski. 2021. Hierarchical transformers are more efficient language models. arXiv preprint arXiv:2110.13711. [Google Scholar]
Liu, Y., and M. Lapata. 2019. Hierarchical transformers for multi-document summarization. arXiv preprint arXiv:1905.13164. [Google Scholar]
Ye, Z. 2025. The Sleep Mechanism of LLMs. [Google Scholar]
Cover, T., and P. Hart. 1967. Nearest neighbor pattern classification. IEEE transactions on information theory 13: 21–27. [Google Scholar] [CrossRef]
Liu, A., B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo, and et al. 2024. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434. [Google Scholar]

Figure 1. Illustration between the traditional KV Cache of attention and bucket attention.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Bucket Attention: Fixed-Size Space for Any Length of Context

Abstract

Keywords:

Subject:

1. Introduction

2. Analysis

2.1. Attention is Unordered

2.2. Finite-Dimensional Context

3. Bucket Attention

3.1. Training-Free Approach

3.2. Partitioning of Buckets

3.3. Training From Scratch

4. Conclusion

References

MDPI Initiatives

Important Links

Subscribe