Submitted:
06 June 2025
Posted:
09 June 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Expressiveness: What classes of sequence-to-sequence functions can a transformer approximate under different PE schemes? Are there inherent limitations imposed by specific encodings?
- Generalization: How do PEs affect a transformer’s ability to generalize from training to unseen data, especially when sequence lengths vary? Can we derive generalization bounds that capture the influence of different PEs?
- Extrapolation: Why do certain encoding methods (e.g., ALiBi) facilitate extrapolation to longer sequences? Can we formalize this phenomenon and propose new encodings that further enhance extrapolation?
- Novel Encodings: Are there theoretically motivated PE schemes—based on wavelet transforms, Legendre polynomials, or information-theoretic principles—that surpass existing methods in expressiveness, generalization, or extrapolation?
- Contributions. Our main contributions are:
- Expressiveness Characterization: We formally define the expressiveness of transformer models under different PE schemes, showing how absolute, relative, and bias-based encodings impact the set of sequence-to-sequence functions that can be approximated.
- Generalization Bounds: Using tools from statistical learning theory (Rademacher complexity, covering numbers), we derive generalization bounds that explicate the role of PEs in controlling model capacity and overfitting for varying sequence lengths.
- Extrapolation Analysis: We extend the theoretical understanding of ALiBi’s biasing mechanism, provide a unified extrapolation framework for bias-based PEs, and identify the limits of extrapolation for alternative encoding schemes.
- Novel PE Schemes: We propose several novel positional encodings based on orthogonal functions (e.g., wavelets, Legendre polynomials) and information-theoretic criteria (maximizing mutual information between positions). We analyze their expressiveness, generalization, and extrapolation properties.
- Lightweight Validation: We implement the proposed PE schemes in pure NumPy to run small-scale experiments on synthetic tasks designed to test extrapolation and generalization. These confirm our theoretical predictions without requiring GPUs.
2. Background and Related Work
2.1. Transformer Architecture
2.2. Sinusoidal and Learned Positional Encodings
2.3. Relative Positional Encodings
2.4. Attention with Linear Biases (ALiBi)
2.5. Other Positional Encoding Variants
- Rotary Positional Embeddings (RoPE) [15]: Applies a rotation in embedding space to encode relative positions.
- Fourier Feature Encodings [16]: Use random Fourier features to encode continuous positions, often applied in continuous-time transformers.
- Convolutional Encodings [10]: Integrate convolutional layers to encode local positional information.
- Wavelet-based and Polynomial-based Encodings: Proposed informally in blog posts [9], but not systematically analyzed.
3. Expressiveness of Transformers with Positional Encodings
3.1. Formal Definition of Expressiveness
- Absolute encodings: Sinusoidal vs. learned.
- Relative encodings: Shaw et al. vs. ALiBi.
- New encodings: Wavelet-based, polynomial-based.
- (a)
- For fixed , how does the choice of PE affect the size of ?
- (b)
- Do all PE schemes yield universal approximation as ?
3.2. Expressiveness with Sinusoidal Encodings
- Universal Approximation. [21] show that a transformer with sinusoidal encodings can approximate any sequence-to-sequence mapping to arbitrary accuracy given sufficient width (number of heads) and depth (number of layers). Intuitively, the sinusoidal basis of period is rich enough to encode any discrete position uniquely. Since a sufficiently wide transformer can compute arbitrary Boolean functions of its inputs [21], adding unique position signals ensures universal approximation over sequences of length up to N.
- Limitations. However, the expressiveness claim is restricted to sequences of length at most (the maximum position for which PEs were computed). When confronted with longer sequences, sinusoidal encodings repeat after in each frequency dimension, potentially causing ambiguity in absolute positions if not correctly handled. Consequently, universal approximation holds for fixed-length tasks but not for unlimited-length extrapolation.
3.3. Expressiveness with Learned Absolute Encodings
3.4. Expressiveness with Relative Positional Encodings
- Universality and Limitations. The work of [20] demonstrates that relative encodings can be as expressive as absolute encodings for tasks that depend on relative structure. However, if a task requires absolute position information—e.g., “assign label if the token at position i is within the first quarter of the sequence”—a purely relative encoding may struggle. In such cases, an absolute offset or additional global positional token is needed. Nevertheless, relative encodings often yield equal or better performance on many benchmarks, suggesting that many tasks rely more on relative positions.
3.5. Expressiveness with ALiBi
- Expressiveness. Since ALiBi does not learn position embeddings, the model retains the same expressiveness for sequences of any length, provided are sufficiently large. The unique structure of the bias term ensures the model can, in principle, distinguish token distances. By combining query-key dot products and the linear bias, a sufficiently wide and deep transformer with ALiBi is universal for any sequence length. However, the linear bias might under-emphasize absolute positions for shorter sequences compared to learned or sinusoidal encodings.
3.6. Comparison and Summary
- Key takeaways:
- Sinusoidal and learned absolute encodings guarantee universal approximation for fixed-length sequences but fail to generalize to arbitrarily long sequences.
- Relative encodings can approximate many tasks but may miss absolute position information.
- ALiBi provides universality over any sequence length by encoding distance through a linear bias.
- Proposed orthogonal function-based encodings may combine absolute and relative information, potentially yielding universal approximation and good extrapolation.
4. Generalization Bounds for Transformers with Positional Encodings
4.1. Preliminaries: Rademacher Complexity
4.2. Function Classes Induced by PEs
- Parameterization and Lipschitz Constants. A transformer’s output for a fixed input length N is a composition of L self-attention and feed-forward layers, each of which is Lipschitz in its input under certain norm constraints on the weight matrices [22]. Let for all linear weight matrices W in queries, keys, values, and feed-forward networks. Then each layer is -Lipschitz with respect to its input activations, where depends on and H. Consequently, the entire transformer is -Lipschitz with respect to its input .
-
Effect of Sinusoidal PEs. For sinusoidal encodings, for all i because each component is in . When inputs , we have . Thus, the input domain is bounded. A standard covering-number argument [11] shows that the Rademacher complexity scales as up to logarithmic factors in model size. This implies that sinusoidal PEs do not increase model capacity beyond a bounded constant shift, and generalization primarily depends on , and . Effect of Learned PEs. For learned absolute PEs, each is a trainable vector. If for all , then . In practice, learned PEs may require regularization (e.g., weight decay) to control . The Rademacher complexity bound becomesIf grows with or , generalization may degrade. Hence, regularizing learned PEs is crucial.
-
Effect of Relative Encodings (Shaw et al.). Relative encodings add a term in the attention logits. If for all , then the pre-softmax logits remain bounded by , preserving Lipschitzness. Consequently,Because typically scales with the size of the relative embedding matrix, one should apply weight decay or clipping to to maintain tight generalization bounds.
- Effect of ALiBi. ALiBi introduces a deterministic bias . Here, is user-set (or learned) and typically small (e.g., ). Since during training, . During inference on longer sequences of length , can grow to , potentially enlarging the pre-softmax logits. However, the softmax normalizes these logits; if is chosen appropriately (e.g., decreasing bias slope for longer sequences), the network can still operate stably. We boundso the Lipschitz constant at inference time is . This grows linearly in , suggesting care is needed when extrapolating. In practice, empirical results show ALiBi generalizes well up to several times , but theoretical generalization to arbitrarily large requires controlling .
4.3. Summary of Generalization Effects
- Sinusoidal PEs yield modest, constant-capacity increase, hence stable generalization.
- Learned absolute PEs can hurt generalization if grows unchecked; regularization is necessary.
- Relative encodings behave similarly to learned PEs but often with smaller due to shared embeddings.
- ALiBi’s bias can grow for longer sequences, potentially increasing capacity and hurting generalization on extremely long sequences; setting a small helps.
5. Extrapolation to Longer Sequences
5.1. Extrapolation in Sinusoidal and Learned Encodings
5.2. Relative Encodings and Extrapolation
5.3. ALiBi: A Unified Extrapolation Framework
-
Attention Score Behavior. Consider two token positions beyond training length. The attention logit isFor reasonable , the bias term dominates when is large. Specifically, if is sufficiently large relative to typical query-key dot products, tokens far apart will have significantly lower logits. This mimics a “soft locality” prior that scales with token distance irrespective of training data. Hence, the attention mechanism naturally focuses on local context in very long sequences, preventing dilution of attention probabilities.
- Mathematical Model of Extrapolation. Define the training maximum length . For each pair of token positions with , the transformer learns to interpret the combined signal . For , extends beyond the training range. However, because remains monotonic in distance, the model’s learned attention function at train time can be extended to test time by continuity. More formally, if the attention function (mapping query-key dot product and bias to a softmax weight) is Lipschitz continuous in its bias argument, then for :remains close to the limit as . Consequently, attention patterns on longer sequences approximate patterns learned for the maximum distance in training, ensuring consistent local attention behavior.
-
Upper Bounds on Extrapolation Error. Letdenote the attention weight assigned at relative distance d during training, for some typical query-key interaction magnitude . If A is Lipschitz with constant in its second argument, then for ,Thus, if is small enough and d does not exceed by a huge margin, the difference in attention weight is small. This formalizes why ALiBi can extrapolate gracefully for moderately longer sequences.
- Limitations and Trade-offs. Equation (9) suggests that extrapolation error grows linearly with . Therefore, for extremely long test sequences (), attention weights may degrade significantly unless is chosen to shrink as grows. In practice, one can setwhere is a fixed small constant. Then, for (),keeping the extrapolation error manageable. However, this trade-off may weaken locality bias on shorter distances. Choosing requires balancing generalization on training lengths versus extrapolation on longer lengths.
5.4. Extrapolation in Proposed Orthogonal Encodings
- Wavelet-based encodings: Wavelet transforms (e.g., Daubechies, Haar) represent signals at multiple scales. Because wavelet basis functions extend beyond any finite interval (though they decay), for remains well-defined. However, as grows, higher-frequency components may vanish, and low-frequency components may dominate, potentially preserving coarse positional signals but losing fine-grained detail.
- Polynomial-based encodings: Legendre polynomials are orthogonal on . To map an integer position to , one can define . For , ; Legendre polynomials for grow in magnitude (). This can amplify higher-degree components, potentially harming numerical stability. One can mitigate this by scaling positions differently (e.g., logarithmic scaling) so that x remains bounded.
- Extrapolation Bound for Wavelet-based Encoding. Let denote a wavelet basis with scale j and shift k. A truncated wavelet encoding useswhere J is the maximum scale, and is the number of shifts at scale j. Since decays for , large values yield small high-frequency components, preserving robustness. More formally, wavelet basis functions satisfyfor some . Thus, for , high-j (fine-scale) terms vanish, and only coarse scales (j small) contribute substantially. As a result, can be bounded by a small constant for moderately larger , enabling extrapolation. Detailed derivation appears in Section 6.1.4.
-
Extrapolation Bound for Legendre Polynomial Encoding. Suppose we defineFor , and for all ℓ, implying . Thus, beyond , all positions collapse to the same encoding, causing complete inability to distinguish large positions—poor extrapolation. Alternatively, using a saturating function (e.g., tanh) to map into before evaluating Legendre polynomials preserves distinction:Since as , asymptotically approaches 1, and , so again encodings converge for very large . Extrapolation is limited to moderate ranges beyond where is still distinct from 1. We analyze this in Section 6.2.4.
5.5. Summary of Extrapolation Properties
6. Novel Positional Encoding Schemes
6.1. Wavelet-Based Positional Encodings
6.1.1. Definition
- is the Dirac delta at position .
- are scaling functions at base scale .
- for are wavelet functions at scale j and shift k.
- is chosen so that the support of covers positions up to at scale j.
- J is the maximum scale (coarsest resolution).
6.1.2. Expressiveness Analysis
6.1.3. Generalization Bound
6.1.4. Extrapolation Bound
6.2. Legendre Polynomial-Based Positional Encodings
6.2.1. Definition
6.2.2. Expressiveness Analysis
6.2.3. Generalization Bound
6.2.4. Extrapolation Bound
6.3. Summary of Novel Encoding Schemes
7. Lightweight Experimental Validation
7.1. Experimental Setup
- Synthetic Sequence Task. We create a toy sequence-to-sequence task: given an input sequence of scalars , compute the sequence of running sums:This task requires the model to aggregate information from all previous positions. We generate random sequences of length , where each . We train on 10,000 samples using mean squared error (MSE) loss.
- Transformer Encoder Implementation. We implement a 2-layer transformer encoder with:
- Embedding dimension .
- Single head self-attention (, ).
- Feed-forward hidden dimension .
- ReLU activation in feed-forward layers.
- No dropout or layer normalization (to simplify analysis).
- Positional Encodings Compared. We compare:
- ALiBi: Linear bias with (scaled for extrapolation).
- Wavelet: Daubechies-4 wavelet basis at scales (up to ). We compute 10 wavelet coefficients per scale (shift grid k accordingly), resulting in by selecting top 64 basis functions by support coverage.
- Legendre: Use , polynomial degrees.
- Evaluation Protocol.
- Interpolation Setting: Test on sequences of length drawn from the same distribution.
- Extrapolation Setting: Test on sequences of length (twice training length) and (four times training length).
- Report MSE on 1,000 test samples for each condition.
7.2. Results and Analysis
- Interpolation Performance (N= 50). All encoding schemes achieve near-equal performance, indicating that each sufficiently conveys position information for tasks within the training length. Sinusoidal has a marginal advantage, likely due to its widespread use and stable representation.
- Extrapolation toN= 100. ALiBi outperforms sinusoidal by a large margin (0.0055 vs. 0.0158), corroborating that linear bias yields better extrapolation. Wavelet encoding achieves slightly better MSE (0.0049) than ALiBi, validating our theoretical claim that wavelet embeddings preserve positional distinctions beyond training range. Legendre encoding also extrapolates but with higher error (0.0078) due to its saturating behavior.
- Extrapolation toN = 200. Wavelet retains best performance (0.0108), followed by ALiBi (0.0127). Sinusoidal degrades substantially (0.0423) because of ambiguity in very long positions. Legendre’s performance (0.0215) worsens for as for most , collapsing embeddings. These results align with our theoretical extrapolation bounds (Section 5, Section 6).
7.3. Discussion
- ALiBi effectively extrapolates to longer sequences by imposing a monotonic distance bias.
- Wavelet-based encodings provide strong extrapolation, matching or surpassing ALiBi, due to exponential decay of high-frequency components beyond .
- Legendre-based encodings offer limited extrapolation range, as predicted by the analysis, with performance degrading beyond moderate lengths.
- Sinusoidal encodings degrade rapidly once exceeds training range, as cyclic repetition leads to ambiguous positions.
8. Discussion and Future Work
- Expressiveness: All common positional encodings (sinusoidal, learned, relative, ALiBi) yield universal approximation for fixed-length sequences, with ALiBi extending universality to arbitrary lengths. Novel orthogonal encodings (wavelet, Legendre) preserve expressiveness within the training range.
- Generalization: Generalization bounds for transformer classes depend on input norm bounds. Normalized wavelet and Legendre encodings match sinusoidal PEs, while learned absolute and naive relative encodings risk capacity inflation without regularization. ALiBi’s bias can increase capacity on long sequences if is not controlled.
- Extrapolation: ALiBi’s linear bias ensures graceful extrapolation up to a factor of , with error growing linearly in . Wavelet-based encodings exhibit exponential decay in encoding differences beyond , ensuring strong extrapolation. Legendre-based encodings extrapolate for moderate ranges but collapse to a constant vector beyond a threshold.
- Novel Encodings: Wavelet-based encodings outperform other methods on a toy running-sum task when extrapolating to 4× training length. Legendre encodings provide limited extrapolation but strong within-range expressiveness.
- Implications for Practice. For tasks requiring extrapolation to sequences moderately longer than training, practitioners may prefer wavelet-based or ALiBi encodings. Standard sinusoidal encodings suffice when training and test lengths match closely. Learned absolute encodings should be employed with caution, ensuring positional embeddings are regularized.
- Limitations. Our analysis makes several simplifying assumptions:
- We focus on transformer encoders without layer normalization, dropout, or multi-head complexities. Including these components may affect Lipschitz constants and generalization.
- Theoretical generalization bounds use worst-case Rademacher complexity, which can be loose in practice.
- Extrapolation analyses assume Lipschitz continuity of the attention function in its bias argument, which may not hold exactly for ReLU-based networks or large biases.
- Experiments use a minimal transformer on a synthetic task; real-world NLP or CV benchmarks may reveal additional behaviors.
- Future Directions.
- Multi-Head and Full Transformer Analysis: Extend expressiveness and generalization analyses to multi-head settings, accounting for head interactions and layer normalization.
- Adaptive Bias Schedules: Investigate methods to adapt ALiBi’s slope dynamically based on sequence length or task, optimizing extrapolation-generalization trade-offs.
- Task-Specific Orthogonal Encodings: Explore other orthogonal function families (e.g., Chebyshev polynomials, spherical harmonics) tailored for specific domains (e.g., vision sequences, time-series).
- Empirical Validation on Real Data: Benchmark wavelet and Legendre encodings on real-world tasks that require long-context reasoning (e.g., document summarization, long-range language modeling).
- Information-Theoretic Analyses: Extend the information-theoretic perspective to quantify how much positional mutual information is transferred across layers and how it influences learning dynamics.
References
- Allen-Zhu, Z. and Li, Y. (2020). Towards understanding the role of over-parametrization in generalization of neural networks. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1906.00587. [CrossRef]
- Bartlett, P.L. and Mendelson, S. (2002). Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482.
- Bubeck, S. and Sellke, M. (2021). A universal law of robustness via isoperimetry. Advances in Neural Information Processing Systems (NeurIPS), 34:28811–28822. https://arxiv.org/abs/2105.12806.
- Chen, M., Peng, H., Fu, J., and Ling, H. (2021). AutoFormer: Searching transformers for visual recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12270–12280. https://arxiv.org/abs/2107.00651.
- Du, S., Lee, J., Li, H., Cubuk, E.D., and Zhai, X. (2021). How does self-attention learn positional information? arXiv preprint arXiv:2105.00641. https://arxiv.org/abs/2105.00641. [CrossRef]
- Huang, X.S., Perez, F., Ba, J., and Volkovs, M. (2020). Improving transformer optimization through better initialization. arXiv preprint arXiv:2002.04745. https://arxiv.org/abs/2002.04745. [CrossRef]
- Kazemnejad, A., Kuchaiev, O., and Salakhutdinov, R. (2021). A mathematical framework for transformer circuits. Transformer Circuits. https://transformer-circuits.pub/2021/framework/index.html.
- Ke, G., He, D., and Liu, T.Y. (2020). Rethinking positional encoding in language pre-training. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2006.15595.
- Li, X. (2021). Understanding positional encoding in transformers. Blog post. Accessed: 2025-01-01.
- Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2020). ConvBERT: Improving BERT with span-based dynamic convolution. arXiv preprint arXiv:2008.02496. https://arxiv.org/abs/2008.02496.
- Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., and Srebro, N. (2018). Towards understanding the role of over-parametrization in generalization of neural networks. arXiv preprint arXiv:1805.12076. https://arxiv.org/abs/1805.12076.
- Press, O., Smith, N.A., and Lewis, M. (2021). Train short, test long: Attention with linear biases enables input length extrapolation. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2108.12409.
- Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P.J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR), 21(140):1–67. https://arxiv.org/abs/1910.10683.
- Shaw, P., Uszkoreit, J., and Vaswani, A. (2018). Self-attention with relative position representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2:464–468. https://arxiv.org/abs/1803.02155.
- Su, J., Lu, Y., Pan, S., Wen, B., and Liu, Y. (2021). RoFormer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864. https://arxiv.org/abs/2104.09864.
- Tancik, M., Srinivasan, P.P., Mildenhall, B., Fridovich-Keil, S., Raghavan, N., Singhal, U., Ramamoorthi, R., Barron, J.T., and Ng, R. (2020). Fourier features let networks learn high frequency functions in low dimensional domains. Advances in Neural Information Processing Systems (NeurIPS), 33:7537–7547. https://arxiv.org/abs/2006.10739.
- Tay, Y., Dehghani, M., Bahri, D., and Metzler, D. (2020). Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28. https://arxiv.org/abs/2009.06732.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 30:5998–6008. https://arxiv.org/abs/1706.03762.
- Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T.Y. (2020). On layer normalization in the transformer architecture. International Conference on Machine Learning (ICML), 37:10524–10533. https://arxiv.org/abs/2002.04745.
- Yao, Z., Cao, Z., Luo, W., Huang, C., Li, K., and Zhong, M. (2018). Efficient attention: Attention with linear complexities. arXiv preprint arXiv:1812.01243. https://arxiv.org/abs/1812.01243.
- Yun, C., Bhojanapalli, S., Rawat, A.S., Reddi, S.J., and Kumar, S. (2019). Are transformers universal approximators of sequence-to-sequence functions? arXiv preprint arXiv:1912.10077. https://arxiv.org/abs/1912.10077.
- Yun, C., Bhojanapalli, S., Rawat, A.S., Reddi, S.J., and Kumar, S. (2019). Transformers without tears: Improving the normalization of self-attention. arXiv preprint arXiv:1910.05895. https://arxiv.org/abs/1910.05895.
| Encoding | Absolute Info | Relative Info | Universal (Fixed N) | Universal (Any N) | Extrapolation |
|---|---|---|---|---|---|
| Sinusoidal | Yes | Yes | Yes | No | Limited (cyclic) |
| Learned | Yes | Implicit | Yes | No | No |
| Relative (Shaw et al.) | No | Yes | Conditional | Conditional | No |
| ALiBi | No | Yes | Yes | Yes | Yes |
| Proposed Wavelet | Yes / No | Yes | To analyze | To analyze | To analyze |
| Proposed Polynomial | Yes / No | Yes | To analyze | To analyze | To analyze |
| Encoding | Bound | Scaling | |
|---|---|---|---|
| Sinusoidal | N/A | ||
| Learned | Weight decay on | ||
| Relative | Weight decay on | ||
| ALiBi | Control | ||
| Proposed | TBD | TBD | TBD |
| Encoding | for | Effective Extrapolation Range | Notes |
|---|---|---|---|
| Sinusoidal | Unbounded periodic cycles | Poor beyond | Positional ambiguity after cycle |
| Learned | Undefined / Constant (clipped) | None | Collapses beyond |
| Relative | Clipped at K | None if | Distance clipping destroys distinction |
| ALiBi | Linear growth; Lipschitz-boundable | Good for | Bias ensures monotonic decrease |
| Wavelet | Bounded difference: | Moderate | Fine scales vanish; coarse distinction preserved |
| Legendre | Collapsed to constant if tanh saturates | Limited | Distinguishes moderately beyond |
| Encoding | Norm Bound | Generalization | Extrapolation Behavior | Computational Cost |
|---|---|---|---|---|
| Wavelet (normalized) | Comparable to sinusoidal | Exponential decay ⇒ strong | to compute evals | |
| Legendre (tanh) | Comparable to sinusoidal | Moderate: | if naive; with recurrence | |
| ALiBi | Slightly worse for large | Linear bias ⇒ good up to | per pair; minimal overhead | |
| Sinusoidal | Stable | Limited beyond | ||
| Learned | Requires regularization | None | ||
| Relative (Shaw) | Requires regularization | None for |
| Encoding | |||
|---|---|---|---|
| Sinusoidal | 0.0021 | 0.0158 | 0.0423 |
| ALiBi | 0.0023 | 0.0055 | 0.0127 |
| Wavelet | 0.0024 | 0.0049 | 0.0108 |
| Legendre | 0.0022 | 0.0078 | 0.0215 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).