Submitted:
25 May 2026
Posted:
27 May 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction

2. Problem Formulation and Preliminaries
2.1. Meta-Learning over Latent Dynamical Systems
2.2. Architecture Definitions
2.2.0.1. Selective SSM.
2.3. Computational Paradigms
- 1.
- Apairwise interaction step: , where weights are determined by the similarity of and (possibly modulated by positional encoding).
- 2.
- Apointwise transformation step: with residual connections.
2.4. Assumptions
2.5. Meta-Training Objective
3. The Structural Chasm
3.1. Non-Commutative Product Tracking
3.2. The Structural Chasm Theorem
- 1.
- Computational topology: has an all-pairs (or chain-without-compression) information flow graph, while a selective SSM has a strictly chain-with-compression graph.
- 2.
- Information bottleneck:The effective state of grows as , while the SSM maintains .
- 3.
- Constant-parameter incompatibility:The excess risk of on this task has a positive lower bound that does not vanish with context length, while a selective SSM with state dimension achieves excess risk approaching zero as .
3.3. Why Position Encoding Cannot Help
| Architecture | Encoding Type | Role | Coupling Depth |
| RoPE Transformer | Rotational () | Inject into Q/K | Shallow |
| ALiBi Transformer | Polynomial (t) | Attention bias | Medium |
| Selective SSM | Exponential () | Is the recurrence | Deep (inseparable) |
4. Forward Separation: SSM Advantage
4.1. Selective SSMs as Adaptive Filters
4.2. Forward Statistical Separation
4.3. Permutation Invariance Impossibility
5. Reverse Separation: Transformer Advantage
5.1. Static Estimation Tasks
5.2. Theorem 2: The Statefulness Curse
6. Non-Commutativity as the Root Cause
6.1. Theorem 3: Scalar Degeneracy and Matrix Separation
6.2. Phase Transition in the Advantage Gap
7. Unified Picture and Architectural Implications
8. Experiments
8.1. Experiment I: Forward Separation
Results.
8.2. Experiment II: Reverse Separation
Results.
8.3. Experiment III: Phase Transition
Results.
8.4. Experiment IV: Non-Commutativity Ablation
Results.
8.5. Experiment V: Positional Encoding Ablation
Results.
8.6. Experiment VI: Beyond LG-SSM — Character-Level HMM
Results.
8.7. Experiment VII: Non-Commutative Product Tracking
Results.
8.8. Experiment VIII: Positional Encoding Cannot Bridge the Chasm
Results.
9. Discussion
Limitations.
Conclusion.
Appendix A. Related Work
In-Context Learning (ICL).
SSMs and In-Context Learning.
Expressiveness and Separation.
Meta-Learning.
Appendix B. Detailed Proofs
Appendix B.1. Proof of Lemma 2 (Filter Representation)
Appendix B.1.1. Setup
Appendix B.1.2. Selective SSM Parameterization
Appendix B.1.3. Constructive Argument
Appendix B.2. Proof of Theorem 2 (Forward Statistical Separation)
Appendix B.2.1. Part (i): Bayes-Optimality
Appendix B.2.2.1. Stage 1: Meta-training learns the conditional expectation.
Appendix B.2.2.2. Stage 2: SSM dynamics as approximate EM.
- E-step (state inference): Given the current parameter estimate , the SSM recurrence computes , which by Lemma 2 equals . This is the posterior mean of conditioned on under , exactly the E-step sufficient statistic.
- M-step (parameter update): The prediction error at step is . The gradient of with respect to the selective network parameters propagates through and :where satisfies the sensitivity recursion . This gradient update adjusts to reduce prediction error, analogous to the M-step of maximum likelihood estimation for state-space model parameters.
Stage 3: Local Asymptotic Normality and efficiency.
Appendix B.2.2. Part (ii): Separation from P inv
State augmentation.
ERM lower bound under misspecification.
SSM upper bound.
Appendix B.3. Proof of Proposition 1 (Permutation Invariance Impossibility)
Appendix B.3.1. Concrete Counterexample
Appendix B.3.2. General Argument
Appendix B.4. Proof of Theorem 3 (Reverse Separation—Statefulness Curse)
Appendix B.4.1. Setup
Appendix B.4.2. Uniform Mean
Appendix B.4.3. Fixed-Gain SSM
Appendix B.5. Proof of Theorem 4 (Non-Commutativity Root Cause)
Appendix B.5.1. Part (i): Scalar Case
Appendix B.5.2. Part (ii): Matrix Case
Appendix B.6. Proof of Corollary 1 (Phase Transition)
Appendix B.6.1. Observability Analysis
Appendix B.6.2. Impact on Kalman Filter Convergence
Appendix B.6.3. Effective ARE
Appendix B.6.4. Monotonicity and Finite-Sample Effects
Appendix C. Experimental Details
Appendix C.1. Experiment I & III: LG-SSM Task Distribution
Appendix C.7.1. Prior Distribution π(θ)
Appendix C.7.2. Colored-Noise Task (Experiment I and III)
Appendix C.7.3. Model Architectures
Appendix C.7.4. Training
Appendix C.7.5. Oracle Computation
Appendix C.2. Experiment II: Static Estimation Task
Appendix C.3. Experiment IV: Non-Commutativity Ablation
Appendix C.4. Experiment V: Positional Encoding Ablation
- Variant (a): Linear attention, no PE. Attention weights: with causal masking. This is the permutation-invariant baseline (identical to Experiment I Transformer).
- Variant (b): Softmax attention, no PE. with causal masking. Softmax introduces nonlinearity but preserves permutation invariance (within the causal mask).
- Variant (c): Softmax attention + sinusoidal PE. Same as (b) but with fixed sinusoidal positional encoding , , added to input embeddings.
- Variant (d): Softmax attention + learned PE. Same as (b) but with a learnable positional embedding table of size .
Appendix C.5. Experiment VI: Character-Level HMM
Appendix C.6. Experiment VII: Non-Commutative Product Tracking
- SSM (Kronecker): 4,760 params. Input-dependent transition , where f maps the 4D input to a matrix. The Kronecker product enforces the correct block structure for matrix multiplication. State dimension ; hidden dimension 64.
- General SSM: 6,104 params. Full transition matrix parameterized from input (no Kronecker constraint). Same architecture as Gu and Dao [1] but with non-diagonal transition.
- LinTF (no PE): 12,868 params. ELU+1 kernel, causal masking.
- SoftTF + RoPE: 12,868 params. 4-head softmax attention with RoPE.
- TF-2L/4L + RoPE: 25,508 / 50,660 params. 2/4-layer Transformer with residual connections, MLP, LayerNorm, and RoPE.
- Hybrid: 20,472 params. One SSM block (Kronecker-structured) + one attention block with RoPE.
Appendix C.7. Experiment VIII: PE Equivalence Sweep
Appendix D. Hybrid Architecture Experiment
Appendix D.1. Mixed-Structure Task
- Filtering blocks (length 32): observations from an LG-SSM with AR(1) noise (), requiring sequential belief updating.
- Retrieval blocks (length 32): the model must recall a “key” observation from the filtering block (the observation with maximum absolute value) and output it. This requires associative retrieval from unordered memory.
Appendix D.2. Models
- Pure SSM: Two stacked Mamba blocks (embedding , state ). ∼4K params.
- Pure Transformer: Two stacked attention layers (4 heads, , FFN , learned PE). ∼25K params.
- Hybrid: One Mamba block followed by one attention layer (same dimensions). ∼19K params.
Appendix D.3. Predicted Outcome
- The pure SSM should excel on filtering blocks but struggle on retrieval blocks (the compressed state discards the specific observation needed for retrieval).
- The pure Transformer should excel on retrieval blocks but underperform on filtering blocks (permutation invariance penalizes sequential estimation).
- The hybrid should achieve the best overall MSE by using the SSM layer for sequential state tracking and the attention layer for associative recall.
| Filtering MSE | Retrieval MSE | Total MSE | |
|---|---|---|---|
| Pure SSM | 0.578 | 2.934 | 1.756 |
| Pure Transformer | 0.886 | 1.280 | 1.083 |
| Hybrid (SSM + Attn) | 0.355 | 1.953 | 1.154 |
Appendix D.4. Results
Appendix E. Proof of the Structural Chasm Theorem
Appendix E.1. Claims 1–2: Computational Topology and Information Bottleneck
Appendix E.2. Claim 3: Constant-Parameter Incompatibility
Appendix F. Position Encoding Classification and the Structural Chasm
Appendix F.19.19.1. The classification describes static markers, not dynamic updates.
Appendix F.19.19.2. The two mechanisms are mathematically distinct.
- PE (Transformer): Modifies , a scalar weight in . The PE changes how much information flows from position i to position t.
- Transition matrix (SSM):, a full matrix. The transition changes what the state becomes, implementing a linear (or nonlinear, via input-dependence) transformation of the previous state.
Appendix F.19.19.3. Consequence: PE cannot implement state transitions.
Appendix F.19.19.4. Reconciliation.
- The classification describes the variety of position encodings (limited to forms).
- The chasm describes the fundamental limitation of all position encodings (they modify attention weights, not state transitions).
- Together, they give the complete picture: not only are PE forms limited, but even the full space of allowed PE forms cannot bridge the paradigm gap.
References
- Gu, Albert; Dao, Tri. Mamba: Linear-time sequence modeling with selective state spaces; 2024. [Google Scholar]
- Shakerinava, Mehran; Khavari, Behnoush; Ravanbakhsh, Siamak. The expressive limits of diagonal ssms for state-tracking. International Conference on Learning Representations, 2026. [Google Scholar]
- Terzic, Aleksandar; Bulatovic, Nadezda; Graziani, Carlo; Dukic, Velibor. On the expressiveness and length generalization of selective state-space models on regular languages. Proc. AAAI Conf. Artif. Intell. 2025, 39, 20758–20766. [Google Scholar] [CrossRef]
- Cooper, John; Diakonikolas, Ilias; Ma, Mingchen; Sala, Frederic. Expressivity-efficiency tradeoffs for hybrid sequence models. arXiv 2026, arXiv:2603.08859. [Google Scholar] [CrossRef]
- O Anderson, Brian D; Moore, John B. Optimal Filtering; Prentice-Hall, 1979. [Google Scholar]
- Kalman, Rudolf E. A new approach to linear filtering and prediction problems. J. Basic Eng. 1960, 82(1), 35–45. [Google Scholar] [CrossRef]
- Puranik, Alok. Using group theory to explore positional encodings and attention. Jane Street Blog. 2024. Available online: https://blog.janestreet.com/using-group-theory-to-explore-positional-encodings-attention/.
- Van der Vaart, Aad W. Asymptotic statistics; Cambridge university press, 2000; volume 3. [Google Scholar]
- White, Halbert. Maximum likelihood estimation of misspecified models. Econom. J. Econom. Soc. 1982, 1–25. [Google Scholar] [CrossRef]
- Garg, Shivam; Tsipras, Dimitris; Liang, Percy S; Valiant, Gregory. What can transformers learn in-context? a case study of simple function classes. Adv. Neural Inf. Process. Syst. 2022, 35, 30583–30598. [Google Scholar]
- Von Oswald, Johannes; Niklasson, Eyvind; Randazzo, Ettore; Sacramento, João; Mordvintsev, Alexander; Zhmoginov, Andrey; Vladymyrov, Max. Transformers learn in-context by gradient descent; 2023; pp. pages 35151–35174. [Google Scholar]
- Zhang, Ruiqi; Frei, Spencer; Bartlett, Peter L. Trained transformers learn linear models in-context. J. Mach. Learn. Res. 2024, 25(49), 1–55. [Google Scholar]
- Kim, Juno; Suzuki, Taiji. Transformers learn nonlinear features in context: Nonconvex mean-field dynamics on the attention landscape. arXiv 2024, arXiv:2402.01258. [Google Scholar] [CrossRef]
- Li, Hongbo; Duan, Lingjie; Liang, Yingbin. Provable in-context learning of nonlinear regression with transformers. arXiv 2025, arXiv:2507.20443. [Google Scholar] [CrossRef]
- Sun, Haoyuan; Jadbabaie, Ali; Azizan, Navid. On the role of transformer feed-forward layers in nonlinear in-context learning. arXiv 2025, arXiv:2501.18187. [Google Scholar]
- Oko, Kazusato; Song, Yujin; Suzuki, Taiji; Wu, Denny. Pretrained transformer efficiently learns low-dimensional target functions in-context. Adv. Neural Inf. Process. Syst. 2024, 37, 77316–77365. [Google Scholar]
- He, Tianyu; Doshi, Darshil; Das, Aritra; Gromov, Andrey. Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks. Adv. Neural Inf. Process. Syst. 2024, 37, 13244–13273. [Google Scholar]
- Magen, Roey; Vardi, Gal. Transformers are almost optimal metalearners for linear classification. arXiv 2025, arXiv:2510.19797. [Google Scholar] [CrossRef]
- Vladymyrov, Max; Von Oswald, Johannes; Sandler, Mark; Ge, Rong. Linear transformers are versatile in-context learners. Adv. Neural Inf. Process. Syst. 2024, 37, 48784–48809. [Google Scholar]
- Giannou, Angeliki; Yang, Liu; Wang, Tianhao; Papailiopoulos, Dimitris; Lee, Jason D. How well can transformers emulate in-context newton’s method? arXiv 2024, arXiv:2403.03183. [Google Scholar]
- Wies, Noam; Levine, Yoav; Shashua, Amnon. The learnability of in-context learning. Adv. Neural Inf. Process. Syst. 2023, 36, 36637–36651. [Google Scholar]
- Jeon, Hong Jun; Lee, Jason D; Lei, Qi; Van Roy, Benjamin. An information-theoretic analysis of in-context learning. arXiv 2024, arXiv:2401.15530. [Google Scholar]
- Wakayama, Tomoya; Suzuki, Taiji. In-context learning is provably bayesian inference: a generalization theory for meta-learning. arXiv 2025, arXiv:2510.10981. [Google Scholar] [CrossRef]
- Deora, Puneesh; Vasudeva, Bhavya; Behnia, Tina; Thrampoulidis, Christos. In-context occam’s razor: How transformers prefer simpler hypotheses on the fly. arXiv 2025, arXiv:2506.19351. [Google Scholar]
- Zhu, Hanlin; Hao, Shibo; Hu, Zhiting; Jiao, Jiantao; Russell, Stuart; Tian, Yuandong. Emergence of superposition: Unveiling the training dynamics of chain of continuous thought. arXiv 2025, arXiv:2509.23365. [Google Scholar]
- Mohan Sushma, Neeraj; Tian, Yudou; Mestha, Harshvardhan; Colombo, Nicolo; Kappel, David; Subramoney, Anand. State-space models can learn in-context by gradient descent. arXiv 2024, arXiv:2410.11687. [Google Scholar]
- Oh, Junsoo; Huang, Wei; Suzuki, Taiji. Mamba can learn low-dimensional targets in-context via test-time feature learning. arXiv 2025, arXiv:2510.12026. [Google Scholar]
- Bondaschi, Marco; Rajaraman, Nived; Wei, Xiuying; Ramchandran, Kannan; Pascanu, Razvan; Gulcehre, Caglar; Gastpar, Michael; Makkuva, Ashok Vardhan. From markov to laplace: How mamba in-context learns markov chains. arXiv 2025, arXiv:2502.10178. [Google Scholar] [CrossRef]
- Cole, Frank; Zhao, Yuxuan; Lu, Yulong; Zhang, Tianhao. In-context learning of linear dynamical systems with transformers. Adv. Neural Inf. Process. Syst. 38, 2025.
- Bick, Evan; et al. Kalman linear attention: Parallel bayesian filtering for efficient sequence modeling. arXiv 2026, arXiv:2602.10743. [Google Scholar]
- Sarrof, Yash; Veitsman, Yana; Hahn, Michael. The expressive capacity of state space models: A formal language perspective. Adv. Neural Inf. Process. Syst. 2024, 37, 41202–41241. [Google Scholar]
- Merrill, William; Petty, Jackson; Sabharwal, Ashish. The illusion of state in state-space models. International Conference on Machine Learning, 2024; PMLR; pp. pages 35563–35581. [Google Scholar]
- Wen, Kaiyue; Dang, Xingyu; Lyu, Kaifeng. Rnns are not transformers (yet): The key bottleneck on in-context retrieval. abs/2402.18510; ArXiv. 2024. Available online: https://api.semanticscholar.org/CorpusID:268041425.
- Jelassi, Samy; Brandfonbrener, David; Kakade, Sham M; Malach, Eran. Repeat after me: Transformers are better than state space models at copying. arXiv 2024, arXiv:2402.01032. [Google Scholar] [CrossRef]
- Bick, Aviv; Xing, Eric; Gu, Albert. Understanding the skill gap in recurrent language models. International Conference on Machine Learning. PMLR, 2025. [Google Scholar]
- Dao, Tri; Gu, Albert. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. International Conference on Machine Learning, 2024; PMLR; pp. pages 10061–10085. [Google Scholar]
- Ebrahimi, M.Reza; Defferrard, Michaël; Panchal, Sunny; Memisevic, Roland. On the “induction bias” in sequence models. arXiv 2026, arXiv:2602.18333. [Google Scholar] [CrossRef]
- Mousavi-Hosseini, Alireza; Sanford, Clayton; Wu, Denny; A Erdogdu, Murat. When do transformers outperform feedforward and recurrent networks? a statistical perspective. Adv. Neural Inf. Process. Syst. 38, 2025.
- Haas, Aurélien; Bruna, Joan. Statistical advantage of softmax attention: Insights from single-step analysis. arXiv 2025, arXiv:2509.21936. [Google Scholar]
- Thrun, Sebastian; Pratt, Lorien. Learning to learn: Introduction and overview; 1998. [Google Scholar]
- Grant, Erin; Finn, Chelsea; Levine, Sergey; Darrell, Trevor; Griffiths, Thomas. Recasting gradient-based meta-learning as hierarchical bayes. arXiv 2018, arXiv:1801.08930. [Google Scholar]
- Györfi, László; Kohler, Michael; Krzyżak, Adam; Walk, Harro. A distribution-free theory of nonparametric regression; Springer, 2002. [Google Scholar]
- Bartlett, Peter L; Mendelson, Shahar. Rademacher and Gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res. 2002, 3, 463–482. [Google Scholar]





| 0.50 | 0.70 | 0.90 | 0.95 | 0.99 | |
|---|---|---|---|---|---|
| Theoretical ≥ | 1.33 | 1.96 | 5.26 | 10.26 | 50.25 |
| Linear Transf. (no PE) | 106.8 | — | 48.7 | 59.9 | 142.8 |
| Softmax Transf. (+PE) | 21.2 | — | 5.6 | 4.5 | 13.1 |
| Case (a): | Case (b): | |
|---|---|---|
| Linear Transf. (no PE) | 102.3 | 198.2 |
| Softmax Transf. (+PE) | 13.4 | |
| Predicted |
| Model | Params | MSE@ | MSE@ | MSE@ |
|---|---|---|---|---|
| SSM (Kron) | 4,760 | 0.356 | 0.594 | 0.808 |
| General SSM | 6,104 | 0.338 | 0.842 | 1.174 |
| LinTF (no PE) | 12,868 | 0.388 | 11.69 | — |
| SoftTF + RoPE | 12,868 | 0.336 | 2.24 | 2.00 |
| TF-2L + RoPE | 25,508 | 0.237 | 1.59 | 1.25 |
| TF-4L + RoPE | 50,660 | 0.195 | 1.44 | 1.60 |
| Hybrid | 20,472 | 0.344 | 3.13 | 2.19 |
| Model | Params | ||
|---|---|---|---|
| Comparable budget (≤ SSM’s 4.7K) | |||
| TF-1L (no PE) | 3,364 | 1.77 | 2.78 |
| TF-1L + Sinusoidal | 3,364 | 1.38 | 1.31 |
| TF-1L + RoPE | 3,364 | 1.64 | 2.64 |
| SSM budget | |||
| TF-2L (no PE) | 25,508 | 1.16 | 0.79 |
| TF-2L + RoPE | 25,508 | 0.89 | 1.15 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).