Submitted:
25 May 2026
Posted:
27 May 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. The Phenomenon of In-Context Learning
- Why does the number of required demonstrations vary dramatically across tasks (1–3 for sentiment analysis vs. 10+ for natural language inference)?
- Does there exist an optimal number of demonstrations, and can it be computed from model properties?
- What is the information-theoretic lower bound on the error—even with arbitrarily many demonstrations?
1.2. Related Work
Bayesian implicit inference.
Gradient descent analogy.
Algorithm selection and stability.
Information-theoretic analysis.
Bayesian model averaging and pretraining analysis.
Asymptotic analysis of linear attention.
Generalization of pretrained Transformers.
PAC-Bayes theory.
1.3. Contributions
- Core PAC-Bayes ICL framework (§Section 3): We establish generalization upper bounds parameterized by two computable quantities—ambiguity A and saliency S—that can be evaluated from the pretrained model without retraining. Unlike the conditional mutual information approach of [7], which yields non-computable bounds, and the Bayesian regret analysis of [8], which assumes a perfectly pretrained model and studies pretraining error, our bounds apply to any frozen pretrained Transformer at inference time with finitely many demonstrations. We further derive closed-form sample complexity, an information-theoretic lower bound, and a permutation variance bound (Theorems 2–6).
- Linear attention instantiation (§Section 4): Under a linear attention model, we derive closed-form expressions for A and S in terms of model dimension, prior variance, noise variance, and signal strength, yielding architecture-dependent bounds (Theorems 1–9). While [9] characterize linear attention ICL in an asymptotic () regime, our bounds are non-asymptotic and yield explicit sample complexity formulas.
- Minimax optimality (§Section 5.1): Using Le Cam’s two-point method and KL tensorization, we prove a matching lower bound, establishing that the rate is tight (Theorems 10–13). No prior work provides minimax lower bounds for ICL generalization.
- Catoni fast-rate bounds (§Section 5.2): With Catoni’s temperature-parameterized bound, we achieve excess risk when the empirical risk is small, reducing sample complexity from to (Theorems 14–16). This is the first fast-rate PAC-Bayes result in the ICL setting.
- Data-dependent priors (§Section 6.1): Via sample splitting, we construct informed priors that can eliminate the ambiguity term entirely (Theorems 17–19). No existing ICL analysis employs data-dependent priors.
- Bernstein variance-adaptive bounds (§Section 6.2): Exploiting low loss variance, we obtain a second path to fast rates through variance decay (Theorems 20–5). Variance-adaptive analysis for ICL has not appeared in prior work.
1.4. Paper Organization
2. Problem Setup and Key Definitions
2.1. ICL as a Statistical Learning Problem
- Prior P: the model’s zero-shot predictive distribution ;
- Posterior Q: the model’s predictive distribution after observing k demonstrations.
2.2. PAC-Bayes Preliminaries
2.3. Ambiguity and Saliency
2.4. KL Reduction: Rigorous Bound and Modeling Assumption
3. Core Generalization Framework
3.1. Upper Bounds
3.2. Sample Complexity
| Task analogue | A | S | |||
|---|---|---|---|---|---|
| Easy (SST-2–like) | 0.5 | 0.3 | 0.15 | 0.10 | 270 |
| Medium (AG News–like) | 2.0 | 0.4 | 0.30 | 0.15 | 237 |
| Hard (SNLI–like) | 5.0 | 0.2 | 0.45 | 0.20 | 567 |
3.3. Information-Theoretic Lower Bound
3.4. Permutation Sensitivity
4. Linear Attention: Architecture-Dependent Bounds
4.1. Linear Attention as Bayesian Linear Regression
- Task parameter: ;
- Observations: , with ;
- Signal strength: a parameter measuring the effective task signal captured by the attention head.
4.2. Closed-Form Ambiguity and Saliency
4.3. Simplified Bounds via Classical Inequalities
4.4. The Saturation Phenomenon
4.5. End-to-End Architecture-Dependent Bound
4.6. Simplified Sample Complexity
| d | SNR | ||||
|---|---|---|---|---|---|
| 5 | 1.0 | 1.0 | 1.73 | 1.0 | 238 |
| 10 | 1.0 | 1.0 | 3.47 | 1.0 | 325 |
| 10 | 1.0 | 0.5 | 5.49 | 2.0 | 426 |
| 20 | 0.5 | 1.0 | 2.23 | 0.5 | 263 |
| 50 | 1.0 | 1.0 | 17.33 | 1.0 | 1018 |
5. Minimax Optimality and Fast Rates
5.1. Minimax Lower Bounds
- : , ;
- : , ;
- Upper bound: (Theorem 2);
- Lower bound: (Theorem 11).
5.2. Catoni Fast-Rate Bounds
6. Extensions
6.1. Data-Dependent Priors via Sample Splitting
6.2. Bernstein Variance-Adaptive Bounds
| Bound Family | Excess Risk Rate | Key Assumption | Best Regime |
|---|---|---|---|
| McAllester (Thm 2) | General | ||
| Catoni (Thm 14) | small | Large k, low emp. risk | |
| Bernstein (Thm 20) | Variance V small | Confident model |
7. Experimental Validation
7.1. Setup
Data generation.
Configurations.
| Config | d | s | |||||
| A | 5 | 1.0 | 1.0 | 0.15 | 1.73 | 0.375 | 5 |
| B | 10 | 1.0 | 1.0 | 0.10 | 3.47 | 0.500 | 7 |
| C | 10 | 1.0 | 0.5 | 0.10 | 5.49 | 0.500 | 11 |
7.2. Experiment 1: Generalization Bound Tightness
Observations.
7.3. Experiment 2: Risk Scaling
Observations.
7.4. Experiment 3: Saturation Point Verification
Observations.
7.5. Experiment 4: McAllester vs. Catoni Crossover
Observations.
7.6. Real Transformer Setup
Model and protocol.
Tasks.
Estimation protocol.
7.7. Experiment 5: Bound Verification with GPT-2
Observations.
7.8. Experiment 6: KL Behavior Under Softmax Attention
Observations.
7.9. Experiment 7: Multi-Task Complexity Estimation
Observations.
8. Discussion
8.1. Implications for Model Design
- Dimensionality reduction: Reducing the effective dimension d (via projection heads or attention bottlenecks) linearly reduces the demonstration requirement.
- Pretraining quality: Improving the SNR through better pretraining data directly improves ICL sample efficiency.
- Diminishing returns: Beyond the saturation point , additional demonstrations provide only improvement—a much slower rate.
8.2. ICL vs. Fine-Tuning
8.3. ICL vs. Standard Machine Learning
8.4. Limitations
- 1.
- KL reduction assumption: Proposition 1 provides a rigorous KL bound for the Gaussian model with explicit linearization error (Corollary 1). For general Transformer architectures, Assumption A1 remains a modeling hypothesis. Experiment 6 confirms that softmax attention exhibits approximately constant (not linearly decreasing) KL divergence across k, but the core PAC-Bayes bound (Theorem 2) remains valid with the measured KL values.
- 2.
- Bounded loss: The requirement excludes unbounded losses such as cross-entropy on raw logits. Extending to sub-Gaussian losses is a natural direction.
- 3.
- 4.
- Computability of A and S: Under the linear attention model, A and S are exactly computable. For general LLMs, requires integration over the (unknown) input distribution; in practice, A and S must be estimated from a held-out sample, introducing estimation error not captured by the bound.
- 5.
- Linear attention model: While providing clean closed-form results, the linear attention model omits softmax nonlinearity, multi-head interaction, and positional encoding effects.
- 6.
- i.i.d. demonstrations: The framework assumes demonstrations are i.i.d., which may not hold in conversational or curriculum-based ICL settings.
8.5. Open Questions
- 1.
- Can the KL linear reduction be extended to softmax attention via local linearization around the softmax operating point?
- 2.
- How do multi-task ICL bounds look when tasks share ambiguity and saliency structure?
- 3.
- Can the permutation variance bound (Theorem 6) be tightened using task-specific ordering heuristics?
- 4.
- Does a data-dependent saliency that decays with k lead to tighter bounds than the constant-S assumption?
- 5.
- What are the computational-statistical tradeoffs when the attention capacity limits the effective number of demonstrations the model can process?
9. Conclusions
Appendix A. Proofs for Core Framework (Section 3)
Appendix A.1. Proof of Theorem 2 (Basic ICL Bound)
Appendix A.2. Proof of Theorem 3 (Saliency-Enhanced Bound)
Appendix A.3. Proof of Theorem 4 (Sample Complexity)
Appendix A.4. Proof of Theorem 5 (ICL Lower Bound)
Appendix A.5. Proof of Theorem 6 (Permutation Variance)
Appendix B. Proofs for Linear Attention (Section 4)
Appendix B.1. Proofs of Theorems 1 and 2
Appendix B.2. Proof of Proposition 1 and Corollary 1
Appendix B.3. Proofs of Theorems 3–7
Appendix B.4. Proof of Theorem 8 (Linear Attention ICL Bound)
Appendix B.5. Proof of Theorem 9
Appendix C. Proofs for Minimax Optimality and Fast Rates (Section 5)
Appendix C.1. Proof of Theorem 10 (Minimax Lower Bound)
- 1.
- Risk gap: (the difference in optimal risk between the two tasks).
- 2.
- Per-sample KL divergence: .
Appendix C.2. Proof of Theorem 11
Appendix C.3. Proof of Theorem 14 (Catoni ICL Bound)
- Coefficient: .
- Denominator: .
Appendix C.4. Proof of Theorem 15
Appendix C.5. Proof of Theorem 4
Appendix D. Proofs for Extensions (Section 6)
Appendix D.1. Proofs for Data-Dependent Priors
Appendix D.2. Proofs for Bernstein Variance-Adaptive Bounds
Appendix E. Classical Results Used
- 1.
- KL non-negativity (Gibbs inequality): whenever . See [29], Theorem 2.6.3.
- 2.
- Hoeffding’s lemma [18]: If X satisfies and , then .
- 3.
- Donsker–Varadhan variational inequality [17]: For any measurable f and : .
- 4.
- McAllester’s PAC-Bayes bound [12]: See Theorem 1.
- 5.
- Fano’s inequality [19]: .
- 6.
- Bounded differences (McDiarmid’s inequality) [30]: Variance bound via Efron–Stein; see Theorem 6.
- 7.
- Gaussian KL closed form: For and : . See [29].
- 8.
- Catoni’s PAC-Bayes bound [13]: Temperature-parameterized bound with .
- 9.
- Le Cam’s two-point method [22]: .
- 10.
- KL tensorization: for product distributions.
- 11.
- Split-sample PAC-Bayes [16]: PAC-Bayes bound with data-dependent priors via sample splitting.
- 12.
- Bernstein PAC-Bayes [15]: Variance-aware PAC-Bayes bound with structure.
Appendix F. Simulation Details
Appendix F.1. Pseudocode
| Algorithm 1:Bayesian Linear Regression ICL Simulation |
|
Appendix F.2. Hyperparameters
| Parameter | Value |
|---|---|
| Dimension d | |
| Prior variance | 1.0 |
| Noise variance | |
| Signal strength s | |
| Max demonstrations | 200 (Exp. 2), 50 (others) |
| Number of trials N | 2000 |
| Confidence parameter | 0.05 |
| Random seed | 42 |
Appendix F.3. Reproducibility
References
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, Vol. 33, 1877–1901. [Google Scholar]
- Xie, S.M.; Raghunathan, A.; Liang, P.; Ma, T. An explanation of in-context learning as implicit Bayesian inference. International Conference on Learning Representations, 2022. [Google Scholar]
- von Oswald, J.; Niklasson, E.; Randazzo, E.; Sacramento, J.; Mordvintsev, A.; Zhmoginov, A.; Vladymyrov, M. Transformers learn in-context by gradient descent. International Conference on Machine Learning, 2023; pp. 35151–35174. [Google Scholar]
- Akyürek, E.; Schuurmans, D.; Andreas, J.; Ma, T.; Zhou, D. What learning algorithm is in-context learning? Investigations with linear models. International Conference on Learning Representations, 2023. [Google Scholar]
- Bai, Y.; Chen, F.; Wang, H.; Xiong, C.; Mei, S. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. Adv. Neural Inf. Process. Syst. 2023, Vol. 36. [Google Scholar]
- Li, Y.; Ildiz, M.E.; Papailiopoulos, D.; Oymak, S. Transformers as algorithms: Generalization and stability in in-context learning. International Conference on Machine Learning, 2023. [Google Scholar]
- Jeon, H.J.; Lee, J.D.; Lei, Q.; Van Roy, B. An information-theoretic analysis of in-context learning. International Conference on Machine Learning, 2024. [Google Scholar]
- Zhang, R.; Frei, S.; Bartlett, P.L. What and how does in-context learning learn? Bayesian model averaging, parameterization, and generalization. International Conference on Artificial Intelligence and Statistics, 2025. [Google Scholar]
- Lu, Y.M.; Letey, M.I.; Zavatone-Veth, J.A.; Maiti, A.; Pehlevan, C. Asymptotic theory of in-context learning by linear attention. Proc. Natl. Acad. Sci. 2025, 122. [Google Scholar] [CrossRef] [PubMed]
- Liu, S.; Cai, Z.; Chen, G.; Li, X. Towards better understanding of in-context learning ability from in-context uncertainty quantification. Transactions on Machine Learning Research, 2025. [Google Scholar]
- McAllester, D.A. PAC-Bayesian model averaging. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory, 1999; pp. 164–170. [Google Scholar]
- McAllester, D. PAC-Bayesian stochastic model selection. Mach. Learn. 2003, 51, 5–21. [Google Scholar] [CrossRef]
- Catoni, O. PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. In Institute of Mathematical Statistics Lecture Notes – Monograph Series; Institute of Mathematical Statistics, 2007; Vol. 56. [Google Scholar]
- Seldin, Y.; Laviolette, F.; Cesa-Bianchi, N.; Shawe-Taylor, J.; Auer, P. PAC-Bayesian inequalities for martingales. IEEE Trans. Inf. Theory 2012, 58, 7086–7093. [Google Scholar] [CrossRef]
- Tolstikhin, I.O.; Seldin, Y. PAC-Bayes-Empirical-Bernstein inequality. Adv. Neural Inf. Process. Syst. 2013, Vol. 26. [Google Scholar]
- Lever, G.; Laviolette, F.; Shawe-Taylor, J. Tighter PAC-Bayes bounds through distribution-dependent priors. Theor. Comput. Sci. 2013, 473, 4–28. [Google Scholar] [CrossRef]
- Donsker, M.D.; Varadhan, S.R.S. Asymptotic evaluation of certain Markov process expectations for large time, I. Commun. Pure Appl. Math. 1975, 28, 1–47. [Google Scholar] [CrossRef]
- Hoeffding, W. Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 1963, 58, 13–30. [Google Scholar] [CrossRef]
- Fano, R.M. Transmission of Information: A Statistical Theory of Communications; MIT Press, 1961. [Google Scholar]
- Efron, B.; Stein, C. The jackknife estimate of variance. Ann. Stat. 1981, 9, 586–596. [Google Scholar] [CrossRef]
- Zhang, R.; Frei, S.; Bartlett, P.L. Trained transformers learn linear models in-context. J. Mach. Learn. Res. 2024, 25, 1–55. [Google Scholar]
- Le Cam, L. Convergence of estimates under dimensionality restrictions. Ann. Stat. 1973, 1, 38–53. [Google Scholar] [CrossRef]
- Dziugaite, G.K.; Roy, D.M. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In Uncertainty in Artificial Intelligence; 2017. [Google Scholar]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.Y.; Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013; pp. 1631–1642. [Google Scholar]
- Zhang, X.; Zhao, J.; LeCun, Y. Character-level convolutional networks for text classification. Adv. Neural Inf. Process. Syst. 2015, Vol. 28, 649–657. [Google Scholar]
- Bowman, S.R.; Angeli, G.; Potts, C.; Manning, C.D. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015; pp. 632–642. [Google Scholar]
- Maurer, A. A note on the PAC-Bayesian theorem. arXiv 2004. [Google Scholar]
- Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley-Interscience, 2006. [Google Scholar]
- McDiarmid, C. On the method of bounded differences. In Surveys in Combinatorics; London Mathematical Society Lecture Note Series; Cambridge University Press, 1989; Vol. 141, pp. 148–188. [Google Scholar]




| Inf-time | Computable | Sample compl. | Architecture | Lower bound | Fast rate | Var-aware | |
|---|---|---|---|---|---|---|---|
| Bayesian inf. [2] | |||||||
| Gradient desc. [3] | ✓ | ||||||
| Alg. selection [5] | |||||||
| Info-theoretic [7] | ✓ | ||||||
| BMA + pretrain [8] | ✓† | ||||||
| Asymptotics [9] | ✓ | ✓ | ✓ | ||||
| This work | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| k | Gap | Bound | Status | ||
|---|---|---|---|---|---|
| 1 | 0.812 | 0.029 | 0.783 | 1.999 | ✓ |
| 2 | 0.795 | 0.031 | 0.764 | 1.368 | ✓ |
| 5 | 0.773 | 0.048 | 0.725 | 0.774 | ✓ |
| 10 | 0.654 | 0.104 | 0.550 | 0.418 | † |
| 15 | 0.576 | 0.184 | 0.392 | 0.316 | † |
| 20 | 0.496 | 0.236 | 0.260 | 0.274 | ✓ |
| 30 | 0.448 | 0.287 | 0.161 | 0.223 | ✓ |
| 50 | 0.405 | 0.324 | 0.081 | 0.173 | ✓ |
| k | Bd(KL) | Bd(A) | Valid | |||
|---|---|---|---|---|---|---|
| 1 | 0.534 | 0.800 | 0.084 | 1.241 | 1.353 | ✓ |
| 2 | 0.568 | 1.000 | 0.143 | 0.886 | 0.956 | ✓ |
| 4 | 0.525 | 0.788 | 0.118 | 0.624 | 0.676 | ✓ |
| 8 | 0.529 | 0.850 | 0.112 | 0.441 | 0.478 | ✓ |
| 16 | 0.515 | 0.731 | 0.089 | 0.311 | 0.338 | ✓ |
| 24 | 0.518 | 0.800 | 0.098 | 0.254 | 0.276 | ✓ |
| Task | (nats) | |||||
|---|---|---|---|---|---|---|
| SST-2 | 0.661 | 0.041 | 16 | 0.670 | 0.497 | 0.173 |
| AG News | 0.785 | 0.033 | 24 | 0.330 | 0.572 | −0.242 |
| SNLI | 0.558 | 0.023 | 24 | 0.670 | 0.645 | 0.025 |
| Aspect | Standard ML | ICL (this work) |
|---|---|---|
| Posterior Q | Over parameters | Over predictions |
| Prior P | Design choice (e.g., ) | Pretrained zero-shot |
| measures | Parameter deviation | Prediction shift (ambiguity A) |
| Complexity driver | (model size) | A (task–model mismatch) |
| Sample complexity | , | |
| Saturation | Eventual (-dependent) | Hard, at |
| Learning mechanism | Parameter update (SGD) | Attention conditioning (frozen) |
| Fast rate condition | Low noise (Bernstein) | Low ambiguity (close prior) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).