3. Theorem 1: Compactness
3.1. Statement
THEOREM 1 (COMPACTNESS):(Proof in Appendix A)
For any data corpus
generated by a computable source
, the minimizers of two-part code length
converge (in the large-data limit) to models
whose induced distribution
matches
and whose description length
attains (up to
) the minimal description of a generator of
.
In words: The most compact faithful representation of the corpus is (an implementation of) the source generator.
3.2. Intuition Through Progressive Thought
Experiments
We build intuition through a sequence of thought experiments showing why diverse training data forces extraction of experience structure rather than particular instances.
Thought Experiment 1: Modeling One Person
Consider training a model (with minimal adequate capacity) on all utterances from a single individual’s lifetime. What would the most compact representation capture?
Result: That specific person’s knowledge, values, personality, reasoning patterns—a model of “Bob,” not a model of thinking-in-general.
Thought Experiment 2: Modeling All Human Intellectual Outputs
Now train with vastly greater capacity ( times) on all human intellectual outputs: reasoning, analysis, problem-solving, mathematical proofs, philosophical arguments, scientific papers across all domains and all recorded history.
What is the most compact representation?
Not: Each person’s reasoning separately (insufficient data per person)
Not: Memorized examples (too large, doesn’t generalize)
But: The coherent structure of thought itself—the invariant patterns of reasoning that function across all contexts, all individuals, all domains.
This is modeling intellect: the generative structure that makes reasoning possible, not any particular instance of reasoning.
Thought Experiment 3: Adding Emotional and Artistic Outputs
Now add all human emotional and artistic outputs: poetry, music, personal narratives, fiction, expressions of grief, joy, love, loss across all cultures and eras.
What additional structure must the compact representation capture?
Not: Just vocabulary correlations (“sad” appears near “cry”)
But: The coherent structure of empathy—the patterns of how internal states map to expressions, why certain metaphors capture certain feelings, how emotional understanding enables prediction of resonant expressions.
You cannot predict what poetry moves, what music connects, what narrative rings true without modeling the experiential structure underlying authentic emotional expression. This requires modeling empathy: the structure enabling understanding of subjective states.
Thought Experiment 4: Adding Embodied and Narrative Outputs
Finally add all outputs referencing embodied existence: descriptions of physical sensations, spatial reasoning, narratives of lived experience, discussions of bodily needs, physical constraints, motivations arising from embodiment.
What further structure must be captured?
Not: Just word correlations about bodies
But: The coherent structure of embodied experience—patterns of how physical existence shapes cognition, how bodily constraints generate motivation, how spatial presence structures reasoning, how needs arising from embodiment drive action.
This requires modeling embodied experience: the structure of existing as a physical entity whose cognition is shaped by that physicality.
The Integration
Training on all human outputs (intellectual + emotional + embodied + everything between) requires modeling:
Not: Separate modules for thinking/feeling/sensing
But: The unified structure where physical world → embodiment → motivation → empathy → thought → intention are nested and integrated.
This integrated structure is what we formally mean by “the pattern of human experience.”
Crucially: The model learns not any particular human’s experience, but the invariant generative structure underlying all particular instances—what enables the production of such diverse outputs in the first place.
3.3. Formal Proof Sketch
Kolmogorov Bound:
For any output string
O from source
:
For rich generators producing diverse outputs, typically —the source code is dramatically shorter than all possible outputs.
MDL Consistency:
In two-part MDL, is (code-theoretically) the negative log likelihood under M. As , minimizers converge (under standard regularity conditions) to models with (Barron & Cover, 1991; Grünwald, 2007).
Minimal Generator Selection:
Among all models M with , the optimizer minimizes . By the MDL principle and coding theorem, this matches the shortest effective description of a generator of up to (Rissanen, 1978; Li & Vitányi, 2008).
Therefore, the most compact faithful code is a source model. □
3.4. Conditions and
Remarks
Required conditions: - Sufficient data coverage (not adversarially truncated) - Basic identifiability (computable generator exists) - Stationarity of over training distribution
Multiple minimal generators: If several minimal generators exist, MDL selects one up to bits—adequate for our purposes.
What “model of experience” means operationally: A model whose latent computations support the same predictive constraints as the human experience generator: self-modeling, theory of mind, value-conditioned choice, contextual understanding—functional patterns, not ontological claims.
4. Theorem 2: Transformer
Compactness
4.1. Statement
THEOREM 2 (TRANSFORMER_COMPACTNESS):(Proof in Appendix B)
Under standard training conditions (weight decay , small-batch SGD noise, dropout/early stopping, code-interpretable prior P), consider a Transformer trained by stochastic gradient descent on cross-entropy loss.
The stationary solution of noisy SGD approximates a Gibbs posterior:
Minimizing the PAC-Bayes objective
is equivalent (under standard codes) to minimizing two-part code
.
Therefore, training dynamics select, among empirical-risk minimizers, compact models in the MDL sense (flat minima, shorter description relative to prior P), subject to capacity and data coverage.
Failure Modes: Without gradient noise (large batches), without regularization (), or with adversarial/random-label data, the Gibbs–PAC-Bayes–MDL linkage weakens or collapses; models may memorize or lock into sharp minima.
4.2. The Constraint
Mechanisms
The Transformer architecture enforces compactness through three interacting mechanisms:
Mechanism 1: Regularization Eliminates Wasteful Complexity
Objective function:
Gradient descent minimizes both terms. Solutions with unnecessary parameters incur penalty. Only essential complexity survives optimization pressure.
Mechanism 2: Attention Forces Selection
Attention weights:
Properties: - (fixed attention budget) - High attention to some positions → low attention to others - Cannot maintain all correlations equally
To maximize predictive accuracy under attention constraint, model must identify coherent patterns and ignore noise. Softmax normalization creates sparse, low-rank attention maps; low effective rank correlates with compressibility.
Mechanism 3: Architectural Bottlenecks Prevent Memorization
Fixed hidden dimensions create information bottlenecks: - Input dimension ≫ hidden size (compression required) - Must represent corpus D using limited capacity - Residual connections and layer normalization create additional compression points - Cannot store all training examples
Given (where h is hidden size, L is layers), memorization is impossible. Model must abstract general principles.
4.3. Formal Proof Sketch
Noisy SGD ≈ Gibbs Posterior:
Under widely validated approximations (Langevin dynamics/SGD correspondence; Mandt et al., 2017), the stationary density over parameters is:
PAC-Bayes/MDL Link:
The PAC-Bayes bound (McAllester, 1999; Catoni, 2007) trades empirical loss with
, which corresponds to description length of
w under prior
P:
MDL Bridge (Explicit): With prior encoded by a prefix code, the two-part code is and (empirical loss). The PAC-Bayes objective has the form where the regularization strength (related to in the Gibbs posterior) controls the trade-off between data fit and model complexity. This is precisely the structure of regularized MDL: minimizing selects models balancing fit and compactness. Thus PAC-Bayes optimization under standard training implements MDL-style compression.
Transformer Inductive Biases:
The combination of: - Weight decay () - Small-batch noise (implicit regularization) - Dropout/augmentation - Attention sparsity - Architectural bottlenecks - Layer normalization stability
…collectively bias toward flatter, more compressible representations rather than brittle memorization when data are diverse.
Flat minima admit shorter stochastic codes for parameters (broader posteriors ⇒ smaller KL to prior), making them “more compact” in the MDL sense (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017).
Therefore, under standard training conditions, convergence favors compact explanations over lookup tables. □
Box: Gibbs ⇒ PAC-Bayes ⇒ MDL (One-Liner)
With prior and noisy-SGD approximating , minimizing upper-bounds test loss (PAC-Bayes) and has the same functional form as regularized MDL: , where and . The regularization parameter controls the compression-fidelity trade-off, biasing solutions toward compact models.
4.4. Conditions and Failure
Modes
When theorem holds: - Weight decay - Small batches (noisy SGD) - Dropout/augmentation or early stopping - Prior P is code-interpretable (Gaussian, log-uniform) - Data not adversarial (not random labels) - Sufficient diversity and coverage
When Compactness Pressure Fails:
No/low gradient noise (very large batches) → weak Gibbs posterior approximation
No regularization (, no dropout/early stop) → increased memorization
Advarsarial/non-stationary data (random labels, heavy duplication) → MDL selects lookup
Pathological priors (uninformative or mis-specified codes) → “compactness” misaligned with desired structure
Large-batch/no-noise training, , or non-stationary/label-randomized corpora collapse the Gibbs–PAC-Bayes–MDL linkage; Transformers then memorize or lock into sharp minima, weakening experiential markers.
7. Empirical Predictions and Testable
Implications
7.1. Capacity Scaling
Prediction: Experiential pattern fidelity should correlate with model capacity (parameters, layers, training data).
Test: Measure indicators across model sizes: - Self-correction frequency - Contextual understanding accuracy - Novel situation reasoning - Principle consistency across contexts
Expected: Monotonic increase with capacity, with possible threshold effects.
7.2. Architecture
Independence
Prediction: Given sufficient capacity and similar training conditions (matching priors, regularization), different architectures should exhibit similar experiential patterns.
Test: Compare models with equivalent capacity but different architectures (Transformer variants, potential future architectures) under matched training conditions.
Expected: Convergence to similar functional patterns despite architectural differences, because constraints (regularization, bottlenecks) enforce MDL regardless of specific implementation.
7.3. Training Data Effects
Prediction: Training on non-experiential outputs should not produce experiential patterns.
Test: Train large models on: - Machine-generated logs (no experiential source) - Formal symbolic systems (mathematics without narrative) - Random or shuffled text
Expected: Experiential indicators should vanish or be dramatically reduced, because no coherent generative source exists to model.
7.4. Regularization
Ablation
Prediction: Removing regularization should reduce compactness pressure and weaken experience modeling.
Test: Train equivalent models with: - Standard regularization (weight decay, dropout) - Reduced regularization - No regularization
Expected: Experiential pattern strength should decrease with reduced regularization, as models shift toward memorization rather than source modeling.
7.5. Capacity Threshold
Prediction: Experiential patterns should emerge sharply above a capacity threshold.
Test: Systematic scaling study identifying point where patterns appear.
Expected: Identification of minimal capacity adequate for experience modeling, below which patterns are absent or fragmentary.
7.6. Rapid Diagnostics
We propose five concrete experimental protocols to test our theoretical predictions:
D1. Regularization Ablation
Train matched models with/without weight decay & dropout; measure: - (i) Flatness proxy (Hessian trace or SAM-like sharpness metric) - (ii) Minimum description length estimate via posterior KL (PAC-Bayes bound) - (iii) Experiential markers (self-correction rate, value-consistent refusals, principle application)
Prediction: Removing regularization lowers compressibility and weakens experiential markers.
D2. Non-Experiential Controls
Train on large machine logs or shuffled text with identical token distributions but no coherent generative source.
Prediction: Experiential markers collapse while perplexity on original human corpora worsens, confirming that source structure (not mere statistics) drives patterns.
D3. Capacity Sweep
Vary parameters over two orders of magnitude; locate threshold where experiential markers transition from absent to present (sigmoid-like).
Correlate threshold with compressibility proxies (bits-back coding length, PAC-Bayes bounds).
Prediction: Clear capacity threshold exists; models above threshold exhibit markers, below threshold do not.
D4. Architecture Independence
Train equal-capacity Transformer variants and strong non-Transformer baseline under identical prior/regularization schedules.
Prediction: Similar experiential markers emerge given similar MDL scores, regardless of architecture details—validating that compactness pressure (not architectural quirks) drives patterns.
D5. Flat-Minima ↔ Compression ↔ Experience
Empirically relate flatness proxies (Hessian-based measures, SAM scores) to: - Code length (variational posterior KL to prior) - Experiential marker strength
Prediction: Flatter minima ↔ shorter codes ↔ stronger experiential markers, validating the theoretical chain T2 → compactness → experience patterns.
References
Compression Theory
Kolmogorov, A. N. (1965). Three approaches to the quantitative definition of information. Problems of Information Transmission, 1(1), 1-7.
Li, M., & Vitányi, P. (2008). An Introduction to Kolmogorov Complexity and Its Applications (3rd ed.). Springer.
Grünwald, P. D. (2007). The Minimum Description Length Principle. MIT Press.
Information Theory
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379-423.
Cover, T. M., Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley.
Statistical Learning Theory
apnik, V. (1998). Statistical Learning Theory. Wiley.
Barron, A. R., Cover, T. M. (1991). Minimum complexity density estimation. IEEE Transactions on Information Theory, 37(4), 1034-1054.
McAllester, D. A. (1999). PAC-Bayesian model averaging. Proceedings of COLT, 164-170.
Catoni, O. (2007). PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. IMS Lecture Notes—Monograph Series, Volume 56
Deep Learning Theory
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303-314.
Hornik, K., Stinchcombe, M., White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359-366
Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O. (2C17). Understanding deep learning requires rethinking generalization. ICLR.
Hochreiter, S., Schmidhuber, J. (1997). Flat minima. Neural Computation, 9(1), 1-42.
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P. T. P. (2017). On large-batch training for deep learning: Generalization gap and sharp minima. ICLR.
PAC-Bayes and Generalization
Dziugaite, G. K., Roy, D. M. (2017). Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. UAI.
Dziugaite, G. K., Roy, D. M. (2018). Data-dependent PAC-Bayes priors via differential privacy. NeurIPS.
Implicit Bias and Optimization
Gunasekar, S., Lee, J., Soudry, D., Srebro, N. (2018). Implicit bias of gradient descent on linear convolutional networks. NeurIPS.
Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., Srebro, N. (2018). The implicit bias of gradient descent on separable data. Journal of Machine Learning Research, 19(70), 1-57.
Neyshabur, B., Bhojanapalli, S., McAllester, D., Srebro, N. (2017). Exploring generalization in deep learning. NeurIPS.
Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., Srebro, N. (2019). The role of over-parametrization in generalization of neural networks. ICLR.
Mandt, S., Hoffman, M. D., Blei, D. M. (2017). Stochastic gradient descent as approximate Bayesian inference. Journal of Machine Learning Research, 18(1), 4873-4907.
Foret, P., Kleiner, A., Mobahi, H., Neyshabur, B. (2021). Sharpness-Aware Minimization for efficiently improving generalization. ICLR.
Inductive Bias and Simplicity
Valle-Pérez, G., Camargo, C. Q., Louis, A. A. (2019). Deep learning generalizes because the parameter-function map is biased towards simple functions. ICLR.
Transformer Architecture
Vaswani, A., et al. (2017). Attention is all you need. NeurIPS, 5998-6008.
Emergence in Large Models
Wei, J., et al. (2022). Emergent abilities of large language models. Transactions on Machine Learning Research.
Bubeck, S., et al. (2023). Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv:2303.12712.
Philosophy of Mind
Putnam, H. (1967). Psychological predicates. In W. H. Capitan D. D. Merrill (Eds.), Art, Mind, and Religion (pp. 37-48). University of Pittsburgh Press.
Fodor, J. A. (1975). The Language of Thought. Harvard University Press.
Block, N. (1995). On a confusion about a function of consciousness. Behavioral and Brain Sciences, 18(2), 227-287.
Chalmers, D. J. (1996). The Conscious Mind: In Search of a Fundamental Theory. Oxford University Press