Generalization as the Great Leap in Evolvability: Insights from Machine Learning

Steven A. Frank

doi:10.20944/preprints202605.0205.v1

Submitted:

03 May 2026

Posted:

05 May 2026

You are already at the latest version

Abstract

Natural selection encodes learned information in the genome. Learned solutions may be tuned specifically to past challenges, failing in altered environments. Or solutions can be general, capturing the essential structure of the challenge and performing well across variations within the abstract class. For example, a neural system might recognize the exact outlines of a rattlesnake but not other snakes, or it might recognize the essence of snakeness. The problem of how a system generalizes is a fundamental aspect of evolvability, the ability of a system to learn broad solutions to novel challenges. In recent years, machine learning has significantly advanced our understanding of when systems generalize their learned solutions and how they accomplish such generalization. One surprising discovery overturned conventional wisdom about learning. Large systems, with more adjustable parameters than the dimensions of the incoming data, do not merely memorize the data patterns in the way suggested by traditional theory. Instead, systems with more parameters generalize better than smaller systems. Because natural selection is a learning algorithm, the new theory of generalization applies to biological evolution. Specifically, increasing regulatory complexity and parameterization associates with increasing evolvability for the discovery of general solutions. This link between genomic complexity and generalization may have been a primary driving force in evolutionary history.

Keywords:

genetic regulatory networks

;

fitness landscape

;

double descent

;

overparameterization

Subject:

Biology and Life Sciences - Ecology, Evolution, Behavior and Systematics

Introduction

The distinction between memorization and generalization is the central problem of learning. Natural selection is a basic learning algorithm, inevitably facing this same problem [1,2].

In the past, statistics and machine learning predicted a tradeoff. A system that is too simple cannot capture general patterns, it underfits. A system that is too complex memorizes every idiosyncratic fluctuation in the data, overfitting and missing any underlying generality. There should be an ideal spot of intermediate complexity at which the system is just large enough to capture the general nature of the problem without memorizing the noise.

Recent advances in machine learning have overturned that seemingly fundamental tradeoff. Vastly overparameterized systems generalize remarkably well compared with smaller systems. The great breakthroughs in artificial intelligence come from gigantic systems, with far greater numbers of parameters than the dimensionality of the data used for learning [3,4].

These observations have led to new theory about how systems learn to generalize. This article translates that theory into biology, where natural selection faces the same fundamental problem of learning and generalization [2,5,6]. Three questions outline the problem.

How do biological systems generalize their solutions to challenges? For example, an immune system might memorize specific danger signals, or it might have the general ability to infer danger by combining the weight of evidence from multiple indicators.

What characteristics of genomic regulatory structure or neural wiring associate with memorization versus generalization? The new theory suggests that very large parameter spaces generalize better, which means that complexly overwired gene regulatory networks or neural networks should generalize more broadly.

What comparative predictions arise over the history of life with regard to the tendency for greater generalization? Transitions from prokaryotes to single-cell eukaryotes to multicellular eukaryotes associate with increasing regulatory dimensionality, predicting an increasing historical trend in the evolvability of generalization.

This new perspective changes the way that we think about fitness landscapes. A fitness landscape describes how well genotypes perform in a particular environment. Machine learning calls that landscape a training-error surface, measuring the distance between a genotype’s performance and an optimal solution to the challenge.

The new learning theory has shown that training error alone is insufficient for understanding generalization [3]. In highly overparameterized systems, many different solutions fit the training data equally well. The excess dimensions create large subspaces of equivalent training performance but varying generalization ability. Within these subspaces, learning dynamics preferentially reach simple, smooth solutions that capture the general structure of the problem rather than its idiosyncratic details [7].

High dimensionality also protects against a system’s tendency to confuse true structure with noise. Systems with excess dimensions can avoid this confusion because they have enough degrees of freedom to separate noise from true pattern. Overparameterized systems do not just have more ways to fit. They have more ways to fit well once they find the right solution region, and the dynamics of learning guide them to the right place [8].

These principles predict that as regulatory circuits become more complex and highly parameterized, they will tend to shift from particular to general solutions. Thus, we can order life’s tendency for generality. Prokaryotes, with compact genomes, likely encode relatively specific responses to specific signals. Single-celled eukaryotes, with more complex regulatory control, tend toward more general solutions. Multicellular organisms should have the greatest generalization.

Theory of Generalization

Classical statistical theory predicts that a system with more parameters than data points will overfit. It will memorize every detail of the training set, including noise. An overfit system predicts outputs poorly when confronted with new inputs that differ from the original data.

Overfitting arises as one approaches the interpolation threshold, at which the number of parameters matches the complexity of the data. At that threshold, there is essentially one solution that closely fits, a purely memorized description of the data.

The surprise from recent studies in machine learning is what happens beyond the interpolation threshold. As the number of parameters grows far past the number needed to fit the data, the system’s ability to predict outputs for novel data improves [3,4].

The overall error pattern can be measured with respect to a generalized ability to predict outputs for new inputs. That error pattern as a function of the number of system parameters traces a double descent (Figure 1).

At first, the error declines as the number of parameters increases from zero, because each new parameter provides some power to match the data. But then the system begins to memorize too much, and the error against new data rises toward the interpolation threshold. After that threshold, with more parameters, the error descends again, the surprising second descent.

A similar double descent pattern arises with respect to the temporal dynamics of learning [4]. Dynamics describe how a highly overparameterized system learns in response to a stream of training data that include both inputs and outputs. At first, the system learns something about the early data stream, improving performance for a while when the system is tested by its ability to predict outputs for new inputs. This initial decline in the system’s error is the first descent.

However, after a continuation of the training data stream, the system becomes very brittle, because it has memorized much of the detail in the observations, including noise and idiosyncrasies of particular observations that do not apply in general. Thus, the error starts to rise toward a peak at the interpolation threshold of memorization.

However, in a highly overparameterized system, continued learning often causes the error to decline again as the system passes the interpolation regime and moves into a regime of generalization. That second generalization phase is the second descent in the error curve when tested against novel inputs.

In response to these observations about the nature of learning, recent studies have developed three complementary lines of theory about how system dimensionality influences generalization.

The first theory concerns the ability of learning dynamics to reach particular solutions. When many solutions fit equally well, gradient-based learning preferentially finds simple, smooth ones [7]. Simple solutions occupy larger regions of parameter space, and learning dynamics are more likely to find big regions than small ones. Put another way, overparameterization creates an abundance of solutions. Among those solutions, the dynamics select for simple and general models of the world.

The second theory concerns aliasing, the confusion between true structure and noise [8]. At the interpolation threshold, aliasing is maximal because the system has no spare capacity to distinguish signal from noise. Every degree of freedom is committed to fitting exactly. Excess dimensions past that rigidly constrained fit provide the slack needed to separate pattern from accident.

The third theory concerns the geometry of the interpolation spike [9]. That spike requires that the data have directions of greater and lesser variation and that novel inputs used for testing must differ in their directions of variation. Near the threshold, errors concentrate along low-variance directions because fitting concentrates along high-variance directions. Adding dimensions spreads the errors widely, so that a mismatch only picks up a few small contributions to overall error rather than a big patch of errors linked along a particular constrained dimension.

These three results are complementary. Wilson [7] explains why, among the vast number of solutions available to an overparameterized system, the ones actually found tend to be general rather than idiosyncratic. Transtrum [8] explains the spike as maximal aliasing and extra dimensions as the remedy. Schaeffer [9] explains when the interpolation spike occurs and why extra dimensions dissolve it.

Classical statistics penalizes complexity to avoid overfitting, effectively smoothing out the interpolation spike. By contrast, biology tends not to penalize complexity as strongly. Evolutionary dynamics is therefore likely to experience the full consequences of the double descent learning curve. That fact leads to new implications for understanding how regulatory systems form specialized or generalized solutions, and for understanding how fitness landscapes truly influence evolutionary process.

Fitness Landscapes

The new theory of generalization changes how we should think about fitness landscapes. The classical landscape problem concerns how performance increases in relation to the fitness gradient, and how peaks and valleys favor or impede trajectories to better performance.

A focus on generalization changes the analysis of evolutionary learning trajectories. Traditionally, the trajectories concern the direct description of motion measured by how evolutionary process updates the system in response to the performance in each realized environment encountered.

Generalization is a more abstract geometric problem. How does the dimensionality of the phenotype affect the learning process? Focusing on a point along a learning trajectory, how does its performance compare against the history of realized environments versus the potential set of environments that might be encountered?

In machine learning, we would compare performance against the realized training data versus novel test data. Good generalization means good performance against novel test data, that is, against the potential set of environments that might be encountered.

Now that we are thinking of generalization rather than realized performance, we can ask some new questions. How does the dimensionality of a solution—of a phenotype—alter generalization? How does the nature of the environmental challenge influence generalization?

The first step may be consideration of phenotypes. Biological fitness landscapes focus on the map of genotypes to phenotypes to fitness. The genotype-to-phenotype part of the mapping is widely considered to be a crucial and difficult component [10,11].

I favor thinking of phenotypes as biochemical or neural circuits that take inputs and compute outputs [12,13]. That encompasses simple metric traits and also traits that are solutions to complex environmental challenges. A computational circuit perspective also provides a direct link to machine learning, which often proceeds by the improving performance trajectory of a computational circuit to solve a specifically defined problem.

Neurobiology is of course based on computational circuits, and gene regulatory networks have been studied extensively. But few studies have asked how and why regulatory networks generalize to novel inputs. Parter et al. showed that circuits evolved under varying environments generalize to novel environments [5]. Xue and Leibler showed that the dimensionality of internal representation determines the range of available strategies [14]. Kourvaris et al. looked at network modularity and connectivity in relation to generalization [6].

The study of such circuits fits well within the traditional analysis of fitness landscapes. In that case, the circuits improve in response to a sequence of realized environments that favor particular parameter combinations over others, a typical learning trajectory over a performance or fitness landscape.

The next step will be a deeper consideration of how dimensionality alters the learned solutions’ ability to generalize to novel inputs. In light of the recent insights from machine learning, how can we understand the fitness landscape geometry in terms of the evolutionary dynamics of generalization? That question provides a way to extend prior work, in which Gavrilets showed that high-dimensional landscapes have extensive nearly neutral networks connecting distant genotypes [15]. The theory of generalization extends that insight by shifting attention to how learning dynamics select among solutions within those connected regions.

Discussion

The theory of generalization abstractly describes the geometry of learning. That abstract theory raises a broad prediction about biology.

Greater dimensionality of phenotypic circuits should correspond to a greater tendency for generalization. For example, single-celled eukaryotes have greater genetic regulatory capacity and dimensionality compared with most prokaryotes. Thus, yeast should generalize more than bacteria. And multicellular eukaryotes should generalize more than single cells.

Similar ideas about evolutionary flexibility have been discussed. Davidson showed that more complex organisms have greater hierarchical depth in their regulatory architecture, which he argued enables evolutionary flexibility [16]. Wagner showed that redundant networks have broad neutral spaces that potentially connect phenotypic innovations [17]. Kirschner and Gerhart identified organizational features, such as weak linkage between compartmentalized components, that bias variation toward functionality [18]. But none explained why more wiring connectivity should produce generalization.

The theory here suggests that generalization arises from the sheer dimensionality of the regulatory space. The specific circuit features identified in prior work may be important for evolvability but not as the primary cause of generalizing capacity. Instead, processes that increase dimensionality may matter more. For example, neutrality and robustness have been raised as primary causes of increasing regulatory dimensionality [19].

Secondarily, older circuits are more likely to generalize. In the temporal version of the double descent pattern discussed earlier, a learning system first improves, then passes through a brittle memorizing phase, then reaches generalization. Applied to biology, recently evolved regulatory circuits should be more likely to show brittle, context-specific correlations among components. Older circuits should be more likely to generalize across a range of conditions.

In summary, the new framework for generalization arises from the fact that natural selection is a basic learning algorithm. Thus, the theory of generalization in learning, originally developed to explain why overparameterized machine learning systems abstract rather than memorize, also applies to biology. Regulatory connections are parameters, selective history is training data, selection is the learning optimizer, and novel environments are test data.

This perspective adds an aspect to evolutionary theory that has been missing. The fitness landscape remains useful for local questions about epistasis, adaptive walks, and similar evolutionary problems. But for the deeper question of how lineages discover general solutions to environmental challenges, the relevant factor is not a position on a fitness surface. It is how the dimensionality of the regulatory architecture alters the geometry of evolutionary trajectories within the fitness landscape.

Figure 1. Double descent of generalization error in learning. The solid curve shows generalization error, the performance of a learned solution against novel inputs not encountered during learning. The dashed curve shows training error, the performance against the data used for learning. In the classical regime, increasing system complexity first reduces generalization error. But then the generalization error increases as the system begins to memorize idiosyncratic details of the training data. At the interpolation threshold, the system has just enough parameters to describe the training data exactly, and generalization error peaks. Beyond the threshold, in the overparameterized regime, generalization error declines even though training error remains at zero. The x-axis can be read as the number of adjustable parameters in the system or as learning time for a system with many parameters. After Figure 1 of [20].

Acknowledgments

The Donald Bren Foundation and US National Science Foundation grant DEB-2325755 support my research.

Data and code availability

This article generated no new data or computer code.

References

Valiant, L. Probably Approximately Correct: Nature’s Algorithms for Learning and Prospering in a Complex World; Basic Books, 2013. [Google Scholar]
Watson, R.A.; Szathmáry, E. How can evolution learn? Trends Ecol. Evol. 2016, 31, 147–157. [Google Scholar] [CrossRef] [PubMed]
Belkin, M.; et al. Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proceedings of the National Academy of Sciences of the United States of America 2019, 116, 15849–15854. [Google Scholar] [CrossRef] [PubMed]
Nakkiran, P.; et al. Deep double descent: where bigger models and more data hurt. J. Stat. Mech. Theory Exp. 2021, 2021, 124003. [Google Scholar] [CrossRef]
Parter, M.; et al. Facilitated variation: How evolution learns from past environments to generalize to new environments. PLoS Comput. Biol. 2008, 4, e1000206. [Google Scholar] [CrossRef] [PubMed]
Kouvaris, K.; et al. How evolution learns to generalise: Using the principles of learning theory to understand the evolution of developmental organisation. PLoS Comput Biol. 2017, 13, e1005358. [Google Scholar] [CrossRef] [PubMed]
Wilson, A.G. Position: Deep learning is not so mysterious or different. 2025. [Google Scholar]
Transtrum, M.K.; et al. Generalized aliasing explains double descent and informs model design. Phys. Rev. Res. 2025, 7, 043268. [Google Scholar] [CrossRef]
Schaeffer, R.; et al. Double descent demystified: Identifying, interpreting & ablating the sources of a deep learning puzzle. arXiv 2023, arXiv:2303.14151. [Google Scholar] [CrossRef]
Payne, J.L.; Wagner, A. The causes of evolvability and their evolution. Nat. Rev. Genet. 2019, 20, 24–38. [Google Scholar] [CrossRef] [PubMed]
Wagner, G.P.; Altenberg, L. Complex adaptations and the evolution of evolvability. Evolution 1996, 50, 967–976. [Google Scholar] [CrossRef] [PubMed]
Frank, S.A. Circuit design in biology and machine learning. II. Anomaly detection. Entropy 2025, 27, 896. [Google Scholar] [CrossRef] [PubMed]
Frank, S.A. Circuit design in biology and machine learning. I. Random networks and dimensional reduction. Evolution 2025, 79, 1403–1418. [Google Scholar] [CrossRef] [PubMed]
Xue, B.K.; et al. Environment-to-phenotype mapping and adaptation strategies in varying environments. Proceedings of the National Academy of Sciences of the United States of America 2019, 116, 13847–13855. [Google Scholar] [CrossRef] [PubMed]
Gavrilets, S. Fitness Landscapes and the Origin of Species; Princeton University Press, 2004. [Google Scholar]
Davidson, E.H. The regulatory genome: gene regulatory networks in development and evolution; Academic, 2006. [Google Scholar]
Wagner, A. Robustness and Evolvability in Living Systems; Princeton University Press, 2005. [Google Scholar]
Kirschner, M.; Gerhart, J. The Plausibility of Life: Resolving Darwin’s Dilemma; Yale University Press, 2005. [Google Scholar]
Frank, S.A. Robustness and complexity. Cell Syst. 2023, 14, 1015–1020. [Google Scholar] [CrossRef] [PubMed]
Elad, M.; et al. Another step toward demystifying deep neural networks. Proc. Natl. Acad. Sci. USA 2020, 117, 27070–27072. [Google Scholar] [CrossRef] [PubMed]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.