1. Introduction
The dominant paradigm in contemporary artificial intelligence—training a monolithic neural network on a static dataset, then deploying it as a frozen inference engine—achieves remarkable task performance but lacks the adaptive, self-organizing properties that characterize biological intelligence. Reinforcement learning (RL) agents exhibit goal-directed autonomy but remain confined to narrow task domains with fixed architectures. AI agent frameworks (AutoGPT, Voyager; Wang et al., 2023) achieve flexible multi-step behavior through large language model (LLM) orchestration, but communicate between components via serialized text, discarding the rich continuous representations that neural networks learn internally.
We argue that progress toward genuinely adaptive AI requires addressing three structural deficits simultaneously: (i) the absence of principled, intrinsically motivated autonomy; (ii) the shallowness of inter-module communication in modular systems; and (iii) the external management of all adaptation and learning. No existing architecture addresses all three.
This paper introduces the NeuroCore framework, a formal mathematical treatment of a modular neural architecture organized around a central thesis: that intelligent self-organization can emerge from the interaction between a minimal executive controller and a rich ecosystem of specialist modules, provided the controller has appropriate motivational drives (curiosity, anti-stagnation pressure), regulatory capabilities (adaptive constraint satisfaction, homeostatic stabilization), and consequential agency over its own module configuration.
The minimal Core thesis. A defining commitment of NeuroCore is that the Core possesses no higher cognitive capabilities. It cannot understand language, perceive images, reason about physics, or generate content. All such capabilities reside exclusively in specialist modules. The Core provides only autonomy (the capacity and motivation to act), regulation (the capacity to maintain stability and enforce constraints), and action (the capacity to execute decisions in both the external environment and the internal module landscape). This design is motivated by the observation that biological intelligence emerges not from a single monolithic system but from compact subcortical controllers (basal ganglia, brainstem nuclei) orchestrating a diverse ensemble of cortical specialist areas.
Our contributions are as follows:
1. We formalize the stagnation-modification tradeoff (§5.3): in systems where self-modification is costly, rational agents converge to modification-avoidance. We prove conditions under which a stagnation penalty restores non-trivial self-modification (Theorem 1).
2. We prove a general non-convergence theorem (§7.1) for coupled self-modifying multi-objective systems, establishing that the joint optimization of NeuroCore’s subsystems does not admit guaranteed convergence.
3. We establish partial stability guarantees (§7.2–7.4): bounded representational drift via a homeostatic Lyapunov function (Theorem 3), local convergence under frozen modules (Proposition 1), and modification frequency bounds (Proposition 2).
4. We derive information-theoretic module manipulation costs (§6) that are principled proxies for true representational disruption, and formalize the meta-RL Serotonin System as a generalization of constrained MDPs for non-stationary settings (§5.2).
2. Related Work
Modular neural architectures. Mixture of Experts (MoE; Shazeer et al., 2017; Fedus et al., 2022) routes inputs to specialist sub-networks via a gating function, but all experts share one architecture and are trained jointly. Routing Networks (Rosenbaum et al., 2018) use RL to select among heterogeneous function blocks for multi-task learning. Neural Module Networks (Andreas et al., 2016) compose task-specific architectures from a library of modules. Deep Model Reassembly (Yang et al., 2022) demonstrated that blocks from heterogeneous vision architectures can be stitched together with reasonable performance. NeuroCore differs from all of these in that modules can be dynamically manipulated (fine-tuned, composed, swapped, added, removed) at runtime by an RL-driven controller, and communication occurs through learned continuous-representation interfaces rather than shared token vocabularies.
Neuroscience-inspired AI. Doya (2002) proposed a mapping between neuromodulators and RL meta-parameters: dopamine → reward prediction error, serotonin → temporal discount factor, norepinephrine → inverse temperature (exploration), acetylcholine → learning rate. The Global Workspace Theory (GWT) implementation by VanRullen and Kanai (2021) connects pretrained modules via a shared latent workspace with attention-based routing. Recent work on Lifelong RL via Neuromodulation (2024) demonstrates that neuromodulatory signals can gate plasticity for continual learning. NeuroCore builds directly on Doya’s framework but elevates the serotonergic subsystem from a passive parameter to an active meta-RL controller.
Self-modifying systems. Schmidhuber’s self-referential weight matrices (1993; Irie et al., 2022) allow networks to modify their own weights at runtime. The Gödel Machine (Schmidhuber, 2007) is a theoretical framework for self-improving systems with provable guarantees. The Darwin Gödel Machine (Sakana AI, 2025) achieves practical self-modifying code. Learning-theoretic analysis (2025) shows that policy-level learnability is preserved only if the policy-reachable family has uniformly bounded capacity. NeuroCore’s self-modification operates at the module level rather than individual weights, and we provide explicit stability analysis for this coarser granularity.
Safe and constrained RL. Constrained MDPs (Altman, 1999) and Constrained Policy Optimization (Achiam et al., 2017) enforce safety constraints via Lagrangian relaxation with reactive multiplier updates. Gao, Schulman, and Hilton (2022) established scaling laws for reward model overoptimization, showing that optimizing against an imperfect proxy inevitably degrades the true objective. Our meta-RL Serotonin System generalizes Lagrangian methods to handle the non-stationarity introduced by self-modification, at the cost of additional optimization complexity.
3. Preliminaries and Notation
We work within the framework of Markov Decision Processes (MDPs) extended to accommodate non-stationarity, multiple objectives, and hierarchical action spaces.
Definition 1 (Non-Stationary Constrained MDP). A non-stationary constrained MDP is a tuple (t) = (S, A, Pₜ, Rₜ, γ(t), {Cᵢ}ᵢ₌₁ᵐ, {dᵢ}ᵢ₌₁ᵐ) where S is the state space, A = A_ext ∪ A_int is a hierarchical action space (external and internal actions), Pₜ : S × A → Δ(S) is a time-varying transition kernel (non-stationary due to self-modification), Rₜ : S × A → ℝ is a time-varying reward function, γ(t) ∈ (0, 1) is a state-dependent discount factor, Cᵢ : S × A → ℝ are m constraint cost functions, and dᵢ ∈ ℝ are constraint thresholds.
Definition 2 (Distributional Return). The distributional return Zπ(s, a) is a random variable representing the discounted sum of future rewards under policy π: Zπ(s, a) = Σₜ₌₀∞ γᵗ R(sₜ, aₜ) with s₀ = s, a₀ = a. The distributional Bellman operator Tπ is defined by TπZ(s, a) =ᵈ R(s, a) + γ Z(s′, a′) where =ᵈ denotes equality in distribution.
We use the following notation throughout: hᶜ ∈ ℝᵈᶜ for the Core’s latent state; πᵈ for the Dopamine System’s policy parameterized by θᵈ; πₛ for the Serotonin System’s meta-policy parameterized by θₛ; τᵢ for the Τ-Interface connecting module Mᵢ to the Core; for the module registry; D_KL for Kullback-Leibler divergence; I(X; Y) for mutual information; and CKA for centered kernel alignment.
4. The NeuroCore Framework
Definition 3 (NeuroCore System). A NeuroCore system is a tuple Ψ = (C,
, {τᵢ}ᵢ∈
) where C = (πᵈ, πₛ, π_B, hᶜ) is the Core (comprising the Dopamine policy, Serotonin meta-policy, Behavior policy, and latent state),
= {M₁, ..., Mₙ} is the module registry, and {τᵢ} are the Τ-Interfaces. The system state at time t is:
4.1. Module Specification
Definition 4 (Module). A module Mᵢ = (fᵢ, dᵢᴺ, dᵢᵒᵘᵗ, θᵢ, ωᵢ) consists of a forward function fᵢ : ℝᵈᵢᴺ → ℝᵈᵢᵒᵘᵗ (architecture-specific), input and output dimensionalities, learnable parameters θᵢ, and metadata ωᵢ (architecture type, computational cost, task domain). Modules are heterogeneous: the set {fᵢ}ᵢ∈ may include Transformers, CNNs, diffusion models, RNNs, state-space models, graph neural networks, and others. The Core imposes no architectural constraints on modules.
4.2. Τ-Interfaces
Definition 5 (Τ-Interface). A Τ-Interface τᵢ = (τᵢᴺ, τᵢᵒᵘᵗ) is a pair of differentiable mappings:
that bridge the Core’s representation space and the module’s native representation space while preserving gradient flow. The architecture of each τ component depends on the output geometry of the connected module (cross-attention pooling for sequence outputs, spatial attention for spatial outputs, learned linear projections for vector outputs). Τ-Interfaces may themselves be parameterized neural networks of arbitrary complexity.
Remark. The essential requirement is that Τ-Interfaces operate on continuous activations, never on serialized text tokens. If module Mᵢ’s native output is text (e.g., an LLM’s token sequence), τᵢᵒᵘᵗ operates on the LLM’s final hidden-layer activations, not the decoded text. This is what distinguishes NeuroCore from AI agent frameworks: information is transmitted in its native continuous form, not lossy-compressed into discrete symbols.
4.3. Core State Update
At each decision step, the Core aggregates information from active modules via gated attention fusion:
where CoreRNN is a recurrent state-update function (GRU variant) and Attn computes multi-head attention with the Core state as query and module projections as keys/values. The attention mechanism naturally handles variable numbers of active modules. Note that the Core’s ‘understanding’ of any situation is entirely mediated by modules: it has no direct access to raw sensory input or linguistic content.
5. Neuromodulatory Subsystems
5.1. The Dopamine System
The Dopamine System implements the Core’s primary learning and decision-making mechanism through distributional reinforcement learning. Grounded in the empirical finding that biological dopamine neurons encode distributional return predictions (Dabney et al., 2020; Muller et al., 2024), the system maintains a full return distribution via the Implicit Quantile Network (IQN; Dabney et al., 2018):
trained via the quantile Huber loss:
Epistemic uncertainty is estimated through a deep ensemble of K distributional critics:
Intrinsic motivation is derived from world-model prediction error, normalized by the model’s own variance estimate:
The denominator mitigates the ‘noisy TV’ problem (Burda et al., 2019): inherently stochastic transitions yield both high prediction error and high variance, producing low intrinsic reward. Only systematically incorrect predictions—genuine model ignorance—generate strong motivation.
5.2. The Serotonin System as Meta-RL Controller
The Serotonin System is formulated as a meta-reinforcement-learning controller that learns an adaptive regulatory policy. This design choice is motivated by neuroscience: biological serotonin does not merely enforce fixed thresholds but actively modulates temporal discounting (Doya, 2002), risk sensitivity, learning rates, and the balance between habitual and goal-directed control (Daw et al., 2002)—functions that collectively constitute a learned regulatory policy on slower timescales than dopaminergic processing. Experimentally, Cardozo Pinto et al. (2024) confirmed opponent dopamine-serotonin dynamics using dual-color optogenetics in mouse nucleus accumbens.
Definition 6 (Serotonin Meta-Policy). The Serotonin System’s meta-policy πₛ parameterized by θₛ outputs a regulatory action vector:
where Hₜ is a history buffer of recent Core states and constraint violations. The components are: γ(t) ∈ (0, 1), the temporal discount factor; α(t), β(t) ≥ 0, exploration-exploitation coefficients for prediction-error and epistemic-uncertainty bonuses respectively; η_sc(t) ∈ (0, 1], the gradient scaling factor controlling Core-to-module gradient flow; and λᵢ(t) ≥ 0, adaptive constraint multipliers.
The meta-RL objective minimizes long-horizon aggregate constraint violation:
where
_homeo is the homeostatic activation regularizer:
with target statistics μ₀, σ₀, p₀ established during initial training (analogous to biological homeostatic set points; Turrigiano, 2008).
Advantage over standard constrained MDPs. Lagrangian methods update constraint multipliers reactively: increasing λ after violations, decreasing after compliance. In NeuroCore, where module modifications introduce discontinuous changes in the transition dynamics, reactive methods must re-discover appropriate multipliers after each modification, producing oscillatory constraint violations. The meta-RL formulation learns a policy that can anticipate constraint violations from behavioral precursors in hᶜ and preemptively adjust regulation. The cost is a second RL optimization loop with its own convergence challenges (§7).
5.3. The Stagnation-Modification Tradeoff
We now formalize a fundamental problem in self-modifying systems: if self-modification is costly and potentially destructive, rational agents will converge to never modifying.
Definition 7 (Modification-Avoidant Policy). A policy π is modification-avoidant if Pr(a ∈ A_int | π) < ε for small ε > 0, i.e., it essentially never selects internal (self-modification) actions.
Theorem 1 (Stagnation Convergence). Let be a non-stationary constrained MDP with action space A = A_ext ∪ A_int, where each internal action a ∈ A_int incurs immediate cost C(a) > 0 and causes a stochastic perturbation to the transition dynamics Pₜ. If (i) the expected post-modification return E[Vπ′(s)] is uncertain with variance σ²_mod > 0, (ii) the discount factor satisfies γ < 1, and (iii) there exists a viable external-only policy π_ext with Vπᵉˣᵗ(s) > 0 for some states, then for sufficiently small learning rates, the policy gradient update converges to a modification-avoidant policy π* with Pr(a ∈ A_int | π*) → 0.
Proof. Consider the expected return of an internal action a_m at state s versus the best external action a_e:
The modification action incurs immediate cost C(a_m) > 0 and perturbs the MDP dynamics, meaning the agent’s current value function Vπ becomes inaccurate for the post-modification MDP. By the performance difference lemma (Kakade & Langford, 2002), the return under the current policy in the perturbed MDP satisfies:
Since the perturbation’s direction and magnitude are uncertain, the agent’s estimate of Q(s, a_m) has variance proportional to σ²_mod / (1 − γ)², which is amplified by the discount factor for long horizons. Under policy gradient methods, actions with high-variance returns and guaranteed immediate costs have their gradients dominated by the cost term. Formally, the gradient contribution is:
The expected gradient is negative (pushing probability mass away from a_m), while the noise has zero mean. Over sufficient iterations, π(a_m | s) → 0 for all modification actions, yielding a modification-avoidant policy. ■
Corollary 1. The modification-avoidant equilibrium is locally stable: small perturbations that occasionally trigger modifications will be corrected by the policy gradient, returning the system to avoidance.
To break the stagnation equilibrium, we introduce an anti-stagnation mechanism based on the rate of epistemic uncertainty reduction. Let U(t) = Eₛ[U_epi(s, a)] be the average epistemic uncertainty:
Proposition (Stagnation penalty sufficiency). If ζ > C_max / (ε_min · ΔT), where C_max = max_a C(a), then the modification-avoidant policy is no longer a local optimum of the augmented objective r_total = r_ext + α r_int + β U_epi + r_stag.
Proof sketch. Under a modification-avoidant policy, epistemic uncertainty in unvisited regions does not decrease (the system exploits only known states). After ΔT steps without uncertainty reduction, the stagnation penalty contributes −ζ · ε_min per step. The cumulative penalty over a horizon of H steps is −ζ · ε_min · H. For ζ > C_max / (ε_min · ΔT), a single modification action at cost C_max followed by uncertainty reduction exceeding ε_min produces higher expected return than continued stagnation for H > C_max / (ζ · ε_min), creating a policy gradient toward non-trivial modification. ■
We additionally introduce a metabolic utilization cost for dormant modules:
where H(dormant) measures the information-theoretic cost of maintaining modules that have not contributed to the Core state in the past N steps, creating pressure to either utilize or prune unused modules.
The composite reward signal is:
where α(t) and β(t) are regulatory outputs of the Serotonin meta-policy (Eq. 9).
5.4. Dual-Timescale Dynamics
The Dopamine and Serotonin systems operate on separated timescales:
This separation is both biologically motivated (serotonergic modulation operates on seconds-to-minutes timescales versus millisecond dopaminergic firing) and mathematically convenient, enabling analysis via two-timescale stochastic approximation theory (Borkar, 2008).
6. Information-Theoretic Self-Modification Costs
A principled cost function for module manipulation must reflect the true representational disruption caused by the manipulation. We derive such costs from information-theoretic quantities.
Definition 8 (Integration Depth). The integration depth of module Mᵢ with the Core is measured by the mutual information between the module’s Τ-Interface output and the Core’s state update:
This measures how much information module Mᵢ uniquely contributes to the Core state, conditional on the Core’s prior state and all other modules. A module with Dᵢ ≈ 0 is redundant; a module with high Dᵢ is irreplaceable.
Module manipulation costs are derived from Dᵢ:
where D_CKA is the centered kernel alignment distance between the two modules’ representation spaces, n is the number of fine-tuning steps, and κ₁ ... κ₄ are scaling constants. The square-root dependence in Eq. 22c reflects the empirical observation that fine-tuning disruption grows sublinearly with the number of steps (due to LoRA’s low-rank constraint).
Proposition (Cost-disruption correlation). Under mild regularity conditions (Lipschitz continuity of fᵢ and τᵢ), the removal cost C(remove, Mᵢ) is a lower bound on the expected performance degradation:
where α₀ depends on the Core’s reliance on module i’s information and ε_approx is an approximation error from finite-sample mutual information estimation. This justifies Dᵢ as a meaningful proxy for true disruption.
7. Stability Analysis
This section provides the paper’s central theoretical results on the stability and convergence properties of the NeuroCore system. We first establish a negative result (non-convergence in general), then derive three partial stability guarantees that collectively characterize the system’s operating regime.
7.1. Non-Convergence Theorem
Theorem 2 (General non-convergence). Let Ψ(t) be the NeuroCore system state. The joint optimization of (πᵈ, πₛ, {τᵢ}, {θᵢ}) does not, in general, converge to a fixed point, limit cycle, or bounded attractor in the parameter space.
Proof. We establish non-convergence via three independent mechanisms, any one of which suffices.
(a) Self-modification-induced non-stationarity. Let T₁, T₂, ... be the sequence of module modification events. At each Tₖ, the Core modifies some module Mᵢ, transitioning the system from MDP
ₖ = (S, A, Pₖ, Rₖ, γₖ) to
ₖ₊₁. The value function Vπᵈ optimized under
ₖ is generally invalid under
ₖ₊₁. By the simulation lemma (Kearns & Singh, 2002), the value function error satisfies:
where D_TV is the total variation distance. Unless the modification process itself converges (i.e., the agent eventually stops modifying), the MDP sequence {
ₖ} does not converge, and neither does the value function. But proving convergence of the modification process requires proving convergence of the overall system—a circularity.
(b) Two-player non-cooperative game. The Dopamine System maximizes r_total (which includes returns from potentially constraint-violating actions), while the Serotonin System minimizes constraint violations (which requires suppressing some high-return behaviors). This constitutes a two-player general-sum game. Daskalakis, Goldberg, and Papadimitriou (2009) proved that computing Nash equilibria in general-sum games is PPAD-complete. Mazumdar, Ratliff, and Sastry (2020) showed that gradient dynamics in continuous games do not converge to Nash equilibria in general, and can exhibit limit cycles or chaotic trajectories in parameter space even for simple bilinear games. The NeuroCore game is substantially more complex (nonlinear, high-dimensional, non-stationary), precluding convergence guarantees.
(c) Non-convex individual objectives. Each subsystem’s optimization involves deep neural networks. Even for single-agent deep RL with nonlinear function approximation, Tsitsiklis and Van Roy (1997) demonstrated divergence of temporal-difference learning. The coupling between multiple such systems—where each system’s loss landscape depends on the other systems’ parameters—creates a composite landscape where saddle points, local minima, and degenerate plateaus are ubiquitous. ■
7.2. Bounded Representational Drift
Theorem 3 (Homeostatic drift bound). Let p(hᶜ(t)) denote the distribution of Core activations at time t and p₀ the initial distribution. If the homeostatic regularization coefficient c₃ in Eq. 12 satisfies c₃ ≥ c*(ε, G_max), where G_max = sup_t ‖∇_hᶜ
ᵈ‖ is the maximum policy gradient magnitude, then:
Proof. Define the Lyapunov function V(t) = D_KL(p(hᶜ(t)) ‖ p₀). The Core state update (Eq. 4) at each step applies a parameterized transformation to hᶜ, causing a distributional shift. By Pinsker’s inequality and the data processing inequality, the KL-shift per step is bounded:
The homeostatic loss
_homeo contributes a restoring gradient:
Combining, the net drift satisfies:
Setting ΔV(t) = 0 and solving for the equilibrium gives V* = 2G²_max / c²₃. For V* ≤ ε, we require c₃ ≥ G_max √(2/ε) =: c*(ε, G_max). When V(t) > V*, the restoring term dominates the drift term, pulling V(t) back below the bound. ■
Remark. Theorem 3 provides a tunable tradeoff: stronger homeostatic regulation (larger c₃) yields tighter drift bounds but constrains the Core’s representational plasticity. The Serotonin meta-policy can learn to balance this tradeoff by adjusting the effective strength of homeostatic regulation across different phases of operation.
7.3. Local Convergence Under Frozen Modules
Proposition 1 (Two-timescale local convergence). If the module configuration is held fixed (
constant, no internal actions), the NeuroCore system reduces to a two-timescale stochastic approximation:
where gᵈ, gₛ are the expected gradient mappings and Mᵈ, Mₛ are martingale difference noise terms. Under standard conditions—(i) Lipschitz continuity of gᵈ and gₛ, (ii) ηᵈ / ηₛ → ∞, (iii) standard step-size conditions Ση = ∞ and Ση² < ∞, and (iv) bounded iterates—the fast system θᵈ tracks the quasi-static equilibrium θ*ᵈ(θₛ) and the slow system θₛ evolves on the equilibrium manifold (Borkar, 2008, Theorem 2, Chapter 6). Local convergence to a locally asymptotically stable equilibrium of the slow-timescale ODE is guaranteed.
Proof sketch. Under frozen modules, the MDP is stationary: Pₜ = P, Rₜ = R. The fast system (Dopamine) sees a fixed MDP with fixed regulatory signals and satisfies standard policy gradient convergence conditions for the tabular/compatible function approximation case. The slow system (Serotonin) observes the time-averaged behavior of the converged fast system and optimizes a smooth objective over it. The two-timescale framework of Borkar (2008) and Bhatnagar et al. (2009) applies directly. ■
7.4. Self-Modification Frequency Bound
Proposition 2 (Modification frequency bound). Let N_mod(T) = |{Tₖ : Tₖ ∈ [0, T]}| be the number of module modifications in the interval [0, T]. Under the constraint budget B and minimum modification cost C_min = min_{a ∈ A_int} C(a), the expected modification frequency is bounded:
Moreover, the inter-modification interval ΔT_mod = Tₖ₊₁ − Tₖ satisfies E[ΔT_mod] ≥ C_min / B.
Proof. Each modification action consumes at least C_min from the per-step constraint budget B. The Serotonin System’s constraint enforcement (Eq. 10–11) ensures that the long-run average constraint cost is bounded by B. By the renewal reward theorem, the long-run modification rate is at most B / C_min. ■
Combined with Proposition 1, this result establishes that the system spends most of its time in the frozen-module regime where local convergence applies, punctuated by infrequent modification events that perturb the system to a new operating point. The key design question is whether the inter-modification intervals E[ΔT_mod] are long enough for the fast system to approximately reconverge—a condition that depends on the relative magnitudes of C_min, B, and the fast system’s convergence rate.
7.5. Summary of Stability Properties
The stability landscape of NeuroCore can be summarized as follows. Globally, the system does not converge (Theorem 2). Locally, during inter-modification intervals with frozen modules, the two-timescale system converges to local equilibria (Proposition 1). The Core’s activation distribution remains within a bounded KL-ball around its initial distribution (Theorem 3). The modification frequency is bounded (Proposition 2), ensuring that non-stationarity is controlled. The system is best characterized not as convergent but as metastable: it occupies a sequence of quasi-stable operating points, transitioning between them via infrequent self-modification events, with homeostatic regulation preventing catastrophic drift during transitions.
8. Testable Predictions
We formulate seven falsifiable predictions, each specifying a predicted outcome, comparison condition, and metric. Hypotheses are categorized as core (failure undermines the framework) or auxiliary (failure indicates a specific design revision).
H1 (Core). Continuous-representation Τ-Interfaces yield ≥15% higher cumulative reward and ≥2× faster learning on multi-modal tasks compared to text-serialized communication, controlling for total parameter count.
H2 (Core). After 10⁶ environment steps, the Core’s modification policy will exhibit entropy H ≥ 1.5 nats and long-run return exceeding a frozen-module baseline by ≥10%, demonstrating non-trivial emergent self-modification.
H3 (Core). The meta-RL Serotonin System produces ≥30% fewer cumulative constraint violations than a Lagrangian baseline under non-stationarity induced by periodic forced module modifications.
H4 (Auxiliary). Removing the stagnation penalty (ζ = 0) causes internal action frequency to fall below 10⁻⁴ per step, confirming Theorem 1’s prediction of modification avoidance.
H5 (Auxiliary). The integration depth Dᵢ (Eq. 21) correlates with actual performance degradation upon module removal with r ≥ 0.7, validating the information-theoretic cost function.
H6 (Auxiliary). The KL-divergence bound from Theorem 3 holds empirically for ≥95% of training, with excursions limited to brief transients following module modifications.
H7 (Core). NeuroCore with a minimal Core orchestrating specialist modules outperforms a monolithic model of equivalent parameter count on a heterogeneous task suite, with ≥40% lower inter-task performance variance.
9. Discussion
On the minimal Core thesis. One might object that a Core without higher cognitive capabilities cannot make intelligent self-modification decisions. We counter that the Core does not need to ‘understand’ its modules cognitively—it learns through reinforcement which abstract actions in which abstract states lead to improved long-run return. The analogy is to the basal ganglia, which orchestrate cortical function without performing cortical computation. The intelligence is in the learned policy, not in explicit comprehension.
On meta-RL versus constrained MDPs. The meta-RL Serotonin System is the framework’s most speculative element. It is more expressive than Lagrangian methods and better aligned with biological serotonin’s known functions, but introduces a second optimization loop with its own convergence challenges. If H3 fails empirically, a hybrid design—meta-RL for temporally extended regulatory decisions, Lagrangian methods for hard safety constraints—represents a principled fallback.
On non-convergence. Theorem 2 is not a flaw but a fundamental property of any simultaneously self-modifying and multi-objective system. Biological brains also do not converge to fixed states; they exhibit lifelong plasticity and perpetual non-stationarity. The partial stability results (Theorem 3, Propositions 1–2) collectively establish that the system operates in a bounded, metastable regime—not convergent, but constrained. Whether this metastable regime supports productive behavior is an empirical question addressable through our testable predictions.
Limitations. The framework has several limitations. First, the homeostatic bound (Theorem 3) assumes bounded policy gradients, which may not hold during training instabilities. Second, the mutual information cost (Eq. 21) requires estimation from finite samples, introducing approximation error. Third, the meta-RL Serotonin System’s convergence is not guaranteed even in the frozen-module regime (it may cycle). Fourth, and most fundamentally, the ‘emergence’ hypothesis—that effective self-modification strategies will arise from the interplay of stagnation pressure and modification costs—remains entirely unvalidated. The framework provides the substrate; whether emergence occurs is unknown.
Safety implications. A self-modifying system with autonomous agency raises significant safety concerns. The Serotonin System provides learned regulation, the homeostatic bounds limit drift, and the modification costs create inertia. However, the fundamental alignment challenge—ensuring a self-modifying system’s objectives remain aligned with human values as the system changes itself—is not solved by this framework. We note that the architecture’s modularity provides natural intervention points: human operators can freeze modules, adjust constraint budgets, or disable self-modification without disrupting the system’s ability to operate as a fixed architecture.
10. Conclusion
We have presented the NeuroCore framework: a formal treatment of modular neural architectures in which a minimal executive Core, equipped with distributional RL, a meta-RL regulatory system, and consequential self-modification capabilities, orchestrates a heterogeneous collection of specialist modules through continuous-representation interfaces. Our theoretical contributions establish the stagnation-modification tradeoff (Theorem 1), the fundamental non-convergence of coupled self-modifying systems (Theorem 2), and partial stability guarantees that collectively characterize a bounded metastable operating regime (Theorem 3, Propositions 1–2). The information-theoretic cost functions provide principled proxies for self-modification disruption.
The framework’s central thesis—that intelligent self-organization emerges from the interplay between a minimal controller and a rich modular ecosystem, given appropriate motivational and regulatory mechanisms—is a testable claim, not an established fact. We have provided seven falsifiable predictions to support or refute this thesis. The mathematics establishes what is guaranteed (bounded drift, bounded modification frequency, local convergence between modifications) and what is not (global convergence, emergence of effective strategies). This honest delineation of theoretical boundaries is, we believe, as valuable as the positive results.
Appendix A: Notation Summary
| Symbol |
Description |
| hᶜ ∈ ℝᵈᶜ |
Core latent state vector |
| πᵈ, θᵈ |
Dopamine System policy and parameters |
| πₛ, θₛ |
Serotonin System meta-policy and parameters |
| Mᵢ = (fᵢ, dᵢᴺ, dᵢᵒᵘᵗ, θᵢ, ωᵢ) |
Module specification tuple |
| τᵢ = (τᵢᴺ, τᵢᵒᵘᵗ) |
Τ-Interface (bidirectional learned transformation) |
|
Module registry |
| A = A_ext ∪ A_int |
Hierarchical action space |
| Zθ(s, a; τ) |
Distributional return (IQN quantile function) |
| U_epi(s, a) |
Epistemic uncertainty (ensemble variance) |
| Dᵢ = I(zᵢ; hᶜ | ...) |
Integration depth (conditional mutual information) |
| γ(t), α(t), β(t), η_sc(t) |
Serotonin regulatory outputs |
| λᵢ(t) |
Adaptive constraint multipliers |
|
_homeo |
Homeostatic activation regularization loss |
| r_stag, r_util |
Stagnation and utilization penalties |
| D_KL, D_TV, D_CKA |
KL divergence, total variation, centered kernel alignment |
References
- Achiam, J.; Held, D.; Tamar, A.; Abbeel, P. Constrained Policy Optimization. Proc. ICML, 2017. [Google Scholar]
- Altman, E. Constrained Markov Decision Processes; Chapman & Hall/CRC, 1999. [Google Scholar]
- Andreas, J.; Rohrbach, M.; Darrell, T.; Klein, D. Neural Module Networks. Proc. CVPR, 2016. [Google Scholar]
- Bhatnagar, S.; Sutton, R.S.; Ghavamzadeh, M.; Lee, M. Natural Actor-Critic Algorithms. Automatica 2009, 45, 2471–2482. [Google Scholar] [CrossRef]
- Borkar, V.S. Stochastic Approximation: A Dynamical Systems Viewpoint; Cambridge University Press, 2008. [Google Scholar]
- Burda, Y.; Edwards, H.; Storkey, A.; Klimov, O. Exploration by Random Network Distillation. Proc. ICLR, 2019. [Google Scholar]
- Cardozo Pinto, D.F.; et al. Dopamine and serotonin oppositely modulate reinforcement learning; Nature, 2024. [Google Scholar]
- Dabney, W.; Rowland, M.; Bellemare, M.G.; Munos, R. Distributional RL with Quantile Regression. Proc. AAAI, 2018. [Google Scholar]
- Dabney, W.; et al. A distributional code for value in dopamine-based RL. Nature 2020, 577, 671–675. [Google Scholar] [CrossRef] [PubMed]
- Daskalakis, C.; Goldberg, P.W.; Papadimitriou, C.H. The Complexity of Computing a Nash Equilibrium. SIAM J. Comp. 2009, 39, 195–259. [Google Scholar] [CrossRef]
- Daw, N.D.; Kakade, S.; Dayan, P. Opponent interactions between serotonin and dopamine. Neural Networks 2002, 15, 603–616. [Google Scholar] [CrossRef] [PubMed]
- Doya, K. Metalearning and neuromodulation. Neural Networks 2002, 15(4–6), 495–506. [Google Scholar] [CrossRef] [PubMed]
- Fedus, W.; Zoph, B.; Shazeer, N. Switch Transformers: Scaling to Trillion Parameter Models. JMLR 2022, 23, 1–39. [Google Scholar]
- Gao, L.; Schulman, J.; Hilton, J. Scaling Laws for Reward Model Overoptimization. Proc. ICML, 2022. [Google Scholar]
- Hu, E.J.; et al. LoRA: Low-Rank Adaptation of Large Language Models. Proc. ICLR, 2021. [Google Scholar]
- Irie, K.; Schlag, I.; Csordás, R.; Schmidhuber, J. A Modern Self-Referential Weight Matrix. Proc. ICML, 2022. [Google Scholar]
- Kakade, S.; Langford, J. Approximately Optimal Approximate Reinforcement Learning. Proc. ICML, 2002. [Google Scholar]
- Kearns, M.J.; Singh, S.P. Near-Optimal Reinforcement Learning in Polynomial Time. Machine Learning 2002, 49, 209–232. [Google Scholar] [CrossRef]
- Mazumdar, E.; Ratliff, L.J.; Sastry, S.S. On Gradient-Based Learning in Continuous Games. SIAM J. Math. of Data Science 2020, 2, 103–131. [Google Scholar] [CrossRef]
- Muller, T.H.; et al. Distributional reinforcement learning in prefrontal cortex. Nature Neuroscience 2024. [Google Scholar] [CrossRef] [PubMed]
- Pathak, D.; Agrawal, P.; Efros, A.A.; Darrell, T. Curiosity-driven Exploration by Self-Supervised Prediction. Proc. ICML, 2017. [Google Scholar]
- Rosenbaum, C.; Klinger, T.; Riemer, M. Routing Networks. Proc. ICLR, 2018. [Google Scholar]
- Schmidhuber, J. A ‘Self-Referential’ Weight Matrix. Proc. ICANN, 1993; pp. 446–450. [Google Scholar]
- Schmidhuber, J. Gödel Machines: Fully Self-Referential Optimal Universal Self-Improvers. In Artificial General Intelligence; Springer, 2007; pp. 199–226. [Google Scholar]
- Shazeer, N.; et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. Proc. ICLR, 2017. [Google Scholar]
- Tsitsiklis, J.N.; Van Roy, B. An Analysis of Temporal-Difference Learning with Function Approximation. IEEE Trans. Automatic Control 1997, 42, 674–690. [Google Scholar] [CrossRef]
- Turrigiano, G.G. The Self-Tuning Neuron: Synaptic Scaling of Excitatory Synapses. Cell 2008, 135, 422–435. [Google Scholar] [CrossRef] [PubMed]
- VanRullen, R.; Kanai, R. Deep Learning and the Global Workspace Theory. Trends in Neurosciences 2021, 44, 692–704. [Google Scholar] [CrossRef] [PubMed]
- Wang, G.; et al. Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv 2023, arXiv:2305.16291. [Google Scholar] [CrossRef]
- Yang, J.; et al. Deep Model Reassembly. Proc. NeurIPS, 2022. [Google Scholar]
|
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).