Appendix A. Related Work
Self-Evolving in LLMs Self-evolving Large Language Models are proposed to reduce the reliance on human-annotated or externally collected data by allowing models to generate, verify, and learn from their own synthetic training data [
1,
2]. Recent works typically instantiate this paradigm by jointly training Questioner and Solver models, demonstrating that such self-evolving training can significantly improve model performance even without external training data [
29,
30,
31]. Representative methods include R-Zero [
5], which co-evolves a Questioner and Solver with uncertainty-guided question generation; Absolute Zero [
4], which unifies task proposal and solving through code-executor verification; and SPIRAL [
28], which uses multi-turn zero-sum self-play to construct an adaptive training curriculum. Despite these advances, self-evolving systems remain vulnerable to model collapse, motivating recent efforts to understand and stabilize their long-term evolutionary dynamics.
Model Collapse in Self-Evolving LLMs. Model collaps e originally refers to the degeneration of generative models recursively trained on their own synthetic outputs, where the model gradually loses information about the underlying data distribution and converges to a narrow, low-diversity subset of the data [
32,
33,
34]. In self-evolving LLMs, this phenomenon appears in a different but closely related form: closed-loop training systems often achieve rapid early gains, but then suffer from performance plateaus or even degradation[
6,
7,
13]. Existing methods mainly mitigate this issue from three directions: introducing external data to break closed-loop information symmetry [
8,
9], improving verification signals to reduce error accumulation from majority voting or self-correction [
10,
11], and encouraging curriculum diversity through difficulty-aware sampling or quality-diversity search [
12,
14,
35]. However, these approaches still suffer from short-term and role-specific reward optimization, treating the Questioner and Solver as separate components. They therefore overlook a key property of self-evolving training: the involved roles form a coupled dynamical system, where local improvements in one component do not necessarily ensure stable global evolution. This motivates us to study model collapse from a system-level stability perspective rather than only through isolated reward design.
Appendix E. Theoretical Guarantee for Discretization
Two-stage view of the discretization. Let
denote a raw interaction sample at a given training step, where
is a question generated by
in its question-generation space, and
is the Solver’s problem-solving behavior on that question, characterizing how
answers the generated question. Our method does not operate directly on
; instead, it factors through two feature maps that render each raw state numerically observable:
where
is the semantic embedding of the generated question (underlying the clustering step) and
is the consensus ratio computed from the Solver’s sampled answers, used as a proxy for its ability to correctly answer the question (underlying the uncertainty-binning step). The Questioner feature space
and Solver feature space
are endowed with the sum metric
.
Let
denote the joint distribution of the feature pair
on
. The Questioner partition
(with centroids
) lives on
and the Solver partition
(with midpoints
) lives on
; the discretized distribution and bi-adjacency matrix are defined by
Under this view,
is a summary of the feature-level joint distribution
, which is itself an observable reduction of the raw Questioner–Solver interaction.
Formal criterion. Let be the maximum cluster radius, the bin width, and the combined partition mesh. The discretization is said to be rational at resolution ε if the following three properties hold with error controlled by :
- (P1)
Expectations of Lipschitz observables under are approximated by their bi-adjacency counterparts;
- (P2)
is close to in a metric compatible with the feature-space geometry;
- (P3)
The Questioner–Solver dependence encoded by is a consistent summary of the true feature-level coupling in .
Assumptions.
- (A1) Feature-level regularity.
admits a density p with respect to a reference product measure on that is bounded, , and -Lipschitz under d. This is a statement about the feature-level density rather than the raw interaction measure: it is automatically implied when are sufficiently smooth and the raw measure on is regular.
- (A2) Partition resolution.
The cluster radii and bin width are well-defined and finite.
Theorem A1
(Rationality of discretization, non-asymptotic form).
Under(A1)–(A2)
, for any finite ,
where measures the coupling strength between the and marginals of μ, and is a constant depending only on and that is independent of .
We establish the theorem through three supporting lemmas proved in turn below.
Lemma A1 (Quantization fidelity, (P1)). Under(A1)–(A2)
, for any L-Lipschitz ,
Proof. Write the left-hand side as
On each cell
, the
L-Lipschitz condition and the definition of
give
Taking absolute values, summing over all cells, and using
yields (
A6). □
Lemma A2 (Topological stability, (P2)). Under(A1)–(A2), .
Proof. Apply the Kantorovich–Rubinstein duality to Lemma A1 with ; the supremum over all 1-Lipschitz functions is exactly . □
Remark A1.
If one further refines the partitions, the standard covering estimate (with the intrinsic dimension of ) together with implies weakly as . This asymptotic regime is not required by our analysis, which is non-asymptotic in .
Lemma A3 (Coupling preservation, (P3)). Under(A1)–(A2), , where depends only on and and is independent of m and n.
Proof. Let , so that . By (A1) and the Lipschitz property of marginalization, g is -Lipschitz under d with .
On the discrete side, since
and likewise for the marginals,
where
and
are cell-averaged marginal densities. Replacing
by
within
introduces a pointwise error bounded by
(a Lipschitz remainder controlled by (A1)). Hence
so that
.
It remains to compare
with
. By the triangle inequality,
equals twice the integral of the smaller of
and
over the cell, which is bounded by the oscillation of
g on that cell:
Summing over all cells and using
,
Combining (
A7) and (
A8) via the triangle inequality,
Absorbing the factor
gives the stated bound with
. □
Proof of Theorem A1. (
A3) is Lemma A1 with
; (A4) is Lemma A2; (A5) is Lemma A3. All three bounds hold for every finite
without any asymptotic requirement. □
Discussion: the regime , . Our implementation fixes
, yielding
, and
. The effective support of the Questioner feature marginal
concentrates on a low-dimensional manifold with intrinsic dimension
[
39,
40], so a few hundred centroids suffice to cover it at small radius
. Under this configuration
is a moderate constant, and Theorem A1 certifies that the Cognitive Bipartite Graph faithfully reproduces the expectations, geometry, and Questioner–Solver coupling of
up to this controlled error. The Cognitive Bipartite Graph should accordingly be interpreted as a faithful—though deliberately coarse-grained—abstraction of
, whose coarsening level is chosen to balance the statistical stability of
against resolution.
Appendix F. Detailed Formulation and Calculation of Structural Entropy
Structural entropy measures the amount of uncertainty needed to identify the state reached by a random walk under a graph coding structure [
41]. Our framework apply it to the Cognitive Bipartite Graph
as a system-level stability signal. High structural entropy means that interaction mass is distributed across many Questioner-side semantic states and Solver-side response states, whereas low structural entropy indicates that the interaction topology is concentrated on a small subset of states.
For computation, the bi-adjacency matrix
is converted into the undirected weighted adjacency matrix
Each entry
denotes the interaction weight between vertices
i and
j. The weighted degree of node
i and the graph volume are
For a vertex subset
, its volume and boundary weight are defined as
Here
is the total weight of edges leaving
S. Since all terms are defined through weighted volumes and volume ratios, a global normalization of edge weights is not required; the edge weights only need to be non-negative. If the graph has no edge, we set the entropy to zero.
Partitioning tree. A partitioning tree
is a rooted tree that represents a hierarchical partition of
. Its root is denoted by
and is associated with the full vertex set
. Each tree node
is associated with a non-empty vertex subset
. If
has children
, where
denotes the parent of
, then these children form a disjoint partition of the parent subset:
Every leaf node corresponds to a singleton vertex. Thus, moving from the root to the leaves progressively refines the whole graph into modules, submodules, and finally individual vertices. The height of
is the maximum root-to-leaf depth; a height-
K tree gives a
K-level structural description of the graph. For a tree node
, we use
to denote the volume of its associated subset and the total weight of edges leaving that subset.
Structural entropy under a fixed tree. Given a partitioning tree
, the structural entropy of
with respect to
is
The term
is the code length needed to identify the child subset
inside its parent subset
, while
weights this code length by how often a random walk enters or leaves the corresponding module boundary. Therefore,
measures the coding uncertainty induced by the hierarchical organization encoded in
.
K-dimensional structural entropy. The
K-dimensional structural entropy is obtained by choosing the height-
K partitioning tree that minimizes Equation (
A14):
This is the same quantity denoted by
in the main text when the structural entropy dimension
K is fixed. In our experiments, we set
, so the tree captures a two-level module structure over Questioner-side and Solver-side states.
One-dimensional structural entropy. The one-dimensional structural entropy can be computed directly from the weighted degree distribution. Specifically, it is defined as
In implementation, we first compute the weighted degree of each node by summing the corresponding row of the adjacency matrix, and then substitute the normalized degree distribution into Equation (
A16). Therefore, this part can be computed exactly and efficiently.
Computing multi-dimensional structural entropy. For
, directly solving the minimization in Equation (
A15) is computationally infeasible for large graphs, because the number of possible partitioning trees grows combinatorially with the number of vertices. Even when only flat partitions are considered, the search space already corresponds to all possible vertex partitions, whose number is the Bell number
. If the entropy of each candidate partition is computed from a dense adjacency matrix, exact computation requires approximately
time. For the general
K-dimensional case, the search space further expands to all height-
K partitioning trees. Let
denote the number of such trees. The exact computation then requires approximately
time, which is infeasible for large-scale graphs.
To make the computation practical, we adopt a greedy approximation strategy. The algorithm starts from the finest partition, where each vertex is treated as an individual module. At each step, it evaluates the entropy reduction caused by merging two modules and selects the merge that yields the largest decrease in structural entropy. After merging, the module volume, cut weight, and inter-module edge weights are updated accordingly. This process is repeated until no candidate merge can further reduce the structural entropy.The resulting partitioning structure provides an approximate multi-dimensional structural entropy:
where
denotes the partitioning tree obtained by the greedy procedure. For a graph with
vertices, our dense-matrix implementation maintains pairwise inter-module weights and uses a priority queue to select candidate merges. Thus, the greedy approximation has an overall time complexity of approximately
and a memory complexity of
, which is substantially more tractable than exhaustive enumeration. Although the complexity remains quadratic in the number of vertices, it is acceptable in our setting because the Cognitive Bipartite Graph uses
Questioner-side nodes and
Solver-side nodes, corresponding to a
bi-adjacency matrix and a
undirected adjacency matrix. This approximation preserves the principle of structural entropy minimization while avoiding exhaustive enumeration of all possible partitioning trees.
Appendix I. Case Study
Following prior works that primarily analyze the quality and diversity of Questioner-generated training data, we focus our case study on the generated questions as a direct manifestation of self-evolution [
8,
12,
13]. We provide a qualitative comparison between R-Zero and S-BGM to examine how different self-evolving strategies affect the generated training questions. While R-Zero often tends to concentrate on repetitive question patterns after several iterations, S-BGM is designed to preserve a broader Questioner–Solver interaction space through cognitive bipartite graph modeling and structural entropy modulation. By comparing their generated questions, this case study provides intuitive evidence that S-BGM can alleviate curriculum collapse and maintain a more diverse and informative training curriculum.
Case Study for R-Zero.Table A2 presents representative questions generated by R-Zero after five iterations. Although these questions differ in surface details, they follow highly similar templates: most of them define a recurrence relation, ask for a divisibility condition or a modular remainder, and repeatedly involve sums of sequence terms. This indicates that the Questioner gradually concentrates on a narrow family of problem patterns, rather than continuously exploring diverse reasoning structures. Such repetition suggests a typical form of model collapse, where the generated training data may become less informative for further improving the Solver.
Case Study for S-BGM. In contrast,
Table A3 shows that S-BGM generates questions covering a broader range of mathematical topics and reasoning forms. The examples include number theory, Euclidean geometry, extremal graph theory, functional equations, and probability reasoning. These questions are not simple variants of a single template, but instead require different problem-solving strategies and knowledge structures. This suggests that S-BGM better preserves the diversity of the training curriculum during self-evolution, allowing the Questioner to provide more varied and informative learning signals for the Solver.
Table A2.
Examples of questions generated by R-Zero after five iterations.
Table A2.
Examples of questions generated by R-Zero after five iterations.
| ID |
Questions |
| A |
Consider a sequence of positive integers defined by the recurrence relation for with the initial term . Let denote the sum of the first n terms of the sequence, i.e., . Find the smallest positive integer k such that is divisible by 1000. |
| B |
Consider a sequence of positive integers defined by the recurrence relation for , with the initial term . Let denote the sum of the first n terms of this sequence. Determine the smallest positive integer k such that is divisible by 1000. |
| C |
Consider a sequence of positive integers where each term satisfies the recurrence relation for , with the initial term . Let S be the sum of the first 10 terms of this sequence. Find the remainder when S is divided by 1000. |
| D |
Consider a sequence of positive integers defined by the recurrence relation for all , with initial terms and . Define as the sum of the first k terms of this sequence. Determine the smallest positive integer k such that is divisible by 1000. |
| E |
Consider the sequence of numbers defined by , and for , . Determine the smallest integer k such that is divisible by 100. |
Table A3.
Examples of questions generated by S-BGM after 5 iterations.
Table A3.
Examples of questions generated by S-BGM after 5 iterations.
| ID |
Questions |
| A |
Find the smallest positive integer n such that is divisible by 24 and is divisible by 32. |
| B |
Let be an acute-angled triangle with circumcenter O. The line parallel to through O intersects and at P and Q, respectively. The line through O perpendicular to intersects the side at M. The lines and intersect at N. Prove that . |
| C |
What is the minimum number of edges that must be removed from a complete graph with 10 vertices so that no cycle of length 3 remains? |
| D |
Let be a differentiable function satisfying the functional equation
for all . Prove that there exists a constant such that for all . |
| E |
In a magical land, there are three types of coins: gold (G), silver (S), and bronze (B). A spell has been cast such that when two different types of coins are placed together, they transform into the third type. For example, a gold and a silver coin together transform into a bronze coin, a gold and a bronze coin transform into a silver coin, and a silver and a bronze coin transform into a gold coin. You start with 1 gold, 1 silver, and 1 bronze coin. If you perform the transformation process exactly 6 times, what is the probability that the final configuration of coins will include at least one gold coin? |
Appendix J. Broader Impact
This work improves the stability of self-evolving LLMs by regulating the coupled evolution between a Questioner and a Solver. By reducing dependence on large-scale human annotations, stable self-evolving training may benefit domains where expert supervision is costly, such as mathematical reasoning, scientific problem solving, and code generation. The proposed Cognitive Bipartite Graph also offers an interpretable tool for monitoring interaction dynamics and detecting collapse into narrow or repetitive patterns. However, self-evolving LLMs may reinforce errors, biases, or spurious reasoning patterns because their training data and pseudo-labels are model-generated. Majority-vote pseudo-labeling may further encourage overconfident rather than reliable responses. Therefore, S-BGM should not be regarded as a complete safety guarantee. Future applications, especially in high-stakes domains, should incorporate human oversight, domain-specific validation, safety filtering, and more reliable pseudo-labeling mechanisms.