Preprint
Article

This version is not peer-reviewed.

Assembly Theory - Formalizing Assembly Spaces, Discovering Patterns and Bounds

Submitted:

27 December 2025

Posted:

29 December 2025

You are already at the latest version

Abstract
Assembly theory defines structural complexity as the minimum number of steps required to construct an object in an assembly space. We formalize the assembly space as an acyclic digraph of strings. Key results include analytical bounds on the minimum and maximum assembly indices as functions of string length and alphabet size, and relations between the assembly index (ASI), assembly depth, depth index, Shannon entropy, and expected waiting times for strings drawn from uniform distributions. We identify patterns in minimum- and maximum-ASI strings and provide construction methods for the latter. While computing ASI is NP-complete, we develop efficient implementations that enable ASI computation of long strings. We establish a counterintuitive, inverse relationship between a string ASI and its expected waiting time. Geometric visualizations reveal that ordered decimal representations of low ASI bitstrings of even length N naturally cluster on diagonals and oblique lines of the squares with sides equal to 2N/2. Comparison with grammar-based compression (Re-Pair) shows that ASI provides superior compression by exploiting global combinatorial patterns. These findings advance complexity measures with applications in computational biology (where DNA sequences must violate Chargaff's rules to achieve minimum ASI), graph theory, and data compression.
Keywords: 
;  ;  ;  ;  ;  ;  ;  

1. Introduction

Assembly theory quantifies the minimum number of constructive steps required to generate an object, providing a principled measure of its structural complexity. It is built around two fundamental concepts: an assembly pool (a predefined set of basic items and a set of hitherto assembled subitems) and an assembly step (joining a pair of items taken from the assembly pool in a predefined order and cloning [1] such an assembled subitem to the assembly pool if it is not there), which together define the assembly index (ASI) as the minimal number of steps necessary to construct the object. Since its original formulation in 2017 [2], AT has been generalized to a broad range of theoretical and practical settings. A considerable body of work on AT [3,4,5,6,7,8,9,10,11,12,13,14,13,14] showcased the scale invariance of AT: its implementations, whether based on graphs, building blocks, DNA nucleobases, bounded structures, or more abstract units assembled in various spaces, share the same fundamental characteristics.
Intrinsic purpose-driven mechanics implementing global combinatorics (heuristics) of AT provides a genuinely different perspective from classical notions of algorithmic or statistical complexity, despite the fact that AT appears to share many methodological features with classical dictionary-based [15,16,17] and more particularly grammar-based [18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33] data compression schemes. This resemblance stems from the fact that both AT and compression algorithms seek to capture redundancy, identify reusable substructures, and derive concise generative descriptions of data. However, instead of quantifying randomness or compressibility, AT focuses on causal construction pathways and structural regularities encoded in minimal assembly pathways (i.e., in the ASI). This perspective naturally raises the question of how such regularities can be efficiently detected, represented, and exploited by formal computational methods.
In this study, we focus on the general string version of AT. The paper is organized as follows. Section 2 introduces definitions and basic theorems used in the paper. Section 3 concerns the minimum complexity strings of AT, showing certain relations between the minimum ASI, assembly depth, depth index, and the Shannon entropy. Section 4 concerns the maximum complexity strings of AT, provides certain bounds of the maximum ASI as a function of a string length and the number of basic symbols, as well as certain methods of constructing high ASI strings. Section 5 concerns results common to the minimum and maximum complexity strings of AT, including a correlation between the expected waiting time for a string of outcomes obtainable from a uniform categorical distribution and the ASI of this string. Section 6 discusses the supremacy of ASI compression over all polynomial-time compression algorithms. Finally, Section 7 summarizes the findings of this study. In general, the proofs of theorems and lemmas, methods, and tables are given in Appendices.

2. General Framework

Consider a string C k ( N , b ) (we often write it simply as C k ) of length N made of b distinct basic symbols (unit-length strings) c. We shall label the symbols c using decimal digits { 0 , 1 , , b 1 } and letters if b > 10 , where b k b k . We can assume that all strings C k belong to a set C and are vertices of a graph, where certain strings are connected via edges e from a set E. The Definition 2.1 and Lemma 2.1 were already stated in our previous studies [10,12]. We restate them here for clarity.
Definition 2.1  
(Assembly Space). An assembly space Ω = ( C , E ) is an acyclic digraph of strings C = { C k } , k N , where all b N unit length strings (basic symbol(s)) are source vertices and the remaining strings are 2-in-regular assembly steps vertices, and E is a set of edges.
Definition 2.2  
(Assembly Step). An assembly step s is the formation of a new string C z in an assembly space Ω from two, not necessarily different, strings C x , C y by concatenating C z = C x C y = strcat ( C x , C y ) which establishes the edges e = ( C x , C z ) and e = ( C y , C z ) .
Thus, any edge e = ( C x , C z ) unambiguously resolves to either e = ( C x , C x C y ) or e = ( C x , C y C x ) . For example, the edge e = ( [ 010 ] , [ 0101 ] ) unambiguously resolves to e = ( [ 010 ] , [ 010 ] [ 1 ] ) . Even though all the Ω vertices are strings, it is convenient to separate this set into a set B C { C k ( N , b ) C : N 1 } of source vertices, and a set S C { C k ( N , b ) C : N = 1 } of 2-in-regular assembly steps vertices.
Definitions 2.1 and 2.2 are consistent: all vertices are unique (in any standard graph, all vertices should be unique) and all are strings. Since an assembly step always consists of joining two parts only [2], this can be thought of as the left and right fragments of the newly formed string, and those strings that can be the result of the concatenation of two shorter strings are assembly step 2-in-regular vertices. The uniqueness of each vertex is a sufficient criterion to establish the admissibility of an assembly step and to introduce the notion of an assembly pool: vertices (strings) present in the assembly space can not be assembled again as new vertices of Ω , as they would not be unique.
Definition 2.3  
(String Assembly Space). An assembly space Ω C s of a string C s is the assembly space 2.1 containing the vertex C s and all the vertices leading to the string C s .
There can be more than one assembly space for a target string, reflecting different assembly pathways leading to it. However, the ASI a ( N , b ) ( C s ) of a string C s ( N , b ) is the minimum cardinality | S ( Ω C s ) | of the set of the assembly step vertices S ( Ω C s ) of all assembly spaces Ω C s of the string C s .
Strings in an assembly space 2.1 can form 2-grams, 3-grams, 4-grams,..., n-grams, in general, wherein the following holds.
Lemma 2.1.  
For all b a 4-gram is the shortest string that allows for more than one ASI.
N = 2 provides b 2 available 2-grams with unit ASI. N = 3 provides b 3 available 3-grams with ASI equal to two. Only N = 4 provides b 4 4-grams that include b 2 4-grams with ASI equal to two, that is b 4-grams C k , min ( 4 , b ) = [ * * * * ] and b ( b 1 ) 4-grams C l , min ( 4 , b ) = [ * * ] , while the ASI of the remaining b 4 b 2 4-grams is three. For example, to assemble the 4-gram C k , min ( 4 , 4 ) = [ 0202 ] , we need to assemble the 2-gram [ 02 ] and reuse it, while there is nothing available to reuse, in the case of the 4-gram C l ( 4 , 4 ) = [ 0123 ] .
Definition 2.4  
(Clear/Mixed String). We call a string (an n-gram) containing only one symbol a clear string (a clear n-gram). We call a string (an n-gram) which is not clear a mixed string (a mixed n-gram).
Theorem 2.1.  
Each k r copies of an n r -gram C r ( n r , b ) contained in a string C m ( N , b ) decrease its ASI at least by k r ( n r 1 ) a ( n r , b ) C r . That is
a ( N , b ) ( C m ) N 1 r = 1 R k r ( n r 1 ) a ( n r , b ) C r ,
where R is the total number of repeated n r -grams.
For example, due to the presence of three copies of a 5-gram [ 01001 ] , each with a ( 5 , 6 ) ( [ 01001 ] ) = 3 , in a string
C k ( 24 , 6 ) = [ 12 | 01001 | 21 | 01001 | 235 | 01001 | 52 ] ,
its ASI amounts to a ( 24 , 6 ) ( C k ) = 24 1 ( 3 · ( 5 1 ) 3 ) = 14 . The relation (1) provides the upper bound on ASI as it does not describe a situation in which n-gram for n > 2 is assembled based on a ( n 1 ) -gram also present in the string C m ( N , b ) , outside of this n-gram. For example, the ASI a ( 14 , 9 ) ( [ 56 | 101 | 78 | 101 | 43 | 01 ] ) = 10 , while 14 1 ( 2 ( 3 1 ) 2 ) = 11 .
Consider the strings C k and C l , where C l is C k in reverse order. The string C k can have a distribution of n r -grams leading to its ASI decrease (1). But the string C l will have the same distribution of n r -grams in reverse order. Consequently, the strings C k and C l have the same ASI.
Definition 2.5  
(AT String Equivalence). We consider two strings C k and C l to be AT-equivalent if they have the same ASI because C l is C k in reverse order and/or if C l and C k differ at most in symbol permutations.
If it is not otherwise clear from the context, in this study, we consider equivalent strings (e.g., { [ 001222 ] , [ 111200 ] , } ) as one string.
Another measure of a string’s complexity is the assembly depth [34].
Definition 2.6  
(Assembly Depth). The assembly depth (ASD)
d s ( N k , b ) ( C k ) max d ( N l , b ) ( C l ) , d ( N m , b ) ( C m ) + 1 ,
where d 0 ( 1 , b ) ( c ) 0 , and d ( N l , b ) ( C l ) and d ( N m , b ) ( C m ) are the ASDs of two substrings C l , C m of the string C k that were joined in step s. For N > 3 , and if there are more assembly pathways with different depths w j leading to a string, which happens if at least two independent assembly steps are possible, the minimum pathway depth is the ASD of this string. Hence, the ASD captures the notion of an independent assembly step.
The ASD naturally divides the lengths of strings into sections 2 d 1 < N 2 d (cf. Figure 4 and Figure 5).
Theorem 2.2.  
If an assembly space Ω contains strings having the same (non-zero) ASD they were assembled in independent assembly steps.
In other words, if two strings C l , C m in Ω have the same ASD, their assembly pathways are unrelated to each other; by the defining equation (3), neither of them could have been used in the assembly pathway of the other.
Corollary 2.1.  
If ASI and ASD of a string are equal to each other, an assembly space of this string cannot contain independent assembly steps.
Lemma 2.2.  
For all b the maximum length N of any string that can be assembled with the ASD d s ( N ) (3) satisfies
N 2 d s ( N ) .
Theorem 2.3.  
For all b the minimum ASD (minASD) as a function of a string length N, is given by
d min ( N ) = log 2 ( N ) ,
where x denotes the ceiling function.
The ASD need not be a monotonically nondecreasing function of the assembly step. For example
[ 11 ] d 1 = 1 ; [ 110 ] d 2 = 2 ; [ 01 ] d 3 = 1 ; [ 00 ] d 4 = 1 ; [ 0001 ] d 5 = 2 ; [ 0001110 ] d 6 = 3 .
Figure 1. Assembly spaces of bitstrings C max ( N , 2 ) (a, c) and C min ( N , 2 ) (b, d) for N = 2 s = 16 (a, b) and N = 15 2 s (c, d), where the assembly index is a number in a string (final string for (a, c)) and the assembly depth corresponds to the level. For N = 2 s , d a max ( 2 s , b ) = d a min ( 2 s , b ) = s . In general, for N 2 s , the assembly depth d a max ( N , b ) < d a min ( N , b ) .
Figure 1. Assembly spaces of bitstrings C max ( N , 2 ) (a, c) and C min ( N , 2 ) (b, d) for N = 2 s = 16 (a, b) and N = 15 2 s (c, d), where the assembly index is a number in a string (final string for (a, c)) and the assembly depth corresponds to the level. For N = 2 s , d a max ( 2 s , b ) = d a min ( 2 s , b ) = s . In general, for N 2 s , the assembly depth d a max ( N , b ) < d a min ( N , b ) .
Preprints 191720 g001
Figure 2. Assembly space Ω , assembly index, and assembly depth. The assembly space of all eight binary 3-grams with all pathways (a). Blue edge provides the 1st string, red edge provides the 2nd string in the assembly step, and the order is irrelevant for two green edges or green edge provides the 1st or 2nd string in dependence of the color of the complementary edge. Dotted edges and question marks indicate alternative pathways. The assembly space of the bitstring C 6 ( 7 , 2 ) = [ 0001110 ] showing that its assembly index a ( 7 , 2 ) ( C 6 ) = 6 (b). The evolution of assembly spaces of strings [ 0101 ] and [ 0110 ] (c-e). Strings [ 0101 ] and [ 0110 ] are initially assembled from 3-grams and basic symbols, increasing the assembly depth (c). New pathways increasing the number of independent assembly steps are found (d), and the edges of Ω are reconfigured, decreasing the number of assembly steps of the string [ 0101 ] from three to two steps and the assembly depth of both 4-grams from three to two. Five assembly spaces of the bitstrings [ 0101 ] , [ 010 ] (two alternatives, one encircled), [ 011 ] , and [ 0110 ] (e)
Figure 2. Assembly space Ω , assembly index, and assembly depth. The assembly space of all eight binary 3-grams with all pathways (a). Blue edge provides the 1st string, red edge provides the 2nd string in the assembly step, and the order is irrelevant for two green edges or green edge provides the 1st or 2nd string in dependence of the color of the complementary edge. Dotted edges and question marks indicate alternative pathways. The assembly space of the bitstring C 6 ( 7 , 2 ) = [ 0001110 ] showing that its assembly index a ( 7 , 2 ) ( C 6 ) = 6 (b). The evolution of assembly spaces of strings [ 0101 ] and [ 0110 ] (c-e). Strings [ 0101 ] and [ 0110 ] are initially assembled from 3-grams and basic symbols, increasing the assembly depth (c). New pathways increasing the number of independent assembly steps are found (d), and the edges of Ω are reconfigured, decreasing the number of assembly steps of the string [ 0101 ] from three to two steps and the assembly depth of both 4-grams from three to two. Five assembly spaces of the bitstrings [ 0101 ] , [ 010 ] (two alternatives, one encircled), [ 011 ] , and [ 0110 ] (e)
Preprints 191720 g002
We cannot consider the ASD apart from the ASI. For example, the ASD of a string C max ( 7 , 2 ) = [ 0001110 ] is d a max ( 7 , 2 ) = log 2 ( 7 ) = 3 even though this string can be assembled in six steps with three larger pathway depths w 6 { 4 , 5 , 6 } as
00 d 1 = 1 , [ 00 ] w 1 = 1 , [ 00 ] w 1 = 1 , [ 00 ] w 1 = 1 , 01 d 2 = 1 , [ 01 ] w 2 = 1 , [ 01 ] w 2 = 1 , [ 000 ] w 2 = 2 , 11 d 3 = 1 , [ 11 ] w 3 = 1 , [ 0001 ] w 3 = 2 , [ 0001 ] w 3 = 3 , 110 d 4 = 2 , [ 0001 ] w 4 = 2 , [ 00011 ] w 4 = 3 , [ 00011 ] w 4 = 4 , 0001 d 5 = 2 , [ 000111 ] w 5 = 3 , [ 000111 ] w 5 = 4 , [ 000111 ] w 5 = 5 , 0001110 d 6 = 3 , [ 0001110 ] w 6 = 4 , [ 0001110 ] w 6 = 5 , [ 0001110 ] w 6 = 6 .
Similarly, the ASD of a string C max ( 8 , 2 ) = [ 00011101 ] is d a max ( 8 , 2 ) = log 2 ( 8 ) = 3 as
00 d 1 = 1 , [ 00 ] w 1 = 1 , [ 00 ] w 1 = 1 , [ 01 ] w 1 = 1 , 01 d 2 = 1 , [ 01 ] w 2 = 1 , [ 01 ] w 2 = 1 , [ 001 ] w 2 = 2 , 11 d 3 = 1 , [ 11 ] w 3 = 1 , [ 0001 ] w 3 = 2 , [ 0001 ] w 3 = 3 , 0001 d 4 = 2 , [ 0001 ] w 4 = 2 , [ 00011 ] w 4 = 3 , [ 00011 ] w 4 = 4 , 1101 d 5 = 2 , [ 000111 ] w 5 = 3 , [ 000111 ] w 5 = 4 , [ 000111 ] w 5 = 5 , 00011101 d 6 = 3 , [ 00011101 ] w 6 = 4 , [ 00011101 ] w 6 = 5 , [ 00011101 ] w 6 = 6 .
However, the non-maximum and non-minimum ASI string C k ( 8 , 2 ) = [ 01001011 ] has only two 2-grams that can be assembled in independent steps. Hence, its ASD cannot be decreased to log 2 ( 8 ) = 3
01 d 1 = 1 , [ 01 ] w 1 = 1 , 11 d 2 = 1 , [ 010 ] w 2 = 2 , 010 d 3 = 2 , [ 010010 ] w 3 = 3 , 010010 d 4 = 3 , [ 0100101 ] w 4 = 4 , 01001011 d 5 = 4 , [ 01001011 ] w 5 = 5 .
In general, the Ω that contains a 2 d -gram having the ASD d can also contain { 2 d 1 + 1 , 2 d 1 + 2 , , 2 d 1 } -grams having the ASD d and based on the shorter n-grams of length n < 2 d 1 + 1 .
Theorem 2.4.  
For all b the ASD of any maximum ASI string C max ( N , b ) , corresponds to the minASD (5) of Theorem 2.3, that is
d a max ( N , b ) = log 2 ( N ) ,
For example, as shown in Figure 1c,d, the string C max ( 15 , 2 ) = [ 010101000011100 ] has the ASI a max ( 15 , 2 ) = 10 and the ASD d a max ( 15 , 2 ) = 4 , while the string C min ( 15 , 2 ) = [ 010010100101001 ] has smaller ASI a min ( 15 ) = 5 but larger ASD d a min ( 15 , 2 ) = 5 . On the other hand, the ASD of the maxASI string C ( N 5 ) ( 16 , 2 ) = [ 0101010000001110 ] and the minASI string C min ( 16 , 2 ) = [ 0101010101010101 ] , shown in Figure 1a,b, is the same.
Here, we introduce the following definition, which — as we shall see — is also related to the independent assembly step.
Definition 2.7  
(Depth Index). We call the number of steps a ^ min ( N ) to reach 1 starting from N N 0 and assigning
N s + 1 = N s 1 if N s is odd , N s 2 if N s = 2 s + 2 , s N , N s / 2 otherwise
the depth index (DPI).
The relation (11) yields the same number of steps as the Chandah-sutra method (OEIS A014701) and, unlike the minimum ASI, is an analytical function of N. For example, a ^ min ( 2 s ) = s and a ^ min ( 2 s 1 ) = 2 ( s 1 ) .
Applying AT to evolutionary biology, we can assume that initially, a new string of length N is formed in an assembly space based on a basic symbol and a string of length N 1 . Subsequently, this string assembly space evolves to reduce the cardinality | S ( Ω C s ) | of the set of the assembly step vertices until it equals the ASI of this string, that is until | S ( Ω C s ) | = a ( N , b ) ( C s ) . Hence, the assembly spaces evolve by reconfiguring the network of edges to decrease the ASD of newly assembled strings, possibly finding shorter pathways for these strings, and if only such a decrease would not result in ASI increase (Length N = 15 shown in Figure 1d is the shortest length, where 5 = d a min ( 15 ) > log 2 ( 15 ) = 4 ).
The concepts of assembly space, string assembly space, ASI, and ASD, as well as the evolution of assembly spaces, are illustrated in Figure 2. Although the alternative pathways shown in Figure 2a demonstrate that the mixed 2-gram [ 01 ] is not required to construct all 3-grams in two steps, this 2-gram is necessary to achieve the ASI of longer strings (Cf. a constrained AT* version shown in list (38) discusses in Section 6.).

3. Minimum Complexity Strings of AT

In this section, we consider factors relevant to the minimum string complexity in the context of AT. The Theorem 3.1 was already stated in our previous study [10]; we restate it here for clarity.
Theorem 3.1.  
For all b the minimum ASI (minASI) a ( N ) ( C min ) as a function of N corresponds to the shortest addition chain for N (OEIS A003313).
Any shortest addition chain for n starts with one, not zero, as zero is the neutral element of addition. For the same reason, two is considered the smallest prime, as one is the neutral element of multiplication. Hence, the fundamental theorem of arithmetic can be thought of as the shortest multiplication chain for N. It is conjectured that the largest prime factor must appear in some shortest addition chain for N [35].
Theorem 3.2.  
The strings C min ( 2 s , b ) can contain at most two distinct symbols if b > 1 . Other minASI strings of length N 2 s can contain at most three distinct symbols if b > 2 .
Proof. 
Minimum ASI strings of length N = 2 s are formed by joining the newly assembled string to itself, where a clear or mixed 2-gram is assembled in the first step. Minimum ASI strings of other lengths admit a 2-gram and a 3-gram containing this 2-gram and an additional basic symbol.
To formally prove the first part, we can also use mathematical induction on the assembly step s. If s = 1 , then the minASI strings C min ( 2 , b ) are 2-grams of the form [ c 1 c 2 ] , where c 1 , c 2 B ( Ω ) . If c 1 = c 2 , the string contains one distinct symbol, and if c 1 c 2 , the string contains two distinct symbols. In both cases, the string has a form (A13) and the number of distinct symbols does not exceed two. Now assume that for some k N , all minASI strings C min ( 2 k , b ) contain at most two distinct symbols. We must show that C min ( 2 k + 1 , b ) also contains at most two distinct symbols. We construct C min ( 2 k + 1 , b ) by joining two identical minASI strings C min ( 2 k , b )
C min ( 2 k , b ) C min ( 2 k , b ) = C min ( 2 k + 1 , b ) ,
with each other. By the inductive hypothesis, each C min ( 2 k , b ) contains at most two distinct symbols. Therefore, their concatenation also contains at most two distinct symbols. By induction, for all s N , the minASI string C min ( 2 s , b ) contains at most two distinct symbols.
We will now show that other minASI strings of length N 2 s can contain at most three distinct symbols if b > 2 . We provide the construction of minASI strings with three symbols. In the first step s = 1 , we assemble a 2-gram [ c 1 c 2 ] where c 1 , c 2 B ( Ω ) and c 1 c 2 . Next, we join the existing 2-gram [ c 1 c 2 ] with a new symbol c 3 B ( Ω ) where c 3 { c 1 , c 2 } . This forms a 3-gram [ c 1 c 2 c 3 ] , introducing a third distinct symbol and further increasing the ASI by 1. We continue assembling by joining the longest string formed so far with itself or with previously formed strings, maintaining the minimal ASI increase.
Assume a contrario that there exists a minASI string C min ( N , b ) of length N 2 s that contains four or more distinct symbols. But, incorporating such a fourth symbol is equivalent to assembling a maxASI 4-gram, which contradicts the minimality of C min ( N , b ) (only a 2-gram must be assembled from basic symbols and a 3-gram must be assembled from a basic symbol and a 2-gram). Thus, Theorem 3.2 is proven. □
By Theorem 2.1 the maximum ASI decrease is provided by minASI 2 s -grams and amounts to k ( 2 s 1 ) s . On the other hand, 1-symbol n-grams are guaranteed to have the minimum ASI for all n. Therefore, the maxASI string will tend to contain only one such long n-gram for each b. We have found, for example, that the sum of the lengths of an n 0 -gram of zeros and an n 1 -gram of ones should preferably be greater than or equal to N / 2 of a maxASI string.
The assembly spaces of strings a min ( N ) of length N 2 s are not unique [36]. For example, a string C min ( 5 , b ) = [ 01010 ] can be assembled in three steps from four assembly spaces with S ( Ω ) = { [ 01 ] , [ 010 ] } , S ( Ω ) = { [ 01 ] , [ 0101 ] } , S ( Ω ) = { [ 10 ] , [ 010 ] } , or S ( Ω ) = { [ 10 ] , [ 1010 ] } .
Strings with non-minASI can contain all symbols. For example, the string [37]
C k = [ 01234012340123401234 ] ,
has ASI a ( 20 , 5 ) ( C k ) = 6 = a min ( 20 ) + 1 and contains all five basic symbols B ( Ω ) { 0 , 1 , 2 , 3 , 4 } .
The minASD as a function of the length of a string d min ( N ) (5), the ASD of a minASI string d a min ( N ) (which we call here the minASI ASD), the minASI as a function of the length of a string a min ( N ) (OEIS A003313), and DPI a ^ min ( N ) (OEIS A014701) define four distinct sets illustrated in Figure 4, wherein d min ( N ) d a min ( N ) a min ( N ) a ^ min ( N ) . We observed certain salient regularities among them.
Theorem 3.3.  
If a minASI string has length N 2 s , s N 0 , then the minimum ASD, minASI ASD, minASI, and DPI are equal to s.
Theorem 3.3 can be generalized as follows.
Theorem 3.4.  
The minASD, minASI ASD, minASI, and DPI of a minASI string are equal to s N iff N ^ 1 2 s 1 + 2 l , l = 0 , 1 , , s 1 , s 1 or, in other words
N ^ 1 2 s 1 + 2 l , l = 0 , 1 , , s 1 , s 1 d min ( N ^ 1 ) = d a min ( N ^ 1 ) = a min ( N ^ 1 ) = a ^ min ( N ^ 1 ) = s .
Figure 3. Lengths of all strings having the property of d min ( N ^ 1 ) = a min ( N ^ 1 ) = d a min ( N ^ 1 ) = a ^ min ( N ^ 1 ) = s (a). Lengths N ˜ 3 , N ˜ 5 , N ˜ 9 of certain strings having the property of a min ( N ˜ * ) = d min ( N ˜ * ) + 1 (b-d). Lengths N ˜ 15 , N ˜ 27 of certain strings having the property of a min ( N ˜ * ) = d min ( N ˜ * ) + 2 (e,f)
Figure 3. Lengths of all strings having the property of d min ( N ^ 1 ) = a min ( N ^ 1 ) = d a min ( N ^ 1 ) = a ^ min ( N ^ 1 ) = s (a). Lengths N ˜ 3 , N ˜ 5 , N ˜ 9 of certain strings having the property of a min ( N ˜ * ) = d min ( N ˜ * ) + 1 (b-d). Lengths N ˜ 15 , N ˜ 27 of certain strings having the property of a min ( N ˜ * ) = d min ( N ˜ * ) + 2 (e,f)
Preprints 191720 g003
The assembly spaces of other minASI strings can contain independent assembly steps. The first such case occurs for N = 7 , where, for example, the S ( Ω )
[ 01 ] d 1 = 1 [ 001 ] , [ 0101 ] d 2 = d 3 = 2 [ 0010101 ] d 4 = 3
results in a string having ma a min ( 7 ) = 4 and d a min ( 7 ) = log 2 ( 7 ) = 3 , since both [ 001 ] and [ 0101 ] were assembled from the 2-gram [ 01 ] in two independent assembly steps at the same depth d 2 = d 3 = 2 , which is congruent with Theorem 2.2.
Lemma 3.1.  
The minASI strings [10] (strings (15)) of lengths
N ˜ 3 2 d 1 + 3 · 2 l , l = 0 , 1 , , d 3 , d 3 a min ( N ˜ 3 ) = d + 1 = log 2 ( N ˜ 3 ) + 1 ,
have only one independent assembly step in their assembly spaces, and excluding this step, they are assembled by joining the longest string assembled so far with itself. Therefore, their ASI is one greater than the minASD (5).
Lemma 3.2.  
The minASI strings of lengths
N ˜ 5 2 d 1 + 5 · 2 l , l = 0 , 1 , , d 4 , d 4 a min ( N ˜ 5 ) = d + 1 = log 2 ( N ˜ 5 ) + 1 ,
have only one independent assembly step in their assembly spaces, and excluding this step, they are assembled by joining the longest string assembled so far with itself. Therefore, their ASI is one greater than the minASD (5).
Lemma 3.3.  
The minASI strings of lengths
N ˜ 9 2 d 1 + 9 · 2 l , l = 0 , 1 , , d 5 , d 5 a min ( N ˜ 9 ) = d + 1 = log 2 ( N ˜ 9 ) + 1 ,
have only one independent assembly step in their assembly spaces, and excluding this step, they are assembled by joining the longest string assembled so far with itself. Therefore, their ASI is one greater than the minASD (5).
Lemmas 3.1–3.3 allow for the following generalization.
Theorem 3.5.  
The minASI strings of lengths
N ˜ 2 n + 1 2 d 1 + 2 k 4 + 1 2 l , k 5 , d k 2 , l = 0 , 1 , , d ( k 2 ) , a min ( N ˜ 2 n + 1 ) = d + 1 = log 2 ( N ˜ 2 n + 1 ) + 1 ,
have only one independent assembly step in their assembly spaces, and excluding this step, they are assembled by joining the longest string assembled so far with itself. Therefore, their ASI is one greater than the minASD (5).
Proof. 
The lengths of the strings (19) are listed in rows in Table A1 starting after the length of the substring assembled in an independent assembly step marked green. Hence, the first row contains the lengths of strings of Theorem 3.1 shown on the diagonal of Figure 3b, and so on. □
Theorem 3.6.  
The minASI strings [10] of lengths
N ˜ 7 2 d 1 + 7 · 2 d 4 { 15 , 30 , 60 , } , d 4 a min ( N ˜ 7 ) = d a min ( N ˜ 7 ) = d min ( N ˜ 7 ) + 1 = a ^ min ( N ^ 1 ) 1 = log 2 ( N ˜ 7 ) + 1 ,
are assembled by joining the longest string assembled so far with itself. Their ASI and ASD are the same, one greater than the minASD (5) and one smaller than the DPI.
Theorem 3.6 seems to allow for the following generalization, which we have validated numerically based on the sequence OEIS A003313 for N 10 5 . For d, l, and N ˜ 2 n + 1 defined by the relation (19), the following holds
N ˜ 2 n + 1 , a N ˜ 2 n + 1 + 2 d = 3 · 2 d 1 + 2 k 4 + 1 2 l k = 5 a min ( N ˜ 2 n + 1 , a ) = d + 2 = log 2 ( N ˜ 2 n + 1 , a ) + 1 ,
N ˜ 2 n + 1 , b N ˜ 2 n + 1 + 2 d + 1 = 5 · 2 d 1 + 2 k 4 + 1 2 l k { 5 , 6 } a min ( N ˜ 2 n + 1 , b ) = d + 3 = log 2 ( N ˜ 2 n + 1 , b ) + 1 .
The lengths of the strings (21a) and (21b) are listed in rows in Table A1. Furthermore, we have numerically validated the following conjecture. The minASI strings of lengths
N ˜ 15 2 d 1 + 15 · 2 l , l = 0 , 1 , , d 5 , d 5 ,
N ˜ 27 2 d 1 + 27 · 2 l , l = 0 , 1 , , d 6 , d 6 ,
N ˜ 50.9 50 · 2 d 6 + 9 · 2 l , l = 0 , 1 , , d 6 , d 6 ,
have the property of
a min ( N ˜ * ) = d + 2 = log 2 ( N ˜ * ) + 2 .
Figure 4. The minimum assembly depth ( log 2 ( N ) , blue), the assembly depth of the minimum assembly index string (magenta), the minimum assembly index (OEIS A003313, red; log 2 ( N ) , red, dash-dot), depth index (OEIS A014701, green), and log 2 ( N ) + N 1 2 , where N 1 is the Hamming weight of the binary representation of N (black) for 1 < N 65 .
Figure 4. The minimum assembly depth ( log 2 ( N ) , blue), the assembly depth of the minimum assembly index string (magenta), the minimum assembly index (OEIS A003313, red; log 2 ( N ) , red, dash-dot), depth index (OEIS A014701, green), and log 2 ( N ) + N 1 2 , where N 1 is the Hamming weight of the binary representation of N (black) for 1 < N 65 .
Preprints 191720 g004
Figure 5. The Shannon entropy of the most balanced bitstrings having the minimum assembly index for 1 < N 65
Figure 5. The Shannon entropy of the most balanced bitstrings having the minimum assembly index for 1 < N 65
Preprints 191720 g005
Strings of lengths (14), (16), and (20), revealed in [10] based on the degree of causation, showed that there are certain regularities among the minASI strings. Here, we extended these results to strings of lengths (19), (21), and (22). In general, Theorems 3.4–3.6 (in particular Theorem 3.6) and Conjectures Section 3 and Section 3 show a peculiar interdependence among the minASD (5), minASI ASD, minASI, and DPI, as shown in Figure 4. In particular, they show that
  • the Ω of minASI strings having ASI equal to DPI cannot contain strings assembled in independent assembly steps,
  • the Ω s of other minASI strings can contain at least two such strings, and therefore
  • the assembly space of a maxASI string will tend to maximize the number of strings assembled in independent assembly steps in the Ω , taking into account the saturation of the Ω as it cannot contain more than b n distinct n-grams, and hence to minimize the possible ASD.
We note that, in general, the difference between the DPI and minASI is larger than 1.
Lemma 3.4.  
The minASI bitstrings assembled along the pathway given by the DPI 2.7 and beginning with C min ( 2 ) = [ * ] are balanced bitstrings if N is even or nearly balanced bitstrings ( N 0 = N 1 + 1 ) if N is odd.
In other words, the Definition 2.7 removes the imbalance propagation. For example, an imbalanced pathway { 2 , 4 , 5 , 10 , 20 } ( N 1 = 8 ) becomes a balanced pathway { 2 , 4 , 8 , 10 , 20 } ( N 1 = 10 = N / 2 ) and 2 s + 2 2 s 1 + 1 2 s 1 2 s + 2 2 s 2 s 1 . We have also examined the Shannon entropy
H C min ( N ) = p 0 log 2 ( p 0 ) p 1 log 2 ( p 1 ) ,
of the most balanced minASI bitstrings, where p 0 = N 0 / N and p 1 = N 1 / N are fractions of the respective symbols { 0 , 1 } within the string ( N 1 is the Hamming weight). Although by Theorem 3.1, the minASI as a function of the length of a string does not depend on b, we have chosen the most balanced bitstrings, as they maximize the Shannon entropy (23). This choice is also supported by physics [38,39,40]. For the same reason, we preferred the pathway { 2 , 3 , 5 , 10 , 15 } (cf. Figure 1d) over { 2 , 3 , 6 , 12 , 15 } , for example, as the string assembled using the former pathway is more balanced ( N 1 = 6 ) than the one assembled using the latter one ( N 1 = 5 ). Similarly, we preferred the pathway that provided a more balanced string over the one that provided independent assembly steps. N = 14 is the first exception. C min ( 14 ) assembled in five steps along the pathway { 2 , ( 3 , 4 ) , 7 , 14 } with the independent assembly steps 3 and 4 has the hamming weight N 1 = 6 as compared to C min ( 14 ) assembled in five steps along the pathway { 2 , 4 , 8 , 12 , 14 } with no independent assembly steps and the hamming weight N 1 = N / 2 = 7 . The resulting pathways of the minASI strings maximizing the number of independent assembly steps or the binary Shannon entropy (23) are listed in Table A2 for N 65 . As shown in Figure 5, the Shannon entropy (23) of the most balanced minASI bitstrings rapidly converges to one with exceptions for lengths N { 15 , 23 , 27 , 39 , 43 , 45 , 51 , 59 , 63 , } substantially corresponding to lengths at which DPI is larger than the minASI (cf. Figure 4), which highlights the interdependence among the minASI and DPI.
We demonstrated in this section that minimum complexity AT strings cannot be fully derived by any rule-based system. The same is true for the maxASI strings, as we shall show in a subsequent section.

4. Maximum Assembly Index Strings of AT

A seven-bit string is the longest string that can have the maxASI a max ( 7 , 2 ) = N 1 = 6 . It contains two clear 3-grams
C ( N 1 ) ( 7 , 2 ) = [ 0001110 ] ,
and its length cannot be increased without a repetition of a 2-gram, which keeps the ASI at the same level a max ( 8 , 2 ) = 8 2 = 6 . This led us to seek a general method to construct the longest possible string having the maxASI a max ( N , b ) ( C ( N 1 ) ) = N 1 , as a function of the radix b. We denote the length of this string by N ( N 1 ) and we call this string a C ( N 1 ) string. We developed two methods of constructing a string C ( N 1 ) of the form
C ( N 1 ) = [ 0001112 b 2 b 1 b 1 b 1 0 ] .
In both methods, we start with an initial balanced string of length 3 b containing b clear 3-grams ordered as
[ 0001112 b 2 b 1 b 1 b 1 ] .
The 2-grams that can be inserted into the initial string (26) can be arranged in a b × b matrix
00 01 02 0 b 1 10 11 12 1 b 1 20 21 22 2 b 1 b 2 0 b 2 1 b 2 2 b 2 b 1 b 1 0 b 1 1 b 1 2 b 1 b 1 ,
where the crossed out entries on a diagonal cannot be reused, as they would form repetitions in this string. Due to the order of 3-grams in the string (26) we can also cross out the entries in the first superdiagonal of the matrix (27). By construction, the starting string (26) has length 3 b and contains only b + ( b 1 ) = 2 b 1 2-grams [ l l ] and [ m ( m + 1 ) ] , where 0 l b 1 and 0 m b 2 .
In the 1st step of the 1st method, we begin by forming a string containing 2-grams on the first subdiagonal of the matrix (27), starting with 10
[ 102132 b 2 b 3 b 1 b 2 ] ,
and we append it to the string (26). With this step, we also eliminate the 2-grams on the second superdiagonal starting with the 2-gram 02, as well as the 2-gram b 1 1 . In the 2nd step, we form a string containing 2-grams on the third superdiagonal beginning with the 2-gram 03
[ 0314 b 5 b 2 b 4 b 1 ] ,
and append it to the string formed so far. With this step, we also remove the 2-gram b 2 0 and the middle part of the second subdiagonal containing { 31 , 42 , , b 2 b 4 } . And so on. Finally, we append 0 if b is even. This process is illustrated in Figure 6 and generates the following C ( N 1 ) ( N , b ) strings
C ( N 1 ) ( 3 , 1 ) = [ 000 ] , C ( N 1 ) ( 7 , 2 ) = [ 000111 | 0 ] , C ( N 1 ) ( 13 , 3 ) = [ 000111222 | 10 | 20 ] , C ( N 1 ) ( 21 , 4 ) = [ 000111222333 | 102132 | 03 | 0 ] , C ( N 1 ) ( 31 , 5 ) = [ 000111222333444 | 10213243 | 0314 | 20 | 40 ] , C ( N 1 ) ( 43 , 6 ) = [ 000111222333444555 | 1021324354 | 031425 | 0415 | 2053 | 0 ] , C ( N 1 ) ( 57 , 7 ) = [ 000111222333444555666 | 102132435465 | 03142536 | 041526 | 2064 | 0516 | 30 ] , C ( N 1 ) ( 73 , 8 ) = [ 000111222333444555666777 | 10213243546576 | 0314253647 | 04152637 | 2075 | 051627 | 306174 | 0 ] , C ( N 1 ) ( 91 , 9 ) = [ | 1021324354657687 | 031425364758 | 0415263748 | 2086 | 05162738 | 30617285 | 0718 | 40 ] ,
We discuss the 2nd method in Appendix A.15.
Figure 6. 2-gram matrices for 1 b 16 that illustrate the generation of C ( N 1 ) strings according to a method presented in this section. Coloured 2-grams are appended to the initial string of clear 3-grams in the order indicated by arrows starting from the 1st column or row. Finally, 0 is appended at the end if b is even.
Figure 6. 2-gram matrices for 1 b 16 that illustrate the generation of C ( N 1 ) strings according to a method presented in this section. Coloured 2-grams are appended to the initial string of clear 3-grams in the order indicated by arrows starting from the 1st column or row. Finally, 0 is appended at the end if b is even.
Preprints 191720 g006
The longest length of a string that has the ASI of N 1 (a C ( N 1 ) string) is given by
N ( N 1 ) = 3 b + ( b 1 ) 2 = b N c + 1 = b 2 + b + 1
(OEIS A353887), where N c = b + 1 is the number of occurrences of all but one symbol within the string, so this string has odd length and is nearly balanced. To be the longest, a C ( N 1 ) string must contain all 2-grams from the matrix (27) and all clear 3-grams. A string that contains all k-grams is a linear, minimal de Bruijn string (e.g., { [ 00110 ] , [ 01100 ] , } for b = k = 2 ), which must start and end with the same symbol. Its length is given by N de Bruijn = b k + k 1 , and there are ( b ! ) b k 1 such strings. We consider k = 2 and have to add b symbols to turn b clear 2-grams of a de Bruijn string into b clear 3-grams, which clearly does not introduce any disjoint occurrences [41]. This leads to the formula (31). For b = 1 the upper and the lower bound on the ASI are the same, a max ( N , 1 ) = a min ( N ) , and this is the only case where the maxASI is not a monotonically non-decreasing function of N. Otherwise, the maxASI must rise. If it were to become constant for N > N ^ m a x , then at some even larger N it would inevitably become lower than the minASI bound of Theorem 3.1, which also rises, and this would be a contradiction.
Subsequently, we considered other strings with the maxASI. A C ( N 1 ) string contains all 2-grams. Hence, inserting any basic symbol into any position inevitably leads to a repetition of a 2-gram. Without loss of generality (w.l.o.g.) we append it at the start of the C ( N 1 ) string, obtaining a string [ * 000111 ] . Another symbol can be introduced to this string without an additional 2-gram repetition, for example, as another prefix [ * 000111 ] leading to the repetition of the 2-gram * or * 0 but not both of them (here we allow = * ). Hence, both the length and the ASI of this string increase by one. Finally, 0 can be appended at the start of this string without an additional 2-gram repetition, provided that = 1 and * = 0 . For b = 2 there are only four1 bitstrings containing only two copies of a 2-gram having the ASI a max ( 10 , 2 ) = 8
[ 01 | 0000 | 1110 ] , [ 1 | 0000 | 11101 ] , [ 101 | 0000 | 111 ] , [ 1 | 0000 | 10111 ] ,
where the leftmost one is the only unbalanced string within this set ( N 0 N 1 ). Each 10-bit string (32) has one clear 4-gram surrounded by the same symbol ( [ 1 | 0000 | 1 ] ) and one clear 3-gram ( [ 111 ] ), which can be concatenated ( [ 1 | 0000 | 111 ] ) in eight bits or separated ( [ 1 | 0000 | 1 111 ] ) in nine bits. The first three strings (32) correspond to the former case, while the last one to the latter. Each of the strings (32) contains three pairs of mutually exclusive 2-grams, so that only one pair can be reused to decrease the maximum N 1 ASI by one.
We developed two methods of constructing generalized C ( N b ) strings (32) for b > 2 . They have b 1 clear 4-grams ( c l c l c l c l ), one clear 3-gram and ( b 1 ) × 3 of mutually exclusive 2-grams { c k c l , c l c l , c l c m } overlapped in patterns [ c k | c l c l c l c l | c k ] surrounding clear 4-gram, so that only one pair from each pattern can be reused in 3 b 1 ways to reduce the maxASI to N b . The 1st method is based on the balanced C ( N 2 ) string (32) [ 101 | 0000 | 111 ] and generates the most balanced C ( N b ) strings. It starts with a string of length 6 b 2 containing all b 1 clear 4-grams in the form
[ b 1 01 | 0000 | 12 | 1111 | 2 b 1 | b 2 b 2 b 2 b 2 | b 1 b 1 b 1 ] ,
to which the missing b 2 2 b symbols are appended using the patterns shown in Figure 7 yielding the C ( N b ) strings of the form
C ( N 1 ) ( 3 , 1 ) = [ 000 ] , C ( N 2 ) ( 10 , 2 ) = [ 101 | 0000 | 111 ] , C ( N 3 ) ( 19 , 3 ) = [ 201 | 0000 | 12 | 1111 | 222 | 102 ] , C ( N 4 ) ( 30 , 4 ) = [ 301 | 0000 | 12 | 1111 | 23 | 2222 | 333 | 103 | 202 | 13 ] , C ( N 5 ) ( 43 , 5 ) = [ 401 | 0000 | 12 | 1111 | 23 | 2222 | 34 | 3333 | 444 | 104 | 30313202 | 1424 ] , C ( N 6 ) ( 58 , 6 ) = [ 501 | 0000 | 12 | 1111 | 23 | 2222 | 34 | 3333 | 45 | 4444 | 555 | 105 | 404142430313202 | 152535 ] , C ( N 7 ) ( 75 , 7 ) = [ 601 | 0000 | 12 | 1111 | 23 | 2222 | 34 | 3333 | 45 | 4444 | 56 | 5555 | 666 | 106 | 505152535404142430313202 | 16263646 ] , C ( N 8 ) ( 94 , 8 ) = [ 701 | 0000 | 12 | 1111 | 23 | 2222 | 34 | 3333 | 45 | 4444 | 56 | 5555 | 67 | 6666 | 777 | 107 | 60616263646505152535404142430313202 | 1727374757 ] , C ( N b ) ( b 2 + 4 b 2 , b ) = [ b 1 01 | 0000 | 12 | 1111 | 2 b 1 | b 2 b 2 b 2 b 2 | b 1 b 1 b 1 | 10 b 1 | b 2 0 b 2 21 b 1 ] .
Figure 7. 2-gram matrices for 1 b 9 that illustrate the generation of the most balanced C ( N b ) strings (33). Coloured 2-grams are appended to the initial string of clear 4-grams in the order indicated by colours.
Figure 7. 2-gram matrices for 1 b 9 that illustrate the generation of the most balanced C ( N b ) strings (33). Coloured 2-grams are appended to the initial string of clear 4-grams in the order indicated by colours.
Preprints 191720 g007
We discuss the 2nd method, generating non-balanced C ( N b ) strings, in Appendix A.16.
The mutually exclusive 2-grams of any C ( N b ) string comprise b 1 adjoining clear 2-grams on the diagonal of the 2-grams matrix (27) and 2 ( b 1 ) 2-grams on its sub- and super-diagonals, as shown in Figure 7, so that each clear 4-gram is surrounded by the same symbol only one pair from each pattern can be reused. In particular, C ( N b ) strings can have forms (A25) or (34) with all clear 4-grams separated by 2-grams and forming 7-gram patterns in the form [ c l c k c l c l c l c l c k ] . However, this is not necessary. For example, the string C ( N 3 ) = [ 1 | 0000 | 102 | 1111 | 21 | 222 | 01 ] does not follow this rule. C ( N b ) strings are not the maximum length maxASI strings having the property of maxASI incremented by one with every three symbols. For example, the string (34) for b = 3 can be suffixed with 202, yielding a string with a ( 22 , 3 ) = 18 which still is not the maxASI string with this property for b = 3 (cf. Table A5).

5. Results Common to the Minimum and Maximum Complexity Strings

The minASI and maxASI bounds are illustrated in Figure 8 for b = { 2 , 3 , 4 } . For each b, maxASI initially equals N 1 up to the 1st threshold (31). Then it flattens and decreases by one every 3 symbols: once for b = 2 and 2 b times for b > 2 , as we conjecture based on the results for b = 3 .
Figure 8. The minimum assembly index (red; log 2 ( N ) , red, dash-dot) and the maximum assembly index (green) for b { 2 , 3 , 4 , 5 } and N 81 .
Figure 8. The minimum assembly index (red; log 2 ( N ) , red, dash-dot) and the maximum assembly index (green) for b { 2 , 3 , 4 , 5 } and N 81 .
Preprints 191720 g008
Figure 9. The ASI of all bitstrings of lengths N = 8 (a), N = 10 (b), N = 12 (c), N = 14 (d), and N = 16 (e), as a function of their decimal representations, in squares with sides equal to 2 N . Red dots indicate maxASI strings.
Figure 9. The ASI of all bitstrings of lengths N = 8 (a), N = 10 (b), N = 12 (c), N = 14 (d), and N = 16 (e), as a function of their decimal representations, in squares with sides equal to 2 N . Red dots indicate maxASI strings.
Preprints 191720 g009
The ASI of a string is also related to the expected waiting time (EWT) of this string. The EWT is the expected number of trials until a specific string appears in a sequence of independent and identically distributed (i.i.d.) random variables taking values in a finite set of b > 1 equiprobable outcomes, called a uniform categorical distribution. The general formula for the EWT of a string, which can be determined using Markov chains, is [42]
E C l ( N , b ) = k = 1 N b k · I k ,
where I k is an indicator function that is 1 if the prefix of length k is also the suffix of the string C l ( N , b ) and 0 otherwise. Thus, the EWT is bounded from below and from above as
b N E C l ( N , b ) b b 1 b N 1 .
The upper EWT bound (36) is attained by clear strings, that is, also by the minASI strings, in which case I k = 1 , k and the formula (35) simplifies to a geometric series E C l ( N , b ) = k = 1 N b k which resolves to the upper EWT bound (36). The minASI strings containing two or three symbols do not satisfy this relation and hence have shorter EWT. For example, the minASI alternating strings have the EWT given by E C l ( N , b ) = m = 1 N / 2 b 2 m if N is even and E C l ( N , b ) = m = 1 N / 2 b 2 m 1 if N is odd, that is E [ C k ( N , b ) ] { 2 , 4 , 10 , 20 , 42 , 84 , } given by { × 2 , × 2 + 2 , × 2 , } (OEIS A026644). The non-alternating minASI bitstrings can have the EWT even closer to the lower bound (36). For example, E [ 010010100101001 ] = 33828 (cf. Figure 1d) lies firmly at the lower end of the spectrum, being positioned just 3.24 % of the way through the total range. The lower EWT bound (36) is attained by the strings having no prefixes matching suffixes, in which case, for k = N the whole string matches itself. The maxASI strings feature the EWT values equal to or close to this bound.
Applying AT to evolutionary biology, the inverse proportionality between the EWT of a string and its ASI is anticipated but also feels counterintuitive, as common sense would suggest that strings of low complexity should be provided earlier than more complex ones.
We found a correlation between the maxASI and the decimal representation of a string providing an intuitive geometric interpretation of the complexity landscape. For example, Figure 9 shows the ASI of all bitstrings of even lengths N { 8 , 10 , 12 , 14 , 16 } distributed in squares of sides equal to 2 N / 2 , starting from the top-left corner ( val 2 [ 00 ] = 0 ) and ending at the bottom-right corner ( val 2 [ 11 ] = 2 N 1 ). The decimal value of each pixel is given by r 2 N / 2 + c , where 0 r , c 2 N / 2 1 are the row and column indices. The low ASI strings naturally occur on diagonals and oblique lines because of the internal symmetries they exhibit. For example, minASI strings naturally occur on the main diagonal, as r ( 2 N / 2 + 1 ) for r = c corresponds to a bitstring containing a prefix of length N / 2 equal to its suffix. Thus, the EWT (35) of a main diagonal bitstring C d is bounded from below by E C d ( N , 2 ) 2 N + 2 N / 2 , and its ASI, taking ito account the relation (1), is bounded by
a min ( N , 2 ) a ( N , 2 ) ( C d ) N 1 2 ( N / 2 1 ) + a max ( N / 2 , 2 ) = a max ( N / 2 , 2 ) + 1 .
For example, for N = 8 , the ASI of all main diagonal strings C d satisfy 3 a ( 8 , 2 ) ( C d ) 4 , and they include the four minASI strings { [ 00 ] , [ 01 ] } × 2 corresponding to equidistributed decimals 0, 85, 170, and 255. For N = 10 , the depleted sixteen maxASI strings correspond to { 94 , 122 , 270 , 286 , 376 , 450 , 482 , 488 , 535 , 541 , 573 , 647 , 737 , 753 , 901 , 929 } decimal values. For N = 12 , the ten minASI strings { [ 00 ] , [ 001 ] , [ 010 ] , [ 01 ] , [ 011 ] } × 2 are equidistributed on the main square diagonal every 585 and every 1365. Finally, for N = 16 , the small ASI strings are aligned on diagonals and skewed lines with the four minASI strings again equidistributed on the main square diagonal, while the 506 maxASI strings tend to attain row-like and column-like arrangements.

6. Supremacy of the ASI Compression over Polynomial-Time Compression Algorithms

Calculating the ASI of a string can be contrasted with compression algorithms that run in polynomial time, such as basic dictionary (LZ-type) schemes [43,44,45] or grammar-based algorithms, such as Re-Pair. Indeed, Turing’s universality and algorithmic complexity show that any computable transformation, including the ASI, can be simulated symbolically with no loss of information [43]. This naturally raised doubts regarding the trivial computability of the ASI and its potential equivalence to popular lossless compression algorithms [43].
Yet these doubts proved moot. As the length of a string increases, computing the ASI becomes a hyper-exponentially hard, exhaustive process over all possible combinations and repeated substrings due to the enormous combinatorial space of possible construction pathways and the need to globally evaluate which fragments are most efficient to create early versus later; such a pathway minimization cannot be resolved entirely through local heuristics; there is no local strategy of finding optimal substrings in AT. This ASI problem is widely known in AT literature (cf. [13,46,47,48]). It was conjectured [10] that the ASI problem is NP-complete. A certificate-dependent (assembly-step) proof of this conjecture within predefined assembly spaces [12,47] was found [48] by providing a Karp reduction from the vertex cover problem. A complete, self-contained proofs of NP-completeness of the ASI problem for both decision and optimization variants and for all assembly spaces were found [14] by establishing a correspondence between string assembly spaces and straight-line grammars. Since, for each ASI, there exists a naïve algorithm (exhaustive search of the entire assembly tree), the ASI is computable for all finite objects [7]. Thus, one can say that determining ASI is computable and algorithmically trivial, but it is also computationally non-trivial because the assembly tree grows exponentially with N. One can see this non-triviality by noting that integrating machine-learning–driven methods does not improve ASI computation, even for moderately sized molecules [49]. Interestingly, human intuition can often identify structural patterns or regularities enabling rapid non-algorithmic qualitative estimation of the ASI that remains inaccessible to sequential algorithms.
The computational non-triviality of the ASI raised a second question: whether a worse and slower algorithm [43], such as NP-complete ASI exhaustive search, merits consideration if it provides the same explanatory power and predictive accuracy as known algorithms. Indeed, if additional constraints are imposed on AT, such as restricting assembly steps to substrings of predefined length, allowing/excluding certain assembly steps [12,48], etc., then certain variants of calculating the ASI may yield the same, or even worse results as compared to known compression algorithms, given the large number of the latter.
Figure 10. The ASI (a) and Re-Pair steps (b) of all bitstrings of length N = 10 , as a function of their decimal representations ( { 0 , 1 , , 1023 } ), in squares with sides equal to 32 pixels. Visible lack of perfect correlation between the ASI and Re-Pair: the number of Re-Pair steps is higher than ASI for 100 strings (e.g., 7=Re-Pair(34) > ASI(34)=5 for [ 0000100010 ] , 8=Re-Pair(95) > ASI(95)=7 for [ 0001011111 ] ).
Figure 10. The ASI (a) and Re-Pair steps (b) of all bitstrings of length N = 10 , as a function of their decimal representations ( { 0 , 1 , , 1023 } ), in squares with sides equal to 32 pixels. Visible lack of perfect correlation between the ASI and Re-Pair: the number of Re-Pair steps is higher than ASI for 100 strings (e.g., 7=Re-Pair(34) > ASI(34)=5 for [ 0000100010 ] , 8=Re-Pair(95) > ASI(95)=7 for [ 0001011111 ] ).
Preprints 191720 g010
As an example, let us compare AT, a constrained AT* version that prohibits mixed 2-gram [ 10 ] in the assembly pool (cf. Figure 2b), and the grammar-based Re-Pair method. The Re-Pair method involves counting all 2-grams in a string, selecting the most frequent 2-gram that occurs at least twice, replacing all its occurrences with a new symbol, and repeating these steps until no 2-gram occurs more than once. This is illustrated in the list (38) for an exemplary maxASI bitstring [ 1111110000001001101010110 ] of length N = 25 . As shown, the ASI of this string is fifteen, while the ASI* obtained by such a constrained AT* version is sixteen, which corresponds to the number of steps required to compress this string into a Re-Pair compressed version [ D A E B 1 B A C C 0 A 0 ] .
step ASI ASI * ( no 2 - gram [ 10 ] ) Re - Pair 1 . [ 00 ] [ 00 ] [ 11 ] × 7 A , [ A A A 000000100 A 01010 A 0 ] 2 . [ 01 ] [ 11 ] [ 00 ] × 6 B , [ A A A B B B 1 B A 01010 A 0 ] 3 . [ 10 ] [ 01 ] [ 01 ] × 2 C , [ A A A B B B 1 B A C C 0 A 0 ] 4 . [ 11 ] [ 000 ] [ A A ] × 2 D , [ D A B B B 1 B A C C 0 A 0 ] 5 . [ 0110 ] [ 111 ] [ B B ] × 2 E . [ D A E B 1 B A C C 0 A 0 ] 6 . [ 1111 ] [ 111111 ] [ D A ] 7 . [ 111111 ] [ 111111000 ] [ D A E ] 8 . [ 11111100 ] [ 111111000000 ] [ D A E B ] 9 . [ 1111110000 ] [ 1111110000001 ] [ D A E B 1 ] 10 . [ 111111000000 ] [ 111111000000100 ] [ D A E B 1 B ] 11 . [ 11111100000010 ] [ 11111100000010011 ] [ D A E B 1 B A ] 12 . [ 111111000000100110 ] [ 1111110000001001101 ] [ D A E B 1 B A C ] 13 . [ 11111100000010011010 ] [ 111111000000100110101 ] [ D A E B 1 B A C C ] 14 . [ 111111000000100110101 ] [ 11111100000010011010101 ] [ D A E B 1 B A C C 0 ] 15 . [ 1111110000001001101010110 ] [ 111111000000100110101011 ] [ D A E B 1 B A C C 0 A ] 16 . [ 1111110000001001101010110 ] [ D A E B 1 B A C C 0 A 0 ]
However, an unconstrained AT provides different results. The distributions of ASI and Re-Pair steps for all the bitstrings of length N = 10 illustrated in Figure 10, show that ASI compression is superior to the one provided by the Re-Pair algorithm. We encourage the reader to conduct an independent comparison of the ASI (e.g., for the maxASI strings provided in Table A3Table A6) with the results produced by well-known compression algorithms.

7. Discussion

The present study provides a general mathematical framework for the general AT string version and demonstrates how this formalization enables efficient computation of assembly indices across various diverse structural sequences. Clarifying the theory’s conceptual structure provides suggestions for its computational implementation. This allowed us to formalize bounding and pruning processes (an approach also proposed in [13,46]) and to develop software capable of effectively computing the ASI of a string, significantly reducing the number of potential configurations to explore. This, in turn, resulted in substantial efficiency gains, allowing us to find the exemplary long maxASI strings (cf. Table A3Table A6) and determine the maxASI bounds shown in Figure 8, overcoming important limitations exhibited by all the other known implementations of the algorithms to compute the ASI of big objects [50].
Fortunately, the non-triviality of computing the ASI does not close the field but rather delineates a structured landscape for algorithmic exploration. The framework developed here is equally applicable to symbolic representations of various real world patterns within the context of AT, such as detecting biosignatures [6], describing the evolution of music [51], and, as we conjecture, natural language processing (AI-generated text detection, authorship style analysis, plagiarism detection, time series analysis, etc.), financial data changes (distinction between random and predictable behaviour), sensor data anomaly analysis, medical signal analysis, etc.
The implications of our findings extend beyond the formal structure of AT itself. Traditional physics, while offering predictive power from past initial conditions to future states, lacks a functional perspective necessary to differentiate meaningful novelty from random fluctuations. AT may offer a unifying language for describing how information becomes structure. It can do so by showing how biological sequences are governed by foundational mathematical principles, for instance, for understanding the rules underlying the assembly of macromolecules such as proteins and DNA. Evolutionary biology explains the survival and prevalence of certain traits, but it does not address the mechanisms for generating novel phenotypic variants. We have shown, for example, that a DNA strand of length N containing four nucleobases cannot represent a minASI string without violating Chargaff’s rules and Theorem 3.2. Theorem 3.2 establishes that a minASI string can contain at most three distinct symbols (if N 2 s , as otherwise it can contain at most two), while the first Chargaff’s rule (for double-stranded DNA) states that four nucleobases A, T, and G, C should be pairwise balanced. On the other hand, biological systems tend to deviate from the equilibrium of maximum entropy [10,52] corresponding to balanced strings,
For b = 1 , the information entropy (23) vanishes, as the bit ( b = 2 ) is the smallest amount and the quantum of information. Yet AT explains the assembly of such minASI strings, extending beyond the minimal two-valued system necessary to convey any information. Our results clarify that AT is a formal system for describing the generative and causal structure of objects, positioning it as a theoretical extension of information theory, emphasizing not the description of information but the effort required to construct it.

Author Contributions

W.B.: first concept of a general method for constructing the C ( N 1 ) string and it’s implementation (the 1st method); the concept of the 2-gram matrix (27); proposition leading to Theorem 2.1; outline of the Section 1 and Section 7; numerous clarity corrections and improvements; P.M.: outline of the 2nd method (A.15); observation of the relation between Theorems 2.3 and 2.4; crucial observations leading to the proofs of Lemmas 3.2 and 3.4; novel Strings (19); the concept of a Table A1; Conjecture 3.1; numerous clarity corrections and improvements; A.T.: formal proof of Theorem 3.2; proof of the Theorem 3.3; conceptualization of the proof of the Theorem 2.4 and equation (4); numerous clarity corrections and improvements; S.T.: development of an efficient software for computing the ASI of a string generating the strings listed in Table A3Table A6; hint on the visualisation of the ASI distributions shown in Figure 8; numerous clarity corrections and improvements; S.Ł.: the remaining part of the study.

Funding

This research received no external funding.

Data Availability Statement

Public repository for the code and results of this study is given under the link https://github.com/szluk/AssemblyTheory (accessed on 28 December 2025).

Acknowledgments

Mariola Bala for her motivation, Rafał Bobrowski for assessing the time required to calculate a maxASI of a string as a function of its length, Robert Dobosz for providing valuable hints on earlier research [53] linking information and probability theories, Rafał Winiarski for noting that the relation (1) is inequality, anonymous referee for pointing out the coincidence of the DPI with log 2 ( N ) + N 1 2 , shown in Figure 4, and SŁ’s daughter Anna for her support in finding the pattern of strings (34). SŁ thanks his wife, Magdalena Bartocha, for her everlasting support, and his partner and friend, Renata Sobajda, for her prayers.

Conflicts of Interest

Authors Wawrzyniec Bieniawski and Piotr Masierak were employed by the company Łukaszyk Patent Attorneys. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AT assembly theory;
N length of a string;
b number of basic symbols c { 0 , 1 , , b 1 } ;
C s , C s ( N , b ) a string;
ASI, a ( N , b ) ( C s ) assembly index of a string C s (minASI - minimum, maxASI - maximum);
S ( Ω C s ) assembly space of a string C s ;
ASD, d s ( N k , b ) ( C s ) assembly depth of a string C s (minASD - minimum, minASI ASD - the ASD of a minASI string);
DPI depth index (OEIS A014701);
EWT, E C k ( N , b ) expected waiting time;

OEIS Sequences

The following OEIS sequences are referred to in this manuscript:
A003313 Length of shortest addition chain for n (minASI);
A014701 Number of multiplications to compute n-th power by the Chandah-sutra method (DPI);
A026644 Number of moves to solve Chinese rings puzzle;
A048645 Integers with one or two 1-bits in their binary expansion;
A173786 Triangle read by rows: T ( n , k ) = 2 n + 2 k , 0 k n .

Appendix A

Table A1. Certain lengths of minASI strings, which are defined by the ASI and the minASI ASD for 2 s 7 .
Table A1. Certain lengths of minASI strings, which are defined by the ASI and the minASI ASD for 2 s 7 .
s a min ( N ˜ * ) = 1 a min ( N ˜ * ) = 2 a min ( N ˜ * ) = 3 a min ( N ˜ * ) = 4 a min ( N ˜ * ) = 5 a min ( N ˜ * ) = 6 a min ( N ˜ * ) = 7 a min ( N ˜ * ) = 8 a min ( N ˜ * ) = 9 ... N ˜ 2 n + 1
2 2 4 3 7 14 28 56 112 224 ... N ˜ 3
15 30 60 120 240 480 N ˜ 3 , a
23 46 92 184 368 N ˜ 3 , b
3 2 4 8 3 11 22 44 88 176 ... N ˜ 3
27 54 108 216 432 N ˜ 3 , a
43 86 172 344 N ˜ 3 , b
2 4 8 5 13 26 52 104 208 ... N ˜ 5
45 90 180 360 ... N ˜ 5 , b
4 2 4 8 16 3 19 38 76 152 ... N ˜ 3
51 102 204 408 N ˜ 3 , a
83 166 332 N ˜ 3 , b
2 4 8 16 5 21 42 84 168 ... N ˜ 5
85 170 340 ... N ˜ 5 , b
2 4 8 16 9 25 50 100 200 ... N ˜ 9
5 2 4 8 16 32 3 35 70 140 ... N ˜ 3
99 198 396 N ˜ 3 , a
163 326 N ˜ 3 , b
2 4 8 16 32 5 37 74 148 ... N ˜ 5
165 330 ... N ˜ 5 , b
2 4 8 16 32 9 41 82 164 ... N ˜ 9
2 4 8 16 32 17 49 98 196 ... N ˜ 17
6 2 4 8 16 32 64 3 67 134 ... N ˜ 3
195 390 N ˜ 3 , a
323 N ˜ 3 , b
2 4 8 16 32 64 5 69 138 276 N ˜ 5
325 650 N ˜ 5 , b
2 4 8 16 32 64 9 73 146 ... N ˜ 9
2 4 8 16 32 64 17 81 162 ... N ˜ 17
2 4 8 16 32 64 33 97 194 ... N ˜ 33
7 2 4 8 16 32 64 128 3 131 ... N ˜ 3
387 N ˜ 3 , a
2 4 8 16 32 64 128 5 133 266 N ˜ 5
645 N ˜ 5 , b
2 4 8 16 32 64 128 9 137 ... N ˜ 9
2 4 8 16 32 64 128 17 145 ... N ˜ 17
2 4 8 16 32 64 128 33 161 ... N ˜ 33
2 4 8 16 32 64 128 65 193 ... N ˜ 65
Table A2. Pathways leading to minASI strings (maximizing the number of independent assembly steps - MIA, maximizing the binary Shannon entropy - MBL). for 2 N 65 (see Section 3 for details).
Table A2. Pathways leading to minASI strings (maximizing the number of independent assembly steps - MIA, maximizing the binary Shannon entropy - MBL). for 2 N 65 (see Section 3 for details).
N d min ( N ) = log 2 ( N ) d a min ( N ) a min ( N ) a ^ min ( N ) MIA pathway MBL pathway (Hamming weight N 1 ) String
2 1 1 1 1 { 2 }  (1) N ^ 1
3 2 2 2 2 { 2 , 3 }  (1) N ^ 1
4 2 2 2 2 { 2 , 4 }  (2) N ^ 1
5 3 3 3 3 { 2 , 4 , 5 }  (2) N ^ 1
6 3 3 3 3 { 2 , 4 , 6 }  (3) N ^ 1
7 3 3 4 4 { 2 , ( 3 , 4 ) , 7 } { 2 , 4 , 6 , 7 }  (3) N ˜ 3
8 3 3 3 3 { 2 , 4 , 8 }  (4) N ^ 1
9 4 4 4 4 { 2 , 4 , 8 , 9 }  (4) N ^ 1
10 4 4 4 4 { 2 , 4 , 8 , 10 }  (5) N ^ 1
11 4 4 5 5 { 2 , ( 3 , 4 ) , 7 , 11 } { 2 , 4 , 8 , 10 , 11 }  (5) N ˜ 3
12 4 4 4 4 { 2 , 3 , 6 , 12 } { 2 , 4 , 8 , 12 }  (6) N ^ 1
13 4 4 5 5 { 2 , 4 , ( 5 , 8 ) , 13 } { 2 , 4 , 8 , 12 , 13 }  (6) N ˜ 5
14 4 4 5 5 { 2 , ( 3 , 4 ) , 7 , 14 } { 2 , 4 , 8 , 12 , 14 }  (7) N ˜ 3
15 4 5 5 6 { 2 , 3 , 5 , 10 , 15 }  (6) N ˜ 7
16 4 4 4 4 { 2 , 4 , 8 , 16 }  (8) N ^ 1
17 5 5 5 5 { 2 , 4 , 8 , 16 , 17 }  (8) N ^ 1
18 5 5 5 5 { 2 , 4 , 8 , 16 , 18 }  (9) N ^ 1
19 5 5 6 6 { 2 , ( 3 , 4 ) , 8 , 11 , 19 } { 2 , 4 , 8 , 10 , 18 , 19 }  (9) N ˜ 3
20 5 5 5 5 { 2 , 3 , 5 , 10 , 20 } { 2 , 4 , 8 , 16 , 20 }  (10) N ^ 1
21 5 5 6 6 { 2 , 4 , ( 5 , 8 ) , 16 , 21 } { 2 , 4 , 8 , 16 , 20 , 21 }  (10) N ˜ 5
22 5 5 6 6 { 2 , ( 3 , 4 ) , 7 , 11 , 22 } { 2 , 4 , 8 , 16 , 20 , 22 }  (11) N ˜ 3
23 5 6 6 7 { 2 , 3 , 5 , 10 , 20 , 23 }  (9) N ˜ 2 n + 1 , b
24 5 5 5 5 { 2 , 4 , 8 , 12 , 24 }  (12) N ^ 1
25 5 5 6 6 { 2 , 4 , 8 , ( 9 , 16 ) , 25 } { 2 , 4 , 8 , 16 , 24 , 25 }  (12) N ˜ 9
26 5 5 6 6 { 2 , 4 , ( 5 , 8 ) , 13 , 26 } { 2 , 4 , 8 , 16 , 24 , 26 }  (13) N ˜ 5
27 5 6 6 7 { 2 , 3 , 6 , 12 , 24 , 27 } { 2 , 4 , 5 , 9 , 18 , 27 }  (12) N ˜ 2 n + 1 , a
28 5 5 6 6 { 2 , ( 3 , 4 ) , 7 , 14 , 28 } { 2 , 4 , 8 , 16 , 24 , 28 }  (14) N ˜ 3
29 5 6 7 7 { 2 , 4 , 8 , ( 9 , 10 ) , 20 , 29 } { 2 , 4 , 8 , 16 , 24 , 28 , 29 }  (14)
30 5 6 6 7 { 2 , 3 , 5 , 10 , 15 , 30 } { 2 , 4 , 6 , 10 , 20 , 30 }  (15) N ˜ 7
31 5 6 7 8 { 2 , 4 , ( 5 , 8 ) , 13 , 26 , 31 } { 2 , 4 , 8 , 10 , 20 , 30 , 31 }  (15) N ˜ 15
32 5 5 5 5 { 2 , 4 , 8 , 16 , 32 }  (16) N ^ 1
33 6 6 6 6 { 2 , 4 , 8 , 16 , 32 , 33 }  (16) N ^ 1
34 6 6 6 6 { 2 , 4 , 8 , 16 , 32 , 34 }  (17) N ^ 1
35 6 6 7 7 { 2 , ( 3 , 4 ) , 7 , 14 , 28 , 35 } { 2 , 4 , 8 , 16 , 32 , 34 , 35 }  (17) N ˜ 3
36 6 6 6 6 { 2 , 4 , 8 , 16 , 32 , 36 }  (18) N ^ 1
37 6 6 7 7 { 2 , 4 , ( 5 , 8 ) , 16 , 32 , 37 } { 2 , 4 , 8 , 16 , 32 , 36 , 37 }  (18) N ˜ 5
38 6 6 7 7 { 2 , ( 3 , 4 ) , 8 , 11 , 19 , 38 } { 2 , 4 , 8 , 16 , 32 , 36 , 38 }  (19) N ˜ 3
39 6 6 7 8 { 2 , 4 , ( 5 , 8 ) , 13 , 26 , 39 } { 2 , 4 , ( 5 , 8 ) , 13 , 26 , 39 } ( 18 )
40 6 6 6 6 { 2 , 4 , 8 , 16 , 32 , 40 } { 2 , 4 , 8 , 16 , 32 , 40 } ( 20 ) N ^ 1
41 6 6 7 7 { 2 , 4 , 8 , ( 9 , 16 ) , 25 , 41 } { 2 , 4 , 8 , 16 , 32 , 40 , 41 } ( 20 ) N ˜ 9
42 6 6 7 7 { 2 , ( 3 , 4 ) , 7 , 14 , 28 , 42 } { 2 , 4 , 8 , 16 , 32 , 40 , 42 } ( 21 ) N ˜ 5
43 6 7 7 8 { 2 , 3 , 5 , 10 , 20 , 40 , 43 }  (17) N ˜ 2 n + 1 , b
44 6 6 7 7 { 2 , ( 3 , 4 ) , 7 , 11 , 22 , 44 } { 2 , 4 , 8 , 16 , 32 , 40 , 44 }  (22) N ˜ 3
45 6 7 7 8 { 2 , 3 , 5 , 10 , 20 , 40 , 45 } { 2 , 4 , 5 , 9 , 18 , 27 , 45 }  (20) N ˜ 2 n + 1 , b
46 6 7 7 8 { 2 , 3 , 5 , 10 , 20 , 23 , 46 } { 2 , 4 , 6 , 10 , 20 , 40 , 46 }  (23) N ˜ 2 n + 1 , b
47 6 7 8 9 { 2 , ( 3 , 4 ) , 7 , 11 , 22 , 44 , 47 } { 2 , 4 , 6 , 10 , 20 , 40 , 46 , 47 }  (23) N ˜ 15
48 6 6 6 6 { 2 , 4 , 8 , 12 , 24 , 48 }  (24) N ^ 1
49 6 7 7 7 { 2 , 4 , 8 , 12 , 24 , 48 , 49 }  (24) N ˜ 17
50 6 6 7 7 { 2 , 4 , 8 , ( 9 , 16 ) , 25 , 50 } { 2 , 4 , 8 , 16 , 32 , 40 , 48 , 50 }  (25) N ˜ 9
51 6 7 7 8 { 2 , 4 , 8 , 16 , 17 , 34 , 51 }  (24) N ˜ 2 n + 1 , a
52 6 6 7 7 { 2 , 4 , ( 5 , 8 ) , 13 , 26 , 52 } { 2 , 4 , 8 , 16 , 32 , 40 , 48 , 52 }  (26) N ˜ 5
53 6 7 8 8 { 2 , 4 , ( 5 , 8 ) , 16 , 32 , 48 , 53 } { 2 , 4 , 8 , 16 , 32 , 40 , 48 , 52 , 53 }  (26)
54 6 7 7 8 { 2 , 3 , 6 , 12 , 24 , 27 , 54 } { 2 , 4 , 6 , 12 , 24 , 48 , 54 }  (27) N ˜ 2 n + 1 , a
55 6 7 8 9 { 2 , ( 3 , 4 ) , 7 , 11 , 22 , 44 , 55 } { 2 , 4 , 8 , 16 , 18 , 36 , 54 , 55 }  (27)
56 6 6 7 7 { 2 , ( 3 , 4 ) , 7 , 14 , 28 , 56 } { 2 , 4 , 8 , 16 , 32 , 48 , 56 }  (28) N ˜ 3
57 6 7 8 8 { 2 , ( 3 , 4 ) , 7 , 14 , 28 , 56 , 57 } { 2 , 4 , 8 , 16 , 32 , 48 , 56 , 57 }  (28)
58 6 7 8 8 { 2 , ( 3 , 4 ) , 7 , 14 , 28 , 29 , 58 } { 2 , 4 , 8 , 16 , 32 , 48 , 56 , 58 }  (29)
59 6 7 8 9 { 2 , ( 3 , 4 ) , 7 , 14 , 28 , 56 , 59 } { 2 , 4 , 5 , 9 , 18 , 27 , 54 , 59 }  (26) N ˜ 27
60 6 7 7 8 { 2 , 4 , 8 , 12 , 24 , 48 , 60 } { 2 , 4 , 6 , 10 , 20 , 30 , 60 }  (30) N ˜ 7
61 6 8 8 9 { 2 , 4 , 8 , 12 , 24 , 48 , 60 , 61 } { 2 , 4 , 8 , 16 , 20 , 40 , 60 , 61 }  (30)
62 6 7 8 9 { 2 , ( 3 , 4 ) , 7 , 14 , 28 , 31 , 62 } { 2 , 4 , 8 , 16 , 20 , 40 , 60 , 62 }  (31) N ˜ 15
63 6 7 8 10 { 2 , ( 3 , 4 ) , 7 , 14 , 21 , 42 , 63 } { 2 , 4 , 5 , 9 , 18 , 27 , 45 , 63 }  (28)
64 6 6 6 6 { 2 , 4 , 8 , 16 , 32 , 64 }  (32) N ^ 1
65 7 7 7 7 { 2 , 4 , 8 , 16 , 32 , 64 , 65 }  (32) N ^ 1
Table A3. Exemplary maxASI bitstrings for 10 N 39 .
Table A3. Exemplary maxASI bitstrings for 10 N 39 .
N C max ( N , 2 ) a max ( N , 2 ) N C max ( N , 2 ) a max ( N , 2 )
10 [1010000111] 8 25 [0000000101010110011111110] 15
11 [00010111100] 8 26 [01001100000111111101010110] 16
12 [101010000111] 9 27 [000000011111110101011001000] 16
13 [1000001110101] 9 28 [0110101011111110000000110010] 17
14 [10011000010111] 10 29 [01100000001010100111011111110] 17
15 [000001010111110] 10 30 [100100000000110010101101111111] 17
16 [1001100001010111] 11 31 [0101010010000000111111101101100] 18
17 [00000010101111110] 11 32 [01001100000000101011011111111001] 18
18 [100110100001010111] 12 33 [100000000010011111111011101101010] 18
19 [0111110110000010100] 12 34 [1000000000100111111110111011010101] 18
20 [10011010000101011111] 13 35 [10101000000010110010011111110001101] 19
21 [000000010101100111110] 13 36 [101010000000101100100111111100011101] 19
22 [0010111111101001100000] 14 37 [1011010101000000010010001111111001101] 19
23 [00000001010101100111110] 14 38 [10111010101000000010010001111111001101] 20
24 [011001111111010100000001] 15 39 [111001100100011010000001010101101101111] 20
Table A4. Exemplary maxASI bitstrings for 40 N 85 .
Table A4. Exemplary maxASI bitstrings for 40 N 85 .
N C max ( N , 2 ) a max ( N , 2 )
40 [0011001011111110101000000011011000101101] 20
41 [00000111111001110101001011011011000100110] 21
42 [001101111110101010110000000111100100100101] 21
43 [0111100111110110010100000011100011000101101] 21
44 [11101010101011111100100100011000000010110110] 22
45 [111010101010111111001001000110000000111011010] 22
46 [0111100111111010001110000000110010010101001101] 22
47 [01111001111110100011100000001100100101010110110] 23
48 [011110011111101000111000000011001001010101101100] 23
49 [0111100111111010001110000000110010010101011011000] 23
50 [10100111111100010001111010000001011011000011100110] 23
51 [101001111111000100011110100000010110110010011010101] 24
52 [1010011111110001000111101000000101101100100100110101] 24
53 [10100111111100010001111010000001011011001001001101010] 24
54 [101001111111000100011110100000010110110010010011010101] 25
55 [1010011111110001000111101000000101101100100100110101010] 25
56 [10100111111100010001111010000001011011001001001101010101] 25
57 [101001111111000100011110100000010110110010010011010101010] 25
58 [1010011111110001000111101000000101101100100100110101010101] 26
59 [10001011100111001111111011010000001110000011001001001010101] 26
60 [101010111011110011111110100100001100100010001001010110000000] 26
61 [1010101110111100111111101001000011001000100010010101100000001] 26
62 [10101011101111001111111010010000110010001000100101011011000001] 27
63 [101010111011110011111110100100001100100010001001010110110000000] 27
64 [1010101110111100111111101001000011001000100010010101101100000001] 27
65 [10101011101111001111111010010000110010001000100101011011011000001] 28
66 [101010101011001000111110110111110001000000110011001110011010010010] 28
67 [1010101010110010001111101101111100010000001100110011100110100100101] 28
68 [10101010101100100011111011011111000100000011001100111001110001011110] 28
69 [101010101011001000111110110111110001000000110011001110011100010100101] 29
70 [1010101010110010001111101101111100010000001100110011100111000101001001] 29
71 [10101010101100100011111011011111000100000011001100111001110001010010010] 29
72 [101010101011001000111110110111110001000000110011001110011100010100100101] 30
73 [1010101010110010001111101101111100010000001100110011100111000101001001001] 30
74 [10101010101100100011111011011111000100000011001100111001110001010010000001] 30
75 [101010101011001000111110110111110001000000110011001110011100010100100000001] 30
76 [1010101010110010001111101101111100010000001100110011100111000101001000010000] 31
77 [10101010101100100011111011011111000100000011001100111001110001010010000000000] 31
78 [101010101011001000111110110111110001000000110011001110011100010100100000000001] 31
79 [1001011101101011111110100110110010011101010110100101000110011110111100000001011] 31
80 [10010111011010111111101001101100100111010101101001010001100111101111000000010101] 32
81 [100101110110101111111010011011001001110101011010010100011001111011110000000101010] 32
82 [1001011101101011111110100110110010011101010110100101000110011110111100000001010100] 32
83 [10010111011010111111101001101100100111010101101001010001100111101111000000010101000] 33
84 [100101110110101111111010011011001001110101011010010100011001111011110000000101011100] 33
85 [1001011101101011111110100110110010011101010110100101000110011110111100000001010111000] 33
Table A5. Exemplary maxASI strings for b = 3 and 13 N 54 .
Table A5. Exemplary maxASI strings for b = 3 and 13 N 54 .
N C max ( N , 3 ) a max ( N , 3 )
13 [0002220111210] 12
14 [00022201112101] 12
15 [000222011121012] 13
16 [0002220111210120] 14
17 [20011121002201021] 14
18 [222111210100001202] 15
19 [0221110100122200021] 16
20 [02211101001222000211] 16
21 [022111010012220002112] 17
22 [0221110100122200002021] 18
23 [02211101001222000211201] 18
24 [022111010012220002011210] 19
25 [0222212112002010001111021] 20
26 [02222121120020100011110210] 20
27 [012221211200201000111102202] 21
28 [0122212112002010001111022010] 21
29 [01222121120020100011110220210] 22
30 [012221211200201000111102202102] 22
31 [0122212112002010001111022021020] 23
32 [01222121120020100011110220210200] 23
33 [012221211200201000111102202102000] 24
34 [0122212112002010001111022021020001] 24
35 [01222121120020100011110220210200000] 25
36 [012221211200201000111102202102000001] 25
37 [0122212112002010001111022021020010101] 26
38 [01222121120020100011110220210200101101] 26
39 [012221211200201000111102202102001011012] 26
40 [0122212112002010001111022021020010110000] 27
41 [01222121120020100011110220210200101100002] 27
42 [012221211200201000111102202102001011000022] 28
43 [0122212112002010001111022021020010110000222] 28
44 [01222121120020100011110220210200101100000110] 29
45 [012221211200201000111102202102001011000001110] 29
46 [2111020110222211012201112212121010020000001202] 29
47 [21110201102222110122011122121210100200000010220] 30
48 [211102011022221101220111221212101002000000120210] 30
49 [2111020110222211012201112212121010020000001202112] 31
50 [21110201102222110122011122121210100200000012021120] 31
51 [211102011022221101220111221212101002000000120212210] 32
52 [2111020110222211012201112212121010020000001202122120] 32
53 [21110201102222110122011122121210100200000012021200220] 33
54 [211102011022221101220111221212101002000000120212002202] 33
Table A6. Exemplary maxASI strings for b = 4 and 21 N 49 .
Table A6. Exemplary maxASI strings for b = 4 and 21 N 49 .
N C max ( N , 4 ) a max ( N , 4 )
21 [000111222333102132030] 20
22 [0001112223331021320302] 20
23 [00011122233310213203012] 21
24 [010000111222333102132030] 22
25 [0100001112223331021320302] 22
26 [01000011122233310213203023] 23
27 [010000111222333102132030221] 24
28 [0001102013331121301222230323] 24
29 [00011320133311121022232302030] 25
30 [301000012111123222233310320213] 26
31 [3010000121111232222333103202130] 26
32 [30100001211112322223331032021303] 27
33 [301000012111123222233310320213313] 28
34 [3010000121111232222333203102133130] 28
35 [30100001211112322223332031021331300] 29
36 [301000012111123222233320310213313110] 30
37 [3010000121111232222333203102133131101] 30
38 [30100001211112322223332031021331311011] 31
39 [301000012111123222233320310213313110221] 32
40 [3010000121111232222333203102133131102210] 32
41 [30100001211112322223332031021331311022101] 33
42 [301000012111123222233320310213313110322011] 33
43 [3010000121111232222333203102133131102210103] 34
44 [30100001211112322223332031021331311022101030] 34
45 [301000012111123222233320310213313110221201300] 35
46 [3010000121111232222333203102133131102212013002] 36
47 [30100001211112322223332031021331311022120130023] 36
48 [301000012111123222233320310213313110221201300230] 37
49 [3010000121111232222333203102133131102212013002303] 38

Appendix A.1. Proof of Theorem 2.1

Consider the following two strings of the same length N + 8 with * 01 and the same distributions of other repetitions (if any)
C k = [ 01 01 01 * ] , C l = [ 01 01 22 22 ] .
Assembling a 2-gram takes one assembly step. Each appending of a 2-gram to an assembled string counts as another assembly step. Hence, in a general case (i.e., for strings C k , C l containing also other symbols), the string C k requires six additional assembly steps, the same as the string C l . Thus, a string containing the same three 2-grams has the same ASI as a string containing two pairs of the same 2-grams, provided that both strings have the same distributions of other repetitions and have the same lengths.
Consider the following two strings of the same length N + 6 with the same distributions of other repetitions
C k = [ 01 01 01 ] , C l = [ 010 010 ] .
The assembly of a 3-gram takes two steps. Hence, in the general case, the string C k requires four additional assembly steps, the same as the string C l . Thus, a string containing the same three 2-grams has the same ASI as a string containing the same two 3-grams, provided that both strings have the same distributions of other repetitions.
Similarly, a string containing the same two 3-grams has the same ASI as a string containing two pairs of the same 2-grams, provided that both strings have the same distributions of other repetitions and have the same lengths.
Consider the following two strings of the same length N + 9 with the same distributions of other repetitions
C k = [ 0101 0101 ] , C l = [ 010 010 010 ] .
The assembly of such a 4-gram takes two steps. Hence, in a general case, the string C k requires five additional assembly steps, the same as the string C l . Thus, a string containing the same two 4-grams of the minASI has the same ASI as a string containing the same three 3-grams, provided that both strings have the same distributions of other repetitions and have the same lengths.
Consider the following two strings of the same length N + 8 with the same distributions of other repetitions
C k = [ 0001 0001 ] , C l = [ 110 10 110 ] .
The assembly of such a 4-gram takes three steps. Hence, in a general case, the string C k requires five additional assembly steps, the same as the string C l . Thus, a string containing the same two 4-grams of the maxASI has the same ASI as a string containing a 2-gram and the same two 3-grams based on this 2-gram, provided that both strings have the same distributions of other repetitions.
Consider the following two strings of the same length N + 10 with the same distributions of other repetitions
C k = [ 110 00 110 00 ] , C l = [ 110 10 110 * ] ,
where * { 11 , 10 } . In a general case, the string C k requires seven additional assembly steps, the same as the string C l . Thus, a string containing the same two 2-grams and the same two 3-grams not based on this 2-gram has the same ASI as a string containing a 2-gram and the same two 3-grams based on this 2-gram, provided that both strings have the same distributions of other repetitions and have the same lengths.
In general, the above considerations show that
  • k copies of a 2-gram in a string decrease the ASI of this string at least by k 1 ;
  • k copies of a 3-gram in a string decrease the ASI of this string at least by 2 k 2 ;
  • k copies of a minASI 4-gram in a string decrease the ASI of this string at least by 3 k 2 ;
  • k copies of a maxASI 4-gram in a string decrease the ASI of this string at least by 3 k 3 ;
where, the phrase "at least" is meant to indicate that other repetitions, such as e.g. 2-grams forming multiple 4-grams, etc. can further decrease the ASI of the string. W.l.o.g., consider the following string
C m ( N , b ) = [ [ c 1 c 2 c n ] [ c 1 c 2 c n ] ] ,
containing two copies of an n-gram C l ( n , b ) = [ c 1 c 2 c n ] . The n-gram C l ( n , b ) can be assembled in at least a ( n , b ) C l steps and appended to the assembled string C m in one step. Consider that the ASI of the n-gram C l ( n , b ) is a ( n , b ) C l = n 1 , i.e. the n-gram does not have any repetitions that can be reused and n < 8 . Then one copy of this n-gram - as expected - does not decrease the ASI of the string C m ( N , b ) , as 1 ( n 1 ) ( n 1 ) = 0 , while more copies k decrease it by ( n 1 ) ( k 1 ) . On the other hand, if a ( n , b ) C l < n 1 then even a single copy of this n-gram will decrease the ASI of C m .

Appendix A.2. Proof of Theorem 2.2

Without loss of generality (w.l.o.g.) assume a contrario that Ω contains two strings C l , C m having the same ASD, i.e., d ( N l , b ) ( C l ) = d ( N m , b ) ( C m ) 0 , that were not assembled in independent assembly steps, i.e., that C m was used in the assembly of C l along with a basic symbol c in some previous step s. Then
d s ( N l , b ) ( C l ) = max d ( N m , b ) ( C m ) , d ( 1 , b ) ( c ) + 1 = d ( N m , b ) ( C m ) + 1 d ( N m , b ) ( C m ) ,
which contradicts our assumption and completes the proof.

Appendix A.3. Proof of Lemma 2.2

Assume a contrario that N > 2 d s ( N ) . Then for the ASD d s ( N ) = 0 , we have N > 2 0 = 1 which is a contradiction as all basic symbols c are unit-length strings and N = 1 . Similarly, for d s ( N ) = 1 , N > 2 is also contradiction in the case of 2-grams, and so on. This is a consequence of the ASD Definition (3).

Appendix A.4. Proof of Theorem 2.3

d s ( N ) log 2 ( N ) follows from the relation (4). d min ( 2 ) = log 2 ( 2 ) = 1 satisfies both the definition (3) and our hypothesis (5). Similarly N = 3 . Using induction on length N, assume that for some N > 3 , we can assemble a minASD string with ASD (5). We need to show that for N + 1 , we can assemble a string with the ASD satisfying
d min ( N + 1 ) = log 2 ( N + 1 ) .
Since, by definition (3), the ASD as a function of N is monotonously nondecreasing and can increase at most by one between N and N + 1 , we have
d min ( N + 1 ) = d min ( N ) = log 2 ( N ) or d min ( N + 1 ) = d min ( N ) + 1 = log 2 ( N ) + 1 ,
where we used relations (5) and (A8). Solving the relation (A9) for N yields
d min ( N + 1 ) = d min ( N ) = s if 2 s 1 < N < 2 s , d min ( N ) + 1 = s + 1 if N = 2 s ,
and completes the proof.

Appendix A.5. Proof of Theorem 2.4

Using the property of the ceiling function n = x n 1 < x n valid for n N , x R , we have
d a max ( N , b ) = log 2 ( N ) d a max ( N , b ) 1 < log 2 ( N ) d a max ( N , b ) ,
The non-strict inequality (A11) corresponds to the non-strict inequality (4) valid for any N and any ASD. Therefore, we need to prove that the strict inequality d a max ( N , b ) < log 2 ( N ) + 1 holds for all C max strings. Assume, for contradiction, that there exists a maxASI string C max ( N , b ) such that
d a max ( N , b ) log 2 ( N ) + 1 = log 2 ( 2 N ) 2 d a max ( N , b ) 2 N N 2 d a max ( N , b ) 1 .
But this relation does not hold for the maxASI string C max ( N , b ) .

Appendix A.6. Proof of Theorem 3.1

Strings C min for which a ( N ) ( C min ) = min k { a ( N , b ) ( C k ) } , k { 1 , 2 , , b N } can be formed in subsequent steps s by joining the longest string assembled so far with itself until N = 2 s is reached. Therefore, if N = 2 s , then min k { a ( 2 s ) ( C k ) } = s = log 2 ( N ) . Only b 2 strings have such ASI if N = 2 s , including respectively b and b ( b 1 ) strings
C k ( 2 s , b ) = [ * * ] , C l ( 2 s , b ) = [ * * ] ,
and the assembly space of each of the strings (A13) is unique. At each assembly step, its length doubles.
An addition chain for N N having the shortest length s N (commonly denoted as l ( N ) ) is defined as a sequence 1 = a 0 < a 1 < < a s = N of integers such that j 1 , a j = a k + a l for k l < j . Hence, j = 1 k = l = 0 and the first step in forming an addition chain for N is always a 1 = a 0 + a 0 = 1 + 1 = 2 , which is equivalent to saying that the ASI of any 2-gram is one. The second step in forming an addition chain can be a 2 = a 0 + a 1 = 1 + 2 = 3 or a 2 = a 1 + a 1 = 2 + 2 = 4 , which corresponds to assembling a 3-gram based on the previously assembled 2-gram or to assembling a minASI 4-gram (A13) from this 2-gram. The maxASI 4-gram can be assembled in a third step, a 3 = a 2 + a 0 = 3 + 1 = 4 , which corresponds to joining a basic symbol to a 3-gram. Therefore, four is the smallest number achievable in two ways according to Theorem 2.1.
Thus, finding the shortest addition chain for N corresponds to finding the ASI of a string containing basic symbols and/or 2-grams and/or 3-grams containing these 2-grams if N 2 s since due to Theorem 2.1 only they provide the same assembly indices { 0 , 1 , 2 } with no internal repetitions.

Appendix A.7. Proof of Theorem 3.3

To prove that the minASI ASD equals the minASI, we use mathematical induction on the length N of the string. For the base case ( N = 2 0 = 1 ), the string consists of a single basic symbol c P 0 ( b ) , where P s ( b ) denotes the assembly pool. Hence, its ASI is a min ( 1 ) 0 and its ASD d a min ( 1 ) 0 . Therefore, d a min ( 1 ) = a min ( 1 ) = 0 . Assume now that for all strings of length 2 s less than N, the ASD equals the minASI, that is
d a min ( 2 s ) = a min ( 2 s ) 2 s < N .
For some integer s, we construct the minASI string as follows. First, we assemble a 2-gram from two basic symbols:
c 1 c 2 = C ( 2 , b ) , c 1 , c 2 P 0 ( b ) .
Its ASI is a min ( 2 ) = 1 and its ASD is d a min ( 2 ) = 1 . Then for each s 2 we have C ( 2 s 1 , b ) with the ASI a min ( 2 s 1 ) = s 1 and the ASD d a min ( 2 s 1 ) = s 1 and we construct C ( 2 s , b ) by joining two copies of C ( 2 s 1 , b )
C ( 2 s 1 , b ) C ( 2 s 1 , b ) = C ( 2 s , b ) .
The ASI of the string C ( 2 s , b ) is equal to
a min ( 2 s ) = a min ( 2 s 1 ) + 1 = ( s 1 ) + 1 = s ,
and, similarly, its ASD is equal to
d a min ( 2 s ) max d a min ( 2 s 1 ) , d a min ( 2 s 1 ) + 1 = ( s 1 ) + 1 = s .
Therefore, a min ( 2 s ) = d a min ( 2 s ) = s . At any step, we assemble strings (A13), and no two assembly steps can be independent, which follows from Theorem 3.1. The equation (A10) establishes that N = 2 s is the largest N for which d min ( N ) = s . This proves d min ( 2 s ) = d a min ( 2 s ) = a min ( 2 s ) = s . Finally, the even part of the definition of the DPI 2.7 is the only defining part of this definition iff N = 2 s . Hence, d min ( 2 s ) = d a min ( N ) = a min ( 2 s ) = a ^ min ( 2 s ) = s .

Appendix A.8. Proof of Theorem 3.4

The lengths (14) (OEIS A173786 or OEIS A048645) are the generalization of the strings of length N = 2 s 1 + 2 s 1 = 2 s of the Theorem 3.3. For other lengths of the strings (14), the base case for s = 2 , l = 0 describes the assembly of a 3-gram, by joining a symbol to a 2-gram made in the first step, so that both the ASI and the ASD of this 3-gram increase by one. And so on. For any s we can join a symbol to a string of length N = 2 s 1 assembled in s 1 steps or join two such strings, as shown in Figure 3a.
To see that a ^ min ( N ^ 1 ) = s (14) holds for N ^ 1 2 s note that there is only one odd part of the definition of the DPI 2.7 that restores N = 2 s . For example, we reach one starting from N ^ 1 = 20 in five consecutive steps 20 10 5 4 2 1 .

Appendix A.9. Proof of Lemma 3.1

We begin at d = 3 by assembling a C min ( 7 ) using a 4-gram and a 3-gram assembled independently (e.g., using an assembly space (15)) with a min ( 7 ) = 4 and d a min ( 7 ) = log 2 ( 7 ) = 3 . For d = 4 , the string (16) C min ( 11 ) can be assembled by joining the string C min ( 8 ) assembled in three steps and the 3-gram, while the string C min ( 14 ) by joining two strings C min ( 7 ) made in the previous step. For any d, the shortest string (16) C min ( N ˜ 3 ) can be assembled by joining the string C min ( 2 d 1 ) (A13) assembled in d 1 steps and the 3-gram, while the remaining strings C min ( N ˜ 3 ) - by joining two strings made in a previous step d 1 , as shown in Figure 3b.

Appendix A.10. Proof of Lemma 3.2

We begin at d = 4 by assembling a C min ( 13 ) through { 2 , 4 , ( 5 , 8 ) , 13 } with a min ( 13 ) = d min ( 13 ) + 1 = 5 . For any d, the shortest string (17) C min ( N ˜ 5 ) can be assembled by joining the string C min ( 2 d 1 ) (A13) assembled in d 1 steps with the 5-gram assembled in the independent assembly step, while the remaining strings C min ( N ˜ 5 ) - by joining two strings made in a previous step d 1 , as shown in Figure 3c.

Appendix A.11. Proof of Lemma 3.3

We begin at d = 5 by assembling a C min ( 25 ) with a min ( 25 ) = d min ( 25 ) + 1 = 6 . For any d, we assemble the shortest strings (18) as
{ 2 , 4 , 8 , ( 9 , 16 ) , 25 } , { 32 , 41 } , { 64 , 73 } , { 128 , 137 } ,
with one independent assembly step ( 9 , 16 ) to assemble the string of length N = 2 d 1 and joining 9-gram at the last step, while the remaining strings C min ( N ˜ 9 ) - by joining two strings made in a previous step d 1 , as shown in Figure 3d.

Appendix A.12. Proof of Theorem 3.6

The equality of ASI and ASD of the strings (20) follows from the proof of Theorem 3.4. Furthermore d ,
2 d 1 < 2 d 1 + 7 · 2 d 4 < 2 d / 2 d 1 0 < 7 · 2 d 4 < 2 d 2 d 1 = 2 d 1 / · 2 4 d 0 < 7 < 2 d 1 2 4 d = 8 ,
shows that d a min ( N ˜ 7 ) = log 2 ( N ˜ 7 ) + 1 . Finally, a ^ min ( N ^ 1 ) = log 2 ( N ˜ 7 ) + 2 follows from the DPI Definition 2.7: six steps are required to reach one starting from fifteen and additional steps for thirty, sixty, etc., which completes the proof.

Appendix A.13. Support for Conjecture Section 3

The shortest strings of length N ˜ 15 (22a) can be assembled with the pathways
{ 2 , 4 , ( 5 , 8 ) , 13 , 26 , 31 } { 39 , 47 } , { 78 , 79 } ,
shown in Figure 3e; the shortest strings of length N ˜ 27 (22b) can be assembled with the pathways
{ 2 , ( 3 , 4 ) , 7 , 14 , 28 , 31 , 59 } { 14 28 , 56 , 84 , 91 } { 11 18 , 36 , 72 , 144 , 155 }
shown in Figure 3f; and for any d, the shortest strings of length N ˜ 50.9 (22c) can be assembled as
{ 2 , 4 , 8 , ( 9 , 16 ) , 25 , 50 , 59 } , { 100 , 109 } , { 200 , 209 }
The remaining strings of length N ˜ 15 , N ˜ 27 , and N ˜ 50.9 (Section 3) can be assembled by joining two strings made in a previous step d 1 .

Appendix A.14. Proof of Lemma 3.4

By Theorems 3.1 and 3.2, a minASI string of length N = 2 s assembled beginning with C min ( 2 ) = [ * ] is a balanced bitstring. To assemble a longer string of other lengths, we assign N s + 1 = N s + 1 or N s + 1 = N s + 2 . However, the Definition 2.7 removes the longest string of an odd length N = 2 s + 1 from the sequence if it is not the first one in the sequence. Strings longer than this string of length N = 2 s + 1 are assembled by joining the longest string assembled so far with itself ( N s + 1 = 2 N s ) or by joining a basic symbol chosen to preserve the balance of the string ( N s + 1 = N s + 1 ).

Appendix A.15. The 2nd Method for Generating C (N-1) Strings

This method is similar to the 1st method discussed in Section 4. We also start with a string of clear 3-grams (26) and the matrix of 2-grams (27) with a crossed diagonal and the first superdiagonal. In the first step, we append the 2-gram 0 b 1 (top right 2-gram of the matrix of 2-grams (27)) at the end of the string (26). Next, we generally perform the following pairs of iterations:
(1)
we check subsequent subdiagonals until we find one that does not contain a 2-gram present in the string formed so far, we append it at the end of this string and proceed to step 2;
(2)
we check subsequent superdiagonals until we find one that does not contain a 2-gram present in the string formed so far, we append it at the end of this string and proceed to step 1.
Finally, we append 0 if b is even. The method is illustrated in Figure A1 and generates the C ( N 1 ) strings in the form
C ( N 1 ) ( 3 , 1 ) = [ 000 ] , C ( N 1 ) ( 7 , 2 ) = [ 000111 | 0 ] , C ( N 1 ) ( 13 , 3 ) = [ 000111222 | 02 | 10 ] , C ( N 1 ) ( 21 , 4 ) = [ 000111222333 | 03 | 102132 | 0 ] , C ( N 1 ) ( 31 , 5 ) = [ 000111222333444 | 04 | 10213243 | 0314 | 20 ] , C ( N 1 ) ( 43 , 6 ) = [ 000111222333444555 | 05 | 1021324354 | 031425 | 304152 | 0 ] , C ( N 1 ) ( 57 , 7 ) = [ 000111222333444555666 | 06 | 102132435465 | 03142536 | 405162 | 041526 | 30 ] , C ( N 1 ) ( 73 , 8 ) = [ 000111222333444555666777 | 07 | 10213243546576 | 0314253647 | 3041526374 | 051627 | 506172 | 0 ] , C ( N 1 ) ( 91 , 9 ) = [ | 08 | 1021324354657687 | 031425364758 | 304152637485 | 05162738 | 607182 | 061728 | 40 ] , C ( N 1 ) ( 111 , 10 ) = [ | 09 | 102132435465768798 | 03142536475869 | 30415263748596 | 0516273849 | 5061728394 | 071829 | 708192 | 0 ] , C ( N 1 ) ( 133 , 11 ) = [ | 0 a | 102132435465768798 a 9 | 031425364758697 a | 30415263748596 a 7 | 05162738495 a | 60718293 a 4 | 061728394 a | 8091 a 2 | 08192 a | 50 ] , C ( N 1 ) ( 157 , 12 ) = [ | 0 b | 102132435465768798 a 9 b a | 031425364758697 a 8 b | 30415263748596 a 7 b 8 | 05162738495 a 6 b | 5061728394 a 5 b 6 | 0718293 a 4 b | 708192 a 3 b 4 | 091 a 2 b | 90 a 1 b 2 | 0 ] , C ( N 1 ) ( 183 , 13 ) = [ | 0 c | 102132435465768798 a 9 b a c b | 031425364758697 a 8 b 9 c | 30415263748596 a 7 b 8 c 9 | 05162738495 a 6 b 7 c | 5061728394 a 5 b 6 c 7 | 0718293 a 4 b 5 c | 8091 a 2 b 3 c 4 | 08192 a 3 b 4 c | a 0 b 1 c 2 | 0 a 1 b 2 c | 60 ] ,
Figure A1. 2-gram matrices for 1 b 13 that illustrate the generation of N ( N 1 ) strings according to the 2nd method. Coloured 2-grams are appended to the initial string of clear 3-grams in the order indicated by arrows starting from the 1st column or row. Finally, 0 is appended at the end if b is even.
Figure A1. 2-gram matrices for 1 b 13 that illustrate the generation of N ( N 1 ) strings according to the 2nd method. Coloured 2-grams are appended to the initial string of clear 3-grams in the order indicated by arrows starting from the 1st column or row. Finally, 0 is appended at the end if b is even.
Preprints 191720 g0a1

Appendix A.16. Method for Generating Non-Balanced C (N-b) Strings

The C ( N b ) strings can also have the following non-balanced (e.g. N 0 = 6 , N 1 = 4 for C ( N 2 ) ) form
C ( N 1 ) ( 3 , 1 ) = [ 000 ] , C ( N 2 ) ( 10 , 2 ) = [ 01 | 0000 | 1110 ] , C ( N 3 ) ( 19 , 3 ) = [ 01 | 0000 | 12 | 1111 | 2221020 ] , C ( N 4 ) ( 30 , 4 ) = [ 01 | 0000 | 12 | 1111 | 23 | 2222 | 333102132030 ] , C ( N 5 ) ( 43 , 5 ) = [ 01 | 0000 | 12 | 1111 | 23 | 2222 | 34 | 3333 | 444 | 1021324303142040 ] , C ( N 6 ) ( 58 , 6 ) = [ 01 | 0000 | 12 | 1111 | 23 | 2222 | 34 | 3333 | 45 | 4444 | 555 | 1021324354031425041520530 ] , C ( N 7 ) ( 75 , 7 ) = [ 01 | 0000 | 12 | 1111 | 23 | 2222 | 34 | 3333 | 45 | 4444 | 56 | 5555 | 666102132435465031425360415262064051630 ] , C ( N 8 ) ( 94 , 8 ) = [ 01 | 0000 | 12 | 1111 | 23 | 2222 | 34 | 3333 | 45 | 4444 | 56 | 5555 | 67 | 6666 | 7771021324354657603142536470415263720750516273061740 ] , C ( N b ) ( b 2 + 4 b 2 , b ) = [ 01 | 0000 | 12 | 1111 | 2 b 1 | b 2 b 2 b 2 b 2 | b 1 b 1 b 1 1021 b 1 b 2 0 ] .

References

  1. Wootters, WK; Zurek, WH. A single quantum cannot be cloned. Nature 1982, 299(5886), 802–3. Available online: http://www.nature.com/articles/299802a0. [CrossRef]
  2. Marshall, SM; Murray, ARG; Cronin, L. A probabilistic framework for identifying biosignatures using Pathway Complexity. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences Available from. 2017, 375(2109), 20160342. [Google Scholar] [CrossRef] [PubMed]
  3. Imari Walker, S; Cronin, L; Drew, A; Domagal-Goldman, S; Fisher, T; Line, M. Probabilistic Biosignature Frameworks. In Planetary Astrobiology; Meadows, V, Arney, G, Schmidt, B, Des Marais, DJ, Eds.; University of Arizona Press, 2019; pp. 1–1. Available online: https://uapress.arizona.edu/book/planetary-astrobiology.
  4. Planetary astrobiology. In University of Arizona space science series; Meadows, VS, Arney, GN, Schmidt, BE, Des Marais, DJ, Eds.; The University of Arizona Press: Tucson; Lunar and Planetary Institute: Houston, 2020. [Google Scholar]
  5. Liu, Y; Mathis, C; Bajczyk, MD; Marshall, SM; Wilbraham, L; Cronin, L. Exploring and mapping chemical space with molecular assembly trees. Science Advances Available from. 2021, 7(39), eabj2465. [Google Scholar] [CrossRef] [PubMed]
  6. Marshall, SM; Mathis, C; Carrick, E; Keenan, G; Cooper, GJT; Graham, H; et al. Identifying molecules as biosignatures with assembly theory and mass spectrometry. Nature Communications 2021, 12(1), 3033. Available online: https://www.nature.com/articles/s41467-021-23258-x. [CrossRef]
  7. Marshall, SM; Moore, DG; Murray, ARG; Walker, SI; Cronin, L. Formalising the Pathways to Life Using Assembly Spaces. Entropy 2022, 24(7), 884. Available online: https://www.mdpi.com/1099-4300/24/7/884. [CrossRef]
  8. Sharma, A; Czégel, D; Lachmann, M; Kempes, CP; Walker, SI; Cronin, L. Assembly theory explains and quantifies selection and evolution. Nature 2023, 622(7982), 321–8. Available online: https://www.nature.com/articles/s41586-023-06600-9. [CrossRef]
  9. Jirasek, M; Sharma, A; Bame, JR; Mehr, SHM; Bell, N; Marshall, SM; et al. Investigating and Quantifying Molecular Complexity Using Assembly Theory and Spectroscopy. ACS Central Science Available from. 2024, 10(5), 1054–64. [Google Scholar] [CrossRef]
  10. Łukaszyk, S; Bieniawski, W. Assembly Theory of Binary Messages. Mathematics 2024, 12(10), 1600. Available online: https://www.mdpi.com/2227-7390/12/10/1600. [CrossRef]
  11. Raubitzek, S; Schatten, A; König, P; Marica, E; Eresheim, S; Mallinger, K. Autocatalytic Sets and Assembly Theory: A Toy Model Perspective. Entropy 2024, 26(9), 808. Available online: https://www.mdpi.com/1099-4300/26/9/808. [CrossRef]
  12. Łukaszyk, S. On the "Assembly Theory and its Relationship with Computational Complexity. 2024. Available online: https://www.preprints.org/manuscript/202412.1492/v1.
  13. Patarroyo, KY; Sharma, A; Seet, I; Packmore, I; Walker, SI; Cronin, L. Quantifying the Complexity of Materials with Assembly Theory ArXiv:2502.09750. arXiv. 2025. Available online: http://arxiv.org/abs/2502.09750.
  14. Masierak, P. Computational Complexity of Determining the Assembly Index. Available from. 2025. [CrossRef]
  15. Ziv, J; Lempel, A. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 1978, 24(5), 530–6. Available online: https://ieeexplore.ieee.org/document/1055934. [CrossRef]
  16. Storer, JA; Szymanski, TG. Data compression via textual substitution. Journal of the ACM Available from. 1982, 29(4), 928–51. [Google Scholar] [CrossRef]
  17. Welch. A Technique for High-Performance Data Compression. Computer 1984, 17(6), 8–19. Available online: https://ieeexplore.ieee.org/document/1659158. [CrossRef]
  18. Charikar, M; Lehman, E; Liu, D; Panigrahy, R; Prabhakaran, M; Sahai, A; et al. The Smallest Grammar Problem. IEEE Transactions on Information Theory 2005, 51(7), 2554–76. Available online: http://ieeexplore.ieee.org/document/1459058/. [CrossRef]
  19. Kieffer, JC; Yang, En-Hui. Grammar-based codes: a new class of universal lossless source codes. IEEE Transactions on Information Theory 2000, 46(3), 737–54. Available online: http://ieeexplore.ieee.org/document/841160/. [CrossRef]
  20. Kieffer, J; Yang, En-hui; Park, T; Yakowitz, S. Complexity of preprocessor in MPM data compression system. Proceedings DCC ’98 Data Compression Conference (Cat. No.98TB100225), 1998; IEEE Comput. Soc: Snowbird, UT, USA; p. 554. Available online: http://ieeexplore.ieee.org/document/672292/.
  21. Lehman, E. Approximation Algorithms for Grammar-Based Data Compression. Ph.D. Thesis, Massachusetts Institute of Technology (MIT), 2002. Available online: https://compression.ru/download/articles/grammar/lehman_phd_2002_approximation_algorithms.pdf.
  22. Kieffer, JC; Eh, Yang. Compression and Explanation using Hierarchical Grammars. The Computer Journal 2000, 43(3), 212–22. Available online: https://www.researchgate.net/publication/2826982_1_INTRODUCTION_Compression_and_Explanation_using_Hierarchical_Grammars.
  23. Kieffer, JC; Yang, En-Hui; Nelson, GJ; Cosman, P. Universal lossless compression via multilevel pattern matching. IEEE Transactions on Information Theory 2000, 46(4), 1227–45. Available online: http://ieeexplore.ieee.org/document/850665/. [CrossRef]
  24. Kieffer, J; Flajolet, P; Yang, Eh. Universal Lossless Data Compression Via Binary Decision Diagrams. arXiv 2011, 1111.1432. Available online: http://arxiv.org/abs/1111.1432. [CrossRef]
  25. Nevill-Manning, CG. Compression and Explanation using Hierarchical Grammars. The Computer Journal 1997, 40(2 and 3), 103–16. Available online: https://academic.oup.com/comjnl/article-lookup/doi/10.1093/comjnl/40.2_and_3.103. [CrossRef]
  26. Larsson, NJ; Moffat, A. Offline dictionary-based compression. Proceedings DCC’99 Data Compression Conference (Cat. No. PR00096), 1999; pp. 296–305. Available online: https://ieeexplore.ieee.org/document/755679, ISSN 1068-0314.
  27. Larsson, NJ; Moffat, A. Off-line dictionary-based compression. Proceedings of the IEEE 2000, 88(11), 1722–32. Available online: http://ieeexplore.ieee.org/document/892708/. [CrossRef]
  28. Nevill-Manning, C; Witten, I. Compression and Explanation using Hierarchical Grammars. In The Computer Journal; Source; CiteSeer, 1999; Volume 40, 2. [Google Scholar]
  29. Nevill-Manning, CG; Witten, IH. Identifying Hierarchical Structure in Sequences: A linear-time algorithm. Journal of Artificial Intelligence Research 1997, 7, 67–82. Available online: https://jair.org/index.php/jair/article/view/10192. [CrossRef]
  30. Apostolico, A; Lonardi, S. Off-line compression by greedy textual substitution. Proceedings of the IEEE 2000, 88(11), 1733–44. Available online: http://ieeexplore.ieee.org/document/892709/. [CrossRef]
  31. Apostolico, A; Lonardi, S. Compression of biological sequences by greedy off-line textual substitution. Proceedings DCC 2000. Data Compression Conference, Snowbird, UT, USA, 2000; IEEE Comput. Soc; pp. 143–52. Available online: http://ieeexplore.ieee.org/document/838154/.
  32. Sakamoto, H; Maruyama, S; Kida, T; Shimozono, S. A Space-Saving Approximation Algorithm for Grammar-Based Compression. IEICE Transactions on Information and Systems 2009, E92-D(2), 158–65. Available online: http://www.jstage.jst.go.jp/article/transinf/E92.D/2/E92.D_2_158/_article. [CrossRef]
  33. Takabatake, Y; I, T; Sakamoto, H. A Space-Optimal Grammar Compression. LIPIcs, Volume 87, ESA 2017. 2017, 87:67, 1–67:15. Available online: https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2017.67.
  34. Pagel, S; Sharma, A; Cronin, L. Mapping Evolution of Molecules Across Biochemistry with Assembly Theory. 2024. Available online: https://arxiv.org/abs/2409.05993.
  35. Knuth, DE. The art of computer programming. In Seminumerical algorithms / Donald E. Knuth (Stanford University). Third edition, forthy-first printing ed; Addison-Wesley: Boston, 2021; Volume 2. [Google Scholar]
  36. Clift, NM. Calculating optimal addition chains. Computing Available from. 2011, 91(3), 265–84. [Google Scholar] [CrossRef]
  37. Cronin, L. Exploring assembly index of strings is a good way to show why assembly & entropy are intrinsically different. 2024. Available online: https://x.com/leecronin/status/1850289225935257665.
  38. Łukaszyk, S. 15. In Black Hole Horizons as Patternless Binary Messages and Markers of Dimensionality; Nova Science Publishers, 2023; pp. 317–74. Available online: https://novapublishers.com/shop/future-relativity-gravitation-cosmology/.
  39. Łukaszyk, S. Life as the Explanation of the Measurement Problem. Journal of Physics: Conference Series Available from. 2024, 2701(1), 012124. [Google Scholar] [CrossRef]
  40. Łukaszyk, S. Black hole merger as an event converting two qubits into one. Frontiers in Quantum Science and Technology 2025, 4, 1656200. Available online: https://www.frontiersin.org/articles/10.3389/frqst.2025.1656200/full. [CrossRef]
  41. Gabric, D; Shallit, J; Zhong, XF. Avoidance of split overlaps. Discrete Mathematics 2021, 344(2), 112176. Available online: https://linkinghub.elsevier.com/retrieve/pii/S0012365X20303629. [CrossRef]
  42. Guibas, LJ; Odlyzko, AM. String overlaps, pattern matching, and nontransitive games. Journal of Combinatorial Theory, Series A 1981, 30(2), 183–208. Available online: https://linkinghub.elsevier.com/retrieve/pii/0097316581900054. [CrossRef]
  43. Ozelim, L; Uthamacumaran, A; Abrahão, FS; Hernández-Orozco, S; Kiani, NA; Tegnér, J. Assembly Theory Reduced to Shannon Entropy and Rendered Redundant by Naive Statistical Algorithms. arXiv. 2025. Available online: http://arxiv.org/abs/2408.15108.
  44. Abrahão, FS; Hernández-Orozco, S; Kiani, NA; Tegnér, J; Zenil, H. Assembly Theory is an approximation to algorithmic complexity based on LZ compression that does not explain selection or evolution. PLOS Complex Systems 2024, 1(1), e0000014. Available online: https://journals.plos.org/complexsystems/article?id=10.1371/journal.pcsy.0000014. [CrossRef]
  45. Uthamacumaran, A; Abrahão, FS; Kiani, NA; Zenil, H. On the salient limitations of the methods of assembly theory and their classification of molecular biosignatures. npj Systems Biology and Applications 2024, 10(1), 82. Available online: https://www.nature.com/articles/s41540-024-00403-y. [CrossRef]
  46. Vimal, D; Parzych, G; Smith, OM; Parkar, D; Bergen, S; Daymude, JJ. Open, Reproducible Calculation of Assembly Indices ArXiv:2507.08852 version: 1. arXiv. 2025. Available online: http://arxiv.org/abs/2507.08852.
  47. Flamm, C; Merkle, D; Stadler, PF. Assembly in Directed Hypergraphs. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences Available from. 2025, 481(2324), 20250331. [Google Scholar] [CrossRef]
  48. Kempes, CP; Lachmann, M; Iannaccone, A; MF, G; RC, M; Walker, SI; et al. Assembly theory and its relationship with computational complexity. npj Complexity 2025, 2(1), 27. [Google Scholar] [CrossRef]
  49. Gebhard, TD; Bell, A; Gong, J; Hastings, JJA; Fricke, GM; Cabrol, N; et al. Inferring molecular complexity from mass spectrometry data using machine learning. Machine Learning and the Physical Sciences workshop, NeurIPS 2022, 2022. [Google Scholar]
  50. Cronin, L; Parra, JCM; Patarroyo, KY. Assembly Addition Chains. arXiv. 2025. Available online: https://arxiv.org/abs/2512.18030.
  51. Krzyżanowski, W. Procesy ewolucji kulturowej muzyki w środowisku technologii cyfrowych [Rozprawa doktorska]. Poznań: Uniwersytet im. Adama Mickiewicza w Poznaniu, Wydział Nauk o Sztuce; 2025. Praca doktorska napisana pod kierunkiem prof. UAM dr hab. Piotra Podlipniaka, złożona w 2025 r.
  52. Vopson, MM. The second law of infodynamics and its implications for the simulated universe hypothesis. AIP Advances 2023, 13(10), 105308. Available online: https://pubs.aip.org/adv/article/13/10/105308/2915332/The-second-law-of-infodynamics-and-its. [CrossRef]
  53. Mugur-Schachter, M. On a Crucial Problem in Probabilities and Solution. arXiv. 2008. Available online: https://arxiv.org/abs/0801.2654.
1
Sixteen if we relax the Definition 2.5 (cf. Figure 9b).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated