Preprint
Article

This version is not peer-reviewed.

On the Salient Regularities of Strings of Assembly Theory

Submitted:

22 October 2024

Posted:

23 October 2024

Read the latest preprint version here

Abstract
Using assembly theory of strings of any natural radix $b$ we find some of their salient regularities. In particular, we show that the upper bound of the assembly index depends quantitatively on the radix $b$ and the longest length $N$ of a string that has the assembly index of $N-k$ is given by $N_{(N-k)}=b^2+b+3k-2$ for $k=\{1,2\}$ and by $N_{(N-3)}=b^2+b+6$. We also provide particular forms of such strings. Knowing the latter bound we conjecture that the maximum assembly index of a string of length $N>N_{(N-2)}$ is given by $a_{\text{max}}^{(N,b)} = \lfloor N/2 \rfloor + b(b+1)/2$. For $k=1$ such odd length strings are nearly balanced and there are four such different strings if $b=2$ and seventy-two if $b=3$. We also show that a string containing $k$ copies of an $n$-plet has the assembly index of at most $N-1 - (n-1)k - a$, where $a$ is the assembly index of this $n$-plet. Finally, we show that the assembly depth of a minimum assembly index string is equal to the assembly index of this string, while the assembly depth of a maximum assembly index string has a value between $\left\lceil \log_2(N) \right\rceil$ and the assembly index of this string. Since these results are also valid for $b=1$, assembly theory subsumes information theory.
Keywords: 
;  ;  ;  ;  

1. Introduction

Assembly theory (AT), formulated in 2017, introduced the concept of an initial pool [1].
Definition 1.
We call a set P 0 ( b ) { 0 , 1 , , b 1 } that contains b N different basic symbols c, the initial assembly pool.
The reader will find numerous results on AT in refs. [1,2,3,4,5,6,7,8,9,10], for example. Here, we extend the results of our previous study [9] concerning bitstrings to strings of any natural radix b. We consider the formation of strings C k ( N , b ) of length N containing symbols from the initial assembly pool P 0 ( b ) within the AT framework in consecutive assembly steps from basic symbols c and strings (doublets, triplets, n-plets) assembled in previous steps.
In fact, any embodiment of AT, with basic symbols representing LEGO® blocks, chemical bonds, graphs, monomers, etc. assembled in any n-dimensional space ( n C ) [11] corresponds to the string AT version. This is because in AT an assembly step always consists in joining two parts only, which can be thought of as the left and right fragments of the newly formed string. Put simply, AT explains and quantifies selection and evolution [7] but it is through the word (aka string or message), in particular a nucleotide sequence in the case of b = 4 , all AT things come into existence [12].
Definition 2.
We call a set P s ( b ) that contains basic symbols and strings assembled in previous steps { 1 , 2 , , s 1 } the working assembly pool.
An assembly step s may consist of
c 1 c 2 = C k ( 2 , b ) , C l ( N l , b ) c 2 = C k ( N l + 1 , b ) , c 1 C m ( N m , b ) = C k ( 1 + N m , b ) , C l ( N l , b ) C m ( N m , b ) = C k ( N l + N m , b ) ,
where c 1 , c 2 P 0 ( b ) , C l ( N l , b ) , C m ( N m , b ) P s 1 ( b ) , and C k P s ( b ) . We note that the joining operator "∘", in general, does not commute. Using Definitions 1 and 2, the assembly index (ASI) of a string is the minimal achievable value of a difference between the cardinalities of the working and initial assembly pools leading to this string, since at each assembly step the cardinality of the working assembly pool increases by one. Therefore, the working assembly pool 2 cannot be identified with the initial assembly pool 1; the initial assembly pool 1 must not contain strings of basic symbols (see Section H).

2. Results

Theorems 1 and 2 were already stated in our previous study [9] for b = 2 . We restate them here b for clarity.
Theorem 1.
A quadruplet is the shortest string that allows for more than one ASI for all b.
Proof. 
N = 2 provides b 2 available doublets with unit ASI. N = 3 provides b 3 available triplets with ASI equal to two. Only N = 4 provides b 4 quadruplets that include b quadruplets C k , min ( 4 , b ) = [ * * * * ] and b ( b 1 ) quadruplets C l , min ( 4 , b ) = [ * * ] with ASI equal to two, while the ASI of the remaining quadruplets is three. For example, to assemble the quadruplet C k , min ( 4 , 4 ) = [ 0202 ] , we need to assemble the doublet [ 02 ] and reuse it from the first step pool P 1 , while there is nothing available to reuse, in the case of the quadruplet C l ( 4 , 4 ) = [ 0123 ] . □
Where the symbol value can be arbitrary, we write * assuming that it is the same within the string. If we allow for the 2nd possibility different from *, we write ★. Furthermore, we consider the degenerate case of just one basic symbol ( b = 1 ).
Theorem 2.
The smallest ASI a ( N ) ( C min ) as a function of N corresponds to the shortest addition chain for N (OEIS A003313) for all b.
Proof. 
Strings C min for which a ( N ) ( C min ) = min k { a ( N , b ) ( C k ) } , k { 1 , 2 , , b N } can be formed in subsequent steps s by joining the longest string assembled so far with itself until N = 2 s is reached. Therefore, if N = 2 s , then min k { a ( 2 s ) ( C k ) } = s = log 2 ( N ) . Only b 2 strings have such ASI if N = 2 s , including respectively b and b ( b 1 ) strings
C k ( 2 s , b ) = [ * * ] , C l ( 2 s , b ) = [ * * ] ,
and the assembly pathway of each of the strings (2) is unique. At each assembly step, its length doubles.
An addition chain for N N having the shortest length s N (commonly denoted as l ( N ) ) is defined as a sequence 1 = a 0 < a 1 < < a s = N of integers such that j 1 , a j = a k + a l for l k < j . The first step in creating an addition chain for N is always a 1 = 1 + 1 = 2 and this corresponds to assembling a doublet [ * * ] or [ * ] from the initial assembly pool P 0 ( b ) . Thus, the lower bound for s of the addition chain for N, s log 2 ( N ) is achieved for N = 2 s by b 2 strings (2).
The second step in creating an addition chain can be a 2 = 1 + 1 = 2 or a 2 = 1 + 2 = 3 . Thus, finding the shortest addition chain for N corresponds to finding the ASI of a string containing basic symbols and/or doublets and/or triplets containing these doublets for N 2 s since due to Theorem 1 only they provide the same assembly indices { 0 , 1 , 2 } . □
At least some of the following seven simple theorems are useful for further consideration.
Theorem 3.
The strings C min ( 2 s , b ) can contain at most two symbols if b > 1 . Other minimal ASI strings of length N 2 s can contain at most three symbols if b > 2 .
Proof. 
Minimal ASI strings of length N = 2 s are formed by joining the newly assembled string to itself, where a clear or mixed doublet is created in the first step. Minimal ASI strings of other lengths admit a doublet and a triplet containing this doublet and an additional basic symbol.
To formally prove the first part, we can also use mathematical induction on the assembly step s. If s = 1 , then the minimal strings C min ( 2 , b ) are doublets of the form [ c 1 c 2 ] , where c 1 , c 2 P 0 ( b ) . If c 1 = c 2 , the string contains one distinct symbol, and if c 1 c 2 , the string contains two distinct symbols. In both cases, the number of distinct symbols does not exceed two. Now assume that for some k N , all minimal strings C min ( 2 k , b ) contain at most two distinct symbols. We must show that C min ( 2 k + 1 , b ) also contains at most two distinct symbols. Consider constructing C min ( 2 k + 1 , b ) by joining two identical minimal strings C min ( 2 k , b )
C min ( 2 k , b ) C min ( 2 k , b ) = C min ( 2 k + 1 , b ) ,
with each other. By the inductive hypothesis, each C min ( 2 k , b ) contains at most two distinct symbols. Therefore, their concatenation also contains at most two distinct symbols. By induction, for all s N , the minimal string C min ( 2 s , b ) contains at most two distinct symbols.
We will now show that other minimal ASI strings of length N 2 s can contain at most three distinct symbols if b > 2 . We provide the construction of minimal ASI strings with three symbols. In the first step s = 1 , we create a doublet [ c 1 c 2 ] where c 1 , c 2 P 0 ( b ) and c 1 c 2 . Next, we combine the existing doublet [ c 1 c 2 ] with a new symbol c 3 P 0 ( b ) where c 3 { c 1 , c 2 } . This forms a triplet [ c 1 c 2 c 3 ] , introducing a third distinct symbol and further increasing the ASI by 1. We continue assembling by joining the longest string formed so far with itself or with previously formed strings, maintaining the minimal increase in ASI.
Assume a contrario that there exists a minimal ASI string C min ( N , b ) of length N 2 s that contains four or more distinct symbols. To incorporate a fourth symbol, at least one additional assembly step is required beyond what is needed for the three symbols. This additional step implies an increase in ASI, which contradicts the minimality of C min ( N , b ) . Thus, Theorem 3 is proven. □
Theorem 4.
A string containing the same three doublets has the same ASI as a string containing two pairs of the same doublets, provided that both strings have the same distributions of other repetitions and have the same lengths.
Proof. 
Without loss of generality (w.l.o.g.), consider the following two strings of the same length N + 8 with * 01 and the same distributions of other repetitions (if there are any other repetitions)
C k = [ 01 01 01 * ] , C l = [ 01 01 22 22 ] ,
where * 01 . Creating a doublet takes one assembly step. Each appending of a doublet to an assembled string counts as another assembly step. Hence, in a general case (i.e., for strings C k , C l containing also other symbols), the string C k requires six additional assembly steps, the same as the string C l , which completes the proof. □
Theorem 5.
A string containing the same three doublets has the same ASI as a string containing the same two triplets, provided that both strings have the same distributions of other repetitions.
Proof. 
W.l.o.g. consider the following two strings of the same length N + 6 with the same distributions of other repetitions
C k = [ 01 01 01 ] , C l = [ 010 010 ] .
Creating a triplet takes two assembly steps. Hence, in the general case, the string C k requires four additional assembly steps, the same as the string C l , which completes the proof. □
Theorem 6.
A string containing the same two triplets has the same ASI as a string containing two pairs of the same doublets, provided that both strings have the same distributions of other repetitions and have the same lengths.
Proof. 
The proof stems from Theorems 4 and 5. □
Theorem 7.
A string containing the same two quadruplets of the minimum ASI has the same ASI as a string containing the same three triplets, provided that both strings have the same distributions of other repetitions and have the same lengths.
Proof. 
W.l.o.g. consider the following two strings of the same length N + 9 with the same distributions of other repetitions
C k = [ 0101 0101 ] , C l = [ 010 010 010 ] .
Creating such a quadruplet takes two assembly steps. Hence, in a general case, the string C k requires five additional assembly steps, the same as the string C l , which completes the proof. □
Theorem 8.
A string containing the same two quadruplets of the maximum ASI has the same ASI as a string containing a doublet and the same two triplets based on this doublet, provided that both strings have the same distributions of other repetitions.
Proof. 
W.l.o.g. consider the following two strings of the same length N + 8 with the same distributions of other repetitions
C k = [ 0001 0001 ] , C l = [ 110 10 110 ] .
Creating such a quadruplet takes three assembly steps. Hence, in a general case, the string C k requires five additional assembly steps, the same as the string C l , which completes the proof. □
Theorem 9.
A string containing the same two doublets and the same two triplets not based on this doublet has the same ASI as a string containing a doublet and the same two triplets based on this doublet, provided that both strings have the same distributions of other repetitions and have the same lengths.
Proof. 
W.l.o.g. consider the following two strings of the same length N + 10 with the same distributions of other repetitions
C k = [ 110 00 110 00 ] , C l = [ 110 10 110 * ] ,
where * { 11 , 10 } . In a general case, the string C k requires seven additional assembly steps, the same as the string C l , which completes the proof. □
In general, Theorems 1-9 show that
  • k copies of a doublet in a string decrease the ASI of this string at least by k 1 ;
  • k copies of a triplet in a string decrease the ASI of this string at least by 2 k 2 ;
  • k copies of a minimum ASI quadruplet in a string decrease the ASI of this string at least by 3 k 2 ;
  • k copies of a maximum ASI quadruplet in a string decrease the ASI of this string at least by 3 k 3 ;
where, the phrase "at least" is meant to indicate that other repetitions, such as e.g. doublets forming multiple quadruplets, etc. can further decrease the ASI of the string. This observation allows us to state the following theorem.
Theorem 10.
A string C m ( N , b ) containing k 1 copies of an n 1 -plet C 1 ( n 1 , b ) , k 2 copies of an n 2 -plet C 2 ( n 2 , b ) , etc. has the ASI of at most.
a ( N , b ) ( C m ) = N 1 r = 1 R ( n r 1 ) k r a C r ( n r , b ) ,
where R is the total number of repeated n r -plets.
Proof. 
W.l.o.g. consider the following string
C m ( N , b ) = [ [ c 1 c 2 c n ] [ c 1 c 2 c n ] ] ,
containing two copies of an n-plet C l ( n , b ) = [ c 1 c 2 c n ] . The n-plet C l ( n , b ) can be assembled in a C l ( n , b ) steps and appended to the assembled string C m in one step. Consider that the ASI of the n-plet C l ( n , b ) is a C l ( n , b ) = n 1 , i.e. the n-plet does not have any repetitions that can be reused. Then one copy of this n-plet - as expected - does not decrease the ASI of the string C m ( N , b ) , as ( n 1 ) 1 ( n 1 ) = 0 , while more copies k decrease it by ( n 1 ) ( k 1 ) . On the other hand, if a C l ( n , b ) < n 1 then even a single copy of this n-plet will decrease the ASI of C m . □
For example, due to the presence of three copies of 5-plet [ 01001 ] , each with a ( 5 , 6 ) ( [ 01001 ] ) = 3 , in a string
C k ( 24 , 6 ) = [ 120100121010012350100152 ] ,
its ASI amounts to a ( 24 , 6 ) ( C k ) = 24 1 ( 4 · 3 3 ) = 14 . We note that the maximum ASI decrease is provided by 2 s -plets of the minimum ASI and amounts to ( n 1 ) k log 2 ( n ) = ( 2 s 1 ) k s .
Another quantity related to the string assembly is the assembly depth (AD) defined [13] as
d s ( N , b ) max ( d ( s 1 ) L ( N , b ) , d ( s 1 ) R ( N , b ) ) + 1 ,
where d ( s 1 ) L ( N , b ) and d ( s 1 ) R ( N , b ) are the ADs of two parts of this string that were joined in step s, where d 0 ( N , b ) 0 . If there are more assembly paths leading to a string with different ADs, which happens if at least two assembly steps can occur independently, in ref. [13] the minimum AD is assigned to the string. Here, we relax this assumption. Any string has a unique assembly index but can have different ADs if its ASI is not minimal.
Theorem 11.
The AD of a minimal ASI string C min ( N ) is equal to the ASI of this string, d a min ( N , b ) = a min ( N ) .
Proof. 
We will prove this by mathematical induction on the length N of the string. For the base case ( N = 1 ) the string consists of a single basic symbol c P 0 ( b ) . Hence, its ASI is a min ( 1 ) 0 and its AD d s ( 1 , b ) 0 . Therefore, d s ( 1 , b ) = a min ( 1 ) = 0 .
Assume now that for all strings of length k less than N, the AD equals the minimal ASI, that is
d a min ( k , b ) = a min ( k ) k < N .
We need to show that d a min ( N , b ) = a min ( N ) . Consider two cases based on whether N is a power of two or not.
If N is a power of two, i.e., N = 2 k for some integer k, we construct the minimal ASI string as follows. First, we assemble a doublet from two basic symbols:
c 1 c 2 = C ( 2 , b ) , c 1 , c 2 P 0 ( b ) .
Its ASI is a min ( 2 ) = 1 and its AD is d s ( 2 , b ) = 1 . Then for each k 2 we have C ( 2 k 1 , b ) with a min ( 2 k 1 ) = k 1 and d s ( 2 k 1 , b ) = k 1 and we construct C ( 2 k , b ) by joining two copies of C ( 2 k 1 , b )
C ( 2 k 1 , b ) C 2 k 1 , b ) = C ( 2 k , b ) .
The ASI of C ( 2 k , b ) is equal to
a min ( 2 k ) = a min ( 2 k 1 ) + 1 = k ,
and the AD is equal to
d s ( 2 k , b ) = max d ( s 1 ) L ( 2 k 1 , b ) , d ( s 1 ) R ( 2 k 1 , b ) + 1 = ( k 1 ) + 1 = k .
Therefore, d s ( 2 k , b ) = a min ( 2 k ) = k in this case.
If N is not a power of two, we construct the minimal ASI string as follows. We take the shortest addition chain for N. Next, we build the string by combining previously assembled strings according to the addition chain. At each assembly step, the AD increases by one, as we are combining previously assembled strings (i.e. we do not perform independent assembly steps) Since, by Theorem 2, the minimal ASI corresponds to the length of the shortest addition chain l ( N ) , we have
d s ( N , b ) = l ( N ) = a min ( N ) ,
also in this case.
In both cases, whether N is a power of two or not, we have:
d s ( N , b ) = a min ( N ) .
This completes the proof. □
Hence, for all b, a string with a min ( N ) can be constructed only with the AD d a min ( N ) = a min ( N ) .
Theorem 12.
The AD of a maximal ASI string C max ( N ) satisfies
log 2 ( N ) d a max ( N , b ) a max ( N , b ) ,
where x denotes the ceiling function.
Proof. 
Since in each assembly step, we combine two previously assembled substrings or basic symbols, the AD increases by at most one. Therefore, the AD cannot exceed the total number of assembly steps and we obtain
d a max ( N , b ) a max ( N , b ) .
Now we will prove that d a max ( N , b ) log 2 ( N ) . Assume, for contradiction, that there exists a maximal ASI string C max ( N , b ) such that:
d a max ( N , b ) < log 2 ( N ) .
Let d d a max ( N , b ) , so d < log 2 ( N ) . In an assembly tree of AD d, the maximum number of leaves that can be combined is 2 d , because at each assembly step, we can combine at most two subassemblies. Therefore, the maximum length N max of a string that can be assembled with AD d satisfies:
N max 2 d .
But since d < log 2 ( N ) , we have:
2 d < 2 log 2 ( N ) 2 log 2 ( N ) + 1 ϵ = N · 2 1 ϵ ,
for some ϵ > 0 . This implies that
N max < N · 2 1 ϵ .
However, since ϵ > 0 , 2 1 ϵ < 2 , so:
N max < N · 2 1 ϵ < 2 N .
But this contradicts the fact that N max N (since we are assembling a string of length N).More precisely, since d < log 2 ( N ) , we have d log 2 ( N ) 1 , which implies:
2 d 2 log 2 ( N ) 1 = 2 log 2 ( N ) · 2 1 = N 2 .
Thus, the maximum length of a string that can be assembled with AD d is N max N / 2 , which is less than N, contradicting the assumption that we are assembling a string of length N. This completes the proof. □
For example, the string C max ( 8 , 2 ) = [ 00011101 ] can be assembled with AD log 2 ( 8 ) d a max ( 8 , 2 ) a max ( 8 , 2 ) as
00 d 1 = 1 , 00 d 1 = 1 , 00 d 1 = 1 , 01 d 1 = 1 , 01 d 2 = 1 , 01 d 2 = 1 , 01 d 2 = 1 , 001 d 2 = 2 , 11 d 3 = 1 , 11 d 3 = 1 , 0001 d 3 = 2 , 0001 d 3 = 3 , 0001 d 4 = 2 , 0001 d 4 = 2 , 00011 d 4 = 3 , 00011 d 4 = 4 , 1101 d 5 = 2 , 000111 d 5 = 3 , 000111 d 5 = 4 , 000111 d 5 = 5 , 00011101 d 6 = 3 , 00011101 d 6 = 4 , 00011101 d 6 = 5 , 00011101 d 6 = 6 .
Conjecture 1.
For N such that a min ( N ) > a min ( N 1 ) the AD d s ( N , b ) of a string C k ( N , b ) has a range of values between the minimum ASI for N and the ASI of this string, that is
log 2 ( N ) d s ( N , b ) ( C k ) a ( N , b ) ( C k ) .
The seven-bit string is the longest string that can have the maximum ASI a max ( 7 , 2 ) = 7 1 = 6 . There are four such bitstrings containing two clear triplets and the starting bit at the end or the ending bit at the start, that is
[ * * * * ] and [ * * * ] ,
and their lengths cannot be increased without a repetition of a doublet, which keeps the ASI at the same level a max ( 8 , 2 ) = 8 2 = 6 .
This observation and Theorem 2 motivated us to develop a general method to construct the longest possible string having the ASI a max ( N , b ) ( C ( N 1 ) ) = N 1 , as a function of the radix b. We denote the length of this string by N ( N 1 ) or N ( N 1 ) ( b ) , and we call this string a C ( N 1 ) string.
After a few groping try-outs, we eventually reached two stable methods (cf. Appendices, Methods A and B). In both methods, we start with an initial balanced string of length 3 b containing b clear triplets ordered as
[ 0001112 ( b 2 ) ( b 1 ) ( b 1 ) ( b 1 ) ] .
The doublets that can be inserted into the initial string (31) can be arranged in a b × b matrix
00 01 02 0 ( b 1 ) 10 11 12 1 ( b 1 ) 20 21 22 2 ( b 1 ) ( b 2 ) 0 ( b 2 ) 1 ( b 2 ) 2 ( b 2 ) ( b 1 ) ( b 1 ) 0 ( b 1 ) 1 ( b 1 ) 2 ( b 1 ) ( b 1 ) ,
where the crossed out entries on a diagonal cannot be reused, as they would create repetitions in this string. If we assume that we shall not insert doublets between the clear triplets of the string (31), we can also cross out the entries in the first superdiagonal of the matrix (32). The strings of odd lengths generated by these general methods are not only the longest but also the most balanced. This can be stated in the following theorem.
Theorem 13
( N ( N 1 ) ). The longest length of a string that has the ASI of N 1 is given by
N ( N 1 ) = 3 b + ( b 1 ) 2 = b 2 + b + 1
(OEIS A353887) and this string is nearly balanced, that is
N ( N 1 ) = b N c + 1 ,
where N c = b + 1 is the number of occurrences of all but one symbol within the string, and its Shannon entropy is
H ( C ( N 1 ) ) = c = 0 b 1 p c log 2 ( p c ) = ( b 1 ) N ( N 1 ) 1 b N ( N 1 ) log 2 N ( N 1 ) 1 b N ( N 1 ) N ( N 1 ) 1 + b b N ( N 1 ) log 2 N ( N 1 ) 1 + b b N ( N 1 ) = = 1 b 2 b 2 + b + 1 log 2 b + 1 b 2 + b + 1 b + 2 b 2 + b + 1 log 2 b + 2 b 2 + b + 1 log 2 ( b ) .
The proof of Theorem 13 is given in Appendix D. A C ( N 1 ) string must contain all clear triplets and all doublets. Although the case for b = 1 is degenerate, as no information can be conveyed using only one symbol ( H ( C ( N 1 ) ) = 0 in this case), nothing precludes the assembly of such defunct strings and the formula (33) yields the correct result; the string [ 000 ] is the longest string with a max ( N , 1 ) = N 1 by Theorem 1, as for b = 1 the upper and the lower bound on the ASI are the same, a max ( N , 1 ) = a min ( N ) (OEIS A003313). This is the only case where the maximum ASI is not a monotonically nondecreasing function of N.
Subsequently, we considered other C ( N k ) strings for k > 1 with the maximum ASI a max ( C ( N k ) ) = N k .
Theorem 14
( N ( N 2 ) ). For all b > 1 the longest length of a string that has the ASI of N 2 is given by N ( N 2 ) = N ( N 1 ) + 3 or equivalently by
N ( N 2 ) = b 2 + b + 4 ,
and
N ( N 2 ) = ( b 2 ) N c + ( N c + 1 ) + ( N c + 3 ) = b N c + 4 ,
where N c = b + 1 is the number of occurrences of all but two symbols within the string, and its Shannon entropy is
H ( C ( N 2 ) ) = b 2 b 2 b 2 + b + 4 log 2 b + 1 b 2 + b + 4 b + 2 b 2 + b + 4 log 2 b + 2 b 2 + b + 4 b + 4 b 2 + b + 4 log 2 b + 4 b 2 + b + 4 .
The entropy H ( C ( N 2 ) ) log 2 ( b ) for b 1 . 6398 .
The proof of Theorem 14 is given in Appendix F. C ( N 2 ) string must contain only two copies of a doublet. Hence, a clear quadruplet ( b b b b ) and a pattern binding different symbols adjoining this quadruplet, such as [ a b b b b c a b c ] , [ a b b b b a b a ] , etc. must be present, so that any C ( N 2 ) string contains only one pair of repeated doublets a b , b b , or { b c , b a } (See also Appendix C). For example, for N = 10 , sixteen bitstrings
[ 0100011110 ] , [ 0111100010 ] , [ 0111101000 ] , [ 0100001110 ̲ ] , [ 0001011110 ] , [ 0001111010 ] , [ 0101111000 ] , [ 0111000010 ]
(an additional eight are given by swapping 0 with 1) have the ASI a = N 2 = 8 , where the underlined string (39) is the one that is created for b = 2 in Appendix F.
Theorem 15
( N ( N 3 ) ). The longest length of a string that has the ASI of N 3 is given by N ( N 3 ) = N ( N 1 ) + 5 . or equivalently by
N ( N 3 ) = b 2 + b + 6 ,
and
N ( N 3 ) = ( b 2 ) N c + ( N c + 2 ) + ( N c + 4 ) = b N c + 6 ,
where N c = b + 1 is the number of occurrences of all but two symbols within the string, and its Shannon entropy is
H ( C ( N 3 ) ) = b 2 b 2 b 2 + b + 6 log 2 b + 1 b 2 + b + 6 b + 3 b 2 + b + 6 log 2 b + 3 b 2 + b + 6 b + 5 b 2 + b + 6 log 2 b + 5 b 2 + b + 6 .
The entropy H ( C ( N 3 ) ) log 2 ( b ) for b 1 . 8317 .
The proof of Theorem 15 is given in Appendix G. C ( N 3 ) string must contain only three copies of a doublet, or two pairs of different doublets.
In general, the strings of Theorem 14, generated by Method A or B, are terminated with 0 and owe their properties to the following distributions of symbols
C ( N 2 ) = [ 0100001 10 0 ] , C ( N 3 ) = [ 010100001 10 0 ] ,
C ( N 2 ) contains three pairs of doublets [ 00 ] , [ 01 ] , and [ 10 ] which are connected in such a way that only one pair can be reused from the assembly pool to decrease the maximum N 1 ASI by one.
Conjecture 2.
( N max > N ( N k ) ). If b > 1 and N > N ( N 2 ) then
a max ( N , b ) = a max ( N 1 , b ) + 1 iff N = 2 l a max ( N 1 , b ) iff N = 2 l + 1 .
Equivalently, if b > 1 and N > N ( N 2 ) then
a max ( N , b ) = N 2 + b ( b + 1 ) 2 ,
( b ( b + 1 ) / 2 are triangular numbers, A000217). In other words, if N > N ( N 2 ) , then ASI increments by one every time the string length N increments by two.
First, we note that maximum ASI must rise. Otherwise, if it was constant for N larger than some N m a x then at some even larger N it would become inevitably lower than minimum ASI bound which also rises, and this would be a contradiction.
W.l.o.g. we aim to prove this theorem for b = 2 . We note that inserting any doublet into a C ( N 3 ) ( 12 , 2 ) string (A20) creates a triplet. Using the equation (9) of Theorem 10 we have
a s = a s 2 + 1 , N s = N s 2 + 2 , a s = N s 1 s = 1 R s ( n s 1 ) k s a C s ( n s , b ) , a s 2 = N s 2 1 p = 1 R s 2 ( n p 1 ) k p a C p ( n p , b ) , a s a s 2 = ( N s 2 + 2 ) 1 s = 1 R s ( n s 1 ) k s a C s ( n s , b ) N s 2 1 p = 1 R p ( n p 1 ) k p a C p ( n p , b ) = = 2 s = 1 R s ( n s 1 ) k s a C s ( n s , b ) + p = 1 R p ( n p 1 ) k p a C p ( n p , b ) = 1 , s = 1 R s ( n s 1 ) k s a C s ( n s , b ) = p = 1 R p ( n p 1 ) k p a C p ( n p , b ) + 1 ,
for any step s if only N s N ( N 2 ) .
Assume that s , a C s ( n s , b ) = n s 1 and p , a C p ( n p , b ) = n p 1 . Then
s = 1 R s ( n s 1 ) ( k s 1 ) = p = 1 R p ( n p 1 ) ( k p 1 ) + 1 , s = 1 R s n s k s s = 1 R s n s s = 1 R s k s + R s = p = 1 R p n p k p p = 1 R p n p p = 1 R p k p + R p + 1 ,
The proof of the Conjecture 2 must show the conditions for the equations (46) and (47) to hold.
The bounds of Theorems 14-15 and Conjecture 2 are illustrated in Figure 1 and Figure 2.
These results disprove our upper bound Conjecture 1 for b = 2 stated in ref. [9]. However, along with Theorem 12 they led us to the method of determining the ASI of a maximum ASI string, based on independent doublets and powers of two, as shown in Table 1. First, a string is sequenced, every two symbols to find the number of unique doublets × 2 ( b ) . Subsequently, these doublets form × 4 quadruplets, quadruplets form × 8 octuples, and so on depending on the length of the string N. Only the first quantity, × 2 ( b ) , depends on the radix b. The columns "last 2 s " indicate if the assembled string should be terminated with a single substring of length 2 s in descending order. The empty fields in the respective columns for N > 1 indicate that a given × 2 s substring can be interpreted as either a "regular" single × 2 s substring or a last × 2 s substring if × 2 s = 1 . It turns out that for b > 1 , the ASI depends only on the distribution of doublets ( × 2 ( b ) ). In particular, the N ( N 1 ) strings (A3) or (A4) contain the maximum of N ( N 1 ) / 2 independent doublets, and the N ( N 2 ) string (A14) contains the maximum of N ( N 2 ) / 2 1 independent doublets.
For example, the N ( N 3 ) string (A21) of length N ( N 3 ) = 18 for b = 3 can be assembled as
0 1 = [ 01 ] , 0 0 = [ 00 ] , 1 1 = [ 11 ] , 1 2 = [ 12 ] , 2 2 = [ 22 ] , 1 0 = [ 10 ] , 2 0 = [ 20 ] ( × 2 ( b = 3 ) = 7 ) , [ 01 ] = [ 0101 ] , [ 00 ] [ 00 ] = [ 0000 ] , [ 11 ] [ 12 ] = [ 1112 ] , [ 22 ] [ 10 ] = [ 2210 ] ( × 4 = 4 ) , [ 0000 ] = [ 01010000 ] , [ 1112 ] [ 2210 ] = [ 11122210 ] ( × 8 = 2 ) , [ 11122210 ] = [ 0101000011122210 ] ( × 16 = 1 ) , [ 20 ] = [ 010100001112221020 ] ( last × 2 ) , ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ 7 + 4 + 2 + 1 + 1 = 15 steps .
Similarly, the N ( N 1 ) string (A3) of length N ( N 1 ) = 21 for b = 4 can be assembled, as shown in Table 1 as
0 0 = [ 00 ] , 0 1 = [ 01 ] , 1 1 = [ 11 ] , 2 2 = [ 22 ] , 2 3 = [ 23 ] , 3 3 = [ 33 ] , 1 0 = [ 10 ] , 2 1 = [ 21 ] , 3 2 = [ 32 ] , 0 3 = [ 03 ] ( × 2 ( b = 4 ) = 10 ) , [ 01 ] = [ 0001 ] , [ 11 ] [ 22 ] = [ 1122 ] , [ 23 ] [ 33 ] = [ 2333 ] , [ 21 ] = [ 1021 ] , [ 32 ] [ 03 ] = [ 3203 ] ( × 4 = 5 ) , [ 1122 ] = [ 00011122 ] , [ 2333 ] [ 1021 ] , [ 23331021 ] ( × 8 = 2 ) , [ 23331021 ] = [ 0001112223331021 ] ( × 16 = 1 ) , [ 3203 ] = [ 00011122233310213203 ] ( last × 4 ) , 0 = [ 000111222333102132030 ] ( last × 1 ) , ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ 10 + 5 + 2 + 1 + 1 + 1 = 20 steps .
For N < 15 and for other small N this combinatorics is valid also for b = 1 , where obviously max × 2 s = 1 . For example, the string of length N = 15 can be assembled in six steps as
0 0 = [ 00 ] , ( × 2 ( b = 1 ) = 1 ) , [ 00 ] = [ 0000 ] ( × 4 ( b = 1 ) = 1 ) , [ 0000 ] = [ 00000000 ] ( × 8 ( b = 1 ) = 1 ) , [ 0000 ] = [ 000000000000 ] ( last × 4 ) , [ 00 ] = [ 00000000000000 ] ( last × 2 ) , [ 0 ] = [ 000000000000000 ] ( last × 1 ) , ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ 1 + 1 + 1 + 1 + 1 + 1 = 6 steps .
However, this is the 1st exception for b = 1 as the ASI of this string is five if it is assembled using doublet [ 00 ] and triplet [ 000 ] .
We further note that the method illustrated in Table 1 cannot be used to construct the maximum ASI string. For example, both the following two distributions of doublets satisfy the distributions of Table 1. However, only the left one correctly reflects the ASI of the assembled string.
0 0 = [ 00 ] , 1 0 = [ 01 ] , 1 0 = [ 11 ] ( × 2 ( b = 2 ) = 3 ) , 0 0 = [ 00 ] , 1 0 = [ 10 ] , 1 0 = [ 11 ] ( × 2 ( b = 2 ) = 3 ) , [ 00 ] [ 01 ] = [ 0001 ] ( × 4 = 1 ) , [ 00 ] [ 10 ] = [ 0010 ] ( × 4 = 1 ) , [ 0001 ] [ 11 ] = [ 000111 ] ( last × 2 ) , [ 0010 ] [ 11 ] = [ 001011 ] ( last × 2 ) , ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ 3 + 1 + 1 = 5 steps , 3 + 1 + 1 = 5 steps .

3. Discussion

Applications of AT seem to be promising. It offers a new lens for studying the construction of biological molecules like DNA and proteins. By analyzing the steps needed to assemble these molecules from basic building blocks, researchers can gain deeper insights into the evolutionary constraints and optimizations that shape biological pathways. This perspective also sheds light on the efficient construction of cellular structures and helps to identify the minimal assembly steps that define biological complexity, reinforcing the idea that life is characterized by highly organized pathways. Furthermore, AT provides an essential tool for understanding the growth of complexity in biological systems over evolutionary time. By quantifying the assembly steps required to form increasingly complex organisms, scientists can map the trajectory of evolutionary development and identify key transitions that lead to higher levels of structural and functional complexity. It can guide the design and optimization of synthetic biological systems by minimizing the number of steps required to build new biological pathways, making bioengineering more efficient and scalable. The ability to model and simplify complex biological processes using AT could lead to the development of more robust and adaptable synthetic organisms.
Strings having lengths N ( N 1 ) (e.g. (A3) or (A4)) are necessarily the most balanced: all but one symbol occur b + 1 times and one symbol occurs b + 2 times within a string C ( N 1 ) . However, if the length of a string is constant, it will tend to evolve to decrease the Shannon entropy [14,15] and, hence, will become less balanced. This tendency to imbalance seems to be associated with some minimum energy condition. The energy of a black hole that can be thought of as a balanced bitstring [16] can be two times the energy of the entropy variation sphere that it generates [17]. For example, the Shannon entropy of the SARS-CoV genome containing N = 29903 nucleobases decreased from H = 1 . 3565 to 1 . 3562 within two years after the Wuhan outbreak [9,14]. The maximum ASI of a C max ( 29903 , 4 ) string given by the equation (45) is a max ( 29903 , 4 ) = 14961 .
Authors’ contributions— WB: first concept of a general method for constructing the string of length N ( N 1 ) leading to Theorem 13; the concept of the doublet matrix (32); outline of the general Method A; proposition of Theorem 9; a string with exactly two copies of all doublets idea and the formula for its length; numerous clarity corrections and improvements; PM: outline of the general Method B; the hint for ASI combinatorics; creation of a software supporting Conjecture 2; numerous clarity corrections and improvements; AT: formal proof of Theorem 3; proof that the Shannon entropy (35) can be approximated by log 2 ( b ) for large b; proof of the Theorem 11; proof of the Theorem 12; the 1st paragraph of the discussion Section 3; numerous clarity corrections and improvements; SŁ: The remaining part of the study.

Funding

This research received no external funding.

Data Availability Statement

The public repository for the code written in the MATLAB computational environment and C++ is given under the link https://github.com/szluk/Evolution_of_Information (accessed on 19 September 2024).

Acknowledgments

The authors thank Mariola Bala for her motivation. SŁ thanks his wife, Magdalena Bartocha, for her everlasting support, and his partner and friend, Renata Sobajda, for her prayers.

Conflicts of Interest

Authors Wawrzyniec Bieniawski and Piotr Masierak were employed by the company Łukaszyk Patent Attorneys. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A. Method A for Generating C(N-1) String

Figure A1. Doublet matrices for 1 b 16 that illustrate the generation of N ( N 1 ) strings according to Method A. Colored doublets are appended to the initial string of clear triplets in the order indicated by arrows starting from the 1st column or row. Finally, 0 is appended at the end, if b is even.
Figure A1. Doublet matrices for 1 b 16 that illustrate the generation of N ( N 1 ) strings according to Method A. Colored doublets are appended to the initial string of clear triplets in the order indicated by arrows starting from the 1st column or row. Finally, 0 is appended at the end, if b is even.
Preprints 122018 g0a1
We start with a string of clear triplets (31). In the 1st step, we create a string containing doublets on the first subdiagonal of the matrix (32) starting with 10
[ 102132 ( b 2 ) ( b 3 ) ( b 1 ) ( b 2 ) ] ,
and we append it to the string (31). With this step, we also eliminate the doublets on the second superdiagonal starting with the doublet 02, as well as the doublet ( b 1 ) 1 . In the 2nd step, we create a string containing doublets on the third superdiagonal beginning with the doublet 03
[ 0314 ( b 5 ) ( b 2 ) ( b 4 ) ( b 1 ) ] ,
and append it to the string created so far. With this step, we also remove the doublet ( b 2 ) 0 and the middle part of the second subdiagonal containing { 31 , 42 , , ( b 2 ) ( b 4 ) } . And so on. Finally, we append 0 if b is even. This process is illustrated in Figure A1 and for 3 b 13 generates the following C ( N 1 ) strings
[ 0001112221020 ] , [ 000111222333102132030 ] , [ 0001112223334441021324303142040 ] , [ 0001112223334445551021324354031425041520530 ] , [ 000111222333444555666102132435465031425360415262064051630 ] , [ 0001112223334445556667771021324354657603142536470415263720750516273061740 ] , [ 1021324354657687031425364758041526374820860516273830617285071840 ] , [ 1021324354657687980314253647586904152637485920970516273849 30617283960718294081950 ] , [ 102132435465768798 a 9031425364758697 a 0415263748596 a 20 a 8 05162738495 a 3061728394 a 70718293 a 408192 a 6091 a 50 ] , [ 102132435465768798 a 9 b a 031425364758697 a 8 b 0415263748596 a 7 b 20 b 9 05162738495 a 6 b 3061728394 a 5 b 80718293 a 4 b 408192 a 3 b 7091 a 2 b 50 a 1 b 60 ] , [ 102132435465768798 a 9 b a c b 031425364758697 a 8 b 9 c 0415263748596 a 7 b 8 c 20 c a 05162738495 a 6 b 7 c 3061728394 a 5 b 6 c 90718293 a 4 b 5 c 408192 a 3 b 4 c 8091 a 2 b 3 c 50 a 1 b 2 c 70 b 1 c 60 ] .

Appendix B. Method B for Generating C(N-1) String

Figure A2. Doublet matrices for 1 b 13 that illustrate the generation of N ( N 1 ) strings according to Method B. Colored doublets are appended to the initial string of clear triplets in the order indicated by arrows starting from the 1st column or row. Finally, 0 is appended at the end, if b is even.
Figure A2. Doublet matrices for 1 b 13 that illustrate the generation of N ( N 1 ) strings according to Method B. Colored doublets are appended to the initial string of clear triplets in the order indicated by arrows starting from the 1st column or row. Finally, 0 is appended at the end, if b is even.
Preprints 122018 g0a2
This method is similar to the Method A. We also start with a string of clear triplets (31) and the matrix of doublets (32) with a crossed diagonal and the first superdiagonal. In the first step, we append the doublet 0 ( b 1 ) (top right doublet of the matrix of doublets (32)) at the end of the string (31). Next, we generally perform the following pairs of iterations:
  • we check subsequent subdiagonals until we find one that does not contain a doublet present in the string created so far, we append it at the end of this string and proceed to step 2;
  • we check subsequent superdiagonals until we find one that does not contain a doublet present in the string created so far, we append it at the end of this string and proceed to step 1.
Finally, we append 0 if b is even. The method is illustrated in Figure A2 and for 3 b 13 generates the C ( N 1 ) strings in the form
[ 0001112220210 ] , [ 000111222333031021320 ] , [ 0001112223334440410213243031420 ] , [ 0001112223334445550510213243540314253041520 ] , [ 000111222333444555666061021324354650314253640516204152630 ] , [ 0001112223334445556667770710213243546576031425364730415263740516275061720 ] , [ 0810213243546576870314253647583041526374850516273860718206172840 ] , [ 091021324354657687980314253647586930415263748596051627384950617283940718297081920 ] , [ 0 a 102132435465768798 a 9031425364758697 a 30415263748596 a 705162738495 a 60718293 a 4061728394 a 8091 a 208192 a 50 ] , [ 0 b 102132435465768798 a 9 b a 031425364758697 a 8 b 30415263748596 a 7 b 805162738495 a 6 b 5061728394 a 5 b 60718293 a 4 b 708192 a 3 b 4091 a 2 b 90 a 1 b 20 ] , [ 0 c 102132435465768798 a 9 b a c b 031425364758697 a 8 b 9 c 30415263748596 a 7 b 8 c 905162738495 a 6 b 7 c 5061728394 a 5 b 6 c 70718293 a 4 b 5 c 8091 a 2 b 3 c 408192 a 3 b 4 c a 0 b 1 c 20 a 1 b 2 c 60 ] .

Appendix C. A String with Exactly Two Copies of All Doublets and No Repeated Triplets

A string that has exactly two copies of all doublets and no repeated triplets can have a form (for b = { 1 , 2 , 3 , 4 , 5 } )
[ 0000 ] [ 00001111010 ] [ 0000111122221021202010 ] [ 0000111122223333102132101202303203130 ] [ 00001111222233334444102132431012023034041304242143203140 ]
and has a length of
N 2 D = 2 b 2 + b + 1 .
A suboptimal method for its generating (with repeated triplets) is illustrated in Figure A3.
Figure A3. Doublet matrices for 1 b 8 that illustrate the generation of N 2 D strings containing exactly two copies of all doublets. Colored doublets are appended to the initial string of clear quadruplets in the order indicated by arrows starting from the 1st column or row. Finally, 0 ( b 1 ) 0 is appended at the end. The 1st superdiagonal is appended as 01234 .
Figure A3. Doublet matrices for 1 b 8 that illustrate the generation of N 2 D strings containing exactly two copies of all doublets. Colored doublets are appended to the initial string of clear quadruplets in the order indicated by arrows starting from the 1st column or row. Finally, 0 ( b 1 ) 0 is appended at the end. The 1st superdiagonal is appended as 01234 .
Preprints 122018 g0a3

Appendix D. Proof of C(N-1) String Theorem

The N ( N 1 ) given by the formula (33) is an odd number for all b. The first element 3 b is the length of the initial string (31) containing b clear triplets and b 2 b ( b 1 ) is the number of doublets available in the matrix (32) after crossing out b doublets on its diagonal and b 1 doublets on its superdiagonal that are present in the starting string (31). By definition, a C ( N 1 ) string cannot have any repetitions. To be the longest, it must contain all doublets in the matrix (32) and all clear triplets. Furthermore, to be the most patternless, this string must maximize Shannon entropy; must be the most balanced. For the string of the form (34) the fractions in the Shannon entropy are
p 0 = N c + 1 N ( N 1 ) , p 1 , 2 , , b 1 = N c N ( N 1 ) ,
where w.l.o.g. we assume that the symbol occurring N c ( b ) + 1 times within the string is c = 0 . To see that the Shannon entropy (35) of a C ( N 1 ) string can be approximated by log 2 ( b ) for large b, first notice that 1 b 2 < 0 and b 2 + b + 1 > 0 , b > 1 . Furthermore, b > 0 , b + 1 b 2 + b + 1 , which implies that the first term
log 2 b + 1 b 2 + b + 1 < 0 .
Similarly the second term,
log 2 b + 2 b 2 + b + 1 < 0 .
Hence, the entropy (35) can be approximated by the dominant contribution from the first term, which is log 2 ( b ) .
The strings given by (33) are not the shortest possible ones. Strings satisfying the equation (34) and satisfying min ( b N c ( b ) + 1 ) > N ( N 1 ) ( b 1 ) are given by b 2 + 1 (OEIS A002522). They can be constructed to contain all possible doublets but without any triplets, starting with an initial balanced string of length 2 b containing b clear doublets ordered from the main diagonal of the doublet matrix (32). Furthermore, their entropies are smaller than the entropies of the strings given by the equation (33). Namely b > 1
1 b 2 b 2 + b + 1 log 2 b + 1 b 2 + b + 1 b + 2 b 2 + b + 1 log 2 b + 2 b 2 + b + 1 > b ( 1 b ) b 2 + 1 log 2 b b 2 + 1 b + 1 b 2 + 1 log 2 b + 1 b 2 + 1 .
Now, assume a contrario that a string C ( N 1 ) longer than N ( N 1 ) can be constructed, say of length N ( N 1 ) = N ( N 1 ) + 1 . But in this case, the corresponding H ( C ( N 1 ) ) < H ( C ( N 1 ) ) . The string of the length given by the formula (33) maximizes the Shannon entropy if it must additionally satisfy the relation (34). Thus, Theorem 13 is proven.

Appendix E. Number of C ( N 1 ) ( 13 , 3 ) Strings

For b = 3 , only two doublets can be introduced without repetitions into the initial string (31), leading to twelve unique strings of length N ( N 1 ) = 13
[ 0001112220210 ] , [ 0001112221020 ] , [ 2021000111222 ] , [ 2102000111222 ] , [ 0001112022210 ] , [ 0001112102220 ] , [ 2100020111222 ] , [ 0002011122210 ] , [ 0200011122210 ] , [ 2000210111222 ] , [ 2100011120222 ] , [ 2100011122202 ] .
Finally, we have to multiply the cardinality of this set by 3 ! = 6 to account for permutations. For example, the first string [ 0001112220210 ] , is equivalent to five strings [ 0002221110120 ] , [ 1110002221201 ] , [ 1112220001021 ] , [ 2220001112102 ] , and [ 2221110002012 ] . Hence, there are seventy-two different strings of length N ( N 1 ) ( 3 ) = 13 .

Appendix F. Proof of C (N-2) String Theorem

For b = 1 , N ( N 2 ) ( 1 ) = N ( N 1 ) ( 1 ) + 2 = 5 , as the ASI of [ 00000 ] is the same as the ASI of [ 000000 ] .
A C ( N 1 ) string contains all doublets. Hence, inserting any basic symbol into any position inevitably leads to a repetition of a doublet. W.l.o.g. we append it at the start of the C ( N 1 ) string, obtaining a string
C k = [ * 000111222 ] , a max ( N ( N 1 ) + 1 , b ) ( C k ) = N 2 .
Another symbol can be introduced to this string without an additional doublet repetition provided that it adjoins the previously introduced symbol, which gives a string
C l = [ * 000111222 ] , a max ( N ( N 1 ) + 2 , b ) ( C l ) = N 2 ,
leading to the repetition of the doublet * or * 0 but not both of them (here we allow = * ). Hence, both length and the ASI of this string increase by one. Finally, 0 can be appended at the start of this string without an additional doublet repetition provided that 0 and * = 0 and the string becomes
C ( N 2 ) = [ 0 0000111222 ] , a max ( N ( N 1 ) + 3 , b ) ( C ( N 2 ) ) = N 2 ,
leading to the mutually exclusive repetition of the doublet 0 , 0 or 00, so that also both length and the ASI of this string increase by one. An insertion of another symbol into the string (A14) at any position will maintain or even decrease the ASI of this newly formed string. For example, appending 0 at the start of the C ( N 2 ) string (A14), where = 1
[ 0010000111222 ] .
creates a 001 triplet based on 00 doublet leading to a decrease of the ASI of this longer string to a = N 4 as compared to a = N 2 of the string (A14). Thus, Theorem 14 is proven.
For the string of the form (37) the fractions in the Shannon entropy are
p 0 = N c + 3 N ( N 2 ) , p 1 = N c + 1 N ( N 2 ) , p 2 , , b 1 = N c N ( N 2 ) ,
where w.l.o.g. we assume that the symbol occurring N c ( b ) + 1 times within the string is c = 0 , which leads to Shannon entropy (38).

Appendix G. Proof of C(N-3) String Theorem

N ( N 3 ) ( 1 ) = N ( N 1 ) ( 1 ) + 4 = 7 , as the ASIs of strings of seven and eight same symbols is three. The appending 0 at the start of the C ( N 2 ) string (A14) decreases of the ASI of this longer string to a = N 4 (cf. F). Thus, w.l.o.g. we append * 0 at the start of the C ( N 2 ) string (A14)
C k = [ * 010000111222 ] , a max ( N ( N 1 ) + 4 , b ) ( C k ) = N 3 .
If * = 1 , we have the same three doublets 10. Otherwise, we have two pairs of the same doublets * 0 and 10. Both cases are equivalent by Theorem 4. An insertion of another symbol to this string may maintain or even decrease the ASI of this newly formed string. To maximize its ASI, another symbol must adjoin *. Hence, we append ★ at the start, where and * 0 , a string
C l = [ * 010000111222 ] , a max ( N ( N 1 ) + 5 , b ) ( C l ) = N 3 ,
has an increased length and ASI. W.l.o.g. for b = 2 we have four bitstrings (A18), wherein three of them
C 1 ( 12 , 2 ) = [ 000100001110 ] , a C 1 ( 12 , 2 ) = 12 4 = 8 , C 2 ( 12 , 2 ) = [ 110100001110 ] , a C 2 ( 12 , 2 ) = 8 , C 3 ( 12 , 2 ) = [ 100100001110 ] , a C 3 ( 12 , 2 ) = 8 ,
have the same non-maximum ASI and only one
C ( N 3 ) ( 12 , 2 ) = [ 010100001110 ] , a max ( N ( N 1 ) + 5 , 2 ) ( C ( N 3 ) ( 12 , 2 ) ) = 12 3 = 9 ,
cannot be further extended along with the increment of the ASI. Therefore
C ( N 3 ) ( N , b ) = [ 01010000111222 10 ] , a max ( N ( N 1 ) + 5 , b ) ( C ( N 3 ) ( N , b ) ) = N 3 ,
and the ASI of this newly formed string increases again. However, the insertion of another symbol into this string will maintain or even decrease the ASI of this newly formed string. Thus, Theorem 15 is proven.

Appendix H. Misunderstanding Assembly Pools

Consider the following mapping [18] between a working assembly pool P 3 ( 5 ) containing five basic symbols and three strings made of these symbols in three steps and the initial assembly pool of radix b = 8
P 3 ( 5 ) P 0 ( 8 ) 0 0 1 1 2 2 3 3 4 4 20 5 201 6 2012 7
Now consider the string
C k ( 11 , 5 ) = [ 20123242012 ]
assembled beginning with the initial assembly pool P 0 ( 5 ) and having the ASI a ( 11 , 5 ) ( C k ) = 7 only two steps above a min ( 11 ) = 5 , as we can assemble this string as the string
C l ( 8 , 8 ) = [ 20123247 ]
of length N = 8 in 7 steps with the initial assembly pool P 0 ( 8 ) and then, using the mapping (A22), it will correspond to the string (A23). However, as we have shown in Section 2, N ( N 1 ) ( 8 ) = 73 7 . In fact the latter string (A24) should be assembled as
C m ( 5 , 8 ) = [ 73247 ]
with the ASI a ( 5 , 8 ) ( C m ) = 5 1 = 4 and with the initial assembly pool P 0 ( 8 ) , as 2012 7 according to the mapping (A22). Hence, considering a set P 3 ( 5 ) as the initial assembly pool is a gross misunderstanding; there is only one initial assembly pool for a given b and many different working assembly pools for b > 1 and s > 1 ( P 1 ( 1 ) = { 0 , 00 } ). Furthermore, basic objects must have the same vanishing AD (12).

References

  1. S. M. Marshall, A. R. G. Murray, and L. Cronin, “A probabilistic framework for identifying biosignatures using Pathway Complexity,” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 375, p. 20160342, Dec. 2017.
  2. S. Imari Walker, L. Cronin, A. Drew, S. Domagal-Goldman, T. Fisher, and M. Line, “Probabilistic biosignature frameworks,” in Planetary Astrobiology (V. Meadows, G. Arney, B. Schmidt, and D. J. Des Marais, eds.), pp. 1–1, University of Arizona Press, 2019.
  3. V. S. Meadows, G. N. Arney, B. E. Schmidt, and D. J. Des Marais, eds., Planetary astrobiology. University of Arizona space science series, Tucson: The University of Arizona Press ; Houston : Lunar and Planetary Institute, 2020. OCLC: 1151198948.
  4. Y. Liu, C. Mathis, M. D. Bajczyk, S. M. Marshall, L. Wilbraham, and L. Cronin, “Exploring and mapping chemical space with molecular assembly trees,” Science Advances, vol. 7, p. eabj2465, Sept. 2021.
  5. S. M. Marshall, C. Mathis, E. Carrick, G. Keenan, G. J. T. Cooper, H. Graham, M. Craven, P. S. Gromski, D. G. Moore, S. I. Walker, and L. Cronin, “Identifying molecules as biosignatures with assembly theory and mass spectrometry,” Nature Communications, vol. 12, p. 3033, May 2021.
  6. S. M. Marshall, D. G. Moore, A. R. G. Murray, S. I. Walker, and L. Cronin, “Formalising the Pathways to Life Using Assembly Spaces,” Entropy, vol. 24, p. 884, June 2022.
  7. A. Sharma, D. Czégel, M. Lachmann, C. P. Kempes, S. I. Walker, and L. Cronin, “Assembly theory explains and quantifies selection and evolution,” Nature, vol. 622, pp. 321–328, Oct 2023.
  8. M. Jirasek, A. Sharma, J. R. Bame, S. H. M. Mehr, N. Bell, S. M. Marshall, C. Mathis, A. MacLeod, G. J. T. Cooper, M. Swart, R. Mollfulleda, and L. Cronin, “Investigating and Quantifying Molecular Complexity Using Assembly Theory and Spectroscopy,” ACS Central Science, vol. 10, pp. 1054–1064, May 2024.
  9. S. Łukaszyk and W. Bieniawski, “Assembly Theory of Binary Messages,” Mathematics, vol. 12, p. 1600, May 2024.
  10. S. Raubitzek, A. Schatten, P. König, E. Marica, S. Eresheim, and K. Mallinger, “Autocatalytic Sets and Assembly Theory: A Toy Model Perspective,” Entropy, vol. 26, p. 808, Sept. 2024.
  11. S. Łukaszyk and A. Tomski, “Omnidimensional Convex Polytopes,” Symmetry, vol. 15, mar 2023.
  12. “Book of John [1.3],” c90.
  13. S. Pagel, A. Sharma, and L. Cronin, “Mapping Evolution of Molecules Across Biochemistry with Assembly Theory,” 2024.
  14. M. M. Vopson, “The second law of infodynamics and its implications for the simulated universe hypothesis,” AIP Advances, vol. 13, p. 105308, Oct. 2023.
  15. S. Łukaszyk, “Shannon entropy of chemical elements,” European Journal of Applied Sciences, vol. 11, p. 443–458, Jan. 2024.
  16. S. Łukaszyk, Black Hole Horizons as Patternless Binary Messages and Markers of Dimensionality, ch. 15, pp. 317–374. Nova Science Publishers, 2023.
  17. S. Łukaszyk, “Life as the explanation of the measurement problem,” Journal of Physics: Conference Series, vol. 2701, p. 012124, Feb 2024.
  18. L. Ozelim, A. Uthamacumaran, F. S. Abrahão, S. Hernández-Orozco, N. A. Kiani, J. Tegnér, and H. Zenil, “Assembly Theory Reduced to Shannon Entropy and Rendered Redundant by Naive Statistical Algorithms,” 2024.
Figure 1. Lower assembly index bound (red) and upper bounds (green) for 1 b 4 , lower assembly depth bound (blue) of C max ( N , b ) strings for b > 1 , and log 2 ( N ) (red, dash-dot) for 0 < N 33 .
Figure 1. Lower assembly index bound (red) and upper bounds (green) for 1 b 4 , lower assembly depth bound (blue) of C max ( N , b ) strings for b > 1 , and log 2 ( N ) (red, dash-dot) for 0 < N 33 .
Preprints 122018 g001
Figure 2. Lower assembly index bound (red) and the upper bounds (green) for 1 b 94 and 0 < N 8873 .
Figure 2. Lower assembly index bound (red) and the upper bounds (green) for 1 b 94 and 0 < N 8873 .
Preprints 122018 g002
Table 1. Distributions of n-plets in strings of maximum ASI.
Table 1. Distributions of n-plets in strings of maximum ASI.
N × 2 ( b = 1 ) × 2 ( b = 2 ) × 2 ( b = 3 ) × 2 ( b = 4 ) × 4 × 8 × 16 × 32 last × 8 last × 4 last × 2 last × 1 a max ( N , 1 ) a max ( N , 2 ) a max ( N , 3 ) a max ( N , 4 )
1 0 0 0 0 0 0 0 0 N N N 0 0 0 0
2 1 1 1 1 0 0 0 0 N N N 1 1 1 1
3 1 1 1 1 0 0 0 0 N N Y 2 2 2 2
4 1 2 2 2 1 0 0 0 N N N 2 3 3 3
5 1 2 2 2 1 0 0 0 N N Y 3 4 4 4
6 1 3 3 3 1 0 0 0 N Y N 3 5 5 5
7 1 3 3 3 1 0 0 0 N Y Y 4 6 6 6
8 1 3 4 4 2 1 0 0 N N N 3 6 7 7
9 1 3 4 4 2 1 0 0 N N Y 4 7 8 8
10 1 4 5 5 2 1 0 0 N Y N 4 8 9 9
11 1 3 5 5 2 1 0 0 N Y Y 5 8 10 10
12 1 4 6 6 3 1 0 0 Y N N 4 9 11 11
13 1 3 6 6 3 1 0 0 Y N Y 5 9 12 12
14 1 4 6 7 3 1 0 0 Y Y N 5 10 12 13
15 1 3 6 7 3 1 0 0 Y Y Y 6 10 13 14
16 1 4 7 8 4 2 1 0 N N N N 4 11 14 15
17 1 3 6 8 4 2 1 0 N N N Y 5 11 14 16
18 1 4 7 9 4 2 1 0 N N Y N 5 12 15 17
19 1 3 6 9 4 2 1 0 N N Y Y 6 12 15 18
20 1 4 7 10 5 2 1 0 N Y N N 5 13 16 19
21 1 3 6 10 5 2 1 0 N Y N Y 6 13 16 20
22 1 4 7 10 5 2 1 0 N Y Y N 6 14 17 20
23 1 3 6 10 5 2 1 0 N Y Y Y 7 14 17 21
24 1 4 7 11 6 3 1 0 Y N N N 5 15 18 22
25 1 3 6 10 6 3 1 0 Y N N Y 6 15 18 22
26 1 4 7 11 6 3 1 0 Y N Y N 6 16 19 23
27 1 3 6 10 6 3 1 0 Y N Y Y 7 16 19 23
28 1 4 7 11 7 3 1 0 Y Y N N 6 17 20 24
29 1 3 6 10 7 3 1 0 Y Y N Y 7 17 20 24
30 1 4 7 11 7 3 1 0 Y Y Y N 7 18 21 25
31 1 3 6 11 7 3 1 0 Y Y Y Y 8 18 21 25
32 1 4 7 11 8 4 2 1 N N N N 5 19 22 26
33 1 3 6 11 8 4 2 1 N N N Y 6 19 22 26
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated