Preprint
Article

This version is not peer-reviewed.

On the Salient Limits of Strings in Assembly Theory

Submitted:

19 September 2024

Posted:

20 September 2024

Read the latest preprint version here

Abstract
We show that the longest length N of a string, composed according to the principles of assembly theory of b different basic symbols, which has the assembly index of N-1 is given by b2+b+1 and this odd-length string is nearly balanced.
Keywords: 
;  ;  ;  

1. Introduction

Assembly theory (AT), formulated in 2017, introduced the concept of an initial pool [1].
Definition 1. 
We call a set P 0 ( b ) { 0 , 1 , , b 1 } that contains different basic symbols c, where b is a finite natural radix, the initial assembly pool.
The reader will find numerous results on AT published since 2017 in refs. [1,2,3,4,5,6,7,8,9].
In this short note, we extend the results of our previous study [9] to strings of any natural radix b. We consider the formation of strings C k ( N , b ) of length N containing symbols from the initial assembly pool within the AT framework in consecutive assembly steps from basic symbols c and strings assembled in previous steps.
Definition 2. 
We call a set P s ( b ) that contains basic symbols and strings assembled in previous steps { 1 , 2 , , s } the working assembly pool.
Using the Definitions 1 and 2, the assembly index of a string is the minimal achievable value of a difference between the cardinalities of the working and initial assembly pools leading to this string, since at each assembly step the cardinality of the working assembly pool increases by one. Therefore, in contrast to the working assembly pool 2, the initial assembly pool 1 must not contain strings of basic symbols. To illustrate this, consider the following mapping between such a faulty initial assembly pool containing five basic symbols and three strings of these symbols and the initial assembly pool of radix b = 8
P 0 ( 5 ) P 0 ( 8 ) 0 0 1 1 2 2 3 3 4 4 20 5 201 6 2012 7
Now consider the string
C k ( 11 , 5 ) = [ 20123242012 ]
assembled beginning with the initial assembly pool P 0 ( 5 ) and having the assembly index a ( 11 , 5 ) ( C k ) = 7 only two steps above a min ( 11 ) = 5 . We can assemble the string
C l ( 8 , 8 ) = [ 20123247 ]
of length N = 8 in 7 steps with the initial assembly pool P 0 ( 8 ) and then, using the mapping (1), it will correspond to the string (2). However, as we shall show in the following section, N max ( 8 ) = 73 7 . In fact the latter string (3) should be assembled as
C m ( 5 , 8 ) = [ 73247 ]
with the assembly index a ( 5 , 8 ) ( C k ) = 5 1 = 4 and with the initial assembly pool P 0 ( 8 ) , as 2012 7 according to (1).
The following two theorems were already stated in our previous study [9] for b = 2 . We restate them here b for clarity.
Theorem 1. 
A string of length N = 4 is the shortest string that allows for more than one assembly index for all b.
Proof. 
N = 2 provides b 2 available strings with unit assembly indices. N = 3 provides b 3 available strings with assembly indices equal to two. Only N = 4 provides b 4 strings that include b strings C k ( 4 , b ) = [ ] and b ( b 1 ) strings C k ( 4 , b ) = [ ] with assembly indices equal to two, while the assembly index of the remaining strings is three. For example, to assemble the string C k ( 4 , 4 ) = [ 0202 ] , we need to assemble the string [ 02 ] and reuse it from P 1 , while there is nothing available to reuse, in the case of the string C l ( 4 , 4 ) = [ 0123 ] . □
Where the symbol value can be arbitrary, we write * assuming that it is the same within the string. If we allow for the 2 nd possibility different from *, we write ★. Thus, C k ( 2 , b ) = [ ] , for example, is a placeholder for all b strings, while C l ( 2 , b ) = [ ] a placeholder for all b ( b 1 ) strings.
Theorem 2. 
The smallest string assembly index a ( N ) ( C min ) as a function of N corresponds to the shortest addition chain for N (OEIS A003313) for all b.
Proof. 
Strings C min for which a ( N ) ( C min ) = min k { a ( N , b ) ( C k ) } , k { 1 , 2 , , b N } can be formed in subsequent steps s by joining the longest string assembled so far with itself until N = 2 s is reached. Therefore, if N = 2 s , then min k { a ( 2 s ) ( C k ) } = s = log 2 ( N ) . Only b 2 strings have such an assembly index in this case, including b strings
C k ( 2 s , b ) = [ ] ,
and b ( b 1 ) strings
C k ( 2 s , b ) = [ ] ,
and the assembly pathway of each of the strings (5) and (6) is unique. At each assembly step, its length doubles.
An addition chain for N N having the shortest length s N (commonly denoted as l ( N ) ) is defined as a sequence 1 = a 0 < a 1 < < a s = N of integers such that j 1 , a j = a k + a l for l k < j . The first step in creating an addition chain for N is always a 1 = 1 + 1 = 2 and this corresponds to assembling a doublet [ ] or [ ] from the initial assembly pool P. Thus, the lower bound for s of the addition chain for N, s log 2 ( N ) is achieved for N = 2 s by strings (5) and (6) .
The second step in creating an addition chain can be a 2 = 1 + 1 = 2 or a 2 = 1 + 2 = 3 . Thus, finding the shortest addition chain for N corresponds to finding an assembly index of a string containing basic symbols and/or doublets and/or triplets containing these doublets for N 2 s since due to Theorem 1 only they provide the same assembly indices { 0 , 1 , 2 } . □

2. Results

The seven-bit string is the longest string that can have the maximum assembly index a max ( 7 , 2 ) = 7 1 = 6 . There are four such bitstrings containing two clear triplets and the starting bit at the end or the ending bit at the start, that is
[ ] and [ ] ,
and their lengths cannot be increased without a repetition of a doublet, which inevitably reduces the assembly index to a max ( 8 , 2 ) = 8 2 = 6 .
This observation and Theorem 2 motivated us to develop a general procedure to construct the longest possible string that has the assembly index a max ( N , b ) = N 1 , as a function of the radix b 3 . We denote the length of this string by N max ( b ) .
After a few groping try-outs (cf. Appendices Appendix A and Appendix B) we eventually reached a stable procedure. We start with an initial balanced string of length 3 b containing b clear triplets ordered as
[ 0001112 ( b 2 ) ( b 1 ) ( b 1 ) ( b 1 ) ] .
The doublets that can be inserted into the initial string (8) can be arranged in a b × b matrix
00 01 02 0 ( b 1 ) 10 11 12 1 ( b 1 ) 20 21 22 2 ( b 1 ) ( b 2 ) 0 ( b 2 ) 1 ( b 2 ) 2 ( b 2 ) ( b 1 ) ( b 1 ) 0 ( b 1 ) 1 ( b 1 ) 2 ( b 1 ) ( b 1 ) ,
where the crossed out entries on diagonal cannot be reused, as they would create repetitions in this string. If we assume that we shall not insert doublets between the clear triplets of the string (8) and hence we can also cross out the entries on the first superdiagonal in the matrix (9).
In the 1 st step, we create a string containing doublets on the first subdiagonal of the matrix (9) starting with 10
[ 102132 ( b 2 ) ( b 3 ) ( b 1 ) ( b 2 ) ] ,
and we append it to the string (8). With this step, we also eliminate the doublets on the second superdiagonal starting with the doublet 02, as well as the doublet ( b 1 ) 1 . In the 2 nd step, we create a string containing doublets on the third superdiagonal beginning with the doublet 03
[ 0314 ( b 5 ) ( b 2 ) ( b 4 ) ( b 1 ) ] ,
and append it to the string created so far. With this step, we also remove the doublet ( b 2 ) 0 and the middle part of the second subdiagonal containing { 31 , 42 , , ( b 2 ) ( b 4 ) } . And so on.
We shall illustrate this process for b = 8 . The matrix
00 01 02 03 04 05 06 07 10 11 12 13 14 15 16 17 20 21 22 23 24 25 26 27 30 31 32 33 34 35 36 37 40 41 42 43 44 45 46 47 50 51 52 53 54 55 56 57 60 61 62 63 64 65 66 67 70 71 72 73 74 75 76 77 ,
contains all the doublets that were used to create the string of length N max ( 8 ) = 73
[ 000111222333444555666777 | 10213243546576 | 0314253647 | 04152637 | 2075 | 051627 | 306174 | 0 ] ,
For b = 7 we would obtain the string of length N max ( 7 ) = 57
[ 000111222333444555666 | 102132435465 | 03142536 | 041526 | 2064 | 0516 | 30 ] .
for b = 6 we would obtain the string of length N max ( 6 ) = 43
[ 000111222333444555 | 1021324354 | 031425 | 0415 | 2053 | 0 ] ,
for b = 5 we would obtain the string of length N max ( 5 ) = 31
[ 000111222333444 | 10213243 | 0314 | 04 | 20 ] ,
for b = 4 we would obtain the string of length N max ( 4 ) = 21
[ 000111222333 | 102132 | 03 | 0 ] ,
and b = 3 leads to the following string of length N max ( 3 ) = 13
[ 000111222 | 10 | 20 ] .
The final string is always terminated by 0.
The strings of odd lengths generated by the general procedure outlined above are not only the longest, but also the most balanced. This leads to the following theorem.
Theorem 3. 
The longest length N max ( b ) of a string composed of b different basic symbols that has the assembly index of N 1 is given by
N max ( b ) = 3 b + ( b 1 ) 2 = b 2 + b + 1
(OEIS A353887) and this string is nearly balanced, that is
N max ( b ) = b N c ( b ) + 1 ,
where N c = b + 1 is the number of occurrences of all but one symbol within the string.
Proof. 
The N max ( b ) given by formula (19) is an odd number for all b. As shown in Table 1, the first element 3 b is the length of the initial string (8) containing b clear triplets and ( b 1 ) 2 is the number of entries in the doublet matrix (9) of the previous b. By definition, a string of length N max ( b ) cannot have any repetitions; it can only contain doublets and clear triplets that do not contain these doublets. Therefore, to be the most patternless, this string must maximize Shannon entropy; must be the most balanced. For the string of the form (20) the fractions in the Shannon entropy are
p 0 = N c ( b ) + 1 N max ( b ) , p 1 , 2 , , b 1 = N c ( b ) N max ( b ) ,
(where without loss of generality we assume that the symbol occuring N c ( b ) + 1 times within the string is c = 0 ) and the Shannon entropy is
H max ( b ) = c = 0 b 1 p c log 2 ( p c ) = ( b 1 ) N max ( b ) 1 b N max ( b ) log 2 N max ( b ) 1 b N max ( b ) N max ( b ) 1 + b b N max ( b ) log 2 N max ( b ) 1 + b b N max ( b ) = = 1 b 2 b 2 + b + 1 log 2 b + 1 b 2 + b + 1 b + 2 b 2 + b + 1 log 2 b + 2 b 2 + b + 1 log 2 ( b ) .
The strings given by the equation (19) are not the shortest possible ones. Strings satisfying the equation (20) and satisfying min ( b N c ( b ) + 1 ) > N max ( b 1 ) are given by b 2 + 1 (OEIS A002522). However, they do not contain all the possible doublets and furthermore their entropies are smaller than the entropies of the strings given by the equation (19).
Now, assume a contrario that a string longer than N max ( b ) can be constructed, say of length N max ( b ) = N max ( b ) + 1 . But in this case, the corresponding H max ( b ) < H max ( b ) . The string of the length given by the formula (19) maximizes the Shannon entropy if it must additionally satisfy the relation (20). □
Although the case for b = 1 (only one symbol) is degenerate, the formula (19) yields correct result; the string [ 000 ] is the longest string with a max ( N , 1 ) = N 1 , as for b = 1 we simply have a max ( N , 1 ) = a min ( N ) (OEIS A003313).

3. Conclusions

There is one string of length N max ( 1 ) = 3 , four strings of length N max ( 2 ) = 7 , seventy-two strings of length N max ( 3 ) = 13 (cf. Appendix A). Their number for b 4 requires further research.

Author Contributions

WB: First concept of a general procedure for constructing the string of length N max ( b ) ( A ) ; determining N max ( b ) for 3 b 9 ; third concept of a general procedure for constructing the string of length N max ( b ) leading to theorem 3; noting that N max ( b ) must be more balanced than N max ( b ) ( B ) ; numerous clarity corrections and improvements; PM: Second concept of a general procedure for constructing the string of length N max ( b ) ( B ) ; numerous clarity corrections and improvements; SŁ: The remaining part of the study.

Acknowledgments

The authors thank Andrzej Tomski for numerous clarity corrections and improvements and Mariola Bala for motivation. SŁ thanks his wife, Magdalena Bartocha, for her everlasting support.

Appendix A. Method A

In the first method of creating the longest, patternless string we developed, we started with the string of clear triplets (8) which we augmented with b 1 doublets { 02 , 03 , , 0 ( b 1 ) , 10 } to form the string
[ 0001 ( b 2 ) ( b 1 ) ( b 1 ) ( b 1 ) 0203 0 ( b 1 ) 10 ] ,
of length
N max ( b ) ( A ) = 3 b + 2 ( b 1 ) = 5 b 2 .
The introduction of b 1 doublets from the first row of the (9) and the doublet 10 into the string (8) also introduces other doublets. For b = 3 the augmented string (A1) is has the length N max ( 3 ) = 13 as the insertion of 0210 at the end of the string (8) introduces the doublet 21. Thus, by construction, doublet ( b 1 ) 1 (last row, 2 nd column) cannot be reused. For b = 3 only two doublets can be introduced without repetitions, leading to twelve unique strings of length N = 13
[ 000111222 | 0210 ] , [ 000111222 | 1020 ] , [ 20 | 21 | 000111222 ] , [ 21 | 02 | 000111222 ] , [ 0001112 | 02 | 22 | 10 ] , [ 0001112 | 10 | 22 | 20 ] , [ 21 | 000 | 20 | 111222 ] , [ 000 | 20 | 111222 | 10 ] , [ 02 | 000111222 | 10 ] , [ 20 | 00 | 21 | 0111222 ] , [ 21 | 0001112 | 02 | 22 ] , [ 21 | 000111222 | 02 ] ,
Finally, we have to multiply the cardinality of this set by 3 ! = 6 to account for permutations. For example, the first string [ 0001112220210 ] , is equivalent to five strings [ 0002221110120 ] , [ 1110002221201 ] , [ 1112220001021 ] , [ 2220001112102 ] , and [ 2221110002012 ] . Hence, there are seventy-two different strings of length N max ( 3 ) = 13 . This method turned out to be valid for b = { 1 , 3 } only, as N max ( 4 ) ( A ) = 18 21 and not all available doublets were used.

Appendix B. Method B

The second method we developed is an extension of the first one Appendix A. We start with the augmented string (A1) and the doublet matrix
00 01 02 03 0 ( b 1 ) 10 11 12 13 1 ( b 1 ) 20 21 22 23 2 ( b 1 ) 30 31 32 33 3 ( b 1 ) ( b 2 ) 0 ( b 2 ) 1 ( b 2 ) 2 ( b 2 ) 3 ( b 2 ) ( b 1 ) ( b 1 ) 0 ( b 1 ) 1 ( b 1 ) 2 ( b 1 ) 3 ( b 1 ) ( b 1 )
For b > 3 , there are ( b 3 ) ( b 2 ) / 2 and ( b 2 ) ( b 1 ) / 2 1 (in total ( b 3 ) ( b 1 ) ) doublets available, respectively, in the upper triangle (beginning with 13) and lower triangle (beginning with 21) of the doublet matrix (A4). We can insert the doublets from the upper triangle to the augmented string (A1) as follows
[ 0001112 13 ̲ 223 1424 ̲ 334 ( b 3 ) ( b 2 ) 1 ( b 1 ) 2 ( b 1 ) ( b 3 ) ( b 1 ) ̲ ( b 2 ) ( b 2 ) ( b 1 ) ( b 1 ) ( b 1 ) 0203 0 ( b 1 ) 10 ] ,
creating a string of length
N max ( b ) ( B ) = 3 b + 2 ( b 1 ) + ( b 3 ) ( b 2 ) = b 2 + 4 .
This method also turned out to be valid for b = 3 only, as N max ( 4 ) ( B ) = 20 21 even though the strings of length N max ( b ) ( B ) contain all doublets of the matrix (9) without repeating. However, the strings (A5) created by this method are non-balanced and do not contain all available b clear triplets. For example, for b = 4 , the non-balanced string (A5) of length N max ( 4 ) ( B ) = 20 that contains all possible doublets (but no clear triplet 222) is
[ 00011121322333020310 ] .
This led us to the third method described in Section 2.

References

  1. Marshall, S.M.; Murray, A.R.G.; Cronin, L. A probabilistic framework for identifying biosignatures using Pathway Complexity. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 2017, 375, 20160342. [CrossRef]
  2. Imari Walker, S.; Cronin, L.; Drew, A.; Domagal-Goldman, S.; Fisher, T.; Line, M. Probabilistic Biosignature Frameworks. In Planetary Astrobiology; Meadows, V.; Arney, G.; Schmidt, B.; Des Marais, D.J., Eds.; University of Arizona Press, 2019; pp. 1–1. [CrossRef]
  3. Meadows, V.S.; Arney, G.N.; Schmidt, B.E.; Des Marais, D.J., Eds. Planetary astrobiology; University of Arizona space science series, The University of Arizona Press ; Houston : Lunar and Planetary Institute: Tucson, 2020. OCLC: 1151198948.
  4. Liu, Y.; Mathis, C.; Bajczyk, M.D.; Marshall, S.M.; Wilbraham, L.; Cronin, L. Exploring and mapping chemical space with molecular assembly trees. Science Advances 2021, 7, eabj2465. [CrossRef]
  5. Marshall, S.M.; Mathis, C.; Carrick, E.; Keenan, G.; Cooper, G.J.T.; Graham, H.; Craven, M.; Gromski, P.S.; Moore, D.G.; Walker, S.I.; Cronin, L. Identifying molecules as biosignatures with assembly theory and mass spectrometry. Nature Communications 2021, 12, 3033. [CrossRef]
  6. Marshall, S.M.; Moore, D.G.; Murray, A.R.G.; Walker, S.I.; Cronin, L. Formalising the Pathways to Life Using Assembly Spaces. Entropy 2022, 24, 884. [CrossRef]
  7. Sharma, A.; Czégel, D.; Lachmann, M.; Kempes, C.P.; Walker, S.I.; Cronin, L. Assembly theory explains and quantifies selection and evolution. Nature 2023, 622, 321–328. [CrossRef]
  8. Jirasek, M.; Sharma, A.; Bame, J.R.; Mehr, S.H.M.; Bell, N.; Marshall, S.M.; Mathis, C.; MacLeod, A.; Cooper, G.J.T.; Swart, M.; Mollfulleda, R.; Cronin, L. Investigating and Quantifying Molecular Complexity Using Assembly Theory and Spectroscopy. ACS Central Science 2024, 10, 1054–1064. [CrossRef]
  9. Łukaszyk, S.; Bieniawski, W. Assembly Theory of Binary Messages. Mathematics 2024, 12, 1600. [CrossRef]
Table 1. The maximum length of a string having the assembly index a max ( N max ( b ) , b ) = N max ( b ) 1 and their Shannon entropies, as a function of the radix b.
Table 1. The maximum length of a string having the assembly index a max ( N max ( b ) , b ) = N max ( b ) 1 and their Shannon entropies, as a function of the radix b.
b N max ( b ) H max ( b )
1 3 + 0 = 3 0
2 6 + 1 = 7 0.9852
3 9 + 4 = 13 1.5766
4 12 + 9 = 21 1.9952
5 15 + 16 = 31 2.3190
6 18 + 25 = 43 2.5831
7 21 + 36 = 57 2.8061
8 24 + 49 = 73 2.9991
9 27 + 64 = 91 3.1692
10 30 + 81 = 111 3.3214
11 33 + 100 = 133 3.4590
12 36 + 121 = 157 3.5846
13 39 + 144 = 183 3.7002
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2025 MDPI (Basel, Switzerland) unless otherwise stated