Algebraic rules for the percentage composition of oligomers in genomes

. The article presents the author's results of studying hidden rules of structural organizations of long DNA sequences in eukaryotic and prokaryotic genomes. The results concern some rules of percentages (or probabilities) of n-plets in genomes. To reveal such rules, the author uses a tensor family of matrix representations of interrelated DNA-alphabets of 4 nucleotides, 16 doublets, 64 triplets, and 256 tetraplets. If percentages of each of these n-plets in tested genomic DNA-texts are disposed into appropriate cells of appropriate matrices, unexpected rules of invariance of total sums of their percentages in certain tetra-groupings of n-plets are revealed. The author connects the received results about these genomic percentages rules with a supposition of P. Jordan, who is one of the creators of quantum mechanics and quantum biology, that life's missing laws are the rules of chance and probability of the quantum world. Algebraic features of the genomic matrices of percentages of n-plets are analyzed and discussed. The received results can be used for further development of quantum biology.


Introduction.
The article continues publications [Petoukhov, 2020a-c] of author's results of studying hidden rules of structural organizations of long DNA nucleotide sequences (that is, DNA-texts) in eukaryotic and prokaryotic genomes.
One of the founders of quantum mechanics, who introduced also the term "quantum biology," P. Jordan noted the main difference between living and inanimate objects: inanimate objects are controlled by the average random movement of their millions of particles, whose individual influence is negligible, while in a living organism selectedgenetic -molecules have a dictatorial influence on the whole living organism. Besides this, claimed that life's missing laws were the rules of chance and probability of the quantum world [Jordan, 1932;McFadden, Al-Khalili, 2018]. From the standpoint of Jordan's statement, the study of probabilities or percentages of n-plets (monoplets, doublets, triplets, etc., that is, oligomers with lengths n) in long DNA sequences is important for discovering hidden biological laws and for developing quantum biology. In his previous articles [Petoukhov, 2020а-с], the author described the universal hyperbolic rules of the oligomer cooperative organization of DNA nucleotide sequences in eukaryotic and prokariotic genomes. The formulated rules concerned total amounts of certain classess of n-plets. In this new preprint, the author focuses on searching possible rules of probabilities (or percentages) of n-plets in genomes in line with the mentioned supposition of Jordan about existence of such rules.
This research uses a well-known fact of binary-oppositional features of DNA nucleotides (adenine A, thymine T, cyrosine C, and guanine G), which allows a constructing a family of square tables for DNA alphabets of 4 nucleotides, 16 doublets, 64 triplets, …, 4 n n-plets. Each of n-plets occupies its strong individual place in this family of tables, which form a tensor family of square matrices. Any DNA sequence of nucleotides (for example, CAGGTACAT...) can be represented as a sequence of oligomers of a fixed length n (for example, as a sequence of triplets CAG-GTA-CAT-...) and one can calculate the percentage content of each of the nplets in this special representation of the DNA sequence as a chain of n-plets. By placing the calculated percentage of each n-plet into the cell occupied by this n-plet in the appropriate square matrix, we obtain the numerical matrices of probabilities of all n-plets in the given DNA-text.
Analysis of this family of probability matrices for n-plets reveals hidden regularities in the structural organization of the studied genomic DNA-texts. Below these regularities for cases n=1, 2, 3, 4 are described and discussed [Petoukhov, 2020a-c].

The matrix representation of the DNA alphabets on the basis of binaryoppositional traits of nucleotides.
As it is known, the DNA alphabet of 4 nucleotides A, T, C, and G is endowed with a system of binary-opposition traits or indicators [Fimmel, Petoukhov, 2020;Petoukhov, 2008;Petoukhov, 2008;Petoukhov, He, 2010;Stambuk, 1999]: 1) two of these molecules are purines with two rings (A and G), and the other two are pyrimidines with one ring (C and T). In terms of these oppositional indicators, C = T = 1, A = G = 0 ; 2) the two letters are keto molecules (T and G), and the other two -amino molecules (C and A). In terms of these oppositional indicators, C = A = 1, T = G = 0.
In the DNA alphabet of 4 nucleotides, each of the letters C, A, T, and G is uniquely determined by its named binary indicators. With this in mind, it is convenient to present sets of 4 DNA nucleotides, their 16 doublets and 64 triplets in the form of square tables, the columns of which are numbered with binary indicators "pyrimidine or purine" (C = T = 1, A = G = 0), and the rows are numbered with binary indicators "amino or keto "(C = A = 1, T = G = 0). In such tables, all 4 nucleotides, 16 doublets and 64 triplets of DNA automatically occupy their individual places in the strict order ( Fig. 2.1 0 T G  10 CT CG AT AG  01 TC TA GC GA  00 TT TG GT GG  111 110 101 100   011 010  001  000  111 CCC CCA CAC CAA ACC ACA AAC AAA  110 CCT CCG CAT CAG ACT ACG AAT AAG  101 CTC CTA CGC CGA ATC ATA AGC AGA  100 CTT CTG CGT CGG ATT ATG AGT AGG  011 TCC TCA TAC TAA GCC GCA GAC GAA  010 TCT TCG TAT TAG GCT GCG GAT GAG  001 TTC TTA TGC TGA GTC GTA GGC GGA  000 TTT TTG TGT TGG GTT GTG  triplets, and 256 tetraplets, which are constructed by the method of binary numbering of their rows and columns and which are members of a tensor family of matrices [C, A; T, G] (n) under n = 1, 2, 3, 4 (see explanations in the text).→ These four tables are not simple tables but they form a single tensor family of matrices: the second, the third, and the fourth tensor powers of the (2*2)-matrix [C, A; T, G] automatically give this (4*4)-matrix of 16 doublets, this (8*8)-matrix of 64 triplets, and this (16*16)-matrix of 256 tetraplets (Fig. 1). Using the same method of binary numbering of rows and columns of square matrices of DNA alphabets of n-plets, one can similarly construct square tables of 1024 pentaplets, and so on. These new tables will also be members of the unified tensor family of symbolic matrices [C, A; T, G] (n) for values n = 5, 6, ...
The tensor family of matrices [C, A; T, G] (n) was first used by the author for a comparative analysis of the percentage of different n-plets in the DNA-texts of various genomes. Let us explain our analytical approach using a specific example of the DNA of the first human chromosome, which contains a sequence of about 250 million nucleotides C, A, T, and G (initial data on this chromosome were taken in the GenBank: https://www.ncbi.nlm.nih.gov/nuccore/NC_000001.11). One can remind here that genomic sequences in the GenBank sites usually contain some letters N, indicating that there can be any nucleotide in this place (https://www.ncbi.nlm.nih.gov/books/NBK21136/). By this reason, the total amount of all nucleotides A, T, C, G, which are calculated for the sequence from the GenBank, is slightly less than the complete length of the DNA sequence, which is indicated in the GenBank. But practically this is not essential for the resulting values of percentages of separate nucleotides in the analyzed genomic sequences. At the first step of the author's approach, percents of each of the nucleotides C, A, T, and G in this chromosome are calculated: %C ≈ 0.2085 , %G ≈ 0.2089, %A ≈ 0.2910, %T ≈ 0.2917 (here percents are shown in fractions of one, and their values are rounded to the fourth decimal place). These percent values are used to be indicated in appropriate cells of the matrix of nucleotides [C, A; T, G] instead of nucleotide symbols for receiving a numeric matrix of nucleotides percents [ Fig. 2]. Here and below, percentages are rounded to the fourth decimal place. One can note that %C ≈ %G and %A ≈ %T in accordance with the second Chargaff's rule [Albrecht-Buehler, 2006;Chargaff, 1971;Prahbu, 1993].
At the second step of the described approach, the DNA-text of the analyzed chromosome is represented as a text of doublets (for example, the text TAACCCTA… is represented as TA-AC-CC-TA-…) and percents of each of 16 doublets are calculated. Then these percents are indicated in appropriate cells of the (4*4)-matrix [C, A; T, G] (2) shown in Fig At the third step of the described approach, the DNA-text of the analyzed chromosome is represented as a text of triplets (for example, the text TAACCCTAG… is represented as TAA-CCC-TAG-…) and percents of each of 64 triplets are calculated. Then these percents are indicated in appropriate cells of the

Tetra-groupings of the percentage composition of n-plets.
At first glance, the set of percent in the resulting matrices (Figs. 2.3-2.5) is quite chaotic. It has the following features regarding the percent of separate n-plets: • Percent of presented n-plets significantly depend on the order of letters in them. • The total sum Σ%CN of percentages of all 4 doublets CN (hereinafter, the symbol N denotes any of the nucleotides A, T, C, and G), which start with the nucleotide C, is equal to %C, that is, Σ%CN ≈ %CC + %CA +%CT + %CG ≈ 0.0541+0.0727+0.0713+0.0103 ≈ 0.2085 ≈ %C; • The total sum Σ%NC of percentages of all 4 doublets NC, which have the nucleotide C at their second positions, is practically equal to %C, that is, Σ%NC ≈ %CC +%AC + %TC +%GC = 0.0541+0.0503+0.0601+0.0440 ≈ 0.2085 ≈ %C as well; • The total sum Σ%CNN of percentages of all 16 triplets CNN, which have the nucleotide C at their first position, is practically equal to %C, that is, Σ%CNN ≈ 0.0284 ≈ %C as well; • The total sum Σ%NCN of percentages of all 16 triplets NCN, which have the nucleotide C at their second position, is practically equal to %C, that is, Σ%NCN ≈ 0.0285 ≈ %C as well; • The total sum Σ%NNC of percentages of all 16 triplets NNC, which have the nucleotide C at their third position, is practically equal to %C, that is, Σ%NNC ≈ 0.0285 ≈ %C as well; • The total sum Σ%CNNN of percentages of all 64 tetraplets CNNN, which have the nucleotide C at their first position, is practically equal to %C, that is, Σ%CNNN ≈ 0.0285 ≈ %C as well; • The total sum Σ%NCNN of percentages of all 64 tetraplets NCNN, which have the nucleotide C at their second position, is practically equal to %C, that is, Σ%NCNN ≈ 0.0285 ≈ %C as well; • The total sum Σ%NNCN of percentages of all 64 tetraplets NNCN, which have the nucleotide C at their third position, is practically equal to %C, that is, Σ%NNCN ≈ 0.0285 ≈ %C as well; • The total sum Σ%NNNC of percentages of all 64 tetraplets NNNC, which have the nucleotide C at their fourth position, is practically equal to %C, that is, Σ%NNCN ≈ 0.0285 ≈ %C as well. Similar equalities turn out to be valid also for the total sums Σ of the considered n-plets with nucleotides A, T, G at the analogical positions, as shown in Fig. 3  Briefly speaking, the following relations (1) hold true -with high level of accuracy -regarding percentages of the nucleotides C, G, A, and T, and the considered n-plets in the human chromosome №1: These equalities (1) can also be written in the form (2)  (2) Knowing the percentages of nucleotides %A, %T, %C, and %G, it is possible to predict with high accuracy the sums of percentages of n-plets of the noted classes. The ability of such predictions on the basis of equalities (1) or (2) holds not only for the considered human chromosome №1 but also for many eukaryotic and prokaryotic genomes, which were analyzed by the author till now. One should note that percentages of nucleotides %A, %T, %C, and %G can be essentially different in various genomes. (Appendix I contains one of many possible examples of percent matrices related to the genome of bacteria Bradyrhizobium japonicum where %A ≈ 0.1819, %T ≈ 0.1815, %C ≈ 0.3184, and %G ≈ 0.3182 in contrast to the considered case of the human chromosome #1). This indicates a universal cooperative organization of n-plets in genomic DNA-texts, which is reflected in very special block-mosaic structures of the percent matrices of n-plets ( Fig. 2.3-2.5).
The four columns in Fig. 3.1 show that in all considered cases of percentages of four kinds of nucleotides and percent sums of n-plets, which have these nucleotides as their positional attributes, there exist regular tetra-subsets of percentages (or percentage tetra-groupings).
It can be noted that the sum of the percentages (that is, the probabilities) of nplets in each of these four subsets can be interpreted as the square of the length of a vector whose components are equal to the square roots of the probabilities of the corresponding n-plets (these square roots can be considered as amplitudes of the probabilities of n-plets). For example, the sum %CC+%CG+%CA+%CT is the square of the length of the 4-dimensional vector VCN = [√%CC, √%CG, √%CA, √%CT]. From this point of view, equalities (2) mean the constancy of the length of the corresponding 2 n -dimensional vectors, whose coordinates are the amplitudes of the probabilities of the corresponding n-plets. This metric approach allows for developing new methods of comparative vector analysis in genetics, which are now being studied in our laboratory.

DNA epi-chains and the tetra-grouping matrices of percentages of n-plets.
This Section presents some results of the analogical study of percentages of n-plets in special subsequences of long nucleotide sequences in single-stranded DNA. These subsequences are termed «DNA epi-chains» [Petoukhov, 2019[Petoukhov, , 2020a. The author's initial results testify that the above described equalities (1) and (2) of total sums of percentages of n-plets hold for these epi-chains as well.
Each genomic DNA epi-chain of k-th order (if k = 2, 3, 4, ....) contains k times fewer nucleotides than the DNA strand and has its own arrangements of nucleobases A, T, C, and G. But unexpectedly, despite on these text differences, matrices of percentages of n-plets in these very new nucleotide sequences contain analogical tetra-groupings with the same equalities (1) and (2) for total sums of corresponding n-plets (at this stage of the research, the author studied the percent matrices of epi-chains only in cases of epi-chains with relatively small orders k).
To illustrate this result, let's consider the percent matrices for the second, third and fourth order epi-chains (k = 2b 3b 4) in the DNA of the first human chromosome (Fig. 4.2).
It illustrates that the total sums of the percentages in each of the tetragroupings of n-plets practically do not depend on the percent values of individual nplets in these sums. This resembles the phenomenon of perceiving a musical melody, which can be reproduced in different frequency ranges of octaves, that is, under significantly changing the frequency of the sound of each of its note elements, but despite these changes, the melody remains generally recognizable. Many such phenomena of perception, in which there is relative independence of the integral form from its constituent individual components, are studied in Gestalt psychology. The described universal regularities in the preservation of total percentages in tetragroupings of n-plets relatively regardless of the percentage of individual n-plets in genomic DNA allow the author to develop gestalt genetics. Gestalt genetics is interrelated in some degree with Gestalt psychology, which studies the corresponding genetically inherited properties of our brain regarding the perception of the environment.

Percentages matrices of n-plets and algebras of hyperbolic numbers
This section shows that each of the vectors of percentages sums of n-plets in the equalities (2) corresponds to such mosaical matrix, which is connected with wellknown algebras of 2 n -dimensional hyperbolic numbers (hyperbolic numbers are termed also as double numbers, Lorentz numbers, split-complex numbers and perplex numbers). Consequently the arrangements of the described tetra-groupings of n-plets inside these (2 n *2 n )-matrices of percentages obey special algebraic rules. It is additionally interesting since, as it is known, structures of some genetically inherited biological phenomena are related to 2 n -dimensional hyperbolic numbers [Petoukhov, 2020b]. 2-dimensional hyperbolic numbers form algebra over the field of real numbers [Harkin, Harkin, 2004;Kantor, Solodovnikov, 1989].
The same analysis with similar results can be done for tetra-groupings of matrix cells regarding 4-dimensional vectors of percentages sums of tetraplets from the equalities (2).

Some concludinhg remarks.
The presented results of the study of the regularities in the distribution of the percentages of n-plets in long DNA-texts of various organisms are consistent with Jordan's claiming that life's missing laws are the rules of chance and probability of the quantum world [Jordan, 1932;McFadden, Al-Khalili, 2018]. The described author's results show existence of previously unknown genetic regularities. These results were obtained on the basis of new methods of analysis and modeling of DNA-texts, which are connected with mathematical formalisms of quantum mechanics and quantum informatics, algebras of multi-dimensional hypercomplex numbers, metric analysis anf the theory on noise-immune coding of informatics.
Considering the views of Jordan and Schrödinger about the dictatorial role of the structured informatics of genetic molecules for the whole organism [McFadden, Al-Khalili, 2018], it is natural to think that the structural features of DNA informatics leave their mark on all genetically inherited biological systems and phenomena. This is consistent with the fact that all physiological systems must be structurally aligned with genetic coding in order to be transmitted in genetically encoded form to offspring. This is also consistent with the point of view that the main task of living organisms is to transfer genetic information along the chain of generations.
The received results can be used, in particular, for developing our knowledge about principles of brain activities and about a relation between 'living' and nonliving matter. These themes are actively discussed in scientific community. For example, concerning a relation between 'living' and non-living matter, W. Pauli stated that the mental and the material domain are governed by common ordering principles, and can be understood as "complementary aspects of the same reality" [Pauli, 1994;Geesink, Meijer, 2016].
Regarding the metric features of biological phenomena, it can be noted that thoughts about metric spaces are spread by a number of authors even to mathematics as a high form of intellectual activity. For example, the book [Hofstadter, 1980, page 612] notes that a mathematician feels that in mathematics there is a certain metric that unites ideas -that all mathematics is a network of results that are interconnected by a huge number of connections; had we been able to introduce this highly developed sense of mathematical closeness -the mental metric of a mathematician -into the program, we could create a primitive artificial mathematician.
In accordance with these statements of Hofstadter, the book [Nalimov, 2015, p. 115] emphasizes: "In other words, artificial intelligence could be brought closer to mathematical thinking, if it were possible to realize the metric properties of the human thinking space ... We are ready to go further and say that consciousness itself is geometrically structured: existentially, a person is geometric ... In our minds, when constructing texts through which we perceive the World, something very similar to what happens in morphogenesis happens. We are ready to see in the depths of consciousness the same geometric images that are revealed in morphogenesis".
In general, the presented results of the author's studies of the structural rules of genetic informatics give pieces of evidences in favor of effectivity of a model approach to living organisms as quantum-informational algebraic-harmonic essences on modular principles.