Function Overcomes Taxonomy for Organella Genes Triplet Composition

Copyright: © 2022 by the authors. Submitted to Int. J. Mol. Sci. for possible open access publication under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/). 1 Institute of computational modelling SB RAS; 660036 Russia, Krasnoyarsk, Akademgorodok, 44, bldg. 55 {msad,msen,annamo}@icm.krasn.ru 2 V.F. Voino-Yasenetsky Krasnoyarsk state medical university; 660022 Russia, Krasnoyarsk, Partizana Zheleznjaka str., 1 3 Siberian federal university; 660041 Russia, Krasnoyarsk, Svobodny prosp., 79 mutovina.ole4ka@mail.ru, viktoriia.fedotovskaia@gmail.com, shpaginat98@gmail.com, nedorez.ya@gmail.com, yuliya-putintseva@rambler.ru 4 Federal Research & Clinic Center of FMBA of Russia 660037 Russia, Krasnoyarsk, Kolomenskaya str., 26 * Correspondence: msad@icm.krasn.ru; Cell tel.: +7-902-990-4597 (M.S.) Abstract: A comprehensive presentation of a variety of biologically sounding properties of 1


17
An exploration of the interplay between the structure of genomic entities, the 18 function encoded in them, and the taxonomy of their bearers is the crucial issue in 19 up-to-date molecular biology, molecular genetics, and bioinformatics. A genome is a 20 complex object of immense size; however, plants have a specific part of a genome that 21 belongs to chloroplasts. 22 Identity of the function encoded in chloroplasts is of great value for studying the 23 interplay mentioned above: one meets no effects resulting from the difference in function 24 encoded in the genome. That is true for all organelle genomes. Thus, a three-side problem 25 is reduced to the two-side one: there is an interplay between structure (whatever it could 26 be) and taxonomy of bearers of the chloroplast genomes. Comprehensive analysis of the 27 interplay between structure and taxonomy in chloroplasts may reveal the details of the 28 evolution of various species of plants. 29 into clusters (apparently identified separated groups). If clustering takes place, then the 86 composition of the clusters was studied: there might be two options. The former is that 87 a cluster preferably comprises the genes of various functions but belonging to the same 88 species. The latter is that a cluster preferably comprises the genes of the same function 89 but belonging to various species. The first option makes taxonomy the leading factor 90 determining the cluster composition, and the second option makes the function this 91 factor. Speaking in advance, the substantial prevalence of the function over taxonomy 92 has been observed.   The novel method has no parameters to be adjusted. It is a very significant advan-  n ω are changed for the frequencies The constraint (2) allows 63 (linearly independent) triplets, thus making the space 63-159 dimensional.

160
Generally, a frequency dictionary W (q,t) is defined in the following way. For a given 161 DNA sequence, put a window of the length q at the first nucleotide in the sequence and 162 fix the identified string (word) into the list of entries. Then move the window along the 163 sequence for t nucleotides, and fix the next word into the list. Go on in this way while 164 the last complete coverage is found. The total number of words in a dictionary is then where N is a sequence length. Counting the number n ω of identical words ω, one 166 gets a finite dictionary; changing numbers for frequencies one gets the frequency dictionary W (q,t) . Obviously, there exists t (different, in general in our studies; here f ω , correspondingly) is the frequency of a triplet ω counted 174 over S i (over S j , correspondingly). 176 We used both GenBank and EMBL-bank to retrieve the genomes. All manipulations 177 with this latter (gene extraction, etc.) have been done due to annotation. 178 2.2.1. T-genome description 179 We studied the sets of tRNA genes of chloroplast genomes of gymnosperm plants. 180 The database enlists 4015 genes, totally; Table 1     and protists (three species) into the data set. These latter are able to converse sunlight 234 into a bioproduct, that is why they were included. This database contains four species of 235 gymnosperm and 86 species of leaf plants. 236 The intergenic fragments of a genome were identified with the annotation and 237 retrieved from the genome sequences with ad hoc software. We completely refer on 238 the annotation, so that these former were defined as subsequences located in a genome

273
To define local density, each point must be supplied with a bell-shaped function; 274 we used a well-known Gaussian curve 1 f l (r): here index l enlists the points in the dataset, r l is the location of l-th point on the plane, 276 and µ is the contrast parameter. To define the local density, one must sum up all the 277 functions (5) and depicture this sum function (6): Here M is the total number of points in the dataset. Function (6) level lines effectively 279 identify clusters: these are the areas on the square with the F (r) value exceeding the 280 given one.

297
The second idea is the core one: it stipulates the binary sequences to be the coeffi-298 cients of a polynomial (of the degree N − 1). So, a symbol-by-symbol comparison of two 299 (symbol) sequences should be changed for a product (convolution, to be exact) of two 300 polynomials. Finally, the third idea is to implement Fourier Transform for convolution 301 calculation; moreover, Fast Fourier Transform should be used here to accelerate the 302 calculations significantly.

303
The method works as follows. Suppose, T 1 and T 2 be the nucleotide sequences to 304 be compared; let N 1 and N 2 be the lengths of each of them, correspondingly. Convert 305 both of them into four binary sequences each; next, they both must be expanded to the The expansion is provided by adding as many 307 zeros (from the right, for the sake of definiteness), as necessary to reach M.

308
Let now introduce the convolution S = A ⊗ B of two number sequences A = {a i }, The brute force way to calculate a convolution of two sequences is rather hard. To   2. Change P and Q into |ℵ| = 4 binary sequences as described above. 3. Expand the sequences with zeros for further application of FFT to get the sequence of the length N 1 + N 2 − 1. To do it, all 2 × |ℵ| binary sequences must be accomplished with zeros (to the right, for certainty) to that length. An effective implementation of FFT requires a sequence to have the length equal to power of 2, so we must to add zeros to get the length N = 2 log 2 (N 1 +N 2 −1) . 4. Apply FFT to each of the binary sequences: 5. Following (8), combine the relevant couples of P ν and Q ν with ν running A, C, G and 332 T) and sum up them: 6. Apply inverse FFT to S thus getting the convolution S = F −1 (S ).

335
Here we present the effects revealed from the triplet composition of chloroplast 336 genomes or their fragments. We start from the presentation of tRNA genes structure-337 function interplay, then change for symmetry presentation found in chloroplast genomes, 338 then change for transposons analysis in chloroplast genomes.   Subfig. 1(a), while the distribution of amino acids is shown in Subfig. 1(b). The clusters color (see Table 1). Such mash presentation makes it difficult to see individual amino  In other words, Fig. 1(b) is a combination of all the maps from Figs. 10 and 11.

367
The clusters in Fig. 1 are basically determined by the contrast parameter (correlation   there is no doubt that the cluster (if identified as two separated entities) heavily differs 378 in an abundance of the genes. On the contrary, the numbers of genes with synonym 379 codons are pretty close; for such genes, see Table 2.

380
The amazing fact is that the distribution of the isodecoders over the codons is very 381 biased (cp. Table 2 and Figs. 10, 11). In other words, the number in Table 2 differs 382 strongly if one splits them for the numbers of genes encoding different isodecoders. In 383 fact, a single isodesoder is usually heavily underrepresented in a genome; only three 384 amino acids fall beyond this pattern: alanine, serine, and valine. Table 3 shows the list 385 of such gene types with underrepresented isodecoders; the distribution of them on an 386 elastic map is shown in Fig. 2. However, it is unclear whether this bias naturally occurred 387 in chloroplast genomes or resulted from the details of sequencing and/or annotation.

388
Probably, one should expect the contribution from both factors.

389
The coloring scheme for amino acids in Table 3 is the same as in Table 1; however, 390 the labels differ from Table 2 and mark the isodecoders. Fig. 2 shows the distribution of   Similarly, Fig. 7 shows the distribution of NADH + genes located in chloroplasts 443 over the elastic map in inner coordinates. There are eleven types of these genes with the 444 following coloring scheme: ndhA is colored in light pink, ndhB is colored in brown, ndhC 445 is colored in gray, ndhD is colored in black, ndhE is colored in coral, ndhF is colored in 446 lime, ndhG is colored in pink, ndhH is colored in yellow, ndhI is colored in yellow green, 447 ndhJ is colored in plum and ndhK is colored in red.

505
We studied various specific genetic systems (photosystem genes, tRNA genes, etc.) 506 with the similar investigation tool: that is a triplet frequency dictionary, and the patterns

577
In general, tRNA genes tend to gather into separate clusters following the amino 578 acid residue transfer specificity. We checked whether the species (families, to be exact) 579 are spread among the clusters in some order: the answer is negative. There is no order 580 or a preference in taxa distribution over the clusters. In other words, each sufficiently 581 abundant cluster comprises the list of species pretty close to that one observed in another 582 one. However, the rare escapees (i. e. the genes located solely far from the "mother"  Two more groups of genes retrieved from chloroplast genomes come from energy 590 consumption complex: these are ATP synthase and NADH + genes. They also are known 591 for their conservative evolution. Hence, these groups of genes provide a good genetic 592 matter for the study of the interplay between structure, function, and taxonomy or 593 bearers.

594
The distribution of ATP synthase genes (NADH + genes, respectively) is shown 595 in Fig. 6 (7, respectively). Both groups show highly perfect clustering that follows the Remarkably, the "quality" of distribution here is much better than that latter observed 601 for tRNA genes. Probably, this fact from function choice of a genes family: similarly high 602 "quality" distribution was observed over ATP synthase genes family located in fungal mitochondria [41,42].

604
The obvious next step in the investigation of the distribution of these genes may 605 consist in a study of the combined distribution of ATP synthase genes, and NADH + 606 genes; regardless of the specificity of their localization in genomes. These two families of 607 genes seem to be very good for such kind study, Of course, there might be (and probably 608 are) some minor differences between, say, NADH + chloroplast genes of species A and 609 those of species A. However, their functional proximity is strong enough to neglect  The composition of the clusters is also non-random: they comprise the dictionaries 643 preferably belonging to the same taxon. However, this preference could be observed for 644 sufficiently distant (whatever this distance might mean) taxa.