Microbial Diversity Based on Multifractal Analysis of Meta- genomes

Species diversity in microbiome is a cutting-edge concept in metagenomic research. In this study, we propose a multifractal analysis for metagenomic research. From the chaos game representation (CGR) visualization of simulated and real metagenomes, we find that there exists self-similarity in the visualization of metagenomes. Then we compute the multifractal dimensions for simulated and real metagenomes. For simulated metagenomes, we also compute their diversity indices, such as species richness indices, Shannon’s diversity indices and Simpson’s diversity indices respectively for varying value of q . Fom the Pearson correlation coefficients between their multifractal dimensions and traditional species diversity indices, we find that the correlation coefficients between the multifractal dimensions and species richness indices and Shannon diversity indices reach their maximums at 1 0,  q respectively. The correlation coefficients between the multifractal dimensions and Simpson’s diversity indices reach their maximums at 2  q nearly. So the traditional diversity indices can be unified by the frame of multifractal analysis. These results coincided with the similar results in macrobial ecology. Finally, we apply our methods to real metagenomes of 100 infants’ gut microbiomes when they are newborn, 4 months and 12 months. Our results show that multifractal dimensions of infants’ gut microbiomes can discriminate the age difference.


Introduction
Species diversity in ecology has been long studied [1,2]. Generally, diversity indices can be divided into two classes (α diversity indices and β diversity indices). All diversity indices referred in this report are α diversity indices. In macrobial (plants/animals), α diversity can be characterized by species richness, Shannon diversity index and Simpson diversity index. Usually, in the field of macrobial ecology, with the increasing of ecology area, species richness is increasing. Generally, species-area relationship (SAR) can be formulated as   is the number of species in A , c and z are constant. SAR is a famous formula in ecological study [3]. On the basis of SAR, Harte and Kinzig pointed out that the formula indicates the self-similarity of species number and area [4]. As main feature of fractals, self-similarity can be described by Generally, for 0  q , q z emphasis the character of rare species, for 0  q , q z emphasis the common species. Particularly, 0 z implies the relationship of the logarithm of species richness (     A S ln ) and the logarithm of the area (   A ln ). 1 z implies the relationship of the logarithm of Shannon diversity (SHD) index and the logarithm of the area. 2 z implies the relationship of the logarithm of Simpson diversity (SID) index and the logarithm of the area.
In microbial diversity study, identifying bacterial strains in metagenome and microbiome samples using computational analyses of short-read sequences remains a difficult problem [5], so that the main difference of diversity indices between macrobial and microbial is that the concept of "species" had been substituted by "OTUs". The number of operation taxonomic units (OTUs) within a community is akin to species richness within macrobial systems [6]. Similar to macrobial ecology, species richness, Shannon diversity index and Simpson diversity index were used to describe the species diversity of microbial community [7]. Up to now, there is no report to unify these diversity indices into one frame.
Fractal analysis has been applied in DNA sequence analysis more than 30 years [8,9]. For example, Chaos Game Representation (CGR) is a classical method [10]. CGR map DNA sequence into unit square by is corresponding to four nucleotides A, C, G and T respectively, According to [11], CGRs have also been subjected to multifractal analysis (which measures the degree of self-similarity within the image).On the basis of visualization for DNA sequence, one can define its multifractal spectrum by . In practical computation, one can rewrite the above-mentioned formula ( Then one can compute Inspired by [10], research group of Vélez studied the Caenorhabditis elegans genome [12] and the human genome [13] [16]. They proposed the general concept of additive DNA signature of a set (collection) of DNA sequences. For example, the composite DNA signature (combines information from n DNA fragments and organellar), the assembled DNA signature (combines information from many short DNA subfragments (e.g., 100 basepairs) of a given DNA fragment). They concluded that such additive signatures could be used with raw unassembled next-generation sequencing (NGS) read data when high-quality sequencing data is not available. Motivated by [21], in this study, we apply the fractal and multifractal method to species diversity analysis of microbiome. First, we visualize the simulated metagenomes and real metagenomes. Then we compute the multifractal dimensions of simulated metagenomes and study the relationship between its multifractal dimensions and species diversity indices. Last, we compute multifractal dimensions of real metagenomes of 100 infants' gut microbiomes when they are newborn, 4 months and 12 months.
Data set 1: Simulated high-diversity metagenome set generated from the genomes of ten distantly related major bacterial species used in [22]. The high-diversity set include 100 metagenomes generated from the genomes of ten distantly related major bacterial species accounting for more than 90 % of all reads in Chinese group: The species used in data set 1 are listed in Table 1. The abundances in data set 1 are listed in Table S1 of Supplementary Materials. Data set 2: Simulated low-diversity metagenome set generated from the genomes of ten closely related major bacterial species used in [22]. The species used in data set 2 are listed in Table 2. The abundances in data set 2 are listed in Table S2 of Supplementary Materials.
is corresponding to four nucleotides A, C, G and T respectively, In order to avoid "large number annihilating small number", we disgarded the first 10 points of each reads. The visualization of a simulated metagenome in data set 1 is demonstrated by Figure 1 as an example.

Fractal and multifractal spectrum of metagenome
We found all CGRs (e.g. Figure 1) seem to be self-similar. So we intend to study their fractal and multifractal properties. On the basis of visualization of metagenome sequence, one can define its multifractal spectrum by (1).
Furthermore, one can define multifractal dimension by  according to (2). In metagenomic research, for a given community (i.e. given abundance values of bacteria), a WGS dataset of metagenome is actually a collection of sampling reads from the give community. Here, we simulate 100 metagenomes from a given abundance of ten bacteria. Figure 3 demonstrates the multifractal dimensions of 100 simulated metagenoms from data set 1 and 100 simulated metagenoms from data set 2.
From Figure 3, we can find that multifractal dimension curves of different simulated metagenomes from the same abundance are unstable when 0 q  , they are stable when

The relationship between multifractal spectrum and microbial diversity index of metagenomes
In order to study the relationship between multifractal spectrum and diversity indices of metagenomes, we simulated 100 metagenomes whose abundance are known, then their species richness index, Shannon diversity index, Simpson diversity index, and multifractal dimensions are computed. Then the Pearson correlation coefficients are computed according to varying q .

Application of multifractal dimension in metagenomes to infant's gut microbiome
In order to apply the multifractal analysis to real metagenomes, we selected 100 infants' fecal WGS datasets of 300 metagenomes (There are 3 samples, including 12 Month  As an example, we plot multifractal dimensions of a selected gut microbiome of a baby in Figure 9. The plot demonstrates the multifractal dimensions of gut microbiomes of an infant and its mother when he/she is a newborn (baby), 4 month, 12 month. Figure   9 suggests that the    respectively. In order to observe the overall characteristic of these multifractal dimensions, we plotted the mean value of 100 multifractal dimensions of gut microbiomes in Figure 10.
In order to evaluate the discriminating power of gut microbiomes' multifractal dimensions in ages of infants, we use multifractal dimensions of 12M,4M, baby and Mother gut microbiomes to discriminate by Support Vector Machine (SVM) [24]. Table 3 demonstrates accurate rates of discriminating metagenomes of 12M, 4M, baby and Mothers by SVM. Within infants' gut microbiomes, the accurate rate of 12M and baby, 12M and 4M, baby and 4M is decreasing.

Discussions and conclusions
In this study, we studied metagenomes by multifractal analysis. From the results above, we can draw the following conclusions.
(i) From the CGR visualization of metagenomes by, we can see there exists statistical self-similarity in these plots. Figure   (iii) In research on real metagenomes, multifractal dimensions of gut mirobiome of one mother and her baby is demonstrated in Figure 9, this plot shows that the multifractal dimensions of gut microbiome of baby is increasing with aging (new born, 4 M and 12M). Figure 10 shows this law holds on the whole for baby in average.The discriminated power of multifractal dimensions of gut microbiomes of infants demonstrated in Table 3 shows that the infants' age can be discriminated by their multifractal spectrum of CGR visualization of gut microbiomes.This section may be divided by subheadings. It should provide a concise and precise description of the experimental results, their interpretation, as well as the experimental conclusions that can be drawn.
Supplementary Materials: Table S1: The abundances in data set 1; Table S2: The abundances in data set 2.
Author Contributions: Conceptualization, X.X. and Y.M.; methodology, X.X. and Z.Y.; software, X.X. and Y.M; validation, X.X., Y.M., Z.Y. and G.H.; formal analysis, X.X.; resources, X.X.; data curation, X.X.; writing-original draft preparation, X.X. and Z.Y.; writing-review and editing, X.X. and Z.Y.; visualization, X.X.; supervision, Z.Y.; All authors have read and agreed to the published version of the manuscript. Data Availability Statement: The source codes for our algorithm and datasets used in this report can be provided via email.