Exploring patterns in modularity of protein interaction networks across the tree of life using Spectral Entropy

Modularity and organizational hierarchy are important concepts in understanding the 1 structure and evolution of interactions in complex biological systems. In this work, we introduce 2 and use a spectral characterization measure (Spectral Entropy) to quantify modularity in protein3 to-protein interaction (PPI) networks in species across the tree of life. We evaluated the relation 4 between the size of a PPI network and its (Spectral Entropy-based) modularity, and found a 5 sigmoidal response between the two. We also found significant differences in the distribution 6 of Spectral Entropy values among the three domains of life (Bacteria, Archaea, Eukaryotes). To 7 explore further correlations with biological traits, we focused solely on bacterial PPI networks, 8 which are the most numerous among the three domains and had associated trait metadata, and 9 investigated how modularity impacts or is impacted by growth, aerobicity, selection and location 10 on the tree of life. We found no relation between maximal growth rate and Spectral Entropy, but 11 a strong dependence between G-C content (a proxy for selection) and Spectral Entropy. We also 12 discovered that Spectral Entropy is negatively affected by phylogenetic placement (evolutionary 13 distance from the last universal common ancestor). The general nature of the Spectral Entropy 14 measure of hierarchical modularity in networks suggests that it will be useful in other settings 15 where structural properties of real-world networks are being compared. 16


Introduction
Despite the interplay of components in a living system, there is heterogeneity in 20 the nature of interactions among them, and this leads to a differential integration of the 21 components into smaller subgroups. An abstract notion invoked to understand such 22 heterogeneity of grouping is called 'modularity,' and it has been used in various contexts 23 and studied with intense interest in the biological and complex systems communities 24 for several decades in different forms [1][2][3][4] (see [2] for a detailed review of modularity 25 concepts in biology and its various applications). Modularity can be understood as an 26 organizational pattern within a biological system, in which some components interact 27 more among themselves (forming a module) than with others (outside the module) 28 [2,4,5]. Although initial work on biological modularity focused on physical, develop- 29 mental, or macro-ecological aspects of biological systems [5], with the advent of large 30 scale data about gene regulation, protein interactions, and metabolic pathways, and 31 emerging tools in network science literature, there has been an explosion of work about 32 Over the last two decades, researchers have developed metrics for quantifying 48 modularity in various complex systems [5,[10][11][12][13] and for characterizing topological 49 characteristics of (network-based) modular organization in their interactions [14][15][16]. 50 However, the issue of identifying hierarchical and overlapping organization of modules 51 in complex system architecture has been less well-studied [11,17]. 52 In this work, we focus on quantifying hierarchical modularity, taking into account 53 naturally existing modular overlaps and different scales of modularity, through spectral 54 characterization [11,12] in protein-to-protein interaction (PPI) networks. We formulate 55 a network statistic of Spectral Entropy, which we hypothesize quantifies hierarchical 56 organization. Spectral Entropy tracks disorganization in the distribution of the adjacency 57 spectrum of a network. A transition from a disorganized, non-hierarchical network 58 topology to a more rigidly hierarchical structure would result in a decrease in Spectral 59 Entropy. To test our proposed measure, we compare experimentally verified PPI network 60 architectures from 1803 species across the tree of life [18] and look for patterns in modular 61 hierarchies. We then narrow down our investigation to 1196 bacterial PPI networks 62 and explore the relationships of modularity with growth, directed evolution, and other 63 biological traits.  The evolutionary history of the curated set of PPIs was obtained by [18] and is 75 derived from a high-resolution phylogenetic tree [19], which is used to quantify the 76 'evolutionary distance' of each species from the root of the tree as measured by the 77 total branch length (quantified as nucleotide substitutions per site) . The phylogenetic 78 taxonomy, the names of species, and associated genome accession numbers were taken 79 from the NCBI Taxonomy database [20]. characteristics. Therefore, we decided to focus on testing the dependence of growth, 84 directed evolution, and phylogenetic placement on spectral modularity in bacteria.

85
To estimate the growth rate of bacteria, we used the gRodon measure [21], which 86 estimates maximal growth rates of prokaryotic organisms from genome-wide codon 87 usage statistics. We only used the genomes for which reference genomes were available 88 from RefSeq. Relative G-C content of the reference genomes is an important trait for 89 bacteria, which is shown to be under selection [22], but the cause of the relationship is 90 debated [23]. One of the leading explanations involves aerobicity (bacterial response to 91 the presence of oxygen) and the damage induced by oxidative stress through double 92 strand breaks [24,25]. Therefore, we compare the aerobicity of bacterial species, collected 93 from literature survey and databases [26], with their relative G-C content, evolutionary 94 distance and Spectral Entropy (through pairwise t-tests for each aerobicity class: aerobe, 95 microaerophilic, anaerobe, and obligate anaerobe). 96

97
For each species t, we created an unweighted, undirected network (V t , E t ) of N t genes. Each network is characterized by its adjacency matrix For each A, we computed the singular value vectors σ(A) ∈ R N t . Note that since

98
A is a real symmetric matrix, the singular values are exactly the absolute values of the 99 eigenvalues, i.e. σ i = |λ i |. We then log 2 transform our spectral values while censoring 100 those less than 1, i.e. we apply the transformation: We then normalize the resultant vector, to give us a vectorσ ∈ [0, 1] N t : This vectorσ(A t ) ∈ [0, 1] N t summarizes structural information associated with the 102 species-specific network (N t , E t ).

103
Our decision to log-transform was made by conducting a Shapiro-Wilk normality 104 test on pre-and post-transformed data, which ultimately favored the transformation. 105 We consider two potential metrics as proxies for potential hierarchical structure in the 106 graph spectra. First, using the definitions below, we consider K-bin Spectral Entropy.

107
This is essentially a Shannon index on K-bin discretization of the vectorσ. Second, we 108 consider estimates of the Gini coefficient (typically used as a metric of for quantifying 109 income inequality and economic stratification, we adapt this measure as a measure of 110 heterogeneity in the network) for the values ofσ. 115 Let x = (x i ) i ∈ [0, 1] N be a vector taking N values between zero and one. We define the K-bin entropy s K (x of this vector as follows. We will bin the entries of x into a K bin discretization of [0, 1], giving us a vector in the discrete alphabet [K] = {0, ..., K − 1}. We then form a probability vector p(x) from the relative frequencies of the letters of this discrete frequency. Put concisely, we define the probability vector p(x) ∈ [0, 1] K as We define s K (x) as the Shannon entropy of this distribution: Thus, we define the K-bin Spectral Entropy of the matrix A t as We calculate the spectral K-bin entropy for all the PPI networks, at different values  diagonal. Both adjacency spectra and Laplacian spectra [27] embody core structural 124 information about graphs, and are used for latent space embeddings and other forms of 125 network dimensionality reduction [12]. We limited ourselves to adjacency spectra, both 126 because this is more typical for spectral analysis in biological networks [28] and more 127 computationally tractable for large networks. We leave the investigation and comparison 128 of the two for future work. In Section 3, we demonstrate the plausibility of Spectral Entropy using simulations of ranndom networks generated by a non-linear preferential attachment (PA) model. The PA model is a growing random graph model, constructed iteratively. It is parametrized by the final number of nodes (N), the number of edges added at each time-step (m), and a non-negative exponent (α ≥ 0) that determines the power of preferential attachment model. A PA graph becomes realized by the following process. We initially begin with a fully connected graph of m nodes N 0 = (V 0 , E 0 ). Then, for each iteration t = 1, ..., (N − m), we add a single node of index i t = m + t, and draw m edges from the new node to nodes j in V t−1 . For any node j ∈ V t−1 , the likelihood of the edge {i t , j} being drawn is at iteration t is proportional to the α-power of the degree d j of the node j in the previous graph N t−1 : For our simulations used to generate Fig. 1 B-D, we take N = 1000, m = 5, and allowed 131 α to range over the values specified. We used the sample_pa function from the igraph 132 R package. In Section 3, we use adjacency spectral embedding (ASE) to represent the nodes of graphs as points in Euclidean space [29]. We may refer to these embedded points as the (estimated) latent positions of nodes, since ASE embeddings have been shown to provide a consistent estimator of the latent positions in random dot product graphs [29]. For an undirected, unweighted network N , the ASE is computed as follows. The network N has a Boolean adjacency matrix A ∈ {0, 1} N×N which is real and symmetric, and thus taking its singular value decomposition results in the unitary diagonalization: where U is an orthogonal matrix with adjoint inverse U * , and Σ is the diagonal matrix with singular values σ i = |λ i | of A along its diagonal. The d-dimensional ASE of the nodes is given by the rows of the rank d matrix where U (d) ∈ R N×d is the matrix of the first d columns of U, and Σ (d) ∈ R d×d the 135 first d rows and columns of Σ. The rows of X 1 , ..., X N of X provide d-dimensional  works. We will now ground that construction, and justify its use as a proxy for hierarchi-144 cal modularity.

145
Modularity, sometimes also referred to as community structure, is a well-studied 146 phenomenon in network science [11,28,[30][31][32][33]. It refers to a partitioning of a network into 147 high-affinity modules. In the case of observed or inferred biological and social networks, 148 these correspond to emergent groupings among the agents represented as nodes. These ics [34,35], many computer scientists developed algorithms utilizing the second-smallest 160 eigenvector of the graph Laplacian for graph bisection and partitioning [36,37]. This 161 literature became particularly popular for image and mesh segmentation [38,39]. Within 162 a related literature on spectral clustering, a method emerged, in which the largest k 163 eigenvalues and eigenvectors of the normalized Laplacian matrix 1 were used to partition 164 a graph into the k communities [40].

165
The adjacency spectral characterization of modularity is, to our knowledge, a more 166 recent phenomenon than the Laplacian approach. Chauhan  the non-degenerate eigenvalues will cluster tightly around mean points corresponding 179 to tiers of the hierarchy. By contrast, we expect less rigidly organized networks to have a 180 much more crowded eigenspectrum. Most random graph models that are not explicitly 181 modular tend to generate relatively 'smooth' distributions of eigenvalues [42,43]. 182 Our reasoning is that, as the networks under our consideration range from topo-183 logically diversified, small-world interactomes into well-specialized blueprints, their 184 structure will less resemble primitive small-world architectures, and instead move in the 185 direction of hierarchical stochastic block models. 3 This will cause a decrease in Spectral 186 Entropy, as our eigenvalue distribution will become clustered is a handful of bins, both 187 near zero and around the mean value for each tier of the emergent hierarchy. In this way, and sub-modules begin to appear in the network structure.

191
Before considering our biological data in the next section, we will first demonstrate the plausibility of this hypothesis using simulations of a non-linear preferential attachment (PA) model [44], also called the non-linear Barabasi-Albert (BA) model [9]. In this model, a random graph is constructed iteratively, with new nodes added and attached to existing nodes at each step. A preferential attachment rule is utilized, in that the likelihood of the selection of a pre-existing node for attachment to the new nodes depends upon its current degree. More precisely, at iteration t, if pre-existing node i has degree k, then the probability Π that a newly added node attaches to i is governed by its degree: More precise master equations can be provided, dependent on parameters specifying 192 the initial network and the growth rates. When α = 1, we have linear dependence of 193 the probability of attachment on degree. This is the classic BA model, which generates 194 balanced, scale-free networks. When α = 0, we have an Erdős-Rényi random graph.

195
Although admittedly an imperfect model, we claim that α moving away from zero, and attachment regimes (coined in [44]). We see that, as we increase the preferential attach-203 ment parameter α from zero, Spectral Entropy decreases. This plausibly suggests that 204 Spectral Entropy is decreasing as some element of hierarchical structure emerges.

205
Interestingly, as α continuous to increase well past the balanced α = 1 region, we 206 see that Spectral Entropy begins rising again. This does not contradict our hypothesis.

207
As the power of preferential attachment rises, we expect a winner-take-all topology. centered around mid-tier mini-hubs, will be less likely to appear. 211 We illustrate this pattern more clearly in Fig. 1, by using the adjacency spectral For each α, we chose to represent the network with the highest Spectral Entropy, that with the lowest, and that with the median (respectively, ranks 1, 25, 50). For the ASEs, we also normalized by the maximal Euclidean norm of the embedded nodes, and took the absolute value in each ASE coordinate (|X 1 | and |X 2 | to avoid negative rotations (raw embeddings can be found in supplementary Fig. A1). Despite its small working range, Spectral Entropy is a sensitive measure to significant structural differences.
dot product graphs (RDPGs), a special type of generative random graph model. An nodes i and j is then proportional to their inner product: p i,j ∝ X T (i) X (j) . Up to orthogonal 222 transformations, ASE recovers the latent positions of RDPGs [31,45].

223
In our context, the two-dimensional ASE has the following interpretation. Those  2), we see that networks with smaller Spectral Entropy values have a relatively higher 253 N-bin entropy than 500-bin entropy, although this effect is small.  The signal with the Eukaryotes is interesting, but inference is limited because of 265 small sample size and unavailability of trait data. Therefore, we concentrate the rest 266 of the paper on the bacterial interactomes due to higher data availability and quality 267 (see [18]) for further exploration of the impact of traits on hierarchical modularity.

268
Linear regression between Spectral entropy and evolutionary distance, as measured by 269 phylogenetic branch distance [18,19] shows a significant negative relation ( Figure 4A).

270
As the evolutionary distance measures phylogenetic distance from the root of a common 271 phylogenetic tree, this trend suggests an increase in hierarchical modularity with higher 272 'evolutionary distance'.

273
Another feature that could affect modularity of bacterial interactomes is selective 274 pressure. Research has suggested relative G-C content as an indicator of selection [22,46], 275 and we explored the possible relationship between Spectral Entropy and relative G-C 276 content of genomes. We found a strong positive relationship between the relative G-C content and Spectral entropy (p-value less than 10 −12 ) ( Figure 4B). This implies a lower 278 modularity for the species with high G-C content. Pairwise t-tests showed significant different among the following pairs of groups: For C, no significant differences; For D, aerobes and microaerophilic bacteria (p-val less than 0.05), aerobes and anaerobes (p-val less than 10 −6 ); For E, aerobes and anaerobes (p-val less than 0.05), obligate anaerobes and aerobes (p-val less than 0.005), microaerophilic bacteria and obligate anaerobes (p-val less than 0.05), obligate anaerobes and anaerobes (p-val less than 0.05) Given the linkage of relative G-C content with aerobicity and oxidative-stress 280 induced damaged in previous literature [24,47], it is not surprising that there was a 281 significant difference in the relative G-C content of aerobes and microaerophilic bacteria 282 (p-val less than 0.05) and between aerobes and anaerobes (p-val less than 10 −6 ). All 283 other comparisons were non-significant ( Figure 4D). The same analysis for evolutionary 284 distance with aerobicity did not show any statistically significant differences between 285 any groups ( Figure 4C), whereas a comparison of Spectral entropy values revealed 286 significant differences between aerobes and anaerobes (p-val less than 0.05) as well as 287 between obligate anaerobes and each of aerobes (p-val less than 0.005), microaerophilic 288 bacteria (p-val less than 0.05) and anaerobes (p-val less than 0.05) ( Figure 4E). Growth (doubling time) is another important trait for bacteria, which we tested for 291 correlation with Spectral Entropy, but did not find any significant relation (Figure 4).

292
In some previous works, growth has been shown to be correlated with modularity of 293 bacterial metabolic networks [48], but we find no such evidence with PPI networks.  [21]). Linear model between the two variables show no statistical relationship between Spectral Entropy and evolutionary distance.

295
Leveraging the availability of high-quality, robust data about PPI networks [18] 296 for a large number of species from across the tree of life has spurred many analyses to 297 understand aspects of the structuring and functioning of these important intra-cellular 298 networks, such as network resilience [18,49], and higher order informative scales [50]. within the modules themselves [53]. These mechanisms in eukaryotes can therefore lead 312 to hierarchical modular structure in protein interactomes.

313
Previous work has shown that modularity increases and tapers off with increased 314 size of metabolic networks for bacterial species [54], but we saw an opposite pattern (see Figure 3B). Naively, one might expect that having a high value of hierarchical 319 modularity is difficult for larger number of entities in a network, as random mutations 320 and neutral processes can cause changes in interaction patterns [55]  For bacteria, we found a negative relationship between evolutionary distance [18,19] 334 and Spectral entropy, implying increased hierarchical modularity with greater "evolu-335 tion" from the last universal common ancestor. This could signify that a combination of 336 selective and neutral processes that have resulted in larger divergence have somehow 337 influenced the hierarchical modularity in the species ( Figure 4A).

338
A similar mechanism may underlie the positive correlation between relative G-C 339 content in bacterial genomes and Spectral entropy ( Figure 4B). A number of previous 340 works have shown that there is a selection pressure associated with G-C content [22,46].

341
Environmental factors that can affect the genomic G-C-content have been hypothesized, 342 such as the availability of oxygen [25], nitrogen fixation ability [56] and UV light [57].

343
These effects are comparatively weak [23], but a certain selective force remain operational 344 as suggested by past studies [22], although there is no definite consensus [23]. Through 345 our work, we found a negative correlation between hierarchical modularity and G-C 346 content. This accords with a previous study that suggested that directional selection can 347 create modularity [10]. G-C content has also been shown to affect repair of double strand 348 breaks in prokaryotic genomes, and oxygen-rich environments tend to favor higher 349 G-C content [24,47]. Somehow this distills to the fact that organisms that have high 3A). Archaea, which tend to be found in anaerobic conditions, have higher hierarchical 354 modularity than bacteria ( Figure 3A).

355
Comparisons among bacterial aerobicity classes revealed differences in Spectral 356 entropy, especially among aerobes and anaerobes (and obligate anaerobes) ( Figure   357 4E), and this affirms our idea that in organisms where high oxidative stress is affecting 358 genomes, high hierarchical modularity is less probable. This might again be due to 359 random mutations causing changes in the interaction patterns [55] that disrupt the 360 modular architecture of the system. This is the same reason why larger PPI networks 361 might have comparatively lower modularity, but here the effect is definitely stronger.

362
Some previous studies had linked growth rate as the selective force behind the 363 relative G-C content patterns [46], but had focused on relatively small group of organisms.

364
Our analysis showed no significant relation between growth rate and Spectral entropy, 365 which we expected given the complex nature of growth rates. Entropy and illustrate its applicability to a data-rich set of complex biological systems.  Figure A1. We present the same ASEs as in Fig. 1D, except without taking the absolute value in each component. As can be seen, except for the disorganized linear PA regime (α = 1), the ASEs are identical to those in Fig. 1D up to rotation. Under the interpretation of ASE as a consistent estimator for RDPGs, this is irrelevant.