Network module detection using recursive local graph sparsification and clustering

Here we present a fast and highly scalable community structure preserving network module detection that recursively integrates graph sparsification and clustering. Our algorithm, called SparseClust, participated in the most recent DREAM community challenge on disease module identification, an open competition to comprehensively assess module identification methods across a wide range of biological networks.


Introduction
Many community detection algorithms for complex networks have been presented over the last decades.However, determining the quality of individual algorithms, in terms of accuracy and computing time, remains an open issue (Yang et al., 2016).In order to obtain optimal clustering performance with respect to both criteria, especially for densely connected and large networks, graph sparsification approaches should be considered as general preprocessing steps to any clustering approach.This leads to the critical problem to efficiently identify edges that are more likely to lie within rather than between clusters.Methodologies for sparsifying graphs include random edge sampling (Karger, 1994, Tumminello et al., 2005) or complex network backbone detection (Glattfelder and Battiston, 2009).However, as these approaches are designed to identify most interesting edges, or construct spanning-tree-like network backbones, they are less suited to preserve community structures (Satuluriet al., 2011).In addition, methods such as network backbone detection becomes increasingly computationally complex for large network structures (Hamann et al., 2016).Other approaches use edge centrality measures, such as edge betweenness centrality (Newman and Girvan, 2004).The betweenness centrality is proportional to the number of shortest paths between any two nodes in the network passing through an edge.As a consequence, edges with a high betweenness centrality can be seen as bottlenecks in the network.Therefore, they are more likely to represent inter-cluster edges, which should be filtered first.However one major disadvantage of the edge betweenness centrality measure is that it becomes prohibitively expensive for larger graphs (Newman and Girvan, 2004).
Here, we present a fast, highly scalable heuristic for iterative local graph sparsification and module detection (see figure 1).The main idea behind our sparsification algorithm is to retain edges that are likely to be part of the same cluster, i.e. intra-cluster edges, and, therefore, to remove those edges that connect modules, i.e. inter-cluster edges.Each sparsification step is followed by a clustering step.The procedure iterates until no more modules larger than a predefined size exist or the sparsification does not change the clustering any further.

Methods
Our approach recursively iterates graph sparsification and clustering until convergence, i.e. until no more graph sparsification is possible or no more modules of a predefined size exist.To identify intra cluster edges, Satuluri et al. (Satuluri et al., 2011) proposed to rank edges using a Jaccard similarity coefficient, based on the assumption that an edge is likely to lie within a module if its adjacent nodes have a high overlap in mutual edges.Here, we use the inverse log-weighted similarity of the edge's adjacent two nodes.It denotes the number of common neighbors of both nodes, weighted by the inverse logarithm of their degrees.We prefer this measure over the Jaccard similarity, as it is based on the assumption that two vertices should be considered more similar if they share a low-degree common neighbor, given that high-degree common neighbors are more likely to have occurred by chance (Adamic, 2003).
The algorithm, as described in pseudo code below, first performs an initial clustering on the full graph G.In addition, a similarity matrix S is formed, computing the inverse log-weighted similarity, over all nodes V.The clustering results are analyzed, retaining all predicted modules within a given size, i.e. a maximum of 100 genes for the DREAM challenge on Disease Module Identification (Choobdar et al., 2018).All larger modules are subsequently sparsified and clustered until convergence.Per large module, the similarity matrix is evaluated, removing all edges in the module specific subgraph, with a similarity coefficient below a given threshold.This threshold is given as a percentage based cutoff with respect to the module specific similarity cumulative density function.This makes our sparsification approach local and adaptive to each individual module.A local similarity threshold is preferred over a global one, given that various clusters may have different densities.As a consequence, a global threshold might therefore retain far many more edges in the denser clusters, disconnecting the less dense clusters (Satuluri et al., 2011).For subsequent module detection, we choose the Infomap algorithm because of its accuracy, scalability and computational complexity (Yang, 2016), although any other clustering algorithm or combinations could be used as well.

Discussion
Given the scarcity of graph sparsification approaches in R, in contrast to a plethora of state of the art community detection algorithms, our goal is to build our approach as a one stop solution in R that can do network pre-processing and clustering in one function call.For the DREAM challenge on Disease Module Identification (Choobdar et al., 2018), given several biological networks, we used two different sparsification thresholds, i.e. 80 % for the protein interaction, signal and cancer networks, as well as 95 % for the coexpression and homology networks.These selection were based on visual inspection of the clusters, with respect to the initial network densities and the resulting number and sizes of clusters.Having entered the DREAM competition on Disease Module Identification (Choobdar et al., 2018) only within its last week, we see our approach as a conceptual prototype that focuses on speed as much as accuracy.

Figure 1 .
Figure 1.Overview of the Sparseclust algorithm.The input network is repeatedly sparsified and subsequently clustered, until no more sparsification can be performed.