In the following, we explore the idea of performing community detection based on finding a suitable set of nodes separating the communities as defined in Definition 1 in a rigorous mathematical manner. Meeting the demand from the derived QUBO formulation for a separation edge estimator, we subsequently introduce a promising heuristic approach based on the concept of modularity.
3.1. Separation-node sets
The approach presented in this paper consists of two steps:
- (1)
identifying a set of nodes separating communities and thus revealing the fundamental community structure (
Section 3.2 and
Section 3.3)
- (2)
classifying the community of each separation-node to finalize the community detection (
Section 3.4)
Either using a trivial, greedy approach introduced in
Section 3.4 or a slight adaptation of the well-known QUBO-formulation of modularity maximization [
33] to perform (2), the main objective of this paper is the development of a QUBO-approach realizing (1). To provide a more formal definition of (1), we now introduce the concept of
separation-node sets. In the following, we use
to denote the set of all separation-node sets.
Definition 1.
For a graph and a ground truth community structure C partitioning V, we call a set of separation-nodesiff the connected components partitioning the graph induced by are distributed such that is a refinement of C.
Equivalent to this definition, one could also demand the existence of a refinement map
mapping each connected component
onto a community
such that
. Utilizing the notion of separation-node sets, (1) can be formulated as finding a smallest set of separation-nodes whose associated refinement map
is ideally bijective. An example of a set of separation-nodes satisfying these conditions is depicted in
Figure 1b, which is part of
Figure 1 displaying the proposed approach. As it will become apparent in the evaluation, such well behaved separation node sets can also be found in graphs with application near topologies.
The surjectivity of ensures that each community gets detected and its injectivity ascertains that no communities get split. In the following, we will call separation-node sets injective, surjective or bijective iff the respective refinement function satisfies these conditions. In order to formulate a QUBO problem where the optimal solution represents the minimal separation-node set, we start by stating an alternate, more convenient definition of minimal separation-node sets.
Theorem 1.
For an adequate penalty term P ensuring the separation-node set properties, the following equation states an equivalent definition of the set containing all minimal separation-node sets .
Here, we used as a 0-flag for separation-nodes, to denote the entries of the adjacency matrix, as a mapping of nodes to their ground truth community and the Kronecker delta . For a penalty term P ensuring the validity of the separation-node set definition by penalizing incident node pairs from strictly different communities where neither node is element of the sought-after separation-node set, see the following definition:
Therefore, the task of finding a smallest set of separation-nodes for any given graph is native to the concept of QUBO. Its formulation can be reduced to approximating for incident node pairs . This can be understood as calculating the probability of an edge being an interconnection of adjacent nodes belonging to different communities, or more formally, a separation-edge.
Most interestingly, we can show that solving the QUBO problem stated in Equation (
5) is NP-hard for a specific estimator. To see this, we start by observing a substantial similarity of our QUBO formulation with the QUBO formulation of the Max-Clique problem as stated in [
34]:
for a given graph
and its corresponding adjacency matrix
A with entries
. Choosing the estimator
by
, it becomes apparent, that the QUBO formulations are identical if we specify to use a complete graph of size
as an input to our QUBO formulation. Leaving an extensive mathematical analysis of the NP-hardness for more realistic estimators to future work, this shows that the problem of finding a minimal separation-node set is NP-hard when treating the estimator as a variable. This result supports the pursuit of the proposed approach of using quantum computing in order to find a minimal separation-node.
Returning to the initial goal of finding bijective separation-node sets, we now explore their surjectivity. A significant discovery regarding surjectivity is illustrated in
Figure 2, showing no-free-lunch when using Theorem 1 to find surjective separation-node sets. This necessitates the addition of a penalty term to the QUBO formulation in order to ensure surjectivity when building upon Theorem 1. For the formulation of a suitable penalty term, see
Appendix A.2.
As our formulation results in a PUBO (polynomial unconstrained binary optimization) problem of degree
, we conjecture that this constraint cannot be realized in QUBO form without the addition of ancillary variables. Using the standard quadratization approach with the Rosenberg polynomial [
35], a QUBO formulation of this term demands superpolynomially many ancillary variables, i.e.,
. In the context of quantum annealing, this scaling beyond a quadratic number of qubits makes the surjective separation-node approach overly complex compared to the standard modularity maximization. In the gate model, the QAOA can be used to solve PUBO problems in principle, but as current hardware limitations prohibit adequate evaluation, we leave the exploration of the surjectivity constraint to future work.
As a consequence of not enforcing surjectivity, there exists a possibility that the number of communities is incorrect after step (1) of detecting the fundamental community structure by separation-node set identification. Modifying step (2) slightly, this could in principle be compensated by iteratively increasing the number of possible communities until no further improvement of the modularity can be achieved. A clever way to do this could be the elbow-method as known in clustering [
36]. For the alternative greedy approach for the second step (2), the possibility of merging communities could be allowed.
Fortunately, conducted experiments show that topological structures precluding free lunch are scarce in practice. Therefore, we will omit the explicit demand for surjective separation-node sets in the following.
Analog to the surjectivity, there exist graph topologies like the one displayed in
Figure 3 showing no-free-lunch when using Theorem 1 to find injective separation-node sets. Hence, it appears necessary to ensure injectivity explicitly using a penalty term when building upon Theorem 1 in principle, as well. The formulation of such a penalty term also turns out to be rather tedious, as can be seen in Appendix A6. In this case, we end up with an even higher dimensional PUBO problem for the injectivity than for the surjectivity. Luckily, compared to the surjectivity, the injectivity of a separation-node set is of less importance, as the second step (2) could easily be adapted to cope with this. Analog to the case of surjectivity, we observe such topological structures preventing free lunch quite rarely in conducted experiments, resulting in the analog dismissal of an explicit demand for the separation-node sets to be injective in practice.
In summary, the apparent infrequence of topological structures preventing free lunch regarding bijectivity renders the QUBO-formulation stated in Theorem 1 to be a well-founded starting point for the stated proposition of QUBO based community detection via separation-node sets.
While this approach provides exact results for a perfect classification of separation-edges, it fully relies on a suitable estimation heuristic. Although many known measures for various edge properties exist (as described in
Section 2), none showed to be entirely suitable for detecting separation-edges according to pretesting conducted for this paper. Consequently, we now motivate a novel approach tailored for exactly this task based on the concept of modularity.
3.2. Modularity-based separation-edge estimation
Motivated by the proven optimality of modularity and by the fact that at its core, modularity stems on essentially estimating whether each node pair is likely to belong to the same or different communities, we start by showing how this idea can be used to estimate
. For this, recall the definition of the entries of the modularity matrix:
As before are the entries of the respective adjacency matrix, while denotes the expected value of the number of edges between and , . Upon closer inspection, we observe two main cases:
, iff less connectivity between and was to be expected, indicating that and likely belong to the same community
, iff more connectivity between and was to be expected, indicating that and likely belong to different communities
As the matrix entries are normalized to the interval of by the division with , we can see that using proper rescaling to the interval of , i.e., via , this allows for an estimation of the term in principle.
In practice however, this approach yields extremely bad estimations, as only the entries of the modularity matrix are relevant, corresponding to a given edge . For these, it quickly becomes apparent that is typically larger than 0, making this exact idea infeasible in practice. These considerations motivate an adaptation of modularity for the estimation of separation-edges as proposed in the following.
3.3. Edge neighborhood connectivity based separation-edge estimation
Exploiting the mathematical structure of modularity for a straightforward separation-edge estimation, we now introduce a promising generalization of the previous approach, which we coin as the neighborhood connectivity of an edge. Instead of merely taking the direct connection between two nodes into account (i.e., an edge), the neighborhood connectivity of an edge considers connections between the neighborhoods of the nodes. In this context, the neighborhood of a node is defined as the set of nodes with the shortest path of length r to v.
Based on this idea, we can rephrase the basic case of our generalization, i.e., modularity, as merely counting the number of unique edges on paths of length 1 between the 0-neighborhoods and of the respective nodes and . The here proposed generalization introduces the following two new notions:
- (1)
Consider connections between r-neighborhoods with radius
- (2)
Also consider paths of length 2
Stating this more precisely in mathematical form, we now define the neighborhood connectivity
of an edge given a path length
l, and a neighborhood size
r:
In this definition, denotes the number of unique edges contained in paths of length l connecting the r-neighborhoods of the given nodes which do not involve nodes or edges contained by the -neighborhoods (as this would result in possible double counting of edges). Analogously to the definition of modularity, denotes the expected value corresponding to and acts as a normalization factor denoting the highest possible number can assume.
These values can be calculated based on a simple breadth-first search with depth r iterating of the neighborhood layers while choosing and as starting nodes. As for the expected value calculation, the configuration model has shown to be an adequate choice (which is in line with modularity). For details on this, we refer to our implementation which can be made available upon request to the authors.
Our preferred method of combining the results into the neighborhood connectivity
of a given edge based on all
is the dot product with a weight vector
w with entries
such that their sum equals 1:
As we know that the standard modularity value is of little use, we chose . We consider the remaining weights as hyperparameters, for which have proven to be suitable values according to conducted experiments.