Hyperedge Entanglement in High-Order Multilayer Networks

Many real complex systems present multilayer structure where high-order metadata on one layer refers to dyadic data on a lower layer. Significant progresses to analyse high-order metadata under the assumption of community organization have been done. However, there are no planted communities in real-world networks, and the necessity of new frameworks to analyze high-order metadata regardless of community organization has been raised.Here, we propose to adopt hyperedge organization. Predicting ‘entanglements’ between a hyperedge and nodes scattered in the rest of the network might suggest structural or functional liaisons, without assumption of any community organization. We introduce a novel concept: hyperedge entanglement (HE), which associates to each hyperedge an entangled hyperedge, by means of a network operator that predicts significant ‘interactions at distance’ between network nodes and existing hyperedges. We also introduce a new challenge termed hyperedge entanglement prediction (HEP), and an algorithm to perform this task. We evaluated HEP performance on social, biological and synthetic data where, given only topology and hyperedges (such as communities or functional modules), the goal is to predict whether nodes not connected to a certain hyperedge might be candidates for a significant entanglement. Finally, as real application in diseasome systems biomedicine, we perform HEP on the human protein interactome to predict unknown gene entanglements with the COPD disease gene hyperedge. HEP predictions are validated by biological experiments, enlarging our understanding of molecular mechanisms behind COPD/aneurysm comorbidity.


Introduction
Research on the modular dyadic organization of complex networked systems has become increasingly popular in many scientific disciplines diverse from physics, including sociobehavioral 1-3 , biological 3,4 and maritime 5 science. Progress has been achieved also in modeling community organization in artificial networks by random graphs without 6 and with geometry [7][8][9][10] . Community detection is an attractive area of investigation in the field of complex networks 11 , and recent studies investigated how network embedding can enhance community detection [12][13][14][15][16] . However, how topological community organization should be interpreted and harmonized with node metadata group information -and what are the limitations of such endeavoris a controversial topic currently under investigation 17 . When Science is stuck in front of a dilemma or conceptual obstaclelet us term it a 'Gordian Knot'then it is the moment that new ideas emerge. These ideas are not always the final solutions to the problem but represent a detour towards a new definition of the conceptual framework inside which flourishing of new interpretations and methods can bring toward 'unleashing the Gordian Knot'. Here, we are in front of a 'Community Knot'; and this study aims to select a different conceptual and theoretical 'blade' and to sharpen it, in the hope that others in the future might success to 'cut the Community Knot'.
The current conceptual framework is that many real networks are composed of several communities, also referred to as clusters or modules. An important and accepted property to distinguish a community in respect to others, is that its nodes have a higher probability to link to one another than to nodes that belong to a different community 18, 19 . These groups of densely connected nodes can have entirely different meanings depending on the complex system under observation. Frequently, communities might be matched with groups of nodes that, according to metadata, share common properties or play similar roles. In a social network, a community can have a certain overlap with a tight group of friends, people sharing the same hobbies or the same job. In a protein-protein interaction (PPI) network it can have a certain correspondence to proteins involved in a specific process, function or pathway. Whereas, in a citation network, it might be to a certain extent associated to a set of scientific papers with a related topic. Yet the matching between the node metadata information and the topological community segregation is not always respected. In some cases, this is due to a multiplayer data structure.
For instance, we can have a scientific co-authorship network, where each node is an author, and each link indicates a collaboration in a study between two authors, whereas the metadata information is the research institution of the authors. In other cases, such as for PPI networks, the network topology is incomplete and metadata information too, therefore some modules indicating a cohort of genes whose proteins are involved in a disease are composed by nodes which are scattered in the network without a true community organization between them, that is also one of the cases we will investigate in the present study. In all these scenarios, we can still say that those groups of nodes are involved in a hyperedge. It means that they are linked (all together) at high-order by means of the metadata information without a detailed specification of their dyadic interactions.
In this study we set a conceptual and computational framework that, at the best of our knowledge, has not been investigated in the previous literature. We start from the observation that many real complex systems present multilayer structure where high-order metadata on one layer refers to dyadic data on another layer. For instance, hyperedge organization (in the first layer) might be associated to node functional segregation and integration of dyadic networks (in the second layer). Predicting 'entanglements' between a hyperedge and nodes scattered in the rest of the network might suggest either structural (direct) or functional (indirect) liaisons.
Hence, we introduce the novel concept of hyperedge entanglement (HE), which associates to each hyperedge an entangled hyperedge, by means of a network operator that predicts significant 'interactions at distance' between network nodes and existing hyperedges. To this aim we also introduce a new challenge termed hyperedge entanglement prediction (HEP), and an algorithm to perform this task.
Depending on the area of application, entanglement can be interpreted as missing association due to a lack of observation or as a future connection. However, we would like to clarify that the topic is related, but not equivalent, to the classical link prediction. Indeed, in the classical link prediction problem, given the network topology, the goal is to identify the connections between nodes that have not been observed and that are likely to occur 20 . In other words, the links to predict are dyadic interactions between two nodes. Differently, in the problem that we are going to address in this study, the entanglement to predict is between a node and a hyperedge. In addition, we can also predict an entangled hyperedge which is an ensemble of nodes that are significantly entangled to the 'seed' hyperedge. In the rest of the study, we will discriminate these two different tasks respectively as node2hyperedge entanglement (N2HE) and hyperedge entanglement prediction (HEP); being N2HE a subroutine of HEP. The motivation for studying HEP is that connections between a hyperedge and other nodes in the rest of the network are pivotal bridges which might suggest either missed node membership or integration with other functionally overlapping hyperedges.
Examples of applications could be: i) to spot unknown associations between genes and a disease module whose proteins are scattered in the PPI network; ii) to predict contacts between terrorists around the world and an existing terrorist cell; iii) to predict contagions, in a contact tracing network, between a cluster of COVID 19 infected people that emerged in a certain time and location, and unknown people that are susceptible to the disease. Among these three examples, we recovered the data to investigate the first real case scenario, whereas we hope that in future studies we might procure the data to investigate the others. After evaluation of HEP performance on social, biological and synthetic networks, we indeed offer an example of real application in systems biomedicine of the diseasome 21 . The diseasome is the bipartite network where diseases on one layer connect to associated genes on the other layer. Hence a disease can be represented by a hyperedge of genes that can be projected on the PPI network nodes. We compute HEP on the human protein interactome in order to predict the entanglement hyperedge that contains genes significantly entangled with the chronic obstructive pulmonary disease (COPD) hyperedge. We consider this peculiar disease because it impacts large part of the worldwide population, and because we have specific expertise to biologically validate some of the predictions by techniques such as co-immunoprecipitation and gene silencing. The result of this validation confirms HEP computational prediction, and spread light on unknown molecular mechanisms behind COPD/aneurysm comorbidity.

Hyperedge entanglement: definition and algorithm
In this section, we will firstly introduce the theoretical definitions related to the hyperedge entanglement concepts and then we will provide the description of the algorithm for computing the significance (a p-value) of node2hyperedge entanglement. From here forward, we will consider as a working framework a high-order multilayer network (Fig. 1), which is a multiplex composed of two layers: one high-order (Fig. 1A) and one dyadic (Fig. 1B). The first highorder layer is a hypergraph ( , ), where is the set of vertices (or nodes) and is the set of hyperedges; the second dyadic layer is a graph ( , ), where is the same set of vertices as the hypergraph and is the set of edges (or links). In contrast to the dyadic graph, where an edge connects exactly two vertices, in the hypergraph a hyperedge can connect any number of vertices. Finally, in this study we will refer to the terms network and graph as interchangeable, but we warn the reader that this is a simplification that we apply only for easing the readability of the study. Although these two terms are often used in the literature of complex systems interchangeably, the graph is in general a pure mathematical structure, whereas the network might be better interpreted as a graph that emerges from the underlying dynamics of a process in a complex system.

Hyperedge entanglement (HE)
The hyperedge entanglement is an operation that associates to a given hyperedge ℎ an entangled hyperedge ℎ (Fig. 1A), whose members are the nodes ∈ that satisfy both the node2hyperedge entanglement necessary (ENC) and sufficient (ESC) conditions, that we will describe below. Intuitively, the entangled hyperedge ℎ includes all the nodes that have a significant 'interaction at distance' with the group of nodes in a hyperedge from which they are disconnected both in the hypergraph HG and the graph G (Fig. 1).
The hyperedge entanglement should not be confused with the entanglement concept in quantum physics 22 , where the quantum state of each entangled particle of the group cannot be described independently of the state of the others. Indeed, in our scenario, the given hyperedge ℎ can be described independently of the entangled hyperedge ℎ , while the opposite is not true. In other words, in the quantum entanglement there is mutual (bidirectional) dependency between the particles, whereas in the here defined hyperedge entanglement there is unidirectional dependency of the entangled hyperedge ℎ on the given hyperedge ℎ .

node2hyperedge entanglement necessary condition (ENC)
Given a node and a hyperedge ℎ (Fig. 1), the ENC is satisfied and becomes a candidate node for entanglement if and only if: In other words, the ENC guarantees that the candidate node is disconnected with the other nodes in the hyperedge ℎ , in both the hypergraph (by not being a member of the hyperedge) and the graph (by not having links to the nodes of the hyperedge). Note that the condition (1) is a necessary and sufficient condition to become a candidate node for entanglement, whereas being a candidate node is a necessary condition for the entanglement, because we can entangle only a node and a hyperedge which are disconnected both in HG and G.

node2hyperedge entanglement sufficient condition (ESC)
Given ( Fig. 1) a hyperedge ℎ , a candidate node (which is defined as candidate because it satisfies the ENC for ℎ ) and a graph G (which can include not only topological information but any meta-information such as the nodes coordinates in a geometrical space), the ESC depends from the type of entanglement operator EO (which estimates the entanglement of with ℎ ) and the statistic S defined on the EO (which estimates the significance of the entanglement). Formally, we can write that the ESC is satisfied and becomes a member of the entangled hyperedge ℎ if and only if: This means that if the statistic is lower than or equal to a certain significance threshold , then the ESC is satisfied. Without lack of generality, in our case we define the ESC according to an entanglement operator that is based on topological information in the graph between the node and the nodes of the hyperedge ℎ . In particular, we develop an algorithm that combines a topological entanglement operator TEO (based on link prediction in the graph ) and a statistical test (based on a null model), in order to compute a node2hyperedge entanglement pvalue. The p-value assesses the extent to which the candidate node is significantly entangled to the hyperedge ℎ . If the p-value is significant (according to a significance threshold ), then the ESC is satisfied, and the node becomes a member of the entangled hyperedge ℎ : Obviously, the significance threshold value determines the number of nodes that satisfy the ESC, the lower the the smaller the size of the ℎ , hence the significance threshold can be interpreted also as a tuning parameter that allows to restrict the selection of nodes entangled with a hyperedge ℎ to a required number.
We stress that the design of the EO is not confined to the mere adoption of topological information. For instance, we define also a second interesting class of entanglement operators that are based on network geometry (geometrical entanglement operator: GEO), and that estimate the entanglement of a node with the hyperedge ℎ on the basis of their distance in the geometrical space in which a network lies or is embedded 12 : Theoretically, also hybrid operators TGEO that integrate topological and geometrical information can be designed. However, here we will focus on TEO and we will leave to a future study the chance to explore GEO and TGEO. The reason to prioritize TEO investigation is that the literature offers already solid evidences that topological link prediction is currently performing better than geometrical-embedding link prediction 4 .

Global node2hyperedge entanglement sufficient condition in case of more than two layers
Although the examples discussed in this study will consider only two layers, here for completeness we will discuss also the general case of a high-order multilayer network composed of n graph layers and m hypergraph layers: We can define a node2hyperedge global entanglement sufficient condition (GESC) when a node respect the ENC necessary condition (1) for ℎ (where t indicates the hypergraph layer ) in all possible n graphs, and becomes a global candidate node .
Then, the GESC for the global candidate node across the high-order multilayer network is: In brief, the mean statistic ̅ computed across all the n graphs should be lower than or equal to a significance threshold in order to confirm that the global candidate node has a global significant entanglement with ℎ . Note that when n = 1 and m > 1 this formula is equivalent to ESC (2), although it is defined for each ℎ .

Algorithm for node2hyperedge entanglement (N2HE) p-value
In this section and Fig. 2, we describe the node2hyperedge entanglement (N2HE) algorithm: -given in input a hyperedge ℎ , a candidate node , a graph , a link-prediction-based topological entanglement operator ( ) and a significance threshold ; -N2HE computes and offers in output a node2hyperedge entanglement p-value, to assess if the ESC is satisfied.
The link predictor ( ) is an operator that, exploiting topological information of the graph , can associate a similarity score to any disconnected node pair, suggesting the likelihood for a link between them to exist. Hence, here we define a topological entanglement operator that estimates an ensemble 'interaction at distance' between and ℎ nodes by means of link prediction. Although in principle any link prediction algorithm can be adopted, without lack of generality we are going to focus on local approaches, which are able to perform a likelihood prediction of a missing link between two nodes, by exploiting only local information that is related to their neighborhood 4 . The rationale for this choice is that local approaches allow the prediction just for a selected subset of missing links (such as the ones that could connect a node to a community). This can save a significant amount of computational time for application on real networks, where we might be interested to predict exclusively the entanglement of one node with one hyperedge. In contrast, many highly-performing global approaches such as Structural Perturbation Method 23 , since they are based on global operations, compute all missing links likelihoods at one time, requiring much more time. Hence, local methods allow to design efficient N2HE predictors that adapt their computational time to the query of the user, while preserving performance. Local approaches are therefore the proper candidates to perform computational demanding tests such as the ones we will present in this study.
One of the simplest and well-known local-based link prediction methods is the common neighbors (CN) index, which states that the higher the number of CN between two nodes, the higher the likelihood to be connected 24 . Similar indexes have been developed introducing a normalization factor to the CN rule, such as in the Resource-Allocation (RA) 25 , Jaccard (J) and Adamic-Adar (AA) 26 indexes. In 2013 Cannistraci et al. 4 introduced the local-community paradigm (LCP) theory, suggesting that local-based topological link prediction should complement the information content related with the common neighbours nodes using also the topological knowledge emerging from the cross-interactions between them. Initially detected in brain-network self-organization topology 4 and then extended to any monopartite and bipartite 27 complex network, the LCP theory derives from a purely topological inspired interpretation of a local-learning-rule of neuronal networks named Hebbian learning rule.
Based on the LCP theory, Cannistraci et al. 4 designed several local, parameter-free and modelbased deterministic rules for topological link prediction in both monopartite and bipartite networks, which in a later study have been also re-interpreted as network automata that participate to learn and form new structures on a network by growing according to some local mechanisms of self-organization known as Cannistraci-Hebb (CH) models 28 . While all the previously mentioned indexes for link prediction on monopartite networks are based on paths of length two (L2), a recent and brilliant study by Kovács et al. 29 proposed an algorithm based on paths of length three (L3) for prediction on protein interactomes. A subsequent study on network automata for link prediction of Muscoloni et al. 28  a. Fig. 2b,c: using the link predictor ( ), compute the similarity score , between node and every node ∈ that is not connected to in the graph , i.e. ( , ) ∉ . (Note: all the disconnected pairs are considered, because this will return useful in the implementation of the hyperedge entanglement predictor described in the next section that computes the entangled hyperedge of any hyperedge ℎ ).
b. Fig. 2d,e: compute the node2hyperedge entanglement score, as the average similarity score between candidate node and the nodes of the hyperedge ℎ : where avg is an average operator, such as mean, median or mode.
c. Fig. 2f: generate a set of random hyperedges ℎ 1 … ℎ with the same size as the hyperedge ℎ . Here, we set = 1000, but in relation to available computational resources this number can be also increased. The members of each hyperedge are sampled uniformly at random among all the nodes except the node , its neighbors in the graph and the members of the hyperedge ℎ . More formally, they are sampled from the set: Fig. 2f: for each random hyperedge ℎ , compute the node2hyperedge entanglement score with node , resulting in a null-distribution of node2hyperedge entanglement scores 1 … . e. Fig. 2f: compute an empirical p-value representing the node2hyperedge entanglement p-value between the node and the hyperedge ℎ : where is the number of node2hyperedge entanglement scores from the null-distribution 1 … that are greater than or equal to the observed one .
In this study we set a significance threshold of = 0.05, therefore if ≤ 0.05 then the candidate node and the hyperedge ℎ satisfy the ESC, hence node becomes a member of the entangled hyperedge ℎ .

Hyperedge entanglement predictor (HEP)
In this section and Fig. 2, we describe the algorithm of the hyperedge entanglement predictor (HEP): -given in input a graph ( , ), a hypergraph ( , ) and an algorithm for node2hyperedge entanglement N2HE (such as the one based on link-prediction and proposed in the previous section); -HEP is an algorithm that finds the entangled hyperedge ℎ associated to every hyperedge ℎ ∈ .
More in details, for each pair ( ∈ , ℎ ∈ ), where is a node ∈ that satisfies the necessary condition ENC (1) for ℎ ∈ , the HEP computes the node2hyperedge entanglement p-value , assessing whether the sufficient condition ESC (2) for ∈ ℎ is also satisfied.
Then, for each hyperedge ℎ ∈ , the set of nodes ∈ that satisfy both ENC and ESC are the members of the entangled hyperedge ℎ . This could be also simplified as: for each hyperedge ℎ ∈ , the candidate nodes ∈ that satisfy the ESC are the members of the entangled hyperedge ℎ . Within the algorithmic pipeline of the HEP, and therefore of the N2HE algorithm for the node2hyperedge entanglement p-value (Fig. 2), we have considered several variants, which are summarized below:  is computed between a certain node and several hyperedges, we consider as an option a Benjamini-Hochberg correction (C) to adjust for multiple hypothesis testing, whereas the alternative option is to not perform any correction.
Note that when the node2hyperedge entanglement p-value is computed between a certain node and several hyperedges, the computation of the similarity scores at step (a) of the N2HE algorithm only needs to be performed once. In addition, we clarify that in this study we set a significance threshold of 0.05 for the p-values, but the user of the HEP algorithm is free to choose it. If the members of the entangled hyperedge are too many with a significance threshold of 0.05, the downstream analysis can be focused on the most important ones by setting a stricter threshold.
The computational time complexity of the HEP algorithm is the maximum between ( 2 ) and the complexity of the chosen link predictor. Considering the link predictors adopted in this study, in case of sparse networks, the complexity is ( 2 ). Please, refer to Suppl. Information for a detailed analysis. The MATLAB implementation of the HEP algorithm is available at: https://github.com/biomedical-cybernetics/hyperedge_entanglement.

node2hyperedge entanglement in real networks
In order to test the algorithmic variants on real data, we collected 5 real networks for which metadata representing the nodes membership to a certain group are available. Nodes belonging to the same group share a common feature and can be considered as a hyperedge. Having a network and a set of hyperedges (a hypergraph) on the same nodes of the network, we fall within our working framework of a high-order multilayer network. Three out of five analysed networks are social and have non-overlapping hyperedges, representing the membership of the nodes to a social community. The remaining two networks are biological and have overlapping hyperedges, representing annotations related to biological metadata. This ensures that we can test our algorithm on different types of complex networked systems. Please refer to the Methods section for a detailed description of the networks and to Suppl. Table 1 for a summary of their topological properties.
The basic idea of our evaluation framework is to remove a test node which is member of a hyperedge ℎ -as well as the direct network connections between this test node and the other ℎ hyperedge members -and to test whether, once removed from ℎ , appears in the entangled hyperedge ℎ and not in other entangled hyperedges. According to these concepts, we build a positive and negative set of node-hyperedge pairs. The positive set is made by all the node-hyperedge pairs such that each test node has at least one link (that is removed) to the other members of the hyperedge and at least one link to nodes that are not members. For each node-hyperedge pair in the positive set we define a related negative pair, made by the same test node and by another hyperedge whose members have the highest average shortest path to the node (considering only pairs that satisfy the ENC). All the possible pairs ( ∈ ℎ , ℎ ) in a hypergraph are tested and a measure of performance based on the correct positive and negative entanglements predicted by HEP is applied. More in details, once obtained predictions for all the pairs in the positive and negative set, we evaluate the performance using the Matthews correlation coefficient (MCC) because it is a measure that accounts for effect size and unbalance in data 31 . The precise algorithm implemented for the evaluation procedure is provided in the Method section.
During the implementation of this evaluation framework, for each node-hyperedge pair evaluated, after a test node is removed from a hyperedge, the network connectivity changes, therefore the computation of the link predictor similarity scores in the N2HE algorithm needs to be performed again regardless of the fact that the same node has been already evaluated for another hyperedge. For this reason, local link predictors are specifically useful in this scenario, since they allow to efficiently compute the similarity scores only for those specific disconnected node pairs that are needed, saving a significant amount of computational time.
On the contrary, global link predictors would require to compute at each round of the evaluation the similarity scores of all the missing links in the network, making this evaluation unfeasible on middle and large networks.  Cannistraci 32 , discussing the importance of minimizing external links in PPI networks). We confirm that also in our tests L3-methods overcome L2-methods as reported in previous literature by Kovács et al. 29 . The E. Coli metabolic network (Fig. 3e) seems instead to represent an exception with respect to the other trends. Indeed, RA-L2, CH3-L2 and CH2-L2 obtain the best performances, followed by the other L2-based and then L3-based predictors. The molecular network of the Escherichia coli K-12 MG1655 strain accounts for many different reactions where other pairs of metabolites help to take place (exchange of a proton or a phosphate moiety, for example), playing similar role to ATP or ADP. Therefore, the cooccurrence of these currency metabolites (ATP, ADP, water, and so on) in many reactions leads to high clustering, and therefore a better fit with the L2-based models.
The last panel (Fig. 3f) reports the mean and standard error of the MCC over the 5 real networks considered. It highlights that the L3-based methods represent overall the most robust choice, followed by the subset of L2-based methods that explicitly minimize the external links in the CH model (CH2-L2, CH3-L2 and RA-L2), and then by the other L2-based predictors.
Altogether, from the results one might notice that even for social networks, typically associated to a L2-based structural organization, the L3-based methods obtain overall higher performance.
In these social datasets the hyperedges represent community-related metadata, groups of nodes that tend to have more connections between themselves than with the rest of the network.
Therefore, the majority of the L2 paths are made by links between the members of the hyperedge. In this evaluation framework, for each node-hyperedge pair in the positive set, the network links between the node and the other members of the hyperedge are removed, therefore most of existing L2 paths are eliminated. This is likely the reason why L2-based methods are outperformed by L3-based methods in HEP on social networks.

Latent geometry plays a role in node2hyperedge entanglement
A point of weakness of the evaluation framework for real networks is that we have to build a positive set by leave-one-out node dismantling of hyperedges and modifying the graph connectivity. In addition, the negative set is based on a node-hyperedge distance that is approximated using topological shortest paths. In this section we want to integrate the previous results by introducing a second (and radically different) evaluation framework which exploits a random geometric graph generative model for realistic artificial complex networks. In this context, 'realistic' means that we can control several topological features that are typical of real networks, therefore we can investigate how HEP changes in relation to fundamental modifications of the network topology. Indeed, having the latent geometry that is behind the observable topology allows to build the positive and negative set based on node-hyperedge geometrical distances, leaving unperturbed (by any dismantling) the graph and hypergraph structure on which we apply the HEP.
In particular, in the computational evaluations of this section we adopt the nonuniform popularity-similarity-optimization (nPSO) model 7,8 , a generative model recently introduced in order to grow random geometric graphs in the hyperbolic space, reproducing networks that have realistic features such as high clustering, small-worldness, scale-freeness and richclubness 12,33 , with the additional possibility to control the community organization (for instance by setting a predefined number of communities, their internal distribution and their mixing). In this framework, the hyperedges are associations between the nodes that are members of the same community. The nodes of the network have coordinates in the hyperbolic disk, and the lower the hyperbolic distance between two nodes the higher the likelihood to be connected.
Based on this knowledge we build a positive and negative set of node-hyperedge pairs. After computing all the pairwise hyperbolic distances between the nodes, for each node-hyperedge pair that satisfies the ENC we assign a node-hyperedge distance equal to the minimum hyperbolic distance between the node and the members of the hyperedge. The positive set is composed by the 5% of pairs with the lowest node-hyperedge distance, while the negative set by the 5% of pairs with the highest distance. This is a very conservative definition of a positive and negative set, because only the pairs that are significantly different from the central part of the distribution are tested, making sure that the evaluation is reliable.
For each node-hyperedge pair ( , ℎ ) in the positive and negative set, we run the algorithm for node2hyperedge entanglement p-value . By considering a significance threshold of 0.05 we obtain a prediction on whether the node becomes a member of the entangled hyperedge ℎ . After obtaining predictions for all the pairs in the positive and negative set, we evaluate the performance using the Matthews correlation coefficient (MCC). Note that if the option of Benjamini-Hochberg correction is considered, then for each pair ( , ℎ ) the p-value should also be computed between the node and every other hyperedge for which the ENC is satisfied.
In order to test the overall performance of the methods, we generated synthetic networks with the nPSO model 7,8 ranging over several parameter combinations. In particular, we fixed the size of the networks to N = 1000 nodes, we tuned the exponent of the power-law degree distribution to γ = [2,3], half of the average node degree to m = [4,8], the temperature (inversely related to the clustering) to T = [0.1, 0.3, 0.5, 0.7] and the number of communities (in this case hyperedges) to C = [10,20]. For each parameter combination, we generated 10 network realizations. We selected these specific values for the parameters because they are close to the ones of real networks and therefore this allows a type of evaluation that is approximating as much as possible the real scenario. On the other hand, the performance has a considerable boost going from γ = 2 to γ = 3 for L3 link predictors. The reason might be that with a decrease in power-lawness (from γ = 2 to γ = 3) -and therefore a decrease of hubs geometrical centrality in the hyperbolic disk -the geometrical distances between the hyperedges increase, and this helps to boost the performance in predicting to which entangled hyperedges belongs a candidate node.

Hyperedge entanglement in network medicine: a case study on COPD
The two different computational evaluations provided in the previous sections were crucial to assess the behaviour of the proposed HEP on data with gold standard and ground-truth hyperedge metadata. In this section we aim to evaluate the impact of hyperedge entanglement theory and the associated HEP on a real case scenario in complex systems biomedicine, where the metadata are incomplete to the point that nodes in the same hyperedge are not topologically connected between them. This is a challenging problem. Most algorithms currently available in network science for community detection or network generative modelling assume that a social group or a disease module should display high dyadic modularity for which nodes inside the hyperedge should have higher connectivity between them than with the rest of the network.
For HEP instead, as we will investigate in this section, this might not be a problem and does not represent an obstacle to provide meaningful predictions.
Scientific evidence suggests that proteins of genes associated with complex diseases tend to connect in protein-protein interaction (PPI) networks, participating in the same biological pathways 21,34 , and this is also reflected in the proteome 35 and the diseasome 21 . We remind that the diseasome consists of a bipartite network where diseases on one layer connects to associated genes on the other layer 21 . Bipartite networks allow interactions only across the two layers, this means that a disease imposes a hyperedge on its associated genes, which (by definition) do not interact between each other in the gene layer. Hence, we can translate and integrate this information in a high-order multilayer network model: one layer contains the diseasome-derived hypergraph (Fig. 5a, where each hyperedge links together a cohort of genes associated to a disease), another layer contains the PPI network (Fig. 5b).
Previous analysis of the diseasome showed that genes that contribute to common disorders: (i) show an increased tendency for their proteins to interact with each other via protein-protein interactions 36 ; (ii) tend to be co-expressed in general or in specific tissues 36 ; (iii) tend to share Gene Ontology (GO) terms 36,37 . These biological evidences are at support of a diseasome-based high-order multilayer network model such as the one we propose here. Indeed, pathways in PPI networks are often composed of group of interconnected proteins responsible for specific biological functions 21 , and a disease represents a variation-induced perturbation of a specific PPI disease hyperedge that might contain different biological pathways, producing pathophysiological abnormalities.
Given any disease hyperedge and any candidate protein (any protein node that satisfies the ENC, see Results section 2.1.2) in a PPI network, HEP assesses the extent to which that candidate protein (and its gene) is significantly entangled to that disease. Hence, the entangled hyperedge (of a disease hyperedge) can provide insights on proteins whose genes might be significantly associated to a certain disease and its biological pathways. This has relevance because, despite major efforts in high-throughput mapping, the missing human PPIs exceed the experimentally documented interactions 38,39 . Our ability to predict previously undetected associations between parts of the diseasome using network science tools offers the possibility to gain insights about the mechanism underlying a complex disease 40 .
To illustrate the benefits of such a modelling approach, we apply HEP algorithm to predict the genes that are members of the entangled hyperedge of COPD (Fig. 5a). COPD is a chronic inflammatory lung disease generally associated with cigarette smoking (with a smaller number of cases due to factors such as air pollution and mere genetics) and leading to obstructed airflow in bronchi. Although genetic studies have identified several risk loci for COPD [41][42][43] , the mechanisms and molecular interactions that rule its pathophysiology need substantial investigation. We based our analysis on a PPI network which is a reference in systems biology and it is compiled from 15 different sources (16,418 nodes and 235,566 edges) 36 . We and genome-wide association study (GWAS) databases 46,47 . Given the PPI network and the 300 disease hyperedges, we apply the HEP algorithm, which provides a p-value for each association between candidate genes (genes that in the PPI network are candidate nodes for entanglement to a given hyperedge) and a disease hyperedge.
At the end of this procedure, given the p-values for a certain gene and all the disease hyperedges of which it is candidate, a Benjamini-Hochberg adjustment is performed in order to correct for multiple hypothesis testing. Hence, in this context, the other 299 disease hyperedges are used only for multiple hypothesis testing adjustment and to obtain a robust estimation of the genehyperedge that is significantly entangled to COPD disease hyperedge, which is the focus of this case study.
In Fig. 5b, we display the COPD disease hyperedge (29 genes on the right side) and the respective entangled hyperedge (32 genes on the left side) predicted by at least one of the HEP algorithmic variants. Their pathway enrichment analysis using DAVID 48  In Fig. 5b, MFAP5 is the only gene exclusively involved with elastic fibres pathway and for which a significant entanglement to COPD has been predicted by HEP algorithm. An earlier study suggested that MFAP5 might be implicated in the extracellular matrix pathway with COPD proteins 45 . Moreover, MFAP5 has been previously shown to be differentially expressed in severe emphysema and bronchitis in lung tissue 52 , two hallmarks of COPD. Hence, we moved forward to perform wet lab experiments that could support and clarify the reason of this significantly predicted entanglement of MFAP5 with the COPD hyperedge.  showing that MFAP5 suppresses the expression of TGFB2, EGLN2, FBLN5, SFTPD, CHRNA5 and TUFM, and enhances the expression of MMP12 at mRNA level (Fig. 6c), supporting the validity of the entangled suggested by the HEP algorithmic variants. Fig. 6d provides a summary of the new information gained from the biological experiments discussed in this section, in comparison to knowledge already in literature. It is evident the disruptive impact of the HEP algorithm in the process to gain new knowledge.
Since previous studies reported a level of comorbidity between COPD and aneurysm 53,54 , we performed an additional analysis in order to spot a possible relation between the genes associated to aneurysm and the COPD entanglement subnetwork genes (Fig. 5b)

Discussion
The first important achievement of this study is that we have introduced a new concept and a novel challenge in network science, which take respectively the names of entangled hyperedge and of hyperedge entanglement prediction (HEP). The formal definition of this prediction problem is fundamental to investigate how elementary modelling at the minimal scale of dyadic node-node link formation can impact modelling of hyperedge formation at larger scale (higher order) in the network organization. In addition, the same prediction problem can be employed to address practical questions such as: how to forecast the possible relationship between a member of a social network and a consolidated social group; or how to spot in biomedicine an unknown association between a certain gene and a disease module.
The second important attainment is that we have proposed the first algorithm for hyperedge entanglement prediction which consists of many variants that allow to investigate the multifaceted local rules of self-organization and link formation of a complex network, and their L2 or L3 organization.
The third important accomplishment is that we have introduced two different evaluation frameworks for the prediction performance of HEP algorithms. The first evaluation scheme is on real networks, where we build a positive set by dismantling existing hyperedges and a negative set by exploiting topological information. The second evaluation scheme is on synthetic networks generated with the nPSO model 7,8 , where the positive and negative sets are constructed based on node-hyperedge geometrical distances.
In brief, the first important result that we achieve is that HEP is feasible and performs much better than random prediction. This is not trivial, because it represents a necessary and fundamental evidence at support of the fact that HEP is a well-posed problem in network science, for which algorithms can provide meaningful computational solutions.
The second important result is that local link prediction variants -adopted within the pipeline of the HEP algorithm -are enough (in the sense that local topological information is enough) to allow efficient (fast and meaningful) prediction. Furthermore, in line with some recent  Fig. 5b), predispose to aortic aneurysmal disease by affecting aortic stiffness and elasticity 56 .
A final important remark is that in our HEP framework, a social group or disease module do not need to be characterized by a high dyadic modularity for which nodes inside the hyperedge should have higher connectivity between them than with the rest of the network. In practice, we do not constrain the hyperedge to be a block module or a segregated community in the network. This is particularly important in predictions in presence of incomplete metadata such as the ones of the COPD disease module, whose genes have only one connection between them in the PPI network topology.
To conclude, there is also a last and remarkable result of HEP that is coming from its application in cardiovascular molecular biomedicine. HEP was recently used with success to identify SDC4, a heparan sulfate proteoglycan, as a potential target of PCSK9 that mediates

Resource Allocation (RA)
The formula of the RA-L2 model is 25 : where: u and v are the two seed nodes of the candidate interaction; is the intermediate node on the considered path of length two; is the respective node degree; and the summation is executed over all the paths of length two.
The formula of the RA-L3 model is 29 : where: u and v are the two seed nodes of the candidate interaction; 1 , 2 are the intermediate nodes on the considered path of length three; 1 , 2 are the respective node degrees; and the summation is executed over all the paths of length three.

Cannistraci-Hebb (CH)
The formulas of the CH1-L2, CH2-L2 and CH3-L2 models are 28 : where: u and v are the two seed nodes of the candidate interaction; is the intermediate node on the considered path of length two; is the respective node degree; is the respective internal node degree; is the respective external node degree; and the summation is executed over all the paths of length two. The asterisk on a degree variable indicates that a unitary term is added, in order to avoid the saturation of the value.
The formulas of the CH1-L3, CH2-L3 and CH3-L3 models are 28 : where: u and v are the two seed nodes of the candidate interaction; 1 , 2 are the intermediate nodes on the considered path of length three; 1 , 2 are the respective node degrees; 1 , 2 are the respective internal node degreed; 1 , 2 are the respective external node degrees; and the summation is executed over all the paths of length three. The asterisk on a degree variable indicates that a unitary term is added, in order to avoid the saturation of the value.
Note that the CH3 model, based only on the minimization of the external node degrees, has been introduced in this study.

Cannistraci-Alanis-Ravasi (CAR)
The formula of the CAR model is 4 : where: u and v are the two seed nodes of the candidate interaction; ( , ) is the number of common neighbours between them and ( , ) is the number of local community links (links between the common neighbours).

Cannistraci-Adamic-Adar (CAA)
The formula of the CAA model is 4 : where: u and v are the two seed nodes of the candidate interaction; is the intermediate node on the considered path of length two; is the respective node degree; is the respective internal node degree; and the summation is executed over all the paths of length two.

Cannistraci-Jaccard (CJC)
The formula of the CAA model is 4 :

Real networks dataset
The methods have been tested on three small real networks and on two large biological networks. For such networks metadata are available, representing the membership of the nodes to a certain group. Nodes belonging to the same group share a common feature. We consider such association between multiple nodes sharing a common feature as a hyperedge. The Some topological properties of the networks used in this study are summarized in Suppl. Table   1.

Algorithm for the evaluation procedure of HEP in real networks
In general, after removal from the hyperedge ℎ , the test node becomes a test candidate node for entanglement with ℎ . Then, we compute by HEP whether this test candidate node is predicted as a member of the ℎ entangled hyperedge. At the same time, we want to test that the test candidate node is not predicted by HEP as member of the entangled hyperedge ℎ of another hyperedge ℎ which by definition is considered a wrong assignment because topologically very far from the test candidate node .
In details, the algorithm for the evaluation procedure of HEP in real networks is the following.
For each node-hyperedge pair ( ∈ ℎ , ℎ ) in hypergraph: a. If the node has at least one link to the other members of the hyperedge ℎ and at least one link to nodes that are not members, we remove the node from ℎ , as well as the network links between and the other members of ℎ . After the removal satisfies the ENC and therefore it becomes a candidate node for entanglement . This generates a pair ( , ℎ ) to test for entanglement significance by the ESC.
b. We run the algorithm for node2hyperedge entanglement p-value between the node and the hyperedge ℎ , which represents a test on the positive set (here a correct prediction is if the p-value will be significant). Meanwhile for we define the 'negative' hyperedge ℎ and we run the algorithm for node2hyperedge entanglement p-value between the node and the hyperedge ℎ , which represents a test on the negative set (here a correct prediction is if the p-value will be not significant). We obtain the p-values and .
c. By considering a significance threshold of 0.05 for the p-values and , we obtain a prediction on whether ( , ℎ ) and ( , ℎ ) also satisfy the ESC, indicating if the node becomes a member of the entangled hyperedges ℎ and ℎ respectively.
Once obtained predictions for all the pairs in the positive and negative set, we evaluate the performance using the Matthews correlation coefficient (MCC). Note that if the option of Benjamini-Hochberg correction is considered, then in step (b) the p-value should be computed between the node and every hyperedge for which ENC is satisfied.

Generation of synthetic networks using the nPSO model
The Standard Western blotting procedure was done. The denatured samples were run in lanes and separated by SDS polyacrylamide electrophoresis. The protein lanes were (1) "input," which refers to protein lysate before IP, and (2) "IP" which is the immunoprecipitated proteins only (Fig. 6b). In some gels/blots (not shown), we added a lane that contained pure recombinant human MFAP5 protein as a reference band to confirm the MFAP5 band in the input and IP Detection of TGFB2 or ELN bands from the same blot where MFAP5 was detected, demonstrates that these proteins were co-immunoprecipitated with MFAP5.

Pre-Amplification
For the pre-Amplification of cDNA samples, it was performed with the Perfecta PreAmp SuperMix (5X) kit (Quantabio). It was made an assay primers pool using 2.0 µl of each primer (see Suppl. Table 8 water were mixed to make 20 µl of the pre-amplification reaction mixture for TaqMan assays. Each sample was, then, incubated in a thermal cycle, following these steps: initial denaturation (95°C, 2 minutes), PreAmpl cycling (14 cycles of 95°C, 10 seconds; 60°C, 3 minutes) and hold (4°C). Finally, the concurrent samples were used for qPCR.

qPCR
For qPCR analysis, it was used a 384 well plate with 2 l of PreAmp cDNA and 8 l of primer cocktail per well. The primer cocktail was made with 5 l of Taqman Fast Universal PCR Master Mix (2X), 2.75 l of nuclease-free water (QuantaBio), and 0.25 l of respective primer (see Suppl. Table 8). PCR was done using 79020HT Fast Real-Time PCR Systems, and the data were analyzed using Prism GraphPad 8. Statistical significance was determined after testing between Control and MFAP5 silenced sample conditions by student's t-test, two-tailed, and unpaired.

Hypergeometric test
Consider to randomly draw n objects, without replacement, from a finite population of size N that contains exactly K objects with a desired feature. The hypergeometric distribution is a discrete probability distribution that describes the probability of randomly drawing k objects with that feature (successes). The probability mass function of a random variable X that follows the hypergeometric distribution is: The function is positive when max(0, + − ) ≤ ≤ min ( , ).
In the hypergeometric test for over-representation of the desired feature in a sample, the pvalue is computed using the hypergeometric distribution as the probability of randomly drawing k or more successes. If in your sample you observe * objects with the desired feature, the p-value of the hypergeometric test for over-representation is given by: In our simulations related to the Results section 2.4, for each of the two biological pathways considered (platelet degranulation, molecules associated with elastic fibres), we performed the hypergeometric test for over-representation of genes associated to aneurysm in the COPDpathway subnetwork. Therefore, we have defined the parameters of the hypergeometric distribution as follows: N is equal to the number of nodes in the PPI network; K is equal to the number of nodes associated to aneurysm, meaning that they are either members of the aneurysm hyperedge or they are neighbors of its members in the PPI; n is the number of nodes in the COPD-pathway subnetwork, meaning that they are in the pathway considered and they are either in the COPD hyperedge, in its entangled hyperedge or in a L2/L3 path between them (see Fig. 5b); k is equal to the number of nodes in the COPD-pathway subnetwork and associated to aneurysm.
The actual values of the parameters and the resulting p-values are the following: • platelet degranulation: N=16418, K=918, n=39, k=12, p=8.35e-07; • molecules associated with elastic fibres: N=16418, K=918, n=23, k=10, p=1.67e-07. The figure represents the hyperedge entanglement framework in high-order multilayer network proposed in this study. a. The first (top) high-order layer is a hypergraph ( , ), where is the set of vertices (or nodes) and is the set of hyperedges. b. The second (bottom) layer is a graph ( , ), where is the same set of vertices as the hypergraph and is the set of edges (or links). Orange nodes are the members of a hyperedge and the information is projected on the network. Black nodes represent candidate nodes for entanglement (nodes that satisfy the ENC with the hyperedge). White nodes are not candidate nodes (because they do not satisfy the ENC), they are intermediate nodes in the network paths which enforce the entanglement between black and orange nodes. The HEP algorithm, which exploits link prediction on the network, predict (HE-prediction arrow) which ones of the black nodes are significantly entangled with the orange nodes (Hyperedge) and therefore are members of the entangled hyperedge (E-Hyperedge).

Figure 2. Flow chart of HEP algorithm.
The figure is a flow-chart representation of the HEP algorithmic steps which include also the N2EH algorithm (b-f panels). a. high-order multilayer network in input to the algorithm; b. choice of the link prediction option; c. computation of the node-node similarities between disconnected node pairs; d. choice of the average operator option for entanglement; e. computation of the node2hyperedge entanglement score; f. statistical test comparing the observed value of the node2hyperedge entanglement score with a null distribution, resulting in an empirical p-value; g. choice of the p-value correction option (if required); h. as output of the algorithm, a p-value for every candidate-node (a node that satisfies the ENC) to hyperedge pair, indicating if the pair also satisfy the ESC. For more details please refer to the Results section 2.1.   Table 7 for the complete overview of all the algorithmic variants. For visual clarity, the standard error of the MCC is not shown, but it is reported in Suppl. Fig. 1.

Figure 5. Prediction of entanglement hyperedge of COPD disease hyperedge a.
Visual representation of the HEP algorithm performed on the high-order multilayer graph, having the COPD disease hyperedge at the upper layer and the PPI network at the bottom layer. b. On the left, the 32 genes (triangle symbol) of the entangled hyperedge (predicted information). On the centre, the 47 intermediate genes (circle symbol) that belong to paths of length two (L2) or length three (L3) and that are included in at least one of the two enriched pathways: platelet degranulation (red colour) and molecules associated with elastic fibres (green colour). On the right, the 29 genes (square symbol) of the COPD hyperedge (known information). Grey nodes are COPD genes or entangled genes that are not part of the pathways, but they are still reported for completeness. Solid lines are for interactions between the nodes in the PPI network, dashed lines are for the interactions experimentally validated in this study. Colours of nodes and links highlight their association to the two pathways, as indicated in the legend. Grey links are between nodes that are not associated to the same pathway. Note that the link likelihoods are computed independently of each other and therefore the implementation can be easily parallelized in order to speed up the running time.

-L3-based link predictors
The L3-based algorithms for topological link prediction, analogously to the L2-based, consist of a main loop exploring all the non-observed links, and evaluating the likelihood of one link at each iteration.
In this case, for a given evaluation of the candidate link between nodes i and j, the dominant operation is to find the L3 paths between i and j. In order to do this, for each neighbour l of i, we need to check if each neighbour of l is connected to j. The time complexity of this computation is dependent on the degree of i multiplied by the degree of l, therefore on average it is dependent on the squared average node degree of the network k: Given that we perform This suggests that running the algorithm over a sparse network achieves complexity ( 2 ), which increases to ( 4 ) for intermediate values of network density.

-HEP
For each node-hyperedge pair that satisfies the ENC, the HEP computes the node2hyperedge entanglement p-value (for details see the Results sections 2.1.5 and 2.1.6). Given nodes and hyperedges, the node-hyperedges pairs to evaluate are ( * ).
As preliminary step, we can compute the similarity scores for all the pairs of disconnected nodes in the network, with time complexity dependent on the link predictor, as discussed in the previous sections. Therefore such operation does not have to be repeated for each nodehyperedge pair. Given a node-hyperedge pair and the node-node similarities already computed, the algorithm for the node2hyperedge entanglement p-value requires the following operations (see section 2.1.5): • Computation of the node2hyperedge entanglement score, using one of the three average operators: mean, median, mode. Mean and median can be executed with time complexity linear to the number of samples to average. The mode option, as described in section 2.1.6, is based on a kernel smoothing function estimate evaluated at 100 points, which can also be implemented with linear time complexity 63 . The number of samples to average is equal to the average size of the hyperedge, that we can here refer with . The time complexity of this step is ( ).
• Generation of random hyperedges. In order to sample the members of one random hyperedge the time complexity is equal to the average hyperedge size, and this procedure is repeated times. The time complexity of this step is ( * ).
• Computation of the null-distribution of node2hyperedge entanglement scores. This is equivalent to computing times a node2hyperedge entanglement score, therefore the time complexity is ( * ).
• Computation of the empirical p-value. This operation requires comparisons, therefore the time complexity is ( ).
The dominant operation among the four listed is ( * ). A reasonable assumption is that the average hyperedge size is approximately equal to = .
Since the four operations above are repeated ( * ) times, we obtain: ( * * * ) = ( 2 * ) Note that in our simulations the number of repetitions is fixed to 1000, therefore it can be considered as a constant factor and indicate the time complexity of the statistical test as ( 2 ).
At last, for each node, the Benjamini-Hochberg correction can be performed as an option to adjust for multiple hypothesis testing over the hyperedges, with an overall cost of ( * ). Since generally ≫ , we can consider such complexity lower than or equal to ( 2 ).
In summary, the time complexity of the HEP algorithm is the maximum between ( 2 ) and the complexity of the link predictor. Considering the link predictors adopted in this study, in case of sparse networks, the complexity is ( 2 ).

Figure S1. MCC evaluation of node2hyperedge entanglement in synthetic networks.
The figure reports the same results as Fig. 4, but the performance related to the two values of each nPSO parameter is shown in separate barplots. In addition, the standard error of the MCC is also reported.