On the distributional characterization of graph models of water distribution networks in Wasserstein spaces

: This paper is focused on two topics very relevant in water distribution networks (WDNs): vulnerability assessment and the optimal placement of water quality sensors. The main novelty element of this paper is to represent the data of the problem, in this case all objects in a graph underlying a water distribution network, as discrete probability distributions. For vulnerability (and the related issue of resilience) the metrics from network theory, widely studied and largely adopted in the water research community, reflect connectivity expressed as closeness centrality or, betweenness centrality based on the average values of shortest paths between all pairs of nodes. Also network efficiency and the related vulnerability measures are related to average of inverse distances. In this paper we propose a different approach based on the discrete probability distribution, for each node, of the node-to-node distances. For the optimal sensor placement, the elements to be represented as discrete probability distributions are sub-graphs given by the locations of water quality sensors. The objective functions, detection time and its variance as a proxy of risk, are accordingly represented as a discrete e probability distribution over contamination events. This problem is usually dealt with by EA algorithm. We’ll show that a probabilistic distance, specifically the Wasserstein (WST) distance, can naturally allow an effective formulation of genetic operators. Usually, each node is associated to a scalar real number, in the optimal sensor placement considered in the literature, average detection time, but in many applications, node labels are more naturally expressed as histograms or probability distributions: the water demand at each node is naturally seen as a histogram over the 24 hours cycle. The main aim of this paper is twofold: first to show how different problems in WDNs can take advantage of the representational flexibility inherent in WST spaces. Second how this flexibility translates into computational procedures.


Introduction
In this paper we address 2 problems: the assessment of network vulnerability for which we introduce a new index which naturally captures the impact of the removal of nodes/edges of the network and the optimal placement of water quality sensors for the early detection of contamination events. The main contribution of this paper is the proposal of a unifying mathematical framework in which the graph underlying the WDN, and its elements are represented as discrete probability distributions. Thanks to this representation the elements of both problems can be seen as points in a space endowed with distance between distributions.
Let's first consider the vulnerability assessment. A first set of measures, widely used in the water research community, is derived from the topology of the network as captured by the analysis of the distance matrix whose entries are the shortest paths between all pair of nodes. The analysis of this matrix yields centrality measures like closeness centrality, betweenness centrality expressed as average values of the node to node distances. Also network efficiency and the related vulnerability measures are related to average of inverse distances. These measures are widely studied and largely adopted in the water research community. Other metrics originate from the analysis of the adjacency matrix and its Laplacian. The spectral analysis of these matrices gives thru the spectral radius and the smallest nonzero eigenvalue respectively another a widely used set of connectivity measures. This paper stems from the observation that there is a lot more information hidden in these matrices than is captured by the average values used to compute them. In this paper we represent the data of the problem, in this case all objects in the graph underlying a water distribution network, as discrete probability distributions and measure their similarity thru a distributional distance. The distance between probability distributions is a central topic in statistical learning and more generally in Machine learning: there are many distances, as Kullback-Leibler and Jensen-Shannon. In this paper we focus on the Wasserstein distance, and we embed input data as probability distributions in a Wasserstein space. To turn this observation into a tool for network analysis able to capture the difference between nodes and net-works through their distances we need a sound theoretical and computational framework. A main contribution of this paper is to propose one such framework to characterize its mathematical features and the computational solutions and to show its representational flexibility and computational efficiency on two different paradigmatical problems in the analysis of WDNs. The same mathematical structure is generalized to the multi-objective optimal sensor problem. The model set-up assumes that the introduction of a contaminant can be associated to any node of the network. Each contamination event can be considered as a scenario in which the objective functions are evaluated. In order to get a distributional representation, instead of taking the average over scenarios, we consider, for a sensor placement the sample of detection times over all scenarios and the associated discrete probability distributions. The distance between two sensor placements can then be computed as the Wasserstein distance between the associate distributions.
The Wasserstein distance , first introduced in [1], has received its modern formulation in [2]. WST has a very rich mathematical structure whose complexity and flexibility are analyzed in a landmark volume [3]. A difficulty with WST is its computational cost which has hampered its diffusion outside the computer vision community. Recently a number of specialized computational approaches have drastically reduced the computational hurdles [4].

Related Works
The literature on vulnerability and resilience in WDNs is extremely large. The analysis of water distribution systems using tools from complex networks theory has been introduced in [5] and extended in [6] to weighted and directed network models. Approaches base on graph theory have been the subject of several papers [7,8]. Other approaches have been focused on the general issue of resilience [9][10][11]. Giustolisi et al. [12] proposes a new statistical distribution of the node degree. Diao [13] proposes a global resilience analysis in order to assess the resilience with respect to pipe failures, excess demand and substance intrusion. Candelieri et al. [14] integrates network analysis and hydraulic simulation [15,16].
A new approach proposed in [17] uses a representation of the network as a discrete probability distribution over the domain of node-to-node distances. The difference in vulnerability between two networks is expressed thru the WST distance between distributions based on the Wasserstein distance. As shown in ANS this provides also a link criticality index [18].
The problem of optimal sensor placement has been widely analyzed at least in the last 2 decades. The seminal contributions are [19,20]. The problem has been largely modelled as a multi-objective optimization problem and addressed by evolutionary algorithms [21,22]. An early contribution for riskbased sensor placement for contaminant detection is [23]. Naserizade et al. [24] introduces a risk-based multi-objective model, which considers quantile analysis in the evaluation of the objective functions for optimal sensor placement. Candelieri et al. [25] proposes optimal sensor placement to minimize detection and its standard deviation. Zhang et al. [26] proposes a new metric in order to identify the relative importance of each sensor in maintaining the detection performance.
The problem of optimal sensor placement has been addressed in [27,28], proposing a Wasserstein based multi-objective evolutionary algorithm for the risk aware optimization of sensor placement.

Our Contributions
The main novelty in this paper is to represent the data of the problem as discrete probability distributions and measure their similarity thru a distributional distance more general and flexible than the Euclidean distance. In this paper we focus on the Wasserstein distance, and we embed input data as probability distributions in a Wasserstein space. This new theoretical and computational framework is suitable to two problems. For vulnerability assessment this distributional representation is shown to capture effectively the topological in-formation embedded in network related matrices and yield more meaningful metrics than centrality and vulnerability measures based on average values. The computational outcomes show that the Wasserstein distance probabilistic distance measures have a good capacity to measure the dissimilarity between different networks not only at network but also at edge level. In this sense it can be said that they generalize the information given by clustering methods.
In this framework not only, we can compute the difference in vulnerability between any two networks but also the contribution of each component to the increase of vulnerability induced by the removal of that component. For the other target problem, optimal sensor placement, the contribution of the paper is the proposal of a new multi-objective evolutionary optimization algorithm: the WST distance enables a new genetic operator which synthetizes the distance between two placements. The performance of the resulting MOES/WST algorithm is computationally shown to be significantly superior than the standard NSGA-II. Both in terms of hypervolume and of cover-age of the approximate Pareto set.

Organization of the paper
Sect. 2 is devoted to the probabilistic representation of the problem's vulnerability and optimal sensor placement. as discrete probability distribution. Sect. 3 contains the basic notions of the Wasserstein distance and a summary of the computational issue connected with its evaluation. Sect. 4 proposes the new vulnerability index based on the Wasserstein distance. Sect. 5 introduces the problem of sensor placement and explains how the "fitness" of a sensor placement can be captured by a histogram. Sect. 6 introduces the archiving structure for the output of the hydraulic simulation. Sect. 7 gives the new selection and cross over operators based on the Wasserstein distance, the structure of the resulting multi-objective evolutionary algorithm and the computational results about optimal sensor placement. Sect. 8 contains the conclusions and perspectives.

Probabilistic Representation of Graphs
In general terms a histogram is a function that counts the elements in a sample of n observations of a random variable that fall into each of the disjoint categories (known as bins). Therefore, if k is the number of bins, the histogram satisfies the condition: To construct a histogram, the first step is to "bin" the range of valuesthat is, divide the support of the random variable into a number of intervals -and then compute the "weight" of the bin counting how many sampled values fall into each interval. The bins are usually specified as adjacent, consecutive, non-overlapping intervals and are usually of the same size.
In this section a new analysis is performed in terms of node-node discrete distance distributions where the weights are ( ) the percentage of nodes which are connected to at a distance with each node = 1, … , of the graph ( , ).
The distance distribution over the whole graph is given by the average over all nodes of the node-to-node distance distributions: Figure 1. The node-to-node distance distribution for node 92 (left) and node 296 (right) of the Neptun network.

The Wasserstein Distance
Measuring the distance between distributions can be accomplished by many alternative models including Kullback-Leibler (and its symmetrized version Jensen-Shannon), Hellinger, total variation and -square divergence. In this paper we focus on the Wasserstein (WST) distance whose basic notions are given in Sect. 3.1 while Sect. 3.2 is devoted to the computation of the barycenter between distributions and the extension of k-means clustering to the Wasserstein space. It is important to remark that the presentation is quite basic omitting any mathematical characterization of WST for which the reader is referred to [3,4].
The Wasserstein distance can be traced back to the works of Gaspard Monge [1] and Lev Kantorovich [2] and is based on the solution of an optimal transport problems. WST enables to synthetizes the comparison between two multi-dimensional distributions through a single metric using all information in the distributions.

Basic Notion
The WST distance between continuous probability distributions is: where ( ( ) , ( ) ) is also called ground distance (usually it is the Euclidean norm), Γ ( ) , ( ) is the set of joint distributions ( ( ) , ( ) ) whose marginals are respectively ( ) and ( ) , and > 1 is an index. In some cases, WST can be written in an explicit form. Let ( ) and ( ) be the cumulative distribution for one-dimensional distributions ( ) and ( ) on the real line and

The Wasserstein Distance for Discrete Distributions
Let denote a discrete distribution with support points with = 1, … , and their associated probabilities such that ∑ = 1 with ≥ 0 and ∈ for = 1, … , . Usually, = ℝ is the -dimensional Euclidean space with the norm and are called the support vectors.
can also be a symbolic set provided with a symbol-to-symbol similarity. can also be written using the notation: where (⋅) is the Kronecker delta.
The WST distance between two distributions ( ) = ( ) , ( The cost of transport between ( ) and ( ) , ( ) , ( ) , is defined by the -th power of the norm ( ) , ( ) (usually the Euclidean distance). We define two index sets = {1, … , } and likewise, such that: Equations 8 and 9 represent the in-flow and out-flow constraint, respectively. The terms are called matching weights between support points ( ) and ( ) or the optimal coupling for ( ) and ( ) . The discrete version of the WST distance is usually called Earth Mover Distance (EMD). For instance, when measuring the distance between grey scale images, the histogram weights are given by the pixel values and the coordinates by the pixel positions. Another way to look at the computation of the EMD is as a network flow problem. In the specific case of histograms, the entries denote how much of the bin has to be moved to bin .
In the case of one-dimensional histograms, the computation of WST can be performed by a simple sorting and the application of the following equation.

A Toy Example
Two graphs and ′ are considered with their distributions ( ) and ( ) that will be referred to as and ′. In order to give an instance of the computation of the node-to-node distributions, a small synthetic water distribution network, Anytown (Figure 2), is considered in [29]. The associated graph consists of 25 nodes and 44 edges. ′ is the graph without the red edge.

Data and software resources
Anytown is a small synthetic water distribution network [29] whose associated graph has 25 nodes and 44 edges. Neptun is the WDN of the Romanian city of Timisoara, with an associated graph of 333 nodes and 339 edges [30].
The multi-objective evolutionary algorithm used for the optimal sensor placement is based on the Python framework Pymoo [31].
The Water Network Tool for Resilience (WNTR) [32] is a Python package based on EPANET 2.0.

Resilience
In this section we show how the Wasserstein distance enables a new index of vulnerability which can be applied both network wise and as a criticality index of each network component.

Wasserstein distance as a measure of network vulnerability
This remarkable result is displayed in Figures 4 and 5 for the Neptun network. Given the graph = ( , ) associated to the network, each edge is represented by a pair of adjacent nodes ( , ). The removal of ( , ) yields \{(i, j)} for which we compute the aggregate node-node distance distribution ( \{( , )}) and the Wasserstein distance ( ( ), ( \{( , )}), whose value is represented by the the color associated to each edge ( , ) by the Wasserstein distance. Usual measures of network vulnerability are given by the loss of efficiency generated removing one or more network elements (nodes and graphs). The distributional approach taken here is different: we consider the original network and that obtained removing components, and we consider the WST distance between the two network as an index of vulnerability. It is important to note that breakages must occur, at the same time, on all the different red edges to imply a hydraulic disconnection. Breakages affecting only one pipe may imply a reduction in the efficiency of the network and an increase in vulnerability. It is important to remark that the edge Wasserstein criticality index agrees, with the result of the spectral clustering in Figure 4 (right).
The computational results show that probabilistic distance measures have a good capacity to discriminate between different networks not only at a network level but also for different edges. This assessment of link criticality could assess important tasks WDN management by just using topological and geometric information. The Wasserstein metric can be regarded as a natural extension of the Euclidean distance to statistical distributions via a single metric while still exploiting all the information present in the distributions.

Problem Formulation
We consider a graph = ( , ) We assume a set of possible locations for placing p sensors, that is ⊆ . Thus, a sensor placement (SP) is a subset of sensor locations, with the subset's size less or equal to depending on the available budget. An SP is represented by a binary vector ∈ {0,1} | | whose components are = 1 if a sensor is located at node , = 0 otherwise. Thus, an SP is given by the nonzero components of . For a WDN the vertices in represent junctions, tanks, reservoirs or consumption points, and edges in represent pipes, pumps, and valves. Let ⊆ denote the set of contamination events ∈ which must be detected by a sensor placement , and is the detection time of a sensor placed in node I for a contamination event in node a. A probability distribution is placed over possible contamination events associated to the nodes. In the computations we assume -as usual in the literature -a uniform distribution, but in general discrete distributions are also possible. In this paper we consider as objective functions the detection time and its standard deviation. We consider a general model of sensor placement: In our study we assume that α = 1 |A| ⁄ , therefore f (s) is: where ̂ = ∑ ,..,| | is the MDT of event . is the average over the contamination events of the detection time for each event. For each event and sensor placement the Minimum Detection Time is defined as = min : with ̂ the minimum time step at which concentration reaches or exceeds a given threshold for the event . As a measure of risk, we consider as the standard deviation of the sample average approximation of .

Representation of Sensor Placements as Histograms
If bins are the time subintervals of the simulation horizon and the weights are the number of events detected in each time subinterval (or their relative frequency) the histogram obtained is a representation of the information acquired in the hydraulic simulation. Denote the time steps in the simulation Δ = − where = 1, … , are equidistanced in the simulation time horizon (0, = 86400). = Δ , Δ = 1, = 24. We consider the discrete random variable | | where = { ∈ : ̂ ∈ Δ }. = | | cardinality is the number of contamination events, the bins are Δ and = | |. To each sensor placement we can associate not only the placement matrix ( ) but also the histogram ℎ ( ) whose bins are Δ and weights are | |. We have added an "extra" bin (86400 to 90000) whose weight | | represents number of contamination events which were undetected during the simulation up to 86400. In this way ∑ | | = 1. The "ideal" placement is that in which | |=|A|.Intuitively a "good" sensor placement has a relatively large mass in lower ), the larger the probability mass in the higher the worse is sensor placement (Figure 5 left). The worst SPs are those in the extra bin.

WNTR
The Water Network Tool for Resilience (WNTR) [32] is a Python package based on EPANET 2.0. The simulation is computationally costly as we need one execution for each contamination event. Figure 6. A schematic representation of the Net1 synthetic example (left). Concentrations of the contaminant introduced at node 9 for node 1, 2, 8, 9, 10, 11 (right).
The hydraulic simulation has been performed for 24 hours, in steps of 1 hour. Assuming = and = (i.e., the most computationally demanding problem configuration) the time required to run a simulation for the synthetic example called Net1 (available with EPANET and WNTR, and whose associated graph is depicted in Figure 1) is 2 seconds. The detection threshold is 10% of the initial concentration.

Single Sensor Placement and Placement Matrix
Denote with ∈ ℝ ( )×| | the so-called "sensor matrix", with = 1, … , | | an index identifying the location where the sensor is deployed at. Each entry of ( ) , represents the concentration of the contaminant for the event ∈ at the simulation step = 0, . . , , with = ∆ . Thus, in our study = 24, ∆ = 1 and = 24. Without loss of generality, we assume that the contaminant is injected at the beginning of the simulation (i.e., = 0). A "sensor placement matrix", ( ) ∈ ℝ ( )×| | can be also defined whose entry ℎ is the maximum concentration over those detected by the sensors in , for the event a and at time step . Suppose to have a sensor placement consisting of sensors with associated sensor matrices , … , , then ( ) is the matrix with entries ℎ = max ,…, ∀ ∈ . The columns of ( ) having maximum concentration at row = 0 (i.e., injection time) are those associated to events with injection occurring at the deployment locations of the sensors in . Indeed, we can now explicit the computation of ̂ in ( ) and ( ): ̂ is the minimum time step at which concentration reaches or exceeds a given threshold τ for the scenario a, that is ̂ = min ,…, {ℎ ≥ }. Each bin of the histogram ℎ ( ) represents the number of events that are detected in a specific time range by . These values can be extracted from the placement matrix ( ) Indeed, each column of this matrix represents an event, and the detection time of this event is given by the row in which the contaminant concentration exceed a given threshold. A metric in the information space has been introduced in Sect. 3 using histograms and the Wasserstein distance.

The Algorithm MOEA/WST
MOEA/WST is instantiated in the above framework according to the following steps: 1. The individuals of the population, that are sensor placements, in the evolutionary algorithm are represented as discrete probability distributions, namely histograms. 2. The space of histograms is endowed with a metric given by the WST distance between them. 3. The results of WST based computations are mapped back into the search space.
This section is focused on analyzing how the mathematics presented in the previous sections enables the construction of two new genetic operators.

The Selection Operator
For the crossover operation, a problem specific selection method has been developed. First, we randomly sample from the actual Pareto set two pairs of individuals ( , ) and ( , ). Then we choose the pair ( , ) as the parents of the new offspring, where = arg max ∈{ , } ( , ). This favors exploration and diversification. In this paper we used for ( , ) the Wasserstein distance between the histograms corresponding to the sensor placement and . Figure 9. Visualization of the selection mechanism.
If at least one individual of the pair of parents is not feasible (i.e., the placement contains more sensors than the budget ) the Constraint Violation ( ) is considered instead. Let's = [ ] be a generic individual and the budget, the Constraint Violation is defined as follow Then we choose the pair of parents ( , ) with = arg min ∈{ , } ( ) + ( ) .

The Crossover Operator
Denote with , ∈ Ω two feasible parents and with (FatherPool) and ′ (MotherPool) the two associated sets = { : = 1} and ′ = { : ′ = 1}. To obtain two feasible children, and ' are initialized as [0, … ,0]. In turn, and ' samples an index from and from ′, respectively, without replacement. Therefore, the new operator rules out children with more than non-zero components.
In Figure 10, an example comparing the behaviour of our crossover compared to a typical 1-point crossover. This problem specific crossover generates two "feasible-by-design" children from two feasible parents. Figure 11 displays the comparative computational results on the Netpun network of MOEA/WST and the benchmark NSGA-II. It is apparent that MOEA/WST reaches a large value of the dominated hypervolume and therefore a much better approximation of the Pareto set way before NSGA-II. Figure 11. Neptun computational results.

Conclusion and Perspectives
The key objective of this work is to show that a distributional framework is suitable to solve two key problems in the design of WDNs: resilience assessment and optimal sensor placement. In particular the Wasserstein distance and the associated analytical methods offer significant advantages over comparing distributions using a set of parametric values such as mean, variance and higher order moments. Indeed, the analysis of these parameters only does not take the whole distribution into account. Even if the roots of WST are in abstract spaces of probability distributions, WST and the associated optimal transport map offer a visually intuitive representation of the similarity between distributions.
The approach proposed is extremely flexible and can be generalized to different failure modes (pipe failures, excess demand and substance intrusion) and different probability functions to account for spatially varied contamination probabilities. A novel data structure has been also proposed for archiving the results of the simulation. This distributional approach has been shown to enable more effective genetic operators, able to capture the specific structure of the optimal sensor placement problem and give a much better computational performance than standard evolutionary algorithms. This distributional approach could be extended to other problems in the design and management of WDNs in the wider context of multitask learning (Ponti, A., (2021 d). This new approach can be naturally extended to different domains of networked infrastructure, as transport and energy and also informational networks for problems as fake-news detection and collaborative filtering in recommender system.