The Quality of the Covariance Selection Through Detection Problem and AUC Bounds

We consider the problem of quantifying the quality of a model selection problem for a graphical model. We discuss this by formulating the problem as a detection problem. Model selection problems usually minimize a distance between the original distribution and the model distribution. For the special case of Gaussian distributions, the model selection problem simplifies to the covariance selection problem which is widely discussed in literature by Dempster [2] where the likelihood criterion is maximized or equivalently the Kullback-Leibler (KL) divergence is minimized to compute the model covariance matrix. While this solution is optimal for Gaussian distributions in the sense of the KL divergence, it is not optimal when compared with other information divergences and criteria such as Area Under the Curve (AUC). In this paper, we analytically compute upper and lower bounds for the AUC and discuss the quality of model selection problem using the AUC and its bounds as an accuracy measure in detection problem. We define the correlation approximation matrix (CAM) and show that analytical computation of the KL divergence, the AUC and its bounds only depend on the eigenvalues of CAM. We also show the relationship between the AUC, the KL divergence and the ROC curve by optimizing with respect to the ROC curve. In the examples provided, we pick tree structures as the simplest graphical models. We perform simulations on fully-connected graphs and compute the tree structured models by applying the widely used Chow-Liu algorithm [3]. Examples show that the quality of tree approximation models are not good in general based on information divergences, the AUC and its bounds when the number of nodes in the graphical model is large. We show both analytically and by simulations that the 1-AUC for the tree approximation model decays exponentially as the dimension of graphical model increases.


I. INTRODUCTION
Graphical models are useful tools for describing the geometric structure of networks in numerous applications such as energy, social, sensor, neuronal, and transportation networks [4] that deal with high dimensional data.
Learning from these high dimensional data requires large computation power which is not always available. The hardware limitation for different applications force us to compromise between the accuracy of the learning algorithm This paper was presented for the special case of tree approximation in part at 2016 Information Theory and Application Workshop [1]. and its time complexity by using the best possible approximation algorithm and imposing structures, instead. In other words, the main concern is to compromise between model complexity and its accuracy by choosing a simpler, yet informative model. There are lots of approximation algorithms that are proposed for model selection to impose structure given data. For Gaussian distribution, covariance selection problem is presented and studied in [2] and [5].
The ultimate purpose of the covariance selection problem is to reduce the computation complexity in various applications. One of the special approximation models is the tree approximation model. Tree approximation algorithms are among the algorithms that reduce the number of computations to get quicker approximate solutions to a variety of problems. If a tree model is used, then distributed estimation algorithms such as message passing algorithm [6] and the belief propagation algorithm [7] can easily be applied and are guaranteed to converge to the maximum likelihood solution.
There are algorithms in the literature such as the Chow-Liu minimum spanning tree (MST) [3], the first order Markov chain approximation [8] and penalized likelihood methods such as LASSO [9] and graphical LASSO [10] that can be used to approximate the correlation matrix and the inverse correlation matrix with a more sparse graph representation while retaining good accuracy. The Chow-Liu MST algorithm for Gaussian distribution is to find the optimal tree structure using a Kullback-Leibler (KL) divergence cost function [2]. The Chow-Liu MST algorithm constructs a weighted graph by computing pairwise mutual informations and then utilizes one of the MST algorithms such as the Kruskal algorithm [11] or the Prim algorithm [12]. The first order Markov chain approximation uses a regret cost function to output first order Markov chain structured graph [8] by utilizing a greedy type algorithm. Penalized likelihood methods uses an L1-norm penalty term in order to sparsify the graph representation and eliminating some of the edges.
Sparse modeling has many applications in distributed signal processing and machine learning over graphs. One of its applications is in smart grid. Smart grid is a promising solution that delivers reliable energy to consumers through the power grid when there are uncertainties such as distributed renewable energy generation sources. Smart grid technologies such as smart meters and communication links are added to the power grid in order to obtain the high dimensional, real-time data and information "Big Data," and overcome uncertainties and unforeseen faults.
The future grid will incorporate distributed renewable energy generation such as solar PVs, with these sources being highly correlated. Thus, modeling is an essential part for signal processing and implementation of the smart grid.
We discuss the quality of the model selection problem, focusing on the Gaussian case, i.e. covariance selection problem. We ask the following important question: "is the covariance approximation of the covariance matrix for the Gaussian model a good approximation?" To answer this question, we need to pick a closeness criterion which has to be coherent and general enough to handle a wide variates of problems and also have asymptotic justification [13]. In many applications the Kullback-Leibler (KL) divergence has been proposed as a closeness criterion between the original distribution and its model approximation distribution [2] and [3]. Besides that, other closeness measures and divergences are used for the model selection problem. One example is the use of reverse KL divergence as the closeness criterion in variational methods to learn the desired approximation structure [14].
In this paper we bring a different perspective to the model approximation problem by formulating a general detection problem. The detection problem leads to calculation of the log-likelihood ratio test (LLRT) statistic, the receiver operating characteristic (ROC) curve, the KL divergence and the reverse KL divergence as well as the area under the curve (AUC) where the AUC is used as the accuracy measure for the detection problem. The detection problem formulation gives us a broader view as well as looking at different approaches of determining whether a particular model is a good approximation or not. For Gaussian data, the LLRT statistic simplifies to an indefinite quadratic form. A key quantity that we define is the correlation approximation matrix (CAM) as the product of the original correlation matrix and the inverse of the model approximation correlation matrix. For Gaussian data this matrix contains all the information needed to compute the information divergences, the ROC curve and the area under this curve, i.e. the AUC. We also show the relationship between the CAM, the AUC and the Jeffreys divergence [15], the KL divergence and the reverse KL divergence. We present an analytical expression to compute the AUC for a given CAM that can be efficiently evaluated numerically. We show the relation between the AUC, the KL divergence, the LLRT statistics and the ROC curve. We also present analytical upper and lower bounds for the AUC which are only depend on eigenvalues of the CAM. Throughout the discussion section, we pick the tree approximation model as a well-known subset of all graphical models. The tree approximations is considered since they are widely used in literature and it is much simpler performing inference and estimation on trees rather than graphs that have cycles or loops. We perform simulations over synthetic and real data for several examples to explore and discuss our results. Simulation results indicate that 1´AUC is decreasing exponentially as the graph dimension increases which is consistent with analytical results obtained from the AUC upper and lower bounds.
The rest of this paper is organized as follows. In section II we give a general framework for the detection problem and the corresponding sufficient test statistic, the log-likelihood ratio test. Moreover, the sufficient test statistic for Gaussian data as well as its distribution under both hypotheses are also presented in this section. The ROC curve and the AUC definition as well as analytical expression for the AUC are given in section III. Section IV provides analytical lower and upper bounds for the AUC as function of the CAM. Moreover, Section V presents the tree approximation model and provides some simulations over synthetic examples as well as real solar data examples and investigates quality of tree approximation based on the numerically evaluated AUC and also its analytical upper and lower bounds. Finally, Section VI summarizes results of this paper.

II. DETECTION PROBLEM FRAMEWORK
In this section, we present a framework to quantify the quality of a model selection problem. More specifically, we formulate a detection problem to distinguish between the covariance matrix of a multivariate normal distribution and an approximation of the aforementioned covariance matrix based on the given model.

A. Model selection problem
We want to approximate a multivariate distribution by the product of lower order component distributions [16].
Let random vector X P R n , have a zero-mean distribution with parameter Θ, i.e. X " f X pxq. Moreover, the random vector X has the graph representation G " pV, Eq where sets V and E are the set of all vertices and the set of all edges of the graph representing X, respectively. We want to approximate the random vector X, with another random vector associated with the desired model 1 . Let the model random vector X M P R n have a zero-mean distribution with parameter Θ M , associated with the desired model, i.e. X " f X M pxq. Also, let G " pV, E M q be the graph representation of the model random variable X M where sets V and E M are the set of all vertices and the set of all edges of the graph representing of X M , respectively where E M Ď E.

B. General detection framework
The model selection problem is extensively studied in the literature [2]. In many state of the art works, minimizing the KL divergence between two distributions or the maximum likelihood criterion are proposed in order to quantify the quality of the model approximation. A different way to look at the problem of quantifying the quality of the model approximation algorithm is to formulate the problem as a detection problem [17]. Given the set of data in the detection problem, the goal is to distinguish between two hypotheses, the null hypothesis and the alternative hypothesis. To set up a detection problem, we need to define these two hypotheses for the model selection problem as follow -The null hypothesis, H 0 : The hypothesis that data is generated using the known distribution, -The alternative hypothesis, H 1 : The hypothesis that data is generated using the model approximation distribution.
Given the set up for the null hypothesis and the alternative hypothesis, we need to define a test statistic to quantify the detection problem. The likelihood ratio test (the Neyman-Pearson (NP) Lemma [18]) is the most powerful test statistic where we first define the log-likelihood ratio test (LLRT) as where f X px|H 0 q is the distribution of random vector X under the null hypothesis while f X px|H 1 q is the distribution of random vector X under the alternative hypothesis. Moreover, let random variables l 0 pXq and l 1 pXq be the LLRT statistics under hypothesis H 0 and hypothesis H 1 , respectively.
We then define the false-alarm probability and the detection probability by comparing the LLRT statistic under each hypothesis with a given threshold, τ, and computing the following probabilities -The false-alarm probability, P 0 pτq, under the null hypothesis, H 0 : P 0 pτq " Prpl 0 pXq ě τq, -The detection probability, P 1 pτq, under the alternative hypothesis, H 1 : P 1 pτq " Prpl 1 pXq ě τq.
The most powerful test is defined by setting the false-alarm rate P 0 pτq "P 0 and then computing the threshold value τ 0 such that Prpl 0 pXq ě τ 0 q "P 0 .
Definition 1. The KL divergence between two multivariate continuous distributions ppXq and qpXq is defined as where X is the feasible set.
Throughout this paper we may use other notations such as the KL divergence between two random variable or the KL divergence between two covariance matrices for zero-mean Gaussian distribution case in order to present the KL divergence between two distributions.
Proposition 1. Expectation of the LLRT statistic under each hypothesis is where l 0 pXq and l 1 pXq are the LLRT statistic under H 0 and H 1 , respectively.
Proof: Proof is based on the KL divergence definition.
Remark: Relationship between the NP lemma and the KL divergence is previously stated in [19] with the similar straightforward calculation, where the LLRT statistic loses power when the wrong distribution is used instead of the true distribution for one of these hypotheses.
In a regular detection problem framework, the NP decision rule is to accept the hypothesis H 1 if the LLRT statistic, lpxq, exceeds a critical value, and reject it otherwise. Moreover, the critical value is set based on the rejection probability of the hypothesis H 0 , i.e. false-alarm probability. Note that, we pursue a different goal in the approximation problem scenario. We approximate a model distribution, f X M pxq, as close as possible to the given distribution, f X pxq. The closeness criterion is based on the modified detection problem framework where we compute the LLRT statistic and compare it with a threshold. In ideal case where there is no approximation error, the detection probability must be equal to the false-alarm probability for the optimal detector at all possible thresholds, i.e. the receiver operating characteristic (ROC) curve [20] that represents best detectors for all threshold values should be a line of slope 1 passing through the origin.
In sequel, we assume that the random vector X has zero-mean Gaussian distribution. Thus, the covariance matrix of the random vector X is the parameter of interest in the model selection problem, i.e. covariance selection.

C. Multivariate Gaussian distribution
Let random vector X P R n , have a zero-mean jointly Gaussian distribution with covariance matrix Σ X , i.e.
X " N p0, Σ X q where the covariance matrix Σ X is positive-definite, Σ X ą 0. In this paper, the null hypothesis, H 0 , is the hypothesis that the parameter of interest is known and is equal to Σ X while the alternative hypothesis, H 1 , is the hypothesis that the random vector X is replaced by the model random vector X M . In this scenario, the model random vector X M has a zero-mean jointly Gaussian distribution (the model approximation distribution) with Thus, the LLRT statistic for the jointly Gaussian random vectors, X and X M , is simplified as where c "´1 2 log p|Σ X Σ´1 X M |q is a constant and kpxq " x T Kx where K " 1 2 pΣ´1 X´Σ´1 X M q is an indefinite matrix with both positive and negative eigenvalues.
We define the correlation approximation matrix (CAM) associated with the covariance selection problem and dissimilarity parameters of the CAM as follows.
Definition 2. Correlation approximation matrix. The CAM for the covariance selection problem is defined as Theorem 1. Covariance Selection [2]. Given a multivariate Gaussian distribution with covariance matrix, Σ X ą 0, and a model M, there exists a unique approximated multivariate Gaussian distribution with covariance matrix, Σ X M ą 0, that minimize the KL divergence, DpΣ X ||Σ X M q and satisfies the covariance selection rules, i.e. the model covariance matrix satisfies the following two covariance selection rules where the set E c M " E´E M represents the complement of the set E M . Proof: Proof for Gaussian distributions is given in Dempster 1972 paper [2].
Remark: Using the CAM definition, the constant c can be written as c "´1 2 logp|∆|q. Moreover, given any covariance matrix and its model covariance matrix satisfying theorem 1, we have trp∆q " n. Thus, from the result in theorem 1 and the definition of KL divergence for jointly Gaussian distributions, we conclude c "

D. Covariance selection example
Here we choose tree approximation model as an example. Figure 1 indicates two graphs: (a) the complete graph and (b) its tree approximation model. The correlation coefficient between each pair of nodes has been written on each edge. The correlation matrix for each of them is .
The CAM contains all information about the tree approximation 2 . Here we assume cases that Gaussian random variables have finite, nonzero variances.
Remark: Without loss of generality, throughout this paper we are working with normalized correlation matrices, i.e. the diagonal elements of the correlation matrices are normalized to be equal to one.

E. Distribution of the LLRT statistic
The random vector X has Gaussian distribution under both hypotheses H 0 and H 1 . Thus under both hypotheses, the real random variable, kpXq " X T KX has a generalized chi-squared distribution, i.e. the random variable, kpXq, is equal to a weighted sum of chi-squared random variables with both positive and negative weights under both hypotheses. Let us define W " Σ´1 2 X X under H 0 and Z " Σ´1 2 are the square root of covariance matrices Σ X and Σ X M , respectively. Then let random vectors W " N p0, Iq and Z " N p0, Iq have zero-mean Gaussian distributions with the same covariance matrices, I, where I is the identity matrix of dimension n. Note that, the CAM is a positive definite matrix with λ i ą 0 where 1 ď i ď n. Thus, the random variable kpXq, under both hypotheses H 0 and H 1 can be written as: respectively, where random variables W i and Z i , are the i-th element of random vectors W and Z, respectively.
Moreover, random variables W 2 i and Z 2 i , follow the first order central chi-squared distribution. Note that, l 0 pXq " c`k 0 pXq and l 1 pXq "´c`k 1 pXq.
Remark: As a simple consequence of the covariance selection theorem, the summation of weights for the generalized chi-squared random variable, the expectation of kpXq, is zero under the hypothesis H 0 , i.e. Epk 0 pXqq " 1 2 ř n i"1 p1λ i q " 0 [2], and this summation is positive under the hypothesis H 1 , i.e. Epk 1 pXqq " 1 2 ř n i"1 pλ´1 i´1 q ě 0.

III. THE ROC CURVE AND THE AUC COMPUTATION
A. The receiver operating characteristic curve The receiver operating characteristic (ROC) curve is the parametric curve where the detection probability is plotted versus the false-alarm probability for all thresholds, i.e. each point on the ROC curve represents a pair of pP 0 pτq, P 1 pτqq for a given threshold τ. Set z " P 0 pτq and η " P 1 pτq, the ROC curve is η " hpzq. If P 0 pτq has an inverse function, then the ROC curve is hpzq " P 1 pP´1 0 pzqq. In general, the ROC curve, hpzq, has the following properties [20] hpzq is convex, -h 1 pzq is positive and decreasing, -ş 1 0 h 1 pzq dz ď 1. Note that, for the ROC curve, the slope of the tangent line at a given threshold, h 1 pzq, gives the likelihood ratio for the value of the test.
Remark: For the ROC curve for our Gaussian random vectors we have h 1 pzq is positive, continuous and decreasing in interval r0, 1s with right continuity at 0 and left continuity at 1. Moreover, Lemma 2. Given the ROC curve, hpzq, we can compute following KL divergences D pl 0 pXq||l 1 pXqq "´ż where (*) holds if the ROC curve, η " hpzq, has an inverse function.
Proof: These results are consequence of the Radon-Nikodým theorem [21]. Simple, alternative calculus based proofs are given in appendix A.

B. Area under the curve
As discussed previously we examine the ROC with a goal that the model approximation results with the ROC being a line of slope 1 passing through the origin. This is in contrast to the conventional detection problem where we want to distinguish between the two hypotheses and ideally have an ROC that is a unit step function. Area under the curve (AUC) is defined as the integral of the ROC curve (figure 2) and is a measure of accuracy in decision problems.
where τ is the detection problem threshold.
Remark: The AUC is a measure of accuracy for the detection problem and 1 {2 ď AUC ď 1. Note that, in conventional decision problems, the AUC is desired to be as close as possible to 1 while in approximation problem presented here we want the AUC to be close to 1 {2.
Definition 5. Let f 0 ptq and f 1 ptq be the probability density function (PDF) of the random variables l 0 pXq and l 1 pXq, respectively.
where f 1 ptq ‹ f 0 ptq fi ş 8 8 f 1 pτq f 0 pt`τq dt is the cross-correlation between f 1 ptq and f 0 ptq. Proof: A proof based on the definition of the AUC (2) is given in [1].
Let us define the difference LLRT statistic random variable as γpXq " l 1 pXq´l 0 pXq. Then, we get AU C " Pr pγpXq ą 0q where F γpXq pγq is the cumulative distribution function (CDF) for random variable γpXq.
The two conditional random variables l 0 pXq and l 1 pXq are independent 3 . Thus, the cross-correlation between the corresponding two distributions is the distribution of the difference LLRT statistic, γpXq. We can write the random variable γpXq as γpXq "´c`k 1 pXq´p´c`k 0 pXqq " k 1 pXq´k 0 pXq.
Replacing the definition for k 0 pXq and k 1 pXq, we have We can rewrite the difference LLRT statistic, γpXq, in an indefinite quadratic form as

C. Generalized Asymmetric Laplace distribution
The difference LLRT statistic random variable, γpXq, follows the generalized asymmetric Laplace (GAL) distribution 4 [23]. For a given i where i P t1, . . . , nu, we define random variable γ i pXq as Then, difference LLRT statistic random variable, γpXq, can be written as where K 0 p´q is the modified Bessel function of second kind [24]. The moment generating function (MGF) for this distribution is for all t's that satisfies 1´α i t´α i t 2 ą 0. From (4), the MGF derivation for the GAL distribution is straightforward and is the multiplication of two MGFs for the chi-squared distribution.
The distribution of the difference LLRT statistic random variable, γpXq, is where˚n i"1 is the notation we use for convolution of n functions together. Note that, although the distribution of random variables γ i pXq's in (5) has discontinuity at γ " 0, the distribution of random variable γpXq is continuous if there are at least two distribution with non-zero parameters, α i 's, in the aforementioned convolution. Moreover, the MGF for f γpXq pγq can be computed by multiplying MGFs for γ i pXq as for all t's in the intersection of all domains of M γipXq ptq. The smallest of such intersections is´1 ă t ă 0.

D. Analytical expression for AUC
To compute the CDF of random variable γpXq, we need to evaluate a multi-dimensional integral of jointly Gaussian distributions [25] or we need to approximate this CDF [26]. More efficiently, as discussed in [27] for the real valued case, the CDF of the random variable γpXq can be expressed as a single-dimensional integral of a complex function 5 in the following form where β ą 0 is chosen such that matrix I`β 2 pΛ´Iq, is positive definite and simplifies the evaluation of the multivariate Gaussian integral [27].
Special case: When Λ " I, i.e. the given covariance obeys the model structure, then AU C " 1´F γpXq p0q " 1´1 2π Picking an appropriate value for the parameter β, the AUC can be numerically computed by evaluating the following one dimension complex integral AU C " 1´1 2π Furthermore, since Λ ą 0, choosing β " 2 and changing variable as ν " ω {2, we have AU C " 1´1 2π Moreover, |Λ`jνpΛ´Iq| " ś p i"1 p1`α i ν 2´j α i νq. This equation shows that the AUC only depends on α i 's.

IV. ANALYTICAL BOUNDS FOR THE AUC
As in the previous section we present an analytical expression for the AUC, in this section, we presents analytical lower and upper bounds for the AUC.

A. Lower bound for the AUC (Chernoff bound application)
Given the MGF for the difference LLRT statistic distribution (6), we can apply the Chernoff bound [28] to find a lower bound for the AUC or upper bound for the CDF of the difference LLRT statistic random variable, γpXq, evaluated at zero).

Proposition 2. Lower bound for the AUC is
Pr pγpXq ą 0q ě max Proof: One-half is a trivial lower bound for AUC. To achieve a non-trivial lower bound, we apply Chernoff bound [28] as follow To complete the proof we need to solve the right-hand-side (RHS) optimization problem.
Step 1: First derivatives of M γpXq ptq is Clearly, first derivative is zero for t "´1{2 which is in the feasible domain of the MGF for the difference LLRT statistic. Note that, the smallest feasible domain is´1 ă t ă 0.
Step 2: Second derivatives of M γpXq ptq is Therefore, we conclude that the second derivative is positive and thus the optimal solution to the RHS optimization problem is at t "´1 2 . Replacing that in the definition of the moment generation function which results in the following bound 4`α i which completes the proof.

B. Upper Bound for the AUC
In this section, we present a parametric upper bound for the AUC, but first, we need to present the following results.
Proof: This lemma is an special case of the invariance property of the KL divergence [29]. By picking appropriate measurable mapping, here appropriate quadratic function for each equation of the above equations, we conclude the lemma. where Dl " min tDpl 1 pXq||l 0 pXqq , Dpl 0 pXq||l 1 pXqqu .
Proof: Proof is given in the appendix B.

Proposition 3. The parametric upper bound for AUC is
Pr pγpXq ą 0q " 1 1´e´a´1 a and D˚ě logpaq`a e a´1´1´l ogp1´e´aq where a ą 0 is a positive parameter and Dpf X px|H 0 q||f X px|H 1 qq u.
Proof: Proof is based on the lemma 5 and the feasible region presented in the theorem 6. From the lemma 5,

we have
Dl ď D˚.
Then, using the result in the theorem 6, we get the parametric upper bound.
Proof: Applying the inequality 2x 2`x ă logp1`xq for x ą 0, we achieve the result.

Proposition 5. Asymptotic behavior of the upper bound. The parametric upper bound for AUC has the following asymptotic behavior
Pr pγpXq ą 0q ď 1´e´D˚´1 where D˚is given in (8).
Applying the exponential function to both sides of the above inequality we conclude the upper bound. . Also, figure 5 shows the feasible region and the asymptotic behavior in regular-scale.

V. EXAMPLES AND SIMULATION RESULTS
In this section, we consider some examples of covariance matrices for Gaussian random vector X. We pick the tree structure as the graphical model corresponds to the covariance selection problem. In our simulations, we compare the numerically evaluated AUC and its lower and upper bounds and discuss their asymptotic behavior as the dimension of the graphical model, n, increases.

A. Tree approximation model
The maximum order of the lower order distributions in tree approximation problem is two, i.e. no more than pairs of variables. Let X T " N p0, Σ X T q have the graph representation G T " pV, E T q where E T Ď E is a set of edges that represents a tree structure. Let X l " N p0, Σ X l q have the graph representation G l " pV, E l q where E l Ď E T is the set of all edges in the graph of X l . The joint PDF for elements of random vector X l can be represented by joint PDFs of two variables and marginal PDFs in the following convenient form Using equation (9) we can then easily construct a tree using iterative algorithms (such as the Chow-Liu algorithm [3] combined with the Kruskal [11] algorithm or the Prim [12] algorithm) by adding edges one at a time [31].
Consider the sequence of random variables X l with 0 ď l ď |E T |, where X l is recursively generated by augmenting a new edge, pi, jq P E l , to the graph representation of X l´1 . For the special case of Gaussian distributions, Σ X l has the following recursive formulation [31] where Σ : i,j " re i e j sΣ´1 i,j re i e j s T and Σ : i " e i Σ´1 i e T i where e i is a unitary vector with 1 at the i-th place and Σ i,j and Σ i are the 2-by-2 and 1-by-1 principle sub-matrices of Σ X , with initial step Σ X 0 " diagpΣ X q where diagpΣ X q represents a diagonal matrix with diagonal elements of Σ X .
Remark: For all 0 ď l ď |E T |, we have Tree approximation models are interesting to study since there are algorithms such as Chow-Liu [3] combined by the Kruskal [11] or the Prim's [12] that efficiently compute the model covariance matrix from the graph covariance matrix.

B. Toeplitz example
Here, we assume that the covariance matrix Σ X has a Toeplitz structure with ones on the diagonal elements and the correlation coefficient ρ ą´1 pn´1q as off diagonal elements For the tree structure model, all possible tree structured distributions satisfying (9) have the same KL divergence to the original graph, i.e. Dpf X pxq||f X T pxqq is constant for all possible connected tree approximation model for this example. The reason is that all the weights computed by the Chow-Liu algorithm to construct the weighted graph associated with this problem are the same and are equal to´1 2 logp1`ρ 2 q, which only depends on the correlation coefficient ρ. In the sequel, we test our results for two tree structured networks: a star network and a chain network.

1) Star approximation:
The star covariance matrix is as follows (all the nodes are connected to the first node) 7 For this example, the KL divergence and the Jeffreys divergence can be computed in closed form as and D J pX , X star q " pn´1qpn´2qρ 2 2p1`pn´1qρq respectively, where D J pX , X star q " DpX||X star q`DpX star ||Xq is the Jeffreys divergence [15]. Moreover, for large values of n we have that DpX||X star q « n 2 logp1`ρq and D J pX , X star q « n 2 ρ 1´ρ . 2) Chain approximation: The chain covariance matrix is as follows (nodes are connected like a first order Markov chain, 1 to n) For this example, the KL divergence and the Jeffreys divergence can be computed in closed form as DpX||X chain q " DpX||X star q 7 All n possible star networks have the same performance. (left) and ρ " 0.9 (right). In both figures, the numerically evaluated AUC is compared with its bounds. and D J pX , X chain q " ρ 2 p1`pn´1qρqp1´ρqn pn´1q 2´n p1´ρ n q 1´ρ`1´p n`1qρ n`n ρ n`1 p1´ρq 2ṙ espectively. Moreover, for large values of n we have the following approximation D J pX , X chain q « n 2 ρ. Figure 7 plots the 1´AUC v.s. the dimension of the graph, n for different correlation coefficients, ρ " 0.1 and ρ " 0.9 as well as its upper and lower bounds.
In both figure 6 and figure 7, p1´AUCq and its bounds rapidly goes to 0 which means that AUC goes to one as we increase the number of nodes, n, in the graph. More precisely, bounds for 1´AUC are decaying exponentially as the dimension of the graph, n, increases which is consistent with the theory obtained for analytical bounds. Furthermore, we can conclude from these figures that a smaller ρ results in a better tree approximation, i.e.
covariance matrices with smaller correlation coefficients are more like tree structure model. Moreover, comparing the AUC for the star network approximation with the AUC for the chain network approximation we conclude that the star network is a much better approximation than the chain network even though that both approximation networks have the same KL divergences. We can also interpret this fact through the analytical bounds obtained in this paper. The star network is a better approximation than the chain network since the decay rate of 1´AUC for the star network is less than its decay rate for the chain network. (left) and ρ " 0.9 (right). In both figures, the numerically evaluated AUC is compared with its bounds.

C. Solar data
In this Example, covariance matrix is calculated based on datasets presented in [32]. Two datasets which are obtained from the National Renewable Energy Laboratory (NREL) website [33]. The first data set is the Oahu solar measurement grid which consists of 19 sensors (17 horizontal sensors and two tilted sensors) and the second one is the NREL solar data for 6 sites near Denver, Colorado. These two data sets are normalized using standard normalization method and the zenith angle normalization method [32] and then the unbiased estimate of the correlation matrix is computed 8 .
1) The Oahu solar measurement grid dataset: From data obtained from 19 solar sensors at the island of Oahu, we computed the spatial covariance matrix during the summer season at 12:00 PM averaged over a window of 5 minutes. Then, the AUC and the KL divergence are computed for those tree structures that are generated using Markov Chain Monte-Carlo (MCMC) method. Figure 8 shows the distribution of those tree structures generated using MCMC method versus the KL divergence (left) and v.s log 10 p1´AUCq (right) 9 .
Looking back at figure 4, for very small value of 1´AUC the relationship between the KL divergence and the boundary of the feasible region for´logp1´AUCq is linear. This means that if the upper bound is tight then the relationship between the KL divergence and the´logp1´AUCq is almost linear. In figure 8, the maximum value of 1´AUC for this model is less than 10´3 which justifies why two distributions in figure 8 are scaled/mirrored of each other.
2) The Colorado dataset: From the solar data obtained from 6 sensors near Denver, Colorado, we computed the spatial covariance matrix during the summer season at 12:00 PM averaged over a window of 5 minute. Then, the  In the Colorado dataset, there are two sensors that are very close to each other compared to the distance between all other pairs of sensors. As a result, if the particular edge between these two sensors is in the approximated tree structure we get a smaller AUC and KL divergence compared to when that particular edge is not in the tree structure. This explains why the distributions of all trees in this case looks like a mixture of two distributions.
3) Two-dimensional sensor network: In this example, we create a 2-dimensional (2D) sensor network using Gaussian kernel [34] as follows where dpi, jq is the Euclidean distance between the i-th sensor and the j-th sensor in the 2D space. All sensors are located randomly in 2D space 10 . We set σ " 1 and generate a 2D sensor network with 20 sensors. For the 2D sensor network example, figure 10 shows the distribution of the generated tree structures using MCMC method v.s KL divergence (left) and v.s log 10 p1´AUCq (right). Again we see the mirroring effect in Fig. 10 as we have an almost linear relationship between the KL divergence and´logp1´AUCq. Note that, the covariance matrix generated has one dominant eigenvalue in most cases. Furthermore, figure 11 plots 1´AUC as well as its analytical upper bound and lower bound v.s. the dimension of the graph, n for σ " 1.3 (left) and σ " 1.8 (right). To generate this figure, we randomly generated 1000 sensor networks and then plot the averaged AUC. As we can see in this figure, the 1´AUC and its bounds decay exponentially which is consistent with the theoretical results of this paper. Right: distribution of the generated trees (Normalized histogram) using MCMC v.s. log 10 p1´AUCq for the 2D sensor network example with 20 sensors and σ " 1.

VI. CONCLUSION
In this paper, we formulate a detection problem and investigate the quality of model selection problem. More specifically, we consider Gaussian distributions and discuss the covariance selection quality of a given model. We present the correlation approximation matrix (CAM), and show its relationship with information theory divergences such as the KL divergence, the reverse KL divergence and the Jeffreys divergence as well as the ROC curve and the area under it, i.e. the AUC, as a measure of accuracy in the detection problem framework. Moreover, this paper presents an analytical expression for the AUC that can efficiently be evaluated numerically. Also, the AUC analytical lower and upper bounds are provided in this paper. It is shown that the value of AUC and its bounds only depend on the eigenvalues of the CAM. We pick the tree structure as an example of an approximation model and use the Chow-Liu MST algorithm to compute the maximum likelihood tree structure approximation. Then, the quality of the Chow-Liu MST tree algorithm is investigated using the formulated detection problem. Through some examples, we show that in general, the tree approximation is not a good model as the number of nudes in the graphical model increases which is the case in high dimensional problems such as smart grid and big data. The aforementioned result is also consistent with the analytical results provided in this paper that is 1´AUC decays exponentially as the dimension of graph increases.
The detection framework presented in this paper, can be generalized for non-Gaussian models. Moreover, the AUC analytical bounds obtained in this paper can also be used in other applications that are using AUC as a relevant criterion. One example is in medicine when the AUC is used for diagnostic tests between positive instance and negative instance [35] where instead of changing the coordinates we can look at the exponent of the AUC bounds.

ACKNOWLEDGMENT
Authors would like to thank Prof. Peter Harremoës for his helpful discussions on information divergences and assistance with Theorem 6. This work was supported in part by NSF grant ECCS-1310634, and the University of Hawaii REIS project. ‚ C1: ş 1 0 h 1 pzqdz " 1 ‚ C2: h 1 pzq ě 0 ‚ C3: h 1 pzq is decreasing where h 1 pzq is the derivative of the ROC curve, hpzq. Also for a given ROC curve, hpzq, we can compute the AUC as Pr pγpXq ą 0q " Then, using integration by parts, we can show that 1´Pr pγpXq ą 0q " To compute the feasible region stated in the theorem 6, we need to optimize both of following KL divergences, Dpl 1 pXq||l 0 pXqq and Dpl 0 pXq||l 1 pXqq, with respect to the derivative of the ROC curve given a fixed AUC, Pr pγpXq ą 0q, while conditions, C1, C2 and C3 hold. To solve this optimization, we can use the method of Lagrange multiplier. To solve this optimization problem, we first write the Lagrangian. We need two coefficients a and b corresponding to conditions in optimization problem (10). Then, we can write the Lagrange multiplier as a function of the derivative of the ROC curve, z, a and b as follow Lph 1 pzq, z, a, bq "´ż Note that, the Lagrangian, Lph 1 pzq, z, a, bq is a concave function of h 1 pzq. Thus, we can compute its minimum by taking its derivative with respect to h 1 pzq. Doing so, we get BLph 1 pzq, z, a, bq Bh 1 pzq " ż 1 0ˆa z`b´1 h 1 pzq˙d z.
Set BLph 1 pzq,z,a,bq Bh 1 pzq " 0 we get h 1 pzq " 1 az`b for all z P r0, 1s. From C3, since h 1 pzq is decreasing, we can conclude that a ą 0. Moreover, from C1, at optimum we have ş 1 0 h 1 pzqdz " 1 and thus, we can compute one of the coefficients as b " a e a´1 .
Computing the AUC integral and the KL divergence using the ROC curve we get the following parametric boundary for the feasible region Pr pγpXq ą 0q " 1 1´e´a´1 a (11) and D " logpaq`a e a´1´1´l ogp1´e´aq (12) where D " Dpl 1 pXq||l 0 pXqq.
Second step: Here we minimize Dpl 0 pXq||l 1 pXqq. The Lagrange multiplier for this step is similar to the first step but it is more straight forward if we define gpηq " h´1pηq. Note that using integration by parts, we can show that AUC is Pr pγpXq ą 0q " Now, we can write the Lagrangian for the optimization problem with respect to g 1 pηq. The Lagrangian is concave, thus taking the derivative and set it equal to zero as follow BLpg 1 pηq, η, a, bq Bg 1 pηq " 0 we can compute the parametric boundary for the feasible region. The parametric boundary in this case is the same as solution in (11) and (12) with D " Dpl 0 pXq||l 1 pXqq. Thus, combining these two steps, for the optimal boundary we have D˚" mintDpl 1 pXq||l 0 pXqq , Dpl 0 pXq||l 1 pXqqu.