A Procustes’s procedure for obtaining divergences

Divergences have become a very useful tool for measuring similarity (or dissimilarity) between probability distributions. Depending on the ﬁeld of application a more appropriate measure may be necessary. In this paper we introduce a family of divergences we call γ divergences. They are based on the convexity property of the functions that generate them. We demonstrate that these divergences verify all the usually required properties, and we extend them to weighted probability distribution. We investigate their properties in the context of kernel theory. Finally, we apply our ﬁndings to the analysis of simulated and real time series.


Introduction
There exists a profusion of divergences between probability distributions. Remarkably when some of them are applied to the same statistical problem, in general they do not lead to indistinguishable results. Therefore, it is useful to have a large set of divergences. In general the divergences have different origins. Some are statistical, others originated in information theory and others borrowed from the realm of pure mathematics. The Fisher's metric is a conspicuous example of the first kind [1]; and the Kullback-Leibler is a well known example originated in information theory [2]. Measures of similarity (or dissimilarity), between probability distributions have become of great interest in physics (classical and quantum), biology and many other areas of science [3][4][5][6].
In recent years the use of divergence measures has taken on great relevance in statistical and quantum mechanics. In the first case, particularly notable has been its use in the context of out-of-equilibrium systems. The obligatory reference at this point is the work of G. Crooks and D. Sivak. There these authors give a physical interpretation to different measures of divergence when they are applied to the study of conjugate ensembles of nonequilibrium trajectories. The relative entropy is related to the dissipation, the Jeffreys divergence is the average dissipation of the forward and reverse evolution (hysteresis), the Jensen Shannon divergence has been proposed as a magnitude of the time arrow, and the Chernoff divergence is the work cumulant generating function [7].
Due to the statistical nature of quantum mechanics it is possible to establish a correspondence between divergences among probability distributions and divergences among quantum states. Several correlations measures in multipartite systems have been defined from quantum divergencies measures. Within the framework of quantum mechanics, a close relationship has been shown between the measurements of the separation between quantum states of a open system and the different ways of evolution for two system-environment setups [8].
In a collateral way, the extension to the quantum realm of some measures of distinguishability between quantum states have been used for the classification and analysis of multi layer networks . This is the case of the Quantum Jensen Shannon divergence [9].
Although there is a profuse bibliography on these topics, we have chosen references [7] and [8] due to the fact that the authors have compared the characteristics of different divergences in their use in the same system. Besides this not all divergence is adequate to treat every problem. So, having a variety of divergence could be useful for both pure theoretical studies and in the context of applications. Sometimes it has been possible to introduce families of divergences, labelling each member with a parameter [10,11] or by giving a general structure depending on a function with certain characteristics. An example are the Csiszar's divergences or f -divergences.
Let Ω = {ω 1 , ..., ω N } be a discrete sample space. The set of probability distributions on Ω can be indentified with the simplex is a convex function such that f (1) = 0. This family has been extensively studied in the context of information geometry [13]. A remarkable result is that when p i and q i = p i +δp i are two close probability distributions, the divergence D f (P ||Q) is proportional to the Fisher (Riemannian) metric: A Csiszar' divergence is symmetric if and only if the function f (u) satisfies: with β a constant. The above mentioned Kullback-Leibler divergence corresponds to f (t) = t log t and it is not symmetric and the Jensen-Shannon divergence it is obtained by taking f (t) = (t + 1) log( 2 t+1 ) + t log t, which clearly is symmetric [14].
In this work we present a new family of divergences that we call γ-divergences, based on the attributes of convex functions. By mimicking the structure of Euclidean metric, we propose a family of divergences that verify the basic properties of a "good" divergence [15]. This paper is organized as follows. In section 2 we introduce the family of distance based on convex functions. We demonstrate that these distances meet the requirements to be considered as divergences. Later on, we show the characteristics that this γ-divergence, and also extended our divergence to weighted and N -dimensional distribution. In section 3 we sketch a mode to check if a γ-divergence is a true metric. These results also provide a way to extend the γ-divergence to the realm of quantum mechanics. In section 4 we applied the divergence for the detection of dynamics changes in generated sequences and electroencephalographic signals. Finally, in section 5 we discussed the results obtained and future works are proposed.

The γ-Divergence family
The concept of metric space has a long and fruitful history. It dates back to the works of M. Fréchet and F. Hausdorff at the beginning of twenty century. Let us recall that a metric space is a pair (X, d), where X is a topological space and d : X × X → R such that: 1. d(x, y) ≥ 0 for every x, y ε X and the equality is valid if and only if x = y 2. d(x, y) = d(y, x) 3. d(x, y) + d(y, z) ≥ d(x, z) for every x, y, zεX Obviously the idea of a metric space is inspired in the Euclidean space. In the context of measures of dissimilarity between probability distributions conditions 2 and 3 are not always satisfied. A long studied problem concerns the properties that a map φ : X → X should have in order that the space (X, φ(d)) be a metric one. For example it is well known that if (X, d) is a metric space, (X, d d+1 ) and (X, d α ) with αε(0, 1) also satisfy the metric properties.
We will focus our attention on X = P N .
two discrete probability distributions belonging to P N . The square of the Euclidean distance between these two distributions can be written in the form: or in an equivalent form: where g(x) = x is the identity function. The space (P N , √ E) is a metric space. The above mentioned Jensen-Shannon divergence (JSD), can be written in the form: which has the same structure that (3), with g(x) = 1 2 log x. A remarkable property of the JSD is that its square root, d JS = √ D JS , it is a metric [15]. In both examples the products x · g(x) are convex functions. In some sense we can think the JSD as a "deformation" through the function g(x) of the euclidean metric.
This simple observation leads us to propose, for each function g(x) such that We will call from now on to D γ (P ||Q) the γ divergence.
The main properties of the γ divergence are presented in the following Theorem: is a convex function. Then the functional defined in (5) satisfies the properties: Let us recall that for a convex function G(x) and for all t ∈ [0, 1] and p, q ∈ [0, 1] the Jensen inequality reads Therefore the convexity of G(x) = xg(x) and the Jensen inequality for t = 1/2 imply that each term γ g (p i , q i ) is positive and then D γ is positive also.
Obviously D γ (P ||Q) = 0 if P ≡ Q. On the contrary if D γ (P ||Q) = 0 then P = Q. Indeed, for y fixed, the equation has only one solution at x = y. ♦

Similarity
Let us investigate the relation between the Csiszar's divergences and the γ-divergencies. Let q i = p i +δp i for each i, with N i=1 δp i = 0, and let g(x) two times differentiable. In this case it is direct evaluate: where g (x) := dg dx . If we require that at second order this approximation agree with the expansion of a Csisar's divergences at same order, eq. (1), it must be valid that: By integrating this equation, we get where α is a constant. The linear term does not contribute to the γ-Divergence. This make us to conclude that the only γ-Divergence that is a Csiszar's one is the JSD. This is a very important result, due that it enables to study to the γ-Divergence as a new class of divergences.

Weighted γ-Divergences
In several contexts, it is be useful to assign different relevance to each probability distribution, for example in Bayesian inference. This motivates us to propose a way to define a generalised γ-Divergence between weighted probability distribution. In expression (5) we can interpret that each probability distribution has assigned a weight 1 2 . Then, if we assign a weight π P to distribution P = p i and a weight π Q to the distribution Q = q i , with π P + π Q = 1. we can introduce a weighted γ-Divergence in the form: where m i = π P p i + π Q q i . This assignment assures that D It is also possible to generalize the γ-Divergence for more than two probability distributions. Again we resort to the Jensen's inequality. Let G(x) a real convex function and let {x 1 , . . . , x K } be K real numbers belonging to the interval [0, 1]. Then where π k ∈ [0, 1] such that K k=1 π k = 1. Then by identifying G( We can interpret the left side of this inequality as an extension of the function γ g previously defined, but now for K probability distributions: with m i := K k=1 π k P k i . We are assuming that the sample space is N dimensional and P k i represents the probability of occurrence of event i according to the distribution P k .

Kernel properties of the γ-Divergences
Let Ξ be an arbitrary set. A function ϕ : Ξ × Ξ → R is called a negative definite kernel if is symmetric and for any n ≥ 2, an arbitrary subset of points {z 1 , , z n } belonging to Ξ, and a collection of real numbers {c 1 , , c n } such that n i=1 c i = 0, the inequality n j,k is valid [16]. It is easy to show that the kernel associated to the euclidean distance is negative definite. It is possible also to prove that the kernel associate to the JSD is negative definite too.
The condition of being negative definite it is very relevant because a theorem by I. Shoenberg affirms that (Z, d) is a metric space if the square d 2 is a negative definite kernel.
The general structure of a γ-Divergence is a sum of terms of the form If we were able to prove that ϕ(x, y) is a negative definite kernel, then we could assure that (P N , D Another consequence of the Shoenberg's theorem is that if inequality (eq.16) is satisfied for the kernel (eq.17), the metric space (P N , D with || || is the norm induced by the Hilbert's space inner product. This is a very significant result if we are interested in extending the γ-Divergence to the realm of quantum mechanics. Worth mentioning that in the case of euclidean metric the embedding Φ induce the Wootter's distance between states |Ψ 1 and |Ψ 2 [17]: In general, for any divergence D γ , the challenge is to find the map Φ. When extending the divergences to the realm of quantum mechanics, some of their valid properties in the classical case are not always easily demonstrable for the quantum case. Recently it has been possible to prove that the quantum extension of the JSD is a true metric. In proving that the Schoemberg's theorem has been crucial [18]. Therefore our expectation that it will also useful in proving the metric property for the quantum extension of the γ-divergences.

Applications
In this section we use γ-divergences as a tool for detecting dynamical changes in artificially generated sequences and real EEG signals.

Dynamics changes detection
We use a Monte Carlo simulation to study the efficiency of γ-divergences to detect dynamical changes in binary sequences artificially generated [19]. For this purpose, we generated M binary sequences composed by two sub-sequences s i = s i 1 + s i 2 for i = 1, ..., M with length L = L s 1 + L s 2 . Each subsequence had a probability distribution P s 1 = [p s 1 1 − p s 1 ] and P s 2 = [p s 2 1 − p s 2 ] (Fig. 1A). We defined a pointer ρ which moves step by step across the sequences. In each step, we took two sequences with the same length (L win ), one on the right (s r ) and other on left (s l ) of ρ (Fig. 1B ). We estimated the probability distribution for both sequences P sl and P sr , and calculated the γ-divergence between then D γ = D γ (P sl ||P sr )). This procedure was repeated for each ρ ∈ [L win+1 L − L win ] (Fig. 1C). The point (ρ * ) where the divergence reaches the maximum value D γ (ρ * ) = D γmax was the transition point between the sequence s 1 to s 2 . We did this analysis for the M generated sequences and then calculated their mean value and standard deviation. Beyond that this was developed for binary sequences, can be applied to all type of discreet or continuous (with previous quantification) sequences. Figure 2 shows the γ-divergence for four specific g(x) applied over a combined binary sequence. The first s 1 = 3000 points was generated with a probability P = [0.5 0.5] and the following s 2 = 3000 points with Q = [0.4 0.6]. The analysis was made using the function (g(x) = e x , log(x), √ x, sinh(x)) all of which satisfies the formal condition given in the theorem 2. We used a windows length of L win = 1000 data point. The fill line represents the D γ mean value, and the shadows are the standard deviation over the M = 1000 realization. Vertical dash line determines the point where sequence 1 and 2 are joined. For a better visualization of the results, the domain of the function goes from the domain [1001 5000] of the original sequence since the first and last 1000 points the D γ are zero. We can see for all function g(x) that the maximum divergence (D γ max ) is reached at the exact point where the sequence changes the probability distribution. In all cases, maximum divergence values are much higher than the standard deviations, demonstrating its statistical significance. Then, we wanted to study the detection limit of the γ-divergence. In other words, for binary sequences we would like to saw what the smallest probability distribution difference was that can be detected by this family of divergences. To fulfil these aims we generated four combine binary sequences with probability distribution increasingly closer. We analyzed the sequences with the function (g(x) = e x , log(x), √ x, sinh(x)). The windows length used were L win = 1000 data point. Fig. 3 shows the analysis for function g(x) = e x .
As the two probability distribution (P and Q) become closer, the maximum divergence value (D γmax ) decreases, being in the limit of detection for P = [0.5 0.5] and Q = [0.45 0.55] (Fig. 3C) and, making it impossible to distinguish between them when P = [0.5 0.5] and Q = [0.51 0.49] (Fig. 3D). Similar results we obtained for the function log(x), √ x, sinh(x).

Transition detection over EEG sleep recording
In the second example, we used the γ-divergence to identify the transition between sleep states in an electroencephalogram (EEG) signal from a sleeping patient. Sleeping is a dynamic activity, during which many processes are vital to health and well-being take place. It is essential to help to maintain mood, memory, and cognitive performance [20][21][22]. A specialist defines two primary sleep stages: rapid eye movement (REM) sleep and non-REM sleep (NREM). REM is defined as an active sleeping period showing an intense brain activity. Brain waves remain fast and desynchronized similar to those in the waking state. This is also the sleep stage is the state in which the majority of dreams are produced. In NREM sleep stage the physiological activity decrease, the brain waves get slower and have greater amplitude, breathing, and heart rate slows down and blood pressure drops. The NREM phase is composed by three stages: N1, N2, and N3. The N1 stage is characterized by perceived drowsiness or transition from being awake to falling asleep observed by slowing down the brain waves and muscle activity. Stage N2 is a period of light sleep during which eye movement stops. Brain waves become slower (Theta waves (4-7 Hz)) with occasional bursts of rapid waves (12)(13)(14) called sleep spindles, coupled with spontaneous periods of muscle tone mixed. Lastly, stage N3 is characterized by the presence of Delta (0.5-4 Hz) slow waves, interspersed with smaller, faster waves [23]. In the N3 stage is a deep sleep stage, without eye movement, and a decrease of muscle activity, resembling a coma state. Usually, sleepers pass through these four stages (REM, N1, N2, and N3) cyclically. A complete sleep cycle takes an average of 90 to 110 minutes, with each one lasting between 5 to 15 minutes. These cycles must be maintained for healthy body function in awake state [24]. Developing tools that can detect the changes in dynamic sleep stages over EEG signal are highly essential for studying patients with sleep disorders [25]. We used our γ-divergence to detect changes in the dynamics of the EEG to allow us to distinguish between one state from another.
The data were taken from the Physionet database: The Sleep-EDF Database [Expanded] [26,27], and are freely available at [28]. The EEG were recorded (Fpz-Cz) bipolar channel and the sampling frequency was 100 Hz. Initially, we extracted segments from the original signal belonging to the five different sleep states: Awake, REM, N1, N2, N3 1 . Each segments had 6000 points (corresponding to 60 sec of recording) and were joined into a single signal 2 Fig. 4A. The signal was preprocessed with a bandpass filter between 0.5 − 60 Hz. We quantify the signal using the permutation vector method approach [29], with parameter d = 4 and τ = 1. We applied the γ-divergence following the method used in the previous section for the binary sequence. The functions used were (g(x) = e x , log(x), √ x, sinh(x)). Figure 4B shows that the divergence detects the transition between different sleep states for all the functions g(x). The log(x) function shows the highest values and the best differentiation between stages.
The transition between Awake-REM and N2-N3 are more remarkable than in REM-N1 and N1-N2. There is a significant differentiation between N2-N3 because of N3 is the deepest sleep stage. In this stage the body becomes more insensitive to outside stimuli than seeming to coma state. In this state, the EEG signal is mostly composed by slow waves (Delta and Theta) causing the brain dynamics to be sharply different from the other states. Lower γ-divergences values were found between N1-N2 states showing that both states share similar characteristics in their dynamics. Particularly, N1 is characterized by drowsiness slowing down the brain waves and muscle activity. While N1 is a period of light sleep during which eye movement stops, the two stages (N1-N2) are very similar with the difference that in N1 occasional bursts of rapid waves (12-14 Hz) called sleep spindles appear.
A similar result can be found between REM and N1. REM and N1 also present lower values of divergence. This is expected considering that the EEG of REM sleep contains frequencies present in the "awake" state and in the lighter stages of sleep N1 [30]. Despite the similarities of the REM and N1, there are still enough differences between them such that statistically different values are obtained. The presence of 11-16 Hz activity (sleep spindles) in N1, and more abundant alpha activity  in REM sleep means that these two stages present activity at an overlapping frequency range, which explains the proximity of the divergence values obtained. Difficulty in detecting N1 and REM sleep has also been found using other measures [31,32]. The EEG signal is composed by 5 sub-signal belonged different sleep stages (Awake, REM, N1, n2, n3). Each state has 6000 point corresponding to 60 sec recording (dashes horizontal lines). B) Using a running windows method the D γ was applied taken four g(x) compared for the study.
The signal was quantified with the permutation vectors with parameter d = 4 and τ = 1. For all functions, the maximum values D γmax were reached in the exact point where a transition between sleep states exists.

Discussion
In the first place, we showed the existence of a close relationship between the square of the Euclidean metric and the Jensen-Shannon divergence establishing that both belong to the same family of functional. In some sense we can think to the JSD as a deformation of the euclidean distance. Based on this, we introduced a family of divergences we called γ-divergence that depend on the properties of convex functions. Subsequently, we introduced generalized versions of these divergences (weighted versions). We give an approach to the new divergences from the negative definite kernels theory, This allows us to explore an extension of the γ-divergence between quantum states. This connection will be explored extensively in a future work.
Finally, we applied the γ-divergence in simulated and real sequences. We showed that all the functions g(x) could detect with high significance the exact point where sequences with different probabilities joined. Moreover, we studied the detection threshold -the point where the divergence can no longer distinguish between the two signals-. Later, we analysed EEG signal from sleep patient. We could detect the points where the signal changes its dynamics due to the change in the state of sleep. Showing it can be an alternative tool for detecting different sleep states.
It is also important to mention that when using divergences, we had consider the significance of the value obtained. This significance is what allows us to say if the distance found is really a real and not just a measure of the statistical fluctuations. In this work we used the standard deviation calculated over M realizations as a measure of significance. However, many times we do not have more than one realization being this method useless. Therefore, a theoretical study of the field is necessary. Some studies on this topic have already been addressed by Grosse et al. for the Jensen-Shannon divergence [19]. Similar studies should be carried out for this family of gamma-divergence. However, this exceeds this work but will be a addressed in the near future.