Gamma-Divergence. An introduction to new divergence family.

Divergences have become a very useful tool for measuring similarity (or dissimilarity) between probability distributions. Depending on the ﬁeld of application, a more appropriate measure may be necessary. In this paper we introduce a family of divergences called γ -Divergences. They are based on the convexity property of the functions that generate them. We demonstrate that these divergences verify all the usually required properties, and we extend them to weighted probability distribution. In addition, we deﬁne a generalised entropy closely related to the γ -Divergences. Finally, we apply our ﬁndings to the analysis of simulated and real time series.


Introduction
There exists a great number of divergences between probability distributions. Remarkably when they are applied to the same statistical problem, in general they do not lead to indistinguishable results. Therefore, it is useful to have a wide set of divergences. In general, the divergences have different origins. Some are purely statistical, others originated in information theory. The Fishers metric is a conspicuous example of the first kind; among the second class are the Kullback-Leibler and the Jensen-Shannon divergences [1]. These measures of similarity (or dissimilarity), between probability distributions have become of great interest in many areas of the science as Physics (classical and quantum), Biology and many other areas of science [2][3][4][5][6][7].
It is well known that not all distance or divergence are adequate to treat every problem. So, having a variety of divergence could be useful for both theoretical studies and for applications. Sometimes it has been possible to introduce families of divergences, labelling each member with a parameter [8,9] or by giving a general structure depending on a function of a certain characteristic. An example of this late one is known as Csiszar divergences or f -Divergences. They are defined as: where f (x) is a convex function such that f (1) = 0, and pi ad qi are discrete probability distributions. This family has been extensively studied in the context of information geometry. A remarkable result is that when pi and qi = pi + δpi are closed probability distributions, the divergence D f (P ||Q) is proportional to the Fisher (riemannian) metric. The above mentioned of Kullback-Leibler and Jensen-Shannon divergences correspond to f (t) = t log t and f (t) = (t + 1) log( 2 t+1 ) + t log t, respectively [10].
In this work we present a new family of divergences that we call γ-Divergences, based on the property of convex functions. By mimicking the structure of Euclidean metric, we propose a family of divergences that verify the basic conditions of a "good" divergence [11].
This work is organized as follows. In section 2 we introduce the family of distance based on convex functions. We demonstrate that these distances meet the requirements to be considered as divergences. Later on, we show the characteristics that this γ-Divergence, and also extended our divergence to weighted and N -dimensional distribution. In section 3 we introduce generalized entropy based on the notion of convexity. We show this entropy meets all the requirements that a generalized entropy should meet. In addition, we investigated the relationship between this γ-Entropy and the γ-Divergence defined above. In section 4 we applied the divergence for the detection of dynamics changes in generated sequences and electroencephalographic signals. Finally, in section 5 we discussed the results obtained and future works are proposed.
2 The γ-Divergence family Let P = {pi} n i=1 and Q = {qi} n i=1 two discrete probability distributions for a N -state of a random variable X. The square of the Euclidean metric between these two distribution can be written in the form: or in an equivalent form: where g(x) = x is the identity function. The square root of the Euclidean distance is a true metric, in the sense that it verifies the triangle inequality. Let be the Jensen Shannon divergence, and let dJS = √ DJS its square root. It is known that dJS is a true metric [1]. Its square can be rewritten in the form: where g(x) = 1 2 log(x). Both, the Euclidean and the JSD have the same structure. Furthermore, the function x · g(x) is convex, both in the case of the Euclidean distance as in the case of the JSD. This simple observation leads us to propose, for each function g(x) such that x · g(x) is convex, a divergence where γg(pi, qi) = pi g(pi) + qi g(qi) − (pi + qi) g pi + qi 2 (6) Theorem: Let P = {pi} n i=1 and Q = {qi} n i=1 two probability distributions for a given N -state random variable X. Let g : R + → R such that f (x) := x · g(x) a convex function. Then the functional defined as: 3. Dγ(P ||Q) = 0 ⇐⇒ P ≡ Q Proof: The divergence Dγ is a sum of N terms of γg(pi, qi) Therefore, if each of these terms are symmetric, positive and null if and only if Q ≡ P these properties are inherited by Dγ. From now on we will use the following notation, i) Symmetry It is direct to check that 6 we can see that γg(pi, qi) satisfies γg(pi, qi) = γg(qi, pi), ∀i ♦ To prove Dγ(P ||Q) ≥ 0 we need to demonstrate that γg(qi, pi) ≥ 0, ∀i.
Under the hypotheses that f is a convex function, for all t ∈ [0, 1] and p, q ∈ [0, 1] the Jensen inequality leads to if we replace f (x) = x g(x) we have choosing t = 1/2 we obtain using definition 6 we have γg(pi, qi) therefore, the sum of positive amounts is positive ♦ Dγ(P ||Q) = 0 ⇐⇒ P ≡ Q: ⇐) Replacing P ≡ Q in the definition of γg(pi, qi) given in 6 we have γg(qi, qi) = qi g(qi) + qi g(qi) − (qi + qi) g qi + qi 2 = 0 ∀i (15) then Dγ(P ||Q) = 0. ⇒) Previously, it has been shown γg(pi, qi) is positive ∀i, therefore, with simple algebraic steps which allows to write f (x), f (mi) is convex and different from identity, therefore equality is satisfied if and only if pi = mi and qi = mi, implying pi = qi. ♦ •

Subset of functions g(x)
Due to the previous result, we can find a subset of functions g(x) that satisfy the hypothesis of the theorem 2. We know that if f (x) is two times differentiable, it is strictly convex if and only if Then, the inequality gives a subset of functions g(x) that satisfies the theorem 2, building a subfamily of γ-Divergence.

Linearity
Next, we will show that a family of γ-Divergences can be generated using the function g(x) as linear sum of functions g k (x) which meet the conditions presented in 2.1. Let g(x) = m k=1 α k g k (x), with α k > 0 for all k : 1, ..., m , where the functions g k (x) satisfy the hypotheses of the theorem 2. Replacing in equation 6 we obtain: where then γ-Divergence takes the following form where Now we will show that γg k comply the properties of the theorem 2. Every term g k (x) satisfies the theorem 2 hypotheses for ∀k implying that Dγ k ≥ 0. In consequence if α k > 0 then On the other hand, considering Dγ k (P ||Q) = Dγ k (P ||Q) for all k, we have Finally, due to Dγ g k (Q||P ) = 0 ⇐⇒ if P ≡ Q for all k by hypothesis, if α k > 0 ∀k, we have If Dγ g (P ||Q) = 0 implies each terms of the sum must be equal to zero. Due to α k > 0 for all k, then Dγ k (P ||Q) = 0. By hypothesis each of the Dγ k satisfy the theorem 2, this means therefore Dγ(P ||Q) = 0 =⇒ Q ≡ P .

Similarity
Let qi = pi + δpi for each i, with N i=1 δpi = 0, and let g(x) a differential function, then the γ-divergence can be written as: taking the Taylor's expansion linear term of the function g(x) valued in y = pi + δpi, we obtain whereġ(x) := dg dx . Then γ-Divergence approximation is: with some algebraic step we have which can be rearranged in the form

Weighted γ-Divergences
In several contexts, it is be useful to assign distinctive relevance to a different probability distribution, for example in Bayesian inference. Here, we propose a way to define a generalised γ-Divergence between weighted probability distribution. Let πP , πQ ≥ 0 with πP + πQ = 1 be arbitrary weights for the probability distributions P and Q. We can see in equation 11 a natural assignment of weights are t = πQ and (1 − t) = πP obtaining: where mi = πP pi + πQ qi. This assignment assures that D

Generalization for more than two distributions
It is also possible to generalize the γ-Divergence for more than two probability distributions. Let f (x), R N → R is a convex function and let x = {x1, . . . , xN } are the values in the domain of f . Now, using Jensen's inequality (see theorem 2.6.2 [12]) we have where π k ∈ [0, 1] and satisfying N k=1 π k = 1. Then using the definition of f ( We can interpret the left side is just the extension of γ π Q π P g (qi, pi) for N probability distribution. In the same way, we can assign the {π1, . . . , πN } as weights of the distributions, and defining as γ π 1 ,...,π N g (p 1 , . . . , p N ). Then,

Introduction of the new generalized entropy
Entropy can be viewed as the significant amount in the information known of a system. It is possible to generalize the concept of entropy proposed by Shannon [14], giving the fundamental characteristics that an entropy should have in general. There are many ways of introducing a generalized entropy. In the cases of the Havrda-Charvat-Tsallis (HCT) and Renyì entropies [15][16][17] we sought to generalize the concept of entropy through a parameter. For Salicru's entropy [18], the intention was to introduce a set of entropies through two functions h and φ. In our case we look for an entropy related to γ-Divergence described before. Let P = {pi} n i=1 a probability distribution function and HG[P ] a functional of P . We can say HG is a generalized entropy, if complies the following properties: • to be continuous for each pi • to be equal to zero in the deterministic case, i.e., HG[P ] = 0 for pi = 1 and pj = 0, ∀j = i • reach the maximum when the distribution is uniform P = U , i.e., when pi = 1 N ∀i • to be concave respect to its argument The last properties allows to define a "Jensen-like" divergence in the following way: From the Karamata theorem we have that for any convex function f , , R is defined as R = {1, 0, .., 0} and P is any distribution, using equation 44 we have On the other hand, let U = {1/N, ..., 1/N } the uniform distribution, using the Jensen's inequality the function N i=1 pi g(pi) is minimum when the distribution is uniform. This result shows that the function Finally, since f (x) is convex by definition, Hg has the property to be concave.
Let H h,φ [P ] = h i φ(pi) be the entropy defined by Salicrú [18]. Let us particularize by taking h as the identity and φ as with some algebraic step we obtain This is equal to This result shows that the generalized entropy Hg is closely related to the γ-Divergence defined in section 2.

Applications
In this section we use γ-Divergence as a tool to detect dynamic changes in artificially generated sequences and real electrophysiological signals.

Dynamics changes detection
We used Monte Carlo simulation to study the efficiency of γ-Divergence to detect dynamical changes in binary sequences artificially generated. For this purpose, we generated M sequences composed by two sub-sequences s i = s i 1 + s i 2 for i = 1, ..., M with length L = Ls 1 + Ls 2 . Each sub-sequence had a probability distribution Ps 1 = [ps 1 1 − ps 1 ] and Ps 2 = [ps 2 1 − ps 2 ] (Fig. 1A). We defined a pointer ρ which moves step by step across the sequences. In each step, we took two sequences with the same length (Lwin), one on the right (sr) and other on left (s l ) of ρ (Fig. 1B ). We estimated the probability distribution for both sequences P sl and Psr, and calculated the γ-Divergence between then Dγ = Dγ(P sl ||Psr)). This procedure was repeated for each ρ ∈ [Lwin+1 L − Lwin] (Fig. 1C). The point (ρ * ) where the Divergence reaches the maximum value Dγ(ρ * ) = Dγ max was the transition point between the sequence s1 to s2. We did this analysis for the M generated sequences and then calculated their mean value and standard deviation. Beyond that this was developed for binary sequences, can be applied to all type of discreet or continuous (with previous quantification) sequences. Figure 2 shows the γ-Divergence for four specific g(x) applied over a combined binary sequence. The first s1 = 3000 points was generated with a probability P = [0. divergence (Dγmax) is reached at the exact point where the sequence changes the probability distribution. In all cases, maximum divergence values are much higher than the standard deviations, demonstrating its statistical significance. Then, we wanted to study the detection limit of the γ-Divergence. In other words, for binary sequences we would like to saw what the smallest probability distribution difference was that can be detected by this family of divergences. To fulfil these aims we generated four combine binary sequences with probability distribution increasingly closer. We analyzed the sequences with the function (g(x) = e x , log(x), √ x, sinh(x)). The windows length used were Lwin = 1000 data point. Fig. 3 shows the analysis for function g(x) = e x .
As the two probability distribution (P and Q) become closer, the maximum divergence value (Dγ max ) decreases, being in the limit of detection for P = [0.5 0.5] and Q = [0.45 0.55] (Fig. 3C) and, making it impossible to distinguish between them when P = [0.5 0.5] and Q = [0.51 0.49] (Fig. 3D). Similar results we obtained for the function log(x), √ x, sinh(x).

Transition detection over EEG sleep recording
In the second example, we used the γ-Divergence to identify the transition between sleep states in an electroencephalogram (EEG) signal from a sleep patient. Sleeping is a dynamic activity, during which many processes are vital to health and well-being take place. It is essential to help to maintain mood, memory, and cognitive performance [19][20][21]. A specialist defines two primary sleep stages, REM and non-REM. Non-REM stage is composed by three stages, N1 and N2 that are light sleep, where you drift in and out of sleep and can be awakened easily. Stage N3 has slow brain waves and this is the deepest sleep stage, which resembles a coma state. REM stage (for rapid eyes moves) is an active period of sleep caused by intense brain activity. Brain waves are fast and desynchronized comparable to those in the waking state, this is also the stage in which most dreams take place. These four stages progress cyclically, from N1 through REM then begin again in N1. It is very important that these cycles are maintained for health. Developing tools that can detect the changes in dynamic sleep stages over EEG signal are highly essential for studying patients with sleep disorders [22]. We used our γ-Divergence to detect changes in the dynamics of the EEG to allow us to distinguish between one state from another.
The data were taken from the Physionet database: The Sleep-EDF Database [Expanded] [23,24], and are freely available at [25]. The EEG were recorded (Fpz-Cz) bipolar channel and the sampling frequency was 100 Hz. Initially, we extracted segments from the original signal belonging to the five different sleep states: Awake, REM, N1, N2, N3 1 . Each segments had 6000 points (corresponding to 60 sec of recording) and were joined into a single signal 2 Fig. 4A. The signal was pre processed with a band-pass filter between 0.5 − 60 Hz. We quantify the signal using the permutation vector method approach [26], with parameter d = 4 and τ = 1. We applied the γ-Divergence following the method used in the previous section for the binary sequence. The functions used were (g(x) = e x , log(x), √ x, sinh(x)). Figure 4B shows that the divergence detects the transition between different sleep states for all the functions g(x). The log(x) function shows the highest values and the best differentiation between stages.
The transition between Awake-REM and N2-N3 are more remarkable than in REM-N1 and N1-N2. There is a significant differentiation between N2-N3 because of N3 is the deepest sleep stage. In this stage the body becomes more insensitive to outside stimuli than seeming to coma state. In this state, the EEG signal is mostly composed by slow waves (Delta and Theta) causing the brain dynamics to be sharply different from the other states. Lower γ-Divergences values were found between N1-N2 states showing that both states share similar characteristics in their dynamics. Particularly, N1 is characterized by drowsiness slowing down the brain waves and muscle activity. While N1 is a period of light sleep during which eye movement stops, the two stages (N1-N2) are very similar with the difference that in N1 occasional bursts of rapid waves (12)(13)(14) called sleep spindles appear.
Similar result as before can be found between REM and N1. REM and N1 also present lower values of divergence. This is expected considering that the EEG of REM sleep contains frequencies present in the "awake state and in the lighter stages of sleep N1 [27]. Despite the similarities of the REM and N1, there are still enough differences between them such that statistically different values are obtained. The presence of 11-16 Hz activity (sleep spindles) in N1, and more abundant alpha activity  in REM sleep means that these two stages present activity at an overlapping frequency range, which explains the proximity of the divergence values obtained. Difficulty in detecting N1 and REM sleep has also been found using other measures [28,29].

Discussion
In the first place, we showed the existence of a close relationship between the square of the Euclidean metric and the Jensen-Shannon divergence establishing that both belong to the same family of functionals. Based on this, we introduced a family of divergences called γ-Divergence that depend on the property of convex functions. We demonstrated that this new divergence family satisfies all the requirements to be a generalize divergence. Then, we studied our divergence for small PDF variations, revealing that the behavior of these are quadratic in relation to the variation introduced. Subsequently, we introduced weights to the divergence to give more importance to a specific distribution, based on the necessity of the problem to analyse. Finally, we could generalize this divergence for more than two probability distributions allowing them to be used in N -dimensional distribution problem, for example: multidimensional signal analysis.
Next, we could define a general entropy based on the properties of convex functions. This entropy includes Shannons entropy, and we showed that is a particular case of a larger family of entropies called Salicrú. We proved that this new entropy satisfies the requirements to be a general entropy in the context of information theory and see the relationship between γ-Divergence and the "Jensen-type divergence through this generalized entropy.
Finally, we applied the γ-Divergence in simulated and real sequences. We showed that all the functions g(x) could detect with high significance the exact point where sequences with different probabilities joined. Moreover, we studied the detection threshold -the point where the divergence can no longer distinguish between the two signals-. Later, we analysed EEG signal from sleep patient. We could detect the points where the signal changes its dynamics due to the change in the state of sleep. Showing it can be an alternative tool for detecting different sleep states.
It is also important to mention that when using divergences, we had consider the significance of the value obtained. This significance is what allows us to say if the distance found is really a real and not just a measure of the statistical fluctuations. In this work we used the standard deviation calculated over M realizations as a measure of significance. However, many times we do not have more than one realization being this method useless. Therefore, a theoretical study of the field is necessary. Some studies on this topic have already been addressed by Grosse et al. for the Jensen-Shannon divergence [30]. Similar studies should be carried out for this family of gamma-Divergence. However, this exceeds this work but will be a addressed in the near future.
Finally, we know that the definition of metric is much stronger than divergence, becasuse of that we want to know is the γ-Divergence introduce in this works meets the definition of metric. For this aim, it is necessary to demonstrate that divergence complies with the triangular inequality [1]. To prove this property to a family of divergences is not an easy task. For this reason in a future work we will investigate the requirements that the function g(x) must be fulfil to consider A B Figure 4: Application the γ-Divergence over a sleep EEG signal using a running windows. A) The EEG signal is composed by 5 sub-signal belonged different sleep stages (Awake, REM, N1, n2, n3). Each state has 6000 point corresponding to 60 sec recording (dashes horizontal lines). B) Using a running windows method the D γ was applied taken four g(x) compared for the study. The signal was quantified with the permutation vectors with parameter d = 4 and τ = 1. For all functions, the maximum values D γmax were reached in the exact point where a transition between sleep states exists.