Preprint Article Version 1 This version is not peer-reviewed

Topological Information Data Analysis: Poincare-Shannon Machine and Statistical Physic of Finite Heterogeneous Systems

Version 1 : Received: 9 April 2018 / Approved: 12 April 2018 / Online: 12 April 2018 (05:35:26 CEST)

How to cite: Baudot, P.; Tapia, M.; Goaillard, J. Topological Information Data Analysis: Poincare-Shannon Machine and Statistical Physic of Finite Heterogeneous Systems. Preprints 2018, 2018040157 (doi: 10.20944/preprints201804.0157.v1). Baudot, P.; Tapia, M.; Goaillard, J. Topological Information Data Analysis: Poincare-Shannon Machine and Statistical Physic of Finite Heterogeneous Systems. Preprints 2018, 2018040157 (doi: 10.20944/preprints201804.0157.v1).

Abstract

This paper establishes methods that quantify the structure of statistical interactions within a given data set using the characterization of information theory in cohomology by finite methods, and provides their expression in terms of statistical physic and machine learning. Following [1–3], we show directly that k multivariate mutual-informations (Ik) are k-coboundaries. The k-cocycles are given by Ik = 0, which generalize statistical independence to arbitrary dimension k. The topological approach allows to investigate Shannon’s information in the multivariate case without the assumptions of independent identically distributed variables. We develop the computationally tractable subcase of simplicial information cohomology represented by entropy Hk and information Ik landscapes. The I1 component defines a self-internal energy functional Uk, and (−1)k Ik,k≥2 components define the contribution to a free energy functional Gk of the k-body interactions. The set of information paths in simplicial structures is in bijection with the symmetric group and random processes, provides a topological expression of the 2nd law and points toward a discrete Noether theorem (1st law). The local minima of free-energy, related to conditional information negativity and the non-Shannonian cone of Yeung [4], characterize a minimum free energy complex. This complex formalizes the minimum free-energy principle in topology, provides a definition of a complex system, and characterizes a multiplicity of local minima that quantifies the diversity observed in biology. Finite data size effects and estimation bias severely constrain the effective computation of the information topology on data, and we provide simple statistical tests for the undersampling bias and for the k-dependences following [5]. We give an example of application of these methods to genetic expression and cell-type classification. The maximal positive Ik identifies the variables that co-vary the most in the population, whereas the minimal negative Ik identifies clusters and the variables that differentiate-segregate the most. The methods unravel biologically relevant I10 with a sample size of 41. It establishes generic methods to quantify the epigenetic information storage and a unified epigenetic unsupervised learning formalism.

Subject Areas

information theory; cohomology; algebraic topology; topological data analysis; genetic expression; epigenetics; machine learning; statistical physic; multivariate mutual-information; complex systems; biodiversity

Readers' Comments and Ratings (0)

Leave a public comment
Send a private comment to the author(s)
Rate this article
Views 0
Downloads 0
Comments 0
Metrics 0
Leave a public comment

×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.