Preprint
Review

This version is not peer-reviewed.

Biologically Informed Graph Contrastive Learning for Microbiome Data Analysis: A Survey

Submitted:

05 June 2026

Posted:

08 June 2026

You are already at the latest version

Abstract
The human microbiome is a complex, dynamic and highly structured ecosystem whose analysis requires computational methods able to capture relationships among microbial taxa, genes, metabolic pathways, host factors, environmental exposures and disease phenotypes. Conventional machine-learning pipelines often represent microbiome samples as independent high-dimensional abundance vectors, thereby neglecting ecological, phylogenetic and functional dependencies among microbial entities. Graph-based learning provides a natural framework for modelling such dependencies, whereas graph contrastive learning (GCL) offers a self- supervised paradigm for learning robust representations from graph-structured data under limited label availability. This survey reviews the emerging intersection between GCL and microbiome data analysis. We first discuss the biological and computational characteristics of microbiome data, including sparsity, zero inflation, compositionality, batch effects, cohort heterogeneity and weak supervision. We then organize microbiome graph representations into taxa–taxa association networks, phylogenetic graphs, sample similarity graphs, microbe– disease association networks, host–microbe graphs, metabolic graphs and heterogeneous multi-omics graphs. Next, we summarize the foundations of GCL, including view generation, positive and negative pair construction, contrastive objectives, negative-free learning and multi-view representation learning.
Keywords: 
;  ;  ;  ;  ;  

Key Points

  • Microbiome data are naturally relational, but many computational pipelines still treat microbial profiles as flat abundance vectors.
  • Graph contrastive learning can exploit unlabelled microbiome data by aligning biologically related graph views while reducing dependence on expensive clinical labels.
  • Microbiome-oriented GCL requires biologically informed augmentations because generic perturbations may destroy ecological, phylogenetic, functional or compositional meaning.
  • A reliable benchmark should combine leave-one-study-out testing, cross-cohort transfer, graph-construction stress tests, biological plausibility checks and reproducibility reporting.
  • Clinical use requires explicit control of batch effects, confounding, dataset shift, false invariances and the uncertainty of inferred microbial networks.

1. Introduction

The human microbiome is increasingly recognized as a determinant of health and disease, influencing metabolism, immunity, inflammation, drug response and host–environment interactions. Advances in high-throughput sequencing have enabled the characterization of microbial communities through 16S rRNA gene sequencing, shotgun metagenomics, metatranscriptomics, metaproteomics and metabolomics. These technologies generate heterogeneous data describing taxonomic profiles, functional pathways, microbial genes, metabolites and host-associated phenotypes [1,2,3,4].
Despite these advances, microbiome data analysis remains computationally challenging. Microbiome datasets are usually high-dimensional, sparse and zero-inflated, with a number of microbial features that can exceed the number of available samples. Sequencing profiles are also compositional: most analyses observe relative abundances rather than absolute microbial loads, and naive correlation or distance-based analyses may therefore produce spurious associations [5,6,7,8]. Microbiome studies are further affected by cohort-specific factors, including sequencing protocol, DNA extraction, diet, geography, medication, clinical phenotype definitions and population structure. These sources of variation limit the generalizability of supervised models and complicate biological interpretation.
Most conventional machine-learning approaches represent a microbiome sample as a fixed-length abundance vector whose entries correspond to operational taxonomic units, amplicon sequence variants, species, genes or pathways. Although this representation has been successful in several classification tasks, it does not explicitly model the relational structure of microbial ecosystems. Microorganisms interact through ecological, metabolic, competitive, cooperative and host-mediated mechanisms. They are also connected by phylogenetic ancestry, functional redundancy, metabolic exchange and disease-associated relationships. These observations suggest that microbiome data are naturally suited to graph-based modelling.
Graph neural networks (GNNs) provide a flexible framework for learning from non-Euclidean relational data. In microbiome research, graphs may represent microbial co-abundance networks, phylogenetic trees, sample similarity networks, microbe–disease associations, host–microbe relationships or heterogeneous multi-omics systems. By propagating information over nodes and edges, GNNs can learn representations that integrate microbial features with relational context. Existing work has explored graph-based learning for microbial disease classification, co-abundance network classification, biomarker discovery, phylogeny-aware phenotype prediction and microbe–disease association prediction [9,10,11,12,13,14,15,16].
However, fully supervised GNNs remain constrained by the limited availability of labelled microbiome datasets and by the difficulty of transferring learned models across cohorts. Graph contrastive learning (GCL), a branch of self-supervised graph representation learning, offers a promising solution. GCL learns representations by contrasting two or more views of the same graph, node, subgraph or biological entity. These views may be generated through structural perturbations, feature masking, graph diffusion, subgraph sampling, multi-scale representations, cross-omics mappings or temporal sampling. The objective is to maximize agreement between related views while separating unrelated ones in latent space [17,18,19,20,21,22].
The intersection between GCL and microbiome data analysis is still underdeveloped. While generic GCL has been extensively studied in citation networks, social networks and molecular graphs, microbiome applications raise specific methodological constraints. Randomly dropping edges may remove plausible ecological interactions; masking rare taxa may eliminate low-abundance biomarkers; and arbitrary subgraph sampling may disrupt phylogenetic or metabolic organization. Therefore, microbiome-oriented GCL should move beyond generic augmentations and incorporate biological priors, uncertainty estimates, batch-aware sampling and compositional-aware preprocessing.
Figure 1. Conceptual workflow for biologically informed graph contrastive learning in microbiome data analysis.
Figure 1. Conceptual workflow for biologically informed graph contrastive learning in microbiome data analysis.
Preprints 217199 g001
This survey reviews graph contrastive learning from the perspective of microbiome data analysis. We organize the literature around three questions: (i) how can microbiome data be represented as graphs? (ii) how can contrastive learning be adapted to these graph representations? and (iii) how can learned representations support biomedical prediction and discovery? By addressing these questions, we aim to provide a methodological roadmap for robust, interpretable and biologically grounded GCL in microbiome research.

2. Microbiome Data as Graphs

A graph is defined as G = ( V , E , X ) , where V is the set of nodes, E is the set of edges and X denotes node, edge or graph-level attributes. In microbiome studies, the definition of nodes and edges depends on the biological question, sequencing modality and available metadata. Table 1 summarizes common graph representations.
Graph construction is not a neutral preprocessing step. The choice of normalization, association metric, thresholding strategy, taxonomic resolution and metadata adjustment can substantially affect the resulting topology. Methods such as SparCC and SPIEC-EASI were developed to mitigate compositional artifacts in microbial association networks [5,6], whereas software such as NetCoMi supports reproducible construction, analysis and comparison of microbiome networks [23]. For GCL, graph construction has an even stronger role because it defines the views that will guide self-supervised learning. If these views are biologically meaningful, the model can learn robust and transferable representations; if they are poorly designed, the contrastive objective may enforce invariances that are technical rather than biological.
Table 2 lists public resources and how they can be mapped to candidate graph formulations and augmentation policies. The table is intended as a practical starting point for benchmark design rather than as an exhaustive catalogue.

3. Foundations of Graph Contrastive Learning

Graph contrastive learning learns representations by encouraging agreement between related graph views as depicted in Figure 2. Given a graph G = ( V , E , X ) , two stochastic transformations t 1 , t 2 T generate two views G ( 1 ) = t 1 ( G ) and G ( 2 ) = t 2 ( G ) . A GNN encoder f θ maps each view to hidden representations H ( k ) = f θ ( G ( k ) ) , and a projection head g ϕ maps them to the contrastive space, Z ( k ) = g ϕ ( H ( k ) ) . For graph-level learning, a readout function ρ aggregates node embeddings into a graph embedding h G = ρ ( { h v : v V } ) .
The most common objective is the InfoNCE or NT-Xent loss. For an anchor representation z i , a positive view z i + and a set of candidates B , the loss is
L i = log exp ( sim ( z i , z i + ) / τ ) z j B exp ( sim ( z i , z j ) / τ ) ,
where sim ( · , · ) is commonly cosine similarity and τ is a temperature parameter. A symmetric loss averages both directions between two views,
L sym = 1 2 N i = 1 N ( z i ( 1 ) , z i ( 2 ) ) + ( z i ( 2 ) , z i ( 1 ) ) .
Contrast may be performed at different levels. Node-level contrast aligns representations of the same node across two graph views; graph-level contrast aligns embeddings of the same graph under two augmentations; local–global contrast aligns node or patch embeddings with a graph-level summary; and multi-view contrast aligns structurally different but semantically related graphs.
GCL methods differ mainly in how they define views and losses. Deep Graph Infomax maximizes agreement between local patch representations and global graph summaries [17]. InfoGraph extends mutual-information maximization to graph-level representation learning [28]. GraphCL systematically studies node dropping, edge perturbation, attribute masking and subgraph sampling for graph-level learning [18]. GRACE applies augmentation-based node-level contrast to graph data [19]. MVGRL contrasts first-order neighborhood views with graph diffusion views [20]. GCA introduces adaptive augmentations based on topological and semantic importance [21]. GCC uses contrastive pretraining to transfer graph structural representations across datasets [29]. JOAO automatically searches augmentation policies [30]. AD-GCL learns adversarial augmentations that avoid trivial or overly destructive transformations [31]. BGRL replaces explicit negative pairs with a bootstrapped objective using online and target encoders [32]. These approaches provide design patterns for microbiome GCL, but their augmentations must be adapted to biological constraints.
The alignment–uniformity perspective is useful for interpreting contrastive objectives. Alignment encourages representations of positive pairs to be close,
L align = E ( x , x + ) f ( x ) f ( x + ) 2 2 ,
whereas uniformity prevents collapse by spreading representations over the hypersphere [33]. In microbiome applications, alignment should encode meaningful invariances, such as robustness to sequencing depth or low-confidence edges, but not invariance to clinically relevant factors such as disease state or medication exposure. This distinction is central: a contrastive model can perform poorly if positive and negative pairs are defined in a way that encodes cohort labels or technical artifacts rather than biological similarity.
Negative-free methods also matter for biomedical data. In microbiome cohorts, false negatives are common: two samples from different patients may share a phenotype, environment or dietary exposure even if they are treated as negatives. Bootstrap methods such as BGRL reduce dependence on explicit negatives and may therefore be useful when cohort labels, phenotype labels or disease annotations are incomplete. However, negative-free learning does not remove the need for biologically meaningful views; it only changes the mechanism used to avoid representational collapse.
Table 3. Foundational GCL methods and their implications for microbiome data analysis.
Table 3. Foundational GCL methods and their implications for microbiome data analysis.
Method Contrastive level View construction Objective Microbiome implication
DGI [17] Local–global Original and corrupted graph Mutual-information-inspired discrimination Useful for aligning taxa/subgraph embeddings with whole-community summaries.
InfoGraph [28] Graph-level Graph and substructure summaries MI maximization across scales Relevant for sample-level representations and microbial community classification.
GraphCL [18] Graph-level Node drop, edge perturbation, attribute masking, subgraph sampling NT-Xent contrast Provides the canonical augmentation vocabulary, but requires biological safeguards.
GRACE [19] Node-level Edge dropping and feature masking Symmetric InfoNCE Relevant for taxon embeddings and node-level biomarker discovery.
MVGRL [20] Node and graph Neighborhood and diffusion views Multi-view contrast Suggests contrasting co-abundance graphs with diffusion or functional similarity views.
GCA [21] Node-level Adaptive topology/feature perturbation InfoNCE Motivates confidence-aware edge dropping and prevalence-aware masking.
GCC [29] Subgraph/ego-network Random-walk-based subgraphs Contrastive pretraining Useful for transferring structural motifs across studies or body sites.
JOAO [30] Graph-level Automatically selected augmentations Bilevel/automated augmentation search Suggests data-driven search constrained by compositional and ecological rules.
AD-GCL [31] Graph-level Learned adversarial augmentations Contrastive adversarial objective Useful for stress-testing whether augmentations are too weak or biologically destructive.
BGRL [32] Node-level Two augmented views Negative-free bootstrap Attractive when false negatives are likely across cohorts and phenotypes.

4. Biologically Informed GCL for Microbiome Data

Generic perturbations may be effective in benchmark graph datasets, but they can be inappropriate for microbial ecosystems. For example, an edge in a co-abundance graph may represent a putative ecological association or a compositional artifact; a low-degree taxon may be rare but clinically relevant; and a subgraph may correspond to a phylogenetic clade or a disease-associated microbial module. Therefore, contrastive view construction should preserve biological semantics while introducing useful invariances.
Compositionality is the first design constraint. Let c s = ( c s 1 , , c s p ) be the count vector for sample s and N s = j c s j its library size. The relative abundance vector x s = c s / N s satisfies j x s j = 1 and lies in the simplex rather than in unconstrained Euclidean space. A change in one component necessarily affects the observed proportions of other components. For this reason, GCL pipelines should avoid applying ordinary Euclidean perturbations directly to raw proportions unless the biological meaning is clear. A common strategy is to apply a zero-handling step followed by a log-ratio transformation. For example, the centered log-ratio transformation is
clr ( x s j ) = log x s j g ( x s ) , g ( x s ) = j = 1 p x s j 1 / p .
CLR is convenient for exploratory representation learning and feature perturbation, but its coordinates are linearly dependent. ILR coordinates and phylogenetic balances, including PhILR, provide orthonormal log-ratio representations and can incorporate tree structure [34,35,36]. In practice, CLR is often easier to combine with graph encoders, whereas ILR/PhILR is preferable when the study aims to preserve phylogenetic balance interpretation.
Pseudo-count and zero-handling choices should be reported and stress-tested. A practical protocol is: (i) harmonize taxonomy across cohorts; (ii) remove features below a pre-specified prevalence threshold before adding pseudo-counts; (iii) distinguish structural zeros, sampling zeros and missing-by-design features where possible; (iv) use a small pseudo-count or multiplicative replacement consistently across training and test cohorts; and (v) perform sensitivity analysis over at least two zero-handling strategies. Augmentations should then be applied in log-ratio or balance space, or should explicitly preserve the simplex if applied to relative abundances. When taxa are inconsistently detected across cohorts, prevalence-aware masking and cohort-specific missingness indicators should be preferred over treating all absent taxa as biological negatives.
Batch effects and confounding require special attention in the contrastive setting. A model can learn to align views that share the same sequencing center, geography, medication profile or diet rather than the same biological phenotype. Positive pairs should therefore not be defined only by cohort membership or technical similarity. Recommended safeguards include study-balanced mini-batches, metadata-stratified negative sampling, leave-one-study-out validation, confounder prediction tests from learned embeddings, and, where appropriate, batch correction or meta-analysis tools designed for microbiome data such as SIAMCAT, MMUPHin and ConQuR [37,38,39]. Domain-adversarial losses can also be used to discourage embeddings from encoding study labels, but such strategies must be checked to avoid removing clinically meaningful geography- or diet-associated biology.
Table 4 proposes a taxonomy of microbiome-specific contrastive augmentations with suggested safeguards. The numerical ranges are not universal hyperparameters; they should be treated as starting points for sensitivity analysis and adapted to graph density, sample size and biological question.
Several microbiome graph-learning studies motivate these directions, even when they do not explicitly use GCL. CACONET models microbial compositional-aware correlation networks for graph-level classification [10]; WSGMB uses weighted signed graph neural networks for microbial biomarker identification [11]; Ph-CNN, PopPhy-CNN and recent phylogeny-aware GNNs use evolutionary relationships for microbiome phenotype prediction [12,13,14]; and GCATCMDA explicitly combines graph neural networks and contrastive learning for microbe–disease association prediction [15]. These studies suggest that microbiome GCL can be developed along multiple axes: graph type, biological prior, contrastive level and downstream task.

5. Applications

Graph contrastive learning can support several microbiome-analysis tasks because it provides a flexible strategy for learning representations from relational biological data under limited supervision. This aspect is particularly relevant in microbiome studies, where labelled cohorts are often small, disease annotations may be noisy or heterogeneous, and large collections of unlabelled taxonomic, functional and multi-omic profiles are increasingly available. At present, direct use of GCL in microbiome data analysis is still emerging. Therefore, it is useful to distinguish between studies that explicitly apply contrastive objectives to microbiome-related graphs and adjacent graph-learning approaches that provide natural application scenarios for future GCL-based methods.
One immediate application is disease classification and patient stratification. Graph-based models use microbial relationships to predict disease status or clinical phenotype. Graph convolutional models have been used for multiclass disease classification from whole-community metagenomes by exploiting phylogenetic relationships among microbial taxa [9]. CACONET constructs microbial compositional-aware correlation networks and applies graph-level classification to distinguish colorectal cancer from healthy samples [10]. These studies are not primarily formulated as contrastive-learning methods, but they show that microbial graph topology contains predictive information that may be exploited by GCL through pretraining, multi-view learning or phenotype-aware contrastive objectives. Different posterior microbial networks inferred from the same condition could be treated as positive views, whereas networks from distinct phenotypes could be used as weakly supervised negative or contrasting views.
A second application is biomarker discovery. Traditional microbiome biomarker analysis often focuses on individual taxa whose abundance differs between conditions. Graph-based approaches extend this view by considering disease-associated interactions, network modules and signed relationships among microbes. WSGMB represents microbial communities as weighted signed graphs and uses graph neural networks to identify microbial biomarkers in colorectal cancer and Crohn’s disease [11]. From a GCL perspective, this task is attractive because contrastive pretraining can test whether candidate biomarkers remain stable under biologically meaningful perturbations, such as removal of low-confidence edges, taxonomic aggregation, feature masking or resampling of microbial subnetworks.
Microbe–disease association prediction is currently one of the clearest application areas for microbiome-oriented GCL. In this task, microbes and diseases are represented as nodes in a bipartite or heterogeneous graph, and the objective is to predict missing associations. GCATCMDA explicitly combines graph neural networks and contrastive learning to predict microbe–disease associations by exploiting microbe similarity, disease similarity and known association networks [15]. This problem is well suited to contrastive learning because known association matrices are sparse and incomplete: the absence of a reported association does not necessarily correspond to a true negative. Related graph representation learning methods, such as SARMDA, further show the relevance of graph autoencoders and adversarial regularization for the same task [16].
Beyond pairwise microbe–disease graphs, several microbiome problems require higher-order relationships. Diet, microbes and diseases form triplets in which food intake may reshape microbial communities and influence disease risk. LSCHNN addresses this type of problem using a lightweight single-view contrastive hypergraph neural network for food–microbe–disease association prediction [40]. Similarly, pharmacomicrobiomics requires predicting or explaining relationships between microbes and drugs. SMMDA integrates structure-sensitive transformer modules, learnable data augmentation and multi-view graph contrastive learning to predict drug-related microbes [41]. These applications illustrate how heterogeneous and hypergraph GCL can support microbiome-aware nutrition, therapy and drug discovery.
Finally, graph learning can contribute to microbial community modelling and temporal microbiome analysis. SIMBA-GNN integrates metabolic simulations and edge-aware graph transformers to predict microbial presence and relative abundance using mechanistic information such as cross-feeding probabilities, pathway activity fingerprints and microbe–microbe functional similarity [42]. Although SIMBA-GNN is not primarily presented as a GCL framework, it motivates mechanistic and temporal contrastive learning. Longitudinal samples from the same subject can define positive temporal views, whereas samples from unrelated subjects or disease states can define contrasting views. Simulation-derived graphs can provide counterfactual views for contrastive pretraining, allowing models to learn representations that are robust to sampling noise while remaining consistent with ecological and metabolic constraints.
Table 5. Representative application areas for GCL and related graph-learning methods in microbiome data analysis.
Table 5. Representative application areas for GCL and related graph-learning methods in microbiome data analysis.
Application Representative study Graph representation Main methodological idea Relevance for microbiome GCL
Disease classification Khan et al. [9] Phylogenetic graph of taxa GCN for multiclass metagenomic classification Taxonomic and phylogenetic structure can define biologically meaningful views.
CRC classification CACONET [10] Compositional-aware microbial correlation networks Graph-level classification of posterior networks Posterior or phenotype-specific networks can be contrasted as views.
Biomarker discovery WSGMB [11] Weighted signed microbial co-occurrence graphs Signed GNN and node-importance scoring Supports contrastive biomarker stability under perturbations.
Microbe–disease prediction GCATCMDA [15] Microbe similarity, disease similarity and association graphs GNN plus contrastive learning Direct example of contrastive graph learning for microbiome-related link prediction.
Higher-order association LSCHNN [40] Food–microbe–disease hypergraph Single-view contrastive hypergraph neural network Illustrates sparse higher-order microbiome-related GCL.
Microbe–drug prediction SMMDA [41] Heterogeneous drug–microbe graph Transformer plus multi-view GCL Shows relevance to pharmacomicrobiomics and drug discovery.
Community prediction SIMBA-GNN [42] Mechanistic microbe–metabolite–pathway graph Simulation-augmented edge-aware graph transformer Motivates mechanistic and temporal GCL with simulation-derived views.

6. Evaluation and Benchmarking

Evaluation of microbiome GCL should go beyond standard predictive metrics. A model may achieve high accuracy in an internal split while failing to generalize across cohorts, sequencing protocols, populations or disease subtypes. Moreover, in graph-based microbiome learning, performance depends not only on the neural architecture but also on graph construction: normalization, association metric, sparsification threshold, taxonomic resolution and metadata adjustment. A rigorous evaluation framework should therefore assess predictive performance, external validity, graph robustness, biological plausibility, interpretability and reproducibility.
For disease classification and patient stratification, standard metrics include accuracy, balanced accuracy, precision, recall, F1-score, AUROC and AUPRC. In imbalanced clinical datasets, both AUROC and AUPRC should be reported together with class prevalence. Calibration should also be considered using Brier score, expected calibration error, calibration slope and calibration plots. For link-prediction tasks such as microbe–disease, food–microbe–disease or microbe–drug association prediction, AUROC, AUPRC, Hits@K, Recall@K, Precision@K and mean reciprocal rank can be used.
External validation is essential. A recommended protocol is leave-one-study-out (LOSO) validation: train on all but one cohort, tune hyperparameters without using the held-out study, and test once on the unseen cohort. This setting evaluates whether learned representations transfer across studies, populations and technical protocols. SIAMCAT provides a machine-learning framework for microbiome meta-analysis and cross-disease comparison [37]. Large cross-cohort evaluations have shown that microbiome classifiers may perform well in intra-cohort validation but worse in external validation [43]. For GCL, LOSO should compare (i) supervised non-graph baselines, (ii) supervised GNNs, (iii) GCL pretraining plus fine-tuning and (iv) zero-shot or linear-probe evaluation of pretrained embeddings.
Graph robustness should be evaluated because microbiome graphs are inferred rather than directly observed. A benchmark should reconstruct graphs using at least two or three inference strategies, such as Spearman correlation, SparCC, SPIEC-EASI, SPRING or NetCoMi-supported workflows [5,6,23]. It should then vary edge threshold, graph density, taxonomic resolution and preprocessing strategy. Robustness can be quantified by measuring the stability of predictions, embeddings and explanations under graph reconstruction, bootstrap resampling and controlled perturbation.
Biological plausibility should be assessed by comparing model outputs with known microbiome biology. Important taxa, edges or modules identified by the model can be compared with curated microbe–disease resources such as Disbiome and GMrepo [25,26]. For functional data, enriched pathways or modules can be compared with known metabolic functions, inflammatory pathways or disease-specific microbial signatures. Interpretability should be evaluated as a stability property: node, edge and subgraph explanations should be compared across folds, random seeds, graph-construction pipelines and external cohorts.
Table 6 summarizes concrete evaluation dimensions, metrics, recommended protocols and representative examples for microbiome GCL benchmarking.
A minimal standardized benchmark suite should include: (i) a multi-study disease-classification task from curatedMetagenomicData; (ii) a microbe–disease link-prediction task using Disbiome or GMrepo; (iii) at least one graph-construction stress test based on SparCC, SPIEC-EASI and a rank-correlation graph; (iv) LOSO validation; (v) pseudo-count and CLR/ILR/PhILR sensitivity analysis; and (vi) ablations of each augmentation policy. This would provide a practical foundation for quantitative comparison even in a review-oriented field where no single benchmark is yet dominant.

7. Reporting Standards, Ethics and Clinical Translation

For review-oriented and methods-oriented bioinformatics work, reporting data and software availability is essential. At minimum, studies of microbiome GCL should report dataset provenance, inclusion and exclusion criteria, metadata harmonization, preprocessing steps, normalization, zero handling, pseudo-count strategy, log-ratio representation, graph-construction method, edge threshold, augmentation policy, negative sampling strategy, batch composition, encoder architecture, projection head, optimizer, temperature, random seeds, train–validation–test splits, LOSO design and ablation results.
Ethical and clinical considerations should also be explicit. Microbiome GCL models may encode dataset shift, technical artifacts or incomplete knowledge graphs as if they were biological signals. This is especially risky when models are used for disease prediction, biomarker discovery or intervention recommendation. Transparent reporting of learned invariances, external validation, uncertainty and failure modes is therefore required before clinical interpretation. In link-prediction tasks, incomplete knowledge graphs should not be treated as ground truth absence of association. In classification tasks, performance should be interpreted in light of confounders such as age, sex, diet, geography, medication, sequencing protocol and disease severity.

8. Future Directions

Promising future directions include phylogeny-aware GCL, heterogeneous graph contrastive learning, cross-omics contrast, temporal GCL, graph foundation models for microbiome data and interpretable microbiome digital twins. Particularly promising is the development of augmentation strategies that are not merely graph-theoretic but biologically grounded. Such strategies may allow GCL models to learn invariances that correspond to meaningful microbial properties, such as functional redundancy, taxonomic consistency, ecological stability and disease-specific dysbiosis.
Higher-order and heterogeneous graph learning is another important direction. Host–microbe–metabolite, diet–microbe–disease and drug–microbe–phenotype systems cannot always be represented adequately by pairwise homogeneous graphs. Multi-encoder architectures, hypergraphs and heterogeneous GNNs can incorporate these domain priors, and contrastive objectives can align different projections of the same biological system [40,41,44]. For microbiome data, such models should remain compositionality-aware and should avoid treating missing taxa, missing metabolites or incomplete associations as simple negatives.

9. Conclusion

Graph contrastive learning provides a compelling self-supervised framework for microbiome data analysis. By learning from graph-structured microbiome representations, GCL can reduce dependence on labelled data, improve robustness to noise and support integrative modelling of taxa, functions, metabolites, host variables and disease phenotypes. However, microbiome GCL should not simply import generic augmentations from standard graph benchmarks. Instead, it requires biologically informed graph construction, compositionality-aware feature spaces, batch- and confounder-aware pair selection, rigorous external validation and interpretable outputs. The development of such methods may contribute to more robust disease classifiers, more reliable biomarker discovery and ultimately to graph-based computational models of microbiome-mediated health and disease.

Data and Software Availability

This article is a review and does not report new experimental sequencing data, new clinical cohorts, new benchmark datasets or newly developed software. All datasets, databases, software tools and algorithms discussed in the manuscript are publicly available through the original publications, repositories or web resources cited in the main text and bibliography. No restricted, identifiable or patient-level data are distributed with this manuscript.

References

  1. Turnbaugh, P.J.; Ley, R.E.; Hamady, M.; Fraser-Liggett, C.M.; Knight, R.; Gordon, J.I. The human microbiome project. Nature 2007, 449, 804–810. [CrossRef]
  2. Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature 2012, 486, 207–214. [CrossRef]
  3. Qin, J.; Li, R.; Raes, J.; Arumugam, M.; Burgdorf, K.S.; Manichanh, C.; Nielsen, T.; Pons, N.; Levenez, F.; Yamada, T.; et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 2010, 464, 59–65. [CrossRef]
  4. Quince, C.; Walker, A.W.; Simpson, J.T.; Loman, N.J.; Segata, N. Shotgun metagenomics, from sampling to analysis. Nature Biotechnology 2017, 35, 833–844. [CrossRef]
  5. Friedman, J.; Alm, E.J. Inferring correlation networks from genomic survey data. PLoS Computational Biology 2012, 8, e1002687. [CrossRef]
  6. Kurtz, Z.D.; Müller, C.L.; Miraldi, E.R.; Littman, D.R.; Blaser, M.J.; Bonneau, R.A. Sparse and compositionally robust inference of microbial ecological networks. PLoS Computational Biology 2015, 11, e1004226. [CrossRef]
  7. Gloor, G.B.; Macklaim, J.M.; Pawlowsky-Glahn, V.; Egozcue, J.J. Microbiome datasets are compositional: and this is not optional. Frontiers in Microbiology 2017, 8, 2224. [CrossRef]
  8. Quinn, T.P.; Erb, I.; Gloor, G.; Notredame, C.; Richardson, M.F.; Crowley, T.M. A field guide for the compositional analysis of any-omics data. GigaScience 2019, 8, giz107. [CrossRef]
  9. Khan, S.; Kelly, L.; Glickman, J.; Ghaoui, L.E.; et al. Multiclass disease classification from microbial whole-community metagenomes using graph convolutional neural networks. Pacific Symposium on Biocomputing 2020, 25, 223–234.
  10. Xu, Y.; et al. CACONET: a novel classification framework for microbial correlation networks. Bioinformatics 2022, 38, 1639–1648. [CrossRef]
  11. Pan, S.; Jiang, X.; Zhang, K. WSGMB: weight signed graph neural network for microbial biomarker identification. Briefings in Bioinformatics 2024, 25, bbad448. [CrossRef]
  12. Fioravanti, D.; Giarratano, Y.; Maggio, V.; Agostinelli, C.; Chierici, M.; Jurman, G.; Furlanello, C. Phylogenetic convolutional neural networks in metagenomics. BMC Bioinformatics 2018, 19, 49. [CrossRef]
  13. Reiman, D.; Layden, B.T.; Dai, Y. PopPhy-CNN: a phylogenetic tree embedded architecture for convolutional neural networks to predict host phenotype from metagenomic data. IEEE Journal of Biomedical and Health Informatics 2020, 24, 2993–3001. [CrossRef]
  14. Irwin, C.; Mignone, F.; Montani, S.; Portinale, L. Graph Neural Networks for Gut Microbiome Metaomic Data: A Preliminary Work. arXiv preprint arXiv:2407.00142 2024. [CrossRef]
  15. Jiang, C.; et al. Predicting microbe-disease associations via graph neural network and contrastive learning. Frontiers in Microbiology 2024, 15, 1483983. [CrossRef]
  16. He, L.; et al. Adversarial regularized autoencoder graph neural network for predicting microbe-disease associations. Briefings in Bioinformatics 2024, 25, bbae584. [CrossRef]
  17. Veličković, P.; Fedus, W.; Hamilton, W.L.; Liò, P.; Bengio, Y.; Hjelm, R.D. Deep Graph Infomax. In Proceedings of the International Conference on Learning Representations, 2019.
  18. You, Y.; Chen, T.; Sui, Y.; Chen, T.; Wang, Z.; Shen, Y. Graph Contrastive Learning with Augmentations. In Proceedings of the Advances in Neural Information Processing Systems, 2020, Vol. 33, pp. 5812–5823.
  19. Zhu, Y.; Xu, Y.; Yu, F.; Liu, Q.; Wu, S.; Wang, L. Deep Graph Contrastive Representation Learning. arXiv preprint arXiv:2006.04131 2020, [2006.04131].
  20. Hassani, K.; Khasahmadi, A.H. Contrastive Multi-View Representation Learning on Graphs. In Proceedings of the Proceedings of the 37th International Conference on Machine Learning, 2020, Vol. 119, pp. 4116–4126.
  21. Zhu, Y.; Xu, Y.; Yu, F.; Liu, Q.; Wu, S.; Wang, L. Graph Contrastive Learning with Adaptive Augmentation. In Proceedings of the Proceedings of the Web Conference 2021, 2021, pp. 2069–2080. [CrossRef]
  22. Ju, W.; Wang, Y.; Qin, Y.; Mao, Z.; Xiao, Z.; Luo, J.; Yang, J.; Gu, Y.; Wang, D.; Long, Q.; et al. Towards Graph Contrastive Learning: A Survey and Beyond. arXiv preprint arXiv:2405.11868 2024.
  23. Peschel, S.; Müller, C.L.; von Mutius, E.; Boulesteix, A.L.; Depner, M. NetCoMi: network construction and comparison for microbiome data in R. Briefings in Bioinformatics 2021, 22, bbaa290. [CrossRef]
  24. Pasolli, E.; Schiffer, L.; Manghi, P.; Renson, A.; Obenchain, V.; Truong, D.T.; Beghini, F.; Malik, F.; Ramos, M.; Dowd, J.B.; et al. Accessible, curated metagenomic data through ExperimentHub. Nature Methods 2017, 14, 1023–1024. [CrossRef]
  25. Dai, D.; Zhu, J.; Sun, C.; Li, M.; Liu, J.; Wu, S.; Ning, K. GMrepo v2: a curated human gut microbiome database with special focus on disease markers and cross-dataset comparison. Nucleic Acids Research 2022, 50, D777–D784. [CrossRef]
  26. Janssens, Y.; Nielandt, J.; Bronselaer, A.; Debunne, N.; Verbeke, F.; Wynendaele, E.; Van Immerseel, F.; Vandewynckel, Y.P.; De Tré, G.; De Spiegeleer, B. Disbiome database: linking the microbiome to disease. BMC Microbiology 2018, 18, 50. [CrossRef]
  27. Mitchell, A.L.; Almeida, A.; Beracochea, M.; Boland, M.; Burgin, J.; Cochrane, G.; Crusoe, M.R.; Kale, V.; Potter, S.C.; Richardson, L.J.; et al. MGnify: the microbiome analysis resource in 2020. Nucleic Acids Research 2020, 48, D570–D578. [CrossRef]
  28. Sun, F.Y.; Hoffmann, J.; Verma, V.; Tang, J. InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization. In Proceedings of the International Conference on Learning Representations, 2020, [1908.01000].
  29. Qiu, J.; Chen, Q.; Dong, Y.; Zhang, J.; Yang, H.; Ding, M.; Wang, K.; Tang, J. GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training. In Proceedings of the Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2020, pp. 1150–1160. [CrossRef]
  30. You, Y.; Chen, T.; Shen, Y.; Wang, Z. Graph Contrastive Learning Automated. In Proceedings of the Proceedings of the 38th International Conference on Machine Learning. PMLR, 2021, Vol. 139, Proceedings of Machine Learning Research, pp. 12121–12132.
  31. Suresh, S.; Li, P.; Hao, C.; Neville, J. Adversarial Graph Augmentation to Improve Graph Contrastive Learning. In Proceedings of the Advances in Neural Information Processing Systems, 2021, Vol. 34, pp. 15920–15933.
  32. Thakoor, S.; Tallec, C.; Azar, M.G.; Azabou, M.; Dyer, E.L.; Munos, R.; Veličković, P.; Valko, M. Large-Scale Representation Learning on Graphs via Bootstrapping. In Proceedings of the International Conference on Learning Representations, 2022, [2102.06514].
  33. Wang, T.; Isola, P. Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere. In Proceedings of the Proceedings of the 37th International Conference on Machine Learning. PMLR, 2020, Vol. 119, Proceedings of Machine Learning Research, pp. 9929–9939.
  34. Aitchison, J. The Statistical Analysis of Compositional Data; Chapman and Hall: London, 1986. [CrossRef]
  35. Silverman, J.D.; Washburne, A.D.; Mukherjee, S.; David, L.A. A phylogenetic transform enhances analysis of compositional microbiota data. eLife 2017, 6, e21887. [CrossRef]
  36. Lin, H.; Peddada, S.D. Analysis of compositions of microbiomes with bias correction. Nature Communications 2020, 11, 3514. [CrossRef]
  37. Wirbel, J.; Zych, K.; Essex, M.; Karcher, N.; Kartal, E.; Salazar, G.; Bork, P.; Sunagawa, S.; Zeller, G. Microbiome meta-analysis and cross-disease comparison enabled by the SIAMCAT machine learning toolbox. Genome Biology 2021, 22, 93. [CrossRef]
  38. Ma, S.; Shungin, D.; Mallick, H.; Schirmer, M.; Nguyen, L.H.; Kolde, R.; Franzosa, E.A.; Vlamakis, H.; Xavier, R.J.; Huttenhower, C. Population structure discovery in meta-analyzed microbial communities and inflammatory bowel disease using MMUPHin. Genome Biology 2022, 23, 208. [CrossRef]
  39. Ling, W.; et al. Batch effects removal for microbiome data via conditional quantile regression. Nature Communications 2022, 13, 5418. [CrossRef]
  40. Hu, J.; Hu, M.; Wu, Y.; Mu, S.; Huang, D.; Wang, B.; Gao, Y.; Gu, S.; Zhu, J. A lightweight single-view contrastive learning hypergraph neural network for food–microbe–disease association prediction. BMC Bioinformatics 2025, 26, 283. [CrossRef]
  41. Xuan, P.; Wang, R.; Gu, J.; Cui, H.; Zhang, T. Structure-sensitive transformer and multi-view graph contrastive learning enhanced prediction of drug-related microbes. BMC Bioinformatics 2025, 26, 231. [CrossRef]
  42. Aminian-Dehkordi, J.; Parsa, M.; Dickson, A.; Mofrad, M.R.K. SIMBA-GNN: mechanistic graph learning for microbiome prediction. npj Systems Biology and Applications 2025, 11, 120. [CrossRef]
  43. Li, M.; et al. Performance of gut microbiome as an independent diagnostic tool for 20 diseases: cross-cohort validation of machine-learning classifiers. Gut Microbes 2023, 15, 2157684. [CrossRef]
  44. Chen, Z.; Wu, Z.; Zhong, L.; Plant, C.; Wang, S.; Guo, W. Attributed Multi-order Graph Convolutional Network for Heterogeneous Graphs. arXiv preprint arXiv:2304.06336 2023, [2304.06336].
Figure 2. Schematic overview of a graph contrastive learning pipeline. Starting from an input graph with node features, two correlated graph views are generated through stochastic graph augmentations, including node dropping, edge perturbation, attribute masking, subgraph sampling, and diffusion- or random-walk-based transformations. The augmented views are processed by a shared graph neural network encoder and mapped through projection heads to latent representations. A contrastive objective, such as the InfoNCE loss, maximizes agreement between positive pairs corresponding to different views of the same graph while separating negative pairs derived from other graphs, thereby learning robust graph-level representations for downstream tasks.
Figure 2. Schematic overview of a graph contrastive learning pipeline. Starting from an input graph with node features, two correlated graph views are generated through stochastic graph augmentations, including node dropping, edge perturbation, attribute masking, subgraph sampling, and diffusion- or random-walk-based transformations. The augmented views are processed by a shared graph neural network encoder and mapped through projection heads to latent representations. A contrastive objective, such as the InfoNCE loss, maximizes agreement between positive pairs corresponding to different views of the same graph while separating negative pairs derived from other graphs, thereby learning robust graph-level representations for downstream tasks.
Preprints 217199 g002
Table 1. Representative graph formulations for microbiome data analysis.
Table 1. Representative graph formulations for microbiome data analysis.
Graph type Nodes Edges Typical use
Taxa–taxa graph OTUs, ASVs, species or genera Correlation, partial correlation, proportionality, co-abundance or inferred ecological association Microbial network analysis, disease-specific community comparison, biomarker modules.
Phylogenetic graph Taxa or internal tree nodes Ancestor–descendant relationships or phylogenetic distance Taxonomy-aware representation learning and phenotype prediction.
Sample–sample graph Patients, samples or time points Similarity in taxonomic, functional or multi-omics profiles Patient stratification, disease classification and cohort alignment.
Microbe–disease graph Microbes and diseases Known or predicted associations Microbe–disease association prediction and hypothesis generation.
Microbe–metabolite/pathway graph Taxa, metabolites, genes or pathways Functional annotation, metabolic exchange or correlation Multi-omics integration and mechanism discovery.
Host–microbe graph Host genes, immune markers, clinical variables and taxa Statistical, mechanistic or literature-derived relationships Precision medicine and host–microbiome interaction modelling.
Temporal graph Taxa, samples or subject states over time Longitudinal transitions or dynamic associations Dysbiosis trajectories, intervention response and microbial stability.
Table 2. Representative public resources, candidate graph formulations and candidate augmentation policies for microbiome GCL.
Table 2. Representative public resources, candidate graph formulations and candidate augmentation policies for microbiome GCL.
Resource Data type Candidate task Graph formulation Candidate augmentation policy
Human Microbiome Project [2] Multi-body-site 16S and shotgun profiles Body-site classification, baseline pretraining Sample–sample, taxa–taxa, body-site graphs Body-site-preserving feature masking; cross-body-site transfer tests.
curated MetagenomicData [24] Uniformly processed human metagenomes Disease classification, LOSO validation Taxa–taxa, sample–sample and study-aware graphs Study-balanced batches; edge-confidence perturbation; leave-one-study-out splits.
GMrepo [25] Curated gut microbiome profiles and disease metadata Cross-disease phenotype prediction Microbe–disease and sample–disease graphs Disease-aware contrast; hard negative sampling among related phenotypes.
Disbiome [26] Literature-curated microbe–disease links Link prediction and biological validation Bipartite microbe–disease graph Similarity-view contrast; evaluation against held-out known associations.
MGnify [27] Public metagenomic/metabarcoding analyses Environmental transfer and pretraining Taxa, functional and environmental graphs Environment-aware positives; domain adaptation across habitats.
Disease-specific cohorts 16S or shotgun profiles plus metadata IBD, CRC, metabolic disease, infection Phenotype-specific co-abundance networks Confounder-aware pair selection; graph robustness across inference methods.
Table 4. Biologically informed graph augmentations for microbiome GCL. Suggested ranges are practical starting points and should be tuned by validation and stress testing.
Table 4. Biologically informed graph augmentations for microbiome GCL. Suggested ranges are practical starting points and should be tuned by validation and stress testing.
Augmentation Implementation Biological meaning Suggested safeguard Risk
Confidence-aware edge dropping Preferentially drop edges with low bootstrap support or weak association score Robustness to uncertain microbial associations Drop 5–30% of low-confidence edges; preserve high-confidence hubs May remove weak but meaningful interactions.
Prevalence-aware taxon masking Mask taxa with probability conditioned on prevalence and abundance Robustness to sparsity and dropout Avoid always masking rare disease markers; report prevalence threshold May suppress low-abundance biomarkers.
Phylogenetic subgraph sampling Sample clades or phylogenetically coherent neighborhoods Preservation of evolutionary structure Sample at multiple depths; compare species/genus/family views May overemphasize taxonomy over function.
Log-ratio feature perturbation Add noise or mask features in CLR/ILR/PhILR space Compositionality-aware robustness Perform sensitivity analysis across zero-handling strategies Poor pseudo-count choice may distort rare taxa.
Cross-omics contrast Contrast taxonomic, pathway, metabolite or host graphs Integration of complementary biological layers Use paired samples only or model missing modalities explicitly Omics layers may have different noise models.
Disease-aware graph views Contrast healthy and disease-specific networks Identification of dysbiosis-related modules Balance cohorts and metadata; avoid study-label shortcuts Confounding by treatment, geography or lifestyle.
Temporal contrast Contrast nearby time points or intervention states Learning stability and transition patterns Use subject-level splits and time-aware negatives Requires dense longitudinal sampling.
Table 6. Concrete evaluation dimensions for microbiome graph contrastive learning.
Table 6. Concrete evaluation dimensions for microbiome graph contrastive learning.
Dimension Metrics or criteria Concrete protocol GCL-specific check
Predictive performance Accuracy, balanced accuracy, F1-score, AUROC, AUPRC, calibration Compare non-graph ML, supervised GNN and GCL-pretrained models on identical splits; use link-prediction metrics for association tasks [15,40,41]. Test whether gains come from contrastive pretraining rather than only architecture or classifier tuning.
External validation LOSO AUROC/AUPRC, transfer gap, calibration shift Train on multiple cohorts and test on a held-out study using SIAMCAT-like cross-study designs [37,43]. Verify that embeddings transfer across studies and do not encode cohort identity.
Graph robustness Stability across inference method, threshold, graph density and taxonomic level Reconstruct graphs using Spearman/SparCC/SPIEC-EASI/NetCoMi and repeat evaluation [5,6,23]. Measure prediction and embedding stability under graph-construction stress tests.
Biological plausibility Enrichment of known taxa, pathways, edges or modules Compare salient nodes/edges/subgraphs with Disbiome, GMrepo and disease literature [25,26]. Check whether learned invariances correspond to plausible biology.
Interpretability Stability of node, edge and subgraph explanations Repeat explanations across seeds, folds, cohorts and graph perturbations; compare with CACONET/WSGMB-style network biomarkers [10,11]. Determine whether biomarkers and microbial modules are stable rather than augmentation artifacts.
Reproducibility Code, scripts, splits, graph parameters, seeds and model checkpoints Release preprocessing, graph construction, view generation, training and evaluation scripts; use standardized resources such as curated MetagenomicData [24]. Allows separation of the effects of preprocessing, graph construction, augmentation and contrastive loss.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated