Preprint
Concept Paper

This version is not peer-reviewed.

Bayesian Multi-View Graph Convolutional Network (BMGCN) for Integrative Multi-Omics Analysis with Survival Outcomes and Zero-Inflated Data

Submitted:

14 October 2025

Posted:

27 October 2025

You are already at the latest version

Abstract
Modern biomedical research generates vast, multi-modal datasets (multi-omics) from the same patient cohorts, offering an unprecedented opportunity to understand complex diseases. However, integrating these heterogeneous data views to predict clinical outcomes like patient survival presents significant statistical challenges. These challenges include data heterogeneity, high dimensionality, inherent zero-inflation due to technical dropouts or biological absence, and the need to incorporate prior biological knowledge. We propose the Bayesian Multi-view Graph Convolutional Network (BMGCN), a deep generative framework designed to address these challenges. BMGCN factorizes the data into shared and view-specific latent representations, enabling both data integration and the identification of view-specific signals. It employs graph-convolutional encoders to integrate prior biological network knowledge, a zero-inflated likelihood to accurately model sparse omics data, and a spike-and-slab prior for Bayesian view selection to identify modalities most relevant to the outcome. Finally, a semi-parametric Cox proportional hazards module allows the model to handle right-censored survival data directly. We detail the full generative model, derive the variational inference objective, and outline a comprehensive validation strategy. BMGCN provides a powerful, interpretable, and flexible framework for integrative multi-omics analysis.
Keywords: 
;  

1. Introduction

The ability to measure multiple molecular layers—such as gene expression (RNA-seq), DNA methylation, microRNA, and proteomics—from the same subjects has revolutionized systems biology and precision medicine. Each of these “omics” views provides a unique window into the biological state of a system. The central hypothesis of multi-omics integration is that a unified analysis of these views can reveal emergent biological insights and yield more accurate predictive models than any single view alone [1].
However, the integrative analysis of such data is fraught with challenges. First, the data are heterogeneous, with different views having unique statistical properties, dimensionalities, and levels of noise. Second, many omics datasets, particularly those from single-cell and sequencing-based assays, are characterized by a high degree of sparsity and zero-inflation, where an excess of zero values complicates modeling [2]. These zeros can represent true biological absence or technical artifacts (e.g., “dropouts”), and distinguishing between them is crucial. Third, predicting clinical endpoints, such as patient survival, requires specialized statistical models that can properly handle right-censored data, where the event of interest is not observed for all subjects [3]. Finally, ignoring the rich web of prior biological knowledge, such as protein-protein interaction networks or gene regulatory pathways, is a missed opportunity to guide the model toward more robust and interpretable solutions.
To address these multifaceted challenges, we introduce the Bayesian Multi-view Graph Convolutional Network (BMGCN). BMGCN is a deep generative framework that provides a principled solution for integrating multi-omics data with censored survival outcomes. The design of BMGCN is guided by the following core objectives:
  • Latent Abstraction: To learn a low-dimensional representation of the data by factorizing the signal into latent variables that are shared across all omics views ( Z i ) and variables that are specific to each view ( Z i ( k ) ). This captures both common and view-specific biological processes.
  • View Selection: To automatically determine the relevance of each omics view for predicting the survival outcome. We achieve this through a Bayesian spike-and-slab prior on the view-specific latents ( γ k ), which can effectively “turn off” uninformative views, enhancing interpretability [4].
  • Principled Sparsity Modeling: To explicitly model the excess zeros common in omics data using a zero-inflated mixture likelihood [2]. This distinguishes between true biological absence (a point mass at zero) and low-level biological signal (a continuous component).
  • Censored Outcome Modeling: To directly model time-to-event data through an integrated Cox proportional hazards module that operates on the learned latent representations [3].
  • Integration of Biological Priors: To leverage prior knowledge in the form of biological networks ( A ( k ) ) by using Graph Convolutional Network (GCN) encoders, which regularize the model and encourage biologically meaningful latent representations [5].
This paper details the statistical foundation of BMGCN. We first present the notation and model overview, followed by a detailed description of the full generative process. We then derive the variational inference framework and the Evidence Lower Bound (ELBO) objective function. Finally, we describe the architecture of the graph-based encoders and outline a rigorous strategy for model validation and diagnostics.

2. Methods

2.1. Notation and Core Entities

Let our dataset consist of N subjects and K omics views. The core entities of BMGCN are defined as follows:
Observed Data:
  • X i ( k ) R p k : The feature vector for subject i in view k, where p k is the number of features in that view.
  • A ( k ) { 0 , 1 } p k × p k : An optional adjacency matrix for view k, representing prior biological knowledge (e.g., a protein-protein interaction network).
  • ( T i , Δ i ) : The observed time for subject i, where T i R + is the survival or censoring time and Δ i { 0 , 1 } is the event indicator ( Δ i = 1 if the event was observed, Δ i = 0 if censored).
Latent Variables:
  • Z i R L : A shared latent vector for subject i, capturing variation common to all views.
  • Z i ( k ) R L k : A view-specific latent vector for subject i in view k.
  • γ k { 0 , 1 } : A binary variable indicating whether view k is active in predicting survival.
  • π k ( 0 , 1 ) : The prior inclusion probability for view k.

2.2. The BMGCN Generative Model

BMGCN defines a full generative process for the observed data, which factorizes into several key steps. For each subject i = 1 , , N :
  • Sample View Inclusion Probabilities and Indicators: For each view k = 1 , , K , we first draw a prior inclusion probability from a Beta distribution and then sample the binary indicator variable:
    π k Beta ( α 0 , β 0 )
    γ k π k Bernoulli ( π k )
  • Sample Latent Variables: A shared latent vector is drawn from a standard normal prior. The view-specific latent vectors are drawn from a spike-and-slab prior, conditioned on the inclusion variable γ k . If γ k = 0 , the latent vector is fixed at zero (the “spike”); otherwise, it is drawn from a standard normal (the “slab”).
    Z i N ( 0 , I L )
    p ( Z i ( k ) γ k ) = ( 1 γ k ) δ 0 ( Z i ( k ) ) + γ k N ( Z i ( k ) 0 , I L k )
    where δ 0 is the Dirac delta function at zero.
  • Generate Omics Features via Decoders: The observed features for each view are generated from the combined shared and view-specific latent variables. A decoder network f ψ ( k ) maps the concatenated latent vector [ Z i , Z i ( k ) ] to the parameters of a zero-inflated Gaussian distribution.
    ( μ i ( k ) , ρ i ( k ) ) = f ψ ( k ) ( [ Z i , Z i ( k ) ] )
    p ( X i j ( k ) ) = ρ i j ( k ) δ 0 ( X i j ( k ) ) + ( 1 ρ i j ( k ) ) N ( X i j ( k ) μ i j ( k ) , σ k 2 )
    Here, μ i ( k ) is the mean vector, ρ i ( k ) is the vector of zero-inflation probabilities (passed through a sigmoid function to ensure they are in ( 0 , 1 ) ), and σ k 2 is a view-specific variance parameter.
  • Generate Survival Outcome: A latent risk score η i for each subject is computed by a survival head network g θ , which takes the shared latent and the gated view-specific latents as input. The observed survival data ( T i , Δ i ) are assumed to follow a Cox proportional hazards model conditioned on this risk score.
    η i = g θ ( [ Z i , γ 1 Z i ( 1 ) , , γ K Z i ( K ) ] )
    ( T i , Δ i ) Cox ( η i )
    The likelihood for the Cox model is given by the partial log-likelihood:
    l ( η ) = i = 1 N Δ i η i log j R i e η j
    where R i = { j : T j T i } is the set of subjects at risk at time T i . This semi-parametric form is powerful because it does not require specifying a baseline hazard function h 0 ( t ) . When multiple events occur at the same time (ties), approximations such as Breslow’s or Efron’s can be used.
    The full joint distribution over all observed and latent variables is given by:
    p ( X , T , Δ , Z , { Z ( k ) } , γ , π ) = i = 1 N p ( T i , Δ i η i ) k = 1 K p ( X i ( k ) Z i , Z i ( k ) ) p ( Z i ( k ) γ k ) p ( Z i ) × k = 1 K p ( γ k π k ) p ( π k )

2.3. Variational Inference and the ELBO

The posterior distribution over the latent variables, p ( Z , { Z ( k ) } , γ X , T , Δ ) , is intractable due to the non-linearities in the model and the Cox likelihood. We therefore resort to amortized variational inference (VI) [6,7]. We introduce a tractable variational distribution q ( · ) to approximate the true posterior and maximize the Evidence Lower Bound (ELBO), L , with respect to the variational parameters.
We use a mean-field variational family that factorizes across subjects and latent variables:
q ( Z , { Z ( k ) } , γ ) = i = 1 N q ( Z i ) k = 1 K q ( Z i ( k ) ) q ( γ k )
where each factor is parameterized by an encoder network:
q ( Z i X i ) = N ( μ i ( X i ) , diag ( σ i 2 ( X i ) ) )
q ( Z i ( k ) X i ( k ) , A ( k ) ) = N ( μ i ( k ) ( X i ( k ) , A ( k ) ) , diag ( ( σ i ( k ) ) 2 ( X i ( k ) , A ( k ) ) ) )
q ( γ k ) = Bernoulli ( τ k )
where τ k is the posterior inclusion probability for view k.
The ELBO is defined as:
L = E q [ log p ( X , T , Δ , Z , { Z ( k ) } , γ ) ] E q [ log q ( Z , { Z ( k ) } , γ ) ]
This objective can be decomposed into several intuitive terms:
L = L recon + L surv L KL
where:
  • Reconstruction Term ( L recon ): The expected log-likelihood of the omics data.
    L recon = i = 1 N k = 1 K E q [ log p ( X i ( k ) Z i , Z i ( k ) ) ]
    The piecewise nature of the zero-inflated log-likelihood makes this term directly computable without explicit sampling of discrete zero-inflation masks.
  • Survival Term ( L surv ): The expected Cox partial log-likelihood.
    L surv = E q i = 1 N Δ i η i log j R i e η j
  • KL Divergence Regularizer ( L KL ): A sum of Kullback-Leibler (KL) divergences that penalize deviations of the variational posterior from the prior.
    L KL = i = 1 N KL ( q ( Z i ) p ( Z i ) )
    + i = 1 N k = 1 K E q ( γ k ) [ KL ( q ( Z i ( k ) ) p ( Z i ( k ) γ k ) ) ]
    + k = 1 K KL ( q ( γ k ) p ( γ k ) )
    The KL terms for the Gaussian latents have a closed-form solution. The term involving the discrete γ k is approximated as τ k · KL ( q ( Z i ( k ) ) N ( 0 , I ) ) , effectively weighting the KL penalty by the posterior probability that the view is included.

2.4. Graph-Convolutional Encoders

To incorporate prior biological network information ( A ( k ) ), the encoder networks that parameterize q ( Z i ( k ) ) are implemented as Graph Convolutional Networks (GCNs) [5]. This allows the model to learn feature representations that are regularized by known biological interactions, improving robustness and interpretability.
For each view k, we first normalize the adjacency matrix to ensure stable learning:
A ^ ( k ) = ( D ˜ ( k ) ) 1 / 2 A ˜ ( k ) ( D ˜ ( k ) ) 1 / 2
where A ˜ ( k ) = A ( k ) + I (adding self-loops) and D ˜ ( k ) is the diagonal degree matrix of A ˜ ( k ) .
The GCN takes the full data matrix for view k, X ( k ) R N × p k , as input features for the nodes (e.g., genes) of the graph. The GCN then learns feature embeddings by propagating and transforming information between nodes in the graph. The propagation rule for layer l is:
H ( l + 1 ) = ReLU ( A ^ ( k ) H ( l ) W ( l ) )
where H ( 0 ) = ( X ( k ) ) R p k × N , W ( l ) is a learnable weight matrix, and H ( l ) R p k × d l is the feature embedding matrix at layer l. After L layers, we obtain a final feature embedding matrix H ( L ) R p k × d L .
To obtain sample-specific parameters for q ( Z i ( k ) ) , we apply a readout function. Specifically, the final feature embeddings are used to parameterize a sample-specific network that maps the input data to the latent space parameters:
h i ( k ) = FFN enc ( X i ( k ) , H ( L ) )
This sample-specific embedding h i ( k ) is then mapped to the variational parameters:
μ i ( k ) = W μ h i ( k )
log σ i ( k ) = W σ h i ( k )

2.5. Optimization and Implementation

The ELBO is optimized with respect to the parameters of the encoders, decoders, and survival head using stochastic gradient ascent. We employ the reparameterization trick for Gaussian latents to obtain low-variance gradient estimates [6]. For the discrete view-selection variables γ k , we use a hard-concrete relaxation with a straight-through gradient estimator during training [8,9].
Training is performed in mini-batches. For the Cox survival term, which is defined over the full cohort, we use mini-batch approximations, such as constructing risk sets only from subjects within the current batch or using a memory bank. While this introduces a small bias, it is essential for scalability.
Key implementation details for ensuring stable training include:
  • Using the log-sum-exp trick for stable computation of the Cox likelihood denominator.
  • Applying KL annealing to gradually introduce the KL regularization term, preventing posterior collapse [10].
  • Using modern optimizers like AdamW and employing learning rate schedulers [11,12].
  • Initializing the view-selection logits ( τ k ) to favor inclusion initially to promote stable learning.

3. Validation and Diagnostics Strategy

To ensure BMGCN is robust, generalizable, and interpretable, we propose a multi-faceted validation strategy.
  • Simulation Studies: We will generate synthetic multi-view data with known ground-truth latent structures, pre-defined informative vs. uninformative views, and known survival dependencies. This will allow us to quantitatively assess the model’s ability to recover the true latent variables (cosine similarity), correctly perform view selection (accuracy of τ k vs. ground truth), and accurately predict survival (Concordance Index).
  • Cross-Validation: On real-world datasets, we will use a k-fold cross-validation scheme, stratified by event status, to evaluate predictive performance on held-out data. The primary metric will be the Concordance Index (C-index), which measures the fraction of concordant pairs of subjects. A secondary metric will be the integrated Brier score, which assesses the calibration of survival probability predictions over time.
  • Posterior Predictive Checks (PPCs): To assess goodness-of-fit, we will sample from the posterior predictive distribution. We will compare the distributions of simulated data (e.g., zero frequencies, means, variances) against the observed data to detect model misspecification.
  • Cox Diagnostics: We will examine Schoenfeld residuals to test the proportional hazards assumption of the Cox model, a critical assumption for the validity of the survival module.
  • Ablation Studies: To quantify the contribution of each model component, we will perform ablation studies by systematically removing key features: (1) the graph structure (setting A ( k ) = I ), (2) the zero-inflation module ( ρ = 0 ), and (3) the spike-and-slab selectors (forcing γ k = 1 for all views).
  • Latent Space Analysis: We will visualize the learned shared latent space Z i using techniques like UMAP to check if it clusters subjects by known phenotypes, treatment groups, or survival outcomes, providing a qualitative check on its biological relevance [13].

4. Discussion

BMGCN offers a comprehensive and principled framework for one of the most pressing problems in computational biology: the integrative analysis of multi-omics data for clinical outcome prediction. Its primary strengths lie in its ability to simultaneously address multiple, often co-occurring, challenges.
The Bayesian framework provides not only point estimates but also uncertainty quantification. The probabilistic view selection mechanism offers a path to identifying which data modalities are most informative for a given prediction task, a crucial step for biomarker discovery and for designing future cost-effective studies.
However, the model is not without limitations. The mean-field approximation used for variational inference may not capture complex posterior dependencies between latent variables. The Cox model component relies on the proportional hazards assumption, which may not hold in all biological contexts; diagnostics are essential. Computationally, BMGCN is intensive, particularly with a large number of views or features, though scalability is addressed via mini-batching.
Future work could extend BMGCN in several directions. More flexible variational families, such as normalizing flows, could be explored to improve posterior approximation. The model could be extended to handle other data types (e.g., categorical or count data) and other clinical endpoints. Furthermore, incorporating mechanisms to explicitly handle missing views for certain subjects would greatly enhance its applicability to real-world clinical datasets.

5. Conclusions

We have presented BMGCN, a Bayesian deep generative model for the integrative analysis of multi-omics data with censored survival outcomes. By combining shared and view-specific latent variables, graph-convolutional encoders, a zero-inflated likelihood, and a Cox survival head within a unified probabilistic framework, BMGCN is tailored to the unique challenges of modern biomedical data. Its design prioritizes flexibility, interpretability, and statistical rigor, providing a powerful new tool for uncovering the complex molecular underpinnings of disease and for improving patient outcome prediction.

Appendix A. Supplementary Mathematical Details: Full Derivations

This appendix provides a complete and rigorous mathematical derivation of the core components of the Bayesian Multi-view Graph Convolutional Network (BMGCN). We derive the Evidence Lower Bound (ELBO), its decomposition into constituent terms, the gradients for optimization, and the closed-form solutions for key quantities.

Appendix A.1. Derivation of the Evidence Lower Bound (ELBO)

The goal of variational inference is to approximate the true, intractable posterior distribution p ( Z , { Z ( k ) } , γ , π X , T , Δ ) with a tractable variational distribution q ( Z , { Z ( k ) } , γ , π ) . We assume a mean-field factorization for q as specified in the main text:
q ( Z , { Z ( k ) } , γ , π ) = q ( Z ) q ( { Z ( k ) } ) q ( γ ) q ( π ) = i = 1 N q ( Z i ) i = 1 N k = 1 K q ( Z i ( k ) ) k = 1 K q ( γ k ) k = 1 K q ( π k )
The quality of this approximation is measured by the Kullback-Leibler (KL) divergence between q and p:
KL ( q p ) = E q [ log q ( Z , { Z ( k ) } , γ , π ) ] E q [ log p ( Z , { Z ( k ) } , γ , π X , T , Δ ) ]
By Bayes’ theorem, the log-posterior is:
log p ( Z , { Z ( k ) } , γ , π X , T , Δ ) = log p ( X , T , Δ , Z , { Z ( k ) } , γ , π ) log p ( X , T , Δ )
Substituting this into the KL divergence:
KL ( q p ) = E q [ log q ] E q [ log p ( X , T , Δ , Z , { Z ( k ) } , γ , π ) ] + log p ( X , T , Δ )
Rearranging terms, we get:
log p ( X , T , Δ ) = E q [ log p ( X , T , Δ , Z , { Z ( k ) } , γ , π ) ] E q [ log q ( Z , { Z ( k ) } , γ , π ) ] + KL ( q p )
Since KL ( q p ) 0 , the first two terms on the right-hand side form a lower bound on the log-evidence (marginal likelihood) log p ( X , T , Δ ) . This is the Evidence Lower Bound (ELBO), denoted by L :
L = E q [ log p ( X , T , Δ , Z , { Z ( k ) } , γ , π ) ] E q [ log q ( Z , { Z ( k ) } , γ , π ) ]
This is Equation (15) from the main text. The ELBO is maximized with respect to the parameters of q to find the best approximation to the true posterior.

Appendix A.2. Decomposition of the ELBO

We now decompose the ELBO L into its constituent parts: the reconstruction term ( L recon ), the survival term ( L surv ), and the KL divergence regularizer ( L KL ).
Starting from the joint distribution defined in Equation (10):
p ( X , T , Δ , Z , { Z ( k ) } , γ , π ) = i = 1 N p ( T i , Δ i η i ) k = 1 K p ( X i ( k ) Z i , Z i ( k ) ) p ( Z i ( k ) γ k ) p ( Z i ) × k = 1 K p ( γ k π k ) p ( π k )
Taking the expectation under q:
E q [ log p ( X , T , Δ , Z , { Z ( k ) } , γ , π ) ] = E q i = 1 N log p ( T i , Δ i η i ) + E q i = 1 N k = 1 K log p ( X i ( k ) Z i , Z i ( k ) ) + E q i = 1 N k = 1 K log p ( Z i ( k ) γ k ) + E q i = 1 N log p ( Z i ) + E q k = 1 K log p ( γ k π k ) + E q k = 1 K log p ( π k )
The second term in the ELBO is the entropy of the variational distribution:
E q [ log q ( Z , { Z ( k ) } , γ , π ) ] = E q i = 1 N log q ( Z i ) + E q i = 1 N k = 1 K log q ( Z i ( k ) ) + E q k = 1 K log q ( γ k ) + E q k = 1 K log q ( π k )
Substituting these into the ELBO L :
L = E q i = 1 N k = 1 K log p ( X i ( k ) Z i , Z i ( k ) ) L recon + E q i = 1 N log p ( T i , Δ i η i ) L surv E q i = 1 N log q ( Z i ) E q i = 1 N log p ( Z i ) KL for Z i E q i = 1 N k = 1 K log q ( Z i ( k ) ) E q i = 1 N k = 1 K log p ( Z i ( k ) γ k ) KL for Z i ( k ) E q k = 1 K log q ( γ k ) E q k = 1 K log p ( γ k π k ) KL for γ k E q k = 1 K log q ( π k ) E q k = 1 K log p ( π k ) KL for π k
This can be grouped as follows:
L recon = E q i = 1 N k = 1 K log p ( X i ( k ) Z i , Z i ( k ) ) ( Reconstruction Term )
L surv = E q i = 1 N log p ( T i , Δ i η i ) = E q i = 1 N Δ i η i log j R i e η j ( Survival Term ) L KL = i = 1 N KL ( q ( Z i ) p ( Z i ) ) + i = 1 N k = 1 K E q KL ( q ( Z i ( k ) ) p ( Z i ( k ) γ k ) )
+ k = 1 K KL ( q ( γ k ) p ( γ k π k ) ) + k = 1 K KL ( q ( π k ) p ( π k ) )
This confirms the decomposition L = L recon + L surv L KL from Equation (16). For simplicity in the main text, the priors on γ k and π k are often absorbed or their KL terms are handled implicitly, leading to the simplified L KL shown in Equations (19-21).

Appendix A.3. Zero-Inflated Gaussian Log-Likelihood and Gradients

The probability density function (PDF) for the zero-inflated Gaussian (ZIG) distribution for a single feature j in view k for subject i is:
p ( X i j ( k ) μ i j ( k ) , σ k 2 , ρ i j ( k ) ) = ρ i j ( k ) · δ 0 ( X i j ( k ) ) + ( 1 ρ i j ( k ) ) · N ( X i j ( k ) μ i j ( k ) , σ k 2 )
where δ 0 ( x ) is the Dirac delta function, which is infinite at x = 0 and zero elsewhere, with the property that δ 0 ( x ) d x = 1 .
The log-likelihood for an observation x is piecewise, as it must handle the discrete mass at zero and the continuous density elsewhere:
log p ( x μ , σ 2 , ρ ) = log ρ + ( 1 ρ ) · N ( 0 μ , σ 2 ) if x = 0 log ( 1 ρ ) + log N ( x μ , σ 2 ) if x 0
To optimize the model, we need the gradients of this log-likelihood with respect to its parameters μ , σ 2 , and ρ .
Case 1: x = 0
Let C = ρ + ( 1 ρ ) · N ( 0 μ , σ 2 ) .
μ [ log C ] = 1 C · ( 1 ρ ) · μ [ N ( 0 μ , σ 2 ) ] = 1 C · ( 1 ρ ) · N ( 0 μ , σ 2 ) · ( 0 μ ) σ 2
σ 2 [ log C ] = 1 C · ( 1 ρ ) · σ 2 [ N ( 0 μ , σ 2 ) ] = 1 C · ( 1 ρ ) · N ( 0 μ , σ 2 ) · ( 0 μ ) 2 2 σ 4 1 2 σ 2
ρ [ log C ] = 1 C · 1 N ( 0 μ , σ 2 )
Case 2: x 0
μ [ log ( 1 ρ ) + log N ( x μ , σ 2 ) ] = μ [ log N ( x μ , σ 2 ) ] = ( x μ ) σ 2
σ 2 [ log ( 1 ρ ) + log N ( x μ , σ 2 ) ] = σ 2 [ log N ( x μ , σ 2 ) ] = ( x μ ) 2 2 σ 4 1 2 σ 2
ρ [ log ( 1 ρ ) + log N ( x μ , σ 2 ) ] = 1 ( 1 ρ )
These gradients are used in the backpropagation algorithm to update the parameters of the decoder network f ψ ( k ) .

Appendix A.4. Gradient of the Cox Partial Log-Likelihood

The Cox partial log-likelihood for the entire dataset is:
l ( η ) = i = 1 N Δ i η i log j R i e η j
where R i = { j : T j T i } is the risk set for subject i.
We derive the gradient of l ( η ) with respect to the risk score η m for a specific subject m.
The gradient l / η m receives contributions from two places in the summation:
  • When i = m , if subject m experienced the event ( Δ m = 1 ), then η m appears directly in the first term.
  • For any subject i for whom subject m is in the risk set (i.e., m R i ), η m appears in the log-sum-exp term log ( j R i e η j ) .
Therefore:
l η m = Δ m · 1 i : m R i Δ i · 1 j R i e η j · e η m = Δ m i : m R i Δ i · e η m j R i e η j
This gradient is used to update the parameters of the survival head network g θ .

Appendix A.5. KL Divergence for Gaussian Distributions

We derive the closed-form solution for the KL divergence between two multivariate Gaussian distributions, which is used for the latent variables Z i and Z i ( k ) .
Let q ( z ) = N ( μ q , Σ q ) and p ( z ) = N ( μ p , Σ p ) . The KL divergence is:
KL ( q p ) = E q [ log q ( z ) ] E q [ log p ( z ) ]
The expectation of the log of a multivariate Gaussian N ( μ , Σ ) is:
E q [ log N ( z μ , Σ ) ] = d 2 log ( 2 π ) 1 2 log | Σ | 1 2 E q [ ( z μ ) Σ 1 ( z μ ) ]
For q ( z ) , this is:
E q [ log q ( z ) ] = d 2 log ( 2 π ) 1 2 log | Σ q | 1 2 E q [ ( z μ q ) Σ q 1 ( z μ q ) ] = d 2 log ( 2 π ) 1 2 log | Σ q | 1 2 trace ( Σ q 1 Σ q ) = d 2 log ( 2 π ) 1 2 log | Σ q | d 2
because E q [ ( z μ q ) Σ q 1 ( z μ q ) ] = trace ( Σ q 1 E q [ ( z μ q ) ( z μ q ) ] ) = trace ( Σ q 1 Σ q ) = trace ( I ) = d .
For p ( z ) , assuming a standard normal prior p ( z ) = N ( 0 , I ) (so μ p = 0 , Σ p = I ), we have:
E q [ log p ( z ) ] = d 2 log ( 2 π ) 1 2 log | I | 1 2 E q [ z z ] = d 2 log ( 2 π ) 1 2 E q l = 1 d z l 2 = d 2 log ( 2 π ) 1 2 l = 1 d E q [ z l 2 ]
Since E q [ z l 2 ] = Var q ( z l ) + ( E q [ z l ] ) 2 = σ q , l 2 + μ q , l 2 , we get:
E q [ log p ( z ) ] = d 2 log ( 2 π ) 1 2 l = 1 d ( μ q , l 2 + σ q , l 2 )
Subtracting Equation (A28) from Equation (A26):
KL ( q p ) = 1 2 log | Σ q | d 2 1 2 l = 1 d ( μ q , l 2 + σ q , l 2 ) = 1 2 l = 1 d ( μ q , l 2 + σ q , l 2 log ( σ q , l 2 ) 1 )
where we have used the fact that for a diagonal covariance matrix Σ q = diag ( σ q , 1 2 , , σ q , d 2 ) , the determinant | Σ q | = l = 1 d σ q , l 2 , and thus log | Σ q | = l = 1 d log ( σ q , l 2 ) .

Appendix A.6. KL Divergence for the Spike-and-Slab Prior

For the view-specific latent Z i ( k ) , the prior is a spike-and-slab: p ( Z i ( k ) γ k ) = ( 1 γ k ) δ 0 ( Z i ( k ) ) + γ k N ( Z i ( k ) 0 , I )
The variational posterior is a Gaussian: q ( Z i ( k ) ) = N ( μ q ( k ) , diag ( ( σ q ( k ) ) 2 ) )
The KL divergence KL ( q ( Z i ( k ) ) p ( Z i ( k ) γ k ) ) is not standard because the prior is a mixture. We handle this by taking the expectation over the discrete variable γ k .
E q ( γ k ) KL ( q ( Z i ( k ) ) p ( Z i ( k ) γ k ) ) = E q ( γ k ) q ( Z i ( k ) ) log q ( Z i ( k ) ) p ( Z i ( k ) γ k ) d Z i ( k ) = q ( γ k = 0 ) · KL ( q ( Z i ( k ) ) δ 0 ) + q ( γ k = 1 ) · KL ( q ( Z i ( k ) ) N ( 0 , I ) )
The term KL ( q ( Z i ( k ) ) δ 0 ) is problematic because the delta function is not a density with respect to the Lebesgue measure. In practice, as noted in the main text, when γ k = 0 , the prior forces Z i ( k ) = 0 . Therefore, the model approximates this by only applying the KL penalty against the Gaussian slab when the view is included. This leads to the practical approximation:
E q ( γ k ) KL ( q ( Z i ( k ) ) p ( Z i ( k ) γ k ) ) τ k · KL ( q ( Z i ( k ) ) N ( 0 , I ) )
where τ k = q ( γ k = 1 ) . This approximation is used in Equation (20).

Appendix A.7. Derivation of the Variational Objective for the Spike-and-Slab Prior

The exact KL divergence for the view-specific latent, marginalized over γ k , is:
KL q ( Z i ( k ) ) p ( Z i ( k ) ) = q ( Z i ( k ) ) log q ( Z i ( k ) ) p ( Z i ( k ) ) d Z i ( k ) = q ( Z i ( k ) ) log q ( Z i ( k ) ) d Z i ( k ) q ( Z i ( k ) ) log p ( Z i ( k ) γ k ) p ( γ k ) d γ k d Z i ( k ) = H q ( Z i ( k ) ) E q ( Z i ( k ) ) log ( 1 π k ) δ 0 ( Z i ( k ) ) + π k N ( Z i ( k ) 0 , I )
where H q ( Z i ( k ) ) is the entropy of the variational distribution.
The intractability arises from the log of a sum inside the expectation. The common variational approach is to introduce an auxiliary variational distribution r ( γ k ) for the discrete variable and derive a lower bound. However, in the context of the overall ELBO, this can become very complex.
The approximation used in the paper, E q ( γ k ) KL ( q ( Z i ( k ) ) p ( Z i ( k ) γ k ) ) , is itself a variational approximation that treats γ k as independent. This is known as the “fully-factorized” or “mean-field” assumption across the discrete and continuous latents. The approximation in Equation (A31) (using τ k · KL ( q N ) ) is a further simplification that ignores the KL divergence to the spike (delta function) because it is infinite for any non-degenerate q. This is justified by the model’s structure: when γ k = 0 , the survival head g θ receives 0 , so the model is incentivized to set q ( Z i ( k ) ) to δ 0 to minimize the reconstruction loss, making the infinite KL penalty moot. The τ k weighting then acts as a regularizer only when the view is likely to be included.

Appendix A.8. Derivation of the Optimal Variational Distribution for πk

The appendix mentions a variational distribution q ( π k ) but does not derive its optimal form. For a conjugate model, we can find a closed-form update.
From the joint distribution in Equation (A7), the conditional posterior for π k (before variational approximation) is proportional to:
p ( π k γ k ) p ( γ k π k ) p ( π k ) π k γ k ( 1 π k ) 1 γ k · π k α 0 1 ( 1 π k ) β 0 1 π k α 0 + γ k 1 ( 1 π k ) β 0 + 1 γ k 1
This shows that the conditional posterior for π k is Beta ( α 0 + γ k , β 0 + 1 γ k ) .
In variational inference, the optimal q ( π k ) that maximizes the ELBO is the one that minimizes the KL divergence to the true conditional posterior. Under the mean-field assumption, the optimal q ( π k ) is:
q ( π k ) exp E q ( γ k ) log p ( π k γ k ) exp E q ( γ k ) log Beta ( π k α 0 + γ k , β 0 + 1 γ k )
Since the expectation of γ k under q is τ k , we can substitute to find:
q ( π k ) π k α 0 + τ k 1 ( 1 π k ) β 0 + 1 τ k 1 = Beta ( π k α 0 + τ k , β 0 + 1 τ k )
Therefore, the optimal variational distribution for π k is q ( π k ) = Beta ( α 0 + τ k , β 0 + 1 τ k ) . The corresponding KL divergence term KL ( q ( π k ) p ( π k ) ) in Equation (A13) can then be computed in closed form using the KL divergence between two Beta distributions.

Appendix A.9. Graph Convolutional Layer: Signal Processing Interpretation

The propagation rule H ( l + 1 ) = ReLU ( A ^ ( k ) H ( l ) W ( l ) ) can be analyzed from a signal processing perspective on graphs.
The normalized adjacency matrix A ^ ( k ) can be decomposed via its eigenvectors: A ^ ( k ) = U Λ U , where Λ is a diagonal matrix of eigenvalues and U is the matrix of eigenvectors. This is the graph Fourier transform.
A graph convolution in the spatial domain (like the GCN layer) is equivalent to a multiplication in the spectral domain. The operation A ^ ( k ) H ( l ) is a form of low-pass filter. It smooths the feature representation of each node by averaging it with the features of its neighbors. The eigenvalues in Λ are in the range [ 1 , 1 ] , and the largest eigenvalues correspond to the lowest frequencies (most global, smoothest signals on the graph).
The weight matrix W ( l ) then acts as a learnable filter that can amplify or attenuate different frequency components. The ReLU non-linearity introduces the capacity to learn more complex, non-linear feature interactions.
This mathematical interpretation justifies why GCNs are effective for biological data: they promote smoothness of the latent representations with respect to the prior biological network, meaning that genes/proteins that are known to interact will have similar representations in the latent space, leading to more biologically plausible and robust models.

Appendix A.10. Hard-Concrete Relaxation for γk

The main text mentions using a “hard-concrete relaxation with a straight-through gradient estimator” for γ k . Here, we provide the mathematical formulation of this trick.
The binary variable γ k { 0 , 1 } is replaced during training with a continuous relaxation γ ˜ k ( 0 , 1 ) sampled from a Hard-Concrete distribution. The sampling process is:
u Uniform ( 0 , 1 ) s = σ 1 β log u log ( 1 u ) + log α s ˜ = s · ( γ max γ min ) + γ min γ ˜ k = min ( 1 , max ( 0 , s ˜ ) )
where α is the location parameter (often derived from the logit of τ k ), β is a temperature parameter, and [ γ min , γ max ] is a small interval (e.g., [ 0.1 , 1.1 ] ) to stretch the output.
During the forward pass, γ ˜ k is used. For the backward pass (gradient computation), the “straight-through” estimator is used: the gradient is computed as if the operation was an identity function. That is, if y = hard_concrete ( x ) , then y / x = 1 .
This allows gradients to flow through the discrete sampling process, enabling end-to-end training. As training progresses, the temperature β is annealed to zero, causing the Hard-Concrete distribution to collapse to a Bernoulli distribution, effectively discretizing γ ˜ k back to γ k .

Appendix A.11. Derivation of Gradients for Model Parameters

The appendix derives the ELBO with respect to the variational parameters. It is also crucial to derive the gradients with respect to the model parameters  ψ (decoder) and θ (survival head).
The ELBO is:
L = E q [ log p ( X , T , Δ , Z , { Z ( k ) } , γ ) ] E q [ log q ( Z , { Z ( k ) } , γ ) ]
The second term (the entropy of q) does not depend on the model parameters ψ , θ . Therefore, the gradient with respect to ψ and θ comes only from the first term, the expected joint log-likelihood.
ψ , θ L = ψ , θ E q [ log p ( X , T , Δ , Z , { Z ( k ) } , γ ) ]
This expectation can be broken down as shown in Equation (A8). The gradients for the reconstruction term L recon with respect to ψ are derived in subapp:ziggradients. The gradients for the survival term L surv with respect to θ are derived in Appendix A.4.
The key point is that these gradients can be estimated using Monte Carlo sampling from the variational distribution q, combined with the reparameterization trick for the Gaussian latents and the straight-through estimator for the discrete latents, making the entire model trainable via standard stochastic gradient descent.

Appendix A.12. Reparameterization Trick for Gaussian Latents

To obtain low-variance, unbiased gradient estimates for the Gaussian latent variables, we use the reparameterization trick.
For a latent variable z N ( μ , Σ ) , instead of sampling z directly, we sample an auxiliary noise variable ϵ N ( 0 , I ) and compute z = μ + Σ 1 / 2 ϵ , where Σ 1 / 2 is the matrix square root. For a diagonal covariance Σ = diag ( σ 1 2 , , σ d 2 ) , this simplifies to element-wise multiplication: z = μ + σ ϵ , where ⊙ denotes the Hadamard (element-wise) product.
The gradient of an expectation E q ( z ) [ f ( z ) ] with respect to μ and σ can then be computed as:
μ E q ( z ) [ f ( z ) ] = E ϵ f ( z ) z · z μ = E ϵ f ( z ) z
σ E q ( z ) [ f ( z ) ] = E ϵ f ( z ) z · z σ = E ϵ f ( z ) z · ϵ
This allows gradients to flow through the sampling process, enabling efficient optimization via stochastic gradient descent.

References

  1. Ritchie, M.D.; Holzinger, E.R.; Li, R.; Pendergrass, S.A.; Kim, D. Methods of integrating data to uncover genotype–phenotype interactions. Nature Reviews Genetics 2015, 16, 85–97. [Google Scholar] [CrossRef]
  2. Lambert, D. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 1992, 34, 1–14. [Google Scholar] [CrossRef]
  3. Cox, D.R. Regression models and life-tables. Journal of the Royal Statistical Society: Series B (Methodological) 1972, 34, 187–202. [Google Scholar] [CrossRef]
  4. George, E.I.; McCulloch, R.E. Variable selection via Gibbs sampling. Journal of the American statistical association 1993, 88, 881–889. [Google Scholar] [CrossRef]
  5. Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
  6. Kingma, D.P.; Welling, M. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 2013. [CrossRef]
  7. Rezende, D.J.; Mohamed, S.; Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082 2014. [CrossRef]
  8. Jang, E.; Gu, S.; Poole, B. Categorical reparameterization with Gumbel-Softmax. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
  9. Maddison, C.J.; Mnih, A.; Teh, Y.W. The concrete distribution: A continuous relaxation of discrete random variables. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
  10. Bowman, S.R.; Vilnis, L.; Vinyals, O.; Dai, A.M.; Jozefowicz, R.; Bengio, S. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349 2015. [CrossRef]
  11. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 2014. [CrossRef]
  12. Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 2017. [CrossRef]
  13. McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 2018. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated