Integrative approaches to reconstruct regulatory networks from multi-omics data: A review of state-of-the-art methods

—Data generation using high throughput technologies has led to the accumulation of diverse types of molecular data.These data have different types (discrete,real,string etc.) and occur in various formats and sizes. Datasets including gene expression, miRNA expression, protein-DNA binding data (ChIP-Seq/ChIP-ChIP), mutation data(copy number variation, single nucleotide polymorphisms), GO annotations, protein-protein interaction and disease-gene association data are some of the commonly used genomic datasets to study biological processes. Each of them provides a unique, complementary and partly independent view of the genome and hence embed essential information about their regulatory mechanisms. In order to understand the functions of genes, proteins and analyze mechanisms arising out of their interactions, information provided by each of these datasets individually may not be sufﬁcient. Therefore integrating these multi-omic data and inferring regulatory interactions from the integrated dataset provides a system level biological insights in predicting gene functions and their phenotypic outcomes. To study genome functionality through interaction networks, different methods have been proposed for collective mining of information from an integrated dataset. We survey here data integration approaches using state-of-the-art techniques such as network integration, Bayesian networks, regularized regression (LASSO) and multiple kernel learning methods.


INTRODUCTION
L IVING cells belong to a class of highly studied complex systems, whose functioning continues to allure the research community. The cell houses a diverse molecular structure, comprised of a large number of molecular entities viz. genes, proteins, metabolites and mRNA, forming a complex and dynamic molecular machinery. Earlier, experimental approaches used to focus on the effect of individual molecular entities or environmental factors to study a particular cell function, without taking into consideration the effect of other important factors. However, these methods are limited by the scale and time of the study, hence making it difficult to obtain a system-wide cellular response of a stimulus e.g., a drug dose, an environmental perturbation or a gene knockdown experiment.
• Nisar Wani ,Govt. Degree College Baramulla,J& K,India. For a holistic measurement of cellular responses and identification of functions, novel, high throughput technologies that generate genome-scale data sets open global perspectives of living organisms at the molecular level. Such a collective quantification and characterization of pool of diverse bio-molecules that give rise to structure, function and dynamics of organism is what is commonly nowadays referred to as Omics layers (Gligorijević & Pržulj, 2015). For example, a whole set of proteins and their interactions form the proteome , mRNA abundance together with factors responsible for transcription give rise to a transcriptome, genes within the DNA form the genome, metabolites within the cell form metabolome and the diverse phenotypes as phenome. The mechanisms which translate the functional codes within the genome depend on various intermediate omics layers and their inter-relationships (interactome) giving rise to complex phenotypic traits in the phenome (Gligorijević & Pržulj, 2015).
High throughput technologies such as , protein mass spectrometry, yeast two-hybrid assays, microarrays and a range of protocols under the banner of next generation sequencing (NGS) viz; RNA-seq, ChIP-seq, DNase-seq and miRNA-seq etc. produce vast amounts of disparate biological data (Omics), these data sets together with a knowledgebase of experimental literature on biochemical processes are disrupting earlier notion of isolated functionality and linear causality pathways and challenging us with multiple functions. For example, a protein produced from the information contained in a genetic sequence may function as a transcription factor (TF) binding to regulatory sites on promoter region of a gene or multiple genes, an enzyme responsible for catalyzing a metabolic reaction, a cellular component or as a component of signal transduction pathway, thereby implying networks of interacting biological components or regulatory systems. Therefore, it is a widely accepted notion in the modern era of high throughput biology that understanding any biological system in a comprehensive manner can only be achieved through an integrative analysis of relevant omics datasets. Such an analysis approach reflects the need for building a joint model that simultaneously captures the information content of all the data involved in the unified model (Ritchie et al., 2015).

NEED FOR MULTI-OMICS DATA INTEGRATION
The task of understanding the mechanisms by which living systems carry out their functions can be divided into two sub-tasks, i) identifying the components that make up these systems and ii) interaction between them that result in either function or dysfunction in such systems. As for the first sub-task, there has been an extensive research effort carried over a period of several decades, to characterize the structure and function of individual cellular components. Research endeavors that lead to completion of Human Genome Project , the growth of Protein Data Bank (PDB) (Bank, 1971), Signaling pathway databases (KEGG) , Gene Ontology Consortium (GO) (G. O. Consortium, 2014), ENCODE project(E. P. Consortium et al., 2004), Roadmap Epigenomics (Bernstein et al., 2010) and 1000 genomes (Siva, 2008) projects have generated huge databases and knowledge-bases that catalog both structural and functional aspects of elements of life. But in order to understand the underlying mechanisms and the overall functioning of the whole system, biological datasets need to be integrated under a mathematical or relational model that can describe the relationships between these components contrary to single data-type study designs.
Data integration therefore refers to the fusion of multiple Omics datasets, in order to build an informative model that can yield comprehensive results to our queries in terms of cellular functions/dysfunctions and phenotypic traits which are likely to be an outcome of the interplay between various cellular components at various levels of regulation. Our prime motivation in fusing multiple genomic data types into an integrative analysis framework is to identify key biological factors, that can explain a biological mechanism and help researchers in predicting a disease risk or a certain phenotypic outcome.Data integration may provide the platform for discovery of biologically important factors and their inter-relationships with improved accuracy.Besides modeling the complexity, interaction between single nucleotide polymorphisms (SNP) data, copy number variations (CNVs), gene expression profiles, methylation data, metabolomes proteomes may help us in improving our understanding of biological mechanisms underlying a complex-trait architecture.
With data integration, the scope for exploring more robust and new research questions becomes more open despite the challenge of assembling all the heterogeneous datasets in a biologically relevant manner.These challenges may be due to missing and noisy data, different sizes and data types across multiple datasets and the measurement errors that may lead to different correspondences and correlations from different technologies.Over the years a number of tools for integrative analysis of multiple genomic datasets have been developed , but no single approach is reported to perform optimally for all the studies, as observed by (Marbach et al., 2012), the performance and robustness of inferences greatly improves by combining predicted results from multiple inference methods together with integrating diverse datasets.Therefore, comprehensive and more inclusive toolbox is the need of the hour to discover,interpret and understand the intricacies of biological systems.

DATA INTEGRATION APPROACHES FOR OMICS DATA
Large and heterogeneous omics datasets are being generated across all branches of life sciences. These datasets hold great promise not only in terms of scalable and unbiased investigation of biological systems, but also leading to conceptual development and new discoveries. Because most of such studies are publicly funded, therefore a large number of omics datasets find their way into publicly available online database resources. With such an unprecedented rate of biological data generation and its accumulation, there is a growing concern about much of this data not being used or being fully analyzed, thus creating a greater disparity between data generation and data utilization.Although these datasets are publicly available, they cannot simply be put into a mathematical framework or fitted straightaway in a statistical model. There are peculiar challenges to integrate these datasets. These challenges arise because of the differences these datasets manifest in their size, format and dimensionality, noisiness, information content and their mutual concordance. Also before integrating multiple datasets, it is imperative to evaluate each of the data types for quality. For genomic datasets such as, DNA-sequencing, RNAsequencing, ChIP-sequencing and genome-wide methylation profiling methods etc., it is essential that the quality control steps be implemented (Patel & Jain, 2012;Wani & Raza, 2017).
Another very important factor in data integration is to handle curse of dimensionality which is most common in biological datasets. This problem can be solved by using data reduction technique that can select limited number of variables with high predictive power for single data type studies, besides, data reduction as a pre-processing step can also be applied to multiple datasets for performing integration analysis. Datasets with dimensionality issues offer a limited statistical power compared to data where the number of independent variables and samples is even to some extent. Since biological datasets occur with inherent skewness in size, investigators apply some sort of data reduction before performing association, correlation or modeling analysis. Reducing the amount of data through filtering strategies such as, principal component analysis (PCA), matrix factorization (MF) and singular value decomposition (SVD) ensures data integration , the selection of more robust features and analysis of a small and refined dataset.Data reduction can improve the computational efficiency of the inference algorithms and can potentially reduce the burden of testing multiple-hypothesis.It is imperative to perform some level of data reduction when working with complex models that can explore millions of measurements from a single dataset.Therefore applying data reduction techniques is a requirement for analyzing single datasets or performing integrative analysis across a diverse set of data.Various approaches have been developed for data integration, these approaches are primarily categorized on the basis of (a) type of data and (b) integration strategy.

Approaches based on data type
Although a clear cut framework and definition of data integration is hardly available in literature, various studies have proposed certain methods in this regard. For instance, (Lu et al., 2005) in a functional genomics context defines data integration as a process whereby data from different sources are combined in a statistical manner to make large-scale inferences for obtaining a holistic view of entire genome. Studies such as (Gligorijević & Pržulj, 2015;Hamid et al., 2009), categorize data integration into homogeneous and heterogeneous methods based on the type of data being combined. These datasets are produced by different experimental protocols, occur in different formats and are composed of different data types. For example, microarray gene expression data, gene expression profiles from RNAseq experiments, protein data, SNPs, mutations, DNA sequences, CNVs and interaction data differ in their source, structures, dimensions and formats.
Taking these differences into account, datasets are treated as homogeneous, where the data is drawn from similar sources, having same data type and possessing a uniform format across different experiments. For example, all gene expression data from multiple samples or conditions or time points. Similarly all proteins, SNPs, CNVs or clinical data from multiple studies can be combined in a homogeneous integration. On the other hand, heterogeneous data sets include datasets generated from different sources, different experimental protocols, having different types (sequences, graphs, real numbers, integers, categorical) and possessing varying formats. For example, DNA-sequencing, gene expression, proteins, Chip-seq, CNV data, SNP data and clinical data.
A recent review on Network based integration methods by (Gligorijević & Pržulj, 2015) treat datasets with similar nodes (e.g. proteins) and different edges (e.g. gene interaction networks, protein-protein interaction networks, etc.) as homogeneous datasets and while on the other hand datasets that characterize various biological entities and embed different types of relationships are grouped as heterogeneous. They are represented as inter-relation collection of networks with different nodes and edges. Representations such as gene-disease association (GDA) networks, drug-chemical similarity (DCS) networks and protein-protein interaction (PPI) networks form a heterogeneous network with different types of nodes and edges.

Approaches based on integration strategies
In (Ritchie et al., 2015), authors categorize data integration methods based on how individual datasets are combined in the integration process into multi-staged integration analysis and meta-dimensional analysis.
A multi-staged analysis approach operates in a linear or hierarchical step wise manner. The model construction using this approach accepts data either as numerical or categorical features, for example, gene expression values denoting level of expression are continuous and numerical , whereas the indication of over/under expression can be represented using values from the categorical domain.
Meta-dimensional analysis approach on the other hand is based on the fusion of multiple features or scales. The aim is to find a complex meta-dimensional model in which multiple variables from different data types are combined simultaneously. It is broadly divided into three strategies, early integration (concatenation based) , late integration (model based) and intermediate integration (transformation based) as depicted in Fig. 1. These integration strategies were initially reported in (Pavlidis et al., 2001) and have been adopted by a number of meta-dimensional analysis studies, for example, protein function prediction , protein classification (Lanckriet et al., 2003a), gene function prediction (Mostafavi et al., 2008) and proteinprotein network inference (Yamanishi et al., 2004).
(a) Early integration or concatenation based approach builds a single model from a single composite dataset formed as a result of combining multiple separate datasets. To achieve this, the individual datasets need to be transformed into a uniform representation. This transformation may also result in a loss of information in certain scenarios (Pavlidis et al., 2001;Žitnik & Zupan, 2015a). An advantage of this integration strategy is that, once we figure out how to fuse variables into a unified matrix, it becomes relatively simple to analyze continuous and categorical data by using any statistical method. (b) For late integration or model based approach, training sets derived from multiple data types are used to build separate models.These models created during the training phase are then combined into a unified final model, preserving properties which are data specific.This type of integration strategy is driven by individual hypothesis and analysis specific to a particular model and a procedure that combines the resulting models in a meaningful and coherent manner.Although we will be able to analyze each dataset individually using modelbased integration, building models in isolation ignores their mutual relationships and therefore can drastically reduce the performance of the final model (Gevaert et al., 2006;Žitnik & Zupan, 2015a). (c) In intermediate integration or transformation based approach individual datasets are combined through development of a joint model. The datasets are initially transformed before the integration process. For example, while handling classification problems we might transform the data into a similarity matrix or a graph or a covariance matrix before joining them. The transformation-based integration approaches preserves the data-type-specific properties of each dataset during the process of transformation and generate appropriate intermediate representations of the individual datasets with the advantage of retaining most of the information content of the original data. The approach offers a versatile framework to integrate diverse types of datasets, viz. continuous or real values, categorical data and sequence data, provided the data contain unification features, such as gene identifiers connecting different genomic data types (gene expression, PPI,TF binding etc.). Moreover, this approach offers robust handling of data having different measurement scales. This strategy has been reported to produce high prediction accuracy in Gevaert et al., 2006;Žitnik & Zupan, 2015a;Van Vliet et al., 2012), while on the other hand results from (Özen et al., 2009) have reported superior performances of concatenation and model based integration strategies , with little or no information loss because of non-transformation of data. The advantage of building models using a systems genomics approach involves combining multiple data sets of different omics types that can compensate for any missing or unreliable information provided by a single data type. Also genes or pathways that derive their information content from multiple genomic sources are likely to produce less false positives. By considering different levels of genomic, genetic and proteomic regulation, it is highly likely that a near complete biological model can be discovered. A summary of state-of-the art methods used for the inference of regulatory networks is presented in Table 1.

NETWORK BASED INTEGRATION AND INFER-ENCE
Network based integration approaches provide the easiest and simplest way to integrate different types of network data into a single unified representation. It is flexible and can handle both homogeneous and heterogeneous datasets. In network data integration lexicon although data types are different, the homogeneous and heterogeneous data are characterized by the similarity/dissimilarity of network nodes. In homogeneous data integration, N networks, G i = (V, E i ), with similar vertices (genes, proteins etc.) across different networks and different edges E i are combined by merging edges over the same set of vertices across all the individual networks i,e. G int = (V, U N i=1 E i ) as shown in Fig. 2. Similarity Network Fusion (SNF) ) is a typical example of constructing merged networks by employing homogeneous network data integration principal. SNF that integrates expression data (mRNA, miRNA) and DNA methylation data, creates a network for each data type and then builds an integrated network from the fusion of individual homogeneous networks. The initial steps of the SNF build data set specific similarity matrices. These matrices are equivalent to similarity networks where nodes are genes/miRNA and the weighted edges represent pairwise node similarities. This is followed by a network fusion step that iteratively builds a single integrated network by employing message passing theory based techniques. The advantage of this iterative network construction is that low  ANN based approaches Multi-modal deep learning model (Liang et al., 2015) Gene expression data, DNA methylation and Drug response data.
Offers component-based learning and hence reduced complexity, can learn from heterogeneous data views, are less scalable and face over fitting.
weight (weak) edges exclusive to all component networks are discarded in order to remove noise and high weight (strong) edges are incorporated into the final network.Also, low-weight (weak) edges common to all the networks being integrated are not discarded depending on how densely connected their network neighborhoods are across the networks. With such non-linearity SNF makes full use of the local network structure, combining common as well as complementary information of the structure across networks. Similarly, using this principle, multi omics data of cohort features can be integrated for the reconstruction of gene regulatory networks (GRNs), for example, given multiple RNA-seq/ micro-array datasets generated under different perturbations/ treatments/condition can be used to learn multiple Gene Regulatory Networks (GRN) from ech data set, we can then use SNF to fuse these individual GRNs into an integrated GRN. Similarly, GRNs inferred using algorithms, such as, CLR (Faith et al., 2007),ARACNE , SIRENE (Mordelet & Vert, 2008) can also be fused by a method like SNF to achieve robust results, which implements the philosophy -'the wisdom of crowds' (Marbach et al., 2012). A similar network fusion method has been proposed in KeyPathwayMinerWeb for pathway enrichment analysis using multi omics data (Alcaraz et al., 2011). Another study by (Gade et al., 2011) builds a transcriptional regulatory network of prostate cancer from mRNA and miRNA expression profiles. Besides these two data sets, their method also incorporates prior knowledge about miRNA targets from MicroCosm (Enright et al., 2003) target database. Here the authors start by building a bipartite graph connecting miRNAs mi j to their respective mRNA m i targets. These relationships are established by computing Pearson correlation coefficients between the expression vectors of both mRNA and miRNA p(m i , mi j ) resulting in p-values form every mRNA-miRNA pair .
Besides the mRNA-miRNA pair, p-values P pred i,j for target predictions from the MicroCosm database (Enright et al., 2003) are included as a secondary source of information. Incorporation of this predicted information strengthens the confidence of relationship between a mRNA and miRNA when some supportive information is provided from the available biological data. MicroCosm database generally stores mRNA-miRNA pairs whose p-values are below 0.05, in case the pair does not exist in database then the value is set to 1.
P-values from both the sources (mRNAs & miRNAs) are combined using truncated product method (Zaykin et al., 2002) to yield a weight matrix W i,j viewed as adjacency matrix, describing relationships between mRNA and miRNA using a bipartite graph. The resultant graph W has edges from mRNA to miRNA and is interpreted as a directed graph. Although the study was not primarily designed to infer a regulatory relationship between mRNA and miRNA, but as a pre-requisite to identify prognostic signatures that are able to predict clinical outcome in prostate cancer. The results of the study clearly indicate an improvement in prediction performance with graph incorporated as regulatory information. Also introduction of likelihood-based boosting in (Binder & Schumacher, 2009) as a possibility to integrate gene interaction network into feature selection improves the predictive performance of the model.
The heterogeneous network integration method uses simple projection techniques to combine different types of vertices and edges across a number of networks. This type of network integration is usually done for association studies, for example, network projections of gene-gene interaction (GI) on a disease similarity network (DSN). Also projecting GI network on to a disease association network disturbs the GI network structure resulting in loss of information carried by the GI network. More sophisticated methods based on diffusion (information flow across network connections) overcome the projection based integration. The diffusion process explores the structure of each network and their corresponding mutual relationships in order to infer an integrated network. These approaches fall under the domain of network propagation methods and have been applied to a variety of biological problems, such as drug-target prediction and drug re-purposing (Cheng et al., 2012), drugdisease association (Huang et al., 2013) and gene-disease association (Guo et al., 2011). Another important study in this direction was carried out by (Cho et al., 2016) that led to the development of integrative framework called 'Mashup'.The Mashup framework exploits the topological representation of individual networks, analyzing each network separately by running a localized network diffusion algorithm, random walk with restart (Tong et al., 2006) to obtain the distribution of each node and their relevance to other nodes in the network. These feature vectors are then used as input by the framework to carry out mutiple tasks, such as gene interaction and gene function prediction etc.
More recently heterogeneous network integration approaches are being used for building Drug-target interaction networks in order to identify new indications for the old drugs (drug re-purposing). DTINet, a computational pipeline developed by (Luo et al., 2017)combines diverse information from heterogeneous networks by integrating a dimensionality reduction technique, DCA (diffusion component analysis) with a network diffusion algorithm RWR (random walk with restart) in order to obtain low dimensional feature vector representation of the network nodes, a process known as compact feature learning.This feature vector embeds relational properties (e.g. similarity), association information and topological information of each drug (or protein) node in the heterogeneous network.

REGRESSION BASED APPROACHES
Regression based techniques contribute a fair share to the range of methods developed so far to extract one-to-many relationships from gene expression profiles. These methods are employed to analyze high dimensional data sets, involving regularization to learn sparse models. Among the regularized regression methods, ridge regression and LASSO (least absolute shrinking and selection operator) (Tibshirani, 1996) have been effectively used to learn gene regulatory networks (Cai et al., 2013). Besides Computational biology, LASSO has also been successfully used in medical imaging, signal processing and economics (Hesterberg et al., 2008;Dasgupta et al., 2011;Yang et al., 2010). GENEI3 (Irrthum et al., 2010) is another inference algorithm that uses regression to infer regulatory networks. The inference problem for P genes is decomposed into P regression problems and treebased ensemble methods using either Random forests or extra-trees are used to predict the expression pattern of target genes from the expression profile of all other input genes. A putative regulatory link between the input gene and its target is predicted based on the role played by the input gene in the prediction of target gene expression pattern. Links from all the genes are ranked and then aggregated to reconstruct a GRN.
In summary, although performance of regularized regression models in GRN inference compared to other approaches yielded better results, an increase in accuracy was observed in (Qin et al., 2014) when the authors integrated ChIP-ChIP/ChIP-seq data in the form of prior knowledge using L p (P < 1) i,e. P 0 , P 1/2 regularization. In (Chartrand, 2007;Z. Xu et al., 2012), regularization models L p (P < 1) have been reported to perform optimally even on data with fewer samples and achieved higher sparsity and more accurate solutions.Since L 0 and L 1/2 regression models suffer from non-convexity, therefore iterative hard and half thresholding algorithms are used to solve these models (Fornasier & Rauhut, 2008). The advantage with these algorithms is that, they have a low computation cost and a fast convergence rate, both essential properties worth consideration while designing algorithms for genomic scale data.

L p regularization models
A linear system representation of relationships between target genes and their regulatory proteins (TFs) can be written as:- Here A ∈ R m×r represents a matrix containing gene expression values of candidate TFs , B ∈ R m×n is the matrix containing gene expression values of target genes, ∈ R m×n is the error matrix and X ∈ R r×n is the target-TF regulation matrix, m and r represent the matrix dimension and number of TFs respectively while n is the total number of target genes. The objective here is to minimize the difference between AX j and B j with few TFs, a sparse optimization problem, which can be written as :- Where . denotes the Euclidean norm as X j 2 = r i=1 X 2 i,j and X j 0 denotes the number of non-zero elements in X j and less X j 0 means higher sparsity of X j , an indicator of number of TFs found to be responsible for regulating the target gene j.
For solving such inference problems , the practical approach is to transform the sparse optimization problem into a regularization problem. For example , with B j as the expression profile of a given gene j, the L 0 regularization model imposes a minimization constraint on the difference between AX j and B j and a maximization constraint on the sparsity of X j as given by : where regularization parameter λ > 0 provides a tradeoff between sparsity and accuracy. L 0 regularization is very close to our initial probelm, but solving it to achieve a global optimal solution is NP-hard, therefore a relaxation of L 0 , L 1 (LASSO) regularization is introduced to solve it:- In many instances , the solutions obtained using LASSO (L 1 ) model are less sparse than the L 0 regularization.
Recently , L 1/2 regularization has been proposed and has been shown to perform better than LASSO (L 1 ) regularization (Z. Xu et al., 2012).This model is described as below:- Qin et al. applied iterative thresholding algorithms to solve L 0 , L 1/2 and L 1 regression models. For example, L 1 models are solved using iterative soft thresholding algorithms and L 0 and L 1/2 regression models use iterative hard and half thresholding algorithms for obtaining optimum solutions. These iterative algorithms are computational very efficient with fast convergence rate and effective tools for handling large scale sparse optimization problems. A detailed discussion on iterative thresholding algorithms can be found in (Fornasier & Rauhut, 2008).
Regularization parameter λ for all the thresholding algorithms is updated iteratively to maintain the sparsity of X k j . All the three regularization models were evaluated on mESC (mouse embryonic stem cell) data set. The initial experiments were performed on transcriptomic data alone and compared with TF-target datasets from high throughput (microarray/ChIP-X) and low throughput (literature curated). Both these TF-target datasets serve as gold standard against which the model accuracy is tested. The ROCs of all the models on both evaluation data were close to random with transcriptome only dataset. However, on the integration of the chip-X data into the models , it was observed that the performance of all the models improved drastically with L 0 and L1/2 models significantly outperforming L 1 regularization model. ROC of all three regression models L p (p = 0, 1/2, 1) are high with high throughput gold standard and low with low throughput gold standard. At an FPR of 0.05, the integrated models yield a TPR of 0.63, 0.59 and 0.07 for L 0 , L 1/2 and L 1 models, while transcriptome only models achieve 0.03, 0.03 and 0.04 TPR. They demonstrated that L 0 , L 1/2 models achieve higher sensitivity with integrated chip-X data compared to transcriptome only dataset.
Another regression based study that infers gene regulatory networks used Fused LASSO to integrate multiple transcriptomic datasets subject to different perturbations (Omranian et al., 2016). The approach formulated by authors combines L 2 norm (residual sum of squares) with L 1 (sum of absolute values of regression coefficients), shrinking coefficients and yielding sparsity as the coefficients approach towards zero. Unlike LASSO where the minimization of L 1 norm is imposed only on regression coefficients, the fused LASSO imposes this constraint also on the consecutive difference of the regression coefficients of corresponding regressors (Tibshirani et al., 2005).Further, the fused LASSO formulation has been extended here by incorporating information about regressors and response genes based on the similarity of their differential behavior. To ensure that the inferred networks are sparse, the fusion approach imposes a similarity constraint on the inferred relationships from each dataset with biologically meaningful evidence. Since this approach relies on similarity of the differential behavior, the model operates on P regressor genes and a single gene as a response gene, multiple gene expression datasets represented by X i , obtained under K different experimental conditions along corresponding controls X c,i (1 ≤ i ≤ k), and the expression profiles Y i for response genes along corresponding response control profiles Y c,i over k conditions (1 ≤ i ≤ k) are used to obtain condition specific differential behavior of regressors and response genes. The differential behavior thus obtained is used to calculate the weight matrix denoted by W i P ×P from the probability of differentially expressed regressor P r i,t j and response genes P r i,t Y which capture the similarities between each of the regressor and response gene at time point t is given by:

Regression model for fused LASSO
For efficient inference of GRNs, the regression model formulated in equation (8) is based on three fundamental criteria:-1) Obtained regression should be sparse to ensure less false positives and high likelihood of detecting direct relationships. Biologically , the expression level of a few TF coding genes should explain the expression level of their presumed targets. 2) For effective explanation of the analyzed condition, the model should infer the regulatory network from all k datasets simultaneously. 3) Genes exhibiting similar differential behavior are proffered while assigning direct regulatory links. While LASSO captures the first criterion, the second criterion of inferring regulatory networks from all datasets can be achieved by imposing the LASSO penalty simultaneously on all the integrated and transformed datasets. k transcriptomic datasets are clubbed together to form a single block matrix X denoting expression profiles for P genes , while as vector Y is response expression profiles over k different conditions The reconstructed networks should be evaluated for proximity in terms of their interaction strengths given by regression coefficients that correspond to links among the response Y and regressor genes X in the k networks.This vector is represented as β.
For reconstructed networks to be in close proximity to each other, the final model includes a fusion term β − β , where β = [β 1 , β 2 , ...., β k−1 ] T and β = [β 2 , β 3 , ...., β k ] T .This term is included to impose a minimization constraint for the sum of absolute differences among the estimated regressor coefficients over multiple datasets. Within the fused LASSO formulation the inclusion of W i (1 ≤ i ≤ k) implies the similarity of the differential behavior between response and the regressor genes. This matrix multiplied with the regression coefficients ensures that the regressors having high explanatory powers and associated with non-zero coefficients and are less penalized over multiple datasets.Therefore the expression W β 1 is added as an additional penalty in the proposed regression model. For reconstruction of gene regulatory network interactions an final model over k given datasets is given as:- Here regressors whose regression coefficients shrink towards zero are the potent regulators of the response gene Y . A comparative analysis of this approach with other state of the art methods viz, global silencing (Barzel & Barabási, 2013), network deconvolution (Feizi et al., 2013), Gaussian Graphical Models (GGM) (Schäfer & Strimmer, 2004;Krämer et al., 2009;Chun et al., 2013), mutual information (ARACNE and CLR) (A. A. Faith et al., 2007;Zoppoli et al., 2010) (Meyer et al., 2007), Bayesian Networks (Friedman et al., 2000), the GENIE3 (Irrthum et al., 2010) approach and other regularization based models is performed using network sparsity and removal of the indirect relationships as dual criteria for model evaluation. The GGM,ARACNE and CLR perform poorly for accurate inference of regulatory networks, on the contrary, fused LASSO approach in addition to retrieving the regulatory links efficiently also predicted the type of regulation (activation/repression), therefore outperforming most of the contending methods on the set criteria.

Bayesian Networks
Bayesian networks (BNs) or belief networks as they are often called belong to the family of probabilistic graphical models used to represent knowledge about an uncertain domain. These networks are represented using directed acyclic graphs (DAGs), wherein each node represents a random variable and the edges between the nodes represent the causal connections / probabilistic dependencies among the corresponding random variables (Ben-Gal et al., 2007). These conditional dependencies are estimated by using statistical and computational methods .Therefore BNs combine principles from graph theory, probability and statistics.
Bayesian networks are recognized as important tools in data driven biological science research and play a pivotal role in performing several biological tasks such as inferring cellular networks (N. Friedman, 2004), data integration (Troyanskaya et al., 2003;Santra, 2014), classification (Bradford et al., 2006) and genetic data analysis (Beaumont & Rannala, 2004).
The graphical representation and use of probabilistic formalisms make BNs suitable for combining data and domain knowledge, express causal relationships, learn from incomplete datasets and provides a natural setting for modeling stochastic nature of biological systems and its measurements. The graph in Fig 3(a) provides an example a BN describing a gene regulatory network. This network represents a joint probability distribution (JPD) over a set Vertices denote the genes whereas the directed edges represent the regulatory relations between the genes.Here Gene g 1 regulates the expression of g 2 ,g 3 and g 5 , genes g 2 and g 3 also regulate g 5 , whereas g 4 is only regulated by g 2 .(b) Graphical depiction of a naive BN with class node y acting as parent to independent nodes x 1 , x 2 , x 3 ....xn.
of random variables V, depicted as an annotated acyclic graph. The pair B= (G, Θ), where G is the DAG whose nodes represent genes as random variables g 1 , g 2 , g3, ..., g n and whose edges represent direct dependencies between the genes. The other component of the network Θ is used to represent a set of parameters. Th elements of this set are the parameter θx i |π i = P B (x i |π i ) for every realization x i of g i conditioned on π i , set of parents of X i in G.
The DAG structure embeds conditional independence relationships, i.e. given its parents in G, a node g i does not depend on its non-descendants The JPD defined by B over V can be written as:- Bayesian Networks have been used to build suitable frameworks for modeling and integration of different types of biological data. Network inference from disparate data is one of the biggest challenges in the field of systems biology, sparse network construction containing important gene associations (strength of this relationship within the network is represented by conditional probabilities in a joint probability distribution). Constructing a BN that provides an effective description of our data involves two important steps: structure learning and parameter learning. (Needham et al., 2007;Ben-Gal et al., 2007). Because the possible number of BN structure grows super exponentially with the number of nodes , the problem of finding a BN with best description of data is Np-complete, therefore heuristic methods have been employed for solving such problems (Chickering, 1996).
Studies employing bayesian networks for integrating gene expression data with other genomic datasets in a heterogeneous data integration framework have widely been reported in the literature.For example, (Zhu et al., 2008) built a probabilistic causal yeast regulatory network by combining gene expression data(GExD), transcription factor binding site (TFBS), eQTL(Expression of quantitative trait loci) and PPI data using bayesian learning .While testing the performance of constructed BN, they demonstrated an integrated BN model possesses higher predictive power than a BN model derived from single data type (gene expression data).A similar approach to construct a probabilistic model for GRN inference from brain tissue of lateonset Alzheimer's has been adopted by (Zhang et al., 2013) by integrating patient data .
Earlier studies of data integration using BNs has been reported in (Gevaert et al., 2006) to predict the outcome of breast cancer from heterogeneous genomic datasets. They constructed the BN network in order to classify the patients into different prognosis groups (good/poor).They increased the performance of the BN by subjecting it to all the three integration strategies as described above and observed that intermediate strategy was the one with improved results.Another study from (Van Vliet et al., 2012) using multiple classifiers with varying complexity also concluded in favour of intermediate integration.
Most of the data integration studies for heterogeneous biological data have used Naive Bayes (simple BN) for constructing integrated networks from genomic datasets, such as gene-gene association network (Lee et al., 2004;Linghu et al., 2009;Franceschini et al., 2012;Jansen et al., 2003). Here all independent nodes representing heterogeneous data sources are treated as child nodes of a class node, such simplicity in Bayesian Network structures enable efficient learning and much faster inference approaches. For example, gene-gene association studies use naive BN structures for prediction purposes where class nodes represent a set of proteins/genes (interacting/non-interacting) while independent nodes represent multiple data sources.A graphical depiction of naive baye's is shown in Fig 3(b).
Recently, (Santra, 2014) proposed a BN method based on bayesian variable selection for GRN inference that incorporates TFBS,PPI among TFs and mRNA expression profiles subject to genetic perturbations. The approach uses a linear model based on an idea, that when a network is perturbed , there is a widespread change in the expression level of genes as the effect of perturbation rapidly propagates through the entire network.This premise has been shown to be true through the works of (Rogers & Girolami, 2005;Lo et al., 2012). They have shown that the response expression x i of target genes (g i ) are linearly dependent upon the response expression X i of their direct regulators g i against a set of perturbations (n p ).
Where β i are the linear coefficients representing the type and strength of interaction between genes (g i ) and its direct regulators (g i ). Since no perturbation experiments are error free, therefore measurements of the expression levels are always accompanied by some sort of noise called '"residuals'". To accommodate these residuals in the linear model, the above equation now becomes:- where i is the residual error and a linear combination of these variables ( i ) denotes total noise of individual measurement errors (Kariya & Kurata, 2004) that the perturbation responses of the genes (g i ) and their regulators (g i ) are associated with.
In order to find the direct regulators (g i ) of the gene (g i ), the requirement is to calculate β i in least square sense. In this formulation , the task of identifying the direct regulators g i of target genes g i is determined by the values of β i . Elements (β ik , k = 1, ..., np) of β i whose values are > 0 are chosen as significant direct interactions and therefore the direct regulators of genes g i are corresponding genes g k .Performing perturbation experiments on a large scale is neither feasible nor possible under most of the circumstances. Hence in such situations reconstructing a full network is not possible and the problem can be resolved by Bayesian variable selection (BVS) algorithm (Santra et al., 2013).
In BVS formulation an adjacency matrix A is used to represent a GRN topology, elements of the matrix with non-zero values A ik denote a direct regulatory relationship between genes g i and g k , (k = i) , on the other hand elements with zero values represent no direct regulation. There is a close relationship between elements in matrix A and the elements of matrix interaction strength β.Given two genes (g i , g k ), absence of direct interactions (A ik = 0, k = i) between them implies a zero interaction strength (β ik = 0, k = i), in β and vice versa. Hence, to locate direct regulators of a gene g i within the binary adjacency matrix is equivalent to locating elements of A in the ith row of the adjacency matrix whose corresponding strength of interaction in β is not zero. Since it is computationally inefficient to iterate across all the combinations of A i , BVS algorithm adopts a Bayesian approach to avoid the iteration process. The Bayesian algorithms learns in a more natural way, the knowledge updates for some events is carried out only as the new information becomes available.In bayesian lexicon, prior distributions are used to encode the information about certain events and hence assigning a prior probability to each possible outcome of the event (Heckerman, 1998).
For GRN inference the prior distribution (P (A i )) encodes the prior knowledge about gene g i and its direct regulators g i .In (Santra et al., 2013), the model does not use any prior knowledge and the approach favors sparsity, penalizing models with too many regulators. Here, a more informative prior distribution A i is injected into the model using integrated TFBS data and PPI data among TFs.
The application of BVS algorithm was demonstrated by reconstructing a liver specific GRN by using perturbation data. Different prior distributions were tested for model evaluation:-1) Prior distribution of A i was formulated without using any prior knowledge from an external data source. 2) Although no prior knowledge is incorporated , the prior distribution was tailored to favor sparse regulatory programs. 3) A prior distribution of A i was formulated based on the predicted information about direct regulatory interactions from publicly available TFBS data. 4) Both direct and indirect regulatory interactions were used to construct a prior distribution of A i from TFBS data and PPI data between TFs.
To estimate the posterior probability of interactions, the BVS algorithm uses Markov Chain Monte Carlo (MCMC) sampling strategy. The probabilities thus obtained represent the confidences on each interaction based on the perturbation responses. An interaction is considered to be true if the probability is above a certain threshold (P th ) and absent in case the probability is below or equal to the threshold (P th ).GRNs reconstructed from the true interactions are then compared with the gold standard.
Results of BVS formulation were compared with LASSO based algorithms using different prior settings.The first case does not use any prior and the regularization parameter λ 1 is set to 0.2. In case of second and third setting a prior distribution of TFBS and TFBS+PPI with regularization parameters set to λ 1 =0.2 and λ 2 =0.8 respectively. Evaluation with ROC and PRE curves suggests that with the inclusion prior information, the performance of BVS algorithm improved significantly. Prior knowledge about direct and indirect regulatory interaction from TFBS integrated with PPI data between TFs were observed to improve predictiveness of regulatory relationships than the TFBS data alone. More over, the proposed BVS formulation outperformed all LASSO models in all circumstances.
Modeling using BNs provides an efficient and suitable framework for integration of multiple datasets (homogeneous/heterogeneous). They are able to capture noisy conditional dependencies between data variables and the construction of CPDs makes them suitable for a variety network inference problems in bioinformatics.However, there are certain limitations of BNs as well: i Although some important associations are captured using the sparse network representation, some of the associations may be missing in the final network. ii Network loops which in many biological networks represent important control mechanisms can not be inferred using BNs because of their acyclic nature. iii As the number of nodes in the model increase, BN structure grows exponentially with a huge computational cost, therefore making BNs less attractive for large scale studies ( e.g. eukaryotic genomes).

Network inference through Markov networks
Graphical probabilistic models with undirected edges are called Markov networks.These networks provide a simple definition of independence between any two nodes and are used to represent conditional dependence relationships between genes. Mechanisms within the biological cells are governed by complex interactions among various genes and it would be of great interest for biologists to understand the conditional dependencies between these genes. It is possible to uncover these dependencies from the experimental data characterizing genes and gene products and represent them in the form of a gene network. Markov networks present a popular statistical approach to model such conditional independence relationships in high-dimensional genomics data. In particular the absence of a link between genes s and t is indicative of independence relationship between s and t, considering the immediate neighborhood of s. From (Murphy, 2012) with such properties we can conclude that, two nodes (genes) without any direct links between them are conditionally independent of each other given the rest of the genes in their vicinity. Such a property of conditional independence (Markov) allow existence of a rich set of dependence relationships between nodes and therefore, we can uncover complex relations among the nodes of a Markov network (Allen & Liu, 2013). The problem of learning network structures using Markov networks have witnessed wide acceptance from the fields of computer vision (S. Z. Li, 1994), social networks, image and signal processing (C. Wang et al., 2013;Metzler & Croft, 2005) and genomics (Wei & Li, 2007). The applications of undirected graphical models (Markov networks) in bioinformatics include construction signaling network for cancer from proteomic data (Mukherjee & Speed, 2008), genetic interaction network reconstruction from integration of multiple datasets (Isci et al., 2013) and in recent times various inference algorithms based on Markov networks using NGS data have been developed (Allen & Liu, 2013;Gallopin et al., 2013).
Zitnik and Zupan developed FUSENET (Žitnik & Zupan, 2015b) as a GRN inference algorithm based on Markov network formulation that integrates multiple nonidentically distributed heterogeneous datasets. The input to FUSENET is a dataset colletion with each dataset containing a set of gene expression profiles. Gene expression measurements are taken from RNA-seq experiments (discrete, nonnegative) and are combined with mutation data including CNV, SNP, short indels and multiple substitutions. While the RNA-seq data follows Poisson or negative binomial distribution, the data from mutation datasets are modeled either using categorical or multinomial distributions . A prominent feature in FUSENET is the latent factor represention of model parameters. Employing latent factors and the flexibility to share these factors across different datasets makes it easy for the algorithm to perform network inference tasks from multiple datasets considered simaltaneously and arising from different probability distributions. The network model offered by FUSENET offers a collective approach to inference, whereby the network estimation is performed jointly from non identical data distributions.The FUSENET model is illustrated as below:-Given a collection  Fig. 4. A Markov network modeled using FUSENET, that infers dependencies among any two genes conditioned on their neighbors. Here dotted line denotes an absence of edge between s 2 and s 3 implying that s 2 acts indepent of s 3 given its neighbours s 1 and s 4 . The ⊥ symbol stand for conditional independence.
D of n observations, D = {x 1 , x 2 , ...., x n } where x i is a p-dimensional vector drawn from a specific probability distribution. This distribution is associated with a graph G = (V, E * ) with parameters{θ * c , c ∈ C}. Because Graph G embeds the Markov independence properties among the nodes, therefore the task of reconstructing the topology of G is to infer an edge E * corresponding to the probability distribution where from the initial observations in D were drawn.Therefore set of edges as a function of parameters can be written as:- Learning the structure of network using FUSENET therefore aims at estimating the weights {θ, c ∈ C} whose values should possibly close to the true but otherwise unknown parameters {θ * c , c ∈ C}.For the final model to take shape, authors chose a pairwise Markov network with joint probability distribution model having cliques of size at most two: The overall Markov network structure is then estimated by adopting neighborhood estimation approaches following the works of (Ravikumar et al., 2010;Jalali et al., 2011;Allen & Liu, 2013), the estimate for the entire network E is then obtained using : Preprints ( where (s, t) represent an edge between nodes s and t andN (s) = {t ∈ V \{s} :θ st = 0} is the estimated neighborhood of node s. In order to assess the performance of FUSENET, several competing inference algorithms were considered. The FUSENET model was compared with both Gaussian and non-Gaussian models including Graphical LASSO (J. Friedman et al., 2008), Local Poisson Graphical Model (LPGM) (Allen & Liu, 2013) and Multinomial Morkov Graphical Model (Mult-GM) (Jalali et al., 2011) for performance evaluation. The considered inference methods along with FUSENET were applied to Poison-distributed simulated data.Four network types viz. random, hub, small world and scale free were generated , the last three being characteristic of many real biological networks. Datasets having P = 100 (nodes) variables and with n = 200 observations were generated. ROC curves for all the simulated networks using comparative and proposed methods were reported. It was concluded that FUSENET outperforms Gaussian based competitors (GLASSO, Log-GLASSO) as well as methods modeled using Poisson data (LPGM,Mult.GM).This shows an overall and consistent performance of FUSENET across all considered networks and more than one data distributions.For in depth understanding of the model refer (Žitnik & Zupan, 2015b) Since GRN inference for most of the eukaryotic organisms remains a challenging task, another important study by (Banf & Rhee, 2017) using Markov random fields exploits the prior biological knowledge and heterogeneous data integration to build high confidence network prediction models.The study uses two datasets to evaluate the prediction accuracy of the proposed algorithm. One of the findings on A.Thaliana dataset is covered here. For reconstruction of a GRN, data from three different sources were integrated.(i) A conserved non-coding sequence 2000bp promoter region of 17610 genes,(ii) DNA binding predictions within these sequences for 120 TFs and (iii) gene expression dataset of A.Thaliana development, comprising RNA samples from 83 tissues.This data was used to derive a condition specific coexpression network.A variance based filtering was applied to remove genes that exhibit little variance across all tissues and development stages.
For network inference, a highly robust and scalable tree based regression which decomposes the network inference into separate regression problems for each target gene, this model calculates an important measure for each predictor, which is used as an indicator for a link to be present between the regulators and target gene. Given a large number of regulators in A.Thaliana , Banf & Rhee (2017) computed ran tree based regressions with more than 5000 decision trees for each target gene to ensure that regulators are selected multiple times during bootstrap aggregation so as to provide stable prediction for each target gene. For all the evaluations, they retained all the predictions beyond 95th percentile of the distribution.
Given a set of regulatory links l, there is a concept of meta gene regulatory networks that describes connections between two links l and l i.e., l ↔ l . A connection between two links l, l is based on the co-regulation principle i.e., two different target genes g and g are controlled by the same regulator r.
The weight of such a connection is based on a distance metric that combines two measures which are assumed to reflect co-regulation.Given this meta gene regulatory network, modules of connected regulatory links can be extracted from the meta network.Each module represents a group of large genes g that are pair-wise co-regulated by a specific TF r. Individual links that are not connected to any other link are also retained as individual (single link) modules.The co-regulatory effects within each module are modeled using Markov random fields G = (V, E), which implements a local independence assumption referred to as Markov property (Schwaller, 2015). This property imposes a node to be independent of any other nodes given all its in direct neighbors, i.e.: Where N i , j|{i, j} ∈ E denotes the set of immediate neighbors of node V i .An important notion in the model is a clique. It is defined as fully connected subset of nodes within the graph, which is considered as maximal if it is not connected within any other larger clique.In order to retain the links in the final network , the model favors links with high weights as well as strongly connected pairs of regulatory links to be in the same state.

KERNEL BASED APPROACHES
Kernel methods represent a mathematical framework which embeds data points (genes, proteins, miRNA etc) from input space I to feature space F by employing a kernel function. Genomic datasets viz., mRNA expression levels from RNAseq, miRNA expression profile from miRNA-seq and TFgene regulation matrix obtained from different databases such as ENCODE (E. P. Consortium et al., 2004), TRED (Jiang et al., 2007), TRRUST (Han et al., 2015) etc. comprise heterogeneous datasets that serve as the building blocks of gene regulatory networks which can be fused together using kernel methods.
Each dataset is transformed into a symmetric positive semi definite kernel matrix by means of a kernel function, that is a real valued k(x 1 , x 2 ) satisfying k(x 1 , x 2 )= k(x 2 , x 1 ) for any two objects x 1 and x 2 and positive semi-definite i.e., to say n i=1 a i a j k(x i , x j ) ≥ 0 for any integer n, set of objects (n = x 1 .....x n ) and any set of real numbers (a 1 .....a n ) (Charpiat, 2015). Kernel functions provide a coherent representation and a mathematical framework for the input data (genes, TFs, miRNA etc.) and represent the object features via their pairwise similarity values comprising the n x n kernel matrix, defined as.
Kernel methods offer a modular approach to pattern analysis (Shawe- Taylor  that performs an inner product on the inputs in a feature space. This algorithm is more generic and can work for any kernel and hence for any data domain. The kernel part is data specific that offers an elegant and flexible approach to design learning systems, that can easily operate in very high dimensional space. It is a modular framework, where modules are combined together to obtain complex learning systems. Some examples of commonly used kernels are (Shawe-Taylor & Cristianini, 2004) : Polynomial kernel Gaussian kernel For suitable values of d (degree of the polynimial kernel) and σ (spread parameterof gaussian kernel), the similarity measure k(x 1 , x 2 ) between x 1 and x 2 is always positive with is maximum at x 1 = x 2 . All points x have same unit norm (since k(x 1 , x 2 ) = 1 ∀ x) suggesting that images of all points in x lie on the unit sphere in H (Schölkopf & Smola, 2002). In order to handle genomic datasets like DNA sequences , protein sequences and network data such as, proteinprotein interactions, molecular and cellular networks, customized kernel functions have been proposed for building similarity matrices. For example,kernel functions for computing the similarity between two genes/proteins are spectrum kernel (C. Leslie et al., 2001;C. S. Leslie et al., 2004) , pairwise kernel (Ben-Hur & Noble, 2005) , motif kernels (Ben-Hur & Brutlag, 2003) for defining similarity between TF-DNA binding motifs (Ben-Hur & Brutlag, 2003), and pfam kernel (Gomez et al., 2003). Network data represented using adjacency matrices is transformed into a similarity matrix using diffusion kernel.For more in depth coverage on range of kernels devised for handling biological data refer to (Schölkopf et al., 2004).
Besides the sequence data that normally occurs in string format, network data from molecular networks and interaction networks, pathways and Gene Ontology (GO) associations are frequently used in kernel based integration studies. These types of datasets can be handled using graph with different scenarios. In the first case,the input varibales can be treated as nodes within a graph,e.g. proteins in PPI networks or genes in case of GRNs and in the second scenario proteins can be represented using graphs, e.g. protein representation using phylogenetic trees. Computational biologists have developed kernels for both scenarios. For example, kernels built from phylogenetic profiles modeled as trees (Vert, 2002), graphical representation of small molecules (Kashima et al., 2004) and graphs modeling protein structures (Borgwardt et al., 2005).

Multiple Kernel Learning Model
Multiple kernel learning (MKL) is a paradigm shift from traditional single feature based learning and offers an advantage of combining multiple features of objects such as genes, proteins, metabolites etc., as different kernels (Sonnenburg et al., 2006). This information can be fed as an ensemble into an MKL learning algorithm as a combined kernel matrix for classification or regression tasks on unknown data. The basic algebraic operations of addition, multiplication and exponentiation when performed in combining multiple kernel matrices preserves the positive semi-definite property and enable the use of powerful kernel algebra. A new kernel can be defined using k 1 and k 2 with their corresponding embeddings Φ 1 (x) and Φ 2 (x). This resultant kernel is with the new induced embedding Given a kernel set K = {k 1 , k 2 , ...., k m }, an affine combination of m parametrized kernels can be formed as given below : - subject to the constraint that µ i (weights) are positive i,e. µ i ≥ 0, i = 1........m A kernel based statistical classifier such as SVM induces a margin in feature space, separating the two classes using a linear discriminant.In order to find this linear discriminant, an optimization problem needs to be solved, known as a quadratic program (QP). A Quadratic program is a form of convex optimization problem, which are easily solvable. Kernel based integration methods were first proposed in , wherein a 1-norm soft margin SVM is trained for a classification problem separating membrane proteins from ribosomal proteins. They combined heterogeneous biological datasets viz. PPI,amino acid sequences and gene expression data, characterizing different proteins by transforming them into multiple positive semidefinite kernel matrices using different kernel functions.Their findings reveal an improved classifier performance when all datasets are integrated as a unit compared to testing the classifier on individual datasets. In an earlier study on function prediction for baker's yeast proteins (Lanckriet et al., 2003b) trained an SVM classifier with multiple datasets and achieved an improved performance over the a classifier trained using single data type.
In another study for network inference using kernel data integration (Yamanishi et al., 2004) four different datasets viz Gene expression data , protein interaction data , protein localization data and data from phylogenetic profiles were transformed into different kernel matrices. Datasets that assign a vector to protein/gene viz. gene expression, protein localization and data from phylogenetic profiles can use Gaussian, ploynomial or linear kernels as tranformation functions. Graph datasets are kernalized using diffusion kernel (Kondor & Lafferty, 2002) defined as k = exp(βH).
Here H is the graph laplacian obtained by subtracting (H = A − D) diagonal matrix D of the graph from its adjacency matrix A and β > 0 is the parameter. All genomic datasets were transformed into different types of kernels.The gold standard protein network and the noisy protein interaction datasets were represented by a diffusion kernel with parameter = 1. Gene expression data were Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 27 April 2018 doi:10.20944/preprints201804.0352.v1 kernalized using Gaussian RBF kernel with σ = 5. A linear kernel function was used to transform both localization data and data from phylogenetic profiles. The study (Yamanishi et al., 2004) compared both unsupervised and supervised inference methods on single and integrated datasets.For unsupervised inference both spectral as well as direct approaches tested either on a kernel derived from single genomic dataset or on a combined kernel from multiple genomic datasets. Using the spectral approach, a feature space is defined by selection initial L=50 principal components. To assess the accuracy of both the methods in terms of their capacity to recover existing interactions, the inferred network is compared with a gold standard protein network. The overall accuracy reported for both the methods seems to capture little information from the gold standard. Although results generated using spectral approach with integrated datasets displayed improved accuracy, nevertheless an increased rate of false positives was observed for any true positive rate.
Again with number of principal components L = 50 , that define the features space, various combinations of kernels from multiple datasets were tested against a gold standard kernel. To evaluate the performance of the supervised approach, a k = 10 fold cross-validation procedure was performed. A graph is gradually built by selecting true positives and plotting them as a function of false positives. Contrary to the direct and spectral approaches, the supervised approach seems to capture most of the information about the gold standard protein network and make interesting predictions. Datasets such as gene expression and phylogenetic profiles seem to make contribution with equal quantum of information, followed by noisy PPI and localization datasets. Applying supervised approach to integrated datasets seems to produce overall best results, therefore highlighting the importance of guided network inference from integrated prior biological knowledge. In another study (Ben-Hur & Noble, 2005) applied kernel methods to PPI studies and proposed a pair-wise kernel between two pairs of proteins in order to construct a similarity matrix. They represented a protein sequence pair (X 1 , X 2 ) in terms of the domain or motif pairs that appear in it, the pair-wise kernel is described as: where x 12 is the pairwise representation of the sequence pair (X 1 , X 2 ), and K (, ) is kernel representation that operates on vector data. This pair-wise kernel is based on three sequence kernel , spectrum kernel , motif and pfam kernels. The ROC scores reported for these kernels are 0.76, 0.78 and 0.81 respectively. This experiment was further extended to explore the effect of adding kernels from non-sequence data, such as gene ontology annotations, homology scores and Mutual clustering coefficient (MCC) derived from protein interactions computed in each cross validation fold.Integrating these non-sequence features with the pairwise kernel resulted in better performance than any method by itself as revealed by their ROC scores. Furthermore, tuning the the soft-margin parameter C of the SVM according to the reliability of the interactions provided another significant boost to the performance, yielding an ROC score of 0.98, at an FPR (false positive rate) of 1%. From the comparative analysis it was observed that gain in performance of the model is due to the contribution coming from the prior biological knowledge incorporated in the form of GO-process kernel and other features of biological relevance.

ANN BASED APPROACHES
Artificial neural network (ANN) is an information processing computing paradigm inspired by structural and functional organization of biological neural systems. ANNs are capable of learning from data, can approximate variety of nonliear functions and their robust handling of noisy data make them suitable candidates for inference of gene regulatory interactions from genomics data. Several variants of ANNs have been successfully applied, for modeling gene regulatory interactions, including multilayer perceptrons (Kim et al., 2000;J. Huang et al., 2003;Zhou et al., 2004), selforganizing maps (SOM) (Weaver et al., 1999) and recurrent neural networks (RNNs) (VOHRADSKÝ, 2001;Keedwell et al., 2002;Tian & Burrage, 2003;R. Xu et al., 2004R. Xu et al., , 2007aR. Xu et al., , 2007bGhazikhani et al., 2011;Noman et al., 2013;Raza & Alam, 2016). However, none of these approaches integrate heterogeneous data to infer GRNs, with majority of them using only gene expression data. There are very few studies reported in the literature where an integration framework is used for genomics studies based on ANN model, such as deep learning (LeCun et al., 2015) based multi-modal framework as illustrated in Fig. 5. The fundamental idea here is to build a subnetwork for each data view/dataset and then integrate the output of the individual subnetworks into the higher layers. Subnetworks at lower layers offer the flexibility for selecting appropriate deep learning models for different data types, such as deep belief net (DBN) (Hinton et al., 2006) or deep Boltzmann machine (DBM) (Salakhutdinov & Larochelle, 2010) for binary/Gaussian data, convolutional networks (LeCun et al., 1995) for image data, RNNs (Graves & Jaitly, 2014)  subnetworks being either directed or undirected.Besides their application to learn representations from image data (Srivastava & Salakhutdinov, 2012), multi modal DBN models have also been applied to genomics data to integrate gene expression, DNA methylation and drug response data for subtyping tumours (Liang et al., 2015) on the basis of survival, showing superior performance to k-means clustering. Deep learning models have also been applied to study gene regulatory mechanisms, algorithms such as DeepBind (Alipanahi et al., 2015) has been shown to be broadly applicable and results in increased predictive power compared to traditional single-domain methods and offers a scalable, flexible and unified computational approach for pattern discovery in regulatory genomics outperforming most of the state-of-art methods when combining multiple genomic datasets (protein binding microarrays, ChIP-seq and RNAcompete assays). Although we can use this multi modal deep learning architecture as a template to build scalable regulatory networks, however, we could not trace any single study based on deep learning integrative approaches that are employed to learn GRNs from multi modal data.

DISCUSSION & CONCLUSION
Molecular information produced by high throughput technologies (NGS, mass spectrometry, microarray etc.) accruing at gigantic scales has opened floodgates to the knowledge of biological world. Large scale information about genomic factors (genes, proteins, mRNA, miRNA, metabolites, SNPs etc.), that characterize diverse biological processes from both prokaryotes and eukaryotes can be obtained using modern experimental technologies. Besides, huge volumes of annotation and functional information about genes, proteins ,DNA, RNA, diseases and signaling pathways is housed in publicly accessible databases. The volume and diversity of this information has led to the emergence of new statistical techniques and development of computational tools with the sole purpose of identifying genomic factors and their interactions that play an essential role in the architecture of complex phenotypes and continue to unravel novel biological insights. The realization that inference studies using single data type possess limited predictive power has led to the development of data integration methods in order to widen the scope of exploring novel interactions, disease risks , drug targets or any other biological outcome. Although not a single gold standard method has emerged out of these system genomics approaches, various strategies can be adopted to perform a powerful integrative analysis without relying on a single best approach when applied to heterogeneous datasets. Still these data integration approaches provide a means of analysis multiple datasets simultaneously and quite comprehensively for a comprehensive understanding of biological systems.
Here we present a detailed review of some sophisticated statistical and machine learning techniques used in the inference of regulatory network from multiple (homogeneous / heterogeneous) datasets. The aim is to bring together diverse set of tools and techniques and highlight their potential and limitations in solving biological inference problems when presented with multiple sets of data.
Given the complexity and heterogeneity of data, approaches included here do not overcome all the data integration challenges as outlined in section II , therefore our emphasis will be how a particular method solves a particular biological problem or handles a specific data type. In that context, the network integration methods offer a simple and straightforward solution whereby similar nodes (genes/proteins) across multiple networks are integrated by merging different types of edges. Similarly Similarity Network Fusion build similarity networks for multiple datasets and then merges these individual networks using a novel fusion method. Although simple, but they are less efficient when it comes to preserving the relationships across multiple networks, particularly when multiple networks are projected on each other. Probabilistic Graphical models such as, bayesian networks and Markov networks are often considered as popular approaches to model regulatory networks as they provide suitable frameworks for biological problems. Both BNs and Markov networks model conditional independence relationships and can handle noisiness inherent in biological data. Although BNs do not scale well for large scale datasets but using a variable selection approach, they can outperform some state-of-the-art methods proposed so far in the literature. Markov networks are the other variants of Probabilistic graphical models employed to infer gene interactions from multiple genomics data sets, for example FUSENET uses Markov networks for uncovering conditional dependence relationship between genes whereas regressions are employed to reconstruct a network from the identified interactions in an iterative manner, thereby circumventing the computational cost of network inference.
Similarly regularized regression using LASSO based formulations provide effective statistical tools to model and infer the regulatory relationships embedded in genomic datasets. Regularization parameters are tuned to control the sparsity of the inferred network, but also harnesses the information contained in multiple datasets. These methods sometimes lose some important independent variables (hence, interactions) because of the feature selection performed by shrinkage parameters. Kernel based methods besides integrating multiple and diverse datasets can also handle large-scale data, provided a kernel function for a specific data type is available. In case there is limited scope for the function to be defined, data integration for heterogeneous data types in those cases becomes less effective. As for the integration strategy adopted by various methods, intermediate (transformation based) have been reported to achieve improved accuracies in inference studies from multiple datasets. Moreover, kernel based methods provide automatic means for weighing data in an integration framework and can be extended to other methods for selecting informative datasets. ANN based deep learning models possess several strengths for data integration, they offer component-wise learning paradigm that can significantly reduce the computational cost. For example, for learning model parameters, the sub-networks can be pre-trained using different data views, separately, and then the parameter of the entire network can be globally fine tuned. Also using multi-modal deep networks, the heterogeneous information from different source can be jointly considered well in the integrative layers for inference, classification and clustering (LeCun et al., 2015). Multi-modal networks can even learn from incomplete data views which enable the maximal use of available data instead of using samples with complete views (Srivastava & Salakhutdinov, 2012). Because of their complex structure, models based on deep neural networks suffer from over fitting, are not easily scalable and need large amount of data for training.
The methods for data integration and inference included here do not provide perfect solutions, but nonetheless have yielded good results. Although there are no set standards to validate the performance of these methods in terms of quality of the integration, neither there is any proper criteria by which we can compare two methods for performance. Nonetheless, research studies using systems genomics approaches are being carried out at an unprecedented scale to handle this data explosion in the field of biology. As the generation of biological data continues across different omics layers, development of more comprehensive systems genomic strategies and tools will lead to an enhanced comprehension of complex-trait architecture and produce new knowledge about human physiology and diseases.