A mechanistic model for genome-scale estimations of metabolic activity

In the last decade, different omic technologies have experiences an exponential technological advancement. However, metabolomics has not followed a similarly vertiginous improving improvement and are far from genomics or transcriptomics in terms of throughput, cost and even accuracy. Therefore, genome-scale in-silico methodologies to estimate metabolic activities from genomic data constitute an active field. The solutions available fall into two extremes: those with few assumptions about the relationships among proteins and metabolites, which are easy-touse but less accurate and those that account for the complex relationships among molecules and proteins defined in the metabolic pathways, which are more accurate but require mathematical skills. Here, we introduce Metabolica, an algorithm that considers the complex functional relationships among all the molecules and proteins involved in the metabolic pathway analyzed but keeping an easy use that do not require of advanced mathematical skills. Metabolica has been implemented in a freely available software R package. The software inputs transcriptomic data and infers the activities of the reactions that produce the different metabolites in the pathway analyzed. An example shows how detected dysregulated metabolites in several cancers are related to patient survival.


Introduction
The deviant levels of metabolites are precise indicators of metabolic irregularities behind the neoplastic cell requirements during cancer initiation and progression. These requirements include mainly the biosynthesis of building blocks (nucleotides, lipids, and amino acids) and the fulfilment of the enormous energy dependencies of rapidly proliferating tumor cells. The main elements of metabolism are the metabolites and the biochemical reactions that are catalyzed by enzymes, which participate in a network of complex interactions that are described in detail in different pathway repositories, like Reactome [1], KEGG [2], Wikipathways [3], etc., or other ones more specific, like disease maps [4]. The quantification of these elements is essential to understand their mechanistic interactions with cellular functions (e.g. apoptosis, metastasis, etc.) [5] and to elucidate their cancerous/oncogenic activities [6].
Despite the relevance of metabolism in cancer was known for almost one century [7], with the observation of enhanced aerobic glycolysis (the well-known Warburg effect) [8], the number of measurements. The benchmark was restricted to 267 metabolites that were contained within metabolomics datasets and had KEGG compound identifiers. The significant differentially regulated metabolites (DRM) are depicted in Figure 1. A total of 74% of DRMs predicted by Metabolica were coincident with the the experimental metabolic profiling. The percentages provided by RM for the same comparison were lower than %1. The total number of DRMs predicted by Metabolica was almost twice the number obtained from the metabolomics datasets. It is important to note that the Metabolica determines DRMs based on the production rate of metabolites but does not necessarily report the actual balance of the metabolite, which is what experimental technologies such as gas chromatography and mass spectrometry detect, given that intermediate metabolites could be further depleted, transformed by other reactions. In order to assess the real ratio of false positives of the algorithm a test was carried by a systematic comparisons between pairs of groups of individuals sampled from the same condition. Since the individuals in the compared groups belong to the same condition, and are actually identical, any difference reported by Metabolica can be considered a false positive. (see Methods). The mean percentage of false positive results was always lower than % 0.003 when the conventional alpha value of 0.05 set as the threshold of significance ( Figure 2). Therefore, biases due to false positives in Metabolica can be discarded.

Impact of metabolites in cancer progression
In order to check whether the estimated metabolite production activities were related with cancer progression and patient prognosis, patient survival and cancer stage analyses were carried out. The number of metabolites significantly associated with patient survival was remarkably different between the two cancer types studied: only 2 metabolites in BRCA and 307 metabolites in KIRC (FDR-adjusted p<0.05 of hazard ratio for Cox model). Out of 307 significant metabolites found in KIRC, 19 were members of arginine and proline metabolism, glutathione metabolism, and primary bile acid biosynthesis pathways. The abundances of these metabolites also showed high correlation with the cancer stages (pearson | r | > 0.9). The results of patient survival and cancer stage of three clinically relevant metabolites from this list, putrescine (C00134), Acyl-CoA (C00040) and Acetyl-CoA (C00024) are given in Figure 3.There is a significant increasing trend for Acetyl-CoA (Correlation coefficient=0.73) and a decreasing trend for putrescine (CC=-0.92) and Acyl-CoA (CC=-0.90).  Table S1.

Discussion
Metabolica is based in a generalized mechanistic model that accounts for the activity of the reactions required to produce any metabolite in the whole metabolic pathway. The results suggest that there is a production of the metabolite, which doesn not mean that the metabolite is detectable given that it can be consumed in another reaction. However, if the metabolite plays a relevant role, this information about its production is relevant. Figure 1 shows a remarkable difference in the number of predicted DRMs by the Metabolica and RM, suggesting that the use of sub-pathways or circuits in the context of mechanistic models produces a more accurate detection of metabolite production activity [26, 36,37]. In particular Pyridoxate, also known as vitamin B6, was identified as a DRM in KIRC by both methods (Figure 1). Pyridoxate and its bioactive form has been identified as a critical factor to alter apoptosis induction properties of chemotherapeutics. Actually, In lung cancer, low expression of pyridoxal kinase gene, which encodes the enzyme that generates the bioactive form of vitamin B6, was found to be associated with poor prognosis and proposed as a biomarker for risk stratification among lung cancer patients [38]. Glucose dependence is known to be the main metabolic characteristic exhibited by tumor cells [7]. Interestingly, both Metabolica and real measurements show different levels of differential activity in the production of this metabolite between KIRC and BRCA ( Figure 1). This observation may be a consequence of the differences that exist between the glucose metabolization in different cancer cells, that was recently reported by isotope tracing in KIRC [39]. There were metabolites which have been identified as DRMs only by the Metabolica in both cancer types. For example, 5,6-dihydrouracil ( Figure 1), which is an intermediate product of breakdown process of uracil into beta-alanine required for epithelial-mesenchymal transition [40]. This catabolic process is also called pyrimidine degradation module and was found to be essential for cancer cell survival in some tumor types and experimentally validated by us [5]. Although a comprehensive description of the results is beyond the scope of this manuscript, it is worth mentioning how the dysregulation of metabolites in BRCA like succinate and kynurenine ( Figure 1) that are taking a key role in the initiation of tumorigenesis and its progression were correctly predicted [41,42].
The therapeutic perspective of the metabolites is an important aspect that cannot be despised. Since metabolites can act as direct regulators of gene expression, they have been used for decades as therapeutic agents, targets, or biomarkers [43]. Thus, one the main motivations here was to demonstrate the clinical relevance of metabolites. To achieve so, patient survival and cancer stage analyses were performed for the metabolites. As previously mentioned, both cancers showed a remarkably different number of metabolites significantly associated with patient survival, which was in agreement with recent observations of pan-cancer analyses of prognostic genes and metabolic modules [5,44]. On the other hand, the low number of significant associations observed in BRCA has been suggested to be an artifact derived from the short follow-up time of the TCGA samples [45]. Figure 3 shows three clinically relevant metabolites detected by Metabolica from this 10 members of the arginine and proline metabolism, glutathione metabolism, and primary bile acid biosynthesis pathways, putrescine (C00134), Acyl-CoA (C00040) and Acetyl-CoA (C00024). Putrescine is a polycationic alkylamine and the precursor of spermidine and spermine. These polyamines are involved in many fundamental processes, essential for normal cell growth and their depletion may have a cytostatic effect on some tumors. It is known that polyamine metabolism is frequently dysregulated in cancer and polyamine blocking therapies are used to heighten immune responses in cancer [46][47][48]. Acyl-CoA and Acetyl-CoA are the important metabolites of the fatty acid biosynthesis and elongation processes. They have been identified as critical factors for tumor growth by means of their effect on histone acetylation and gene expression [49,50]. The high production rate of Acetyl-CoA from acetate is known to be associated with poor prognosis cancer cell survival [51]. Mitochondrial respiration under prolonged hypoxic conditions increases the generation of reactive oxygen species (ROS) that results in the cell death [52]. Therefore, the cancer cells adapt to the hypoxic microenvironment by limiting the conversion of pyruvate to acetyl-CoA that is entering into the tricarboxylic acid (TCA) cycle. For all that, the required acetyl-CoA in the other cellular processes can be generated by alternative routes like using acyl-CoA in β-oxidation [52,53]. The negative correlation between the tendencies of acyl-CoA and acetyl-CoA from healthy tissue to tumor stage 4 is also confirming this alternative process ( Figure 3).
In summary, the Metabolica has been designed to predict metabolite production activity that can be used as potential diagnostic biomarkers or drug targets for complex traits. In particular, it can be used for in-silico targeted enrichment of metabolites to prioritize them for experimental metabolomics studies. The Metabolica tool, the data used in this study and more comprehensive results can be found at https://github.com/babelomics/Metabolica.

Breaking down the pathway into elementary sub-pathways
Metabolica requires the definition of sub-pathways in which the activity of the reactions of metabolite synthesis are modeled. Breaking down a pathway into sub-pathways and estimating the activity of a sub-pathway do not depend on a particular pathway repository, but it requires essential information for metabolic reactions (substrate, product, and reversibility descriptions). Here, the canonical pathways presented in KEGG database [54] were used. A total of 78 human metabolic pathways, containing 1901 reactions and 1270 metabolites (Table S2) were downloaded. The KGML files were parsed using the KEGGgraph Bioconductor package [55]. Figure 4: Example of a sub-pathway. The production sub-pathway of 4-Androsten-16alpha-ol-3,17dione (blue node) starting from cholesterol and 20alpha,22beta-Dihydroxycholesterol (green nodes). This sub-pathway (left) was dissected from the steroid hormone biosynthesis pathway (right). The circles, rectangles and arrows are representing metabolites, metabolic reactions and reaction reversibility, respectively.
The sub-pathway that produces a given metabolite is defined by all the nodes which were visited inside its pathway using breadth-first search algorithm. This process starts from the metabolite produced (so-called product) and continues iteratively on the direction of the edges which are arriving at the product and its connected neighbor nodes. Figure 4 shows a real example of a subpathway which is extracted from its pathway as described above. Because of the highly interconnected nature of metabolic pathways and the numerous feedback loops, the convergence of calculations is challenging when the propagation algorithms are applied on metabolic hypergraphs. To deal with this issue, the feedback loops which are not derived from the product were kept, however, all the feedback loops (outdegree edges) of the product were removed. By this means, we also restrict the consumption of the product by its producing pathway.
Similarly to the signaling [27] or metabolic module [36] implementations, Metabolica requires starting node(s) to initialize the propagation of metabolic flux along sub-pathways. The definition of starting nodes is a 2-steps process: first, the metabolites with indegree of zero and the metabolites at the farthest position in the sub-pathway (products) are selected, and the propagation algorithm (see below) is ran without any objective function. In the second step, the nodes which were not visited in the previous run to the list of starting nodes are included in order to guarantee that all the nodes can be visited in the further runs. All the decomposing steps were done only one time and saved for future analysis of the pathway.

Estimating reaction node activities
The method proposed in this study uses the expression levels of the genes that encode enzymes as proxies of levels of the corresponding gene product enzymes and, consequently, of their activity levels [58][59][60]. The transcriptomics data mapping from genes to enzymes (proteins) and reactions were done using the gene-protein-reaction (GPR) relationships [61]. In the case of single gene (protein) -reaction relationships, the normalized gene expression level is used as the activity level of the reaction. For reaction nodes composed of multiple genes (proteins), where the reaction is catalyzed by isozymes or enzyme complexes, a 50th percentile of expression values of the genes is assigned as the activity of the reaction. The percentile threshold (50th) used to summarize gene expression values into a reaction activity was empirically obtained from the observed regulation patterns of 3 well characterized metabolic reactions in the context of the Warburg effect: (S)-Lactate + NAD+ <=> Pyruvate + NADH + H+ (KEGG ID: R00703), L-Alanine + 2-Oxoglutarate <=> Pyruvate + L-Glutamate (KEGG ID: R00258) and Pyruvate + Thiamin diphosphate <=> 2-(alpha-Hydroxyethyl)thiamine diphosphate + CO2 (KEGG ID: R00014), depicted in Figure 5A.
To define the optimal threshold the Warburg Effect, defined as an increase in the rate of glucose uptake and preferential production of lactate (anaerobic), even in the presence of oxygen (aerobic) [56], was used. Therefore, the observation of an increase in the reaction activities of anaerobic respiration compared to aerobic in tumor samples ( Figure 5A) was expected. Figure 5B illustrates the upregulation of the main reaction of anaerobic branch (R00703) at 10th, 30th, 40th and 50th percentiles when tumor samples compared to healthy samples. Actually, the highest up regulation was observed at 50th percentile for both cancer types. The first reaction of the aerobic branch, R00014, was down regulated at all percentiles (from 10th to 90th percentiles). Finally, the reaction R00258 was up regulated at 10th, 20th, 30th, 40th and 50th percentiles in both cancer types [57]. Therefore, although the whole range from 10 th to 50 th percentiles is compatible with the Warburg Effect, the highest amount of lactate production, decrease in pyruvate decarboxylation and increase in the nitrogen transfer from glutamine to alanine over pyruvate is attained at 50 th percentile, with the highest lactate production using pyruvate, which is the essential metabolic process in tumorigenesis [57] .

Computing the sub-pathway activity
For each reaction node ri of a sub-pathway, the metabolic flux propagation is computed by the given formula according to the following recursive rules: Where Wms,ri is the amount of substrate (ms) used by reaction (ri), Wrimp is the amount of product (mp) produced by the reaction ri and mp is the final amount of product which is produced by different reactions. N + and Ndenote neighborhood of a node on the direction of its outgoing and incoming edges, respectively. n is the total number of Wrimp plus 1 for mp. Thus, the equation (1) distributes the substrate proportionally with the activities of its consuming reactions. The equation (2) aims to elucidate the reaction rate (limited by the minimum amount of the substrates used). The amount of metabolite produced per unit time depends on the capacity of enzyme (saturation) and the amount of substrate. This is the combination of Michaelis-Menten kinetics and systems-level analysis of mechanisms regulating metabolic fluxes [62]. The equation (3) updates mp node with the amount of contributing product of the reaction ri without saturating this node and it can also handle the loops appropriately. The loops in a sub-pathway need a high number of iterations to stabilize the flux propagated. Thus, the Metabolica iterates the flux that is in a loop until it reaches the convergence state. Here, the convergence state is defined as almost-zero flux change between iterations. Therefore, the Metabolica repeats the steps 1, 2 and 3 until the flux initiated in the initial nodes reaches the product in a sub-pathway and while the flux which is propagated in a loop has not reached convergence. Metabolica input values in the [0,1] interval and returns output values in the same interval. Such results are non-dimensional values that, like gene expression values, can be interpreted in the context of a comparison.

Samples and data processing
RNA-seq counts and simple somatic mutations data for a total of 1550 samples, 1365 corresponding to tumor and 185 to healthy reference tissues, belonging to breast invasive (BRCA) and kidney renal clear cell (KIRC) carcinomas were downloaded from The International Cancer Genome Consortium (ICGC) repository [63]. The trimmed mean of M-values (TMM) normalization method [64] was used for gene expression normalization. Normalized samples were log-transformed and a truncation by quantile 0.99 was applied. The COMBAT method [65] was used for batch effect correction. Finally, the data was re-scaled between 0 and 1.
Annovar tool [66] with its ljb26 database was used for functional annotation of non-synonymous genetic variants. The variants predicted as damaging by at least 3 out of 5 in-silico pathogenicity predictors were considered loss-of-function (LoF) mutations. These in-silico methods are: SIFT [67], Polyphen2 [68], FATHMM [69], MutationTaster [70], MutationAssessor [71]. For each tumor sample, expression value of the genes that were affected by the damaging variants were multiplicated by a decreasing constant: 0.001, to simulate LoF in the enzyme (equivalent to a non-expressed gene).

Comparative performance of Metabolica
The performance of Metabolica and the RM method in detecting DRMs were compared to real mass spectrophotometry (MS) metabolic profiles. An R implementation of RM method, as described in the original paper [23] was used. As it is recommended in the RM article, the aggregated Z scores were corrected for the background distribution by subtracting the mean and dividing by the standard deviation. For each dataset size, random sampling was repeated 10000 times to calculate background mean and standard deviation statistics. Scaled and imputed metabolomics datasets of BRCA and KIRC were downloaded from supplementary materials of their articles [72,73]. Additionally, quantile normalization was applied to these datasets using preprocessCore Bioconductor package [74].
Here, the samples and sample sizes which were used in RM and the Metabolica calculations belong to TCGA datasets and they are different than the samples of metabolomics datasets analyzed. But all these datasets contain the paired samples of tumor and healthy. Therefore, Student's t-test for paired samples is used to assess the significance of observed changes of metabolites when samples of two conditions are compared.
Finally, significant (FDR-adjusted p<0.05) DRMs of 3 methods were compared. The results of metabolomics profiles are taken as gold standards for this comparison.

False positive rate of Metabolica
In order to estimate the false positive rate (type I error), different testing datasets of samples were generated from the original dataset, and divided them into two equally sized groups that were further compared to each other for finding DRMs with Metabolica. Since the compared groups are composed of the same type of individuals, any significant change found in sub-pathway activities, can be considered a false positive result. To avoid biases derived from sample size or from the type of sample, different group sample sizes were tested both in cancer and in normal individuals. Sample sizes of groups ranged from 5 to 55 for normal samples and from 50 to 450 for tumor samples, which were also proportional to the total number of normal and tumor samples used in this study. For each sample size, 1000 different simulated samplings were carried out and compared. Student's t-test for unpaired samples used to assess the significance of the observations.

Clinical Relevance of Metabolite Producing Sub-pathway Activities
Since the Metabolica provides individual-level results the clinical relevance of metabolites can be tested. In order to show the potential prognostic value of sub-pathway activities, tumor stage and overall survival time of BRCA and KIRC patients were analyzed. Clinical data were obtained from the cBIOportal [75].
Cox hazards multivariate regression model was used considering the following variables: metabolite abundances, age, tumor stage, sex, race, and tumor purity. Metabolites having significant Cox proportional hazards regression models (p-value of log rank test < 0.05, in which metabolite abundances presented significant contribution with p<0.05) were prognostic metabolite candidates. For the visualization of the survival results Kaplan-Meier (K-M) plots were used. The K-M analysis needs a stratification variable with at least two categorical groups [76]. For this reason, we have discretized the metabolite level values. As a common discretization procedure, for each metabolite, lower 45th and upper 55th percentiles of its abundance were used to categorice the samples into two groups. Samples between these two percentiles were not used. Survival analysis was carried out using survival R package [77]. Similarly to the case of two class comparison, multiple testing effects are corrected by FDR [78].
For the tumor stage analysis continuous values of the metabolite abundances were used. The tumor stage codes of American Joint Committee on Cancer, which are divided into four categories (T1, T2, T3 and T4) were used for this analysis. Samples with detailed subdivision codes were pooled under their main designators (e.g. T3' = |T3| U |T3a| U |T3b| U … |T3n|). To find metabolites with a significant role in tumor progression, a linear correlation between tumor stages and metabolite abundances was used.

Conclusions
The Metabolica provides a simple and elegant algorithmic framework for genome-scale metabolic analysis of transcriptomic and genomic data, which accounts for the complexity of the relationships between proteins within metabolic pathways, and delivers estimations of metabolite production activity. Table S1: Death events for the survival analysis. Table S2: List of human metabolic pathways, reactions and metabolites downloaded from KEGG and used in this study. Funding: Please add: This research was funded by grants SAF2017-88908-R from the Spanish Ministry of Economy and Competitiveness and PT17/0009/0006 from the ISCIII, both co-funded with European Regional Development Funds (ERDF) as well as by European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie (GA 813533) and "ELIXIR-EXCELERATE fast-track ELIXIR implementation and drive early user exploitation across the life sciences" (GA 676559).