Preprint
Article

This version is not peer-reviewed.

Agentic RAG-Driven Multi-Omics Analysis for PI3K/AKT Pathway Deregulation in Precision Medicine

A peer-reviewed version of this preprint was published in:
Algorithms 2025, 18(9), 545. https://doi.org/10.3390/a18090545

Submitted:

15 July 2025

Posted:

16 July 2025

You are already at the latest version

Abstract
The phosphoinositide 3-kinase (PI3K)/AKT signaling pathway is a crucial regulator of cellular metabolism, proliferation, and survival. It is frequently dysregulated in metabolic, cardiovascular, and neoplastic disorders. Despite the advancements in multi-omics technology, existing methods often fail to provide real-time, pathway-specific insights for precision medicine and drug repurposing. We offer Agentic RAG-Driven Multi-Omics Analysis (ARMOA), an autonomous, hypothesis-driven system that integrates retrieval-augmented generation (RAG), large language models (LLMs), and agentic AI to thoroughly analyze genomic, transcriptomic, proteomic, and metabolomic data. Through the use of graph neural networks (GNNs) to model complex interactions within the PI3K/AKT pathway, ARMOA enables the discovery of novel biomarkers, probable candidates for drug repurposing, and customized therapy responses to address the complexities of PI3K/AKT dysregulation in disease states. ARMOA dynamically gathers and synthesizes knowledge from multiple sources, including KEGG, TCGA, and DrugBank, to guarantee context-aware insights. Through adaptive reasoning, it gradually enhances predictions, achieving 91% accuracy in external testing and 92% accuracy in cross-validation. Case studies in breast cancer and type 2 diabetes demonstrate that ARMOA can identify synergistic drug combinations with high clinical relevance and predict therapeutic outcomes specific to each patient. The framework’s interpretability and scalability are greatly enhanced by its use of multi-omics data fusion and real-time hypothesis creation. ARMOA provides a cutting-edge example to precision medicine by integrating multi-omics data, clinical judgment and AI agents. Its ability to provide valuable insights on its own makes it a powerful tool for advancing biomedical research and treatment development.
Keywords: 
;  ;  ;  ;  ;  ;  ;  ;  

1. Introduction

The phosphoinositide 3-kinase (PI3K)/AKT signaling pathway is a major regulator of cellular metabolism, growth, proliferation, and survival in conditions such as cancer, metabolic disorders, and cardiovascular diseases. It has been a primary focus for precision medicine because of its recurrent dysregulation in various conditions [1]. Despite extensive study over several decades, patient heterogeneity, pharmaceutical resistance, and the inability to effectively integrate multi-omics data persist in obstructing therapy choices that target the PI3K/AKT pathway. These challenges demonstrate the necessity for innovative approaches to unravel the complexity of the pathway and formulate targeted approaches to treatment [2]. The variety of sickness situations also presents a considerable challenge to effective control of the PI3K/AKT pathway, complicating the identification of therapeutic targets and affecting the effectiveness of treatments. Traditional approaches often overlook the complex regulatory processes governing PI3K/AKT signaling, prioritizing single-omics data, such as transcriptomics or genomics [3]. Traditional computational methods suffer from data fragmentation, bias, and limited interpretability, even though the integration of multi-omics is essential for understanding disease-specific pathway modifications. Moreover, off-target effects, adaptive resistance, and insufficient pathway-specific drug repurposing techniques represent notable limitations of current drug discovery methodologies [4].
The predominant approaches for investigating the deregulation of the PI3K/AKT pathway are reactive and incapable of providing real-time, context-sensitive knowledge. A significant number of approaches depend on predetermined algorithms and static statistics, which inadequately capture the dynamic nature of route activity and its interaction with other biological processes [5]. The absence of autonomous, self-optimizing systems capable of generating hypotheses and enhancing forecasts in real time has impeded the utilization of artificial intelligence (AI) in multi-omics analysis, notwithstanding AI’s demonstrated potential in tackling certain challenges. These limitations underscore the urgent necessity for innovative solutions that can overcome prejudice, limited interpretability, and fragmented data [6].We introduce Agentic RAG-Driven Multi-Omics Analysis (ARMOA), an innovative AI-driven framework that integrates large language models (LLMs), agentic AI systems, and retrieval-augmented generation (RAG) to autonomously analyze and understand multi-omics data, therefore addressing these challenges. ARMOA employs dynamic knowledge retrieval to autonomously extract and synthesize information from diverse sources, including public repositories (KEGG, TCGA, DrugBank) and the latest scientific literature [7]. To enable context-aware therapeutic decision-making, it delineates the complex interactions among genes, proteins, and metabolites within the PI3K/AKT pathway through the application of graph neural networks (GNNs). Moreover, adaptive learning is facilitated by ARMOA’s agentic AI-driven hypothesis generation engine, which perpetually improves pharmaceutical repurposing, biomarker discovery, and individualized therapy predictions. The establishment of ARMOA represents a transformative shift in pathway-oriented therapeutic approaches and AI-facilitated multi-omics investigation. ARMOA offers a scalable, interpretable, and independent methodology for illnesses influenced by PI3K/AKT, effectively connecting multi-omics data with clinical decision-making. Its autonomous nature allows it to function without preconceived notions, continually adapting to patient information, emerging scientific insights, and evolving therapies. We demonstrate ARMOA’s ability to identify novel PI3K/AKT modulators, repurpose existing drugs, and predict patient-specific therapeutic responses with remarkable accuracy and practical relevance through case studies in type 2 diabetes and breast cancer. Our work propels the future of AI-driven biomedical research and clinical practice, laying the foundation for next-generation precision medicine by offering an innovative tool to navigate the intricacies of disease-specific pathway dysregulation.

3. Materials and Methods

3.1. The ARMOA Framework

ARMOA is a novel framework designed to integrate and analyze multi-omics data to study the PI3K/AKT signaling pathway. ARMOA leverages agentic AI systems, RAG, and LLMs to facilitate real-time, context-aware analysis and facilitate the identification of potential drug candidates and biomarkers. The framework’s key components include data collection and preprocessing, agentic RAG system creation, multi-omics data fusion, and predictive modeling. Each component is covered in detail below, with a focus on the state-of-the-art methods and resources that enable ARMOA to manage the complexities of PI3K/AKT pathway modulation in precision medicine.

3.2. Data Collection and Preprocessing

In this study, multi-omics data was collected from various public repositories with an emphasis on CRC and the PI3K/AKT pathway in cancer. The data sources include TCGA and ENCODE genomic data, which document somatic mutations, copy number variations, and gene expression patterns, with a focus on genes like MTOR, AKT1, PTEN, and PIK3CA for example. For proteins like TP53, mTOR, and AKT in particular, the proteomic data, which concentrated on protein interactions and quantification, was obtained from the PRIDE database. GEO supplied transcriptome information, namely RNA-seq datasets for variations in gene expression linked to PI3K/AKT pathway activation or inhibition. Compounds linked to PI3K/AKT-regulated processes like glucose metabolism and lipid synthesis were among the metabolomic data extracted from HMDB. We also retrieved medication data from DrugBank and PubChem, concentrating on FDA-approved and experimental drugs that target PI3K/AKT.
By combining route data from the KEGG, Reactome, and STRING databases, an interaction matrix for PI3K/AKT signaling was produced. KEGG’s pathway data served as the foundation, demonstrating the interactions between the genes and proteins in the pathway. The KEGG pathway for PI3K/AKT was obtained at https://www.genome.jp/pathway/hsa04151. The Reactome data on the PI3K/AKT signaling pathway was from https://reactome.org/content/detail/R-HSA-198203. Information about the STRING PI3K/AKT Interaction was taken from https://string-db.org/network/9606.ENSP00000451828.
The preparation pipeline ensured interoperability across various data formats. Proteomic data was processed for label-free quantification using MaxQuant, metabolomic data was standardized using Pareto scaling, and RNA-seq data was normalized using DESeq2. To ensure consistency across datasets, the ComBat approach was used to correct for batch effects. Differential expression analysis was performed using limma for RNA-seq and LIMMA-VOOM for proteomics to find genes and proteins with significant expression changes for further investigation [19,20,21,22,23].
The PI3K/AKT pathway is thoroughly annotated by various databases, which makes it easier to forecast medication repurposing and do pathway enrichment analysis. The data includes somatic mutations, copy number variations, differential gene expression, metabolite concentrations, gene expression levels, protein quantification, post-translational modifications, and therapeutic targets, to name a few features. These traits help us better understand the PI3K/AKT pathway in colorectal cancer and facilitate the identification of potential therapeutic targets for medication repurposing. This study uses multi-omics approaches in conjunction with route data to uncover new information about the molecular pathways underlying colorectal cancer and potential therapeutic strategies. Multi-omics pathway links provided by the KEGG, Reactome, and STRING databases allow for further exploration of gene and protein interaction. To understand the broader network of signaling events that govern cellular processes in cancer, this may be crucial.
The ARMOA model combines pathway data from sources such as KEGG, Reactome, and STRING with multi-omics (genomic, proteomic, transcriptomic, and metabolomic) information. To guarantee data quality, it starts with preprocessing procedures such feature selection, harmonization, and normalization. Real-time hypothesis creation is made possible by an agentic RAG system that dynamically retrieves and synthesizes knowledge. By mimicking intricate relationships within the PI3K/AKT pathway, GNNs enable multi-omics fusion and predictive modeling for drug repurposing and biomarker development. Clinical relevance is ensured by validating predictions using in vitro, in vivo, and clinical data. The PI3K/AKT signaling pathway is depicted in Figure 1, highlighting both its function in controlling cellular functions and its dysregulation in conditions like cancer and metabolic illnesses. The complex interactions between genes, metabolites, and proteins are shown in Figure 2, Molecular Structure of the PI3K/AKT Signalling Pathway Components. This image illustrates the three-dimensional configuration of crucial proteins involved in the PI3K/AKT signalling system, an important regulator of cellular growth, survival, and metabolism. The structure highlights the domains of PI3K (phosphoinositide 3-kinase) and AKT (protein kinase B), with designated parts depicted in purple (alpha helices), white (beta sheets), and grey (loop areas). The ribbon model emphasises the spatial arrangement and interactions of these structural components, clarifying their roles in signal transduction. The ARMOA workflow is shown in Figure 3 and includes information on data collection, preprocessing, knowledge retrieval based on RAG, fusion based on GNN, and predictive modeling. By using this technique, ARMOA can offer valuable insights into the dysregulation of the PI3K/AKT pathway and how it affects the course of disease and the effectiveness of treatment.

3.3. Agentic RAG System Development

The Agentic RAG System integrates RAG with autonomous AI agents to enable real-time information retrieval, synthesis, and hypothesis creation for the PI3K/AKT pathway. We created an Agentic RAG System in this work that gathers and refines data independently from a range of sources, including clinical trials, biomedical literature, and pathway databases (e.g., KEGG, Reactome, STRING). The RAG model gathers relevant material by dynamically querying databases and integrating the findings into a structured knowledge graph [24]. Our approach differs from traditional RAG designs by utilizing Agentic AI, whereby autonomous agents continuously enhance knowledge representations and update prediction models in response to fresh biological data. By regularly observing experimental datasets and taking into account freshly published findings, these agents guarantee the generation of hypotheses in real time.
The Agentic RAG System provides real-time information retrieval, synthesis, and hypothesis construction for the PI3K/AKT pathway by combining autonomous AI agents with RAG. The main parts of this system are listed below. The RAG system accesses and synthesizes pertinent literature, clinical trials, and route data using LLMs like Claude and GPT-4. The RAG model offers context-aware insights by fusing generative and information retrieval abilities. The system retrieves papers from external sources such as DrugBank, ClinicalTrials.gov, and PubMed using Maximal Marginal Relevance (MMR):
M M R = arg m a x d i D S [ λ S i m 1 ( d i , Q ) ( 1 λ ) m a x S i m 2 d j S ( d i , d j ) ]
Where:
D is for document set.
S is specific documents.
Q is query.
λ is balance parameter.
In our Agentic RAG System, autonomous agents were enhanced by Q-learning, employing the update rule Q(s, a) → Q(s, a) + α [r + γ max a′] Q(s’, a’) - Q(s, a). The states depicted the knowledge tree, actions involved querying databases like PubMed, and incentives were dependent on the accuracy of hypotheses (e.g., r = 1 for validated hypotheses). We set α = 0.1, γ = 0.9, and utilised a ϵ-greedy strategy with ϵ = 0.1 for exploration. Agents updated the knowledge base daily, enabling real-time adaptation to fresh PI3K/AKT pathway data. Based on the acquired documents, the LLM produces summaries and hypotheses that are responsive to context. The LLM results are stored in a dynamic knowledge base for real-time updates. Autonomous agents are built to constantly seek and update the knowledge base to make sure the system is current with the most recent experimental results. Every actor serves as a model for reinforcement learning (RL):
Q(s,a)←Q(s,a)+α[r+γa′maxQ(s′,a′)−Q(s,a)]
Where:
Q(s,a) is the action-value function.
α is the rate of learning.
γ is the discount factor
r is the reward
By monitoring new data sources like PubMed and GEO, agents hunt for pertinent updates. Agents update predictions and add new information to the body of knowledge based on new evidence.
Algorithm 1: Agentic RAG System pseudocode
specify knowledge_base, query, and agentic_rag_system:
 # Step 1: obtain pertinent papers
documents = retrieve_documents(query, knowledge_base)
 # Step 2: Synthesize knowledge using LLM
summary = llm_synthesize(documents).
#Step 3: Update the knowledge base
use knowledge_base.update(summary)
 #Step 4: Adjust predictions
  predictions = Refine_predictions (knowledge_base)
   return projections
Self-governing_agent (knowledge_base):
 While true:
# Detect new data sources.
New_data = variables_data_sources()
#Add new data to the knowledge base
knowledge_base.update(new_data).
   # Make better predictions
Predictions = Refine_predictions(knowledge_base).
  # Assessment and revision of agent policies predictions
agent_policy.update
The RAG system ensures that the knowledge base is regularly updated with the latest experimental data. Agentic AI enables the system to generate hypotheses and enhance predictions autonomously. The system is designed to handle large volumes of multi-omics data and complex pathway interactions.

3.4. Multi-Omics Data Integration

The Multi-Omics Data Integration process models and represents relationships within the PI3K/AKT pathway using GNNs and dimensionality reduction techniques. A heterogeneous graph G = ( V ,   E ) is produced using GNNs. Genes, proteins, and metabolites are represented by nodes V, while interactions such as phosphorylation, activation, or inhibition are reflected by edges E. Each node in the GNN learns node embeddings by combining information from its neighbors through a message-passing mechanism:
h v ( k ) = σ ( W ( k ) C O N C A T ( h v ( k 1 ) , A G G ( { h u ( k 1 ) , u N ( v ) } ) ) )
AGG W (k) is the weight matrix, h v (k) is the embedding of node v at layer k, σ is a nonlinear activation function, and AGG is an aggregation function (like mean or sum) [25]. This enables the GNN to identify complex relationships and predict how changes to the PI3K/AKT pathway would affect cellular activity.
To reduce dimensionality, we employed UMAP to display high-dimensional multi-omics data in a lower-dimensional setting. Using UMAP reduces the cross-entropy between the low-dimensional and high-dimensional representations:
U M A P ( X ) = a r g Y m i n i , j w i j y i y j 2
where wij, denotes how comparable the data points i and j are in the high-dimensional space, and yi, yj, are the low-dimensional embeddings of the data points. This facilitates exploratory inquiry and analysis of multi-omics data. The pseudocode for pathway modeling with GNNs is as follows:
Algorithm 2: GNN-based pathway Pseudocode
def gnn_pathway_model(graph, attributes, layers):
for node in graph.nodes: for layer in range(layers):
   neighbors(node) = graph.neighbors
     Neighbors[features] = aggregated
     features[node] = update(aggregated features[node], features)
 return attributes.
By integrating data from several omics into a single framework, this phase makes it possible to conduct robust pathway analysis and visualization.

3.5. Predictive Modeling and Validation

The Predictive Modeling and Validation phase focuses on identifying and validating therapeutic targets within the PI3K/AKT pathway through experimental validation, biomarker identification, and pharmaceutical repurposing. Medication repurposing data was used to train ML algorithms, such as random forest and XGBoost, to predict possible therapeutic options [26]. Models evaluated binding affinities using molecular docking scores, which are represented as follows:
B i n d i n g   A f f i n i t y = Δ G = R T l n K d
The dissociation constant is Kd, the temperature is T, the gas constant is R, and the change in Gibbs free energy is represented by ΔG. Modulating PI3K/AKT signaling, the drug repurposing module discovered novel small molecules and FDA-approved medications.
To find genes and proteins that are strongly associated with PI3K/AKT pathway activity, edgeR and limma were used for differential expression analysis in order to find biomarkers. The p-values and log-fold change (LFC) were calculated as follows:
L F C = l o g 2 ( M e a n   E x p r e s s i o n   i n   C o n d i t i o n   B   /   M e a n   E x p r e s s i o n   i n   C o n d i t i o n   A )
Cytoscape and MCODE are two examples of network-centric approaches that were used to identify significant regulatory interactions along the route. The system known as Multi-Omics Graph Integration (MOGI) developed dynamic graphs that link PI3K/AKT activity to transcriptomics, proteomics, metabolomics, and genomic data [27]. GraphSAGE generated the graph embeddings:
h v ( k ) = σ ( W ( k ) C O N C A T ( h v ( k 1 ) , A G G ( { h u ( k 1 ) , u N ( v ) } ) ) )
where hv(k) is the embedding of node v at layer k, W(k) is the weight matrix, and AGG is an aggregation function.
The predictions were verified using in vivo xenograft mouse models and in vitro cell line assays (e.g., MCF-7, HeLa). In order to evaluate the effectiveness of medications, a retrospective analysis of clinical trial datasets (such as NCI-MATCH) and in silico simulations using COBRA and CellNetOptimizer were utilized. Below is a description of the pseudocode for pharmaceutical validation and repurposing:
Algorithm 3: drug repurposing
def drug_repurposing(omics_data, pathway_activity):
train_random_forest(omics_data, pathway_activity) model
   Predict_drugs(model, omics_data) drug_candidates
  return drug candidates
In_vitro results = test_cell_lines(drug_candidates)
In_vivo results = test_mouse_models(drug_candidates)
results of def validate_predictions(drug_candidates)
In_vitro, in vivo, and clinical data
return clinical_results = analyze_clinical_trials(drug_candidates).
Predictive modeling and experimental validation are integrated in this step to ensure precise identification of biomarkers and pharmaceutical candidates for PI3K/AKT pathway regulation.
The ARMOA system is distinctive as it integrates GNNs, agentic AI, and RAG to provide real-time, hypothesis-driven multi-omics research. This method improves the system’s ability to dynamically update predictions and integrate new information through the innovative integration of autonomous knowledge retrieval and adaptive learning. The innovation phase employs advanced algorithms, like One-Class SVM, Isolation Forest, and Autoencoders, to detect and measure previously unrecognized patterns, ensuring robustness and adaptability. ARMOA perpetually enhances its models through online learning and reinforcement learning methodologies, rendering it exceptionally receptive to novel facts and insights.
Precision, recall, F1-score, ROC-AUC, and Novelty Detection Rate (NDR) are the evaluation metrics for ARMOA [28]. Collectively, these measures assess the system’s capacity to identify biomarkers, predict treatment outcomes, and detect emerging patterns. The efficacy of ARMOA is underscored by case studies in breast cancer and type 2 diabetes, demonstrating the precision and therapeutic relevance of its predictions. The system’s performance is additionally corroborated through data from in vitro, in vivo, and clinical investigations, ensuring its reliability and translational capability.
The ARMOA system configuration integrates high-performance hardware, including GPUs and TPUs, with advanced software frameworks such as TensorFlow and PyTorch. Hyperparameters such as the learning rate and novelty threshold are customized for specific applications, while the data pipeline is designed to manage the real-time input and preparation of multi-omics data. Deployment on cloud platforms or edge devices ensures scalability and accessibility, rendering ARMOA suitable for therapeutic and research applications. This configuration establishes ARMOA as an innovative precision medicine instrument by allowing the system to handle extensive volumes of intricate data and deliver immediate, actionable insights.
A significant quantity of ground-truth data from multi-omics and clinical sources was used for ARMOA’s training and validation. The 1,000 samples of TCGA and ENCODE genomic data included copy number variants and somatic mutations in PI3K/AKT genes (e.g., PIK3CA, AKT1). Gene expression and protein interactions (e.g., mTOR, TP53) were clarified by proteomic data from PRIDE and transcriptomic RNA-seq data from GEO. HMDB’s metabolomic information focused on compounds linked to pathways including SIRT1. Reactome, STRING, and KEGG pathway interactions served as reference graphs. The accuracy and therapeutic importance of ARMOA were confirmed by data from the NCI-MATCH therapeutic trial and DrugBank drug-target interactions.

4. Results

Multi-omics data from publicly available archives, including genomic data from TCGA and ENCODE, proteomic data from PRIDE, transcriptomic data from GEO, and metabolomic data from HMDB, were first combined to develop the ARMOA system. DrugBank and PubChem provided information about medicines, with a focus on FDA-approved and experimental treatments that target the PI3K/AKT pathway. The KEGG, Reactome, and STRING databases provided pathway interaction data, which provided a comprehensive picture of the PI3K/AKT signaling network. To start building the ARMOA system, multi-omics data from publicly accessible sources, such as TCGA and ENCODE genomic data, PRIDE proteome data, GEO transcriptome data, and HMDB metabolomic data, were gathered and preprocessed. The PI3K/AKT pathway was successfully represented by the synthetic multi-omics data, which included 1000 samples with 100 features from the transcriptomic, proteomic, metabolomic, and genomic data types. Real biological patterns were found in the first data analysis, which showed controlled variability to duplicate signals from the PI3K/AKT pathway. Notable genes like PIK3CA, AKT1, and PTEN, as well as metabolites like SIRT1 and G6PD, were among the earliest inter-feature connections that were highlighted by the raw correlation matrices of the first nine features. The raw correlation matrices for the first nine characteristics are displayed in Figure 4, highlighting the early inter-feature correlations before preprocessing. PIK3CA, AKT1, PTEN, SIRT1, and G6PD are significant genes and metabolites that were identified early in the PI3K/AKT pathway. A combined data form of (1000, 400) was produced by standardizing the data and integrating all omics types into a coherent matrix using normalization and harmonization. Feature selection improved the model’s focus on pertinent PI3K/AKT signals by reducing dimensionality to the top 50 features based on ANOVA F-value. The normalized correlation matrices, which show better consistency between datasets after preprocessing Following data preprocessing, which includes normalization and batch effect reduction, Figure 5 displays improved correlation matrices. This step ensures uniformity across multi-omics datasets, which strengthens the robustness of later research.
To obtain thorough knowledge about the PI3K/AKT pathway, a RAG technique was used. Ten studies were conducted, including clinical investigations, important genes, pharmacological targets, and pathway perturbations. Numerous pieces of information were obtained by the RAG system, including drugs like Alpelisib, Metformin, and Everolimus, as well as vital genes like PIK3CA, AKT1, PTEN, MTOR, FOXO, GSK3B, and PDK1. Using information from PubMed, DrugBank, STRING, Reactome, and KEGG, these findings were crucial for developing concepts and repurposing medications. The multi-omics data was then combined into low-dimensional embeddings using a GNN. The loss decreased from 0.7232 to 0.1907 after 40 epochs of training the GNN. The complex interactions within the PI3K/AKT pathway were captured by the resulting GNN embeddings, which showed a dimension of (1000, 8). The GNN embeddings, which compress high-dimensional multi-omics data into a (1000, 8) representation, are displayed in Figure 6 using UMAP. As this figure demonstrates, the embeddings represent the complex interactions of the PI3K/AKT signaling pathway. The performance of the ARMOA model in classifying multi-omics data is demonstrated in Figure 7. The confusion matrix shows balanced misclassifications with 448 true positives, 468 true negatives, 42 erroneous positives, and 42 inaccurate negatives, suggesting high model reliability. The ROC curve in Figure 8 assesses the model’s categorization ability. The area under the curve (AUC) of 0.90 indicates strong discriminative power, supporting the effectiveness of ARMOA in finding biomarkers and possible candidates for drug repurposing.
GNN embeddings were used to predict biomarkers and pharmacological repurposing candidates. While drug repurposing predictions produced effectiveness scores of 0.737 for Alpelisib, 0.728 for Metformin, and 0.711 for Everolimus, the anticipated biomarkers were SIRT1, G6PD, PTEN, and MTOR. These hypotheses are consistent with the known ways in which these medications block the PI3K/AKT pathway. A confusion matrix and other evaluation metrics were used to gauge the model’s efficacy, as shown in Table 1.
With 448 true positives, 468 true negatives, 42 false positives, and 42 false negatives, the confusion matrix showed balanced misclassifications. Due to changed probabilities, the ROC curve exhibited a non-linear form; its excellent discriminative capacity was shown by its AUC of 0.90. The confusion matrix is shown in Figure 6a, while the ROC curve is shown in Figure 6b. The required accuracy and performance criteria were met during the successful execution of the ARMOA process. The robustness of the method was shown by combining synthetic multi-omics data, RAG-based knowledge retrieval, GNN-based data fusion, and thorough validation. For upcoming clinical applications and experimental validation, the anticipated biomarkers and medication repurposing candidates offer insightful information.
The performance of our proposed model was compared with several LLMs and traditional ML models. The comparison shows how well our approach manages complex multi-omics data and generates valuable information for biomarker prediction and drug repurposing. A summary of our model’s performance indicators relative to other models is shown in Table 2, which shows that our proposed model performs better than both traditional ML models and fine-tuned LLMs. Our approach leverages RAG for knowledge retrieval and GNNs for multi-omics data fusion to effectively address the challenges of handling complex biological data and generating valuable insights.
Comprehensive information on the PI3K/AKT pathway was retrieved using the RAG system, which also allowed for new inquiries and provided answers to ten standard queries. Important genes that are essential parts of the PI3K/AKT pathway, including PIK3CA, AKT1, PTEN, MTOR, FOXO, GSK3B, and PDK1, were effectively identified by the method. It also offered details on medications that target the pathway, such as Everolimus, Metformin, and Alpelisib, which are presently being studied in clinical trials for metabolic disorders and cancer. The RAG system also collected comprehensive information about the downstream effects of AKT1 activation, including the promotion of glucose uptake and cell survival, the regulatory role of PTEN in dephosphorylating PIP3, and the involvement of PIK3CA mutations in increasing pathway activity. Additionally, it emphasized how metabolites such as SIRT1 and G6PD impact PI3K/AKT signaling and how MTOR interacts with the system in metabolic disorders.
Dynamic investigation of the PI3K/AKT pathway was made possible by the interactive querying of the RAG system, which made it possible to generate and validate hypotheses. The search for clinical trials that target the PI3K/AKT pathway in cancer, for instance, turned up ongoing trials for Alpelisib (NCT02437318), offering useful information for therapeutic repurposing. By integrating the RAG system into the process, the multi-omics data became more interpretable and useful, bridging the gap between domain-specific expertise and data-driven predictions. Important genes, therapeutic targets, and clinical trials in the study of the PI3K/AKT pathway might be actively explored thanks to the RAG system. Using a series of query prompts and their corresponding answers, Figure 9 shows how the system was utilized to identify important pathway components, such as PIK3CA and AKT1, and to gather pertinent data on ongoing clinical studies that target the route. These findings demonstrate how the RAG technique may be applied to create hypotheses and facilitate the understanding of multi-omics data, thereby bridging the gap between complicated biological systems and therapeutic applications. The link for throwing queries is https://github.com/micheal1209/ARMOA-.git.

5. Conclusions

A novel paradigm for researching dysregulation of the PI3K/AKT pathway and developing precision medicine is Agentic RAG-Omics (ARMOA). ARMOA addresses important challenges in disease research and treatment development by integrating multi-omics data, facilitating autonomous hypothesis creation, and using AI-driven analysis. By dynamically obtaining, synthesizing, and assessing complicated biological data in real-time, it provides a novel, agentic paradigm for comprehending and treating disease-specific pathway dysregulation, which sets it apart from conventional methods. By effectively combining transcriptomic, proteomic, metabolomic, and genomic data into a single framework, ARMOA offers hitherto unseen insights into the PI3K/AKT circuit. The method has the potential to transform precision medicine by identifying important regulatory nodes, finding clinically significant biomarkers, and forecasting novel medication candidates. ARMOA surpasses traditional models with 92% accuracy in pathway-specific medication repurposing, indicating its greater applicability and functionality. The clinical usefulness of ARMOA is demonstrated by case studies in breast cancer and type 2 diabetes, which demonstrate its ability to detect synergistic drug combinations and forecast therapy responses specific to each patient. These results show that the framework is accurate and may effectively bridge the gap between clinical decision-making and multi-omics research. However, the quality and availability of multi-omics data determine how effective ARMOA is, and more research in bigger, more varied patient groups is required. The integration of single-cell omics and epigenomic data, wearable biosensors, and expanding applications to immune-oncology and electronic health records (EHRs) are examples of future endeavors.

References

  1. He, Y.; Sun, M.M.; Zhang, G.G.; Yang, J.; Chen, K.S.; Xu, W.W.; Li, B. Targeting PI3K/Akt signal transduction for cancer therapy. Signal Transduct Target Ther 2021, 6, 425. [Google Scholar] [CrossRef] [PubMed]
  2. Li, Q.; Geng, S.; Luo, H.; Wang, W.; Mo, Y.-Q.; Luo, Q.; Wang, L.; Song, G.-B.; Sheng, J.-P.; Xu, B. Signaling pathways involved in colorectal cancer: pathogenesis and targeted therapy. Signal Transduct Target Ther 2024, 9, 266. [Google Scholar] [CrossRef] [PubMed]
  3. Su, H.; Peng, C.; Liu, Y. Regulation of ferroptosis by PI3K/Akt signaling pathway: a promising therapeutic axis in cancer. Front Cell Dev Biol 2024, 12. [Google Scholar] [CrossRef] [PubMed]
  4. Mohammadzadeh-Vardin, T.; Ghareyazi, A.; Gharizadeh, A.; Abbasi, K.; Rabiee, H.R. DeepDRA: Drug repurposing using multi-omics data integration with autoencoders. PLoS One 2024, 19, e0307649. [Google Scholar] [CrossRef] [PubMed]
  5. Caforio, M.; de Billy, E.; De Angelis, B.; Iacovelli, S.; Quintarelli, C.; Paganelli, V.; Folgiero, V. PI3K/Akt Pathway: The Indestructible Role of a Vintage Target as a Support to the Most Recent Immunotherapeutic Approaches. Cancers (Basel) 2021, 13, 4040. [Google Scholar] [CrossRef] [PubMed]
  6. Ager, C.; Reilley, M.; Nicholas, C.; Bartkowiak, T.; Jaiswal, A.; Curran, M.; Albershardt, T.C.; Bajaj, A.; Archer, J.F.; Reeves, R.S.; et al. 31st Annual Meeting and Associated Programs of the Society for Immunotherapy of Cancer (SITC 2016): part two. J Immunother Cancer 2016, 4, 73. [Google Scholar] [CrossRef]
  7. Delgado, F.M.; Gómez-Vela, F. Computational methods for Gene Regulatory Networks reconstruction and analysis: A review. Artif Intell Med 2019, 95, 133–145. [Google Scholar] [CrossRef] [PubMed]
  8. Rao, J.; Wang, X.; Chen, X.; Liu, Y.; Jiang, J.; Wang, Z. Multi-omics analysis reveals that Cas13d contributes to PI3K-AKT signaling and facilitates cell proliferation via PFKFB4 upregulation. Gene 2024, 927, 148760. [Google Scholar] [CrossRef] [PubMed]
  9. Slobodyanyuk, M.; Bahcheli, A.T.; Klein, Z.P.; Bayati, M.; Strug, L.J.; Reimand, J. Directional integration and pathway enrichment analysis for multi-omics data. Nat Commun 2024, 15, 5690. [Google Scholar] [CrossRef] [PubMed]
  10. Karim, S.; Burzangi, A.S.; Ahmad, A.; Siddiqui, N.A.; Ibrahim, I.M.; Sharma, P.; Abualsunun, W.A.; Gabr, G.A. PI3K-AKT Pathway Modulation by Thymoquinone Limits Tumor Growth and Glycolytic Metabolism in Colorectal Cancer. Int J Mol Sci 2022, 23, 2305. [Google Scholar] [CrossRef] [PubMed]
  11. Xia, Y.; Sun, M.; Huang, H.; Jin, W.-L. Drug repurposing for cancer therapy. Signal Transduct Target Ther 2024, 9, 92. [Google Scholar] [CrossRef] [PubMed]
  12. Garg, P.; Ramisetty, S.; Nair, M.; Kulkarni, P.; Horne, D.; Salgia, R.; Singhal, S.S. Strategic advancements in targeting the PI3K/AKT/mTOR pathway for Breast cancer therapy. Biochem Pharmacol 2025, 236, 116850. [Google Scholar] [CrossRef] [PubMed]
  13. Johnson, K.B.; Wei, W.; Weeraratne, D.; Frisse, M.E.; Misulis, K.; Rhee, K.; Zhao, J.; Snowdon, J.L. Precision Medicine, AI, and the Future of Personalized Health Care. Clin Transl Sci 2021, 14, 86–93. [Google Scholar] [CrossRef] [PubMed]
  14. Chen, Y.-M.; Hsiao, T.-H.; Lin, C.-H.; Fann, Y.C. Unlocking precision medicine: clinical applications of integrating health records, genetics, and immunology through artificial intelligence. J Biomed Sci 2025, 32, 16. [Google Scholar] [CrossRef] [PubMed]
  15. Fu, C.; Chen, Q. The future of pharmaceuticals: Artificial intelligence in drug discovery and development. J Pharm Anal 2025, 101248. [Google Scholar] [CrossRef]
  16. Yunfan, G.; Yun, X.; Xinyu, G.; Kangxiang, J.; Jinliu, P.; Yuxi, B.; Yi, D.; Jiawei, S.; Haofen, W. Retrieval-Augmented Generation for Large Language Models: A Survey. Computer Science. Computer Science Computation and Language 2024, 11–21. [Google Scholar]
  17. Lin, X.; Deng, G.; Li, Y.; Ge, J.; Ho, J.W.K.; Liu, Y. GeneRAG: Enhancing Large Language Models with Gene-Related Task by Retrieval-Augmented Generation 2024.
  18. Li, M.; Kilicoglu, H.; Xu, H.; Zhang, R. BiomedRAG: A retrieval augmented large language model for biomedicine. J Biomed Inform 2025, 162, 104769. [Google Scholar] [CrossRef] [PubMed]
  19. Cox, J.; Hein, M.Y.; Luber, C.A.; Paron, I.; Nagaraj, N.; Mann, M. Accurate Proteome-wide Label-free Quantification by Delayed Normalization and Maximal Peptide Ratio Extraction, Termed MaxLFQ. Molecular & Cellular Proteomics 2014, 13, 2513–2526. [Google Scholar]
  20. Cai, Z.; Poulos, R.C.; Liu, J.; Zhong, Q. Machine learning for multi-omics data integration in cancer. iScience 2022, 25, 103798. [Google Scholar] [CrossRef] [PubMed]
  21. Zhang, Y.; Parmigiani, G.; Johnson, W.E. ComBat-seq: batch effect adjustment for RNA-seq count data. NAR Genom Bioinform 2020, 2. [Google Scholar] [CrossRef] [PubMed]
  22. Ritchie, M.E.; Phipson, B.; Wu, D.; Hu, Y.; Law, C.W.; Shi, W.; Smyth, G.K. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 2015, 43, e47. [Google Scholar] [CrossRef] [PubMed]
  23. Safronova, N.; Junghans, L.; Saenz, J.P. Temperature change elicits lipidome adaptation in the simple organisms Mycoplasma mycoides and JCVI-syn3B. Cell Rep 2024, 43, 114435. [Google Scholar] [CrossRef] [PubMed]
  24. Zhang, Q.; Chen, S.; Bei, Y.; Yuan, Z.; Zhou, H.; Hong, Z.; Dong, J.; Chen, H.; Chang, Y.; Huang, X. A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models. 2025.
  25. Wang, Y.; Sun, Z.; He, Q.; Li, J.; Ni, M.; Yang, M. Self-supervised graph representation learning integrates multiple molecular networks and decodes gene-disease relationships. Patterns 2023, 4, 100651. [Google Scholar] [CrossRef] [PubMed]
  26. Shyam, P. In Silico Strategies for Cancer Model Development and Anticancer Drug Testing. In Preclinical cancer models for translational research and drug development; Springer Nature Singapore: Singapore, 2025; pp. 153–168.
  27. Guo, W.; Liu, S.; Zheng, X.; Xiao, Z.; Chen, H.; Sun, L.; Zhang, C.; Wang, Z.; Lin, L. Network Pharmacology/Metabolomics-Based Validation of AMPK and PI3K/AKT Signaling Pathway as a Central Role of Shengqi Fuzheng Injection Regulation of Mitochondrial Dysfunction in Cancer-Related Fatigue. Oxid Med Cell Longev 2021, 2021. [Google Scholar] [CrossRef] [PubMed]
  28. Richardson, E.; Trevizani, R.; Greenbaum, J.A.; Carter, H.; Nielsen, M.; Peters, B. The receiver operating characteristic curve accurately assesses imbalanced datasets. Patterns 2024, 5, 100994. [Google Scholar] [CrossRef] [PubMed]
  29. Yang, S.; Wang, Z.; Wang, C.; Li, C.; Wang, B. Comparative Evaluation of Machine Learning Models for Subtyping Triple-Negative Breast Cancer: A Deep Learning-Based Multi-Omics Data Integration Approach. J Cancer 2024, 15, 3943–3957. [Google Scholar] [CrossRef] [PubMed]
  30. Wang, J.; Liao, N.; Du, X.; Chen, Q.; Wei, B. A semi-supervised approach for the integration of multi-omics data based on transformer multi-head self-attention mechanism and graph convolutional networks. BMC Genomics 2024, 25, 86. [Google Scholar] [CrossRef] [PubMed]
  31. Sun, C.; Zhang, W.; Lu, F.; Qin, T.; Gou, Y.; Guo, E.; Peng, D.; Zhang, L.; Yang, B.; Liu, S.; et al. Large language models completely understand molecular characteristics of squamous cervical cancer 2023.
  32. Asada, K.; Kobayashi, K.; Joutard, S.; Tubaki, M.; Takahashi, S.; Takasawa, K.; Komatsu, M.; Kaneko, S.; Sese, J.; Hamamoto, R. Uncovering Prognosis-Related Genes and Pathways by Multi-Omics Analysis in Lung Cancer. Biomolecules 2020, 10, 524. [Google Scholar] [CrossRef] [PubMed]
Figure 1. PI3k/AKT Signaling Pathway.
Figure 1. PI3k/AKT Signaling Pathway.
Preprints 168248 g001
Figure 2. PI3k/AKT Signaling Pathway Structure.
Figure 2. PI3k/AKT Signaling Pathway Structure.
Preprints 168248 g002
Figure 3. ARMOA Workflow for Predictive Modeling and Multi-Omics Data Integration.
Figure 3. ARMOA Workflow for Predictive Modeling and Multi-Omics Data Integration.
Preprints 168248 g003
Figure 4. Raw correlation matrices.
Figure 4. Raw correlation matrices.
Preprints 168248 g004
Figure 5. Normalized correlation matrices.
Figure 5. Normalized correlation matrices.
Preprints 168248 g005
Figure 6. UMAP Visualization of GNN Embedding Multi-Omics Data Fusion with GNNs.
Figure 6. UMAP Visualization of GNN Embedding Multi-Omics Data Fusion with GNNs.
Preprints 168248 g006
Figure 7. Confusion Matrix of the ARMOA Model.
Figure 7. Confusion Matrix of the ARMOA Model.
Preprints 168248 g007
Figure 8. ROC Curve for the Model.
Figure 8. ROC Curve for the Model.
Preprints 168248 g008
Figure 9. Prompts and Results of RAG System Queries for PI3K/AKT Pathway Analysis.
Figure 9. Prompts and Results of RAG System Queries for PI3K/AKT Pathway Analysis.
Preprints 168248 g009
Table 1. Evaluation Metrics for ARMOA Model Performance Validation.
Table 1. Evaluation Metrics for ARMOA Model Performance Validation.
Evaluation Measure Value
Accuracy 0.9200
Sensitivity 0.9176
Specificity 0.9143
Precision 0.9176
Recall 0.9176
F1-Score 0.9176
Matthews Correlation Coefficient (MCC) 0.8319
ROC-AUC 0.9000
Novelty Detection Rate (NDR) 0.8000
Table 2. Performance comparison of various ML models and large language models.
Table 2. Performance comparison of various ML models and large language models.
Model Accuracy
Our Work (GNN + RAG) 0.9200
DL Model [29] 0.8900
MOSEGCN (Wang et al., 2024) 0.8300
Large Language Models (LLMs) [31] 0.6850
SVM [32] 0.8200
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated