Computer Science and Mathematics

Sort by

Review

Mathematical and Computational Biology

Rebuilding the Antibiotic Pipeline with Guided Generative Models

Shriya Bhat

Rishab Jain

Wesley Greenblatt

Abstract: The antibiotic pipeline has stalled: most recent approvals reflect incremental modifications of existing scaffolds, while antimicrobial resistance continues to outpace discovery. Antimicrobial peptides (AMPs) offer a compelling alternative because of rapid, multi-modal activity, but clinical translation has been limited by toxicity, serum instability, and the prohibitive cost of synthesizing and testing large libraries. Recent progress in protein language models (pLMs) changes the computational landscape by providing embeddings that capture sequence context and biophysical regularities from massive unlabeled datasets. However, pLMs alone are not a design solution. We propose a technique coupling pLM-derived representations to diffusion or discrete flow-based generative models that can explore non-homologous regions of peptide space while being steered by multi-objective guidance. This framework supports direct optimization for potency, selectivity, and developability during generation, compressing hit discovery and early optimization into a single in silico loop. Conditioning generation on target and safety predictors could shift AMPs from membrane-lytic ‘blunt instruments’ toward more selective, target-aware therapeutics.

Posted: 16 January 2026

https://doi.org/10.20944/preprints202601.1230.v1

Article

Computer Science and Mathematics

Mathematical and Computational Biology

The Prevention Theorem: Time-Dependent Constraints on Post-Exposure Prophylaxis for HIV

A.C. Demidont

Abstract: Antiretroviral agents for HIV prevention are typically evaluated in terms of trial efficacyand programmatic coverage, but rarely in terms of whether they admit a true mathematicalsolution to prevention. Here we introduce the Prevention Theorem, which formalizesprevention for a given exposure e as the condition R0(e) = 0, meaning that the probability of establishing a productive, transmissible infection is exactly zero. Within this framework,post-exposure prophylaxis (PEP) is not delayed treatment but a time-dependent operatoracting on within-host infection establishment dynamics. Using a mechanistic model of reservoirseeding and proviral integration, we derive the PEP Window Corollary: PEP can enforce R0(e) = 0 only when initiated within a finite biological window prior to irreversible integrationand initial reservoir establishment. Beyond this window, all reachable system statessatisfy R0(e) > 0 and are irreducible by post-exposure intervention. Parameterization usingvirological data indicates that this window extends to approximately 72 hours for mucosalexposures but is compressed to roughly 12–24 hours for parenteral exposures due to bypass ofearly immune bottlenecks. As an applied example, we show that structural access delays inhigh-risk populations—such as people who inject drugs—frequently exceed this compressedparenteral window. Consequently, for such exposures the condition R0(e) = 0 is mathematicallyand biologically unreachable before access is even attempted, rendering the failure ofpost-exposure prevention a consequence of violated biological boundary conditions ratherthan pharmacological efficacy.

Posted: 14 January 2026

https://doi.org/10.20944/preprints202601.1090.v1

Article

Computer Science and Mathematics

Mathematical and Computational Biology

The Synaptic Pruning Cliff: Threshold-Like Network Fragility Under Internal Stress and Efficient Recovery in a Computational Model of Depression

Ngo Cheung

Abstract: Background: Major depressive disorder (MDD) is increasingly viewed through a neuroplasticity lens, with developmental synaptic pruning emerging as a potential core liability. Genetic evidence implicates pruning pathways, while rapid-acting antidepressants like ketamine promote synaptogenesis, suggesting that excessive early elimination leaves circuits vulnerable to later stress. Few computational models, however, capture the specific MDD pattern of latent fragility collapsing under perturbation, followed by recovery via limited plasticity enhancement.Methods: An overparameterized feed-forward neural network (∼396,000 parameters) was trained on a noisy four-class Gaussian cluster task to represent dense early connectivity. Excessive pruning (95% magnitude-based weight removal, per-layer) simulated adolescent over-elimination. Fragility was assessed under input perturbations and internal neural noise (post-activation Gaussian injections at varying intensities) modeling neuromodulatory disruption. Recovery involved gradient-guided regrowth (50% of pruned connections, prioritized by loss-reduction potential) followed by fine-tuning. Comparisons included random regrowth and a sparsity sweep to identify thresholds.Results: The intact network showed robust performance across conditions. Pruning induced sharp collapse (clean accuracy ∼51%, standard noisy ∼43%), with pronounced sensitivity to internal noise (moderate stress accuracy ∼31%) exceeding input noise effects. Gradient-guided regrowth plus fine-tuning restored near-baseline accuracy (clean/standard ∼100%) and robustness (combined stress ∼97%) despite ∼47% persistent sparsity. Targeted regrowth slightly outperformed random under high stress. A critical threshold emerged around 93% sparsity, beyond which combined-stress performance dropped abruptly (>44 percentage points).Conclusions: Excessive pruning generates threshold-like intrinsic fragility consistent with stress-triggered MDD relapse, while targeted, limited synaptogenesis efficiently compensates without full density restoration. These findings support a pruning-mediated plasticity deficit as a mechanistic framework for MDD vulnerability and highlight the therapeutic potential of activity-dependent plasticity enhancement. The model provides a testable scaffold for linking polygenic pruning risk to circuit-level decompensation and rapid treatment response.

Posted: 13 January 2026

https://doi.org/10.20944/preprints202601.0805.v1

Article

Computer Science and Mathematics

Mathematical and Computational Biology

Mechanistic Insights into the Differential Efficacy of Sonratoclax against Venetoclax-Resistant BCL2 G101V

Yashmin Afshar

Ali Goli

Melika Abrishami

Abstract: Resistant mechanisms to venetoclax, a selective BCL-2 inhibitor approved for hematological malignancies, are frequently mediated by the G101V mutation in BCL-2. Sonrotoclax illustrates superior potency against both wild-type and G101V-mutated BCL-2, yet the mechanistic basis remains unclear. This study employed computational methods to investigate the binding dynamics of both inhibitors. Structures were predicted with AlphaFold, refined via molecular dynamics simulations (MDS), and ligands were docked with AutoDock Vina. Four systems were subjected to triplicate 200 ns MDS, with analyses including RMSD, RMSF, buried surface area, protein-ligand interaction fingerprint, and MM/GBSA binding free energies. Results indicate venetoclax exhibits progressive dissociation from G101V BCL-2, with elevated RMSD, reduced buried surface area, and increased unbound states. In contrast, Sonrotoclax maintains a steady correlation, shows persistence with entropy-enthalpy compensation, displays negligible unbound time, higher binding free energies, and constant van der Waals anchors. Having all these results in mind, a "Dynamic Blockade" hypothesis is proposed, where Sonrotoclax's flexibility enables sustained BH3 groove occupancy, blocking pro-apoptotic BH3-only proteins and overcoming allosteric perturbations induced by G101V. This mechanistic perspective proposes the optimal approach for designing resilient inhibitors to accelerate drug repurposing and development in oncology.

Posted: 08 January 2026

https://doi.org/10.20944/preprints202601.0600.v1

Article

Computer Science and Mathematics

Mathematical and Computational Biology

An Integrative Variant Scoring Function for Finding Novel Genes Associated with Ovarian and Thyroid Cancer

Amanda Bataycan

Omodolapo Nurudeen

Jonathon E. Mohl

Khodeza Begum Mitchell

Ming-Ying Leung

Abstract:

We devised a quantitative scoring function to assess the cumulative effects of nonsynonymous single nucleotide variants (SNVs) on protein-coding genes in patients with ovarian cancer (OvCa) and thyroid cancer (ThCa). The goal is to find novel candidate cancer-related genes for downstream bioinformatics analyses and wet-lab studies. With Genomic Data Commons as primary data resource, SNV information was extracted from whole-exome sequencing data from patients with these cancers. A cumulative variant scoring function, Q(G) was developed to sum up the deleterious effects of the individual SNVs on the gene G. While Q(G) can be computed using any popular functional effect analyzers such as FATHMM-XF, SIFT, PolyPhen, and CADD, we have also established an integrative scoring function iQ(G) that combines the deleterious assessments from different analyzers and demonstrated that iQ(G) is a more effective method for identifying likely cancer-related genes. Based on the iQ(G) rankings, the top three novel genes for OvCa are AHNAK2, UNC13A, and PCDHB4; and those for ThCA are PLEC, HECTD4, and CES1. Furthermore, the top 1% genes with highest iQ(G) scores for each cancer were submitted for KEGG pathway analysis. The results revealed that several genes of the CACNA1 family within the type II diabetes mellitus pathway are likely related to both OvCa and ThCa and suggested other molecular interactions that should be further studied in connection with OvCa prognosis and ThCa treatment.

Abstract:

Posted: 07 January 2026

https://doi.org/10.20944/preprints202601.0543.v1

Article

Computer Science and Mathematics

Mathematical and Computational Biology

Bayesian Decision-Making Shapes Phenotypic Landscapes from Differentiation to Cancer

Arnab Barua

Haralampos Hatzikirou

Abstract: Cells adapt their phenotypes in noisy microenvironments while maintaining robust decision-making. We develop a coarse-grained theoretical framework in which cellular phenotypic adaptation is described as Bayesian decision-making coupled to replication and diffusion. This leads to an effective Fokker--Planck equation with an emergent fitness landscape governing phenotypic dynamics. We identify distinct phenotypic regimes—homeostatic fixation, bistable decision-making, critical switching, and runaway explosion—and propose a biological interpretation in which homeostatic and bistable landscapes correspond to healthy differentiated cell states, whereas explosive landscapes capture stem-like or cancer-like behaviour. In the Gaussian setting, the correlation $\rho$ between intrinsic and extrinsic states directly encodes mutual information and acts as a bifurcation parameter: high correlation produces shallow or explosive landscapes associated with phenotypic plasticity, while reduced correlation stabilises differentiated fates by deepening potential wells. We further show that proliferation reshapes these landscapes in a nontrivial manner. Proliferation conditionally may stabilises local homeostasis without altering global confinement, or cooperate with biased environmental sensing to eliminate homeostasis/bistability and drive cancer-like phenotypic explosion even at high phenotypic fidelity. Finally, we show that negative intrinsic–extrinsic correlations suppress explosive dynamics but also reduce bistable plasticity, suggesting a robustness-plasticity trade-off. Together, our results suggest that development, tissue homeostasis, and carcinogenesis can be understood as information-driven deformations of a Bayesian phenotypic fitness landscape.

Posted: 05 January 2026

https://doi.org/10.20944/preprints202601.0188.v1

Article

Computer Science and Mathematics

Mathematical and Computational Biology

GenProtect-V: A Variational Inference-based Framework for Privacy-Preserving Synthetic Human Genomic Data Generation

Zihan Bian

Linyu Mou

Abstract: The generation of synthetic human genomic data offers immense potential for biomedical research and data sharing, while theoretically safeguarding individual privacy. However, existing methods, including deep generative models, struggle to achieve a robust balance between data utility and privacy protection. State-of-the-art evaluations like PRISM-G reveal vulnerabilities such as proximity, kinship replay, and trait-linked leakage. This paper introduces GenProtect-V, an end-to-end privacy-preserving synthetic human genomic data generation framework based on a Variational Autoencoder architecture. GenProtect-V integrates multi-layered privacy mechanisms: a Differentially Private Encoder to mitigate Proximity Leakage, Decoupled Latent Space Learning to address Kinship Replay, and a Rare Variant Smoother to counter Trait-linked Leakage. Through extensive experiments on the 1000 Genomes Project dataset, we demonstrate that GenProtect-V consistently achieves significantly lower PRISM-G composite scores compared to state-of-the-art baselines. Crucially, GenProtect-V simultaneously maintains or improves key utility metrics, including Allele Frequency fidelity, Population Structure preservation, and GWAS reproducibility. An ablation study further confirms the independent and significant contributions of its privacy mechanisms. GenProtect-V establishes a new benchmark for balancing privacy and utility, offering a more secure and practical paradigm for synthetic genomic data generation.

Posted: 29 December 2025

https://doi.org/10.20944/preprints202512.2461.v1

Brief Report

Computer Science and Mathematics

Mathematical and Computational Biology

Artificial Intelligence–Driven Structural Mining Enables Functional Inference in the Human Dark Proteome

Valentina Carbonari

Annamaria Defilippo

Ugo Lomoio

Caterina Francesca Perri

Barbara Puccio

Pierangelo Veltri

Pietro Hiram Guzzi

Abstract: The rapid diffusion of high-throughput sequencing technologies has generated a vast repertoire of protein-coding se- quences whose biological roles remain unknown. This discrepancy between sequence availability and functional under- standing has led to the definition of the dark proteome, comprising proteins or protein regions that lack experimentally resolved structures and reliable functional annotations. Classical sequence-based approaches often fail to characterize these targets due to extreme sequence divergence, intrinsic disorder, or membrane localization. Here, we present an inte- grated, structure-centric computational framework that leverages recent advances in artificial intelligence to enable func- tional inference in the human dark proteome. By combining deep learning–based protein structure prediction, large-scale structural alignment, and machine learning–driven surface pocket analysis, we uncover remote evolutionary relationships and conserved functional features that remain invisible to traditional bioinformatics pipelines. Our results demonstrate that artificial intelligence provides a powerful strategy to bridge the gap between genomic information and biological function, opening new avenues for systematic exploration of uncharacterized regions of the human proteome.

Posted: 23 December 2025

https://doi.org/10.20944/preprints202512.2025.v1

Article

Computer Science and Mathematics

Mathematical and Computational Biology

An Agent‐Based Model of Youth Nonmedical Prescription Opioid Use in Ontario: Forecast Validation and Future Projection

Narjes Shojaati

Abstract: Amid COVID-19-related in-person school closures in 2021, an agent-based simulation grounded in social impact theory was implemented and documeted to investigate the effects of in-person school closure on nonmedical prescription opioid use among adolescents in Ontario, Canada. The results of model simulations forecasted an alarming rebound effect in the opioid use prevalence after the lifting of in-person school closures and identified secure medication storage in households as an effective strategy for mitigating associated risks. This study evaluates this result by comparing the baseline projection from the previously published study with newly released 2023 data from the Ontario Student Drug Use and Health Survey. Furthermore, it employs the developed agent-based model to simulate the projection through 2030 and assesses the efficacy of secure medication storage in households for the coming years. The study confirms that the previously published simulation projection for 2023 closely aligns with observed data, showing nonmedical prescription opioid use prevalence among Ontario adolescents nearly doubling from 2021 to 2023. Additionally, the results show that nonmedical prescription opioid use prevalence among youth is projected to remain at these elevated levels. Critically, the findings suggest that the temporal window for effective secure medication storage interventions has elapsed, and these interventions are now expected to have minimal impact on reducing this increase, even when applied extensively. The agreement between reported predictions and observed data demonstrates that a simulation model with relevant conceptual foundation can accurately predict future trends and provide sufficient lead time for policymakers to implement interventions within critical time-sensitive windows to alter undesirable trajectories before public health crises escalate.

Posted: 19 December 2025

https://doi.org/10.20944/preprints202512.1761.v1

Article

Computer Science and Mathematics

Mathematical and Computational Biology

Kappa-Frameshift Background Mutations and Long-Range Correlations of the DNA Base Sequences

Elias Koorambas

Abstract: Following Livadiotis G. and McComas D. J. (2023) [1], we propose a new type of DNA frameshift mutations that occur spontaneously due to information exchange between the DNA sequence of length bases (n) and the mutation sequence of length bases (m), and respect the kappa-addition symbol ⊕κ. We call these proposed mutations Kappa-Frameshift Background (KFB) mutations. We find entropy defects originate in the interdependence of the information length systems (or their interconnectedness, that is, the state in which systems with a significant number of constituents (information length bases) depend on, or are connected with each) by the proposed KFB-mutation). We also quantify the correlation among DNA information length bases (n) and (m) due to information exchange. In the presence of entropy defects, the Landauer’s bound and minimal metabolic rate for a biological system are modified. We observe that the different n and κ scales are manifested in the double evolutionary emergence of the proposed biological system through subsystems correlations. For specific values of the kappa parameter we can expect deterministic laws associated with a single biological polymer in the short term before the polymer explores over time all the possible ways it can exist.

Posted: 17 December 2025

https://doi.org/10.20944/preprints202503.0891.v2

Article

Computer Science and Mathematics

Mathematical and Computational Biology

Stochastic Modelling and Analysis of Within-Farm Highly Pathogenic Avian Influenza Dynamics in Dairy Cattle

Parul Tiwari

Malavika Smitha

Hammed Olawale Fatoyinbo

Abstract: Highly pathogenic avian influenza (HPAI) has expanded its host range with recent detections in dairy cattle, raising critical concerns regarding within-herd persistence and cross-species spillover. This study develops a stochastic SEI_sI_aR − B compartmental model to analyse HPAI transmission, explicitly accounting for environmental pathogen reservoirs and noise intensities through Wiener processes. The positivity and boundedness of solutions are established, and the disease-free and endemic equilibria are analytically derived. The basic reproduction number is determined using the next-generation matrix method. Numerical simulations confirm that the model dynamics are consistent with theoretical analysis and illustrate how stochastic fluctuations significantly influence disease persistence. Furthermore, sensitivity analysis using Latin Hypercube Sampling (LHS) and Partial Rank Correlation Coefficients (PRCC) identifies the transmission rate from asymptomatic infectious cattle (β_a) as the primary driver of transmission. The model effectively captures the dynamics of environmental variability affecting HPAI spread, suggesting that effective control strategies must prioritise the early detection and isolation of asymptomatic carriers alongside environmental management.

Posted: 17 December 2025

https://doi.org/10.20944/preprints202512.1583.v1

Article

Computer Science and Mathematics

Mathematical and Computational Biology

DNABERT2-CAMP: A Hybrid Transformer-CNN Model for E. coli Promoter Recognition

Hua-Lin Xu

Xiu-Jun Gong

Hua Yu

Ying-Kai Wang

Abstract: Accurate identification of promoters is essential for deciphering gene regulation but remains challenging due to the complexity and variability of transcriptional initiation signals. Existing deep learning models often fail to simultaneously capture long-range dependencies and precise local motifs in DNA sequences. To address this, we propose DNABERT2-CAMP, a hybrid deep learning framework that integrates global sequence context with localized feature extraction for enhanced promoter recognition in Escherichia coli. The model leverages a pre-trained DNABERT-2 Transformer to encode evolutionary conserved patterns across extended contexts, while a novel CAMP (CNN-Attention-Mean Pooling) module detects fine-grained promoter motifs through convolutional filtering, multi-head attention, and mean pooling. By fusing global embeddings with high-resolution local features, our approach achieves robust discrimination between promoter and non-promoter sequences. Under 5-fold cross-validation, DNABERT2-CAMP attained an accuracy of 93.10% and a ROC AUC of 97.28%. It also demonstrated strong generalization on independent external data, achieving 89.83% accuracy and 92.79% ROC AUC. These results underscore the advantage of combining global contextual modeling with targeted local motif analysis for accurate and interpretable promoter identification, offering a powerful tool for synthetic biology and genomic research.

Posted: 17 December 2025

https://doi.org/10.20944/preprints202512.1533.v1

Article

Computer Science and Mathematics

Mathematical and Computational Biology

Graph of Life, Borders of Life, and Global Life Network

Valentin E. Brimkov

Abstract: In this work, we pose and aim to answer the following questions, among others: Which quantitative characteristics, being satisfied, led to the phase transition from "primordial soup" to living organisms? How to measure the negentropy of a certain organic matter that underpinned the appearance of a certain species? To what extent do the biosequences of living organisms differ from random sequences? How do we quantitatively distinguish primitive from higher-level organisms? How can we compare the complexity of two living things? Is there an adequate mathematical structure that naturally and appropriately represents each organism biosequence and all of them as a whole? What are the properties of that structure? How does that structure evolve, and what are the theoretical limits of any further evolution? Is it likely that these bounds will be reached, and what are the "limits of life?" How to estimate the effect on the mechanism of evolution of natural selection vs. the one of chance and mutations? To this end, we introduce relevant mathematical structures and use them for modeling purposes. Finally, we also speculate on possible scenarios of the origin of life, evolution, and related issues.

Posted: 15 December 2025

https://doi.org/10.20944/preprints202512.1257.v1

Article

Computer Science and Mathematics

Mathematical and Computational Biology

Fuzzy Logic–Integrated Optimal Control for Dynamic Intervention in Hepatitis C Virus Epidemiology

Debnarayan Khatua

Bikash Kumar

Manoranjan K. Singh

Somnath Kumar

Abstract: Hepatitis C Virus (HCV) continues to be a significant worldwide health issue, particularly in resource-limited environments with inadequate diagnostic and therapeutic options. This study formulates a deterministic six-compartment model, predicated on the assumptions that the population undergoes natural birth-death dynamics, awareness initiatives transition individuals from $S_1$ to $S_2$, diagnosis advances U to I, recovery is achieved through therapy or immunity, and infection and mortality rates vary among classes. The system is described by coupled nonlinear ODEs that include three time-dependent controls. Analytical examination guarantees the positivity and boundedness of all compartments and calculates the fundamental reproduction number ($R_0$) using the next-generation matrix. Sensitivity analysis shows that $\beta_1, \beta_2, \tau_1, \tau_2$ are the most important parameters. Using Pontryagin's Maximum Principle, the forward–backwards sweep method is employed to determine the optimal controls that minimise both infection and cost. A Mamdani fuzzy logic controller is added to handle parameter uncertainty and generate adaptive responses to infection pressure, awareness level, and hospital load. Simulations reveal that fuzzy control delivers equivalent suppression to the crisp optimum with around two-thirds lower cost, enabling a stable, interpretable, and resource-efficient paradigm for dynamic HCV intervention.

Posted: 14 December 2025

https://doi.org/10.20944/preprints202512.1149.v1

Article

Computer Science and Mathematics

Mathematical and Computational Biology

Real-Time Nanopore Methylome Profiling Identifies CpG-Poor Transcription Factor Regions as Epigenetic Signatures of Relapse in Acute Myeloid Leukemia

Gabriela Fernandes

Abstract:

Relapse in acute myeloid leukemia (AML) is frequently associated with chemoresistance, yet the molecular mechanisms driving this transition remain incompletely understood. To explore relapse-associated epigenetic remodeling, we reanalyzed publicly available Nanopore whole-genome methylation data from three AML patients with matched onset and relapse samples. We focused on CpG-poor transcription factor (TF)-associated regulatory regions, recently implicated as unconventional epigenetic hotspots in leukemia progression. Across all samples, relapse was characterized by a consistent gain in DNA methylation within CpG-poor TF regions, with all ranked loci demonstrating a positive mean Δβ shift. Heatmap visualization of the top-ranked regions revealed distinct clustering of relapse versus onset samples, supporting the presence of a coordinated epigenetic signature rather than random methylation drift. These findings suggest that relapse AML cells may acquire targeted methylation to suppress key regulatory networks involved in DNA repair, apoptosis, and growth control, thereby enabling therapeutic escape. This work highlights the potential utility of Nanopore methylation profiling as a real-time biomarker platform to detect relapse-associated epigenetic rewiring and guide precision treatment strategies.

Abstract:

Posted: 02 December 2025

https://doi.org/10.20944/preprints202512.0257.v1

Article

Computer Science and Mathematics

Mathematical and Computational Biology

Reaction–Diffusion Model of CAR-T Cell Therapy in Solid Tumours with Antigen Escape

Maxim Valentinovich Polyakov

Elena Ivanovna Tuchina

Abstract: Developing effective CAR-T cell therapy for solid tumours remains challenging because of biological barriers such as antigen escape and an immunosuppressive microenvironment. The aim of this study is to develop a mathematical model of the spatio-temporal dynamics of tumour processes in order to assess key factors that limit treatment efficacy. We propose a reaction–diffusion model described by a system of partial differential equations for the densities of tumour cells and CAR-T cells, the concentration of immune inhibitors, and the degree of antigen escape. The methods of investigation include stability analysis and numerical solution of the model using a finite-difference scheme. The simulation results show that antigen escape leads to the formation of a persistent core within the tumour and subsequent relapse after an initial regression. We find that the efficacy of therapy critically depends on the balance between the rate of tumour-cell killing and the rate of resistance development, and that repeated administration of CAR-T cells provides deeper and more durable suppression of tumour growth compared with a single infusion. We conclude that the proposed model is a valuable tool for analysing and optimising CAR-T therapy protocols, and that our results highlight the need for combined strategies aimed at overcoming antigen escape.

Posted: 02 December 2025

https://doi.org/10.20944/preprints202512.0136.v1

Article

Computer Science and Mathematics

Mathematical and Computational Biology

Predicting the Onset of Type 2 Diabetes (T2D) Based on Genetic and Clinical Risk Factors Using XGBoost ML Model

Arnav Gupta

Gatik Goyal

Abstract: Both hereditary and clinical risk factors influence development of T2D. Currently a rich body of research exists about the effect of the clinical factors on T2D, but less is known about how genetic factors influence the development of T2D. Therefore, we used an AI trained ML algorithm to better understand how genetic variants influence the development of T2D in the presence of high, moderate, and low risk clinical factors.We collected genetic and clinical risk factor data sets from publicly available sources. We probabilistically assigned genetic variants from our genetic dataset to the individuals in the clinical dataset to form a single dataset containing both clinical and genetic risk factors. The combined data set was then trained on XGBoost XGBClassifier. SHAP Summary plots were also generated for each risk group after model training. The model’s predictive performance (AUC scores) achieved highest accuracy with the low-risk group, while the moderate and high-risk groups performed slightly lower. According to the SHAP plots, both BMI and family history are key predictors of T2D across all risk groups. However, SNP effect sizes were more influential than other clinical risk factors, indicating that genetic contributions, while secondary, were still relevant. ROC curves assess the model’s ability to predict diabetes cases across risk groups. All models performed above the 0.7 AUC threshold, with the low risk group having an AUC score of 0.9116, the medium risk group AUC score being 0.7372, and the high risk group AUC score being 0.7366. indicating they are clinically applicable and not affected by assignment of genetic variables. While genetic treatments for diabetes remain experimental, our work supports emerging advancements in pharmacogenomics and gene-based therapies by helping to identify which patients may benefit from specific drug regimens including gene-based interventions.

Posted: 26 November 2025

https://doi.org/10.20944/preprints202511.2060.v1

Article

Computer Science and Mathematics

Mathematical and Computational Biology

An Intelligent Decision-Support Framework for AST Risk Prediction Using Explainable Ensemble Learning

Natalya Maxutova

Akmaral Kassymova

Kuanysh Kadirkulov

Aisulu Ismailova

Gulkiz Zhidekulova

Zhanar Azhibekova

Jamalbek Tussupov

Quvvatali Rakhimov

Zhanat Kenzhebayeva

Abstract: This paper proposes an intelligent and explainable ensemble system for predicting as-partate aminotransferase (AST) levels based on routine biochemical and demographic data from the NHANES dataset. The framework integrates robust preprocessing, adaptive feature encoding, and multi-level ensemble learning within a nested cross-validation (5×3) structure to ensure reproducibility and prevent data leakage. Several regression mod-els—including Random Forest, XGBoost, CatBoost, and stacking ensembles—were sys-tematically compared using R², RMSE, MAE, and MAPE metrics. The results show that the Stacking v2 architecture, combining CatBoost, LightGBM, and Ridge meta-regression, achieves the highest predictive accuracy and stability. Explainable AI analysis using SHAP revealed key biochemical and lifestyle factors influencing AST variability. The pro-posed system provides a modular, interpretable, and reproducible foundation for deci-sion-support applications in intelligent healthcare analytics, aligning with the goals of applied system innovation.

Posted: 18 November 2025

https://doi.org/10.20944/preprints202511.1265.v1

Article

Computer Science and Mathematics

Mathematical and Computational Biology

Variational Quantum Eigensolver for Clinical Biomarker Discovery: A Multi-Qubit Model

Juan Pablo Acuña González

Moisés Sánchez Adame

Oscar Montiel

Abstract: We formalize an inverse, data-conditioned variant of the Variational Quantum Eigensolver (VQE) for clinical biomarker discovery. Given patient-encoded quantum states, we construct a task-specific Hamiltonian whose coefficients are inferred from clinical associations, and interpret its expectation value as a calibrated energy score for prognosis and treatment monitoring. The method integrates principled coefficient estimation, ansatz specification with basis rotations, commuting-group measurements, and a practical shot-budget analysis. Evaluated on public infectious-disease datasets under severe class imbalance, the approach yields consistent gains in balanced accuracy and precision-recall over strong classical baselines, with stability across random seeds and feature ablations. This variational energy-scoring framework bridges Hamiltonian learning and clinical risk modeling, offering a compact, interpretable, and reproducible route to biomarker prioritization and decision support.

Posted: 13 November 2025

https://doi.org/10.20944/preprints202511.0978.v1

Article

Computer Science and Mathematics

Mathematical and Computational Biology

ML-Based Optimal Design of One-Component Ionizable Amphiphilic Janus Dendrimers for Enhanced Dendrimersome Nanoparticle-Mediated mRNA Delivery

Joshua Kim

Sungwoo Yang

Abstract: Background/Objectives: Ionizable lipid nanoparticles (LNPs) are the mainstream delivery mechanisms for mRNA vaccines. However, LNPs are limited in their mRNA transfection efficiency (TE) into target cells. Dendrimersome nanoparticle (DNP) delivery systems, developed using ionizable amphiphilic Janus dendrimers (IAJDs), were designed to overcome the limitations of earlier approaches. Researchers have found this alternative promising due to their comparatively simple, repeating one-component structure and enhanced stability. This study sought to clarify the impact of particular IAJD structural components on mRNA TE and develop novel IAJD candidates for maximum predicted TE. Methods: Structural constituents (hydrophilic, ionizable amine, & hydrophobic regions) were systematically defined & encoded for computational analysis. Luciferase-induced luminescence was used as a quantitative metric for mRNA transfection. TE prediction models were built using several machine learning algorithms, and the model using eXtreme Gradient Boosting was selected. This prediction model overcame imbalanced datasets and this model was used to find the optimal IAJD designs and formulation conditions. Results: The IAJD optimization process ultimately yielded three novel optimized IAJD candidates and one of existing IAJDs, surpassing previously identified IAJDs. Conclusions: To our knowledge, this study presents the first large-scale computational investigation of IAJD structural optimization using machine learning. The design of IAJD is the primary factor that influences mRNA TE, but there are other impacting factors and more work is needed. This study highlights the potential of ML-driven IAJD optimization. Combined with high-throughput in vitro assays, this method could significantly accelerate mRNA therapeutics development with an improved delivery mechanism.

Posted: 30 October 2025

https://doi.org/10.20944/preprints202510.2331.v1

of 13