Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Using LLM Models and Explainable ML to Analyse Biomarkers at Single Cell Level for Improved Understanding of Diseases

Version 1 : Received: 24 August 2023 / Approved: 29 August 2023 / Online: 30 August 2023 (03:53:31 CEST)

A peer-reviewed article of this Preprint also exists.

Elsborg, J.; Salvatore, M. Using LLMs and Explainable ML to Analyze Biomarkers at Single-Cell Level for Improved Understanding of Diseases. Biomolecules 2023, 13, 1516. Elsborg, J.; Salvatore, M. Using LLMs and Explainable ML to Analyze Biomarkers at Single-Cell Level for Improved Understanding of Diseases. Biomolecules 2023, 13, 1516.

Abstract

Single-cell RNA sequencing (scRNA-seq) technology has significantly advanced our understanding of the diversity of cells and how this diversity is implicated in diseases. Yet, translating these findings across various scRNA-seq datasets poses challenges due to technical variability and dataset-specific biases. To overcome this, we present a novel approach that employs both an LLM-based framework and explainable machine learning to facilitate generalization across single-cell datasets and identify gene signatures to capture disease-driven transcriptional changes. Our approach uses scBERT, which harnesses shared transcriptomic features among cell types to establish consistent cell-type annotations across multiple scRNA-seq datasets. Additionally, we employ a symbolic regression algorithm to pinpoint highly relevant yet minimally redundant models and features for inferring a cell type’s disease state based on its transcriptomic profile. We ascertain the versatility of these cell-specific gene signatures across datasets, showcasing their resilience as molecular markers to pinpoint and characterize disease-associated cell types. Validation is carried out using four publicly available scRNA-seq datasets from both healthy individuals and those suffering from ulcerative colitis (UC). This demonstrates our approach’s efficacy in bridging disparities specific to different datasets, fostering comparative analyses. Notably, the simplicity and symbolic nature of the retrieved gene signatures facilitate their interpretability, allowing us to elucidate underlying molecular disease mechanisms using these models.

Keywords

biomarker, LLM, interpretability, scRNA-seq, machine learning, symbolic regression

Subject

Biology and Life Sciences, Life Sciences

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.