Large Language Model as a Promising Framework for the Complete Clinical Interpretation of Human Genetic Variants

Jihun Bhak; Dong-Hyun Shin; Jongbum Jeon; Soobok Joe; Yeonsu Jeon; Hyoungjin Choi; Yoonsung Kwon; Kyungwhan An; Yun Sung Cho; Sungwon Jeon; Haeyoung Jeong; Jong Bhak

doi:10.20944/preprints202605.0651.v1

Submitted:

10 May 2026

Posted:

11 May 2026

You are already at the latest version

Abstract

The number of human genetic variants cataloged in dbSNP has plateaued since 2021, with over ~1.1 billion variants housed. Since the human pangenome reference has enabled the precise identification of even structurally complex variants, capturing the entire spectrum of human genetic variants is almost achievable. However, the clinical impacts of most genetic variants still remain elusive. This is due to limitations in genome-wide association study (GWAS), the standard framework for variant interpretation, which relies solely on statistical assumptions. GWAS cannot interpret low‐frequency alleles and capture molecular interactions between variants, hindering its ability to explain complex traits and diseases. Recently, large language models (LLMs) enabled accurate inference of human genetic variants’ pathogenicity even without requiring a large sample size or prior annotations by modeling the biological principles encoded within the genome. For instance, Evolutionary Scale Modeling (ESM1b) successfully predicted missense variants in ClinVar, achieving an auROC of up to 0.905. In addition, Evo 2 classified non-coding pathogenic variants in ClinVar with an auROC of 0.987 for single nucleotide variants (SNVs) and 0.971 for non-SNVs. These results suggest that although yet limited to pathogenicity prediction, integrating multiomic and clinical data through LLM will enable the complete clinical interpretation of human genetic variants.

Keywords:

large language model

;

human genetic variants

;

pangenome

;

GWAS

Subject:

Biology and Life Sciences - Biology and Biotechnology

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Large Language Model as a Promising Framework for the Complete Clinical Interpretation of Human Genetic Variants

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe