SEFA: Semantic Embedding-Based Feature Augmentation of Biomedical Language-Model Embeddings Improves Interpretable Metabolomic Prediction of Lung Cancer

Jiawen Wu; Jean-François Haince; Rashid A. Bux; Guoyu Huang; Paramjit S. Tappia; Bram Ramjiawan; Maria Vaida

doi:10.20944/preprints202606.1685.v1

Submitted:

22 June 2026

Posted:

23 June 2026

You are already at the latest version

Abstract

Feature engineering remains a major challenge in metabolomics-based prediction, particularly when rich biochemical knowledge is available but underutilized. Conventional metabolomics models rely primarily on measured variables and statistically driven feature selection, often overlooking the molecular and pathway context encoded in curated metabolite knowledge bases. Here, we propose SEFA (Semantic Embedding–based Feature Augmentation), a model agnostic feature engineering framework that integrates metabolite level textual knowledge into structured metabolomics and clinical modeling for lung cancer. SEFA leverages semantic embeddings derived from external metabolite resources to enrich feature construction through embedding based semantic projection and knowledge-guided feature selection, yielding a compact and biologically interpretable feature representation. Using a targeted lung cancer metabolomics dataset, we evaluate SEFA across multiple learning algorithms, including linear, kernel based, tree based, and neural network models, and show that embedding informed feature construction consistently improves prediction performance over approaches that rely solely on tabular metabolite and clinical features. Analysis of the selected features revealed enrichment in pathways including arginine and proline metabolism and glycine, serine, and threonine metabolism, indicating that the embedding derived feature representations capture biologically relevant metabolic processes associated with lung cancer. These results support the use of semantic knowledge in metabolomics feature engineering and position SEFA as a practical, general machine learning framework for enhancing predictive modeling while preserving biological interpretability through pathway level analysis.

Keywords:

lung cancer

;

metabolomics

;

feature augmentation

;

biomedical language models

;

semantic embeddings

;

pathway enrichment

;

interpretable machine learning

;

HMDB

Subject:

Medicine and Pharmacology - Oncology and Oncogenics

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

SEFA: Semantic Embedding-Based Feature Augmentation of Biomedical Language-Model Embeddings Improves Interpretable Metabolomic Prediction of Lung Cancer

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe