Preprint
Article

This version is not peer-reviewed.

SEFA: Semantic Embedding-Based Feature Augmentation of Biomedical Language-Model Embeddings Improves Interpretable Metabolomic Prediction of Lung Cancer

Submitted:

22 June 2026

Posted:

23 June 2026

You are already at the latest version

Abstract
Feature engineering remains a major challenge in metabolomics-based prediction, particularly when rich biochemical knowledge is available but underutilized. Conventional metabolomics models rely primarily on measured variables and statistically driven feature selection, often overlooking the molecular and pathway context encoded in curated metabolite knowledge bases. Here, we propose SEFA (Semantic Embedding–based Feature Augmentation), a model agnostic feature engineering framework that integrates metabolite level textual knowledge into structured metabolomics and clinical modeling for lung cancer. SEFA leverages semantic embeddings derived from external metabolite resources to enrich feature construction through embedding based semantic projection and knowledge-guided feature selection, yielding a compact and biologically interpretable feature representation. Using a targeted lung cancer metabolomics dataset, we evaluate SEFA across multiple learning algorithms, including linear, kernel based, tree based, and neural network models, and show that embedding informed feature construction consistently improves prediction performance over approaches that rely solely on tabular metabolite and clinical features. Analysis of the selected features revealed enrichment in pathways including arginine and proline metabolism and glycine, serine, and threonine metabolism, indicating that the embedding derived feature representations capture biologically relevant metabolic processes associated with lung cancer. These results support the use of semantic knowledge in metabolomics feature engineering and position SEFA as a practical, general machine learning framework for enhancing predictive modeling while preserving biological interpretability through pathway level analysis.
Keywords: 
;  ;  ;  ;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Accessibility

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated