Identifying the primary site of origin of metastatic cancer is vital for guiding treatment decisions, especially for patients with cancer of unknown primary (CUP). Despite advanced diagnostic techniques, CUP remains difficult to pinpoint and is responsible for a considerable number of cancer-related fatalities. Understanding its origin is crucial for effective management and potentially improving patient outcomes.
This study introduces a machine learning framework ONCOfind-AI that leverages transcriptome-based gene set features to enhance the accuracy of predicting the origin of metastatic cancers. By ensuring compatibility between RNA-sequencing and micro-array data, we were able to construct a more comprehensive training dataset. Integrating data from different platforms improved the accuracy of our machine learning models for predicting cancer origins. Our method was validated using external data from clinical samples collected through Kangbuk Samsung Medical Center and the Gene Expression Omnibus. The external validation results demonstrated a top-1 accuracy ranging from 0.80 to 0.86, with a top-2 accuracy of 0.90. This study highlights that incorporating biological knowledge through curated gene sets can merge gene expression data from different platforms, enhancing the compatibility needed for more effective machine learning prediction models.