1. Introduction
Tumor metastasis is the movement of tumor cells spreading from a primary site and progressively colonizing distant organs. When cancer cells disperse from a primary tumor, they travel to other parts of the body through the blood or lymph system. In over 90% of cases, the metastatic spread of tumor cells is the greatest contributor to deaths from cancer [
1]. Treating metastatic cancer is an enormous challenge. For all types of cancer, metastatic tumor patients are often unresponsive to existing corresponding primary cancer treatment. It could be due to the basic biology of metastatic tumors and the lack of treatments targeting their biology. The genetic characteristics of metastatic tumors make them highly resistant to standard treatments for each tumor may have a unique tumor microenvironment (TME) and respond differently to the treatment [
2]. It is necessary to characterize the complicated molecular mechanism in order to choose appropriate treatment strategies.
Recently, RNA-seq transcriptome data have been widely used to study the nature of cancer metastasis [
3,
4,
5]. High-throughput technology frees the potential exploration of transcriptome changes at the whole-genome level which would be used for detecting biomarkers for diagnosis and treatment.
One critical challenge of metastatic cancer study is the difficulty of collecting data due to the complicated characteristics of metastatic cancer. Metastatic cancer is quite late-stage cancer, therefore, patients’ survival rate is significantly lower than primary cancer patients which makes the participants in any clinical trial much less [
6]. The diagnosis of metastatic cancer is also relatively difficult since the primary site and biopsy site are separated which could be confusing at some points. Therefore, building an Atlas for metastatic cancer could be costly. Even some big projects such as The Cancer Genome Atlas (TCGA) or Integrative Clinical Genomics of Metastatic Cancer (MET500) only store a few hundred metastatic samples, including the tumor types for which the primary tumor is rarely diagnosed such as in melanoma [
7,
8]. It leaves great potential for metastatic cancer data augmentation.
Generative models, such as Generative Adversarial Networks (GANs), have achieved remarkable successes in various fields, including computer vision and natural language processing [
9]. However, despite their potential, GANs suffer from several major challenges. 1) unstable training. their training can be unstable due to the simultaneous training of the generator and discriminator models in an adversarial game, where improvements to one model come at the expense of the other, resulting in difficulty in achieving convergence. 2) mode collapse. GANs are prone to mode collapse, where the generator model can learn to produce samples within a particular mode, resulting in a lack of diversity in the generated data. 3) Massive data requirement. GANs require large amounts of data to produce good results. 4) poor interpretability. The generator has no encouragement on giving interpretable and meaningful representation. In the context of metastatic data, GANs face additional challenges due to the limited availability of samples, especially for rare metastatic cancers where only a few samples are available, resulting in severe unstable training and mode collapse. Furthermore, biological data is complex and carries rich entangled information, requiring an interpretable model for downstream analysis.
To this end, we proposed a metastatic cancer expression generator (MetGen), a contrastive learning based generative model that can generate metastatic expression profiles from primary cancer and normal tissue expression samples. Generating metastatic samples from primary cancer and tissue samples is a challenging task, due mainly to limited metastatic cancer samples and corresponding primary cancer & tissue samples. For example, the MET500 dataset has in total of 868 samples from 22 different metastatic cancer types, where most of the cancer types have less than 30 samples[
8]. If we further divide one metastatic cancer into subtypes based on tissue sites, the sample size of each subtype would be even smaller which causes mode collapse and poor generating performance. To resolve this issue, contrastive learning is adopted in our model. The goal of contrastive learning is to learn such an embedding space in which similar sample pairs stay close to each other while dissimilar ones are far apart [
10]. Contrastive learning can be applied to both supervised and unsupervised settings. In the context of our study, our objective is to measure the similarity between a generated sample and an actual metastatic sample. Specifically, we aim to minimize the distance between samples that originate from the same type of metastatic cancer, while maximizing the distance between samples from different types of metastatic cancers.
MetGen avoids mode collapse. Mode collapse is caused by the model finding a trivial solution that only fits majority single mode samples but ignores the samples in other modes [
11]. In our model, modes are regulated by not only MET500 target samples but also TCGA source samples. All the modes, and metastatic cancer types in our scenario, are predefined and learned evenly. Therefore, there is no way that model ignores any of them.
MetGen ensures stable training. In contrastive learning, samples are fed to the model in pairs[
10]. By pairing cancer, tissue, and metastatic sample recursively, we could obtain a massive training dataset easily. For example, one breast cancer sample can pair with multiple different liver tissue samples which results in multiple sample pairs for training. In this way, even rare metastatic cancer types could be trained stable and sufficient.
MetGen has interpretable latent space. MetGen has many advantages in model interpretation. Firstly, MetGen model’s latent code is obtained by Variational autoencoder, and to a large degree, the learned code components are disentangled [
12,
13]. Each component or component cluster represents different functions. Secondly, the autoencoder gives a convenient connection between latent space and gene expression space. The changes in latent space could be reflected in gene expression space by reconstructing latent code. Therefore, the biological function of each code components could be learned in expression space easily by masking them out in code space.
In summary, our proposed model integrates primary cancer information into the context of metastasized tissue site during the generative process. This allows for differential analysis and interpretable visualization in a latent space. We anticipate that our model can facilitate the identification of functional pathways associated with cancer metastasis and contribute to the investigation of their roles in this process.
2. Results and Discussion
Overview of the MetGen framework for generating metastatic expression profiles. The MetGen framework consists of two model components, MetVAE and MetGen. Establishing the MetGen framework includes three stages, i.e., 1) Training MetVAE to obtain latent metastatic cancer codes for MET500 samples, 2) Training MetGen to generate metastatic cancer codes through contrastive mixing of cancer and tissue expressions, and 3) Reconstructing metastatic cancer expression from the generated metastatic cancer code (
Figure 1).
In the first stage, MetVAE was trained using gene expression profiles from the Integrative Clinical Genomics of Metastatic Cancer (MET500). MetVAE is a variational autoencoder (VAE) that includes an encoder and a decoder (
Figure 1a). MetVAE encoder converts a metastatic gene expression sample to a latent metastatic code distribution, which represents a low dimensional representation of the input expression. MetVAE decoder takes a sample from the latent code distribution and reconstructs the input gene expression. MetVAE was trained to minimize the VAE loss that enforces the reconstructed sample to be close to input while minimizing the KL divergence between the learned latent code distribution and standard Gaussian distribution. Therefore, MetVAE learns to produce good latent representations of input MET500 metastatic samples.
In the second stage, MetGen was trained using TCGA primary cancer and tissue expression samples to generate the latent metastatic code of a metastatic cancer type from the corresponding paired primary cancer and tissue expression data (
Figure 1b). MetVAE includes two expression encoders with shared weights followed by a CNN mixer (
Figure 1b). The two expression encoders convert TCGA primary cancer and tissue gene expression data to primary cancer and tissue latent codes, respectively. The two encoders, which are shared-weights convolutional neural networks (CNNs), extract cancer and tissue features and filter useless information for generating metastatic cancer samples. The CNN mixer integrates cancer and tissue codes to generate the corresponding metastatic cancer code, i.e., code for the corresponding input primary cancer spreading to the input tissue. At the input of the CNN mixer are stacked cancer and tissue codes (
Figure 1b), which are compressed into a 1D code by multiple 2 × 1 kennels, each mixing them into different metastatic cancer related features. Two fully connected layers are followed to produce nonlinear combinations of these features and generate the metastatic cancer code. The expression encoder and CNN mixer are trained with a contrastive learning loss to make the generated codes follow the distribution of MetVAE codes of the corresponding metastatic cancer in MET500. The contrastive learning loss compares the generated code and a MET500 metastatic code and minimizes the distances between the codes of matching metastatic cancer types (e.g., when the MetGen output code is generated from TCGA breast cancer and live tissue and MetVAE code is from MET500 breast cancer in liver) but keeps the distance of codes from mismatched types at a preset margin. When the training converges, MetGen would generate metastatic cancer codes as close to MET500 codes of the matching metastatic cancer type as possible but not to be confused with other metastatic cancer types.
In the third stage, the MetGen generated metastatic code is fed into the MetVAE decoder to reconstruct the corresponding metastatic cancer gene expression profile (
Figure 1c).
Training and testing data preparation. The dataset MET500 and TCGA used to train MetGen are downloaded from UCSC XenaHub. For the sake of discussion, we use the term ‘cancer type’ for TCGA primary cancer and metastatic cancer types; ‘tissue type’ for TCGA normal tissue and metastatic biopsy sites; ‘metastatic subtype’ for specific primary cancer migrated in specific tissue site.
TCGA and MET500 have 20 common cancer types. Our goal is to learn metastatic cancer samples, therefore we want to keep target information in training as much as possible. We picked the top 6 largest sample cancer types in MET500. Common tissue types shared by TCGA and MET500 are 16 in total, 6 of them are associated with selected cancer types. Thus, we have 6 cancer types, 6 tissue types and 21 metastatic subtypes left. We further removed two metastatic subtypes, breast cancer migrated from one to another and lung cancer migrated from one to another, from MET500 data, since it’s difficult to define such a metastasis in our model when there are no left and right labels for breasts and lungs in datasets. We filtered datasets based on selected cancer, tissue types and metastatic subtypes. 3052 TCGA cancer samples in 6 cancer types, 338 TCGA tissue samples in 6 tissue types and 219 MET500 samples in 19 metastatic subtypes are kept (
Table 1).
We randomly take one sample from each TCGA primary cancer, TCGA normal tissue and MET500 metastatic cancer. Three samples together are called a pair. In each pair, if TCGA samples and MET500 samples matches in both cancer type and tissue type, we call it positive pair. Otherwise, it’s called negative pair. For instance, a pair including TCGA breast cancer, TCGA normal liver tissue and MET500 breast cancer in liver is a positive pair while a pair including TCGA breast cancer, TCGA lung tissue and MET500 breast cancer in bladder is negative pair, etc. We generated 1000 positive pairs for each metastatic subtype and three times negative pairs, in total 76,000 pairs for training and validating. We also generate another 3800 pairs sample, 200 pairs for each metastatic subtypes, as testing data which is held out during training.
Expressions in all samples were normalized and transformed into log2(FPKM +1) and then scaled between 0 and 1 by the min-max scaler. 7,312 genes were selected for this study. The data preparation is explained in greater detail in Method.
MetVAE codes capture the cancer and tissue features of metastatic samples. We first examined if MetVAE codes could represent metastatic samples in lower dimensions without compromising feature patterns. The expressions and MetVAE latent codes are visualized using t-SNE [
14] (
Figure 2). In the expression space, unlike primary tumors [
15] (
Supplementary Figure S3), metastatic samples do not form distinguish clusters, though most of the cancer types are loosely grouped such as BRCA and PRAD (
Figure 2a). On the other hand, tissue types are not segregated based on biopsy sites, with the exception of liver [
8] (
Figure 2b). In code space, cancer types show similar pattern to data space sample (
Figure 2c), however, more cancer groups are more clustered than in gene expressions. For example, BRCA, HNSC and PRAD are further separated from other cancer groups and formed clear clusters. For tissue types t-SNE (
Figure 2d), no significant change discovered. We still observe liver tissues clustered together while other tissue types loosely distributed. In general, MET500 samples group in cancer types, and within cancer clusters we can observe multiple tissue sites. This is expected because metastatic cancer is the primary tumor cell metastasized to other tissue sites. They are still considered same cancer type as primary tumor, so the overall characteristics is similar. In summary, our MetVAE model learns faithful pattern of the cancer types, tissue types and metastatic subtypes in code space compared to gene expression space if not better.
To further quantify the extent to which the latent metastatic cancer codes captured both cancer and tissue features, we trained two classifiers on MET500 codes to predict cancer types and tissue types, respectively. The cancer classifier and tissue classifier achieved 98.3 ± 0.7% and 92.3 ± 0.6% ROC AUCs, indicating that MetVAE could capture both cancer and tissue information of MET500 samples correctly.
MetGen generates high-quality metastatic cancer samples that reveal the distinct patterns of metastatic cancer types. We generated 200 samples for every 19 metastatic subtypes, resulting in 3,800 samples, using testing data. To visualize the results, t-SNE was applied to reduce the feature dimensions. We observed 19 metastatic subtypes clusters while original MET500 data does not form such clear cancer and tissue clusters (
Figure 3a). To test if the generated codes reserve the cancer and tissue information, we tested 3,800 generated codes on trained cancer classifier and tissue classifier. The performance yields to 99.9% and 96.6% for cancer type and tissue type respectively. Nearly perfect classification performance suggests the generated code has very similar cancer and tissue feature to true MET500 code, or classifiers would not be able to classify them correctly.
To better show how MetGen reveals metastatic cancer distribution, we projected true MET500 sample to generated metastatic samples’ t-SNE plot (
Figure 3a). We found original data points and generated data points have significant overlaps. To be more specific, for each generated cluster, there are original metastatic samples enclosed. That is to say, MetGen generated new samples around the original metastatic target samples and impute space in between. The more data generated the better distribution pattern would show in t-SNE plot. To justify the results quantitatively, we trained a classifier on our testing dataset to predict metastatic subtypes. The idea is to see whether learned knowledge from generated metastatic data can apply on true metastatic samples. Unsurprisingly, the model shows good predicting power for trueMET500 samples with ROC AUC score 97.6%, which furtherly demonstrates MetGen’s power in generating high quality metastatic samples from primary cancer and normal tissues.
To investigate one metastatic subtype specifically, we used breast cancer in liver for example. We first retrieved all breast cancer in liver samples from MET500, breast cancer samples and normal liver tissue samples from TCGA, then compared them with 1000 generated breast cancer in liver sample in gene expression heatmaps (
Figure 3b). Generated samples do not only have similar discriminative power but also look like true samples in gene expressions. Some genes are not active in primary breast cancer but found high expression level in metastatic breast cancer. Those genes are also upregulated in liver tissues. It implies MetGen learns metastatic cancer by merging primary cancer information into tissue environment.
MetGen model learns metastatic prostate cancer characteristics. The development of metastatic cancer requires cancer cells to leave their primary sites and eventually acclimate to new cellular surroundings in tissue sites. It is rather complicated than simply ‘put tumor to target location’. We used metastatic prostate cancer for example to illustrate how MetGen takes advantage of its architecture to capture metastatic cancer information.
We generated 4000 metastatic prostate samples in four different tissue sites (lung, liver, bladder, skin) using TCGA-PRAD and TCGA normal samples. Another 4000 samples were generated similarly except the MetGen cancer branch was muted, thus generated samples do not contain prostate cancer information, namely masked cancer samples (
Figure 4A). To study the characteristics of the samples robustly, we use gene set variation analysis (GSVA) [
16] to perform single sample gene set analysis on 186 KEGG and 50 HALLMARK gene sets downloaded from MSigDB collection [
17,
18]. GSVA transforms gene expression files into 236 gene set signatures. Eventually, we obtain a GSVA score matrix of 8000 by 236. Each number is a sample’s enrichment score for one specific gene set (
Figure 4b). Metastatic prostate cancer and masked cancer metastatic sample shows clear different patterns. We anticipate the difference is mainly contributed by the cancer information in metastatic prostate cancer.
We performed differential analysis to identify the difference of two groups (
Figure 4c). KEGG androgen response (AR), a key pathway in prostate cancer, ranked high in upregulation pathways. Multiple studies identified it in the development of the normal prostate and in the progression from primary to metastasis, low AR transcriptional activity resulted in upregulated AR expression [
19,
20,
21]. Some other well-known metastatic prostate pathways such as purine metabolism [
22,
23,
24], complement and coagulation cascades [
25,
26] were also successfully detected by our model. Cancer intrinsic pathways like P53 signaling along with WNT β catenin signaling were found more activate in masked cancer samples which also makes sense since prostate cancer are usually suppressive [
27,
28]. Furthermore, tumor microenvironment (TME) pathways ECM receptor interaction [
29,
30] and epithelial mesenchymal transition [
31,
32] found significantly differential expressed in our comparison. The metastatic cascade is dependent on the loss of adhesion between cells which results in the dissociation of the cell from the primary tumor, and subsequently the ability of the cell to attain a motile phenotype via changes in cell to matrix interaction (
Figure 4c,
Supplementary Table S1).
MetGen not only ranked prostate cancer related pathways highly but also detected multiple cancer intrinsic and metastatic TME pathways correctly (
Figure 4c). Such results prove our model could leverage cancer information from primary tumor and then embed it to local tumor microenvironment.
MetGen latent components learns functional clusters in metastatic breast cancer in bladder. The latent code leaned from MetVAE reserves key metastatic cancer information, however due to a limited number of metastatic cancer samples, interpret the code statistically is often difficult. We use metastatic breast cancer in bladder to illustrate how our model helps in latent functional interpretation.
We generated 1000 metastatic breast cancer in bladder latent code using TCGA-BRCA and TCGA normal bladder tissues. Then, the hierarchical cluster was used to characterize 100 code components into clusters and high concordance was shown among 3 functional clusters (
Figure 5a).
To discover the pathway enriched in these functional clusters, we use similar masking approach from previous section. For instance, to get cluster 1 pathways, we set the cluster 1code components to zero, keep the rest components value as they are, then fed the masked code to MetVAE decoder to obtain masked metastatic cancer samples. Later on, masked metastatic cancer samples will be compared to non-masked metastatic cancer samples, therefor the difference is mainly contributed by cluster 1 components (
Figure 5b).
Similar to metastatic prostate cancer analysis, we first transform our samples into KEGG and HALLMARK gene set signatures using GSVA. Since we have three cluster masked samples and non-masked samples, 4000 by 236 GSVA score matrix was obtained.
Figure 5c shows the heatmap of activate level for gene sets in different samples.
To identify the cluster information, three differential comparisons were performed using LIMMA, for instance, metastatic breast cancer samples vs. each masked cluster samples. Top 20 of positive and negative differential pathways from each comparison were selected for analysis (
Supplementary Table S2). From the results (
Figure 5d), we noticed functional clusters often represents multiple functions. We roughly group them into 5 signatures.
Cell cycle. All three clusters detected G2M checkpoint, E2F targets and MYC targets V1 upregulated. Those pathways are closely related to
cell cycle process which is commonly active in biological samples [
33,
34,
35,
36].
Immune response. Immune response related pathways are mostly located in cluster 2 and 3 components. In this group we found interferon alpha response, phosphatidylinositol signaling system and PIK3, AKT, mTOR signaling which are actively regulated in immune activation and cell migration in the innate immune system [
37,
38]. Additionally, estrogen response has been proved with immunoenhancing effect on the immune system [
39]. Above pathways together have been shown to regulate immune response by impairing negative selection of high affinity auto-reactive B cells, modulating B cell function, and leading to Th2 response [
40].
Inflammatory/Cell differentiation/
metastasis. This group has rather homogeneous pathways score pattern but very complicated signature. Cluster 1 and 3 detected both Hedgehog and Notch signaling pathways. Hedgehog is a signaling pathway that transmits information to embryonic cells required for proper
cell differentiation. Notch signaling promotes proliferative signaling during neurogenesis, and its activity is inhibited by Numb to promote neural differentiation. They play a major role in the regulation of embryonic development [
41,
42,
43]. Interestingly, cluster 1 and 3 also detected olfactory signaling and melanogenesis pathways, along with Hedgehog and Notch signaling pathways, are often found in melanoma tumor metastasis [
44]. We also found several pathways related to specifical T cell response and immune cell trafficking in cluster 1 and 3. Antigen processing and presentation is a well-known pathway which can digest protein they encounter and display peptide fragment from them on their surface for another immune cell to recognize [
45,
46]. Cytosolic DNA sensing pathway are responsible for detecting foreign DNA from invading microbes or host cells and generating innate immune responses [
47]. Besides those pattern recognition receptors, immune cells such as T cell receptor and NK cell mediated cytotoxicity signals are also activated which strongly implies this group is for inflammatory response.
Cellular metabolism. Cellular metabolism involves complex sequences of controlled biochemical reactions, better known as metabolic pathways. These processes allow organisms to grow and reproduce, maintain their structures, and respond to environmental changes. We found two important cell metabolism target files, lipid metabolism and cytochrome P450 (CYP450). CYP450 are usually membrane-bound and localized to the inner mitochondrial or endoplasmic reticular membrane [
48]. Lipid metabolism are important sources of cellular energy, stored as triglycerides[
49]. Both play critical roles in oxygenase activity [
50].
Bladder function. Cluster 2 detected a few non cancer related pathways such as vascular smooth muscle contraction, tight junction and calcium signaling pathways [
51,
52]. Those pathways are highly correlated to muscle activities of bladder. We believe metastatic tissue site information are mainly stored in this cluster.
Above results show MetGen latent code components give great interpretability. Each functional code cluster captured multiple functional processes which are highly correlated to metastatic breast cancer. The generated samples reveal the characteristics of metastasis, target tissue sites and regular cancer related immune responses.