2. Materials and Methods
The initial dataset is a publicly available and contains three datasets, derived from 105 studies [12]. Each of these datasets contain blood transcriptomics data. The first dataset uses HG-U133A microarray to asses 2500 samples of 2500 individuals. The second dataset uses HG-U133 microarray to asses 8348 samples of 8348 individuals. The final dataset uses RNA-seq to asses 1181 samples of 1181 individuals. Combining these datasets gives 12029 samples of 12019 individuals, each containing 12708 genes. From these 12029 samples, 907 are healthy, 4082 have the disease AML and 63 have the disease Acute megakaryoblastic leukaemia (AMKL). AMKL is a rare sub-type of AML [14]. And is thus considered as AML for this study. The samples with other diseases are dropped, giving a dataset containing 907 healthy cases (17.95%) versus 4145 cases of AML (82.05%). We did not use any other information as regards the individuals, like the age, sex, etc. since this would drop the number of samples drastically. Age is highly correlated to AML and since it has been proved [1,3] that its role is really important regarding the predictability of AML using CatBoost and transcriptomics data, we believe that in the future we will be able to enrich this or similar dataset with age so that the performance of our models to be higher.
For the problem of classification of AML vs. Healthy individuals the dataset contains 5052 data instances which correspond to 5052 individuals, each containing 12708 genes. The dataset has been randomly split in two sets, the training and the validation datasets. The training dataset contains 80% of the data and the validation dataset 20%. The percentages of the AML and Healthy individuals are the same in both datasets and close to 20% of healthy and 80% of AML. In order to do the parameter tuning we used 10 fold cross validation (10CV) [15]. Since we want our models to be machine’s agnostic (meaning we do not want to add as feature if a data-instance comes from HG-U133A, HG-U133 or from RNA-seq) we used in our dataset data-instances from the three different sources. CatBoost is a gradient boosting algorithm that solves the prediction shift, which is present in other existing boosting techniques, by using ordered boosting with ordered target statistics. By doing so it outperforms other gradient boosting algorithms like XGBoost, LightGBM, etc.. Even though CatBoost has been developed and it is recommended for tabular datasets consisted of categorical features, various works on numerical features derived from utrasound B-mode and shear wave elastography, as well as the parameters of FIB-4 score, has been shown that CatBoostcan has great performance on numerical data without discretization techniques applied on them. As the datasets are imbalanced, all of the models make use of the weight balance parameters offered by the CatBoost library. The tuning of the models has to do mainly with the following parameters: the number of iterators, the learning rate and the depth. All other tunable parameters have been kept similar to the default values offered by the CatBoost library. The first model is used for dimensionality reduction, the other models are used to do the actual classification of AML. Dimensionality reduction improves the time complexity of the training phase of machine learning models [16]. This has the advantage of being able to create and tune more complex models.
2.1. AML vs. Healthy
The dimensionality reduction CatBoost model, which is the first one in the whole methodology, has been tuned using: 200 iterators, depth of 4 and a learning rate of 0.1. From this dimensionality reduction CatBoost model, we obtain the feature importance for each of its 12708 features. This is done for both the importance regarding the change of the loss function during the training phase and the importance regarding the predictability of the model itself. We took the 100 most important features (genes) for both the predictability and the loss function and we computed the intersection of them. This lead us to only 57 features. The above steps are similar to the method from [3]. In the already proven method 34 genes have been found to have good predictive importance as regards the classification of AML vs. Healthy individuals using probe-set data. In order to take advantage of this previous work, we find the genes which belong to the intersection of the 12708 genes and the 34 genes of [1]. The idea is that using genes which they already have importance as regards the predictability of CatBoost model in the same problem, will benefit the performance of our models. There are only 15 genes that belong to this intersection. These genes are the following, ’ADAMTS2’, ’CEACAM3’, ’CHRNA3’, ’DSG2’, ’FAM153B’, ’FNDC3A’, ’GATA3’, ’LMAN1’, ’MAL’, ’PATJ’, ’RPL10’, ’SERPINI2’, ’SH2D3A’, ’SLC46A3’, ’TRIM45. These genes together with the 57 genes of the dimensionality reduction CatBoost model leads us to 72 genes. These 72 genes will be used in both of our models below.
The first model that is created after the dimensionality reduction has occurred, is a model which uses as features the 72 genes which belong to the above intersection. The tuned CatBoost model has 600 iterators, a depth of 6 and a learning rate of 0.2. This model from now on will be referred to as CatBoost72. According to the methodology of [1] for the next CatBoost model first we keep only these genes from the above 72 which are not associated to AML (there is no bibliographic reference); these are 62 genes. In [1] other than the non associated to to AML genes, one more feature has been used which is highly associated to AML, namely, the age. In our dataset not all data instances had the age filled-in so we decided not to use the age. In order to continue our process we included one gene which from bibliographic references is highly associated to AML: the ’FLT3’. The resulting 63 genes are used for this final model which has been tuned with 400 iterators, a depth of 6 and a learning rate of 0.2; and from now on will be referred to as CatBoost63. The main purpose of the CatBoost63 model is to show that there are genes that are associated to AML and prove for once more after [1] that machine learning can help in the identification of genes which have not been yet associated to AML helping to uncovering the gene profile of this specific disease.In the methodological framework of our study, we strategically confined our analysis to a maximum of 20 genes with the goal of identifying a subset with diagnostic relevance for AML. This constraint led to the initial selection of 19 genes, chosen for their statistical and biological significance. These genes formed the basis for our predictive modeling using the CatBoost algorithm. The CatBoost model, refined with these 19 genes, exhibited exceptional predictive capabilities, as evidenced by achieving a ROC-AUC score of 0.9946 in a 10CV process and 0.9941 on an independent inference dataset.
Subsequent iterations of model optimization were conducted, focusing on a more selective gene set comprising 15 of the initially chosen 19 genes. This selection was particularly notable for including genes with no prior associations to blood cancers, aiming to explore their predictive power in the context of AML. The refined model, henceforth referred to as CatBoost15, demonstrated a robust ROC-AUC of 0.9922 in 10CV and 0.9900 in the inference dataset, indicative of a strong predictive performance and hinting at the existence of previously unrecognized genetic markers for AML. The optimized parameters for CatBoost19 included 95 iterations, a tree depth of 6, and a learning rate of 0.5. In contrast, the CatBoost15 model was adjusted to 200 iterations, a tree depth of 5, and a learning rate of 0.2, reflecting the tailored approach to model refinement.
This systematic approach not only underscores our study’s contribution to the existing body of knowledge, as echoed by the work of Angelakis et al. [1,3], but also highlights the methodological rigor applied in narrowing down the gene set. Our exploration extends beyond the conventional gene profiles associated with AML, venturing into uncharted genetic territories with potential implications for novel diagnostic markers and therapeutic targets. The discovery of predictive capabilities in genes hitherto unassociated with AML not only enriches the genetic landscape pertinent to the disease but also signifies a leap towards innovative drug discovery, opening pathways for the development of new therapeutic strategies. This advancement, rooted in our methodical analysis, sets a precedent for future research aimed at enhancing patient care through genetically informed interventions.
An overview of how the different models and datasets stack up can be found in
Figure 1.
2.2. AML vs. Healthy & Other Diseases
In this study we use the dataset of a previous work [12]. In this work the AML vs. Healthy problem has not been explored so our focus here is on this particular problem. In addition though, we want to see how the CatBoost algorithm and the methodology of [1] applies on the binary classification problem of AMl vs. Healthy & Other Diseases individuals. In [12] it is stated that Lasso had the best performance but it was taken into account the differentiation of the data-instances provided from HG-U133A, HG-U133 or from RNA-seq. Since we want our model to be agnostic on HG-U133A, HG-U133 or from RNA-seq we use all the data-instances from the initial dataset. Similar methodology to the previous work
Section 2.1 has been conducted. However, instead of the Lasso model that had the best performance in [12], we use again CatBoost models. This comparison model is made using the whole dataset, containing all the diseases and healthy people. From there the AML and AMKL cases are seen as one class. Whereas all other diseases and healthy is seen as the other class. This model is created to question if CatBoost and the methodology of [1] are able to achieve good performance in a different problem using blood transcriptomic data.
The initial dataset has 12029 data-instances and 12708 features. First, a CatBoost model is trained on the whole dataset in order to do a dimensionality reduction on the number of the features. This dimensionality reduction CatBoost model has 100 iterators, a depth of 5 and a learning rate of 0.2. After the training and tuning using 10 fold cross validation, we take the 100 most important features as regards the predictability and the 100 most important features as regards the loss function. We compute the intersection of them which contains 60 genes. From these genes, a new dataset is created, containing only these genes as features. Given this new dataset, a new CatBoost model is tuned using 10CV. This model from now on will be referred to as CatBoost60. It has 200 iterators, a depth of 6 and a learning rate of 0.5.