Submitted:
30 November 2023
Posted:
01 December 2023
Read the latest preprint version here
Abstract
Keywords:
1. Introduction
- The current study focuses on protein sequential data rather than image data.
- The most frequently mutated genes were discovered through a literature review that was responsible for chronic myeloid Leukemia.
- Datasets were formulated from the most frequently muted gene data.
- Features were extracted through physicochemical properties of Amino Acid composition, Pseudo Amino Acid Composition, and di-peptide composition.
- The study focuses on enhancing early-stage prediction to improve patient recovery prospects significantly.
- Our proposed solution encompasses a user-friendly web application dashboard that presents an invaluable tool for early CML diagnosis, offering a deploy-able asset within healthcare institutions and hospitals.
2. Literature Review
| Reference | Data Set | Classifier | Classification | Accuracy |
|---|---|---|---|---|
| Mohamed et al. [7] | White Blood Cell Images | Random Forest | Detection of WBC Cancer | 94.3% |
| Kumar et al. [8] | Medical Images-ALL | K-mean clustering | Detection of Acute Lymphocytic Leukemia (ALL) | 92.8% |
| Sharma et al. [9] | Medical Images- Leukemia cells | ABC-BPNN and PCA | Classify Leukemia Cells | 98.72% |
| Moshavash et al. [11] | Blood Microscopic Images- Acute Leukemia | Support Vector Machine (SVM) | Classify Acute Lymphocytic Leukemia (ALL) | 89.81% |
| Gal et al. [13] | Gene expression patterns-RNA sequencing- AML | k-nearest neighbors algorithm (K-NN) | Predicting complete remission of AML | 84.2% |
| Bostanci et al. [14] | RNA sequences- Colon cancer | Random Forest | Prediction of colon cancer | 97.3% |
| Hosseinzadeh et al. [15] | Protein sequences-Lung tumor | Support Vector Machine (SVM) | Prediction of lung tumor types based on protein attributes | 82.0% |
| Dhakal et al. [16] | miRNA–mRNA interactions | Stacking-classifier algorithm | Predicting functional miRNA targets | 79.77% |
| Albitar et al. [17] | RNA sequences | Geometric Mean Naïve Bayesian | Bone Marrow based biomarker for predicting aGVHD | 93.0% |
| Ahmad et al. [18] | Protein Sequences | SVM,Random Forest, XGBoost | Prediction of Chronic Lymphocytic Leukemia using protein sequences | 97.09% |
| Jian et al. [19] | DNA sequences | Deep Learning (CNN & LSTM) | to investigate leukemia types from transcription factor binding sites | 75.0% |
3. Materials and Methods
3.1. Block Diagram
3.2. Dataset Collection
3.2.1. Fasta Format
3.2.2. Sample of Protein Sequence (HSP90)
3.2.3. Sample of Protein Sequence (HSP90)
3.3. Feature Extraction
3.3.1. Amino Acid Composition
3.3.2. Pseudo Amino Acid Composition
3.3.3. Di-peptide Composition
3.3.4. Data Augmentation
4. Development of Individual Classifiers
4.1. Support Vector Machine
4.2. Random Forest
4.3. K-Nearest Neighbor (KNN)
4.4. Naïve Bayes
4.5. XGBoost
4.6. Logistic Regression
- is the probability of the target variable y being equal to 1 given the input features X
- is the vector of model parameters
- X is the vector of input features
5. Results and Discussion
5.1. Results on Pseudo Amino Acid Composition (Pse-AAC) Data
5.2. Accuracy Result on Amino Acid Composition (AAC) Data
5.3. Accuracy Results on Di-Peptide Composition (DPC)
5.4. Machine Learning Based Dashboard
6. Conclusions
References
- Siegel, R.L.; Miller, K.D.; Fuchs, H.E.; Jemal, A.; others. Cancer statistics, 2021. Ca Cancer J Clin 2021, 71, 7–33. [Google Scholar] [CrossRef] [PubMed]
- Bibi, N.; Sikandar, M.; Ud Din, I.; Almogren, A.; Ali, S. IoMT-based automated detection and classification of leukemia using deep learning. Journal of healthcare engineering 2020, 2020, 1–12. [Google Scholar] [CrossRef]
- IAfRoC, I. Leukaemia Source: Globocan 2020 2020 [Available from: https://gco. iarc. fr/today/data/factsheets/cancers/36-Leukaemia-fact-sheet. pdf, 2022.
- Munteanu, C.R.; Magalhães, A.L.; Uriarte, E.; González-Díaz, H. Multi-target QPDR classification model for human breast and colon cancer-related proteins using star graph topological indices. Journal of theoretical biology 2009, 257, 303–311. [Google Scholar] [CrossRef]
- Ramani, R.G.; Jacob, S.G. Improved classification of lung cancer tumors based on structural and physicochemical properties of proteins using data mining models. PloS one 2013, 8, e58772. [Google Scholar] [CrossRef] [PubMed]
- Yang, J.Y.; Yoshihara, K.; Tanaka, K.; Hatae, M.; Masuzaki, H.; Itamochi, H.; Takano, M.; Ushijima, K.; Tanyi, J.L.; Coukos, G.; others. Predicting time to ovarian carcinoma recurrence using protein markers. The Journal of clinical investigation 2013, 123, 3740–3750. [Google Scholar] [CrossRef] [PubMed]
- Mohamed, H.; Omar, R.; Saeed, N.; Essam, A.; Ayman, N.; Mohiy, T.; AbdelRaouf, A. Automated detection of white blood cells cancer diseases. 2018 First international workshop on deep and representation learning (IWDRL). IEEE, 2018, pp. 48–54. [CrossRef]
- Kumar, S.; Mishra, S.; Asthana, P. ; Pragya. Automated detection of acute leukemia using k-mean clustering algorithm. Advances in Computer and Computational Sciences: Proceedings of ICCCCS 2016, Volume 2. Springer, 2018, pp. 655–670. [CrossRef]
- Sharma, R.; Kumar, R. A novel approach for the classification of leukemia using artificial bee colony optimization technique and back-propagation neural networks. Proceedings of 2nd International Conference on Communication, Computing and Networking: ICCCN 2018, NITTTR Chandigarh, India. Springer, 2019, pp. 685–694. [CrossRef]
- Jothi, G.; Inbarani, H.H.; Azar, A.T.; Devi, K.R. Rough set theory with Jaya optimization for acute lymphoblastic leukemia classification. Neural Computing and Applications 2019, 31, 5175–5194. [Google Scholar] [CrossRef]
- Moshavash, Z.; Danyali, H.; Helfroush, M.S. An automatic and robust decision support system for accurate acute leukemia diagnosis from blood microscopic images. Journal of digital imaging 2018, 31, 702–717. [Google Scholar] [CrossRef] [PubMed]
- Umamaheswari, D.; Geetha, S. A framework for efficient recognition and classification of acute lymphoblastic leukemia with a novel customized-KNN classifier. Journal of computing and information technology 2018, 26, 131–140. [Google Scholar] [CrossRef]
- Gal, O.; Auslander, N.; Fan, Y.; Meerzaman, D. Predicting complete remission of acute myeloid leukemia: machine learning applied to gene expression. Cancer informatics 2019, 18, 1176935119835544. [Google Scholar] [CrossRef] [PubMed]
- Bostanci, E.; Kocak, E.; Unal, M.; Guzel, M.S.; Acici, K.; Asuroglu, T. Machine learning analysis of RNA-seq data for diagnostic and prognostic prediction of colon cancer. Sensors 2023, 23, 3080. [Google Scholar] [CrossRef] [PubMed]
- Hosseinzadeh, F.; KayvanJoo, A.H.; Ebrahimi, M.; Goliaei, B. Prediction of lung tumor types based on protein attributes by machine learning algorithms. SpringerPlus 2013, 2, 1–14. [Google Scholar] [CrossRef] [PubMed]
- Dhakal, P.; Tayara, H.; Chong, K.T. An ensemble of stacking classifiers for improved prediction of miRNA–mRNA interactions. Computers in Biology and Medicine 2023, 164, 107242. [Google Scholar] [CrossRef] [PubMed]
- Albitar, M.; Zhang, H.; Pecora, A.L.; Ip, A.; Goy, A.H.; Antzoulatos, S.; De Dios, I.; Ma, W.; Kaur, S.; Suh, H.C.; others. Bone Marrow-Based Biomarkers for Predicting aGVHD Using Targeted RNA Next Generation Sequencing and Machine Learning. Blood 2021, 138, 2892. [Google Scholar] [CrossRef]
- Ahmad, W.; Hameed, M.; Bilal, M.; Majid, A. ML-Pred-CLL: Machine Learning based prediction of Chronic Lymphocytic Leukemia using protein sequential data. 2022 International Conference on Recent Advances in Electrical Engineering & Computer Sciences (RAEE & CS). IEEE, 2022, pp. 1–7. [CrossRef]
- He, J.; Pu, X.; Li, M.; Li, C.; Guo, Y. Deep convolutional neural networks for predicting leukemia-related transcription factor binding sites from DNA sequence data. Chemometrics and Intelligent Laboratory Systems 2020, 199, 103976. [Google Scholar] [CrossRef]
- Rodríguez, D.; Bretones, G.; Quesada, V.; Villamor, N.; Arango, J.R.; López-Guillermo, A.; Ramsay, A.J.; Baumann, T.; Quirós, P.M.; Navarro, A.; Royo, C.; Martín-Subero, J.I.; Campo, E.; López-Otín, C. Mutations in CHD2 cause defective association with active chromatin in chronic lymphocytic leukemia. Blood 2015, 126, 195–202. [Google Scholar] [CrossRef]
- Apweiler, R.; Bairoch, A.; Wu, C.H.; Barker, W.C.; Boeckmann, B.; Ferro, S.; Gasteiger, E.; Huang, H.; Lopez, R.; Magrane, M.; others. UniProt: the universal protein knowledgebase. Nucleic acids research 2004, 32, D115–D119. [Google Scholar] [CrossRef] [PubMed]
- Fu, L.; Niu, B.; Zhu, Z.; Wu, S.; Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 2012, 28, 3150–3152. [Google Scholar] [CrossRef] [PubMed]
- Feng, P.M.; Lin, H.; Chen, W.; others. Identification of antioxidants from sequence information using naive Bayes. Computational and mathematical methods in medicine 2013, 2013. [Google Scholar] [CrossRef] [PubMed]
- Feng, P.M.; Ding, H.; Chen, W.; Lin, H.; others. Naive Bayes classifier with feature selection to identify phage virion proteins. Computational and mathematical methods in medicine 2013, 2013. [Google Scholar] [CrossRef] [PubMed]
- Jia, J.; Liu, Z.; Xiao, X.; Liu, B.; Chou, K.C. pSuc-Lys: predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach. Journal of theoretical biology 2016, 394, 223–230. [Google Scholar] [CrossRef] [PubMed]
- Lin, W.Z.; Fang, J.A.; Xiao, X.; Chou, K.C. iDNA-Prot: identification of DNA binding proteins using random forest with grey model. PloS one 2011, 6, e24756. [Google Scholar] [CrossRef] [PubMed]
- Qu, K.; Han, K.; Wu, S.; Wang, G.; Wei, L. Identification of DNA-binding proteins using mixed feature representation methods. Molecules 2017, 22, 1602. [Google Scholar] [CrossRef] [PubMed]
- Khajapeer, K.V.; Baskaran, R. Hsp90 inhibitors for the treatment of chronic myeloid leukemia. Leukemia research and treatment 2015, 2015. [Google Scholar] [CrossRef] [PubMed]
- Alves, R.; Santos, D.; Jorge, J.; Gonçalves, A.C.; Catarino, S.; Girão, H.; Melo, J.B.; Sarmento-Ribeiro, A.B. Alvespimycin Inhibits Heat Shock Protein 90 and Overcomes Imatinib Resistance in Chronic Myeloid Leukemia Cell Lines. Molecules 2023, 28, 1210. [Google Scholar] [CrossRef] [PubMed]
- Ellisen, L.W. PARP inhibitors in cancer therapy: promise, progress, and puzzles. Cancer cell 2011, 19, 165–167. [Google Scholar] [CrossRef] [PubMed]
- Liu, Y.; Song, H.; Song, H.; Feng, X.; Zhou, C.; Huo, Z. Targeting autophagy potentiates the anti-tumor effect of PARP inhibitor in pediatric chronic myeloid leukemia. AMB Express 2019, 9, 1–9. [Google Scholar] [CrossRef]
- Kaloni, D.; Diepstraten, S.T.; Strasser, A.; Kelly, G.L. BCL-2 protein family: Attractive targets for cancer therapy. Apoptosis 2023, 28, 20–38. [Google Scholar] [CrossRef] [PubMed]
- Ko, T.K.; Chuah, C.T.; Huang, J.W.; Ng, K.P.; Ong, S.T. The BCL2 inhibitor ABT-199 significantly enhances imatinib-induced cell death in chronic myeloid leukemia progenitors. Oncotarget 2014, 5, 9033. [Google Scholar] [CrossRef] [PubMed]
- Zhou, L.; Ng, D.S.C.; Yam, J.C.; Chen, L.J.; Tham, C.C.; Pang, C.P.; Chu, W.K. Post-translational modifications on the retinoblastoma protein. Journal of Biomedical Science 2022, 29, 1–16. [Google Scholar] [CrossRef]
- Yin, D.D.; Fan, F.Y.; Hu, X.B.; Hou, L.H.; Zhang, X.P.; Liu, L.; Liang, Y.M.; Han, H. Notch signaling inhibits the growth of the human chronic myeloid leukemia cell line K562. Leukemia research 2009, 33, 109–114. [Google Scholar] [CrossRef] [PubMed]
- Cai, Y.D.; Chou, K.C. Predicting subcellular localization of proteins in a hybridization space. Bioinformatics 2004, 20, 1151–1156. [Google Scholar] [CrossRef] [PubMed]
- Chou, K.C. Impacts of bioinformatics to medicinal chemistry. Medicinal chemistry 2015, 11, 218–234. [Google Scholar] [CrossRef] [PubMed]
- Chou, K.C. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins: Structure, Function, and Bioinformatics 2001, 43, 246–255. [Google Scholar] [CrossRef]
- Khan, Y.D.; Ahmad, F.; Anwar, M.W. A neuro-cognitive approach for iris recognition using back propagation. World Applied Sciences Journal 2012, 16, 678–685. [Google Scholar]
- of Clinical Oncology (ASCO), A.S. Genes and Cancer. Cancer.net 2023. [Google Scholar]
- Hart, P.E.; Stork, D.G.; Duda, R.O. Pattern classification; Wiley Hoboken, 2000.
- Khan, Y.D.; Ahmed, F.; Khan, S.A. Situation recognition using image moments and recurrent neural networks. Neural Computing and Applications 2014, 24, 1519–1529. [Google Scholar] [CrossRef]
- Butt, A.H.; Khan, S.A.; Jamil, H.; Rasool, N.; Khan, Y.D.; others. A prediction model for membrane proteins using moments based features. BioMed research international 2016, 2016. [Google Scholar] [CrossRef] [PubMed]
- Butt, A.H.; Rasool, N.; Khan, Y.D. A treatise to computational approaches towards prediction of membrane protein and its subtypes. The Journal of membrane biology 2017, 250, 55–76. [Google Scholar] [CrossRef] [PubMed]
- Khan, Y.D.; Khan, S.A.; Ahmad, F.; Islam, S.; others. Iris recognition using image moments and k-means algorithm. The Scientific World Journal 2014, 2014. [Google Scholar] [CrossRef]
- Sugiyama, M. Introduction to statistical machine learning; Morgan Kaufmann, 2015.
- Theodoridis, S. Machine learning: a Bayesian and optimization perspective; Academic press, 2015.
- Vapnik, V. The nature of statistical learning theory; Springer science & business media, 1999.
- Montesinos López, O.A.; Montesinos López, A.; Crossa, J. Multivariate statistical machine learning methods for genomic prediction; Springer Nature, 2022.
- Jiao, Y.; Du, P. Performance measures in evaluating machine learning based bioinformatics predictors for classifications. Quantitative Biology 2016, 4, 320–330. [Google Scholar] [CrossRef]
- Fawcett, T. ROC graphs: Notes and practical considerations for researchers. Machine learning 2004, 31, 1–38. [Google Scholar]













| Name of Algorithms | Accuracy | F1-Score | Recall | Specificity |
|---|---|---|---|---|
| Support Vector Classifier | 92~94% | 91~92% | 91~93% | 92~94% |
| Extreme Gradient Boost | 79~85% | 63~70% | 51~55% | 92~94% |
| Logistic Regression | 66~69% | 10~20% | 6~10% | 97~98% |
| Decision Tree | 81~84% | 73~76% | 74~76% | 84~86% |
| Random Forest | 87~91% | 85~87% | 80~83% | 96~97% |
| K Nearest Neighbor | 82~86% | 72~74% | 61~64% | 93~95% |
| Name of Algorithms | Confusion Matrix | |
|---|---|---|
| Support Vector Classifier | True Negative =424 | False Positive =28 |
| False Negative =14 | True Positive =211 | |
| Extreme Gradient Boost | True Negative =26159 | False Positive =2271 |
| False Negative =3435 | True Positive =10890 | |
| Logistic Regression | True Negative =25817 | False Positive =2849 |
| False Negative =11010 | True Positive =3445 | |
| Decision Tree | True Negative =24388 | False Positive =4278 |
| False Negative =3803 | True Positive =10652 | |
| Random Forest | True Negative =28014 | False Positive =808 |
| False Negative =2753 | True Positive =11546 | |
| K Nearest Neighbor | True Negative =419 | False Positive =23 |
| False Negative =95 | True Positive =140 |
| Name of Algorithms | Accuracy | F1-Score | Recall | Specificity |
|---|---|---|---|---|
| Support Vector Classifier | 54.95% | 14.3% | 0.7% | 100% |
| Extreme Gradient Boost | 56.8% | 52.9% | 45.9% | 69% |
| Logistic Regression | 51.1% | 27.6% | 19.1% | 81.7% |
| Decision Tree | 54.4% | 52.25% | 52.9% | 55.8% |
| Random Forest | 50.6% | 41.1% | 35.4% | 64.9% |
| K Nearest Neighbor | 54.2% | 54.8% | 57% | 51% |
| Name of Algorithms | Confusion Matrix | |
|---|---|---|
| Support Vector Classifier | True Negative =271 | False Positive =0 |
| False Negative =121 | True Positive =62 | |
| Extreme Gradient Boost | True Negative =409 | False Positive =23 |
| False Negative =119 | True Positive =103 | |
| Logistic Regression | True Negative =9028 | False Positive =2022 |
| False Negative =8519 | True Positive =2015 | |
| Decision Tree | True Negative =124 | False Positive =98 |
| False Negative =95 | True Positive =107 | |
| Random Forest | True Negative =12612 | False Positive =6817 |
| False Negative =11832 | True Positive =6510 | |
| K Nearest Neighbor | True Negative =112 | False Positive =105 |
| False Negative =89 | True Positive =118 |
| Name of Algorithms | Accuracy | F1-Score | Recall | Specificity |
|---|---|---|---|---|
| Support Vector Classifier | 92~94% | 87~88% | 91~93% | 90~93% |
| Extreme Gradient Boost | 79~84% | 66~68% | 55~57% | 92~94% |
| Logistic Regression | 66~69% | 0~0% | 6~10% | 100% |
| Decision Tree | 81~84% | 70~73% | 56~59% | 96~97% |
| Random Forest | 82~84% | 67~68% | 57~58% | 94~95% |
| K Nearest Neighbor | 72~73% | 31~32% | 20~21% | 95~97% |
| Name of Algorithms | Confusion Matrix | |
|---|---|---|
| Support Vector Classifier | True Negative =416 | False Positive =37 |
| False Negative =17 | True Positive =207 | |
| Extreme Gradient Boost | True Negative =413 | False Positive =25 |
| False Negative =105 | True Positive =134 | |
| Logistic Regression | True Negative =453 | False Positive =0 |
| False Negative =224 | True Positive =0 | |
| Decision Tree | True Negative =433 | False Positive =16 |
| False Negative =94 | True Positive =134 | |
| Random Forest | True Negative =437 | False Positive =23 |
| False Negative =93 | True Positive =124 | |
| K Nearest Neighbor | True Negative =438 | False Positive =15 |
| False Negative =179 | True Positive =45 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).