Submitted:
28 November 2023
Posted:
28 November 2023
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. The Road from GWAS Findings to Drug Discovery
1.2. GWAS Applications beyond Gene Discovery: Cumulative Genetic Profiles and Causal Relationships
2. Machine Learning Solutions for GWAS
2.1. Machine Learning Methods Frequently Adapted for GWAS
2.2. Machine Learning Application Areas in GWAS
2.3. Tools for SNP Discovery From Whole-Genome SNP Data
| Application categories | Applications and tools | Machine learning approach |
|---|---|---|
| Prioritization of top GWAS results | ClusteringSVMRandom ForrestNeural Network | |
| Epistasis detection among pre-selected SNPs | ClusteringRandom ForrestNeural Network | |
| Variant prioritization | SVMRandom ForrestNeural Network | |
| Hypothesis-free GWAS | SVMNeural Network | |
| Polygenic Risk Score | Random ForrestNeural Network |
2.4. Applications Supporting PRS
| Name | Method | Genotype matrix generation | Explainability/ Method for SNP relevance scores | Language |
|---|---|---|---|---|
| COMBI |
Two-step method:
|
Not built-in. It requires a phenotype vector and a genotype matrix. | No/ SVM for SNP relevance scores | Matlab/Octave, R and Java |
| DeepCOMBI |
Three-step method:
|
Not built-in. It requires a phenotype vector and a genotype matrix. | Yes/ relevance scores | Python |
| Deep Mixed Model |
Two-component DL method:
|
Not built-in. It requires genotype and phenotype matrices. | Not available | Python |
| DeepWAS |
Integration method:
|
Not built-in. DeepSea requires vcf format. | Not available | R |
| GenNet | Use of NN with connections defined by prior biological knowledge to create groups of nodes across different layers to reduce the number of learnable parameters | Built-in | Built in as SNP, gene and pathway relevance scores based on relative weights | Python |
| GMStool |
Three-step method:
|
Not built-in. It requires genotype, phenotype, GWAS result and test list files. | Not available | R |
| GWANN |
|
Not built-in. It requires a VCF file with genotype data and a csv file with phenotype data. | Not available | Python |
| IMEGES | The Annovar input/bed format file |
3. Limitations and Criticism of Machine Learning
4. Future Prospects
4.1. Multimodal Omics Databases
4.2. Opportunities of Large Language models and Foundation Models
5. Conclusions
References
- Visscher, P.M. 10 Years of GWAS Discovery: Biology, Function, and Translation. Am J Hum Genet 2017, 101, 5–22. [Google Scholar] [CrossRef]
- Watanabe, K. A global overview of pleiotropy and genetic architecture in complex traits. Nat Genet 2019, 51, 1339–1348. [Google Scholar] [CrossRef]
- GWAS catalogue.
- Canela-Xandri, O., Rawlik, K. & Tenesa, A. An atlas of genetic associations in UK Biobank. Nat Genet 2018, 50, 1593–1599. [Google Scholar] [CrossRef]
- Frontini, M. Genome-wide association of rice response to blast fungus identifies loci for robust resistance under high nitrogen. BMC Plant Biol 2021, 21, 99. [Google Scholar] [CrossRef]
- Young, B.C. Panton-Valentine leucocidin is the key determinant of Staphylococcus aureus pyomyositis in a bacterial GWAS. Elife 2019, 8. [Google Scholar] [CrossRef]
- Tibbs Cortes, L. , Zhang, Z. & Yu, J. Status and prospects of genome-wide association studies in plants. Plant Genome 2021, 14, e20077. [Google Scholar]
- Plassais, J. Whole genome sequencing of canids reveals genomic regions under selection and variants influencing morphology. Nat Commun 2019, 10, 1489. [Google Scholar] [CrossRef]
- Wang, K. The Chicken Pan-Genome Reveals Gene Content Variation and a Promoter Region Deletion in IGF2BP1 Affecting Body Size. Mol Biol Evol 2021, 38, 5066–5081. [Google Scholar] [CrossRef]
- Ramirez, A.H. The All of Us Research Program: Data quality, utility, and diversity. Patterns (N Y) 2022, 3, 100570. [Google Scholar] [CrossRef]
- Claussnitzer, M. FTO Obesity Variant Circuitry and Adipocyte Browning in Humans. N Engl J Med 2015, 373, 895–907. [Google Scholar] [CrossRef] [PubMed]
- Ng, M.C. Implication of genetic variants near TCF7L2, SLC30A8, HHEX, CDKAL1, CDKN2A/B, IGF2BP2, and FTO in type 2 diabetes and obesity in 6,719 Asians. Diabetes 2008, 57, 2226–2233. [Google Scholar] [CrossRef] [PubMed]
- Lambert, J.C. Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer's disease. Nat Genet 2013, 45, 1452–1458. [Google Scholar] [CrossRef]
- Lagou, V. GWAS of random glucose in 476,326 individuals provide insights into diabetes pathophysiology, complications and treatment stratification. Nat Genet 2023, 55, 1448–1461. [Google Scholar] [CrossRef]
- Reay, W.R. & Cairns, M. J. Advancing the use of genome-wide association studies for drug repurposing. Nat Rev Genet 2021, 22, 658–671. [Google Scholar]
- Ochoa, D. Human genetics evidence supports two-thirds of the 2021 FDA-approved drugs. Nat Rev Drug Discov 2022, 21, 551. [Google Scholar] [CrossRef] [PubMed]
- Ochoa, D. The next-generation Open Targets Platform: reimagined, redesigned, rebuilt. Nucleic Acids Res 2023, 51, D1353–D1359. [Google Scholar] [CrossRef]
- Ghoussaini, M. Open Targets Genetics: systematic identification of trait-associated genes using large-scale genetics and functional genomics. Nucleic Acids Res 2021, 49, D1311–D1320. [Google Scholar] [CrossRef]
- Genin, E. APOE and Alzheimer disease: a major gene with semi-dominant inheritance. Mol Psychiatry 2011, 16, 903–907. [Google Scholar] [CrossRef]
- Ni, G. A Comparison of Ten Polygenic Score Methods for Psychiatric Disorders Applied Across Multiple Cohorts. Biol Psychiatry 2021, 90, 611–620. [Google Scholar] [CrossRef]
- International Schizophrenia, C. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 2009, 460, 748–752. [Google Scholar] [CrossRef]
- Demirkan, A. Genetic risk profiles for depression and anxiety in adult and elderly cohorts. Mol Psychiatry 2011, 16, 773–783. [Google Scholar] [CrossRef] [PubMed]
- Lewis, C.M. & Vassos, E. Polygenic risk scores: from research tools to clinical instruments. Genome Med 2020, 12, 44. [Google Scholar]
- O'Sullivan, J.W. Polygenic Risk Scores for Cardiovascular Disease: A Scientific Statement From the American Heart Association. Circulation 2022, 146, e93–e118. [Google Scholar] [CrossRef] [PubMed]
- Martin, A.R. Human Demographic History Impacts Genetic Risk Prediction across Diverse Populations. Am J Hum Genet 2017, 100, 635–649. [Google Scholar] [CrossRef] [PubMed]
- Kachuri, L. Principles and methods for transferring polygenic risk scores across global populations. Nat Rev Genet 2023. [Google Scholar] [CrossRef] [PubMed]
- Gola, D. Population Bias in Polygenic Risk Prediction Models for Coronary Artery Disease. Circ Genom Precis Med 2020, 13, e002932. [Google Scholar] [CrossRef] [PubMed]
- Richmond, R.C. & Davey Smith, G. Mendelian Randomization: Concepts and Scope. Cold Spring Harb Perspect Med 2022, 12. [Google Scholar]
- van Rheenen, W. , Peyrot, W. J., Schork, A.J., Lee, S.H. & Wray, N.R. Genetic correlations of polygenic disease traits: from theory to practice. Nat Rev Genet 2019, 20, 567–581. [Google Scholar]
- Bergen, S.E. & Petryshen, T. L. Genome-wide association studies of schizophrenia: does bigger lead to better results? Curr Opin Psychiatry 2012, 25, 76–82. [Google Scholar]
- Degroeve, S. , De Baets, B., Van de Peer, Y. & Rouze, P. Feature subset selection for splice site prediction. Bioinformatics 2002, 18 (Suppl 2), S75–S83. [Google Scholar]
- Bucher, P. Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J Mol Biol 1990, 212, 563–578. [Google Scholar] [CrossRef]
- Heintzman, N.D. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat Genet 2007, 39, 311–318. [Google Scholar] [CrossRef]
- Segal, E. A genomic code for nucleosome positioning. Nature 2006, 442, 772–778. [Google Scholar] [CrossRef]
- Mathieu, A. , Leclercq, M., Sanabria, M., Perin, O. & Droit, A. Machine Learning and Deep Learning Applications in Metagenomic Taxonomy and Functional Annotation. Front Microbiol 2022, 13, 811495. [Google Scholar]
- Costea, P.I. Enterotypes in the landscape of gut microbial community composition. Nat Microbiol 2018, 3, 8–16. [Google Scholar] [CrossRef]
- Callahan, B.J. DADA2: High-resolution sample inference from Illumina amplicon data. Nat Methods 2016, 13, 581–583. [Google Scholar] [CrossRef]
- Statnikov, A. A comprehensive evaluation of multicategory classification methods for microbiomic data. Microbiome 2013, 1, 11. [Google Scholar] [CrossRef]
- Hie, B. , Zhong, E. D., Berger, B. & Bryson, B. Learning the language of viral evolution and escape. Science 2021, 371, 284–288. [Google Scholar]
- Ramakrishnan, G. Understanding structure-guided variant effect predictions using 3D convolutional neural networks. Front Mol Biosci 2023, 10, 1204157. [Google Scholar] [CrossRef]
- Huang, X. , Rymbekova, A., Dolgova, O., Lao, O. & Kuhlwilm, M. Harnessing deep learning for population genetic inference. Nat Rev Genet 2023. [Google Scholar]
- Saba Moeinizade, G.H. , Lizhi Wang. A REINFORCEMENT LEARNING APPROACH TO RESOURCE ALLOCATION IN GENOMIC SELECTION. 2021.
- Chen, X. & Ishwaran, H. Random forests for genomic data analysis. Genomics 2012, 99, 323–329. [Google Scholar]
- Lunetta, K.L. , Hayward, L. B., Segal, J. & Van Eerdewegh, P. Screening large-scale association study data: exploiting interactions using random forests. BMC Genet 2004, 5, 32. [Google Scholar]
- Cortes, C. & Vapnik, V. Support-vector networks. Machine Learning 1995, 20, 273–297. [Google Scholar]
- Gurney, K. An Introduction to Neural Networks, (CRC Press, 1997).
- Alzubaidi, L. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data 2021, 8, 53. [Google Scholar] [CrossRef]
- Montesinos-Lopez, O.A. A review of deep learning applications for genomic selection. BMC Genomics 2021, 22, 19. [Google Scholar] [CrossRef]
- Nicholls, H.L. Reaching the End-Game for GWAS: Machine Learning Approaches for the Prioritization of Complex Disease Loci. Front Genet 2020, 11, 350. [Google Scholar] [CrossRef]
- Vitsios, D. & Petrovski, S. Mantis-ml: Disease-Agnostic Gene Prioritization from High-Throughput Genomic Screens by Stochastic Semi-supervised Learning. Am J Hum Genet 2020, 106, 659–678. [Google Scholar]
- Roshan, U. , Chikkagoudar, S., Wei, Z., Wang, K. & Hakonarson, H. Ranking causal variants and associated regions in genome-wide association studies by the support vector machine and random forest. Nucleic Acids Res 2011, 39, e62. [Google Scholar]
- Liu, Q. , Xia, F., Yin, Q. & Jiang, R. Chromatin accessibility prediction via a hybrid deep convolutional neural network. Bioinformatics 2018, 34, 732–738. [Google Scholar]
- Kelley, D.R. , Snoek, J. & Rinn, J.L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res 2016, 26, 990–999. [Google Scholar]
- Lee, D. A method to predict the impact of regulatory variants from DNA sequence. Nat Genet 2015, 47, 955–961. [Google Scholar] [CrossRef]
- Pinakhina, D. , Loboda, A., Sergushichev, A. & Artomov, M. Gene, cell type, and drug prioritization analysis suggest genetic basis for the utility of diuretics in treating Alzheimer disease. HGG Adv 2023, 4, 100203. [Google Scholar]
- Mountjoy, E. An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci. Nat Genet 2021, 53, 1527–1533. [Google Scholar] [CrossRef]
- Wang, Y. & Chen, L. DeepPerVar: a multi-modal deep learning framework for functional interpretation of genetic variants in personal genome. Bioinformatics 2022, 38, 5340–5351. [Google Scholar]
- Bureau, A. Identifying SNPs predictive of phenotype using random forests. Genet Epidemiol 2005, 28, 171–182. [Google Scholar] [CrossRef]
- Garcia-Magarinos, M. , Lopez-de-Ullibarri, I., Cao, R. & Salas, A. Evaluating the ability of tree-based methods and logistic regression for the detection of SNP-SNP interaction. Ann Hum Genet 2009, 73, 360–369. [Google Scholar]
- Leem, S. , Jeong, H. H., Lee, J., Wee, K. & Sohn, K.A. Fast detection of high-order epistatic interactions in genome-wide association studies using information theoretic measure. Comput Biol Chem 2014, 50, 19–28. [Google Scholar]
- Xie, Q. Decision forest analysis of 61 single nucleotide polymorphisms in a case-control study of esophageal cancer; a novel method. BMC Bioinformatics 2005, 6 (Suppl. 2). [Google Scholar] [CrossRef]
- Wang, H. , Yue, T. , Yang, J., Wu, W. & Xing, E.P. Deep mixed model for marginal epistasis detection and population stratification correction in genome-wide association studies. BMC Bioinformatics 2019, 20, 656. [Google Scholar]
- Motsinger-Reif, A.A. , Dudek, S. M., Hahn, L.W. & Ritchie, M.D. Comparison of approaches for machine-learning optimization of neural networks for detecting gene-gene interactions in genetic epidemiology. Genet Epidemiol 2008, 32, 325–340. [Google Scholar]
- Nguyen, T.T. , Huang, J. , Wu, Q., Nguyen, T. & Li, M. Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests. BMC Genomics 2015, 16 (Suppl. 2). [Google Scholar]
- Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines.. ACM Transactions on Intelligent Systems and Technology 2:27:1--27:27(2011).
- Silva, P.P. A machine learning-based SNP-set analysis approach for identifying disease-associated susceptibility loci. Sci Rep 2022, 12, 15817. [Google Scholar] [CrossRef] [PubMed]
- Wang, C. , Kao, W. H. & Hsiao, C.K. Using Hamming Distance as Information for SNP-Sets Clustering and Testing in Disease Association Studies. PLoS One 2015, 10, e0135918. [Google Scholar]
- Gaudillo, J. Machine learning approach to single nucleotide polymorphism-based asthma prediction. PLoS One 2019, 14, e0225574. [Google Scholar] [CrossRef] [PubMed]
- Díez Díaz, F. GASVeM: A New Machine Learning Methodology for Multi-SNP Analysis of GWAS Data Based on Genetic Algorithms and Support Vector Machines. Mathematics 2021, 9, 654. [Google Scholar] [CrossRef]
- Mittag, F. Use of support vector machines for disease risk prediction in genome-wide association studies: concerns and opportunities. Hum Mutat 2012, 33, 1708–1718. [Google Scholar] [CrossRef]
- Alatrany, A.S. , Khan, W. , Hussain, A., Al-Jumeily, D. & Alzheimer's Disease Neuroimaging, I. Wide and deep learning based approaches for classification of Alzheimer's disease using genome-wide association studies. PLoS One 2023, 18, e0283712. [Google Scholar]
- Li, Y. DeepGWAS: Enhance GWAS Signals for Neuropsychiatric Disorders via Deep Neural Network. Res Sq 2023. [Google Scholar]
- Mieth, B. Combining Multiple Hypothesis Testing with Machine Learning Increases the Statistical Power of Genome-wide Association Studies. Sci Rep 2016, 6, 36671. [Google Scholar] [CrossRef]
- Purcell, S. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 2007, 81, 559–575. [Google Scholar] [CrossRef]
- Jeong, S. , Kim, J. Y. & Kim, N. GMStool: GWAS-based marker selection tool for genomic prediction from genomic data. Sci Rep 2020, 10, 19653. [Google Scholar]
- Mieth, B. DeepCOMBI: explainable artificial intelligence for the analysis and discovery in genome-wide association studies. NAR Genom Bioinform 2021, 3, lqab065. [Google Scholar] [CrossRef]
- van Hilten, A. GenNet framework: interpretable deep learning for predicting phenotypes from genetic data. Commun Biol 2021, 4, 1094. [Google Scholar] [CrossRef]
- Nimrod Ashkenazy, M.F. , Ofer M. Shir, Sariel Hübner. GWANN: Implementing deep learning in genome wide association studies. BioRxiv 2022.
- Khan, A. , Liu, Q. & Wang, K. iMEGES: integrated mental-disorder GEnome score by deep neural network for prioritizing the susceptibility genes for mental disorders in personal genomes. BMC Bioinformatics 2018, 19, 501. [Google Scholar]
- Zhou, X. Deep learning-based polygenic risk analysis for Alzheimer's disease prediction. Commun Med (Lond) 2023, 3, 49. [Google Scholar] [CrossRef]
- Badre, A. , Zhang, L., Muchero, W., Reynolds, J.C. & Pan, C. Deep neural network improves the estimation of polygenic risk scores for breast cancer. J Hum Genet 2021, 66, 359–369. [Google Scholar]
- Lau, M. , Wigmann, C., Kress, S., Schikowski, T. & Schwender, H. Evaluation of tree-based statistical learning methods for constructing genetic risk scores. BMC Bioinformatics 2022, 23, 97. [Google Scholar]
- Peter, H. Westfall, S.S.Y. Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment, (1993).
- Roshchupkin, G.V. HASE: Framework for efficient high-dimensional association analyses. Sci Rep 2016, 6, 36076. [Google Scholar] [CrossRef]
- Wang, K. , Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 2010, 38, e164. [Google Scholar]
- Arloth, J. DeepWAS: Multivariate genotype-phenotype associations by directly integrating regulatory information using deep learning. PLoS Comput Biol 2020, 16, e1007616. [Google Scholar] [CrossRef] [PubMed]
- Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods 2015, 12, 931–934. [Google Scholar]
- Maier, R. Joint analysis of psychiatric disorders increases accuracy of risk prediction for schizophrenia, bipolar disorder, and major depressive disorder. Am J Hum Genet 2015, 96, 283–294. [Google Scholar] [CrossRef] [PubMed]
- Wei, Z. Large sample size, wide variant spectrum, and advanced machine-learning technique boost risk prediction for inflammatory bowel disease. Am J Hum Genet 2013, 92, 1008–1012. [Google Scholar] [CrossRef] [PubMed]
- Laurie Prélot, H.D. , Mila D. Anasanti, Zhanna Balkhiyarova, Matthias Wielscher, Loic Yengo, Beverley Balkau, Ronan Roussel, Sylvain Sebert, Mika Ala-Korpela, Philippe Froguel, Marjo-Riitta Jarvelin, Marika Kaakinen, Inga Prokopenko. Machine Learning in Multi-Omics Data to Assess Longitudinal Predictors of Glycaemic Health. BioRxiv 2018.
- Sudlow, C. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med 2015, 12, e1001779. [Google Scholar] [CrossRef] [PubMed]
- Chen, Z. China Kadoorie Biobank of 0. 5 million people: survey methods, baseline characteristics and long-term follow-up. Int J Epidemiol 2011, 40, 1652–1666. [Google Scholar] [PubMed]
- Leitsalu, L. Cohort Profile: Estonian Biobank of the Estonian Genome Center, University of Tartu. Int J Epidemiol 2015, 44, 1137–1147. [Google Scholar] [CrossRef] [PubMed]
- Scholtens, S. Cohort Profile: LifeLines, a three-generation cohort study and biobank. Int J Epidemiol 2015, 44, 1172–1180. [Google Scholar] [CrossRef]
- Jacob Devlin, M.-W.C. , Kenton Lee, Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018.
- Atito, S. , Awais, M. & Kittler, J. Sit: Self-supervised vision transformer. ArXiv.
- Moor, M. Foundation models for generalist medical artificial intelligence. Nature 2023, 616, 259–265. [Google Scholar] [CrossRef]
- Rumelhart, D.E. , Hinton, G.E. & Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar]
- Kieran Elmes, D.B.-P. , Neşet Özkan Tan, Trung Bao Nguyen, Nicholas Sumpter, Megan Leask, Michael Witbrock, Alex Gavryushkin. SNVformer: An Attention-based Deep Neural Network for GWAS Data. 2022.
- Ji, Y. , Zhou, Z. , Liu, H. & Davuluri, R.V. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 2021, 37, 2112–2120. [Google Scholar] [PubMed]
- Santiesteban, S. , Awais, M., Song, Y. & Kittler, J. Multimodal Self-Supervised Learning for Pan-Cancer Survival Prediction using Histology-Genomic Data." Open review CVPR 2024.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
