Palheta, H.; Gonçalves, W.G.; Brito, L.M.; Ribeiro dos Santos, A.; Matsumoto, M.; Ribeiro-dos-Santos, Â.K.; Araújo, G.S. AmazonForest: In-silico Meta-Prediction of Pathogenic Variants. Preprints2020, 2020110519. https://doi.org/10.20944/preprints202011.0519.v1
APA Style
Palheta, H., Gonçalves, W.G., Brito, L.M., Ribeiro dos Santos, A., Matsumoto, M., Ribeiro-dos-Santos, Â.K., & Araújo, G.S. (2020). AmazonForest: In-silico Meta-Prediction of Pathogenic Variants. Preprints. https://doi.org/10.20944/preprints202011.0519.v1
Chicago/Turabian Style
Palheta, H., Ândrea Kely Ribeiro-dos-Santos and Gilderlanio Santana Araújo. 2020 "AmazonForest: In-silico Meta-Prediction of Pathogenic Variants" Preprints. https://doi.org/10.20944/preprints202011.0519.v1
Abstract
ClinVar is a web platform that stores around 774k curated entries, which allows exploring genetic variants and their associations with complex phenotypes. A partial set of ClinVar’s genetic associations were reported with conflict of interpretation or uncertain clinical impact significance, which currently challenges clinicians and geneticists. Here, we evaluate the performance of data pre-processing methods combined with classical prediction methods, such as Naive Bayes, Random Forest, and Support Vector Machine to build a meta-prediction model aiming to improve genetic pathogenicity interpretation. Models were trained with ClinVar data (September 2020), and genetic variants were annotated with eight functional impact predictors catalogued with SnpEff/SnpSift (v4.3). A 10-fold cross-validation strategy was performed for evaluation by accuracy, F1-Score, Receiver Operating Characteristic, Area Under Curve. The best meta-prediction model raises by combining one-hot encoding with tree-based classifiers as Random Forest, which shows Area Under Curve ≥ 0,93. We predict pathogenicity for 109k genetic variants, which were found labeled as uncertain significance or conflict of interpretation. Additionally, we implemented AmazonForest (https://www.lghm.ufpa.br/amazonforest), a web tool to query data for a set of 5k variants that were predicted with high pathogenic probability (RFprob >= 0.9).
Keywords
Meta-prediction; Encoding data; ClinVar; Classification; Random Forest; Naive Bayes; Support Vector Machine
Subject
Biology and Life Sciences, Anatomy and Physiology
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.