Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

AmazonForest: In-silico Meta-Prediction of Pathogenic Variants

Version 1 : Received: 18 November 2020 / Approved: 19 November 2020 / Online: 19 November 2020 (16:43:51 CET)

How to cite: Palheta, H.; Gonçalves, W.G.; Brito, L.M.; Ribeiro dos Santos, A.; Matsumoto, M.; Ribeiro-dos-Santos, Â.K.; Araújo, G.S. AmazonForest: In-silico Meta-Prediction of Pathogenic Variants. Preprints 2020, 2020110519 (doi: 10.20944/preprints202011.0519.v1). Palheta, H.; Gonçalves, W.G.; Brito, L.M.; Ribeiro dos Santos, A.; Matsumoto, M.; Ribeiro-dos-Santos, Â.K.; Araújo, G.S. AmazonForest: In-silico Meta-Prediction of Pathogenic Variants. Preprints 2020, 2020110519 (doi: 10.20944/preprints202011.0519.v1).

Abstract

ClinVar is a web platform that stores around 774k curated entries, which allows exploring genetic variants and their associations with complex phenotypes. A partial set of ClinVar’s genetic associations were reported with conflict of interpretation or uncertain clinical impact significance, which currently challenges clinicians and geneticists. Here, we evaluate the performance of data pre-processing methods combined with classical prediction methods, such as Naive Bayes, Random Forest, and Support Vector Machine to build a meta-prediction model aiming to improve genetic pathogenicity interpretation. Models were trained with ClinVar data (September 2020), and genetic variants were annotated with eight functional impact predictors catalogued with SnpEff/SnpSift (v4.3). A 10-fold cross-validation strategy was performed for evaluation by accuracy, F1-Score, Receiver Operating Characteristic, Area Under Curve. The best meta-prediction model raises by combining one-hot encoding with tree-based classifiers as Random Forest, which shows Area Under Curve ≥ 0,93. We predict pathogenicity for 109k genetic variants, which were found labeled as uncertain significance or conflict of interpretation. Additionally, we implemented AmazonForest (https://www.lghm.ufpa.br/amazonforest), a web tool to query data for a set of 5k variants that were predicted with high pathogenic probability (RFprob >= 0.9).

Subject Areas

Meta-prediction; Encoding data; ClinVar; Classification; Random Forest; Naive Bayes; Support Vector Machine

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our diversity statement.

Leave a public comment
Send a private comment to the author(s)
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.