ARTICLE | doi:10.20944/preprints202208.0201.v1
Subject: Life Sciences, Genetics Keywords: auto-encoder; high sparse binary data; feature extraction; SNV integration
Online: 10 August 2022 (10:27:32 CEST)
Genomics involving tens of thousands of genes is a complex system determining phenotype. An interesting and vital issue is that how to integrate highly sparse genetic genomics data with a mass of minor effects into prediction model for improving prediction power. We find that deep learning method can work well to extract features by transforming highly sparse dichotomous data to lower dimensional continuous data in a non-linear way. This idea may provide benefits in risk prediction based on genome-wide data associated e.g. integrating most of the information in the genotype data. Hence, we developed a multi-stage strategy to extract information from highly sparse binary genotype data and applied it for risk prediction. Specifically, we first reduced the number of biomarkers via a univariable regression model to a moderate size. Then a trainable auto-encoder was used to extract compact representations from the reduced data. Next, we performed a LASSO problem process over a grid of tuning parameter values to select the optimal combination of extracted features. Finally, we applied such feature combination to two prognostic models, and evaluated predictive effect of the models. The results of simulation studies and real data applying indicated that these highly compressed transformation features could better improve predictive performance and did not easily lead to over-fitting.
ARTICLE | doi:10.20944/preprints202007.0735.v1
Subject: Life Sciences, Genetics Keywords: Variant of Unknown Significance (VUS); Single-Nucleotide Variant (SNV); Variant Effect Prediction (VEP); Stacked Ensemble of Supervised Deep Learners (SESDL); Next Generation Sequencing (NGS); Alternative Allele Frequency (AAF).
Online: 31 July 2020 (06:13:53 CEST)
Pathogenicity is unknown for the majority of human gene variants. For prioritization of sequenced somatic and germline mutation variants, in silico approaches can be utilized. In this study, 84 million non-synonymous Single Nucleotide Variants (SNVs) in the human coding genome were annotated using consensus Variant Effect Prediction (cVEP) method. An algorithm, implemented as a stacked ensemble of supervised learners, performed combination of the 39 functional, conservation mutation impact scores from dbNSFP4.0. Adding gene indispensability score, accounting for differences in the pathogenicities of the variants in the essential and the mutation-tolerant genes, improved the predictions. For each SNV the consensus combination gives either a continuous-value pathogenicity score, or a categorical score in five classes: pathogenic, likely pathogenic, uncertain significance, likely benign, benign. The provided class database is aimed for direct use in clinical practice. The trained prediction models were 5-fold cross-validated on the evidence-based categorical annotations from the ClinVar database. The rankings of the scores based on their ability to predict pathogenicity were obtained. A two-step strategy using the rankings, scores and class annotations is suggested for filtering and prioritization of the human exome mutations in clinical and biological applications of NGS technology.