Efficient Feature Extraction from High Sparse Binary Genotype Data for Genetic Risk Prediction by Deep Learning Method

Junjie Shen; Huijun Li; Xinghao Yu; Lu Bai; Yongfei Dong; Jianping Cao; Ke Lu; Zaixiang Tang

doi:10.20944/preprints202208.0201.v1

Submitted:

09 August 2022

Posted:

10 August 2022

You are already at the latest version

Abstract

Genomics involving tens of thousands of genes is a complex system determining phenotype. An interesting and vital issue is that how to integrate highly sparse genetic genomics data with a mass of minor effects into prediction model for improving prediction power. We find that deep learning method can work well to extract features by transforming highly sparse dichotomous data to lower dimensional continuous data in a non-linear way. This idea may provide benefits in risk prediction based on genome-wide data associated e.g. integrating most of the information in the genotype data. Hence, we developed a multi-stage strategy to extract information from highly sparse binary genotype data and applied it for risk prediction. Specifically, we first reduced the number of biomarkers via a univariable regression model to a moderate size. Then a trainable auto-encoder was used to extract compact representations from the reduced data. Next, we performed a LASSO problem process over a grid of tuning parameter values to select the optimal combination of extracted features. Finally, we applied such feature combination to two prognostic models, and evaluated predictive effect of the models. The results of simulation studies and real data applying indicated that these highly compressed transformation features could better improve predictive performance and did not easily lead to over-fitting.

Keywords:

auto-encoder

;

high sparse binary data

;

feature extraction

;

SNV integration

Subject:

Biology and Life Sciences - Biochemistry and Molecular Biology

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Efficient Feature Extraction from High Sparse Binary Genotype Data for Genetic Risk Prediction by Deep Learning Method

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe