Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Efficient Feature Extraction from High Sparse Binary Genotype Data for Genetic Risk Prediction by Deep Learning Method

Version 1 : Received: 9 August 2022 / Approved: 10 August 2022 / Online: 10 August 2022 (10:27:32 CEST)

How to cite: Shen, J.; Li, H.; Yu, X.; Bai, L.; Dong, Y.; Cao, J.; Lu, K.; Tang, Z. Efficient Feature Extraction from High Sparse Binary Genotype Data for Genetic Risk Prediction by Deep Learning Method. Preprints 2022, 2022080201. https://doi.org/10.20944/preprints202208.0201.v1 Shen, J.; Li, H.; Yu, X.; Bai, L.; Dong, Y.; Cao, J.; Lu, K.; Tang, Z. Efficient Feature Extraction from High Sparse Binary Genotype Data for Genetic Risk Prediction by Deep Learning Method. Preprints 2022, 2022080201. https://doi.org/10.20944/preprints202208.0201.v1

Abstract

Genomics involving tens of thousands of genes is a complex system determining phenotype. An interesting and vital issue is that how to integrate highly sparse genetic genomics data with a mass of minor effects into prediction model for improving prediction power. We find that deep learning method can work well to extract features by transforming highly sparse dichotomous data to lower dimensional continuous data in a non-linear way. This idea may provide benefits in risk prediction based on genome-wide data associated e.g. integrating most of the information in the genotype data. Hence, we developed a multi-stage strategy to extract information from highly sparse binary genotype data and applied it for risk prediction. Specifically, we first reduced the number of biomarkers via a univariable regression model to a moderate size. Then a trainable auto-encoder was used to extract compact representations from the reduced data. Next, we performed a LASSO problem process over a grid of tuning parameter values to select the optimal combination of extracted features. Finally, we applied such feature combination to two prognostic models, and evaluated predictive effect of the models. The results of simulation studies and real data applying indicated that these highly compressed transformation features could better improve predictive performance and did not easily lead to over-fitting.

Keywords

auto-encoder; high sparse binary data; feature extraction; SNV integration

Subject

Biology and Life Sciences, Biochemistry and Molecular Biology

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.