Version 1
: Received: 24 January 2022 / Approved: 25 January 2022 / Online: 25 January 2022 (07:53:01 CET)
Version 2
: Received: 16 February 2022 / Approved: 16 February 2022 / Online: 16 February 2022 (02:57:38 CET)
Version 3
: Received: 23 May 2022 / Approved: 23 May 2022 / Online: 23 May 2022 (11:16:49 CEST)
How to cite:
Fan, F.J.; Shi, Y. Effects of Data Quality and Quantity on Deep Learning for Protein-Ligand Binding Affinity Prediction. Preprints2022, 2022010365. https://doi.org/10.20944/preprints202201.0365.v1.
Fan, F.J.; Shi, Y. Effects of Data Quality and Quantity on Deep Learning for Protein-Ligand Binding Affinity Prediction. Preprints 2022, 2022010365. https://doi.org/10.20944/preprints202201.0365.v1.
Cite as:
Fan, F.J.; Shi, Y. Effects of Data Quality and Quantity on Deep Learning for Protein-Ligand Binding Affinity Prediction. Preprints2022, 2022010365. https://doi.org/10.20944/preprints202201.0365.v1.
Fan, F.J.; Shi, Y. Effects of Data Quality and Quantity on Deep Learning for Protein-Ligand Binding Affinity Prediction. Preprints 2022, 2022010365. https://doi.org/10.20944/preprints202201.0365.v1.
Abstract
Prediction of protein-ligand binding affinities is crucial for computational drug discovery. A number of deep learning approaches have been developed in recent years to improve the accuracy of such affinity prediction. While the predicting power of these models have advanced to some degrees depending on the dataset used for training and testing, the effects of the quality and quantity of the underlying data have not been thoroughly examined. In this study, we employed erroneous datasets and data subsets of different sizes, created from one of the largest databases of experimental binding affinities, to train and evaluate a deep learning system based on convolutional neural networks. Our results show that data quality and quantity do have significant impacts on the performance of trained models. Depending the variations in data quality and quantity, the performance differences could be comparable to or even larger than those observed among different deep learning approaches. This implies that continued accumulation of high-quality affinity data is important for improving deep learning models to better predict protein-ligand binding affinities.
Keywords
binding affinity prediction; machine learning; data quality; data quantity; deep learning
Subject
LIFE SCIENCES, Biochemistry
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.