Preprint Article Version 3 Preserved in Portico This version is not peer-reviewed

Effects of Data Quality and Quantity on Deep Learning for Protein-Ligand Binding Affinity Prediction

Version 1 : Received: 24 January 2022 / Approved: 25 January 2022 / Online: 25 January 2022 (07:53:01 CET)
Version 2 : Received: 16 February 2022 / Approved: 16 February 2022 / Online: 16 February 2022 (02:57:38 CET)
Version 3 : Received: 23 May 2022 / Approved: 23 May 2022 / Online: 23 May 2022 (11:16:49 CEST)

A peer-reviewed article of this Preprint also exists.

Fan, F. J.; Shi, Y. Effects of Data Quality and Quantity on Deep Learning for Protein-Ligand Binding Affinity Prediction. Bioorganic & Medicinal Chemistry, 2022, 72, 117003. https://doi.org/10.1016/j.bmc.2022.117003. Fan, F. J.; Shi, Y. Effects of Data Quality and Quantity on Deep Learning for Protein-Ligand Binding Affinity Prediction. Bioorganic & Medicinal Chemistry, 2022, 72, 117003. https://doi.org/10.1016/j.bmc.2022.117003.

Abstract

Prediction of protein-ligand binding affinities is crucial for computational drug discovery. A number of deep learning approaches have been developed in recent years to improve the accuracy of such affinity prediction. While the predicting power of these systems have advanced to some degrees depending on the dataset used for model training and testing, the effects of the quality and quantity of the underlying data have not been thoroughly examined. In this study, we employed erroneous datasets and data subsets of different sizes, created from one of the largest databases of experimental binding affinities, to train and evaluate a deep learning system based on convolutional neural networks. Our results show that data quality and quantity do have significant impacts on the prediction performance of trained models. Depending on the variations in data quality and quantity, the performance discrepancies could be comparable to or even larger than those observed among different deep learning approaches. In particular, the presence of proteins during model training leads to a dramatic increase in prediction accuracy. This implies that continued accumulation of high-quality affinity data, especially for new protein targets, is indispensable for improving deep learning models to better predict protein-ligand binding affinities.

Keywords

binding affinity prediction; machine learning; data quality; data quantity; deep learning

Subject

Biology and Life Sciences, Biochemistry and Molecular Biology

Comments (1)

Comment 1
Received: 23 May 2022
Commenter: Yun Shi
Commenter's Conflict of Interests: Author
Comment: Further revisions were made, including the effects of errors with normal distribution, the clarification of differences between this study and previous ones, and the reiteration of our findings that data could have larger impacts than algorithms for DL models to better predict protein-ligand binding affinities.
+ Respond to this comment

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 1
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.