Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

A Parametric Approach to Molecular Encodings of Carbon-based Multilevel Atomic Neighborhoods

Version 1 : Received: 16 February 2022 / Approved: 18 February 2022 / Online: 18 February 2022 (02:19:27 CET)

How to cite: Hattab, G.; Neumann, N.; Anžel, A.; Heider, D. A Parametric Approach to Molecular Encodings of Carbon-based Multilevel Atomic Neighborhoods. Preprints 2022, 2022020220. https://doi.org/10.20944/preprints202202.0220.v1 Hattab, G.; Neumann, N.; Anžel, A.; Heider, D. A Parametric Approach to Molecular Encodings of Carbon-based Multilevel Atomic Neighborhoods. Preprints 2022, 2022020220. https://doi.org/10.20944/preprints202202.0220.v1

Abstract

Exploring new ways to represent and discover organic molecules is critical for developing novel therapies. With recent advances in bioinformatics, virtual screening of databases is possible. However, biochemical data must be encoded using computer algorithms to make them machine-readable, taking into account distance and similarity measures to support tasks such as similarity searching. Motivated by the ubiquity of the carbon element and the structured patterns that emerge, we propose a parametric approach to molecular encodings of carbon-based multilevel atomic neighborhoods. It implements a walk along the carbon chain of an organic molecule to compute different representations of its feature encoding in the form of a binary or numerical array that can be exported later into an image. Resulting encodings are reproducible and readily formatted for various domain tasks including machine learning tasks. This approach was evaluated using a 10-fold stratified cross validation for binary classification with eight data sets and six different encodings (384 models) in the domain knowledge of cell-penetrating peptides. The parametric approach is built on open-source software and is implemented as a Python package (cmangoes). Source code and documentation are available at https://github.com/ghattab/cmangoes.

Keywords

parametric; encoding; fingerprinting; machine learning; classification; transporter; cell-penetrating peptide.

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.