Version 1
: Received: 16 February 2022 / Approved: 18 February 2022 / Online: 18 February 2022 (02:19:27 CET)
How to cite:
Hattab, G.; Neumann, N.; Anžel, A.; Heider, D. A Parametric Approach to Molecular Encodings of Carbon-based Multilevel Atomic Neighborhoods. Preprints2022, 2022020220. https://doi.org/10.20944/preprints202202.0220.v1.
Hattab, G.; Neumann, N.; Anžel, A.; Heider, D. A Parametric Approach to Molecular Encodings of Carbon-based Multilevel Atomic Neighborhoods. Preprints 2022, 2022020220. https://doi.org/10.20944/preprints202202.0220.v1.
Cite as:
Hattab, G.; Neumann, N.; Anžel, A.; Heider, D. A Parametric Approach to Molecular Encodings of Carbon-based Multilevel Atomic Neighborhoods. Preprints2022, 2022020220. https://doi.org/10.20944/preprints202202.0220.v1.
Hattab, G.; Neumann, N.; Anžel, A.; Heider, D. A Parametric Approach to Molecular Encodings of Carbon-based Multilevel Atomic Neighborhoods. Preprints 2022, 2022020220. https://doi.org/10.20944/preprints202202.0220.v1.
Abstract
Exploring new ways to represent and discover organic molecules is critical for developing novel therapies. With recent advances in bioinformatics, virtual screening of databases is possible. However, biochemical data must be encoded using computer algorithms to make them machine-readable, taking into account distance and similarity measures to support tasks such as similarity searching. Motivated by the ubiquity of the carbon element and the structured patterns that emerge, we propose a parametric approach to molecular encodings of carbon-based multilevel atomic neighborhoods. It implements a walk along the carbon chain of an organic molecule to compute different representations of its feature encoding in the form of a binary or numerical array that can be exported later into an image. Resulting encodings are reproducible and readily formatted for various domain tasks including machine learning tasks. This approach was evaluated using a 10-fold stratified cross validation for binary classification with eight data sets and six different encodings (384 models) in the domain knowledge of cell-penetrating peptides. The parametric approach is built on open-source software and is implemented as a Python package (cmangoes). Source code and documentation are available at https://github.com/ghattab/cmangoes.
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.