Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Enhancing the Classification of Biosynthetic Gene Clusters through Comprehensive NLP-Based Approach

Version 1 : Received: 23 October 2023 / Approved: 24 October 2023 / Online: 25 October 2023 (11:36:13 CEST)

How to cite: Mishra, D.C.; Madival, S.D.; Sharma, A.; Budhlakoti, N.; Chaturvedi, K.K.; Angadi, U.B.; Farooqi, M.S.; Srivastava, S.; Basavaraja, P.; Arora, A.; Jha, G.K.; Rai, S. Enhancing the Classification of Biosynthetic Gene Clusters through Comprehensive NLP-Based Approach. Preprints 2023, 2023101564. https://doi.org/10.20944/preprints202310.1564.v1 Mishra, D.C.; Madival, S.D.; Sharma, A.; Budhlakoti, N.; Chaturvedi, K.K.; Angadi, U.B.; Farooqi, M.S.; Srivastava, S.; Basavaraja, P.; Arora, A.; Jha, G.K.; Rai, S. Enhancing the Classification of Biosynthetic Gene Clusters through Comprehensive NLP-Based Approach. Preprints 2023, 2023101564. https://doi.org/10.20944/preprints202310.1564.v1

Abstract

Biosynthetic gene clusters are specific genomic regions in microorganisms, like bacteria and fungi, responsible for producing bioactive compounds. Identifying these clusters is complex due to their diverse nature. This research presents a comprehensive approach to effective BGC identification. The study focuses on five classes of Natural Products: PKS , NRPS, RiPP, Terpenes, and Hybrid PKS-NRPS. Data was gathered from the MiBIG database in GBK format. Protein sequences from each file were extracted, and sequences under the same BGC ID were combined. Physicochemical properties were calculated, and sequence embeddings were generated using NLP techniques like CountVec, TFIDF, and Word2Vec specific to each NP class. An integrated feature matrix was created by merging physicochemical properties and generated embeddings. This matrix was used for training and testing of nine ML models such SVM, RF and many more. The study explored data balancing techniques with and without SMOTE and employed Grid Search for parameter optimization. This led to six datasets and 54 models. The LR model, using TFIDF with SMOTE, emerged as the most effective, achieving an accuracy of 0.96, AUC of 0.9912, and other strong metrics. This method enhances BGC identification for drug development, offering a broader understanding of their applications in medicine and biotechnology.

Keywords

Biosynthetic Gene Clusters; Natural Language Processing; Machine Learning; Hybrids PKS-NRPS; SMOTE

Subject

Biology and Life Sciences, Biochemistry and Molecular Biology

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.