Submitted:
16 September 2025
Posted:
17 September 2025
You are already at the latest version
Abstract
Background: Invasive fungal infections (IFIs) represent a pressing global health threat, particularly for immunocompromised individuals, yet the development of antifungal drugs continues to lag behind antibacterial therapeutics. In this study, we present a data-driven machine learning framework to predict antifungal compound activity, leveraging cheminformatics and supervised learning. Method: A curated dataset of 3,748 positive (antifungal) and 4,096 negative (non-antifungal) compounds was constructed using ChEMBL, ChemDiv, and HMDB. Chemical class assignment via NPClassifier and Tanimoto similarity filtering ensured non-overlapping, structurally meaningful training data. We extracted 217 molecular descriptors per compound and evaluated physicochemical differences between positive and negative sets, confirming statistically significant divergence in Lipinski parameters (p < 0.001). Results: Feature selection using model-specific importance metrics identified key descriptors such as molecular weight, van der Waals surface area, and nitrogen group counts. Multiple supervised learning models were trained—Random Forest (RF), Extreme Gradient Boosting (XGBoost), Support Vector Machines (SVMs with RBF, polynomial, and sigmoid kernels), and Multi-Layer Perceptron (MLP)—and evaluated using five-fold cross-validation. RF and MLP achieved the highest AUCs of 0.996, with SVM-RBF and XGBoost performing comparably well. To assess generalizability, we introduced chemical class-based cross-validation, wherein compounds were partitioned by their chemical class to reduce information leakage. Despite a slight drop in metrics compared to random splits, all models retained balanced accuracies above 0.91. These results demonstrate the promise of integrating molecular informatics with machine learning for antifungal drug discovery and highlight the importance of rigorous validation strategies aligned with chemical diversity.
Keywords:
1. Introduction:
2. Materials and Methods:
2.1. Dataset Preparation
2.2. Compositional Analysis
2.3. Calculation of Physicochemical Properties
2.4. Use of Non-Parametric Methods for Antifungal Compound Identification
2.5. Machine Learning
2.5.1. Hyperparameter Tuning
2.5.2. Feature Selection
2.5.3. Neural Network
2.5.4. Model Training and Validation
2.5.5. Chemical Class Based Cross Validation
3. Results
3.1. Compositional Analysis
3.2. Calculation of Physicochemical Properties
3.3. Use of Non-Parametric Methods for Antifungal Compound Identification
3.4. Machine Learning
3.4.1. Feature Selection
3.4.2. Model Training and Validation
3.4.2.1. Random Split Cross Validation
3.4.2.2. Chemical Class Based Cross Validation
4. Discussion
5. Conclusion
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgements
Conflicts of Interest
Abbreviations
| RF | Random Forest |
| SVM | Support Vector Machine |
| XGBoost | Extreme gradient boosting tree |
| NN | Neural Network |
| PCA | Principal Component Analysis |
| t-SNE | t-Distributed Stochastic Neighbor Embedding |
| UMAP | Uniform Manifold Approximation and Projection |
| MCC | Matthew’s Correlation Coefficient |
| AUC | Area under the ROC curve |
References
- Bongomin, F.; Gago, S.; Oladele, R.O.; Denning, D.W. Global and Multi-National Prevalence of Fungal Diseases—Estimate Precision. J. Fungi 2017, 3, 57. [Google Scholar] [CrossRef] [PubMed]
- Brown, G.D.; Denning, D.W.; Gow, N.A.R.; Levitz, S.M.; Netea, M.G.; White, T.C. Hidden Killers: Human Fungal Infections. Sci. Transl. Med. 2012, 4, 165rv13. [Google Scholar] [CrossRef]
- Organization, W.H. , WHO fungal priority pathogens list to guide research, development and public health action. 2022: World Health Organization.
- Perfect, J.R. The antifungal pipeline: a reality check. Nat. Rev. Drug Discov. 2017, 16, 603–616. [Google Scholar] [CrossRef]
- Cowen, L.E.; Singh, S.D.; Köhler, J.R.; Collins, C.; Zaas, A.K.; Schell, W.A.; Aziz, H.; Mylonakis, E.; Perfect, J.R.; Whitesell, L.; et al. Harnessing Hsp90 function as a powerful, broadly effective therapeutic strategy for fungal infectious disease. Proc. Natl. Acad. Sci. USA 2009, 106, 2818–2823. [Google Scholar] [CrossRef]
- Satoh, K.; Makimura, K.; Hasumi, Y.; Nishiyama, Y.; Uchida, K.; Yamaguchi, H. Candida aurissp. nov., a novel ascomycetous yeast isolated from the external ear canal of an inpatient in a Japanese hospital. Microbiol. Immunol. 2009, 53, 41–44. [Google Scholar] [CrossRef]
- Vamathevan, J.; Clark, D.; Czodrowski, P.; Dunham, I.; Ferran, E.; Lee, G.; Li, B.; Madabhushi, A.; Shah, P.; Spitzer, M.; et al. Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov. 2019, 18, 463–477. [Google Scholar] [CrossRef]
- Chen, H.; Engkvist, O.; Wang, Y.; Olivecrona, M.; Blaschke, T. The rise of deep learning in drug discovery. Drug Discov. Today 2018, 23, 1241–1250. [Google Scholar] [CrossRef]
- Rifaioglu, A.S.; Atas, H.; Martin, M.J.; Cetin-Atalay, R.; Atalay, V.; Doğan, T. Recent applications of deep learning and machine intelligence on in silico drug discovery: methods, tools and databases. Brief. Bioinform. 2019, 20, 1878–1912. [Google Scholar] [CrossRef]
- Tran, T.P.; Ong, E.; Hodges, A.P.; Paternostro, G.; Piermarocchi, C. Prediction of kinase inhibitor response using activity profiling, in vitro screening, and elastic net regression. BMC Syst. Biol. 2014, 8, 74–74. [Google Scholar] [CrossRef] [PubMed]
- Wu, Z.; Ramsundar, B.; Feinberg, E.N.; Gomes, J.; Geniesse, C.; Pappu, A.S.; Leswing, K.; Pande, V. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 2017, 9, 513–530. [Google Scholar] [CrossRef] [PubMed]
- Sheridan, R.P. Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. J. Chem. Inf. Model. 2013, 53, 783–790. [Google Scholar] [CrossRef]
- Wallach, I.; Heifets, A. Most Ligand-Based Classification Benchmarks Reward Memorization Rather than Generalization. J. Chem. Inf. Model. 2018, 58, 916–932. [Google Scholar] [CrossRef]
- Campoy, S.; Adrio, J.L. Antifungals. Biochem Pharmacol 2017, 133, 86–96. [Google Scholar] [CrossRef] [PubMed]
- Wishart, D.S.; Guo, A.; Oler, E.; Wang, F.; Anjum, A.; Peters, H.; Dizon, R.; Sayeeda, Z.; Tian, S.; Lee, B.L.; et al. HMDB 5.0: the Human Metabolome Database for 2022. Nucleic Acids Res. 2021, 50, D622–D631. [Google Scholar] [CrossRef] [PubMed]
- Chung, N.C.; Miasojedow, B.; Startek, M.; Gambin, A. Jaccard/Tanimoto similarity test and estimation methods for biological presence-absence data. BMC Bioinform. 2019, 20, 1–11. [Google Scholar] [CrossRef] [PubMed]
- Kuwahara, H.; Gao, X. Analysis of the effects of related fingerprints on molecular similarity using an eigenvalue entropy approach. J. Chemin- 2021, 13, 1–12. [Google Scholar] [CrossRef]
- Kim, H.W.; Wang, M.; Leber, C.; Nothias, L.-F.; Reher, R.; Kang, K.B.; Hooft, J.J.J.; Dorrestein, P.; Gerwick, W.; Cottrell, G. NPClassifier: A Deep Neural Network-Based Structural Classification Tool for Natural Products. J. Nat. Prod. 2021. [Google Scholar] [CrossRef]
- Bento, A.P.; et al. An open source chemical structure curation pipeline using RDKit. J Cheminform 2020, 12, 51. [Google Scholar] [CrossRef]
- Mishra, P.; et al. Application of Student's t-test, Analysis of Variance, and Covariance. Ann Card Anaesth 2019, 22, 407–411. [Google Scholar] [CrossRef]
- Groth, D.; et al. Principal components analysis. Methods Mol Biol 2013, 930, 527–547. [Google Scholar]
- Jolliffe, I.T.; Cadima, J. Principal component analysis: A review and recent developments. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2016, 374, 20150202. [Google Scholar] [CrossRef]
- Cieslak, M.C.; Castelfranco, A.M.; Roncalli, V.; Lenz, P.H.; Hartline, D.K. t-Distributed Stochastic Neighbor Embedding (t-SNE): A tool for eco-physiological transcriptomic analysis. Mar. Genom. 2020, 51, 100723. [Google Scholar] [CrossRef] [PubMed]
- Armstrong, G.; Martino, C.; Rahman, G.; Gonzalez, A.; Vázquez-Baeza, Y.; Mishne, G.; Knight, R.; Korem, T. Uniform Manifold Approximation and Projection (UMAP) Reveals Composite Patterns and Resolves Visualization Artifacts in Microbiome Data. mSystems 2021, 6, e0069121. [Google Scholar] [CrossRef] [PubMed]
- Khan, H.; Mubarak, M.S.; Amin, S. Antifungal Potential of Alkaloids As An Emerging Therapeutic Target. Curr. Drug Targets 2017, 18, 1825–1835. [Google Scholar] [CrossRef]
- Thawabteh, A.M.; Ghanem, A.W.; AbuMadi, S.; Thaher, D.; Jaghama, W.; Karaman, R.; Scrano, L.; Bufo, S.A. Antibacterial Activity and Antifungal Activity of Monomeric Alkaloids. Toxins 2024, 16, 489. [Google Scholar] [CrossRef]
- Wang, H.; Tian, R.; Chen, Y.; Li, W.; Wei, S.; Ji, Z.; Aioub, A.A. In vivo and in vitro antifungal activities of five alkaloid compounds isolated from Picrasma quassioides (D. Don) Benn against plant pathogenic fungi. Pestic. Biochem. Physiol. 2022, 188, 105246. [Google Scholar] [CrossRef] [PubMed]
- Cushnie, T.T.; Cushnie, B.; Lamb, A.J. Alkaloids: An overview of their antibacterial, antibiotic-enhancing and antivirulence activities. Int. J. Antimicrob. Agents 2014, 44, 377–386. [Google Scholar] [CrossRef]
- Aniszewski, T. Alkaloids: chemistry, biology, ecology, and applications. 2015: Elsevier.
- Kittakoop, P.; Mahidol, C.; Ruchirawat, S. Alkaloids as Important Scaffolds in Therapeutic Drugs for the Treatments of Cancer, Tuberculosis, and Smoking Cessation. Curr. Top. Med. Chem. 2013, 14, 239–252. [Google Scholar] [CrossRef]
- Upadhyay, S.; Xu, X.; Lowry, D.; Jackson, J.C.; Roberson, R.W.; Lin, X. Subcellular Compartmentalization and Trafficking of the Biosynthetic Machinery for Fungal Melanin. Cell Rep. 2016, 14, 2511–2518. [Google Scholar] [CrossRef]
- Herrmann, K.M.; Weaver, L.M. THE SHIKIMATE PATHWAY. Annu Rev Plant Physiol Plant Mol Biol 1999, 50, 473–503. [Google Scholar]
- Dixon, R.A.; Paiva, N.L. Stress-Induced Phenylpropanoid Metabolism. Plant Cell 1995, 7, 1085. [Google Scholar] [CrossRef] [PubMed]
- Lazaridis, T.; Hummer, G. Classical Molecular Dynamics with Mobile Protons. J. Chem. Inf. Model. 2017, 57, 2833–2845. [Google Scholar] [CrossRef] [PubMed]










| Models | Balanced Accuracy | Precision | Recall | F1 | MCC | AUC |
| Random Forest | 0.972±0.002 | 0.963±0.003 | 0.979±0.003 | 0.971±0.002 | 0.944±0.004 | 0.996±0.001 |
| XGBoost | 0.973±0.003 | 0.964±0.003 | 0.979±0.006 | 0.971±0.002 | 0.945±0.006 | 0.995±0.001 |
| SVM Polynomial | 0.942±0.006 | 0.938±0.012 | 0.942±0.008 | 0.940±0.007 | 0.885±0.013 | 0.980±0.003 |
| SVM RBF | 0.955±0.003 | 0.936±0.008 | 0.971±0.004 | 0.953±0.003 | 0.909±0.006 | 0.986±0.002 |
| SVM Sigmoid | 0.901±0.007 | 0.897±0.008 | 0.897±0.011 | 0.897±0.007 | 0.803±0.013 | 0.940±0.006 |
| Neural Network | 0.977±0.004 | 0.976±0.005 | 0.977±0.006 | 0.976±0.004 | 0.954±0.008 | 0.996±0.001 |
| Models | Balanced Accuracy | Precision | Recall | F1 | MCC | AUC |
| Random Forest | 0.933 | 0.923 | 0.923 | 0.922 | 0.878 | 0.986 |
| XGBoost | 0.933 | 0.933 | 0.919 | 0.926 | 0.883 | 0.986 |
| SVM Polynomial | 0.880 | 0.708 | 0.861 | 0.764 | 0.669 | 0.941 |
| SVM RBF | 0.911 | 0.819 | 0.882 | 0.841 | 0.759 | 0.972 |
| SVM Sigmoid | 0.881 | 0.838 | 0.848 | 0.843 | 0.754 | 0.951 |
| Neural Network | 0.932 | 0.928 | 0.935 | 0.914 | 0.862 | 0.981 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).