Preprint
Article

This version is not peer-reviewed.

PAID: An AI-Ready LC-MS/MS Dataset for Pesticide Residue Analysis

Submitted:

23 June 2026

Posted:

23 June 2026

You are already at the latest version

Abstract
Liquid chromatography-tandem mass spectrometry (LC-MS/MS) is widely employed in pesticide residue analysis. Machine learning methods for automated spectral interpretation depend on large, well-curated training datasets; however, publicly available pesticide mass spectrometry data are fragmented across heterogeneous repositories, lack standardized preprocessing, and suffer from incomplete metadata. We introduce PAID (Pesticide AI-ready Dataset), comprising two curated LC-MS/MS spectral collections derived from 15 public sources (GNPS, MassIVE, MoNA, MassBank). Starting from 91,420 raw spectra, after initial pesticide-directed screening, a seven-step reproducible pipeline—spanning multi-source integration, spectral cleaning, deduplication, metadata standardization, quality scoring, stratified splitting, and feature engineering—yields PAID-Strict (7,527 spectra; 3,197 compounds) and PAID-Extended (21,292 spectra; 3,224 compounds). Both versions cover eight pesticide categories across QTOF and Orbitrap platforms, with core metadata fields (SMILES, InChIKey, molecular formula) exceeding 98% completeness. A feature suite of 32 chemoinformatic descriptors, 2,214 molecular fingerprints, and 30 spectral features is provided alongside the spectra. Benchmark classification of the eight pesticide categories using XGBoost and LightGBM achieved 81.5% accuracy. The dataset, code, and pre-computed features are publicly available under CC BY 4.0 and MIT licenses (DOI: 10.57760/sciencedb.35414).
Keywords: 
;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Accessibility

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated