Preprint
Article

This version is not peer-reviewed.

GSPM-YOLO: A Lightweight Detection Method for Pillboxes in Complex Environments Based on Improved YOLOv11

Submitted:

03 April 2026

Posted:

07 April 2026

You are already at the latest version

Abstract
In unmanned pharmacy and home-care medicine management applications, reliable pillbox localization is a prerequisite for automated dispensing and grasping. However, existing detectors still perform poorly in complex environments where dense stacking, occlusion, weak illumination, and high inter-class similarity occur simultaneously. To address this problem, GSPM-YOLO is proposed as an improved detector built on the YOLOv11 framework for complex pillbox recognition, and four novel plug-and-play lightweight modules are developed: GSimConv, a lightweight dual-branch convolution module that incorporates the Attention Weight Calculation Algorithm in HardSAM for edge-preserving feature extraction, PSCAM for position-sensitive coordinate attention, MSAAM, a multi-scale strip-pooling module that integrates the Horizontal Context-Aware Attention weight calculation algorithm to strengthen occluded targets, and LGPFH for bidirectional ghost pyramid fusion. To simulate the complex operating environments of dispensing robots, we construct MBox-Complex, a dataset of 3{,}041 images with 8{,}153 annotations across 25 drug categories. Ablation experiments first validate the effectiveness of the four-module composition, with F1 rising from 0.641 to 0.714, and each module is then individually compared with advanced replacement schemes in dedicated substitution experiments to verify its own effectiveness. The integrated model is then benchmarked against advanced detectors and domain-specific methods on the self-constructed MBox-Complex dataset, achieving 0.727 mAP@50 and 0.427 mAP@50-95 with 3.8M parameters and surpassing YOLOv11 by 7.1 and 4.0 percentage points and YOLOv12 by 4.3 and 3.1 percentage points, respectively. Further cross-dataset evaluation on the VOC and Brain Tumor benchmark datasets verifies the transferability of the proposed model. Grad-CAM is adopted to visualize the detector's attention distribution, and the resulting heatmaps together with detection visualizations confirm that the proposed model focuses more precisely on stacked and occluded regions.
Keywords: 
;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated