In complex traffic environments, image degradation caused by haze, low illumination, and occlusion significantly undermines the reliability of vehicle and pedestrian detection. To address these challenges, this paper proposes an aerial vision framework that tightly couples multi-level image enhancement with a lightweight detection architecture. At the image preprocessing stage, a cascaded “dehazing + illumination” module is constructed. Specifically, a learning-based dehazing method, Learning Hazing to Dehazing, is employed to restore long-range details affected by scattering artifacts. Additionally, HVI-CIDNet is introduced to decouple luminance and chrominance in the Horizontal/Vertical Intensity (HVI) color space, thereby simultaneously enhancing structural fidelity in low-light regions and achieving global brightness consistency. On the detection side, a lightweight yet robust detection architecture, termed GDEIM-SF, is designed. It adopts GoldYOLO as the lightweight backbone and integrates D-FINE as an anchor-free decoder. Furthermore, two key modules, CAPR and ASF, are incorporated to enhance high-frequency edge modeling and multi-scale semantic alignment, respectively. Evaluated on the VisDrone dataset, the proposed method achieves improvements of approximately 2.5–2.7 percentage points in core metrics such as mAP@50–90 compared to similar lightweight models (e.g., the DEIM baseline and YOLOv12s), while maintaining low parameter count and computational overhead. This ensures a balanced trade-off among detection accuracy, inference efficiency, and deployment adaptability, providing a practical and efficient solution for UAV-based visual perception tasks under challenging imaging conditions.