Real-time video object detection on unmanned aerial vehicles (UAVs) is essential for urban inspection and autonomous perception, yet its deployment on edge devices is severely constrained by the high computational cost of accurate detectors, the quantization sensitivity of hybrid convolution-attention networks, and the system-level latency of full video processing pipelines. To address these challenges, we present DUST-YOLO, a deployment-oriented algorithm-hardware co-design framework for lightweight and efficient UAV small-object detection on edge platforms. First, we introduce a multi-dimensional structured pruning strategy that applies asymmetric channel pruning to convolutional and feature-fusion modules while compressing the Swin Transformer prediction heads and bottleneck stacks, thereby reducing parameters and computation with limited impact on multi-scale representation capability. Second, we develop a hardware-aware mixed-precision quantization-aware training (QAT) scheme that maps computation-intensive backbone layers to INT8 while preserving the Transformer-related modules in FP16, improving inference efficiency while mitigating the accuracy loss caused by uniform low-bit quantization. Third, we compile the optimized network with TensorRT and integrate the resulting inference engine into a DeepStream-based asynchronous video pipeline on the edge platform, enabling end-to-end acceleration by reducing decoding, preprocessing, and memory-transfer overheads. Experimental results on the VisDrone2019-DET dataset and the NVIDIA Jetson Orin NX demonstrate that DUST-YOLO achieves 43.7% mAP@0.5 acuracy with an end-to-end latency of 36.3 ms and a throughput of 27.5 FPS. Compared with the state-of-the-art detector, DUST-YOLO reduces end-to-end latency by 56.9% and improves end-to-end video throughput by ×2.31, while lowering total energy consumption by 68.5%.