1. Introduction
With rapid global urbanization and industrial expansion have exacerbated water pollution, transforming it into a critical global challenge. Consequently, the monitoring and governance of floating litter have garnered significant international attention [
1]. Floating debris, primarily composed of vegetation, plastic fragments, and anthropogenic waste [
2], tends to form high-density accumulations in river channels and coastal zones due to hydrodynamic forces. Non-degradable plastic waste, in particular, disrupts aquatic ecosystems, threatens fisheries and navigation safety, and endangers human health through bioaccumulation [
3]. Estimates suggest that 23 million tons of plastic waste entered aquatic ecosystems in 2016 alone, with projections soaring to 53 million tons by 2030 without effective intervention [
4]. Traditional management strategies, relying on manual patrols and vessel-based salvage, exhibit significant latency and are constrained by visual range, weather conditions, and labor costs, preventing high-frequency, all-weather coverage [
5].
In this context, Unmanned Aerial Vehicle (UAV) low-altitude remote sensing has emerged as a transformative paradigm. Surpassing the limitations of satellite remote sensing, UAVs offer centimeter-level spatial resolution and deployment flexibility, enabling real-time data acquisition in complex river networks [
6]. This technology has been successfully applied to wildlife tracking [
7], oil spill monitoring [
8], and vegetation classification [
9,
10]. Deploying deep learning-based object detection algorithms on UAV platforms for automated litter identification has become a focal point in environmental remote sensing [
11,
12,
13].
However, algorithm performance relies heavily on domain-specific datasets. General-purpose datasets (e.g., COCO[
14]) fail to capture the unique aerial perspectives and optical characteristics of water scenes. Recent benchmarks such as Seaclear [
15] and the River Floating Debris Dataset [
16] have begun to address this gap. Furthermore, large-scale datasets like FMPD [
17] have enhanced sample diversity and included annotations for complex environmental factors, such as reflections. Parallel to data accumulation, standardizing monitoring protocols is crucial. Research has focused on optimizing flight parameters and imaging configurations [
18,
19] and balancing coverage with resolution under varying hydrological conditions [
20] to ensure data consistency and scientific reproducibility.
Despite the maturity of object detection in terrestrial scenarios [
21,
22], domain adaptation for water surface floating litter presents unique difficulties. As illustrated in
Figure 1, UAV-based detection faces a triad of challenges stemming from target characteristics, environmental interference, and edge deployment constraints [
23].
1) Small Scale and Non-Rigid Deformation: Due to the wide field of view of aerial imagery, floating litter often occupies minimal pixel areas (
pixels), with scales fluctuating drastically with flight altitude. While feature pyramids and context enhancement strategies have been proposed to mitigate this [
24,
25], targets also undergo non-rigid deformation caused by water flow, partial submersion, or occlusion. During the downsampling process of Deep Convolutional Neural Networks, the fine-grained features of these irregular targets are prone to information loss, leading to significant missed detections.
2) Complex Environmental Noise: Natural water surfaces are highly dynamic [
26]. Interferences such as specular reflections, glints, and shadows of shoreline vegetation create high-frequency structural noise. These visual artifacts are easily confused with targets like white foam or plastic bags, hindering the extraction of discriminative features by traditional networks. Consequently, integrating attention mechanisms or frequency domain analysis to suppress background noise has become a mainstream approach for enhancing robustness [
27].
3) SWaP Constraints on Edge Deployment: UAV inspection tasks necessitate deployment on embedded platforms with strict Size, Weight, and Power (SWaP) constraints, requiring real-time inference (FPS
) [
28]. Existing high-performance models typically incur high computational costs, making it difficult to balance accuracy and speed under low-power conditions, thereby restricting practical engineering implementation.
To address these lacunae, this paper proposes FLD-Net, a lightweight, real-time detection model designed for UAV visual perception. Furthermore, to alleviate the scarcity of high-quality annotated data, we construct UAV-Flow, the first multi-scenario dataset for floating litter from a UAV perspective. The primary contributions of this work are summarized as follows:
We construct UAV-Flow, a benchmark dataset that fills the gap for small and non-rigid targets in water environments. Covering diverse hydrological and lighting conditions, it provides critical data support for researching algorithm robustness.
We propose FLD-Net, a real-time detection framework robust against water surface interference. By integrating three novel mechanisms—DFEM, DCFN, and DANA—the model effectively overcomes the technical bottlenecks of deformation adaptation, feature loss, and background noise suppression.
We implement a high-performance edge deployment scheme. Validation on embedded platforms demonstrates the system’s real-time capability and energy efficiency, offering a viable solution for low-cost, automated intelligent water monitoring.
2. Related Work
The automated monitoring of floating litter represents an interdisciplinary synergy of computer vision, remote sensing technology, and environmental engineering. With the rapid advancement of deep learning, research in this domain has witnessed a paradigm shift from traditional image processing methods to intelligent recognition based on Convolutional Neural Networks (CNNs). This section systematically reviews existing literature from two dimensions: data benchmarks and detection algorithms while analyzing the specific challenges encountered in UAV-based aquatic operations.
2.1. Surface Floating Debris Datasets
High-quality annotated data serves as the cornerstone for driving performance improvements in deep learning models. Although numerous marine and aquatic litter datasets have emerged in recent years, existing public repositories exhibit significant domain discrepancies and scenario limitations when supporting fine-grained detection tasks from a low-altitude UAV perspective.Aquatic litter can be categorized into three distinct types: beach litter, surface floating litter, and benthic debris [
29]. This study specifically targets waste that floats or suspends on open water surfaces (including oceans, lakes, and rivers) driven by currents or wind—typically low-density, high-buoyancy items such as plastic bags, bottles, foam, and branches.
The most representative dataset, SeaClear [
15], contains 8,610 images of shallow water environments and serves as a critical benchmark. However, it primarily focuses on underwater scenarios where image characteristics are dominated by light attenuation and scattering, with background interference stemming largely from benthic organisms or sediment. In contrast, UAV orthophotography is dominated by surface specular reflections, where background noise consists of dynamic wave glint, spray, and shoreline vegetation reflections. Another category, such as the Floater dataset [
30], targets inland floating debris but is derived mainly from shore-based surveillance cameras. The horizontal or oblique angles of shore-based monitoring result in severe inter-object occlusion and perspective distortion, failing to simulate the spatial geometric features and scale distributions inherent to the UAV orthographic view.
While some studies have attempted to construct UAV-perspective datasets, they often suffer from limited sample sizes (
images) and scenario homogeneity. The River Floating Debris dataset [
16] comprises only 840 samples from a single river scene; its limited magnitude and lack of annotation for varied flight altitudes fail to capture scale variations essential for operational UAV missions. Similarly, the HAIDA Trash Dataset [
31] focuses on coastal and marine floating litter, covering common items like plastic bottles and fishing nets, yet its small scale is insufficient to support the generalization of deep CNNs against illumination changes and complex background clutter. The recently proposed FMPD dataset [
17] expands both data volume and scenario diversity but is largely collected under ideal conditions with favorable lighting and calm water surfaces, exhibiting a benign water bias. Real-world aquatic environments are replete with uncontrollable interference factors; consequently, models trained on such biased data may perform well on test sets but suffer a drastic decline in robustness when deployed in complex lighting and adverse hydrological conditions.
Therefore, constructing a benchmark dataset that encompasses multiple scenarios, multi-scale targets, and rich environmental noise is a prerequisite for addressing the robustness deficiencies of current algorithms. This necessity underpins the motivation for developing the UAV-Flow dataset in this study.
2.2. UAV Object Detection
UAV remote sensing imagery presents severe challenges to traditional object detection algorithms due to its high spatial resolution, wide field of view, and complex backgrounds. Unlike natural scene images, UAV aerial photography is characterized by drastic target scale variations and intricate background textures [
21]. Addressing these characteristics, existing research primarily focuses on three core dimensions: small target feature enhancement, complex background suppression, and lightweight edge deployment.
Regarding
small target detection, UAV perspectives result in ground targets occupying minimal pixel areas (
pixels). Although the YOLO series [
32,
33,
34] and its variants [
35,
36] optimize multi-scale feature fusion via Path Aggregation Networks (PANet), continuous downsampling operations in deep CNNs inevitably lead to feature information loss. To mitigate this, LAR-YOLOv8 [
37] and MSD-YOLOn [
38] attempt to improve small target recall by expanding receptive fields or reinforcing shallow feature reuse. Their primary strategies include introducing attention mechanisms to enhance weak feature capture and employing multi-scale fusion architectures to integrate cross-level information, thereby preventing small targets from being submerged in deep features. Furthermore, RFLA [
39] and NWD [
40] introduce label assignment strategies based on Gaussian distribution modeling and Wasserstein distance, respectively. While these approaches alleviate the sensitivity of positive/negative sample matching from a loss function perspective, the loss of spatial information during the feature extraction stage remains a physical bottleneck constraining detection accuracy.
In terms of
noise resistance in complex backgrounds, UAV imagery often includes urban scenes, vegetation cover, or uneven illumination, leading to significant background and occlusion interference. Attention mechanisms are widely regarded as effective means to enhance feature discriminability. FFCA-YOLO [
24] utilizes channel attention to amplify salient feature responses while suppressing invalid background channels. However, traditional global attention often relies on statistical information, making it difficult to suppress high-frequency background noise while preserving the structural details of small targets. In comparison, RTD-Net [
41] and RT-DETR [
42] incorporate Transformer architectures, utilizing self-attention mechanisms to model long-range dependencies and effectively distinguish targets from complex backgrounds. However, their quadratic complexity causes high latency, restricting their use in real-time tasks.
Regarding
real-time edge deployment, with the proliferation of edge computing in UAV payloads, balancing inference latency and detection accuracy on resource-constrained devices has become a critical hurdle. Model lightweighting techniques such as pruning, quantization, and knowledge distillation have been extensively explored. For instance, Drone-YOLO [
36] maintains high detection accuracy under real-time conditions through a lightweight "sandwich" fusion mechanism, while LUD-YOLO [
43] significantly reduces parameters via channel pruning. Nevertheless, a trade-off between accuracy and efficiency persists in practical applications: excessive parameter compression often sacrifices the feature extraction capability for weak and small targets, increasing the miss rate; conversely, high-precision models typically incur massive Floating Point Operations (FLOPs), making it difficult to meet real-time video stream processing requirements (FPS
) within limited power budgets.
In summary, while existing methods have progressed in their respective dimensions, constructing a lightweight model that simultaneously possesses small target perception and noise resistance remains a key scientific problem to be solved in UAV remote sensing. Moreover, the proposed FLD-Net must specifically account for the additional challenges posed by dynamic water surface backgrounds.
3. UAV-Flow Dataset
As elucidated in
Section 2.1, existing benchmarks for floating litter detection universally suffer from domain discrepancies (e.g., underwater or shore-based views), insufficient sample magnitudes, and scenario homogeneity relative to UAV remote sensing imagery. To bridge this data hiatus and provide robust training support for detection models, this study constructs
UAV-Flow, the first multi-scenario benchmark dataset tailored for the UAV perspective.
3.1. Dataset Construction
To guarantee data diversity and fidelity, the dataset construction workflow, as delineated in
Figure 2, comprises four distinct phases: data acquisition, screening, annotation, and partitioning.
Data acquisition was conducted using the DJI M350 RTK platform. This platform was selected for its centimeter-level positioning accuracy and high stability against wind, ensuring imaging quality under complex meteorological conditions. To build a robust sample library, the acquisition strategy emphasized spatiotemporal heterogeneity. Three typical aquatic environments were selected: urban river channels, natural lakes, and nearshore mudflats. These cover a spectrum of hydrological conditions, from static to dynamic flow and from clear to turbid water. A multi-variable flight strategy was employed, encompassing varying altitudes (15 m–100 m), multiple viewing angles, and diverse time slots. This approach effectively captured the visual feature variations of floating debris under different illumination intensities, shadow occlusions, and water surface reflection conditions. Over 5,000 raw images were acquired. Following rigorous manual screening, images exhibiting motion blur, severe overexposure, or background-only scenes were discarded, resulting in a core dataset of 4,593 high-resolution valid images.
Addressing the challenges of sparse distribution and high annotation costs for aquatic debris, we proposed a semi-supervised annotation strategy to balance efficiency and precision. First, approximately 20% of typical samples were manually annotated via the Makesense.ai platform to establish a high-quality seed dataset. Subsequently, an initial detection model was trained on this seed set to generate pseudo-labels for the remaining 80% of the data using the X-AnyLabeling tool. Human experts were then introduced to verify and fine-tune the automatically generated labels, focusing on correcting missed minute targets and refining boundary regression deviations. Finally, all annotations were standardized to the YOLO format and randomly partitioned into training, validation, and testing sets in a 7:2:1 ratio to ensure unbiased evaluation.
3.2. Statistical Analysis and Characteristics
The dataset contains 20,618 annotated instances covering five common categories of anthropogenic waste in natural waters: plastic bottles, plastic bags, courier cartons, Styrofoam blocks, and waste paper. The class distribution is relatively balanced; specifically, plastic bottles—the most representative surface pollutant—account for approximately 30% of instances, effectively mitigating model bias caused by long-tail distributions.
To quantify the scale characteristics of floating litter in UAV imagery, this study strictly adheres to the definition standards of the MS COCO benchmark [
14]. Targets are classified into three scale levels based on the pixel area
A of the bounding box: small objects (
pixels), medium objects (
pixels), and large objects (
pixels). Statistical analysis of UAV-Flow based on these criteria (
Figure 3) reveals significant small-scale characteristics. Small objects constitute 78.9% of all annotated instances. Notably, the proportion of small objects for plastic bottles, waste paper, and courier cartons exceeds 80%. This scale distribution authentically replicates the visual challenges encountered during high-altitude UAV patrols, where fine-grained feature information is prone to loss during downsampling.
To validate the cross-domain generalization capability of the model, UAV-Flow maintains high diversity in scenario composition, comprising urban rivers (46%), natural lakes (37%), and nearshore mudflats (17%). This multi-environment coverage ensures the inclusion of diverse interference modes, ranging from static zones with strong specular reflections to dynamic zones with complex background clutter. As presented in
Table 1, compared to the single-perspective limitations of Floater and FMPD, or the small sample sizes of River Floating Debris and HAIDA, UAV-Flow achieves an order-of-magnitude improvement in image quantity, small object proportion, and environmental heterogeneity. In summary, UAV-Flow provides a moderately scaled, high-quality, and challenging multi-scenario benchmark for surface floating litter detection, offering solid data support for subsequent algorithmic research and model generalization evaluation.
4. Methodology
To address the challenges of high small-target density, non-rigid geometric deformations, and water surface background noise interference inherent in UAV-based floating litter detection, this article proposes
FLD-Net, a lightweight real-time detection framework optimized for edge deployment. To satisfy the stringent requirements for low power consumption and real-time inference on UAV platforms, we select YOLOv11 as the baseline model due to its superior balance between speed and accuracy. As illustrated in
Figure 4, we introduce three mechanism-enhancement modules into the backbone, neck, and pre-detection stages of the YOLOv11 architecture: the
Deformable Feature Extraction Module (DFEM), the
Dynamic Cross-Scale Fusion Network (DCFN), and the
Dual-Domain Anti-Noise Attention (DANA). These innovative modules form a continuous enhancement chain from low-level feature acquisition to high-level semantic decision-making, achieving the preservation of deformed small-target information, efficient cross-scale feature fusion, and background noise suppression.
4.1. Deformable Feature Extraction Module
In UAV orthographic remote sensing imagery, floating litter often exhibits highly irregular, non-rigid geometric deformations due to hydrodynamic impact and partial submersion. Standard Convolutional Neural Networks (CNNs) rely on fixed geometric grids for feature sampling. This rigid sampling mechanism, when processing floating objects with stochastic morphological variations, is prone to sampling points falling on background regions rather than the target body, resulting in spatial mismatch during feature extraction and loss of semantic information [
44].
To surmount the physical limitations of standard convolution in geometric modeling, Dai
et al. [
45] first proposed Deformable Convolutional Networks (DCN), which introduce a set of learnable 2D offsets to the standard sampling grid, endowing the convolution kernel with the ability to adaptively adjust the receptive field shape. Subsequently, Zhu
et al. [
46] introduced a modulation mechanism in DCNv2, enhancing the model’s ability to suppress irrelevant backgrounds by adding weight masks. Recent studies have widely demonstrated the significant advantages of DCN in visual tasks involving complex geometric transformations. Xiao
et al. [
47] utilized multi-scale deformable convolution alignment to effectively resolve spatial misalignment caused by dynamic motion in satellite video, confirming the robustness of DCN in feature alignment for dynamically drifting targets. Similarly, Liu
et al. [
48] further showed that DCN can adaptively fit the geometric boundaries of complex topological structures such as slender and curved shapes, significantly breaking through the physical limitations of traditional convolution in modeling irregular targets.
Inspired by these studies and given the non-rigid characteristics of floating debris, this research posits that introducing deformable convolution is critical to resolving feature spatial mismatch and semantic loss. To this end, we designed the
DFEM, intended to replace traditional convolution units in the deep layers of the backbone network, enabling the network to actively deform to fit the target’s true geometric contour. DFEM adopts a hierarchical nested architecture as shown in
Figure 5. This module follows the topological design principles of Cross Stage Partial (CSP), balancing model expressiveness and inference cost through gradient flow splitting and feature recombination.
As shown in
Figure 5a, the input feature flow of DFEM is first divided into a backbone branch and a residual branch via a Split operation. The backbone branch enters the core feature extraction component, designed as a
C3_DCN unit depending on network depth. C3_DCN retains the efficiency of the YOLO series’ C3 module but introduces a deformation mechanism within its stacked bottleneck layers. The residual branch connects directly to the end via a
convolution. The two branches are finally merged via a Concat operation and fused by a terminal convolution layer. This design not only ensures lossless gradient flow during backpropagation but also effectively reduces video memory consumption after introducing complex operators.
The core innovation of DFEM lies in the micro-level
Bottleneck_DCN unit, as shown in
Figure 5b. This unit reconstructs the standard residual block by replacing the standard convolution with deformable convolution. Specifically, assuming the input feature map is
X and the sampling grid of standard convolution is
R. In Bottleneck_DCN, for each position
on the output feature map
Y, the sampling process is reconstructed as:
where
denotes the fixed offset within the regular sampling grid
R , and
represents the kernel weight corresponding to the sampling position. The equation introduces two key variables: the learnable offset
and the modulation scalar
, which are dynamically predicted and generated by the network based on the input feature
X. Breaking the constraints of a regular grid, for each sampling point, the network predicts a 2D offset vector, shifting the actual sampling position to
. This allows the convolution kernel to deform, enabling its receptive field to adaptively stretch, rotate, or bend to fit the actual geometric contour of the floating litter.
is a scalar activated by Sigmoid, serving as the weight for each sampling point. If a sampling point
p falls into an irrelevant background region even after offsetting, the network can suppress its contribution by setting
. Since the new sampling position
p has non-integer coordinates, the feature value
at this sub-pixel location is computed via bilinear interpolation, ensuring gradient differentiability during backpropagation.
Through this design, Bottleneck_DCN can dynamically adjust the geometric shape of the receptive field according to the actual morphology of the floating litter, breaking the physical constraints of the rigid grid in standard convolution. Ultimately, via CSP structural feature aggregation, DFEM achieves high-fidelity feature extraction for non-rigid targets in deep networks, significantly ameliorating the issue of missed detections caused by geometric mismatch.
4.2. Dynamic Cross-Scale Fusion Network
In UAV aquatic remote sensing monitoring, floating litter exhibits cross-scale distribution characteristics due to variations in flight altitude, with a predominance of small targets. In the UAV-Flow dataset, small-scale targets (pixel area
) account for as high as 78.9%. Feature Pyramid Networks (FPN) are the standard paradigm for solving such multi-scale problems [
49]. However, applying generic FPN architectures directly to water surface environments faces two core bottlenecks. First, while shallow feature maps retain high-resolution spatial details, they also contain substantial high-frequency background noise composed of water ripples and glints; excessive fusion of shallow features often degrades the Signal-to-Noise Ratio (SNR) of the feature maps. Second, traditional bilinear interpolation is content-agnostic, calculating sampling points based solely on spatial distance [
50]. When processing floating objects with weak textures or blurred edges, this interpolation strategy is prone to sub-pixel level feature aliasing, leading to the loss of edge information for minute targets during fusion. Consequently, this paper proposes
DCFN, which reconstructs the cross-scale feature fusion path through the synergistic design of asymmetric topology pruning and content-aware upsampling.
To achieve efficient feature aggregation under limited edge computing budgets, DCFN utilizes the P3, P4, and P5 feature layers of the backbone network as inputs to construct a lightweight asymmetric multi-path fusion structure, as shown in
Figure 6. Drawing on the efficient design concepts of GFPN [
51] and DAMO-YOLO [
52], this structure strategically prunes redundant shallow fusion paths via an asymmetric topology and strengthens the top-down flow of deep information. By utilizing
DySample for the top-down fusion path, semantic information is ensured not to decay during transmission. The exclusion of extra shallow layer (P2) fusion is justified because, in aquatic aerial imagery, while shallow features retain high-resolution spatial information, they are also replete with high-frequency background noise. Excessive shallow injection would introduce significant non-target interference into deep semantics, causing minute targets to be submerged. This design not only adheres to SNR optimization principles in signal processing but also effectively reduces computational load (FLOPs), as complex operations on high-resolution shallow feature maps consume substantial computing power. Thus, DCFN theoretically achieves a dual optimization of accuracy and efficiency, making it particularly suitable for the resource constraints of edge devices.
To address the semantic ambiguity of small targets in cross-scale fusion, DCFN introduces the
DySample operator to replace standard interpolation. DySample is a lightweight dynamic upsampler based on a point-sampling perspective. According to Wang
et al. [
53] and Liu
et al. [
50], standard interpolation kernels possess spatial invariance, using the same weight kernel for the entire image. However, for small targets with minimal pixel occupancy, local texture changes are drastic and sparse. Research [
50] indicates that adopting a content-aware reorganization strategy—dynamically generating upsampling kernels based on the semantic content of feature maps—is key to recovering minute target details and reducing feature aliasing. Compared to kernel prediction-based methods [
53], DySample is theoretically more aligned with the nature of geometric transformation and possesses lower computational complexity.
The core idea of DySample is that upsampling should not be fixed network interpolation but a resampling process dynamically generated based on feature content. Let the input feature map be
X and the target upsampling factor be
s, yielding output feature map
Y. As shown on the right side of
Figure 6, DySample is divided into two stages: sampling set generation and feature resampling. The network first uses a lightweight generator to predict a dense offset map based on the local semantic content of input feature
X. The final sampling grid is defined as the superposition of the original grid
G and the offset
O:
where
G represents the projection coordinates of the target feature map pixels on the original low-resolution map, and
O is the adaptive offset learned by the network. Each pixel value on the output feature map
Y is obtained by sampling at position
S in the input map
X:
Similarly, the sampling operation here is implemented via bilinear interpolation to ensure differentiability. By learning offsets, DySample attempts to recover lost spatial information, pulling sampling points back to the semantic center or edges of the target. For floating debris, this implies that even if the target is very blurry in high-level features, DySample can focus the upsampling on potential target areas by referencing surrounding contextual information, thereby reconstructing sharp feature boundaries. This mechanism theoretically guarantees semantic alignment during feature fusion, enabling deep semantics to be precisely injected into the texture of small targets in shallow layers.
4.3. Dual-Domain Anti-Noise Attention
Although DFEM and DCFN effectively improve the geometric representation and cross-scale semantic alignment of non-rigid targets, significant high-frequency structural noise specific to aquatic environments remains in the deep feature space before entering the detection head. For instance, specular reflections under strong sunlight and bright edges of dynamic ripples often exhibit feature responses highly similar to minute plastic debris. Traditional global attention mechanisms (e.g., SE [
54], CBAM [
55]) typically recalibrate channel or spatial weights based on statistical information like global average pooling. However, in aquatic scenarios, such blind enhancement strategies face severe inductive bias risks; they tend to enhance all high-response regions, potentially erroneously amplifying bright reflection noise and leading to a surge in false positives.
To this end, we propose the
DANA module, as shown in
Figure 7. DANA is not merely feature enhancement but a
discriminative decoupling mechanism. By establishing an interaction constraint mechanism between spatial and channel domains, it utilizes the spatial structural priors of the target to guide the screening of channel semantics, thereby precisely stripping the semantic signal of real floating debris from background clutter.
To precisely localize target boundaries while suppressing water ripple noise, DANA designs a
Spatial Orthogonal Context Branch. Unlike standard spatial attention which incurs high computational burdens from large-kernel convolutions, this branch adopts the philosophy of Coordinate Attention (CA) [
56], performing 1D feature aggregation along horizontal (X) and vertical (Y) directions, respectively. For input feature
F, two direction-aware feature vectors are generated:
where superscripts
h and
w represent the vertical and horizontal spatial dimensions. Accordingly,
and
are formulated to capture long-range dependencies of the feature map
F along the vertical and horizontal directions. This operation effectively compresses the 2D spatial structure into two 1D feature descriptors. Subsequently, these two are concatenated, transformed via convolution, and passed through nonlinear activation to generate the intermediate feature
f:
Then
f is split and transformed into two spatial attention maps
and
. The final spatial attention map
is the broadcast product of the two:
where ⊗ denotes element-wise broadcast multiplication. This interactive design enables spatial attention to simultaneously consider contextual information in both horizontal and vertical directions, thereby more precisely locating the non-rigid boundaries of floating litter.
To avoid the erroneous enhancement of background noise by traditional SE modules, DANA introduces the
Spatial-Channel Interaction Strategy. Modulating channel descriptors using intermediate features generated by the spatial branch is the key step for noise resistance. Traditional channel descriptors are computed as
. In contrast, DANA computes a spatially weighted channel descriptor
. This ensures that if a channel responds spatially only in noise regions, its corresponding spatial weight will be suppressed, thereby lowering the overall weight of that channel via the interaction mechanism. After Sigmoid activation, the final channel attention vector
is generated. The Sigmoid function maps weights to the interval
rather than binarized
. This means that even faint target signals are merely re-weighted rather than hard-truncated. This soft attention mechanism ensures that gradients can still flow to minute targets during backpropagation, allowing the network to progressively correct its focus on small targets during training. The final feature reconstruction
Y is the result of dual weighting:
Through this spatial-channel interaction mechanism, DANA achieves discriminative decoupling of target semantics and noise structures. Spatial attention is responsible for suppressing the spatial locations of noise, while channel attention suppresses the semantic channels where noise resides. A high response is obtained only through the dual validation of and , effectively avoiding the over-suppression problem caused by judgment errors in a single dimension. After DANA processing, water surface reflections and ripple interference in the feature maps are significantly suppressed, while the feature responses of minute floating debris are focused and enhanced. This provides a high-SNR decision basis for the subsequent detection head, significantly reducing the false detection rate under complex lighting conditions.
5. Experiments and Discussion
5.1. Experimental Setup
All model training and validation processes in this study were conducted on a single computational node equipped with an NVIDIA GeForce RTX 4060 Ti GPU (16 GB VRAM). The software environment was established on the PyTorch 2.0.0 deep learning framework, accelerated by CUDA 11.8, using Python 3.10.18.Input images were uniformly resized to a resolution of pixels. A Stochastic Gradient Descent (SGD) optimizer was employed with an initial learning rate of , a momentum factor of 0.937, and a weight decay coefficient of . The batch size was set to 16, and the training process spanned 200 epochs.
To comprehensively evaluate model performance, Precision (
P), Recall (
R), and Mean Average Precision (mAP) were adopted as core metrics, defined mathematically as follows:
where
,
, and
denote the number of True Positives, False Positives, and False Negatives, respectively.
represents the average precision for a specific class, and mAP
50 denotes the mean average precision at an Intersection over Union (IoU) threshold of 0.5, reflecting the model’s detection capability. Additionally, Parameters (Params), Floating Point Operations (FLOPs), and Frames Per Second (FPS) were introduced to assess the degree of lightweight design and real-time inference capability.
5.2. Ablation Studies
To validate the scientific rigor of the FLD-Net architecture and the efficacy of its core modules, a two-stage ablation analysis was conducted on the UAV-Flow dataset. Prior to finalizing the architecture, granular comparative experiments were performed regarding the insertion position of DFEM and the topology/operator selection for DCFN.
Although DCN adapts to geometric deformations, it incurs computational overhead. To investigate its efficacy at different feature levels, DFEM was deployed at low levels (P2, P3), high levels (P4, P5), and all levels of the backbone. As shown in
Table 2, deploying DFEM at high levels yielded the optimal cost-performance ratio, achieving an mAP
50 of 73.51%. This configuration not only surpassed the all-level deployment in accuracy but also maintained a lower parameter count. Strategically introducing deformation modeling at deeper layers effectively captures the non-rigid contours of floating litter while avoiding the computational waste associated with DCN on high-resolution shallow feature maps.
To verify the necessity of the DCFN design, we performed stepwise optimization starting from the standard PANet. As presented in
Table 3, directly introducing GFPN increased mAP
50 to 73.58% but caused a parameter surge to 13.72 M, violating the lightweight principle. Adopting the asymmetric pruning strategy (GFPN-Scale) significantly reduced parameters to 9.66 M, with a slight regression in accuracy. The critical breakthrough occurred with the introduction of the
DySample operator; the mAP
50 surged to 77.69% with negligible parameter increase. This strongly evidences that the performance leap of DCFN stems not from parameter stacking but from DySample’s pixel-level lossless reconstruction of minute target features, which effectively compensates for information loss caused by topological pruning.
Upon determining the optimal internal structure, we analyzed the impact of DFEM, DCFN, and DANA modules on detection performance using YOLOv11s as the baseline. The results are detailed in
Table 4. The baseline model exhibited marked deficiencies in complex aquatic scenarios, with mAP
50 of 71.96%, respectively. Introducing DFEM raised mAP
50 to 73.44% by enhancing geometric deformation modeling. Adding DCFN alone boosted Recall from 66.18% to 75.04%, underscoring the critical role of dynamic cross-scale fusion for minute targets. Employing DANA alone achieved an mAP
50 of 78.95%, the highest single-module gain, proving its effectiveness in filtering surface reflections and ripple noise via spatial-channel dual-domain filtering.
The experiments demonstrate significant complementarity among modules. DFEM ensures precise geometric alignment of deep features; DCFN and DANA target small target leakage and complex background noise, respectively. Crucially, the combined usage of all three modules reveals a distinct synergistic effect, yielding performance gains superior to simple superposition. Compared to the baseline YOLOv11s, the final FLD-Net achieved a substantial 11.66% improvement in Recall and an 8.51% increase in mAP50. These results validate that FLD-Net can systematically address the core challenges of extreme scale variations and intense background interference in complex water body remote sensing.
5.3. Comparative Experiments
To further verify the comprehensive performance of FLD-Net, we benchmarked it against mainstream general object detectors (YOLOv5s [
57], YOLOv8s [
33], YOLOv11s [
34], RT-DETR [
42]), UAV-optimized models (TPH-YOLOv5 [
35], Drone-YOLO [
36], UAV-DETR [
58]), and the noise-resistant FFCA-YOLO [
24]. All models were evaluated under identical dataset and training configurations.
As shown in
Table 5, FLD-Net achieved the optimal overall performance with an mAP
50 of 80.47% and a Recall of 77.84%. While YOLOv11s offers extremely high inference speeds, it is limited by its feature extraction capability for minute surface targets. FLD-Net improved mAP
50 by 8.51% with only a 0.5 M increase in parameters and a 3.8 G increase in FLOPs, maintaining an exceptional efficiency ratio. Although RT-DETR and UAV-DETR achieved high precision via global attention, their computational costs are prohibitive for real-time airborne applications. In contrast, FLD-Net surpassed UAV-DETR in accuracy with only 1/4 of the parameters and 1/5 of the computational load, achieving 156.72 FPS. Furthermore, FLD-Net significantly outperformed UAV-specific models like TPH-YOLOv5 and Drone-YOLO, attributed to its explicit mechanisms for suppressing strong structural noise, which generic UAV models lack.
To evaluate generalization, cross-domain validation was conducted on two external datasets: HAIDA Trash Dataset (complex marine scenes) and FMPD Dataset (dynamic river scenes). As presented in
Table 6, FLD-Net achieved an mAP
50 of 73.1% on HAIDA Trash, significantly outperforming YOLOv8 and RT-DETR. On the FMPD dataset, which contains macroscopic debris and strong ripple interference, FLD-Net achieved superior performance compared to comparison models. Specifically, contrasting the drastic recall decay of RT-DETR, FLD-Net demonstrated robust environmental adaptability, credited to the DANA module’s effective suppression of ripple and glare noise.
5.4. Visualization Analysis
To intuitively evaluate the perceptual robustness of the model in dynamic aquatic environments,
Figure 8 presents a qualitative comparison between FLD-Net and mainstream detectors (YOLOv8, YOLOv11, and RT-DETR) across five typical challenging scenarios. In the visualization, red bounding boxes denote missed detections (False Negatives), while yellow bounding boxes indicate false detections (False Positives). The comparative results reveal significant perceptual deficiencies in the baseline models under complex interference. Specifically, in scenarios characterized by dynamic water ripples (a), low illumination (b), and turbid water (d), the YOLO series and RT-DETR suffered from severe missed detections of small targets due to insufficient suppression of high-frequency noise and inadequate extraction of weak features. Under shoreline shadow occlusion (c), low background contrast caused target features to be submerged, resulting in missed detections within shadowed regions. Most critically, in strong glare reflection scenarios (e), the baseline models failed to distinguish high-intensity light spots from floating debris, generating numerous false positives. In contrast, FLD-Net demonstrated superior environmental adaptability, significantly outperforming existing mainstream models in both suppressing false positives and enhancing recall rates.
To further validate the generalization capability of FLD-Net,
Figure 9 illustrates the visualization results of FLD-Net versus mainstream detectors on the FMPD dataset, which is characterized by water flow fluctuations and high-density debris accumulation. The results indicate that the YOLO series models exhibit severe missed detection phenomena when confronting densely accumulated and minute floating fragments. This is primarily attributed to the loss of fine-grained features during the downsampling process, making it difficult to decouple minute targets from complex backgrounds. Conversely, although RT-DETR utilizes global attention to capture long-range dependencies, it demonstrates excessive sensitivity when processing dynamic ripple textures, resulting in substantial false positives by misidentifying water surface glints as floating litter. In comparison, FLD-Net exhibits exceptional adaptability on the FMPD dataset, effectively handling domain discrepancies across varying hydrological environments.
5.5. Real-time Edge Deployment and Energy Efficiency
To validate feasibility for practical engineering, end-to-end deployment testing was conducted on an NVIDIA Jetson Xavier NX (16 GB) embedded platform. The environment utilized the TensorRT 8.5 inference engine with FP16 half-precision quantization.
As shown in
Table 7, under TensorRT optimization, FLD-Net achieved an inference speed of 69.6 FPS with a single-frame latency of approximately 14 ms. Despite incorporating three enhancement modules, the latency increased by only 4 ms compared to YOLOv11s, while yielding an 8.5% improvement in mAP
50. At a full-load power consumption of 15 W, the system energy efficiency reached 4.80 FPS/W. This indicates that under typical UAV battery conditions, the algorithm imposes no significant burden on flight endurance.
As illustrated in
Figure 10, combined with field tests on the DJI M350 RTK platform, the end-to-end system response time—including camera acquisition, model inference, and 4G transmission—was controlled within 100 ms. This metric fully satisfies the operational requirements for closed-loop control and real-time ground station decision-making, confirming the practical value of FLD-Net as a high-efficiency edge detector.
6. Conclusions
Addressing the challenges of intelligent floating litter detection in complex aquatic environments, this article establishes a dual-layer innovation framework comprising a novel dataset and a specialized detection methodology.
At the data level, we established UAV-Flow, a multi-scenario benchmark dataset tailored for UAV-based vision. Leveraging multi-altitude, multi-view acquisition strategies and a semi-supervised annotation paradigm, this dataset authentically replicates scene characteristics such as drastic illumination changes, complex background clutter, and extreme target scale variations. It provides a robust data foundation for research on minute, non-rigid target detection and model generalization evaluation.
At the algorithmic level, targeting the unique visual difficulties of surface floating debris, we proposed FLD-Net, a lightweight, real-time detection framework. Specifically, the DFEM effectively overcomes the geometric mismatch inherent in standard convolutions when processing non-rigid deformations. The DCFN, utilizing a dynamic sampling mechanism, resolves feature aliasing and semantic ambiguity during cross-scale fusion. Furthermore, the DANA module, employing a dual-domain interaction strategy, successfully suppresses complex background noise such as surface glint and ripples.
Quantitative experiments confirm that FLD-Net achieves a superior trade-off between detection precision (mAP50 80.47%) and computational efficiency (156.72 FPS on RTX 4060 Ti). With a parameter count of only 9.9 M, the model demonstrates high operational viability for on-board real-time processing on resource-constrained UAV payloads. Future work will focus on integrating temporal coherence and multi-modal sensing to investigate the spatiotemporal distribution dynamics of aquatic pollution, thereby providing comprehensive technical support for ecological monitoring and assessment.
Author Contributions
Conceptualization, B.Z.; methodology, B.Z. and X.W.; software, X.W.; validation, X.W. and Z.W.; formal analysis, B.Z.; investigation, X.Y.; resources, L.W. and B.Z.; data curation, X.Y.; writing—original draft preparation, X.W.; writing—review and editing, B.Z., L.W. and X.Y.; visualization, X.W.; supervision, B.Z.; project administration, B.Z.; funding acquisition, B.Z. and L.W..
Funding
This research was supported by the Joint Fund of Zhejiang Provincial Natural Science Foundation of China under Grant No. LGEY26E090014
Data Availability Statement
The source code and portions of the UAV-Flow dataset have been open-sourced on GitHub (link:
https://github.com/starandmoonw/FLD-Net). The complete dataset will be made publicly available upon acceptance of the paper.
Conflicts of Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References
- Bhatia, S.K.; Mehariya, S.; Bhatia, R.K.; Kumar, M.; Pugazhendhi, A.; Awasthi, M.K.; Atabani, A.; Kumar, G.; Kim, W.; Seo, S.O.; et al. Wastewater based microalgal biorefinery for bioenergy production: Progress and challenges. Science of the Total Environment 2021, 751, 141599.
- Van Sebille, E.; Wilcox, C.; Lebreton, L.; Maximenko, N.; Hardesty, D.P.; Van Franeker, J.A.; Eriksen, M.; Siegel, D.; Galgani, F.; Law, K.L. A global inventory of small floating plastic debris. Environmental Research Letters 2015, 10, 124006.
- Hong, S.; Lee, J.; Lim, S. Navigational threats by derelict fishing gear to navy ships in the Korean seas. Marine Pollution Bulletin 2017, 119, 100–105.
- Borrelle, S.B.; Ringma, J.; Law, K.L.; Monnahan, C.C.; Lebreton, L.; McGivern, A.; Murphy, E.; Jambeck, J.; Leonard, G.H.; Hilleary, M.A.; et al. Predicted growth in plastic waste exceeds efforts to mitigate plastic pollution. Science 2020, 369, 1515–1518.
- Di, J.; Xi, K.; Yang, Y. An enhanced YOLOv8 model for accurate detection of solid floating waste. Scientific Reports 2025, 15, 25015.
- Beatty, D.S.; Aoki, L.R.; Graham, O.J.; Yang, B. The future is big—and small: remote sensing enables cross-scale comparisons of microbiome dynamics and ecological consequences. Msystems 2021, 6, e01106–21.
- Lee, H.; Byeon, S.; Kim, J.H.; Shin, J.K.; Park, Y. Construction of a Real-Time Detection for Floating Plastics in a Stream Using Video Cameras and Deep Learning. Sensors 2025, 25, 2225.
- He, L.; Zhou, Y.; Yang, H.; Su, L.; Ma, J. A deep learning-based method for marine oil spill detection and its application in UAV imagery. Marine Pollution Bulletin 2026, 222, 118889.
- Chang, B.; Li, F.; Hu, Y.; Yin, H.; Feng, Z.; Zhao, L. Application of UAV remote sensing for vegetation identification: a review and meta-analysis. Frontiers in Plant Science 2025, 16, 1452053.
- Cao, Y.; Lei, R. Progress and prospect of forest fire monitoring based on the multi-source remote sensing data. National Remote Sensing Bulletin 2024, 28, 1854–1869.
- Wang, J.; Zhao, H. Improved yolov8 algorithm for water surface object detection. Sensors 2024, 24, 5059.
- Yang, Z.; Yu, X.; Dedman, S.; Rosso, M.; Zhu, J.; Yang, J.; Xia, Y.; Tian, Y.; Zhang, G.; Wang, J. UAV remote sensing applications in marine monitoring: Knowledge visualization and review. Science of The Total Environment 2022, 838, 155939.
- Song, F.; Zhang, W.; Yuan, T.; Ji, Z.; Cao, Z.; Xu, B.; Lu, L.; Zou, S. UAV quantitative remote sensing of riparian zone vegetation for river and lake health assessment: A review. Remote Sensing 2024, 16, 3560.
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European conference on computer vision. Springer, 2014, pp. 740–755.
- Đuraš, A.; Ilioudi, A.; Wolf, B.; Palunko, I.; De Schutter, B. Seaclear Marine Debris Detection & Segmentation Dataset, 2024. Data set. [CrossRef]
- Yang, Z.; et al. River Floating Debris Dataset for UAV-based detection, 2024. [Data set]. Available: https://figshare.com/.
- Pinson, S.; Vollering, A. Floating macroplastic debris in rivers labelled dataset, 2025. Data set. [CrossRef]
- Gonçalves, G.; Andriolo, U.; Gonçalves, L.M.; Sobral, P.; Bessa, F. Beach litter survey by drones: Mini-review and discussion of a potential standardization. Environmental Pollution 2022, 315, 120370.
- Andriolo, U.; et al. Drones for litter monitoring on coasts and rivers: suitable flight altitude and image resolution. Marine Pollution Bulletin 2023, 195, 115521.
- Gallitelli, L.; Girard, P.; Andriolo, U.; Liro, M.; Suaria, G.; Martin, C.; Lusher, A.; Hancke, K.; Blettler, M.; Garcia-Garin, O.; et al. Monitoring macroplastics in aquatic and terrestrial ecosystems: Expert survey reveals visual and drone-based census as most effective techniques. Science of the Total Environment 2024, 955, 176528.
- Wu, X.; Li, W.; Hong, D.; Tao, R.; Du, Q. Deep learning for unmanned aerial vehicle-based object detection and tracking: A survey. IEEE Geoscience and Remote Sensing Magazine 2021, 10, 91–124.
- Tang, G.; Ni, J.; Zhao, Y.; Gu, Y.; Cao, W. A survey of object detection for UAVs based on deep learning. Remote Sensing 2023, 16, 149.
- Jiang, Z.; Wu, B.; Ma, L.; Zhang, H.; Lian, J. APM-YOLOv7 for small-target water-floating garbage detection based on multi-scale feature adaptive weighted fusion. Sensors 2023, 24, 50.
- Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for Small Object Detection in Remote Sensing Images. IEEE Transactions on Geoscience and Remote Sensing 2024, 62, 1–15.
- Hui, Y.; Wang, J.; Li, B. STF-YOLO: A small target detection algorithm for UAV remote sensing images based on improved SwinTransformer and class weighted classification decoupling head. Measurement 2024, 224, 113936.
- Yu, J.; Zheng, H.; Xie, L.; Zhang, L.; Yu, M.; Han, J. Enhanced YOLOv7 integrated with small target enhancement for rapid detection of objects on water surfaces. Frontiers in Neurorobotics 2023, 17, 1315251.
- Gui, S.; Song, S.; Qin, R.; Tang, Y. Remote sensing object detection in the deep learning era—a review. Remote Sensing 2024, 16, 327.
- Hu, M.; Zhang, Y.; Jiao, T.; Xue, H.; Wu, X.; Luo, J.; Han, S.; Lv, H. An Enhanced Feature-Fusion Network for Small-Scale Pedestrian Detection on Edge Devices. Sensors 2024, 24, 7308.
- Chen, H.; Wang, S.; Guo, H.; Lin, H.; Zhang, Y.; Long, Z.; Huang, H. Study of marine debris around a tourist city in East China: Implication for waste management. Science of the total environment 2019, 676, 278–289.
- Qiao, G.; Yang, M.; Wang, H. An annotated Dataset and Benchmark for Detecting Floating Debris in Inland Waters. Scientific Data 2025, 12, 385.
- Liao, Y.H.; Juang, J.G. Application of Unmanned Aerial Vehicles for Marine Trash Detection and Real-Time Monitoring System. Master’s thesis, Department of Communications, Navigation and Control Engineering, NTOU, ROC, 2021.
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016.
- Ultralytics. YOLOv8. GitHub, 2023. [Online]. Available: https://github.com/ultralytics/ultralytics.
- Jocher, G.; et al. Ultralytics YOLOv11. GitHub, 2024. [Online]. Available: https://github.com/ultralytics/ultralytics.
- Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 2778–2788.
- Zhang, Z. Drone-YOLO: An efficient neural network method for target detection in drone images. Drones 2023, 7, 526.
- Yi, H.; Liu, B.; Zhao, B.; Liu, E. Small object detection algorithm based on improved YOLOv8 for remote sensing. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 2023, 17, 1734–1747.
- Zhang, X.; Du, B.; Jia, Y.; Luo, K.; Jiang, L. MSD-YOLO11n: an improved small target detection model for high precision UAV aerial imagery. Journal of King Saud University Computer and Information Sciences 2025, 37, 1–17.
- Xu, C.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.S. RFLA: Gaussian receptive field based label assignment for tiny object detection. In Proceedings of the European conference on computer vision. Springer, 2022, pp. 526–543.
- Xu, C.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.S. Detecting tiny objects in aerial images: A normalized Wasserstein distance and a new benchmark. ISPRS Journal of Photogrammetry and Remote Sensing 2022, 190, 79–93.
- Ye, T.; Qin, W.; Zhao, Z.; Gao, X.; Deng, X.; Ouyang, Y. Real-time object detection network in UAV-vision based on CNN and transformer. IEEE Transactions on Instrumentation and Measurement 2023, 72, 1–13.
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 16965–16974.
- Fan, Q.; Li, Y.; Deveci, M.; Zhong, K.; Kadry, S. LUD-YOLO: A novel lightweight object detection network for unmanned aerial vehicle. Information Sciences 2025, 686, 121366.
- Zhuang, Y.; Liu, J.; Zhao, H.; Ma, L.; Fang, Z.; Li, L.; Wu, C.; Cui, W.; Liu, Z. A deep learning framework based on structured space model for detecting small objects in complex underwater environments. Communications Engineering 2025, 4, 24.
- Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the Proceedings of the IEEE international conference on computer vision, 2017, pp. 764–773.
- Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable convnets v2: More deformable, better results. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9308–9316.
- Xiao, Y.; Su, X.; Yuan, Q.; Liu, D.; Shen, H.; Zhang, L. Satellite video super-resolution via multiscale deformable convolution alignment and temporal grouping projection. IEEE Transactions on Geoscience and Remote Sensing 2021, 60, 1–19.
- Liu, H.; Zhou, X.; Wang, C.; Chen, S.; Kong, H. Fourier-deformable convolution network for road segmentation from remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 2024, 62, 1–17.
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
- Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to upsample by learning to sample. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 6027–6037.
- Jiang, Y.; Tan, Z.; Wang, J.; Sun, X.; Lin, M.; Li, H. GiraffeDet: A heavy-neck paradigm for object detection. arXiv preprint arXiv:2202.04256 2022.
- Xu, X.; Jiang, Y.; Chen, W.; Huang, Y.; Zhang, Y.; Sun, X. Damo-yolo: A report on real-time object detection design. arXiv preprint arXiv:2211.15444 2022.
- Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe: Content-aware reassembly of features. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 3007–3016.
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7132–7141.
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19.
- Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 13713–13722.
- Jocher, G.; et al. YOLOv5. GitHub, 2020. [Online]. Available: https://github.com/ultralytics/yolov5.
- Zhang, H.; Liu, K.; Gan, Z.; Zhu, G.N. UAV-DETR: efficient end-to-end object detection for unmanned aerial vehicle imagery. arXiv preprint arXiv:2501.01855 2025.
Figure 1.
Three Major Challenges in Drone-Based Detection of Surface Floating Debris.
Figure 1.
Three Major Challenges in Drone-Based Detection of Surface Floating Debris.
Figure 2.
Schematic diagram of the dataset construction workflow.
Figure 2.
Schematic diagram of the dataset construction workflow.
Figure 3.
Analysis of dataset scale information. (a) illustrates the dominance of small objects across all categories. (b) quantifies the overall scale distribution.
Figure 3.
Analysis of dataset scale information. (a) illustrates the dominance of small objects across all categories. (b) quantifies the overall scale distribution.
Figure 4.
Overall network architecture of FLD-Net.
Figure 4.
Overall network architecture of FLD-Net.
Figure 5.
Detailed structure of the Deformable Feature Extraction Module.
Figure 5.
Detailed structure of the Deformable Feature Extraction Module.
Figure 6.
Architecture of the Dynamic Cross-Scale Fusion Network and Micro-Structure of DySample Mechanism.
Figure 6.
Architecture of the Dynamic Cross-Scale Fusion Network and Micro-Structure of DySample Mechanism.
Figure 7.
Schematic of the Dual-Domain Anti-Noise Attention module.
Figure 7.
Schematic of the Dual-Domain Anti-Noise Attention module.
Figure 8.
Qualitative comparison of FLD-Net with mainstream detection models in five typical challenging scenarios: (a) Water ripple interference, (b) Low-light scenes, (c) Shoreline reflections, (d) Turbid water, and (e) Strong glare/flares.
Figure 8.
Qualitative comparison of FLD-Net with mainstream detection models in five typical challenging scenarios: (a) Water ripple interference, (b) Low-light scenes, (c) Shoreline reflections, (d) Turbid water, and (e) Strong glare/flares.
Figure 9.
Visual comparison of FLD-Net with mainstream detection models on the FMPD Dataset.
Figure 9.
Visual comparison of FLD-Net with mainstream detection models on the FMPD Dataset.
Figure 10.
Architecture of the real-time floating litter detection system based on UAV edge computing.
Figure 10.
Architecture of the real-time floating litter detection system based on UAV edge computing.
Table 1.
Comparison of UAV-Flow with Existing Mainstream Surface Floating Litter Datasets.
Table 1.
Comparison of UAV-Flow with Existing Mainstream Surface Floating Litter Datasets.
| Dataset |
Platform |
Environment |
Scale |
Small Obj. |
Main Limitation |
| Floater |
Shore Cams |
Multiple Rivers |
3,000 |
16.3% |
Static View |
| FMPD |
River Monitor |
Multiple Rivers |
2,229 |
36.5% |
Static View |
| River Debris |
DJI Mini 2 |
Single River |
840 |
- |
Small Scale |
| HAIDA |
UAV |
Coastal/Port |
324 |
29.6% |
Small Scale |
| UAV-Flow |
DJI M350 |
River/Lake/Flat |
4,593 |
78.9% |
Limited Geo. Range |
Table 2.
Ablation Study on DFEM Hierarchical Settings.
Table 2.
Ablation Study on DFEM Hierarchical Settings.
| Setting |
P (%) |
R (%) |
mAP50 (%) |
Params (M) |
| Baseline |
77.85 |
66.18 |
71.96 |
9.40 |
| P2, P3 |
78.14 |
68.55 |
72.78 |
9.42 (+0.02) |
| P4, P5 (Ours) |
78.88 |
67.99 |
73.51 |
9.50 (+0.10) |
| All |
77.51 |
68.53 |
73.32 |
9.52 (+0.12) |
Table 3.
Ablation Study on DCFN Internal Design.
Table 3.
Ablation Study on DCFN Internal Design.
| Adjustment |
P (%) |
R (%) |
mAP50 (%) |
Params (M) |
| Baseline |
77.85 |
66.18 |
71.96 |
9.40 |
| + GFPN |
77.78 |
70.53 |
73.58 |
13.72 |
| + GFPN-Scale |
79.89 |
68.42 |
73.39 |
9.66 |
| + DySample(Ours) |
79.32 |
75.04 |
77.69 |
9.67 |
Table 4.
Ablation Experiments of Key Modules on UAV-Flow Dataset.
Table 4.
Ablation Experiments of Key Modules on UAV-Flow Dataset.
| DFEM |
DCFN |
DANA |
P (%) |
R (%) |
mAP50 (%) |
| |
|
|
77.85 |
66.18 |
71.96 |
| ✓ |
|
|
79.31 |
67.76 |
73.44 |
| |
✓ |
|
79.32 |
75.04 |
77.69 |
| |
|
✓ |
79.41 |
75.88 |
78.95 |
| ✓ |
✓ |
|
78.48 |
75.53 |
78.49 |
| |
✓ |
✓ |
80.66 |
75.00 |
78.71 |
| ✓ |
|
✓ |
79.71 |
76.29 |
79.19 |
| ✓ |
✓ |
✓ |
80.67 |
77.84 |
80.47 |
Table 5.
Comparative Experiments on the UAV-Flow Dataset.
Table 5.
Comparative Experiments on the UAV-Flow Dataset.
| Method |
P (%) |
R (%) |
mAP50 (%) |
Params (M) |
FLOPs (G) |
FPS |
| YOLOv5s [57] |
78.33 |
65.94 |
71.35 |
7.8 |
18.7 |
323.49 |
| YOLOv8s [33] |
78.09 |
67.40 |
72.13 |
9.8 |
23.4 |
309.36 |
| YOLOv11s [34] |
77.85 |
66.18 |
71.96 |
9.4 |
21.3 |
289.91 |
| TPH-YOLOv5 [35] |
78.57 |
69.64 |
75.73 |
45.36 |
245.1 |
8.22 |
| Drone-YOLO [36] |
78.07 |
69.11 |
74.10 |
34.65 |
7.9 |
12.4 |
| FFCA-YOLO [24] |
80.39 |
73.39 |
75.62 |
7.1 |
51.3 |
53.37 |
| RT-DETR [42] |
82.14 |
73.36 |
77.29 |
41.9 |
125.6 |
80.07 |
| UAV-DETR [58] |
82.72 |
74.91 |
79.75 |
44.6 |
161.4 |
35.09 |
| FLD-Net (Ours) |
80.67 |
77.84 |
80.47 |
9.9 |
25.1 |
156.72 |
Table 6.
Cross-Domain Generalization on HAIDA Trash Dataset and FMPD Dataset.
Table 6.
Cross-Domain Generalization on HAIDA Trash Dataset and FMPD Dataset.
| Method |
HAIDA Trash Dataset |
FMPD Dataset |
| P (%) |
R (%) |
mAP50 (%) |
P (%) |
R (%) |
mAP50 (%) |
| YOLOv8 [33] |
69.2 |
67.3 |
71.2 |
37.30 |
43.23 |
33.09 |
| YOLOv11 [34] |
55.5 |
63.5 |
64.9 |
38.17 |
49.30 |
37.00 |
| RT-DETR [42] |
62.9 |
67.3 |
67.0 |
40.35 |
25.23 |
26.77 |
| FLD-Net(Ours) |
74.0 |
66.4 |
73.1 |
41.31 |
49.50 |
39.13 |
Table 7.
Performance Comparison on NVIDIA Jetson Xavier NX.
Table 7.
Performance Comparison on NVIDIA Jetson Xavier NX.
| Model |
TensorRT |
mAP50 (%) |
FPS |
Power (W) |
Efficiency (FPS/W) |
| YOLOv11s |
|
71.96 |
54.0 |
14.2 |
3.80 |
| YOLOv11s |
✓ |
71.57 |
83.2 |
14.2 |
5.86 |
| FLD-Net |
|
80.47 |
44.6 |
14.5 |
3.07 |
| FLD-Net |
✓ |
80.29 |
69.6 |
14.5 |
4.80 |
|
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).