3.1. Dataset
The dataset employed in this study was specifically constructed to address the challenges of safety monitoring in real-world port logistics environments. It comprises 3,150 real-scene images collected through a collaborative project with a port company in Anhui Province, China. The images were sourced from surveillance videos and historically captured violation records across four distinct open channels, ensuring the data reflects a wide spectrum of actual operational conditions. To build a representative and diverse dataset capable of handling the complexities of the target environment—including significant scale variations, mutual occlusions, varying lighting conditions, and diverse viewpoints—we extracted frames from the video streams and applied data augmentation techniques (e.g., random rotation, brightness adjustment, and noise injection) to enhance the model's robustness and generalization capability.
A rigorous categorization framework was devised, encompassing seven distinct label categories essential for port safety monitoring: helmet, head, person, truck, fake_helmet, tricycle, and bicycle. To ensure a balanced assessment, the dataset was meticulously partitioned into training, validation, and testing subsets following a 7:2:1 ratio, yielding 2,205 images for training, 630 for validation, and 315 for testing.
Figure 5 provides a comprehensive statistical overview of the dataset. The top-left quadrant illustrates the categorical distribution of instances, revealing a notable disparity in the proportion of truck instances, which is attributable to their larger target size and lower frequency in the operational scene. Conversely, the top-right quadrant showcases the distribution of bounding boxes, highlighting the substantial variations in target sizes and the presence of overlaps, which underscores the challenging nature of target representations in this environment.
Furthermore, the scatter plot in the bottom-left corner depicts the distribution of target center positions, demonstrating their relatively even spread across the image space. This even distribution confirms the dataset's ability to represent targets appearing in various locations within the field of view. Lastly, the scatter plot in the bottom-right corner focuses on the width and height distribution of the targets, revealing a preponderance of small- and medium-sized targets. This prevalence poses a unique challenge for accurate detection and localization, further validating the dataset's complexity and practical relevance.
3.3. Detection Results and Comprehensive Analysis
The loss function, serving as a pivotal metric, quantifies the discrepancy between the model's forecasts and the ground truth values, thereby exerting a direct influence on the overall performance of the detection model. As depicted in
Figure 6, a discernible trend emerges during the training and validation phases, where the constituent losses—encompassing box_loss, cls_loss, and dfl_loss—undergo a gradual decline and eventually stabilize. This progression signifies the model's relentless convergence towards optimal parameters, ensuring that the prediction errors are minimized.
Furthermore, the model's Precision and Recall metrics, which are crucial indicators of its detection prowess, exhibit remarkable values nearing 0.9 and 0.95, respectively, at the culmination of the training process. These high figures underscore the model's exceptional ability to accurately identify true positives while minimizing false negatives, thereby demonstrating a robust detection accuracy and recall capability.
To provide a more comprehensive assessment of the model's performance, we also evaluate it using the mean Average Precision (mAP) metric at various Intersection over Union (IoU) thresholds. Specifically, the model achieves an mAP@0.5 score of approximately 0.95, indicating a strong performance in detecting objects with a high degree of overlap with the ground truth. Moreover, the mAP@0.5:0.95 score of approximately 0.6 demonstrates the model's resilience and adaptability across a broader range of IoU thresholds, further validating its comprehensive detection capabilities.
In summary, the consistent decline and stabilization of loss functions, coupled with the impressive Precision, Recall, and mAP scores, collectively attest to the effectiveness and robustness of the proposed detection model. These results not only underscore the model's ability to accurately detect objects but also highlight its potential for real-world applications where high detection accuracy and recall are paramount.
A confusion matrix serves as a comprehensive visualization tool for assessing the algorithmic performance, offering insights into how well the model distinguishes between different classes. In this matrix, each row meticulously represents the actual class distribution, whereas each column portrays the model's predicted class outcomes. Notably, the elements aligned along the main diagonal hold paramount importance, as they signify the number of correctly classified samples, commonly referred to as True Positives (TP).
Delving deeper, the lower-left triangular region of the matrix unveils a crucial aspect of the model's performance—False Negatives (FN). These represent instances where the model fails to detect actual samples, inadvertently misclassifying them as belonging to other categories or overlooking them entirely. This region highlights the model's limitations in recognizing certain targets, potentially leading to unrecognized or mislabeled objects.
Conversely, the upper-right triangular region encapsulates False Positives (FP), instances where the model erroneously classifies background elements or samples from different categories as belonging to the current category. A high density of FP values indicates that the model may suffer from a higher rate of spurious detections, mistakenly identifying non-target objects as belonging to the current class.
As evidenced in
Figure 7a, the robust presence of elements along the main diagonal underscores the high degree of accuracy achieved by our detection model across most categories. For instance, the "helmet" category boasts a TP of 323, and the "person" category achieves an impressive TP of 655, testifying to the model's exceptional ability to discern these targets from others. Notably, even in the challenging "fake_helmet" category, characterized by its similarity to "head" and "helmet," the model manages to maintain a relatively low level of both false and missed detections.
Expanding on this analysis,
Figure 7b provides a more nuanced view of the model's performance by presenting the accuracy for each category. The "helmet" category achieves an accuracy of 95%, the "person" category surpasses with 96%, and the "truck" category attains a perfect score of 100%, underscoring the model's remarkable performance across these domains. However, a closer inspection of the normalized matrix also reveals nuances in the model's behavior. Specifically, the "background" category experiences a 0.26 probability of being misclassified as "person" and a 0.21 probability of being mislabeled as "head," indicating that ambient noise and background complexities in real-world scenarios can still pose challenges to the model's classification accuracy. Nonetheless, these findings serve as valuable insights for future model refinements and optimizations to further enhance its robustness and performance.
3.3.1. Ablation Experiment
To meticulously assess the individual and collective contributions of our proposed modules—SPPELAN-DW, CASAM, and CACC—to the overall performance of our model, we designed and executed a comprehensive ablation study. By systematically incorporating and excluding these modules within the YOLOv8 framework, we delved into their specific impacts on both detection accuracy, measured by mAP@0.5 and the more stringent mAP@0.5~0.95, and processing speed, quantified by frames per second (FPS). The findings of this rigorous experimentation are summarized in
Table 3.
Table 3 presents the ablation study results, where each row represents a unique configuration of the modules. The checkmark (√) signifies the inclusion of a module, whereas the cross (×) denotes its exclusion. The standalone integration of SPPELAN-DW not only elevated the mAP@0.5 to 0.931 but also bolstered the FPS to 129.47, underscoring its pivotal role in enhancing both detection precision and computational efficiency. Similarly, the introduction of CASAM alone significantly improved the mAP@0.5~0.95 to 0.610, emphasizing its effectiveness in boosting accuracy across a broader range of IoU thresholds.
The CACC module, when integrated, demonstrated a dual benefit, elevating both the mAP@0.5 to 0.932 and FPS to 126.54, highlighting its ability to harmoniously optimize both accuracy and speed. The synergistic effect of combining CASAM and CACC was evident, with the mAP@0.5 soaring to 0.938 and FPS achieving an impressive 142.30, despite a marginal dip in mAP@0.5~0.95, which remained at a high level.
The pairing of SPPELAN-DW and CACC yielded an mAP@0.5 of 0.936 coupled with an exceptional FPS of 146.64, showcasing their complementary strengths in accelerating processing while preserving accuracy. Likewise, the combination of SPPELAN-DW and CASAM struck a fine balance between accuracy (mAP@0.5 = 0.935) and speed (FPS = 144.21), further validating the efficacy of our proposed modules.
Ultimately, the harmonious integration of all three modules—SPPELAN-DW, CASAM, and CACC—culminated in the optimal configuration, achieving the highest mAP@0.5 of 0.949, an elevated mAP@0.5~0.95 of 0.625, and a remarkable FPS of 149.35. This outcome underscores the profound impact of our modular design, where each component contributes uniquely yet synergistically to elevate the detection capabilities and processing efficiency of our model in open channels, thereby demonstrating the effectiveness of our approach.
3.3.2. Comparative Experiment
To rigorously evaluate the detection performance of the proposed CSPC-BRS algorithm, we conducted comprehensive comparative experiments against several state-of-the-art detectors, including YOLOv5-n, YOLOv6-n, YOLOv7-n, YOLOv8-n, RT-DETR-R18, PP-YOLOE+-s, and NanoDet-Plus. This systematic evaluation aimed to provide a thorough performance analysis across different architectural paradigms.
Figure 8 illustrates the learning trajectories of all compared models throughout the training process. Our CSPC-BRS model demonstrates rapid convergence in mAP@0.5 during the initial training phase, establishing a strong foundation early on. After approximately 20 epochs, it maintains a stable improvement trajectory, consistently outperforming all other methods. In contrast, while the YOLO series models show reasonable convergence trends, they exhibit noticeable performance fluctuations in later training stages. The CSPC-BRS model ultimately achieves the highest mAP@0.5 of 0.949 within 100 epochs, demonstrating superior training stability and convergence behavior.
As quantitatively demonstrated in
Table 4, the proposed CSPC-BRS architecture achieves an optimal balance between computational efficiency and detection accuracy. Compared to the YOLOv8-n baseline, our model shows only a modest increase in computational requirements—parameters increase from 3.0M to 3.6M (20% increase) and FLOPs from 8.1 to 8.6 (6.2% increase)—while delivering substantial performance gains. Most notably, CSPC-BRS achieves a state-of-the-art mAP@0.5:0.95 score of 0.625, representing a significant 9.6% improvement over YOLOv8-n's 0.570.
The comprehensive comparison reveals several important insights. The transformer-based RT-DETR-R18 underperforms in our specific application scenario, achieving lower accuracy than YOLOv8-n despite higher computational costs, suggesting that pure transformer architectures may require further optimization for complex industrial environments. PP-YOLOE+-s shows competitive performance with a 1.6% improvement over YOLOv8-n, but our method maintains a clear 8.0% advantage. Notably, NanoDet-Plus, while extremely efficient, suffers from a substantial 20.7% accuracy drop, rendering it unsuitable for precision-critical applications.
Analysis of computational resource utilization demonstrates that model complexity alone doesn't guarantee performance. YOLOv7-n, with the highest parameter count (6.0M) and computational demand (13.2 GFLOPs), achieves the lowest mAP@0.5:0.95 among the YOLO series. Our CSPC-BRS shows a 40% reduction in parameters and 34.8% decrease in FLOPs compared to YOLOv7-n, while delivering 19.9% superior detection accuracy, highlighting the effectiveness of our architectural design choices.
The real-time performance analysis in
Table 5 provides crucial insights for practical deployment. Our CSPC-BRS framework achieves 149.35 FPS, representing a 21.2% improvement over YOLOv8-n (123.25 FPS) and outperforming other detectors including PP-YOLOE+-s (118.34 FPS) and RT-DETR-R18 (106.38 FPS). While NanoDet-Plus achieves the highest FPS (238.10), this comes at the cost of severely compromised accuracy (-20.7% in mAP@0.5:0.95 versus YOLOv8-n), making it impractical for safety-critical applications requiring high detection precision.
The experimental results demonstrate that through systematic integration of enhanced feature extraction (CASAM), efficient multi-scale context aggregation (SPPELAN-DW), and optimized feature fusion (CACC), our CSPC-BRS framework maintains robust performance across varying object scales and complex backgrounds while delivering superior frame rates. This confirms its practical viability for deployment in mission-critical infrastructure inspection systems.
To comprehensively evaluate the robustness and detection performance of the CSPC-BRS algorithm in complex open channel environments, we selected four representative scenarios (a, b, c, d) from the validation set imagery. For visual comparison clarity, we focused on comparing CSPC-BRS against the YOLO series models (YOLOv5-n, YOLOv6-n, YOLOv7-n, and YOLOv8-n), as these architectures share similar design paradigms, enabling more straightforward visual analysis of detection differences.
Figure 9 clearly demonstrates the superior performance of CSPC-BRS across all test scenarios. In scenario (a), characterized by an employee wearing a hat resembling a helmet amidst inclement weather and obstructions, YOLOv5-n and YOLOv6-n failed to detect the "fake_helmet" target entirely. While YOLOv7-n and YOLOv8-n successfully identified the target, both models generated false positives by misclassifying it as a genuine "helmet." In contrast, CSPC-BRS accurately identified the "fake_helmet" with high confidence, effectively eliminating false positives through its refined feature extraction capabilities and the effective Bounding Box Reduction Strategy (BRS).
Scenario (b) presented a challenging situation with overlapping targets, where a "person" was partially occluded by a "tricycle," and a small "helmet" blended with the background. In this case, YOLOv5-n and YOLOv6-n exhibited significant missed detections, failing to identify multiple targets. YOLOv7-n and YOLOv8-n demonstrated improved detection capability but with notably lower confidence scores and less precise bounding box localization compared to CSPC-BRS. Our method achieved significantly higher confidence scores and more accurate bounding boxes, demonstrating its exceptional capability in handling occlusion and detecting small objects.
Scenarios (c) and (d), featuring inadequate lighting conditions and significant target size variations, further revealed the limitations of the baseline models. YOLOv5-n and YOLOv6-n consistently showed the poorest performance with numerous missed detections across these challenging conditions. While YOLOv7-n and YOLOv8-n exhibited comparable detection performance to each other, both suffered from inconsistent confidence scores and occasional false positives. The proposed CSPC-BRS, empowered by its multi-level attention mechanism and optimized feature fusion, maintained robust detection performance with high confidence scores and minimal false positives even under low-light conditions and with varying target sizes.
The visual comparative analysis unequivocally confirms that CSPC-BRS achieves the most stable and reliable detection performance across diverse environmental challenges. While YOLOv5-n and YOLOv6-n suffer from substantial missed detections, and YOLOv7-n/YOLOv8-n exhibit inconsistent performance with false positives, our method consistently outperforms all YOLO variants in terms of both detection accuracy and confidence stability, demonstrating its practical advantage for real-world deployment in complex port environments.
3.3.3. Actual Detection Results
To comprehensively evaluate the robustness and efficacy of our detection model, we conducted a rigorous validation process by randomly selecting 16 representative images from the test set. These images encompassed a broad spectrum of challenging scenarios, including umbrella occlusion, extreme lighting conditions (both overly bright and dim), substantial variations in target sizes, and intricate instances where target features closely resemble those of the background.
As depicted in
Figure 10, our detection model triumphantly identified all targets within these intricate images. This remarkable achievement stems from the model's sophisticated feature fusion and expression capabilities, which empower it to dynamically modulate feature intensity and meticulously extract crucial information from multi-scale features. This adaptability enables the model to effectively discern targets from their surroundings, even in the most complex of scenarios.
Furthermore, the integration of multi-scale feature extraction and deep convolutional layers significantly bolstered the model's representational power and computational efficiency. This enhancement allowed the model to seamlessly handle the challenges posed by varying target sizes and intricate environmental factors, ensuring consistent and accurate detections across the board.
In summary, the actual detection results presented in this section underscore the exceptional performance of our detection model in real-world applications. By leveraging advanced techniques such as enhanced feature fusion, multi-scale feature extraction, and deep convolutional processing, our model has demonstrated its ability to overcome even the most daunting of detection challenges, thereby validating its suitability for practical deployment in complex open channels.
3.4. Model Tracking Results and Analysis
3.4.1. MDRS Weight Ablation Experiment
To validate the scientific rationale behind the weight assignment in the Multi-Dimensional Re-identification Similarity (MDRS) mechanism, we conducted a systematic ablation study on the validation set. This experiment aimed to quantitatively evaluate the contribution of each feature dimension to cross-camera target re-identification performance and determine the optimal weight combination through empirical analysis.
A baseline configuration with uniform weights (1/6 ≈ 0.167 for each feature) was established to provide a neutral reference point, where no single feature dimension received preferential treatment. This baseline allows for clear measurement of performance improvements achieved through optimized weight allocation,As shown in
Table 6.
The experimental results provide several key insights into feature importance for cross-camera re-identification. Configuration A emerged as the optimal weighting scheme, achieving the highest IDF1 score of 0.829, which represents an 11.9% improvement over the uniform baseline.
The color-only configuration demonstrated that color features alone provide substantial discriminative power (+5.5% over baseline), confirming their status as the most stable and distinctive feature in port environments. However, the superior performance of Configuration A over the color-only approach highlights the importance of complementary features.
Comparative analysis of the configurations reveals the delicate balance required in weight assignment. Configuration B (emphasizing texture) and Configuration C (emphasizing color) both showed respectable improvements (+10.0% and +10.8% respectively) but fell short of the optimal balance achieved in Configuration A. The performance degradation in Configuration D (reduced shape weight) and Configuration E (reduced motion weight) further underscores the importance of maintaining appropriate weights for these complementary features.
The critical role of motion consistency is evident from the 2.9% performance gap between Configuration A and Configuration E, demonstrating that motion features provide essential spatiotemporal context for cross-camera association. Similarly, texture and shape features serve as vital complements to color information, with Configuration A's assignment of 0.20 and 0.15 weights proving optimal.
Size and temporal features function effectively as regularization terms with their low weights (0.05 each). The removal of temporal constraints resulted in a 2.8% performance drop compared to Configuration A, demonstrating their importance in filtering physically implausible associations across extended time intervals.
This systematic ablation study provides empirical justification for our weight selection and confirms that Configuration A establishes an optimal balance that emphasizes the most discriminative features while maintaining complementary support from secondary features and necessary regularization terms. The proposed weighting scheme effectively captures the relative importance of each feature dimension for cross-camera re-identification in complex port environments.
3.4.2. Tracking Results for Vehicles
This section evaluates the tracking performance of our proposed CSPC-BRS framework under both single-camera and cross-camera scenarios. The tracking requirements differ based on vehicle type and operational context: for smaller vehicles like electric bicycles and tricycles, the primary concern is monitoring safety compliance (e.g., helmet usage, unauthorized entry) within localized areas, making single-camera tracking sufficient. However, for trucks involved in cargo transportation, ensuring operational safety requires continuous monitoring across multiple zones, necessitating robust cross-camera identity consistency.
Figure 11 presents a comparative analysis of tracking efficacy in a low-light, backlit open environment, evaluating our CSPC-BRS algorithm against the baseline YOLOv8+BoT-SORT. Under these challenging conditions, our CSPC-BRS framework successfully tracked all targets, including fast-moving bicycles and tricycles, maintaining a stable maximum ID count of 3 without any target loss. In stark contrast, the baseline model not only failed to correctly classify targets—misidentifying a bicycle as a person at frame 57—but also completely lost the trajectory of this target by frame 62. This misclassification and subsequent tracking failure underscore the baseline's limitations in handling complex visual scenarios.
The performance disparity is further evidenced in
Figure 12, where the baseline model consistently failed to detect the person target throughout the sequence. Our CSPC-BRS, however, reliably maintained all tracks with high confidence and without any identity switches, even as the tricycle underwent increased occlusion. These results collectively demonstrate the superior robustness of our method in adverse lighting conditions and its enhanced capability for handling rapid motion and partial occlusions.
The superior performance stems from the integrated CASAM attention mechanism, SPPELAN-DW aggregation network, and CACC convolutional modules, which collectively enhance feature extraction and fusion capabilities. Additionally, the BRS strategy dynamically adjusts bounding box dimensions in response to velocity changes, significantly reducing target loss and ID switching.
To evaluate performance in cross-camera scenarios essential for monitoring cargo transportation safety, we designed an experiment tracking a truck across three surveillance cameras with significantly different viewpoints. Two cross-camera routes were established: Route 1 (camera1 to camera3) and Route 2 (camera2 to camera3).
As shown in
Figure 13, CSPC-BRS maintained consistent ID assignment (ID=1) for the truck throughout both routes, despite significant viewpoint variations and visual distortions. In Route 1, when the truck re-emerged after temporary occlusion by a gantry crane, the baseline model failed to re-identify the target, assigning new IDs (2-4). In contrast, CSPC-BRS successfully preserved the original identity, demonstrating the effectiveness of its multi-feature Re-ID scoring mechanism and BRS-based scale adaptation.
The baseline algorithm consistently assigned new IDs when the truck appeared in camera3 from both routes, indicating limited capability in handling appearance changes across viewpoints. CSPC-BRS not only maintained ID continuity but also sustained high-confidence bounding boxes throughout the tracking sequence, validating its robust performance in dynamic multi-camera environments.
This experiment confirms the effectiveness of our multi-dimensional Re-ID mechanism in complex cross-view scenarios and demonstrates the BRS module's crucial role in compensating for scale variations, collectively enhancing tracking stability in practical logistics surveillance applications.
3.4.3. Evaluation of Employee Tracking in Open Channels
To quantitatively evaluate the tracking robustness of the proposed CSPC-BRS framework, we conducted a comparative study against the baseline YOLOv8+Bot-Sort under diverse challenging scenarios in open channels, with a focus on identity (ID) preservation and occlusion handling.
The superior performance of our method is vividly demonstrated in
Figure 14, which encapsulates tracking outcomes across two representative scenarios. In a scenario with frequent inter-person occlusion, the baseline model exhibited significant fragility. As individuals crossed paths, the baseline system suffered from multiple identity switches, causing its maximum ID count to increase by 33.3% (from 6 to 8), indicating a failure to maintain consistent trajectories. In stark contrast, our CSPC-BRS framework, empowered by the BRS strategy, successfully maintained tracking continuity through the occlusion events, reducing the incidence of identity loss by 50% and demonstrating robust data association.
The advantage of our model becomes even more pronounced in highly complex environments characterized by strong illumination and persistent occlusions. In such demanding conditions, the baseline model struggled severely with both detection and tracking, failing to reliably identify safety helmets and suffering from rampant ID fragmentation. This led to a drastic 100% increase in the maximum ID count (from 12 to 24), rendering the tracking output unusable. Our proposed framework, however, proved highly resilient. It achieved stable detection despite the harsh lighting and, most importantly, successfully re-identified targets after prolonged occlusions (e.g., from frame 88 to 104), thereby limiting the maximum ID count to 13 and ensuring high-fidelity trajectory preservation.
In summary, the experimental results from multiple scenarios conclusively demonstrate that the integrated CSPC-BRS framework significantly enhances tracking stability and reliability. It substantially mitigates the critical challenge of identity switches in complex industrial settings, providing a more dependable solution for real-world safety monitoring applications.