Preprint
Article

This version is not peer-reviewed.

A Wavelet-Enhanced Data-Augmented Network for Robust Test Tube Detection in Clinical Workflows

Submitted:

11 April 2026

Posted:

14 April 2026

You are already at the latest version

Abstract
Reliable test-tube detection on clinical conveyor lines remains difficult when tubes are densely packed, placed irregularly, weakly illuminated, partially blurred by robot vibration, and contaminated by glare from glass or PET surfaces. These conditions erode the short-axis boundary cues and faint graduation marks that slender tubes depend on. We therefore build WDA-TNET on YOLOv11 and target the failure modes at four points of the pipeline. First, WGSR restores blurred regions selectively based on wavelet energy, avoiding over-sharpening specular areas. Second, GSCIM suppresses glare-dominated channel responses in the backbone through direction-aware pooling and cross-channel interaction, retaining weak structural cues like liquid-level edges. Third, DCPAF separates height and width encoding in the neck, dynamically balancing long-axis context and short-axis localization suitable for elongated targets. Finally, ATSS and MPDIoU stabilize supervision when positives are sparse and boxes overlap only weakly. We evaluated our model on the newly constructed Complex Test Tube (CTT) dataset containing 11,955 images and 81,044 instances. WDA-TNET achieves 94.1% precision and 79.1% mAP50:95, improving mAP50:95 by 3.6 percentage points over YOLOv11. On the transparent-container HeinSight4 dataset, the model attains 95.2% mAP50:95, proving robust cross-domain generalization.
Keywords: 
;  ;  ;  ;  ;  

1. Introduction

With the acceleration of global population aging, healthcare demand continues to grow. Medical expenses are projected to surge, putting unprecedented pressure on traditional labor-intensive dispensing and sorting models. In modern hospital laboratories and home-based care scenarios, routine biological samples such as urine and blood are typically transported in test tubes. These body fluid samples often carry infectious pathogens; for instance, blood samples may contain HIV or Hepatitis B virus, while urine samples may harbor pathogenic bacteria like Mycobacterium tuberculosis. Manual sorting of these tubes poses significant occupational exposure risks to medical personnel, especially when dealing with non-compliant containers (e.g., unsealed caps or leakage due to breakage). Deploying dispensing robots to undertake repetitive tasks like drug transport and sorting can reduce labor costs and infection risks, breaking through the limits of traditional pharmacy hours.
However, the stable operation of such robotic systems depends heavily on the online recognition of medical supplies. Clinical conveyor lines do not provide the clean imaging conditions assumed by generic detectors. Test tubes exhibit transparent materials that lead to severe reflections; their colors and specifications vary; dense arrangements cause mutual occlusion. Lighting inside automated equipment is often weak, while glass and PET surfaces introduce unstable highlights that hide liquid levels, labels, and tube contours. The geometry is also difficult: test tubes are slender, and even a small localization error along the short axis can reduce IoU noticeably. Once vibration from the mobile platform introduces motion blur, barcodes and graduation marks become even harder to preserve.
WDA-TNET is designed around these failure modes rather than around a single generic enhancement block. WGSR restores blurred regions selectively so that specular areas are not sharpened indiscriminately. GSCIM suppresses glare-dominated channel responses in the backbone and retains weak structural cues. DCPAF separates height and width encoding in the neck, which is better suited to elongated targets than isotropic pooling. During training, ATSS and MPDIoU keep supervision stable when positives are sparse and boxes overlap only weakly. The resulting detector is evaluated on the newly established CTT dataset and two cross-domain benchmarks, where it delivers the strongest overall localization performance among the compared models.

3. Methodology

3.1. Overall Architecture

WDA-TNET addresses the aforementioned challenges through targeted improvements in four stages: preprocessing, feature extraction, fusion, and detection supervision. As shown in Figure 1, in the preprocessing stage, we introduce the Wavelet-Guided Selective Restoration (WGSR) algorithm to selectively repair motion blur and glare degradation. By using a degradation mask, restoration is constrained to areas that genuinely require enhancement, avoiding over-processing of clear regions. In the feature extraction stage, we replace the standard C3k2 module with the Glare-Suppressed Channel Interaction Module (GSCIM), which utilizes direction-aware pooling and cross-channel interaction to suppress abnormal high responses caused by reflection. For feature fusion, we substitute the standard SPPF with the Direction-Aware Cross-Stage Pyramid Attention Fusion (DCPAF) module, transforming isotropic pooling into direction-aware 1D encoding to balance receptive fields along the long and short axes of slender targets. Finally, in the detection head, we employ Adaptive Training Sample Selection (ATSS) for dynamic positive/negative sample assignment and combine it with the MPDIoU loss function to maintain effective gradients in low-IoU regimes, ensuring precise bounding box regression for densely packed tubes.

3.2. WGSR: Wavelet-Guided Selective Restoration

When medical robots perform tube picking and delivery tasks, chassis movement inevitably causes visual sensor vibration, resulting in directional motion blur. This temporal integration effect weakens critical localization features of test tube labels. Traditional global deblurring algorithms often amplify background noise and create textures in specular regions when applied to the entire image. To solve this, we propose the Wavelet-Guided Selective Restoration (WGSR) algorithm. Distinguished from global methods, WGSR constructs a three-stage collaborative workflow comprising frequency-domain localization, spatial-domain reconstruction, and gated fusion, as illustrated in Figure 2.

3.2.1. Frequency-Domain Localization

Clear edges typically possess sufficient high-frequency energy, whereas high-frequency components in blurred regions are significantly attenuated. Based on this physical property, WGSR applies 2D-DWT to the input image. Using single-level Daubechies wavelets (db2), which effectively capture edge features of slender structures, we decompose the image into a low-frequency approximation component c A and three high-frequency detail components { c H , c V , c D } . Figure 3 shows the resulting sub-band decomposition and the corresponding energy map. We summarize these into a unified high-frequency energy map E raw :
E raw = 1 C c = 1 C c H ( c ) + c V ( c ) + c D ( c ) ,
where C denotes the number of channels.
Since DWT halves spatial resolution, we upsample E raw to the original resolution H × W via bilinear interpolation. The upsampled energy map E up is then normalized instance-wise, and a Sigmoid gate with steepness k converts the normalized response into the degradation mask M:
M = Sigmoid k · E up μ ( E up ) std ( E up ) + ε ,
where μ ( · ) and std ( · ) denote the mean and standard deviation, and ε = 10 8 is a stabilizing constant. We set k = 4.0 to balance boundary preservation and noise suppression based on statistical analysis of normalized energy distributions. When k , it degrades to a hard binary segmentation; when k 0 , it loses discrimination. k = 4.0 covers approximately one standard deviation below the mean, avoiding over-penalization. Figure 4 illustrates the spatial response under different k values.

3.2.2. Spatial Reconstruction and Gated Fusion

The degradation mask identifies regions requiring repair. We design a spatial structure reconstruction branch using a lightweight three-layer convolutional network F SR to selectively recover high-frequency textures:
Y = F SR ( I ) = Conv 1 × 1 Conv 3 × 3 ( Conv 7 × 7 ( I ) ) .
Directly replacing the input I with Y would introduce over-sharpening in clear regions. Therefore, we compute the residual R = Y I and use the degradation mask M to spatially weight the injection:
I out = I + α · ( 1 M ) R ,
where ⊙ denotes element-wise multiplication. α is a learnable scalar initialized to 0.2. In clear regions ( M 1 ), injection is suppressed; in degraded regions ( M 0 ), details are fully injected. Figure 5 shows the three-stage output comparison on a blurred tube image.
From a deployment perspective, WGSR functions as an offline data augmentation module during training. The restoration network is pre-trained and frozen to generate enhanced training samples. During inference, the detection network receives raw images without any additional latency.

3.3. DCPAF: Direction-Aware Cross-Stage Pyramid Attention Fusion

Test tube targets exhibit strong directional structures, typically extending vertically. Isotropic pooling operations easily lose fine spatial positional information along the narrow short axis. DCPAF addresses this by decoupling 2D aggregation into two 1D encodings along height and width, as shown in Figure 6. This separates height and width encoding in the neck, preserving position resolution along the short axis while accumulating longer-range context along the long axis.
The input feature X first enters the direction attention branch, performing 1D average pooling along height and width respectively:
z c h ( i ) = 1 W j = 0 W 1 x c ( i , j ) ,
z c w ( j ) = 1 H i = 0 H 1 x c ( i , j ) ,
We obtain a recalibrated feature by injecting the direction weights back into the input feature:
X att = X σ ( φ h ( z h ) ) σ ( φ w ( z w ) ) ,
where σ ( · ) is the Sigmoid function, and φ h ( · ) and φ w ( · ) are lightweight mapping branches. We next apply a lightweight convolutional re-encoding to stabilize representations before pooling:
X 1 = Conv 1 × 1 ( Conv 3 × 3 ( Conv 1 × 1 ( X att ) ) ) .
For multi-scale context extraction, DCPAF chains 5 × 5 max-pooling to expand the effective receptive field at low cost without losing cross-stage details:
P 1 = MP 5 ( X 1 ) , P 2 = MP 5 ( P 1 ) , P 3 = MP 5 ( P 2 ) .
The pooled branches are concatenated with the identity branch and compressed by a 1 × 1 convolution:
Y b = Conv 1 × 1 Concat ( X 1 , P 1 , P 2 , P 3 ) .
Finally, we inject a cross-stage short path to preserve details and facilitate gradient flow:
Y = Conv 1 × 1 Concat ( Conv 3 × 3 ( Y b ) , Conv 1 × 1 ( X att ) ) .

3.4. GSCIM: Glare-Suppressed Channel Interaction Module

Test tube images suffer from specular reflections and texture interference on glass surfaces, which degrade deeper feature maps during downsampling. Standard convolution treats all channels equally, lacking explicit modeling of inter-channel semantics. Consequently, abnormal high responses induced by glare are amplified.
To address this, we propose the Glare-Suppressed Channel Interaction Module (GSCIM), whose structure is shown in Figure 7. Instead of relying on global average pooling alone, GSCIM combines direction-aware coordinate pooling with grouped channel interaction so that highlight-dominated responses can be suppressed before they spread to later fusion stages.
Let the input feature be X. GSCIM uses two independent 1 × 1 convolutions to split the input into two parallel paths:
X 1 = Conv 1 × 1 ( 1 ) ( X ) , X 2 = Conv 1 × 1 ( 2 ) ( X )
X 1 enters the transformation branch, passing through a cascade of 3 × 3 convolutions and ECA channel weighting units. The ECA unit compresses each channel into a scalar via global average pooling v c = 1 H W X c ( i , j ) , and models local dependencies across adjacent channels using 1D convolution:
w = σ ( Conv 1 d k s ( v ) )
where the kernel size k s is adaptively determined by the channel dimension. The weights w are multiplied element-wise with the input features for channel response recalibration. The refined features are then concatenated with the shallow preservation branch X 2 and compressed via a 1 × 1 convolution.
GSCIM is deployed after each downsampling stage in the backbone to perform joint screening of spatial positions and channels at early stages, preventing the propagation of invalid glare information.

3.5. Training Supervision: ATSS and MPDIoU

3.5.1. Adaptive Training Sample Selection (ATSS)

Traditional static IoU thresholds struggle with densely packed and severely occluded tubes, where predicted boxes fluctuate wildly in early training epochs. ATSS dynamically constructs a positive-sample threshold based on the statistical distribution of candidates. Figure 8 illustrates where ATSS intervenes in the detection head.
The overall positive/negative sample assignment procedure is visualized in Figure 9. ATSS proceeds through five steps: center-distance ranking, per-level candidate selection, cross-level merging, IoU statistics, and threshold-based labeling with conflict resolution.
Formally, the center distance between candidate c and ground truth g is
d ( c , g ) = p ( c ) p ( g ) 2 ,
where p ( · ) denotes the center coordinates. On feature level l, the k nearest candidates form the per-level subset S g l = TopK k { c C l d ( c , g ) } . These are merged across all levels into the candidate set
C g = l S g l .
The IoU values of all candidates in C g with respect to g yield a distribution I g ; its mean μ g and standard deviation σ g define the adaptive threshold
t g = μ g + σ g .
A positive assignment requires satisfying both the IoU threshold and the geometric constraint that the candidate center falls within the ground truth box:
Pos ( c , g ) = I ( IoU ( c , g ) t g ) · I ( center ( c ) g ) .
The indicator product enforces a strict logical AND: a candidate is accepted only when both conditions hold, which eliminates rack-edge candidates that overlap a tube region but whose center lies outside. When tubes are tightly packed, the same candidate may satisfy multiple ground truths simultaneously. This conflict is resolved by
g * ( c ) = arg max g G c IoU ( c , g ) ,
where G c is the set of ground truths that claim c as a positive. Assigning each candidate to the single most geometrically compatible ground truth preserves one-to-one supervision and avoids label ambiguity at tube boundaries.

3.5.2. MPDIoU Loss Function

Test tube racks often contain mixed specifications (e.g., 5 ml, 15 ml, 50 ml). A few pixels of width deviation on a 5 ml tube causes a drastic IoU drop, whereas significant length deviation on a 50 ml tube may be masked by a large intersection area. Traditional IoU losses cannot distinguish these error types effectively. When boxes do not overlap, IoU is zero and gradients vanish.
MPDIoU translates regression into corner-point matching, directly measuring the Euclidean distances of the top-left and bottom-right corners, normalized by the diagonal of the minimum enclosing rectangle. The geometric interpretation is shown in Figure 10.
d 1 2 = ( x 1 prd x 1 gt ) 2 + ( y 1 prd y 1 gt ) 2 ,
d 2 2 = ( x 2 prd x 2 gt ) 2 + ( y 2 prd y 2 gt ) 2 .
The complete MPDIoU loss is defined as:
L MPDIoU = 1 IoU + d 1 2 + d 2 2 w 2 + h 2 .
This formulation maintains effective gradients even when boxes do not overlap, and evenly penalizes deviations regardless of the absolute scale.

4. Experiments

4.1. Datasets

To evaluate the model under clinically relevant conditions, we constructed the Complex Test Tube (CTT) dataset for hospital laboratory sorting. The dataset contains 11,955 images and 81,044 instances, split into 9,797 for training, 1,079 for validation, and 1,079 for testing. Collection scenes fall into two stages: pre-sorting (mixed tubes in disorganized layouts) and rack-loaded (tubes positioned for scanning), as shown in Figure 11. Both stages were captured in hospital laboratory and ward sampling environments to maximize scene diversity.
Annotation was performed with the open-source tool CVAT, which supports key-frame extraction and semi-automatic interpolation for video clips. We annotated 15 categories across four dimensions to support comprehensive sample analysis:
1.
Physical Features (6 classes): 5 ml, 15 ml, and 50 ml specifications for both tube bodies and caps.
2.
Internal Sample Status (3 classes):blood, urine, and empty, to support coarse sample pre-screening.
3.
External Labels (1 class):barcode region, providing visual support for downstream robotic scanning.
4.
Cap Colors (5 classes):blue, grey, green, purple, red, yellow—six color categories that encode additive types in clinical practice (e.g., red caps indicate immunological assays with higher biosafety requirements).
Figure 12 illustrates representative annotations across the four dimensions.
For cross-domain evaluation, we use HeinSight4.0 and VisDrone [28,29]. HeinSight4.0 [28] is a transparent-container dataset published by El-Khawaldeh et al. on Zenodo, containing 6,031 images extracted from chemical experiment videos. Five annotation categories cover air (empty and residual-gas), liquid (clear solution and turbid liquid), and solid (particles or precipitates suspended in liquid). All images were annotated manually with bounding boxes and divided into training and validation sets at a 9:1 ratio. The dataset shares key visual challenges with CTT—specular highlights from transparent walls, blurred phase boundaries, and liquid-level ambiguity—making it a suitable external benchmark for cross-domain generalization. Figure 13 shows examples from each category.

4.2. Data Augmentation

To improve robustness, we designed a multi-level augmentation pipeline.

4.2.1. Primary Augmentation

We implement basic geometric and photometric transforms with Albumentations [30]. Horizontal/vertical flips, shift-scale-rotate operations, brightness/contrast perturbations, and gamma adjustment are used to cover viewpoint changes and low-light variation. We also retain Mosaic [31] and MixUp [32] to simulate dense occlusion. Figure 14 shows representative examples of eight primary augmentation operations applied to tube images.
Beyond single-image transforms, we additionally apply three composition-level strategies during training. Mosaic [31] stitches four images into a single training sample, simultaneously exposing the model to multi-scale tube arrangements and dense background clutter. MixUp [32] fuses two images and their annotations at a fixed ratio, strengthening robustness to partial occlusion. Copy-Paste [31] transfers tube instances between images and is particularly effective for rare cap-color categories. Figure 15 shows representative outputs from these three composition strategies.

4.2.2. Synthetic Patch Augmentation

To address sparse scenarios, we developed a synthetic patch augmentation based on GrabCut. GrabCut estimates GMM parameters for foreground and background iteratively to find the minimum cut. To eliminate jagged edges, we apply morphological opening ( M open = ( M init K ) K ) followed by Gaussian feathering, then use Alpha blending to seamlessly embed the tube foreground into complex backgrounds. The complete pipeline is illustrated in Figure 16, with the feathered mask detail shown in Figure 17.

4.2.3. Density-Driven Adaptive Blur

To simulate non-uniform motion blur caused by robot acceleration, we compute the spatial distribution density grid F and its variation coefficient ρ :
ρ = std ( F ) mean ( F ) + ϵ
The blur kernel size σ blur is adaptively scaled by ρ :
σ blur = σ min + N ( ρ ) · ( σ max σ min )
Sparse areas trigger stronger blur to simulate high-speed robot movements. Figure 18 shows two typical samples with the density-driven blur applied at different intensity levels.

4.3. Implementation Details

Training is performed for 200 epochs with a batch size of 16 and 8 dataloader workers on an RTX 4060 Ti GPU. We use AdamW with an initial learning rate of 0.01 and a cosine decay schedule.

4.4. Results on the CTT Dataset

Table 1 compares representative detectors on CTT.
Table 1 shows that WDA-TNET attains the highest precision (94.08%) and the highest mAP@50:95 (79.08%) among the compared models. Compared to YOLOv11, the stricter mAP@50:95 improves by 3.58 percentage points. This pattern is critical because the stricter metric is more sensitive to box tightness on slender tubes. The gain confirms that DCPAF and MPDIoU improve localization stability rather than only increasing coarse detections at IoU 0.5.

4.5. Cross-Domain Evaluation

HeinSight4. This dataset contains transparent-container scenes with strong reflection, refraction, and blurred phase boundaries.
Table 2. Performance comparison on HeinSight4.
Table 2. Performance comparison on HeinSight4.
Model Precision (%) Recall (%) mAP@50 (%) mAP@50:95 (%) FPS
Ours 97.73 94.99 97.74 95.17 131.05
YOLOv12 [36] 97.91 93.80 97.87 94.97 195.71
Improved YOLOv8 for pipette tips [15] 96.95 94.72 98.05 95.16 60.22
Ghost-YOLOv4 liquid-level [10] 96.75 95.39 98.04 95.47 115.15
YOLOv11 [34] 96.43 93.57 97.72 94.81 55.72
YOLOv9t [35] 95.55 92.03 97.72 94.68 69.55
YOLO aliquoting baseline [37] 96.64 93.81 97.46 94.95 83.02
YOLO liquid-level baseline 95.21 93.16 96.98 93.30 28.43
TubeDet-YOLO [13] 93.99 94.70 97.38 94.16 30.90
Tube baseline (improved) 90.90 93.58 96.46 92.94 102.63
Internal variant-2 95.43 93.46 96.99 93.89 111.05
On HeinSight4, WDA-TNET reaches 95.17% mAP@50:95 and 94.99% recall. The transfer result suggests that WGSR and GSCIM learn feature corrections that remain helpful in general transparent-object scenes, not just overfitting to clinical tubes.
VisDrone. We further test cross-domain generalization on VisDrone, a dense small-object dataset with complex backgrounds.
Table 3. Performance comparison on VisDrone.
Table 3. Performance comparison on VisDrone.
Model Precision Recall mAP@50 mAP@50:95 Params (M) GFLOPs Size (MB)
Ours 48.94 36.63 37.33 21.85 9.49 22.99 18.38
YOLOv9t 43.14 32.33 32.33 18.78 2.01 7.86 8.48
YOLOv8 42.61 32.41 31.70 18.09 3.01 8.20 5.95
YOLOv12 42.00 32.83 32.01 18.48 2.57 6.49 5.26
YOLOv11 40.19 30.26 29.57 16.70 2.59 6.45 5.20
VisDrone is harsher on small-object density and background clutter. WDA-TNET still leads the comparison with 48.94% precision and 21.85% mAP@50:95, indicating that selective restoration and adaptive sample assignment remain useful when objects become smaller and more numerous.

4.6. Ablation on Backbone Block Variants

We evaluate replacing the backbone baseline block (C3k2) with five attention and normalization alternatives alongside our GSCIM on the CTT dataset.
Table 4. Ablation study on C3k2 backbone variants (CTT dataset).
Table 4. Ablation study on C3k2 backbone variants (CTT dataset).
Variant mAP@50 (%) mAP@50:95 (%) Params (M) FPS
C3k2+MS [38] 90.90 74.49 2.59 62.93
C3k2+RCM [39] 90.85 74.85 2.80 76.52
C3k2+DW [40] 90.27 74.18 2.73 62.52
C3k2+CBAM [41] 89.73 74.26 2.55 58.93
C3k2+GN [42] 86.40 67.00 2.90 70.90
Ours (GSCIM) 91.00 75.60 2.90 73.40
GSCIM achieves the best mAP@50 (91.0%) and mAP@50:95 (75.6%) in this group. The key reason is that, rather than simply amplifying high responses, it first establishes a stable background reference via global channel recalibration to flag glare-dominated activations as anomalies, then preserves geometrically continuous low-energy cues such as liquid-level edges through a short-cut path, and finally applies spatial reweighting to redirect responses toward real boundaries. This staged mechanism keeps the high-IoU localization cues intact without relying on a larger model. C3k2+GN flattens the feature distribution and eliminates the weak energy contrast between tube walls and liquid surfaces, causing mAP@50 to drop to 86.4%. Figure 19 shows precision, recall, and FLOPs across all six variants.

4.7. Detection Head Comparison

We compare six representative detection heads under identical backbone and training settings to justify the choice of ATSS.
Table 5. Comparison of detection head variants on the CTT dataset.
Table 5. Comparison of detection head variants on the CTT dataset.
Head mAP@50 (%) mAP@50:95 (%) Params (M) Latency (ms)
YOLOX Head [43] 90.40 74.30 9.40 8.80
DY Head [44] 90.30 74.20 2.70 16.00
FCOS Head [45] 90.00 73.70 3.10 14.90
Decoupled Detect [33] 89.90 73.00 2.80 14.70
Lightweight Efficient 89.90 74.70 2.20 12.10
ATSS Head [23] 90.90 75.90 3.10 10.30
ATSS Head achieves the highest mAP@50 (90.9%) and mAP@50:95 (75.9%) with a latency of only 10.3 ms. YOLOX Head requires 9.4 M parameters—more than three times the backbone cost of ATSS—yet gains less than 0.1 percentage points in precision. DY Head produces adaptive convolution kernels per input but incurs 16.0 ms latency, the highest in the group. FCOS and Decoupled Detect decouple classification and regression to different extents; however, when the coupling is reduced beyond a certain degree, the shared structural evidence for boundary discrimination weakens, and mAP@50:95 falls below 74%. Lightweight Efficient keeps parameters low at 2.2 M but cannot fully model the slender-tube short-axis boundary, leaving mAP@50:95 at 74.7%. Figure 20 visualizes precision, recall, and FLOPs across all six heads.

4.8. SPPF Module Ablation

We compare DCPAF against seven SPPF variants that represent representative directions in lightweight spatial pooling design.
Table 6. Comparison of SPPF variants and DCPAF on the CTT dataset.
Table 6. Comparison of SPPF variants and DCPAF on the CTT dataset.
Module P (%) R (%) F1 (%) mAP@50 (%) mAP@50:95 (%) FPS
SPPF (baseline) 92.1 87.4 89.7 90.8 75.5 73.4
SimConvSPPF [46] 92.5 88.8 90.6 91.7 77.6 94.2
LSKA_SPPF [47] 92.7 89.4 91.0 91.7 77.8 90.6
FocalModulation [48] 92.5 88.2 90.3 91.3 77.0 85.5
Mamba_SPPF [49] 92.5 89.3 90.9 91.6 77.6 92.3
GCBlock_SPPF [50] 93.6 88.6 91.0 91.8 77.5 88.9
UniRepLKNetSPPF [51] 92.4 89.8 91.1 92.0 78.0 76.9
DCPAF (Ours) 94.1 88.9 91.4 92.8 79.1 92.3
DCPAF achieves the best mAP@50 (92.8%), mAP@50:95 (79.1%), and F1 (91.4%) at 92.3 FPS. Other variants strengthen channel recalibration or expand spatial receptive fields, yet they retain isotropic pooling concatenation, which discards directional and positional differences between feature locations during aggregation. DCPAF applies direction-aware encoding to each pooling scale before concatenation, so horizontal liquid-level features and vertical tube-boundary features already carry explicit directional labels when they are merged. The cross-stage alignment mechanism further ensures semantic consistency across scales in the channel dimension, producing a clear advantage under strict IoU thresholds. GCBlock_SPPF achieves the highest precision (93.6%) through global context suppression of false positives, but its additive global feedback lacks local spatial modeling, limiting mAP@50:95 to 77.5%. Figure 21 summarizes mAP@50:95 and FPS for all variants.

4.9. Loss Function Comparison

To verify the regression effectiveness, we compare eight bounding-box regression losses under identical training strategies.
Table 7. Comparison of bounding-box regression losses on the CTT dataset.
Table 7. Comparison of bounding-box regression losses on the CTT dataset.
Loss Precision (%) Recall (%) mAP@50 (%) mAP@50:95 (%)
GIoU [24] 92.82 89.25 92.31 79.08
CIoU [25] 93.50 89.18 92.20 78.79
VarifocalLoss [52] 92.80 88.50 91.50 76.30
WIoU [26] 92.84 86.53 90.96 72.07
IoU-Focaler 91.80 89.50 91.20 77.10
IoU-Slide 90.40 88.80 90.20 75.10
IoU-FocalIoU 93.40 89.20 92.20 78.70
MPDIoU [27] 94.10 88.90 92.80 79.10
MPDIoU achieves the best precision (94.1%), mAP@50 (92.8%), and mAP@50:95 (79.1%). The critical property is that it reformulates regression as corner-point alignment: even when prediction and ground truth do not overlap, the corner distances provide directional gradients, preventing the loss from stalling during early training. For tubes in mixed-specification racks, the diagonal normalization factor scales with the enclosing rectangle, so a few-pixel width error on a 5 ml tube and a tens-of-pixel length error on a 50 ml tube receive equivalent penalty weights. CIoU adds center-distance and aspect-ratio constraints on top of GIoU but lacks explicit corner alignment; when adjacent tubes share nearly identical centers, these constraints cannot distinguish boundary position errors, capping mAP@50:95 at 78.79%. WIoU applies asymmetric weighting but suppresses the gradient contribution of boundary-contrast-weak tubes in dense layouts, causing recall to drop to 86.5% and mAP@50:95 to fall to 72.1%. Focal-based variants (IoU-Focaler, IoU-FocalIoU) direct attention to hard samples without introducing geometric correction, so improvements over GIoU remain small. MPDIoU addresses the dominant bottleneck in this task—corner-level geometric alignment—rather than sample reweighting, which explains why it outperforms all alternatives.

5. Conclusions

This paper presented WDA-TNET, a test-tube detector designed around the four coupled failure modes encountered on clinical conveyor lines: motion blur, glass glare, slender geometry, and dense packing. Rather than applying a single enhancement block, the design distributes the correction burden across four stages. WGSR targets motion blur through wavelet-energy-guided selective restoration, confining sharpening to degraded regions so that specular highlights are not amplified. GSCIM addresses glare-dominated channel responses in the backbone through direction-aware pooling and local cross-channel interaction, preserving liquid-level edge cues that isotropic attention tends to suppress. DCPAF replaces isotropic SPPF pooling with decoupled height-and-width encoding and a cross-stage short path, maintaining short-axis positional resolution while aggregating long-axis context. ATSS and MPDIoU stabilize supervision when positives are sparse and box overlap is weak, with corner-distance constraints providing non-zero gradients throughout early training.
On the newly constructed CTT dataset (11,955 images, 81,044 instances), WDA-TNET reaches 94.1% precision and 79.1% mAP50:95, improving mAP50:95 by 3.6 percentage points over YOLOv11. Ablation experiments confirm that each module contributes complementary gains: WGSR improves border visibility in degraded frames, GSCIM suppresses glare noise at the feature level, DCPAF brings the largest single-step gain in high-IoU localization (mAP50:95 from 75.9% to 79.4%), and MPDIoU strengthens mid-range IoU regression quality. Cross-domain evaluation on HeinSight4 (95.5% mAP50:95) and VisDrone (21.85% mAP50:95) confirms that the learned corrections generalize beyond the training distribution.
Current limitations are twofold. The parameter count of 3.6 M and 5.6 GFLOPs restrict direct deployment on low-power edge boards. Future work will investigate structured pruning and INT8 quantization to bring WDA-TNET within the operational envelope of embedded controllers. Multi-task joint learning that simultaneously outputs tube type, liquid level, and barcode region—rather than treating each as a separate pipeline stage—would also reduce system latency and support fully autonomous robotic sorting.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The CTT dataset and trained model weights described in this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zhao, Z.Z. Clinical Application and Effect Evaluation of Blood Specimen Tube Rack and Tube Identification Alarm System. China Health Stand. Manag. 2017, 8, 123–125. [Google Scholar]
  2. Jing, Y. An Automated Test Tube Identification and Sorting System Based on Machine Vision. In Proceedings of the Third International Conference on Biomedical and Intelligent Systems (IC-BIS 2024), SPIE; SPIE: Bellingham, WA, USA, 2024; Vol. 13208, p. 132081K. [Google Scholar]
  3. Liu, C.; Dong, J.; Lu, Q.; et al. High-Precision Serum Level Detection Method Based on HSV Color Space. Chin. J. Sci. Instrum. 2020, 41, 78–86. [Google Scholar]
  4. Dong, J. Research on Test Tube Liquid Level Detection Method Based on Machine Vision. Agric. Equip. Veh. Eng. 2021, 59, 108–115. [Google Scholar]
  5. Zhang, W.; Li, S.; Long, T.; et al. Real-Time Detection of Test Tube Position Based on Image Processing. Softw. Eng. Appl. 2022, 11, 425–434. [Google Scholar]
  6. Balia, R.; Barra, S.; Podda, A.S.; Pompianu, L.; Sangiovanni, M.; Fenu, G. Automated Classification of Test Tubes Based on Uncontrolled Image Analysis. In Proceedings of the 2024 IEEE International Conference on Metrology for eXtended Reality, Artificial Intelligence and Neural Engineering; IEEE: St Julians, Malta, 2024; pp. 1159–1164. [Google Scholar]
  7. Xu, T. Test Tube Counting Based on Hough Transform and Convolutional Neural Network. Electron. Technol. Softw. Eng. 2020, 135–136. [Google Scholar] [CrossRef]
  8. Liu, S.; Lin, J.; Chen, Z.; Zou, Z. Data Matrix Code Recognition Method for Test Tube–Rack System Based on Mask R-CNN. J. Fujian Univ. Technol. 2023, 21, 378–384. [Google Scholar]
  9. Şişman, A.R.; Başok, B.İ.; Karakoyun, İ.; Çolak, A.; Bilge, U.; Demirci, F.; Başoğlu, N. Measuring the Performance of an Artificial Intelligence-Based Robot That Classifies Blood Tubes and Performs Quality Control in Terms of Preanalytical Errors: A Preliminary Study. Am. J. Clin. Pathol. 2024, 161, 553–560. [Google Scholar] [CrossRef]
  10. Liang, J.F.; Domingo Palaoag, T. Develop a Liquid Level Detection Algorithm for Infusion Bottles Based on Ghost-YOLOv4. In Proceedings of the 2024 7th International Conference on Computer Information Science and Application Technology (CISAT); IEEE: Chengdu, China, 2024; pp. 211–215. [Google Scholar]
  11. Wu, Z.; Tan, F.; Li, D.; Tang, Z. Test Tube Sample Liquid Level Detection Based on YOLOv5. China Med. Devices 2023, 38, 61–67. [Google Scholar] [CrossRef]
  12. Ren, K.; Tao, Q. Test Tube Detection Algorithm Based on YOLOv5. Mod. Comput. 2022, 28, 1–8. [Google Scholar] [CrossRef]
  13. Liu, X.; Xie, S.; Jin, X.; Bian, K. TubeDet-YOLO: Edge-Based Detector for Color-Coded Vacuum Blood Collection Tube. In Proceedings of the 2025 10th International Conference on Automation, Control and Robotics Engineering (CACRE); IEEE: Dalian, China, 2025; pp. 198–203. [Google Scholar]
  14. Zhang, K.; Peng, Y.; Wang, Y.; Tang, J. Test Tube and Liquid Level Recognition Algorithm Based on Improved YOLOv8. J. Comput. Appl. 2024, S2. [Google Scholar]
  15. Yin, Y.; Lei, J.; Tao, W. Detection of Liquid Retention on Pipette Tips in High-Throughput Liquid Handling Workstations Based on Improved YOLOv8 Algorithm with Attention Mechanism. Electronics 2024, 13, 2836. [Google Scholar] [CrossRef]
  16. Ma, X.; Zhang, X.; Wang, T. Chemical Product Localization and Classification Based on Robotic Arm. Comput. Mod. 2021, (10), 88–126. [Google Scholar]
  17. Yin, S.; Li, M.; Jiang, Y.; et al. Robotic Grasp Detection for Medical Test Tubes Based on Convolutional Neural Network. In Proceedings of the 2023 2nd International Conference on Automation, Robotics and Computer Engineering; IEEE: Wuhan, China, 2023; pp. 1–6. [Google Scholar]
  18. Chen, H.; Wan, W.; Matsushita, M.; Kotaka, T.; Harada, K. In-Rack Test Tube Pose Estimation Using RGB-D Data. arXiv 2023, arXiv:2308.10411. [Google Scholar]
  19. Tang, Y.; Wan, W.; Chen, H.; et al. Zero-Shot Recognition of Test Tube Types by Automatically Collecting and Labeling RGB Data. IEEE Robot. Autom. Lett. 2025, 10, 8276–8283. [Google Scholar] [CrossRef]
  20. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Salt Lake City, UT, USA, 2018; pp. 7132–7141. [Google Scholar]
  21. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Seattle, WA, USA, 2020; pp. 11534–11542. [Google Scholar]
  22. Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021; IEEE; pp. 13713–13722. [Google Scholar]
  23. Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the Gap Between Anchor-Based and Anchor-Free Detection via Adaptive Training Sample Selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Seattle, WA, USA, 2020; pp. 9759–9768. [Google Scholar]
  24. Rezatofighi, H.; Tsoi, N.; Gwak, J.Y.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Long Beach, CA, USA, 2019; pp. 658–666. [Google Scholar]
  25. Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI: New York, NY, USA, 2020; Volume 34, pp. 12993–13000. [Google Scholar]
  26. Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
  27. Ma, S.; Xu, Y. MPDIoU: A Loss for Efficient and Accurate Bounding Box Regression. arXiv 2023, arXiv:2307.07662. [Google Scholar] [CrossRef]
  28. El-Khawaldeh, R.; Zhang, W.; Corkery, R. HeinSight4.0 Dataset and Models for Dynamic Monitoring of Chemical Experiments [Dataset]; Zenodo: Geneva, Switzerland, 2025. [Google Scholar]
  29. Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q. Detection and Tracking Meet Drones Challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7380–7399. [Google Scholar] [CrossRef]
  30. Buslaev, A.; Iglovikov, V.I.; Khvedchenya, E.; Parinov, A.; Druzhinin, M.; Kalinin, A.A. Albumentations: Fast and Flexible Image Augmentations. Information 2020, 11, 125. [Google Scholar] [CrossRef]
  31. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
  32. Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond Empirical Risk Minimization. In Proceedings of the International Conference on Learning Representations; ICLR: Vancouver, BC, Canada, 2018. [Google Scholar]
  33. Jocher, G.; Chaurasia, A.; Qiu, J. YOLOv8 [Software]; Ultralytics: Los Angeles, CA, USA, 2023; Available online: https://github.com/ultralytics/ultralytics.
  34. Jocher, G.; Chaurasia, A.; Qiu, J. YOLOv11 [Software]; Ultralytics: Los Angeles, CA, USA, 2024. [Google Scholar]
  35. Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
  36. Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
  37. Rybak, L.A.; Cherkasov, V.V.; Malyshev, D.I.; Carbone, G. Blood Serum Recognition Method for Robotic Aliquoting Using Different Versions of the YOLO Neural Network. In Advances in Service and Industrial Robotics; Springer: Cham, Switzerland, 2023; pp. 150–157. [Google Scholar]
  38. Chen, Y.; Yuan, X.; Wang, J.; et al. YOLO-MS: Rethinking Multi-Scale Representation Learning for Real-Time Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 4240–4252. [Google Scholar] [CrossRef]
  39. Ni, Z.; Chen, X.; Zhai, Y.; Tang, Y.; Wang, Y. Context-Guided Spatial Feature Reconstruction for Efficient Semantic Segmentation. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 241–258. [Google Scholar]
  40. Howard, A.G.; Zhu, M.; Chen, B.; et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
  41. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision; Springer: Munich, Germany, 2018; pp. 3–19. [Google Scholar]
  42. Wu, Y.; He, K. Group Normalization. In Proceedings of the European Conference on Computer Vision; Springer: Munich, Germany, 2018; pp. 3–19. [Google Scholar]
  43. Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
  44. Dai, X.; Chen, Y.; Xiao, B.; et al. Dynamic Head: Unifying Object Detection Heads with Attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Nashville, TN, USA, 2021; pp. 7373–7382. [Google Scholar]
  45. Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Seoul, Republic of Korea, 2019; pp. 9627–9636. [Google Scholar]
  46. Liu, S.; Sun, Y.; Fu, X.; et al. SlimConv: Reducing Channel Redundancy in Convolutional Neural Networks by Features Recombining. IEEE Trans. Image Process. 2021, 30, 6434–6445. [Google Scholar] [CrossRef]
  47. Li, T.; Chen, Z.; He, Y.; Li, H. Large Separable Kernel Attention. Expert Syst. Appl. 2023, 236, 121352. [Google Scholar] [CrossRef]
  48. Yang, J.; Li, C.; Zhang, P.; et al. Focal Modulation Networks. In Advances in Neural Information Processing Systems; NeurIPS: New Orleans, LA, USA, 2022; Volume 35, pp. 4203–4217. [Google Scholar]
  49. Liu, Y.; Tian, Y.; Zhao, Y.; et al. VMamba: Visual State Space Model. arXiv 2024, arXiv:2401.13260. [Google Scholar]
  50. Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. GCNet: Non-Local Networks Meet Squeeze-Excitation Networks and Beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops; IEEE: Seoul, Republic of Korea, 2019; pp. 1971–1980. [Google Scholar]
  51. Ding, X.; Zhang, X.; Zhou, X.; et al. UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition. arXiv 2023, arXiv:2311.15599. [Google Scholar]
  52. Zhang, H.; Wang, Y.; Dayoub, F.; Sünderhauf, N. VarifocalNet: An IoU-aware Dense Object Detector. arXiv 2020, arXiv:2008.13367. [Google Scholar]
Figure 1. Overall architecture of WDA-TNET. The pipeline consists of four stages: (1) WGSR preprocessing for selective motion-blur restoration, (2) GSCIM-based backbone for glare-suppressed feature extraction, (3) DCPAF neck for direction-aware multi-scale fusion, and (4) ATSS detection head with MPDIoU loss for stable supervision on slender densely-packed targets.
Figure 1. Overall architecture of WDA-TNET. The pipeline consists of four stages: (1) WGSR preprocessing for selective motion-blur restoration, (2) GSCIM-based backbone for glare-suppressed feature extraction, (3) DCPAF neck for direction-aware multi-scale fusion, and (4) ATSS detection head with MPDIoU loss for stable supervision on slender densely-packed targets.
Preprints 207821 g001
Figure 2. Three-stage pipeline of WGSR. The frequency-domain branch computes a wavelet-energy degradation mask; the spatial reconstruction branch recovers high-frequency textures via a lightweight three-layer CNN; the gated fusion stage injects the residual selectively into degraded regions only.
Figure 2. Three-stage pipeline of WGSR. The frequency-domain branch computes a wavelet-energy degradation mask; the spatial reconstruction branch recovers high-frequency textures via a lightweight three-layer CNN; the gated fusion stage injects the residual selectively into degraded regions only.
Preprints 207821 g002
Figure 3. Wavelet sub-band decomposition and degradation mask generation. From left to right: input image; low-frequency approximation c A ; horizontal, vertical, and diagonal high-frequency components { c H , c V , c D } ; aggregated energy map E raw ; final degradation mask M after normalization and Sigmoid gating.
Figure 3. Wavelet sub-band decomposition and degradation mask generation. From left to right: input image; low-frequency approximation c A ; horizontal, vertical, and diagonal high-frequency components { c H , c V , c D } ; aggregated energy map E raw ; final degradation mask M after normalization and Sigmoid gating.
Preprints 207821 g003
Figure 4. Degradation masks generated under different gating steepness k. A small k (left) includes too many low-contrast regions; a large k (right) over-clips the transition zone. At k = 4.0 (middle), the mask covers approximately one standard deviation below the energy mean, which matches the measured blur-kernel-to-edge-width ratio in the CTT dataset.
Figure 4. Degradation masks generated under different gating steepness k. A small k (left) includes too many low-contrast regions; a large k (right) over-clips the transition zone. At k = 4.0 (middle), the mask covers approximately one standard deviation below the energy mean, which matches the measured blur-kernel-to-edge-width ratio in the CTT dataset.
Preprints 207821 g004
Figure 5. Visualization of WGSR three-stage collaborative processing. From left to right: original blurred input; degradation mask M (bright regions = clear, dark = degraded); reconstruction output Y; final restored image I out . The restoration is concentrated on blur-affected zones while specular highlights remain unchanged.
Figure 5. Visualization of WGSR three-stage collaborative processing. From left to right: original blurred input; degradation mask M (bright regions = clear, dark = degraded); reconstruction output Y; final restored image I out . The restoration is concentrated on blur-affected zones while specular highlights remain unchanged.
Preprints 207821 g005
Figure 6. Structure of the Direction-Aware Cross-Stage Pyramid Attention Fusion (DCPAF) module. The input feature is first recalibrated by two independent 1D pooling branches along height and width. The recalibrated feature then goes through chained 5 × 5 max-pooling for multi-scale context, which is concatenated with a cross-stage skip path and fused by 1 × 1 convolution.
Figure 6. Structure of the Direction-Aware Cross-Stage Pyramid Attention Fusion (DCPAF) module. The input feature is first recalibrated by two independent 1D pooling branches along height and width. The recalibrated feature then goes through chained 5 × 5 max-pooling for multi-scale context, which is concatenated with a cross-stage skip path and fused by 1 × 1 convolution.
Preprints 207821 g006
Figure 7. Architecture of the Glare-Suppressed Channel Interaction Module (GSCIM). The input is split into a transformation branch (cascade of 3 × 3 convolutions and ECA channel-weighting units) and a shallow preservation branch. Both branches are concatenated and compressed by a 1 × 1 convolution. The ECA unit computes local cross-channel weights via adaptive 1D convolution, suppressing glare-dominated responses while retaining liquid-level edge cues.
Figure 7. Architecture of the Glare-Suppressed Channel Interaction Module (GSCIM). The input is split into a transformation branch (cascade of 3 × 3 convolutions and ECA channel-weighting units) and a shallow preservation branch. Both branches are concatenated and compressed by a 1 × 1 convolution. The ECA unit computes local cross-channel weights via adaptive 1D convolution, suppressing glare-dominated responses while retaining liquid-level edge cues.
Preprints 207821 g007
Figure 8. Position of the ATSS module within the detection head. Multi-scale features P3–P5 pass through the prediction module; the outputs feed into both the loss computation branch and the ATSS assigner, which dynamically determines positive/negative labels based on candidate-set statistics rather than a fixed IoU threshold.
Figure 8. Position of the ATSS module within the detection head. Multi-scale features P3–P5 pass through the prediction module; the outputs feed into both the loss computation branch and the ATSS assigner, which dynamically determines positive/negative labels based on candidate-set statistics rather than a fixed IoU threshold.
Preprints 207821 g008
Figure 9. Flowchart of ATSS adaptive sample assignment. For each ground-truth box, top-k candidates are selected per feature level by center distance, merged across levels, and an adaptive IoU threshold t g = μ g + σ g is computed from the candidate set. A candidate is labeled positive only when it simultaneously satisfies the IoU condition and the center-in-box constraint; conflicts from dense tube layouts are resolved by assigning each candidate to the ground truth with the highest IoU.
Figure 9. Flowchart of ATSS adaptive sample assignment. For each ground-truth box, top-k candidates are selected per feature level by center distance, merged across levels, and an adaptive IoU threshold t g = μ g + σ g is computed from the candidate set. A candidate is labeled positive only when it simultaneously satisfies the IoU condition and the center-in-box constraint; conflicts from dense tube layouts are resolved by assigning each candidate to the ground truth with the highest IoU.
Preprints 207821 g009
Figure 10. Corner-distance constraint of MPDIoU. Distances d 1 and d 2 are computed between the predicted and ground-truth top-left and bottom-right corners respectively, then normalized by the diagonal of the minimum enclosing rectangle. This formulation provides non-zero gradients regardless of overlap, and penalizes short-axis and long-axis deviations equally.
Figure 10. Corner-distance constraint of MPDIoU. Distances d 1 and d 2 are computed between the predicted and ground-truth top-left and bottom-right corners respectively, then normalized by the diagonal of the minimum enclosing rectangle. This formulation provides non-zero gradients regardless of overlap, and penalizes short-axis and long-axis deviations equally.
Preprints 207821 g010
Figure 11. CTT dataset sample collection scenes. The top row shows pre-sorting layouts with tubes in mixed, unordered states; the bottom row shows rack-loaded configurations ready for robotic scanning. Both stages span hospital laboratory and ward settings.
Figure 11. CTT dataset sample collection scenes. The top row shows pre-sorting layouts with tubes in mixed, unordered states; the bottom row shows rack-loaded configurations ready for robotic scanning. Both stages span hospital laboratory and ward settings.
Preprints 207821 g011
Figure 12. CTT dataset annotation examples. Four annotation dimensions are shown: physical specification (tube size), internal sample state (blood/urine/empty), external barcode label, and cap color. Each dimension contributes complementary information to support robot-assisted tube sorting.
Figure 12. CTT dataset annotation examples. Four annotation dimensions are shown: physical specification (tube size), internal sample state (blood/urine/empty), external barcode label, and cap color. Each dimension contributes complementary information to support robot-assisted tube sorting.
Preprints 207821 g012
Figure 13. HeinSight4.0 dataset: five-category annotation examples. From left to right: air (empty container), air with residual gas, clear liquid, turbid liquid, and solid/particle phase. The transparent container walls introduce strong reflection and refraction in all categories.
Figure 13. HeinSight4.0 dataset: five-category annotation examples. From left to right: air (empty container), air with residual gas, clear liquid, turbid liquid, and solid/particle phase. The transparent container walls introduce strong reflection and refraction in all categories.
Preprints 207821 g013
Figure 14. Primary data augmentation examples. Each column illustrates one transform: horizontal flip, vertical flip, shift-scale-rotate, brightness perturbation, contrast perturbation, gamma correction, Mosaic, and MixUp. All transformations maintain annotation consistency via coordinate remapping.
Figure 14. Primary data augmentation examples. Each column illustrates one transform: horizontal flip, vertical flip, shift-scale-rotate, brightness perturbation, contrast perturbation, gamma correction, Mosaic, and MixUp. All transformations maintain annotation consistency via coordinate remapping.
Preprints 207821 g014
Figure 15. Composition-level augmentation examples. Left: Mosaic (four images stitched), simulating dense and multi-scale tube layouts. Middle: MixUp (two images blended), training the model on overlapping or partially visible tubes. Right: Copy-Paste (tube instances transplanted), increasing instance diversity in under-represented cap-color categories.
Figure 15. Composition-level augmentation examples. Left: Mosaic (four images stitched), simulating dense and multi-scale tube layouts. Middle: MixUp (two images blended), training the model on overlapping or partially visible tubes. Right: Copy-Paste (tube instances transplanted), increasing instance diversity in under-represented cap-color categories.
Preprints 207821 g015
Figure 16. Synthetic patch augmentation pipeline. From left to right: source annotated image; GrabCut foreground extraction; morphological opening to smooth mask boundaries; Gaussian feathering to produce a soft alpha channel; background image with color perturbation; final composited sample with retained bounding-box annotation.
Figure 16. Synthetic patch augmentation pipeline. From left to right: source annotated image; GrabCut foreground extraction; morphological opening to smooth mask boundaries; Gaussian feathering to produce a soft alpha channel; background image with color perturbation; final composited sample with retained bounding-box annotation.
Preprints 207821 g016
Figure 17. Detail of GrabCut-based foreground segmentation and Gaussian feathering. Left: initial hard binary mask from GrabCut. Middle: mask after morphological opening removing boundary artifacts. Right: soft alpha channel after Gaussian convolution, enabling seamless Alpha-blend compositing.
Figure 17. Detail of GrabCut-based foreground segmentation and Gaussian feathering. Left: initial hard binary mask from GrabCut. Middle: mask after morphological opening removing boundary artifacts. Right: soft alpha channel after Gaussian convolution, enabling seamless Alpha-blend compositing.
Preprints 207821 g017
Figure 18. Wavelet-enhanced augmentation examples for motion-blur simulation. Left pair: dense, relatively uniform tube arrangement ( ρ low) yields mild blur. Right pair: sparse or clustered arrangement ( ρ high) triggers stronger blur. WGSR restores the degraded training samples before they enter detector training.
Figure 18. Wavelet-enhanced augmentation examples for motion-blur simulation. Left pair: dense, relatively uniform tube arrangement ( ρ low) yields mild blur. Right pair: sparse or clustered arrangement ( ρ high) triggers stronger blur. WGSR restores the degraded training samples before they enter detector training.
Preprints 207821 g018
Figure 19. Precision, recall, and GFLOPs comparison across C3k2 backbone variants on the CTT dataset. GSCIM achieves the best balance: comparable precision to C3k2+MS with lower GFLOPs and a consistently higher recall than all norm-based alternatives.
Figure 19. Precision, recall, and GFLOPs comparison across C3k2 backbone variants on the CTT dataset. GSCIM achieves the best balance: comparable precision to C3k2+MS with lower GFLOPs and a consistently higher recall than all norm-based alternatives.
Preprints 207821 g019
Figure 20. Precision (P), recall (R), and FLOPs comparison of six detection heads on CTT. ATSS Head achieves the highest precision and recall while keeping FLOPs well below YOLOX Head (approximately one-quarter of YOLOX’s computational cost).
Figure 20. Precision (P), recall (R), and FLOPs comparison of six detection heads on CTT. ATSS Head achieves the highest precision and recall while keeping FLOPs well below YOLOX Head (approximately one-quarter of YOLOX’s computational cost).
Preprints 207821 g020
Figure 21. mAP@50:95 (bars) and FPS (line) comparison of SPPF variants and DCPAF on CTT. DCPAF reaches the highest mAP@50:95 at a competitive inference speed.
Figure 21. mAP@50:95 (bars) and FPS (line) comparison of SPPF variants and DCPAF on CTT. DCPAF reaches the highest mAP@50:95 at a competitive inference speed.
Preprints 207821 g021
Table 1. Performance comparison on the CTT dataset.
Table 1. Performance comparison on the CTT dataset.
Model Precision (%) Recall (%) mAP@50 (%) mAP@50:95 (%) Params (M)
yolov5_liquid_level 93.40 89.23 92.63 78.79 9.13
YOLOv8 [33] 91.64 87.99 91.19 76.22 3.01
YOLOv11 [34] 92.17 87.44 90.87 75.50 2.59
YOLOv8aa 91.90 87.96 90.85 75.71 2.94
yolov8_improved_pip 91.13 88.82 90.63 75.78 3.02
06_gold_yolo_style 91.15 88.17 90.61 74.46 2.59
yolov5_shufflenet 90.39 88.54 90.40 75.31 6.85
YOLOv9t [35] 90.14 88.36 90.19 75.21 2.01
YOLOv12 [36] 91.12 88.52 90.14 75.02 2.57
yolov8n_expiry_date 89.61 88.14 89.63 73.21 2.91
TubeDet-YOLO [13] 88.40 85.82 88.31 69.83 1.03
YOLOv5 liquid-level localization 87.97 79.24 83.30 66.20 2.75
WDA-TNET (Ours) 94.08 88.91 92.77 79.08 8.46
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated