A Wavelet-Enhanced Data-Augmented Network for Robust Test Tube Detection in Clinical Workflows

Xiaoming Zhang; Rundong Zhuang

doi:10.20944/preprints202604.0865.v1

Submitted:

11 April 2026

Posted:

14 April 2026

You are already at the latest version

Abstract

Reliable test-tube detection on clinical conveyor lines remains difficult when tubes are densely packed, placed irregularly, weakly illuminated, partially blurred by robot vibration, and contaminated by glare from glass or PET surfaces. These conditions erode the short-axis boundary cues and faint graduation marks that slender tubes depend on. We therefore build WDA-TNET on YOLOv11 and target the failure modes at four points of the pipeline. First, WGSR restores blurred regions selectively based on wavelet energy, avoiding over-sharpening specular areas. Second, GSCIM suppresses glare-dominated channel responses in the backbone through direction-aware pooling and cross-channel interaction, retaining weak structural cues like liquid-level edges. Third, DCPAF separates height and width encoding in the neck, dynamically balancing long-axis context and short-axis localization suitable for elongated targets. Finally, ATSS and MPDIoU stabilize supervision when positives are sparse and boxes overlap only weakly. We evaluated our model on the newly constructed Complex Test Tube (CTT) dataset containing 11,955 images and 81,044 instances. WDA-TNET achieves 94.1% precision and 79.1% mAP_50:95, improving mAP_50:95 by 3.6 percentage points over YOLOv11. On the transparent-container HeinSight4 dataset, the model attains 95.2% mAP_50:95, proving robust cross-domain generalization.

Keywords:

test tube detection

;

wavelet enhancement

;

selective restoration

;

direction-aware attention

;

glare suppression

;

adaptive sample selection

Subject:

Computer Science and Mathematics - Computer Vision and Graphics

1. Introduction

With the acceleration of global population aging, healthcare demand continues to grow. Medical expenses are projected to surge, putting unprecedented pressure on traditional labor-intensive dispensing and sorting models. In modern hospital laboratories and home-based care scenarios, routine biological samples such as urine and blood are typically transported in test tubes. These body fluid samples often carry infectious pathogens; for instance, blood samples may contain HIV or Hepatitis B virus, while urine samples may harbor pathogenic bacteria like Mycobacterium tuberculosis. Manual sorting of these tubes poses significant occupational exposure risks to medical personnel, especially when dealing with non-compliant containers (e.g., unsealed caps or leakage due to breakage). Deploying dispensing robots to undertake repetitive tasks like drug transport and sorting can reduce labor costs and infection risks, breaking through the limits of traditional pharmacy hours.

However, the stable operation of such robotic systems depends heavily on the online recognition of medical supplies. Clinical conveyor lines do not provide the clean imaging conditions assumed by generic detectors. Test tubes exhibit transparent materials that lead to severe reflections; their colors and specifications vary; dense arrangements cause mutual occlusion. Lighting inside automated equipment is often weak, while glass and PET surfaces introduce unstable highlights that hide liquid levels, labels, and tube contours. The geometry is also difficult: test tubes are slender, and even a small localization error along the short axis can reduce IoU noticeably. Once vibration from the mobile platform introduces motion blur, barcodes and graduation marks become even harder to preserve.

WDA-TNET is designed around these failure modes rather than around a single generic enhancement block. WGSR restores blurred regions selectively so that specular areas are not sharpened indiscriminately. GSCIM suppresses glare-dominated channel responses in the backbone and retains weak structural cues. DCPAF separates height and width encoding in the neck, which is better suited to elongated targets than isotropic pooling. During training, ATSS and MPDIoU keep supervision stable when positives are sparse and boxes overlap only weakly. The resulting detector is evaluated on the newly established CTT dataset and two cross-domain benchmarks, where it delivers the strongest overall localization performance among the compared models.

2. Related Work

Test tube detection has evolved from simple rule-based vision pipelines to data-driven deep detectors, yet the transition is incomplete: clinical scenes couple transparency, variable illumination, dense layouts, and real-time constraints in ways that standard architectures handle poorly.

2.1. Conventional Vision-Based Methods

Early systems operated under tightly controlled conditions and relied on hand-crafted features. Zhao [1] pointed out from a clinical perspective that manual handling is a primary source of pre-analytical errors, and that introducing automated detection can effectively reduce sample mix-ups and insufficient specimen volumes, lowering the re-inspection rate caused by human oversight. Jing [2] proposed a machine-vision-based automatic test tube identification and sorting system that converted CCD images to HSI color space and applied Hough circle transforms to locate tube caps. Liu et al. [3] captured images with a dark-box setup using bilateral point illumination and white background panels to minimize ambient interference, then applied color-separation and image-segmentation algorithms for serum level detection. Dong [4] proposed an edge detection algorithm combining digital morphology with the Laplacian operator, using industrial cameras to capture tube images, performing grayscale conversion and filtering followed by LoG edge detection and morphological operations to obtain clear image edges and determine liquid-level height without auxiliary measurement equipment. Zhang et al. [5] read video frames via USB camera, computed inter-frame difference images, and used threshold segmentation within a pre-defined ROI grid to determine tube presence in storage racks. These approaches reach acceptable accuracy only when the camera, lighting, and tube-rack geometry are fixed. Specular highlights on glass surfaces produce pseudo-edges that defeat Hough-based localization; perspective tilt turns circular caps into ellipses that the Hough accumulator cannot resolve; and difference-based methods break down whenever camera or robot motion shifts the reference frame.

2.2. CNN-Based Classification and Detection

Convolutional networks removed the need for hand-crafted features. Balia et al. [6] built a tube classification dataset collected under uncontrolled conditions and showed that CNN features generalize across continuously varying backgrounds and orientations for tubes that differ only in length ratio and aperture. Xu [7] designed a CNN-based tube counting model that introduced Hough transforms, inverted pyramids, and mean filtering to reduce periodic noise before training a CNN for overhead-view tube counting. Liu et al. [8] proposed content-diversified data synthesis with synchronized automatic annotation and used Mask R-CNN to recognize 2D barcodes at the base of tube racks. Şişman et al. [9] integrated four separate convolutional networks for tube localization, type recognition, liquid-level detection, and barcode scanning, achieving high classification accuracy and throughput on blood samples. These CNN-level studies confirmed that detector learning is feasible on laboratory imagery once a sufficiently varied dataset is available.

2.3. YOLO-Based Lightweight Detectors

The demand for real-time edge deployment shifted attention toward single-stage detectors, with the YOLO family becoming dominant. Liang et al. [10] replaced the CSPDarknet53 backbone in YOLOv4 with a Ghost backbone (Ghost-YOLOv4) to compress parameters and improve inference speed, and simultaneously refined the regression loss to offset feature-information loss caused by the linearized computation in Ghost modules; their approach was validated on infusion-bottle liquid-level detection. Wu et al. [11] compared traditional algorithms against YOLOv5 for liquid-level inspection and found that YOLOv5 yields clearly higher accuracy but incurs a computation cost that is difficult to accommodate on embedded hardware. Ren et al. [12] addressed this by substituting the YOLOv5s backbone with ShuffleNetv2 and adding a mixed-attention mechanism, then further applying pruning and compression for embedded deployment while maintaining detection performance. Liu et al. [13] proposed TubeDet-YOLO, using StarNet as the backbone to preserve discriminative capacity at low computational cost, RepViT modules that combine lightweight CNN and Vision Transformer blocks for cross-scale feature fusion, and a shared lightweight detection head to improve parameter efficiency. Zhang et al. [14] improved representational power through lightweight downsampling, cross-scale feature fusion, and dynamic convolution, combined with a refined regression loss to speed up bounding-box convergence for tubes under weak texture. Yin et al. [15] modified YOLOv8 by jointly incorporating GC attention, C2f, and LKS modules in the backbone to enhance global structural modeling for multi-size tubes with variable fill levels, and restructured the neck with a SimAM-based attention gate.

Despite these advances, YOLO-based approaches share a common limitation when glass glare, motion blur, and dense stacking interact simultaneously: existing lightweight backbones sacrifice high-frequency edge retention during compression, while standard isotropic pooling in the neck discards positional resolution along the short axis of slender tubes.

2.4. Robot-Integrated Visual Manipulation

Several studies extended tube detection into closed-loop robotic manipulation. Ma et al. [16] detected tube surface labels with an improved YOLOv3-Tiny and used those labels as indirect spatial anchors, augmenting the pipeline with a Bi-LSTM branch for optical-character recognition; the combination achieved reliable recognition and localization of transparent tubes for robotic pick-and-place. Yin et al. [17] trained a CNN to predict pixel-level grasp quality, finding that RGB input is more stable than depth for smooth transparent tubes because depth sensors are easily confused by specular reflections. Chen et al. [18] estimated six-degree-of-freedom tube poses from RGB-D point clouds, using geometric constraints from rack-slot geometry to compensate for depth-data gaps caused by transparent surfaces. Tang et al. [19] proposed a zero-shot tube identification pipeline that automatically collected and annotated new tube types in situ, combining global slot-occupancy estimation with local crop classification and clustering; while the method handles novel tube types in structured environments, it relies on prior knowledge of rack geometry and is less suited to uncontrolled multi-specification layouts.

These robot-integrated systems confirm that accurate tube localization is the prerequisite for reliable downstream manipulation, and that transparent materials and motion blur are the two most persistent bottlenecks once scenes depart from controlled laboratory conditions.

2.5. Attention Mechanisms and Frequency-Domain Enhancement

Attention modules provide a complementary way to steer feature selection without adding much computational cost. SE-Net [20] introduced global-average-pooling-based channel recalibration; ECA-Net [21] reduced its overhead with local 1D convolution across adjacent channels; and Coordinate Attention [22] preserved positional cues by decomposing global pooling into separate height and width encodings. For slender objects, directional encoding is particularly relevant because isotropic pooling collapses the short-axis position information that distinguishes adjacent tubes. Frequency-domain enhancement offers yet another complementary direction: rather than sharpening the whole image, selectively restoring regions where high-frequency energy has been suppressed avoids amplifying glare-dominated areas. This observation directly motivates the WGSR preprocessing stage of WDA-TNET.

2.6. Sample Assignment and Bounding-Box Regression

Training strategy is as important as architecture when targets are slender and densely packed. ATSS [23] replaced fixed IoU thresholds with statistics derived from the candidate set of each ground-truth box, making positive-sample assignment less sensitive to scale variation and reducing false negatives for small or tightly overlapping objects. For regression loss, GIoU [24] addresses gradient vanishing in non-overlap stages, CIoU [25] further adds center-distance and aspect-ratio terms, and Wise-IoU [26] applies asymmetric weighting to distinguish easy from hard samples. MPDIoU [27] translates regression into corner-distance minimization, maintaining non-zero gradients even when predicted and ground-truth boxes do not overlap—a property that is especially valuable for slender tubes where early-stage predictions may miss entirely. The combination of direction-aware feature encoding and statistically grounded supervision forms the design rationale for WDA-TNET.

3. Methodology

3.1. Overall Architecture

WDA-TNET addresses the aforementioned challenges through targeted improvements in four stages: preprocessing, feature extraction, fusion, and detection supervision. As shown in Figure 1, in the preprocessing stage, we introduce the Wavelet-Guided Selective Restoration (WGSR) algorithm to selectively repair motion blur and glare degradation. By using a degradation mask, restoration is constrained to areas that genuinely require enhancement, avoiding over-processing of clear regions. In the feature extraction stage, we replace the standard C3k2 module with the Glare-Suppressed Channel Interaction Module (GSCIM), which utilizes direction-aware pooling and cross-channel interaction to suppress abnormal high responses caused by reflection. For feature fusion, we substitute the standard SPPF with the Direction-Aware Cross-Stage Pyramid Attention Fusion (DCPAF) module, transforming isotropic pooling into direction-aware 1D encoding to balance receptive fields along the long and short axes of slender targets. Finally, in the detection head, we employ Adaptive Training Sample Selection (ATSS) for dynamic positive/negative sample assignment and combine it with the MPDIoU loss function to maintain effective gradients in low-IoU regimes, ensuring precise bounding box regression for densely packed tubes.

3.2. WGSR: Wavelet-Guided Selective Restoration

When medical robots perform tube picking and delivery tasks, chassis movement inevitably causes visual sensor vibration, resulting in directional motion blur. This temporal integration effect weakens critical localization features of test tube labels. Traditional global deblurring algorithms often amplify background noise and create textures in specular regions when applied to the entire image. To solve this, we propose the Wavelet-Guided Selective Restoration (WGSR) algorithm. Distinguished from global methods, WGSR constructs a three-stage collaborative workflow comprising frequency-domain localization, spatial-domain reconstruction, and gated fusion, as illustrated in Figure 2.

3.2.1. Frequency-Domain Localization

Clear edges typically possess sufficient high-frequency energy, whereas high-frequency components in blurred regions are significantly attenuated. Based on this physical property, WGSR applies 2D-DWT to the input image. Using single-level Daubechies wavelets (db2), which effectively capture edge features of slender structures, we decompose the image into a low-frequency approximation component

c_{A}

and three high-frequency detail components

{c_{H}, c_{V}, c_{D}}

. Figure 3 shows the resulting sub-band decomposition and the corresponding energy map. We summarize these into a unified high-frequency energy map

E_{raw}

:

E_{raw} = \frac{1}{C} \sum_{c = 1}^{C} (|c_{H}^{(c)}| + |c_{V}^{(c)}| + |c_{D}^{(c)}|),

(1)

where C denotes the number of channels.

Since DWT halves spatial resolution, we upsample

E_{raw}

to the original resolution

H \times W

via bilinear interpolation. The upsampled energy map

E_{up}

is then normalized instance-wise, and a Sigmoid gate with steepness k converts the normalized response into the degradation mask M:

M = Sigmoid (- k \cdot \frac{E_{up} - μ (E_{up})}{std (E_{up}) + ε}),

(2)

where

μ (\cdot)

and

std (\cdot)

denote the mean and standard deviation, and

ε = 10^{- 8}

is a stabilizing constant. We set

k = 4.0

to balance boundary preservation and noise suppression based on statistical analysis of normalized energy distributions. When

k \to \infty

, it degrades to a hard binary segmentation; when

k \to 0

, it loses discrimination.

k = 4.0

covers approximately one standard deviation below the mean, avoiding over-penalization. Figure 4 illustrates the spatial response under different k values.

3.2.2. Spatial Reconstruction and Gated Fusion

The degradation mask identifies regions requiring repair. We design a spatial structure reconstruction branch using a lightweight three-layer convolutional network

F_{SR}

to selectively recover high-frequency textures:

Y = F_{SR} (I) = {Conv}_{1 \times 1} ({Conv}_{3 \times 3} ({Conv}_{7 \times 7} (I))) .

(3)

Directly replacing the input I with Y would introduce over-sharpening in clear regions. Therefore, we compute the residual

R = Y - I

and use the degradation mask M to spatially weight the injection:

I_{out} = I + α \cdot (1 - M) ⊙ R,

(4)

where ⊙ denotes element-wise multiplication.

α

is a learnable scalar initialized to 0.2. In clear regions (

M \approx 1

), injection is suppressed; in degraded regions (

M \approx 0

), details are fully injected. Figure 5 shows the three-stage output comparison on a blurred tube image.

From a deployment perspective, WGSR functions as an offline data augmentation module during training. The restoration network is pre-trained and frozen to generate enhanced training samples. During inference, the detection network receives raw images without any additional latency.

3.3. DCPAF: Direction-Aware Cross-Stage Pyramid Attention Fusion

Test tube targets exhibit strong directional structures, typically extending vertically. Isotropic pooling operations easily lose fine spatial positional information along the narrow short axis. DCPAF addresses this by decoupling 2D aggregation into two 1D encodings along height and width, as shown in Figure 6. This separates height and width encoding in the neck, preserving position resolution along the short axis while accumulating longer-range context along the long axis.

The input feature X first enters the direction attention branch, performing 1D average pooling along height and width respectively:

\begin{matrix} z_{c}^{h} (i) & = \frac{1}{W} \sum_{j = 0}^{W - 1} x_{c} (i, j), \end{matrix}

(5)

\begin{matrix} z_{c}^{w} (j) & = \frac{1}{H} \sum_{i = 0}^{H - 1} x_{c} (i, j), \end{matrix}

(6)

We obtain a recalibrated feature by injecting the direction weights back into the input feature:

X_{att} = X ⊙ σ (φ_{h} (z^{h})) ⊙ σ (φ_{w} (z^{w})),

(7)

where

σ (\cdot)

is the Sigmoid function, and

φ_{h} (\cdot)

and

φ_{w} (\cdot)

are lightweight mapping branches. We next apply a lightweight convolutional re-encoding to stabilize representations before pooling:

X_{1} = {Conv}_{1 \times 1} ({Conv}_{3 \times 3} ({Conv}_{1 \times 1} (X_{att}))) .

(8)

For multi-scale context extraction, DCPAF chains

5 \times 5

max-pooling to expand the effective receptive field at low cost without losing cross-stage details:

P_{1} = {MP}_{5} (X_{1}), P_{2} = {MP}_{5} (P_{1}), P_{3} = {MP}_{5} (P_{2}) .

(9)

The pooled branches are concatenated with the identity branch and compressed by a

1 \times 1

convolution:

Y_{b} = {Conv}_{1 \times 1} (Concat (X_{1}, P_{1}, P_{2}, P_{3})) .

(10)

Finally, we inject a cross-stage short path to preserve details and facilitate gradient flow:

Y = {Conv}_{1 \times 1} (Concat ({Conv}_{3 \times 3} (Y_{b}), {Conv}_{1 \times 1} (X_{att}))) .

(11)

3.4. GSCIM: Glare-Suppressed Channel Interaction Module

Test tube images suffer from specular reflections and texture interference on glass surfaces, which degrade deeper feature maps during downsampling. Standard convolution treats all channels equally, lacking explicit modeling of inter-channel semantics. Consequently, abnormal high responses induced by glare are amplified.

To address this, we propose the Glare-Suppressed Channel Interaction Module (GSCIM), whose structure is shown in Figure 7. Instead of relying on global average pooling alone, GSCIM combines direction-aware coordinate pooling with grouped channel interaction so that highlight-dominated responses can be suppressed before they spread to later fusion stages.

Let the input feature be X. GSCIM uses two independent

1 \times 1

convolutions to split the input into two parallel paths:

X_{1} = {Conv}_{1 \times 1}^{(1)} (X), X_{2} = {Conv}_{1 \times 1}^{(2)} (X)

(12)

X_{1}

enters the transformation branch, passing through a cascade of

3 \times 3

convolutions and ECA channel weighting units. The ECA unit compresses each channel into a scalar via global average pooling

v_{c} = \frac{1}{H W} \sum X_{c} (i, j)

, and models local dependencies across adjacent channels using 1D convolution:

w = σ (Conv 1 d_{k_{s}} (v))

(13)

where the kernel size

k_{s}

is adaptively determined by the channel dimension. The weights w are multiplied element-wise with the input features for channel response recalibration. The refined features are then concatenated with the shallow preservation branch

X_{2}

and compressed via a

1 \times 1

convolution.

GSCIM is deployed after each downsampling stage in the backbone to perform joint screening of spatial positions and channels at early stages, preventing the propagation of invalid glare information.

3.5. Training Supervision: ATSS and MPDIoU

3.5.1. Adaptive Training Sample Selection (ATSS)

Traditional static IoU thresholds struggle with densely packed and severely occluded tubes, where predicted boxes fluctuate wildly in early training epochs. ATSS dynamically constructs a positive-sample threshold based on the statistical distribution of candidates. Figure 8 illustrates where ATSS intervenes in the detection head.

The overall positive/negative sample assignment procedure is visualized in Figure 9. ATSS proceeds through five steps: center-distance ranking, per-level candidate selection, cross-level merging, IoU statistics, and threshold-based labeling with conflict resolution.

Formally, the center distance between candidate c and ground truth g is

d (c, g) = {∥ p (c) - p (g) ∥}_{2},

(14)

where

p (\cdot)

denotes the center coordinates. On feature level l, the k nearest candidates form the per-level subset

S_{g}^{l} = {TopK}_{k} {c \in C_{l} ∣ d (c, g)}

. These are merged across all levels into the candidate set

C_{g} = ⋃_{l} S_{g}^{l} .

(15)

The IoU values of all candidates in

C_{g}

with respect to g yield a distribution

I_{g}

; its mean

μ_{g}

and standard deviation

σ_{g}

define the adaptive threshold

t_{g} = μ_{g} + σ_{g} .

(16)

A positive assignment requires satisfying both the IoU threshold and the geometric constraint that the candidate center falls within the ground truth box:

Pos (c, g) = I (IoU (c, g) \geq t_{g}) \cdot I (center (c) \in g) .

(17)

The indicator product enforces a strict logical AND: a candidate is accepted only when both conditions hold, which eliminates rack-edge candidates that overlap a tube region but whose center lies outside. When tubes are tightly packed, the same candidate may satisfy multiple ground truths simultaneously. This conflict is resolved by

g^{*} (c) = arg max_{g^{'} \in G_{c}} IoU (c, g^{'}),

(18)

where

G_{c}

is the set of ground truths that claim c as a positive. Assigning each candidate to the single most geometrically compatible ground truth preserves one-to-one supervision and avoids label ambiguity at tube boundaries.

3.5.2. MPDIoU Loss Function

Test tube racks often contain mixed specifications (e.g., 5 ml, 15 ml, 50 ml). A few pixels of width deviation on a 5 ml tube causes a drastic IoU drop, whereas significant length deviation on a 50 ml tube may be masked by a large intersection area. Traditional IoU losses cannot distinguish these error types effectively. When boxes do not overlap, IoU is zero and gradients vanish.

MPDIoU translates regression into corner-point matching, directly measuring the Euclidean distances of the top-left and bottom-right corners, normalized by the diagonal of the minimum enclosing rectangle. The geometric interpretation is shown in Figure 10.

\begin{matrix} d_{1}^{2} & = {(x_{1}^{prd} - x_{1}^{gt})}^{2} + {(y_{1}^{prd} - y_{1}^{gt})}^{2}, \end{matrix}

(19)

\begin{matrix} d_{2}^{2} & = {(x_{2}^{prd} - x_{2}^{gt})}^{2} + {(y_{2}^{prd} - y_{2}^{gt})}^{2} . \end{matrix}

(20)

The complete MPDIoU loss is defined as:

L_{MPDIoU} = 1 - IoU + \frac{d_{1}^{2} + d_{2}^{2}}{w^{2} + h^{2}} .

(21)

This formulation maintains effective gradients even when boxes do not overlap, and evenly penalizes deviations regardless of the absolute scale.

4. Experiments

4.1. Datasets

To evaluate the model under clinically relevant conditions, we constructed the Complex Test Tube (CTT) dataset for hospital laboratory sorting. The dataset contains 11,955 images and 81,044 instances, split into 9,797 for training, 1,079 for validation, and 1,079 for testing. Collection scenes fall into two stages: pre-sorting (mixed tubes in disorganized layouts) and rack-loaded (tubes positioned for scanning), as shown in Figure 11. Both stages were captured in hospital laboratory and ward sampling environments to maximize scene diversity.

Annotation was performed with the open-source tool CVAT, which supports key-frame extraction and semi-automatic interpolation for video clips. We annotated 15 categories across four dimensions to support comprehensive sample analysis:

1.: Physical Features (6 classes): 5 ml, 15 ml, and 50 ml specifications for both tube bodies and caps.
2.: Internal Sample Status (3 classes):blood, urine, and empty, to support coarse sample pre-screening.
3.: External Labels (1 class):barcode region, providing visual support for downstream robotic scanning.
4.: Cap Colors (5 classes):blue, grey, green, purple, red, yellow—six color categories that encode additive types in clinical practice (e.g., red caps indicate immunological assays with higher biosafety requirements).

Figure 12 illustrates representative annotations across the four dimensions.

For cross-domain evaluation, we use HeinSight4.0 and VisDrone [28,29]. HeinSight4.0 [28] is a transparent-container dataset published by El-Khawaldeh et al. on Zenodo, containing 6,031 images extracted from chemical experiment videos. Five annotation categories cover air (empty and residual-gas), liquid (clear solution and turbid liquid), and solid (particles or precipitates suspended in liquid). All images were annotated manually with bounding boxes and divided into training and validation sets at a 9:1 ratio. The dataset shares key visual challenges with CTT—specular highlights from transparent walls, blurred phase boundaries, and liquid-level ambiguity—making it a suitable external benchmark for cross-domain generalization. Figure 13 shows examples from each category.

4.2. Data Augmentation

To improve robustness, we designed a multi-level augmentation pipeline.

4.2.1. Primary Augmentation

We implement basic geometric and photometric transforms with Albumentations [30]. Horizontal/vertical flips, shift-scale-rotate operations, brightness/contrast perturbations, and gamma adjustment are used to cover viewpoint changes and low-light variation. We also retain Mosaic [31] and MixUp [32] to simulate dense occlusion. Figure 14 shows representative examples of eight primary augmentation operations applied to tube images.

Beyond single-image transforms, we additionally apply three composition-level strategies during training. Mosaic [31] stitches four images into a single training sample, simultaneously exposing the model to multi-scale tube arrangements and dense background clutter. MixUp [32] fuses two images and their annotations at a fixed ratio, strengthening robustness to partial occlusion. Copy-Paste [31] transfers tube instances between images and is particularly effective for rare cap-color categories. Figure 15 shows representative outputs from these three composition strategies.

4.2.2. Synthetic Patch Augmentation

To address sparse scenarios, we developed a synthetic patch augmentation based on GrabCut. GrabCut estimates GMM parameters for foreground and background iteratively to find the minimum cut. To eliminate jagged edges, we apply morphological opening (

M_{open} = (M_{init} ⊖ K) \oplus K

) followed by Gaussian feathering, then use Alpha blending to seamlessly embed the tube foreground into complex backgrounds. The complete pipeline is illustrated in Figure 16, with the feathered mask detail shown in Figure 17.

4.2.3. Density-Driven Adaptive Blur

To simulate non-uniform motion blur caused by robot acceleration, we compute the spatial distribution density grid F and its variation coefficient

ρ

:

ρ = \frac{std (F)}{mean (F) + ϵ}

(22)

The blur kernel size

σ_{blur}

is adaptively scaled by

ρ

:

σ_{blur} = σ_{min} + N (ρ) \cdot (σ_{max} - σ_{min})

(23)

Sparse areas trigger stronger blur to simulate high-speed robot movements. Figure 18 shows two typical samples with the density-driven blur applied at different intensity levels.

4.3. Implementation Details

Training is performed for 200 epochs with a batch size of 16 and 8 dataloader workers on an RTX 4060 Ti GPU. We use AdamW with an initial learning rate of 0.01 and a cosine decay schedule.

4.4. Results on the CTT Dataset

Table 1 compares representative detectors on CTT.

Table 1 shows that WDA-TNET attains the highest precision (94.08%) and the highest mAP@50:95 (79.08%) among the compared models. Compared to YOLOv11, the stricter mAP@50:95 improves by 3.58 percentage points. This pattern is critical because the stricter metric is more sensitive to box tightness on slender tubes. The gain confirms that DCPAF and MPDIoU improve localization stability rather than only increasing coarse detections at IoU 0.5.

4.5. Cross-Domain Evaluation

HeinSight4. This dataset contains transparent-container scenes with strong reflection, refraction, and blurred phase boundaries.

Table 2. Performance comparison on HeinSight4.

Model	Precision (%)	Recall (%)	mAP@50 (%)	mAP@50:95 (%)	FPS
Ours	97.73	94.99	97.74	95.17	131.05
YOLOv12 [36]	97.91	93.80	97.87	94.97	195.71
Improved YOLOv8 for pipette tips [15]	96.95	94.72	98.05	95.16	60.22
Ghost-YOLOv4 liquid-level [10]	96.75	95.39	98.04	95.47	115.15
YOLOv11 [34]	96.43	93.57	97.72	94.81	55.72
YOLOv9t [35]	95.55	92.03	97.72	94.68	69.55
YOLO aliquoting baseline [37]	96.64	93.81	97.46	94.95	83.02
YOLO liquid-level baseline	95.21	93.16	96.98	93.30	28.43
TubeDet-YOLO [13]	93.99	94.70	97.38	94.16	30.90
Tube baseline (improved)	90.90	93.58	96.46	92.94	102.63
Internal variant-2	95.43	93.46	96.99	93.89	111.05

On HeinSight4, WDA-TNET reaches 95.17% mAP@50:95 and 94.99% recall. The transfer result suggests that WGSR and GSCIM learn feature corrections that remain helpful in general transparent-object scenes, not just overfitting to clinical tubes.

VisDrone. We further test cross-domain generalization on VisDrone, a dense small-object dataset with complex backgrounds.

Table 3. Performance comparison on VisDrone.

Model	Precision	Recall	mAP@50	mAP@50:95	Params (M)	GFLOPs	Size (MB)
Ours	48.94	36.63	37.33	21.85	9.49	22.99	18.38
YOLOv9t	43.14	32.33	32.33	18.78	2.01	7.86	8.48
YOLOv8	42.61	32.41	31.70	18.09	3.01	8.20	5.95
YOLOv12	42.00	32.83	32.01	18.48	2.57	6.49	5.26
YOLOv11	40.19	30.26	29.57	16.70	2.59	6.45	5.20

VisDrone is harsher on small-object density and background clutter. WDA-TNET still leads the comparison with 48.94% precision and 21.85% mAP@50:95, indicating that selective restoration and adaptive sample assignment remain useful when objects become smaller and more numerous.

4.6. Ablation on Backbone Block Variants

We evaluate replacing the backbone baseline block (C3k2) with five attention and normalization alternatives alongside our GSCIM on the CTT dataset.

Table 4. Ablation study on C3k2 backbone variants (CTT dataset).

Variant	mAP@50 (%)	mAP@50:95 (%)	Params (M)	FPS
C3k2+MS [38]	90.90	74.49	2.59	62.93
C3k2+RCM [39]	90.85	74.85	2.80	76.52
C3k2+DW [40]	90.27	74.18	2.73	62.52
C3k2+CBAM [41]	89.73	74.26	2.55	58.93
C3k2+GN [42]	86.40	67.00	2.90	70.90
Ours (GSCIM)	91.00	75.60	2.90	73.40

GSCIM achieves the best mAP@50 (91.0%) and mAP@50:95 (75.6%) in this group. The key reason is that, rather than simply amplifying high responses, it first establishes a stable background reference via global channel recalibration to flag glare-dominated activations as anomalies, then preserves geometrically continuous low-energy cues such as liquid-level edges through a short-cut path, and finally applies spatial reweighting to redirect responses toward real boundaries. This staged mechanism keeps the high-IoU localization cues intact without relying on a larger model. C3k2+GN flattens the feature distribution and eliminates the weak energy contrast between tube walls and liquid surfaces, causing mAP@50 to drop to 86.4%. Figure 19 shows precision, recall, and FLOPs across all six variants.

4.7. Detection Head Comparison

We compare six representative detection heads under identical backbone and training settings to justify the choice of ATSS.

Table 5. Comparison of detection head variants on the CTT dataset.

Head	mAP@50 (%)	mAP@50:95 (%)	Params (M)	Latency (ms)
YOLOX Head [43]	90.40	74.30	9.40	8.80
DY Head [44]	90.30	74.20	2.70	16.00
FCOS Head [45]	90.00	73.70	3.10	14.90
Decoupled Detect [33]	89.90	73.00	2.80	14.70
Lightweight Efficient	89.90	74.70	2.20	12.10
ATSS Head [23]	90.90	75.90	3.10	10.30

ATSS Head achieves the highest mAP@50 (90.9%) and mAP@50:95 (75.9%) with a latency of only 10.3 ms. YOLOX Head requires 9.4 M parameters—more than three times the backbone cost of ATSS—yet gains less than 0.1 percentage points in precision. DY Head produces adaptive convolution kernels per input but incurs 16.0 ms latency, the highest in the group. FCOS and Decoupled Detect decouple classification and regression to different extents; however, when the coupling is reduced beyond a certain degree, the shared structural evidence for boundary discrimination weakens, and mAP@50:95 falls below 74%. Lightweight Efficient keeps parameters low at 2.2 M but cannot fully model the slender-tube short-axis boundary, leaving mAP@50:95 at 74.7%. Figure 20 visualizes precision, recall, and FLOPs across all six heads.

4.8. SPPF Module Ablation

We compare DCPAF against seven SPPF variants that represent representative directions in lightweight spatial pooling design.

Table 6. Comparison of SPPF variants and DCPAF on the CTT dataset.

Module	P (%)	R (%)	F1 (%)	mAP@50 (%)	mAP@50:95 (%)	FPS
SPPF (baseline)	92.1	87.4	89.7	90.8	75.5	73.4
SimConvSPPF [46]	92.5	88.8	90.6	91.7	77.6	94.2
LSKA_SPPF [47]	92.7	89.4	91.0	91.7	77.8	90.6
FocalModulation [48]	92.5	88.2	90.3	91.3	77.0	85.5
Mamba_SPPF [49]	92.5	89.3	90.9	91.6	77.6	92.3
GCBlock_SPPF [50]	93.6	88.6	91.0	91.8	77.5	88.9
UniRepLKNetSPPF [51]	92.4	89.8	91.1	92.0	78.0	76.9
DCPAF (Ours)	94.1	88.9	91.4	92.8	79.1	92.3

DCPAF achieves the best mAP@50 (92.8%), mAP@50:95 (79.1%), and F1 (91.4%) at 92.3 FPS. Other variants strengthen channel recalibration or expand spatial receptive fields, yet they retain isotropic pooling concatenation, which discards directional and positional differences between feature locations during aggregation. DCPAF applies direction-aware encoding to each pooling scale before concatenation, so horizontal liquid-level features and vertical tube-boundary features already carry explicit directional labels when they are merged. The cross-stage alignment mechanism further ensures semantic consistency across scales in the channel dimension, producing a clear advantage under strict IoU thresholds. GCBlock_SPPF achieves the highest precision (93.6%) through global context suppression of false positives, but its additive global feedback lacks local spatial modeling, limiting mAP@50:95 to 77.5%. Figure 21 summarizes mAP@50:95 and FPS for all variants.

4.9. Loss Function Comparison

To verify the regression effectiveness, we compare eight bounding-box regression losses under identical training strategies.

Table 7. Comparison of bounding-box regression losses on the CTT dataset.

Loss	Precision (%)	Recall (%)	mAP@50 (%)	mAP@50:95 (%)
GIoU [24]	92.82	89.25	92.31	79.08
CIoU [25]	93.50	89.18	92.20	78.79
VarifocalLoss [52]	92.80	88.50	91.50	76.30
WIoU [26]	92.84	86.53	90.96	72.07
IoU-Focaler	91.80	89.50	91.20	77.10
IoU-Slide	90.40	88.80	90.20	75.10
IoU-FocalIoU	93.40	89.20	92.20	78.70
MPDIoU [27]	94.10	88.90	92.80	79.10

MPDIoU achieves the best precision (94.1%), mAP@50 (92.8%), and mAP@50:95 (79.1%). The critical property is that it reformulates regression as corner-point alignment: even when prediction and ground truth do not overlap, the corner distances provide directional gradients, preventing the loss from stalling during early training. For tubes in mixed-specification racks, the diagonal normalization factor scales with the enclosing rectangle, so a few-pixel width error on a 5 ml tube and a tens-of-pixel length error on a 50 ml tube receive equivalent penalty weights. CIoU adds center-distance and aspect-ratio constraints on top of GIoU but lacks explicit corner alignment; when adjacent tubes share nearly identical centers, these constraints cannot distinguish boundary position errors, capping mAP@50:95 at 78.79%. WIoU applies asymmetric weighting but suppresses the gradient contribution of boundary-contrast-weak tubes in dense layouts, causing recall to drop to 86.5% and mAP@50:95 to fall to 72.1%. Focal-based variants (IoU-Focaler, IoU-FocalIoU) direct attention to hard samples without introducing geometric correction, so improvements over GIoU remain small. MPDIoU addresses the dominant bottleneck in this task—corner-level geometric alignment—rather than sample reweighting, which explains why it outperforms all alternatives.

5. Conclusions

This paper presented WDA-TNET, a test-tube detector designed around the four coupled failure modes encountered on clinical conveyor lines: motion blur, glass glare, slender geometry, and dense packing. Rather than applying a single enhancement block, the design distributes the correction burden across four stages. WGSR targets motion blur through wavelet-energy-guided selective restoration, confining sharpening to degraded regions so that specular highlights are not amplified. GSCIM addresses glare-dominated channel responses in the backbone through direction-aware pooling and local cross-channel interaction, preserving liquid-level edge cues that isotropic attention tends to suppress. DCPAF replaces isotropic SPPF pooling with decoupled height-and-width encoding and a cross-stage short path, maintaining short-axis positional resolution while aggregating long-axis context. ATSS and MPDIoU stabilize supervision when positives are sparse and box overlap is weak, with corner-distance constraints providing non-zero gradients throughout early training.

On the newly constructed CTT dataset (11,955 images, 81,044 instances), WDA-TNET reaches 94.1% precision and 79.1% mAP_50:95, improving mAP_50:95 by 3.6 percentage points over YOLOv11. Ablation experiments confirm that each module contributes complementary gains: WGSR improves border visibility in degraded frames, GSCIM suppresses glare noise at the feature level, DCPAF brings the largest single-step gain in high-IoU localization (mAP_50:95 from 75.9% to 79.4%), and MPDIoU strengthens mid-range IoU regression quality. Cross-domain evaluation on HeinSight4 (95.5% mAP_50:95) and VisDrone (21.85% mAP_50:95) confirms that the learned corrections generalize beyond the training distribution.

Current limitations are twofold. The parameter count of 3.6 M and 5.6 GFLOPs restrict direct deployment on low-power edge boards. Future work will investigate structured pruning and INT8 quantization to bring WDA-TNET within the operational envelope of embedded controllers. Multi-task joint learning that simultaneously outputs tube type, liquid level, and barcode region—rather than treating each as a separate pipeline stage—would also reduce system latency and support fully autonomous robotic sorting.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The CTT dataset and trained model weights described in this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhao, Z.Z. Clinical Application and Effect Evaluation of Blood Specimen Tube Rack and Tube Identification Alarm System. China Health Stand. Manag. 2017, 8, 123–125. [Google Scholar]
Jing, Y. An Automated Test Tube Identification and Sorting System Based on Machine Vision. In Proceedings of the Third International Conference on Biomedical and Intelligent Systems (IC-BIS 2024), SPIE; SPIE: Bellingham, WA, USA, 2024; Vol. 13208, p. 132081K. [Google Scholar]
Liu, C.; Dong, J.; Lu, Q.; et al. High-Precision Serum Level Detection Method Based on HSV Color Space. Chin. J. Sci. Instrum. 2020, 41, 78–86. [Google Scholar]
Dong, J. Research on Test Tube Liquid Level Detection Method Based on Machine Vision. Agric. Equip. Veh. Eng. 2021, 59, 108–115. [Google Scholar]
Zhang, W.; Li, S.; Long, T.; et al. Real-Time Detection of Test Tube Position Based on Image Processing. Softw. Eng. Appl. 2022, 11, 425–434. [Google Scholar]
Balia, R.; Barra, S.; Podda, A.S.; Pompianu, L.; Sangiovanni, M.; Fenu, G. Automated Classification of Test Tubes Based on Uncontrolled Image Analysis. In Proceedings of the 2024 IEEE International Conference on Metrology for eXtended Reality, Artificial Intelligence and Neural Engineering; IEEE: St Julians, Malta, 2024; pp. 1159–1164. [Google Scholar]
Xu, T. Test Tube Counting Based on Hough Transform and Convolutional Neural Network. Electron. Technol. Softw. Eng. 2020, 135–136. [Google Scholar] [CrossRef]
Liu, S.; Lin, J.; Chen, Z.; Zou, Z. Data Matrix Code Recognition Method for Test Tube–Rack System Based on Mask R-CNN. J. Fujian Univ. Technol. 2023, 21, 378–384. [Google Scholar]
Şişman, A.R.; Başok, B.İ.; Karakoyun, İ.; Çolak, A.; Bilge, U.; Demirci, F.; Başoğlu, N. Measuring the Performance of an Artificial Intelligence-Based Robot That Classifies Blood Tubes and Performs Quality Control in Terms of Preanalytical Errors: A Preliminary Study. Am. J. Clin. Pathol. 2024, 161, 553–560. [Google Scholar] [CrossRef]
Liang, J.F.; Domingo Palaoag, T. Develop a Liquid Level Detection Algorithm for Infusion Bottles Based on Ghost-YOLOv4. In Proceedings of the 2024 7th International Conference on Computer Information Science and Application Technology (CISAT); IEEE: Chengdu, China, 2024; pp. 211–215. [Google Scholar]
Wu, Z.; Tan, F.; Li, D.; Tang, Z. Test Tube Sample Liquid Level Detection Based on YOLOv5. China Med. Devices 2023, 38, 61–67. [Google Scholar] [CrossRef]
Ren, K.; Tao, Q. Test Tube Detection Algorithm Based on YOLOv5. Mod. Comput. 2022, 28, 1–8. [Google Scholar] [CrossRef]
Liu, X.; Xie, S.; Jin, X.; Bian, K. TubeDet-YOLO: Edge-Based Detector for Color-Coded Vacuum Blood Collection Tube. In Proceedings of the 2025 10th International Conference on Automation, Control and Robotics Engineering (CACRE); IEEE: Dalian, China, 2025; pp. 198–203. [Google Scholar]
Zhang, K.; Peng, Y.; Wang, Y.; Tang, J. Test Tube and Liquid Level Recognition Algorithm Based on Improved YOLOv8. J. Comput. Appl. 2024, S2. [Google Scholar]
Yin, Y.; Lei, J.; Tao, W. Detection of Liquid Retention on Pipette Tips in High-Throughput Liquid Handling Workstations Based on Improved YOLOv8 Algorithm with Attention Mechanism. Electronics 2024, 13, 2836. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Wang, T. Chemical Product Localization and Classification Based on Robotic Arm. Comput. Mod. 2021, (10), 88–126. [Google Scholar]
Yin, S.; Li, M.; Jiang, Y.; et al. Robotic Grasp Detection for Medical Test Tubes Based on Convolutional Neural Network. In Proceedings of the 2023 2nd International Conference on Automation, Robotics and Computer Engineering; IEEE: Wuhan, China, 2023; pp. 1–6. [Google Scholar]
Chen, H.; Wan, W.; Matsushita, M.; Kotaka, T.; Harada, K. In-Rack Test Tube Pose Estimation Using RGB-D Data. arXiv 2023, arXiv:2308.10411. [Google Scholar]
Tang, Y.; Wan, W.; Chen, H.; et al. Zero-Shot Recognition of Test Tube Types by Automatically Collecting and Labeling RGB Data. IEEE Robot. Autom. Lett. 2025, 10, 8276–8283. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Salt Lake City, UT, USA, 2018; pp. 7132–7141. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Seattle, WA, USA, 2020; pp. 11534–11542. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021; IEEE; pp. 13713–13722. [Google Scholar]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the Gap Between Anchor-Based and Anchor-Free Detection via Adaptive Training Sample Selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Seattle, WA, USA, 2020; pp. 9759–9768. [Google Scholar]
Rezatofighi, H.; Tsoi, N.; Gwak, J.Y.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Long Beach, CA, USA, 2019; pp. 658–666. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI: New York, NY, USA, 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Ma, S.; Xu, Y. MPDIoU: A Loss for Efficient and Accurate Bounding Box Regression. arXiv 2023, arXiv:2307.07662. [Google Scholar] [CrossRef]
El-Khawaldeh, R.; Zhang, W.; Corkery, R. HeinSight4.0 Dataset and Models for Dynamic Monitoring of Chemical Experiments [Dataset]; Zenodo: Geneva, Switzerland, 2025. [Google Scholar]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q. Detection and Tracking Meet Drones Challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7380–7399. [Google Scholar] [CrossRef]
Buslaev, A.; Iglovikov, V.I.; Khvedchenya, E.; Parinov, A.; Druzhinin, M.; Kalinin, A.A. Albumentations: Fast and Flexible Image Augmentations. Information 2020, 11, 125. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond Empirical Risk Minimization. In Proceedings of the International Conference on Learning Representations; ICLR: Vancouver, BC, Canada, 2018. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. YOLOv8 [Software]; Ultralytics: Los Angeles, CA, USA, 2023; Available online: https://github.com/ultralytics/ultralytics.
Jocher, G.; Chaurasia, A.; Qiu, J. YOLOv11 [Software]; Ultralytics: Los Angeles, CA, USA, 2024. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Rybak, L.A.; Cherkasov, V.V.; Malyshev, D.I.; Carbone, G. Blood Serum Recognition Method for Robotic Aliquoting Using Different Versions of the YOLO Neural Network. In Advances in Service and Industrial Robotics; Springer: Cham, Switzerland, 2023; pp. 150–157. [Google Scholar]
Chen, Y.; Yuan, X.; Wang, J.; et al. YOLO-MS: Rethinking Multi-Scale Representation Learning for Real-Time Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 4240–4252. [Google Scholar] [CrossRef]
Ni, Z.; Chen, X.; Zhai, Y.; Tang, Y.; Wang, Y. Context-Guided Spatial Feature Reconstruction for Efficient Semantic Segmentation. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 241–258. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision; Springer: Munich, Germany, 2018; pp. 3–19. [Google Scholar]
Wu, Y.; He, K. Group Normalization. In Proceedings of the European Conference on Computer Vision; Springer: Munich, Germany, 2018; pp. 3–19. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Dai, X.; Chen, Y.; Xiao, B.; et al. Dynamic Head: Unifying Object Detection Heads with Attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Nashville, TN, USA, 2021; pp. 7373–7382. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Seoul, Republic of Korea, 2019; pp. 9627–9636. [Google Scholar]
Liu, S.; Sun, Y.; Fu, X.; et al. SlimConv: Reducing Channel Redundancy in Convolutional Neural Networks by Features Recombining. IEEE Trans. Image Process. 2021, 30, 6434–6445. [Google Scholar] [CrossRef]
Li, T.; Chen, Z.; He, Y.; Li, H. Large Separable Kernel Attention. Expert Syst. Appl. 2023, 236, 121352. [Google Scholar] [CrossRef]
Yang, J.; Li, C.; Zhang, P.; et al. Focal Modulation Networks. In Advances in Neural Information Processing Systems; NeurIPS: New Orleans, LA, USA, 2022; Volume 35, pp. 4203–4217. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; et al. VMamba: Visual State Space Model. arXiv 2024, arXiv:2401.13260. [Google Scholar]
Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. GCNet: Non-Local Networks Meet Squeeze-Excitation Networks and Beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops; IEEE: Seoul, Republic of Korea, 2019; pp. 1971–1980. [Google Scholar]
Ding, X.; Zhang, X.; Zhou, X.; et al. UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition. arXiv 2023, arXiv:2311.15599. [Google Scholar]
Zhang, H.; Wang, Y.; Dayoub, F.; Sünderhauf, N. VarifocalNet: An IoU-aware Dense Object Detector. arXiv 2020, arXiv:2008.13367. [Google Scholar]

Figure 1. Overall architecture of WDA-TNET. The pipeline consists of four stages: (1) WGSR preprocessing for selective motion-blur restoration, (2) GSCIM-based backbone for glare-suppressed feature extraction, (3) DCPAF neck for direction-aware multi-scale fusion, and (4) ATSS detection head with MPDIoU loss for stable supervision on slender densely-packed targets.

Figure 2. Three-stage pipeline of WGSR. The frequency-domain branch computes a wavelet-energy degradation mask; the spatial reconstruction branch recovers high-frequency textures via a lightweight three-layer CNN; the gated fusion stage injects the residual selectively into degraded regions only.

Figure 3. Wavelet sub-band decomposition and degradation mask generation. From left to right: input image; low-frequency approximation

c_{A}

; horizontal, vertical, and diagonal high-frequency components

{c_{H}, c_{V}, c_{D}}

; aggregated energy map

E_{raw}

; final degradation mask M after normalization and Sigmoid gating.

Figure 3. Wavelet sub-band decomposition and degradation mask generation. From left to right: input image; low-frequency approximation

c_{A}

; horizontal, vertical, and diagonal high-frequency components

{c_{H}, c_{V}, c_{D}}

; aggregated energy map

E_{raw}

; final degradation mask M after normalization and Sigmoid gating.

Figure 4. Degradation masks generated under different gating steepness k. A small k (left) includes too many low-contrast regions; a large k (right) over-clips the transition zone. At

k = 4.0

(middle), the mask covers approximately one standard deviation below the energy mean, which matches the measured blur-kernel-to-edge-width ratio in the CTT dataset.

Figure 4. Degradation masks generated under different gating steepness k. A small k (left) includes too many low-contrast regions; a large k (right) over-clips the transition zone. At

k = 4.0

(middle), the mask covers approximately one standard deviation below the energy mean, which matches the measured blur-kernel-to-edge-width ratio in the CTT dataset.

Figure 5. Visualization of WGSR three-stage collaborative processing. From left to right: original blurred input; degradation mask M (bright regions = clear, dark = degraded); reconstruction output Y; final restored image

I_{out}

. The restoration is concentrated on blur-affected zones while specular highlights remain unchanged.

Figure 5. Visualization of WGSR three-stage collaborative processing. From left to right: original blurred input; degradation mask M (bright regions = clear, dark = degraded); reconstruction output Y; final restored image

I_{out}

. The restoration is concentrated on blur-affected zones while specular highlights remain unchanged.

Figure 6. Structure of the Direction-Aware Cross-Stage Pyramid Attention Fusion (DCPAF) module. The input feature is first recalibrated by two independent 1D pooling branches along height and width. The recalibrated feature then goes through chained

5 \times 5

max-pooling for multi-scale context, which is concatenated with a cross-stage skip path and fused by

1 \times 1

convolution.

Figure 6. Structure of the Direction-Aware Cross-Stage Pyramid Attention Fusion (DCPAF) module. The input feature is first recalibrated by two independent 1D pooling branches along height and width. The recalibrated feature then goes through chained

5 \times 5

max-pooling for multi-scale context, which is concatenated with a cross-stage skip path and fused by

1 \times 1

convolution.

Figure 7. Architecture of the Glare-Suppressed Channel Interaction Module (GSCIM). The input is split into a transformation branch (cascade of

3 \times 3

convolutions and ECA channel-weighting units) and a shallow preservation branch. Both branches are concatenated and compressed by a

1 \times 1

convolution. The ECA unit computes local cross-channel weights via adaptive 1D convolution, suppressing glare-dominated responses while retaining liquid-level edge cues.

Figure 7. Architecture of the Glare-Suppressed Channel Interaction Module (GSCIM). The input is split into a transformation branch (cascade of

3 \times 3

convolutions and ECA channel-weighting units) and a shallow preservation branch. Both branches are concatenated and compressed by a

1 \times 1

convolution. The ECA unit computes local cross-channel weights via adaptive 1D convolution, suppressing glare-dominated responses while retaining liquid-level edge cues.

Figure 8. Position of the ATSS module within the detection head. Multi-scale features P3–P5 pass through the prediction module; the outputs feed into both the loss computation branch and the ATSS assigner, which dynamically determines positive/negative labels based on candidate-set statistics rather than a fixed IoU threshold.

Figure 9. Flowchart of ATSS adaptive sample assignment. For each ground-truth box, top-k candidates are selected per feature level by center distance, merged across levels, and an adaptive IoU threshold

t_{g} = μ_{g} + σ_{g}

is computed from the candidate set. A candidate is labeled positive only when it simultaneously satisfies the IoU condition and the center-in-box constraint; conflicts from dense tube layouts are resolved by assigning each candidate to the ground truth with the highest IoU.

Figure 9. Flowchart of ATSS adaptive sample assignment. For each ground-truth box, top-k candidates are selected per feature level by center distance, merged across levels, and an adaptive IoU threshold

t_{g} = μ_{g} + σ_{g}

is computed from the candidate set. A candidate is labeled positive only when it simultaneously satisfies the IoU condition and the center-in-box constraint; conflicts from dense tube layouts are resolved by assigning each candidate to the ground truth with the highest IoU.

Figure 10. Corner-distance constraint of MPDIoU. Distances

d_{1}

and

d_{2}

are computed between the predicted and ground-truth top-left and bottom-right corners respectively, then normalized by the diagonal of the minimum enclosing rectangle. This formulation provides non-zero gradients regardless of overlap, and penalizes short-axis and long-axis deviations equally.

Figure 10. Corner-distance constraint of MPDIoU. Distances

d_{1}

and

d_{2}

are computed between the predicted and ground-truth top-left and bottom-right corners respectively, then normalized by the diagonal of the minimum enclosing rectangle. This formulation provides non-zero gradients regardless of overlap, and penalizes short-axis and long-axis deviations equally.

Figure 11. CTT dataset sample collection scenes. The top row shows pre-sorting layouts with tubes in mixed, unordered states; the bottom row shows rack-loaded configurations ready for robotic scanning. Both stages span hospital laboratory and ward settings.

Figure 12. CTT dataset annotation examples. Four annotation dimensions are shown: physical specification (tube size), internal sample state (blood/urine/empty), external barcode label, and cap color. Each dimension contributes complementary information to support robot-assisted tube sorting.

Figure 13. HeinSight4.0 dataset: five-category annotation examples. From left to right: air (empty container), air with residual gas, clear liquid, turbid liquid, and solid/particle phase. The transparent container walls introduce strong reflection and refraction in all categories.

Figure 14. Primary data augmentation examples. Each column illustrates one transform: horizontal flip, vertical flip, shift-scale-rotate, brightness perturbation, contrast perturbation, gamma correction, Mosaic, and MixUp. All transformations maintain annotation consistency via coordinate remapping.

Figure 15. Composition-level augmentation examples. Left: Mosaic (four images stitched), simulating dense and multi-scale tube layouts. Middle: MixUp (two images blended), training the model on overlapping or partially visible tubes. Right: Copy-Paste (tube instances transplanted), increasing instance diversity in under-represented cap-color categories.

Figure 16. Synthetic patch augmentation pipeline. From left to right: source annotated image; GrabCut foreground extraction; morphological opening to smooth mask boundaries; Gaussian feathering to produce a soft alpha channel; background image with color perturbation; final composited sample with retained bounding-box annotation.

Figure 17. Detail of GrabCut-based foreground segmentation and Gaussian feathering. Left: initial hard binary mask from GrabCut. Middle: mask after morphological opening removing boundary artifacts. Right: soft alpha channel after Gaussian convolution, enabling seamless Alpha-blend compositing.

Figure 18. Wavelet-enhanced augmentation examples for motion-blur simulation. Left pair: dense, relatively uniform tube arrangement (

ρ

low) yields mild blur. Right pair: sparse or clustered arrangement (

ρ

high) triggers stronger blur. WGSR restores the degraded training samples before they enter detector training.

Figure 18. Wavelet-enhanced augmentation examples for motion-blur simulation. Left pair: dense, relatively uniform tube arrangement (

ρ

low) yields mild blur. Right pair: sparse or clustered arrangement (

ρ

high) triggers stronger blur. WGSR restores the degraded training samples before they enter detector training.

Figure 19. Precision, recall, and GFLOPs comparison across C3k2 backbone variants on the CTT dataset. GSCIM achieves the best balance: comparable precision to C3k2+MS with lower GFLOPs and a consistently higher recall than all norm-based alternatives.

Figure 20. Precision (P), recall (R), and FLOPs comparison of six detection heads on CTT. ATSS Head achieves the highest precision and recall while keeping FLOPs well below YOLOX Head (approximately one-quarter of YOLOX’s computational cost).

Figure 21. mAP@50:95 (bars) and FPS (line) comparison of SPPF variants and DCPAF on CTT. DCPAF reaches the highest mAP@50:95 at a competitive inference speed.

Table 1. Performance comparison on the CTT dataset.

Model	Precision (%)	Recall (%)	mAP@50 (%)	mAP@50:95 (%)	Params (M)
yolov5_liquid_level	93.40	89.23	92.63	78.79	9.13
YOLOv8 [33]	91.64	87.99	91.19	76.22	3.01
YOLOv11 [34]	92.17	87.44	90.87	75.50	2.59
YOLOv8aa	91.90	87.96	90.85	75.71	2.94
yolov8_improved_pip	91.13	88.82	90.63	75.78	3.02
06_gold_yolo_style	91.15	88.17	90.61	74.46	2.59
yolov5_shufflenet	90.39	88.54	90.40	75.31	6.85
YOLOv9t [35]	90.14	88.36	90.19	75.21	2.01
YOLOv12 [36]	91.12	88.52	90.14	75.02	2.57
yolov8n_expiry_date	89.61	88.14	89.63	73.21	2.91
TubeDet-YOLO [13]	88.40	85.82	88.31	69.83	1.03
YOLOv5 liquid-level localization	87.97	79.24	83.30	66.20	2.75
WDA-TNET (Ours)	94.08	88.91	92.77	79.08	8.46

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

A Wavelet-Enhanced Data-Augmented Network for Robust Test Tube Detection in Clinical Workflows

Abstract

Keywords:

Subject:

1. Introduction

2. Related Work

2.1. Conventional Vision-Based Methods

2.2. CNN-Based Classification and Detection

2.3. YOLO-Based Lightweight Detectors

2.4. Robot-Integrated Visual Manipulation

2.5. Attention Mechanisms and Frequency-Domain Enhancement

2.6. Sample Assignment and Bounding-Box Regression

3. Methodology

3.1. Overall Architecture

3.2. WGSR: Wavelet-Guided Selective Restoration

3.2.1. Frequency-Domain Localization

3.2.2. Spatial Reconstruction and Gated Fusion

3.3. DCPAF: Direction-Aware Cross-Stage Pyramid Attention Fusion

3.4. GSCIM: Glare-Suppressed Channel Interaction Module

3.5. Training Supervision: ATSS and MPDIoU

3.5.1. Adaptive Training Sample Selection (ATSS)

3.5.2. MPDIoU Loss Function

4. Experiments

4.1. Datasets

4.2. Data Augmentation

4.2.1. Primary Augmentation

4.2.2. Synthetic Patch Augmentation

4.2.3. Density-Driven Adaptive Blur

4.3. Implementation Details

4.4. Results on the CTT Dataset

4.5. Cross-Domain Evaluation

4.6. Ablation on Backbone Block Variants

4.7. Detection Head Comparison

4.8. SPPF Module Ablation

4.9. Loss Function Comparison

5. Conclusions

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

MDPI Initiatives

Important Links

Subscribe