Semi-Supervised Traffic Sign Detection with Dynamic Pseudo-Label Selection and Gated-Feature-Fusion-Based Proposal Refinement

Chenhui Xia; Yeqin Shao; Meiqin Che; Guoqing Yang

doi:10.20944/preprints202604.2149.v1

Submitted:

29 April 2026

Posted:

30 April 2026

You are already at the latest version

Abstract

Accurate traffic sign detection is important for the safety of autonomous driving systems. However, fully supervised methods require a large amount of manual annotation, which is cost-prohibitive and time-consuming. Semi-supervised methods employ a small amount of labeled data and a large amount of unlabeled data to train the models, hence largely reducing the annotation costs. However, these methods have the following challenges: (1) with an imbalanced long-tail class distribution of traffic signs, they tend to achieve poor performance on tail classes; (2) they often fail to detect small traffic signs. To solve these issues, we propose a Semi-Supervised Traffic Sign Detection method with Dynamic Pseudo-Label Selection and Gated-Feature-Fusion-based Proposal Refinement. Firstly, we design a Class-Distribution-based Dynamic Pseudo-Label Selection module (CD-DPLS) to select pseudo-labels for different classes based on the class distribution information, which reduces the tendency to select more pseudo-labels from head classes instead of tail classes, thereby improving the tail class detection performance. Secondly, we employ a Gated-Feature-Fusion-based Proposal Refinement strategy (GFF-PR) to refine detection proposals by fusing different-scale features with a gating mechanism, which facilitates the detection of small traffic signs. Besides, we use an Adaptive-Weight Focal Loss (AWFL), with which the weight of each pseudo-label is determined by the ratio between its classification confidence and the corresponding class-specific classification-confidence threshold. Experiments on traffic sign datasets demonstrate that the proposed method outperforms state-of-the-art semi-supervised approaches, with mAP50 scores of 11.5% and 36.3% using only 1% and 10% labeled data, respectively.

Keywords:

semi-supervised learning

;

traffic sign detection

;

pseudo-label selection

;

imbalanced class distribution

;

feature fusion

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

Traffic signs provide essential visual guidance to drivers by conveying regulatory, warning, and informational messages in the road environment. In real-world driving scenarios, failing to correctly detect traffic signs can significantly increase the risk of traffic accidents. With the continuous advancement of autonomous driving and Advanced Driver Assistance Systems (ADAS), the automatic and accurate traffic sign detection has become more and more vital for driving safety [1].

In recent years, deep-learning-based object detection methods have achieved great breakthroughs, largely due to the availability of large-scale annotated datasets [2,3,4,5,6,7]. These fully supervised methods have demonstrated excellent performances across various research domains. However, these fully supervised methods heavily rely on extensive manual annotation, which is inherently labor-intensive and cost-prohibitive, especially for traffic sign detection tasks that require high-precision bounding boxes for small targets.

To reduce the dependence on annotated data, researchers have turned to semi-supervised learning methods [8,9,10,11], which rely on a small amount of labeled data and a large amount of unlabeled data. By exploiting information from the unlabeled data, semi-supervised learning can achieve accurate and robust detection results while significantly lowering the dependence on expensive annotations, particularly when labeled data is hard to obtain. The semi-supervised learning methods mainly fall into two categories: Consistency training and Pseudo-labeling [12,13,14,15,16,17,18,19,20,21,22].

Consistency training typically constructs a regularization term to enforce prediction invariance across different perturbations, thereby smoothing the decision boundary. However, this mechanism inherently depends on the reliability of supervision signals. If the underlying pseudo-labels are inaccurate due to class imbalance, consistency regularization may inadvertently reinforce incorrect predictions.

Existing pseudo-labeling methods in traffic scenes usually rely on a single classification-confidence threshold for all object classes. However, this simple strategy ignores the difference in classification confidences of object classes, which is due to the imbalanced class distribution. As shown in Figure 1, the traffic sign dataset exhibits imbalance in instance counts of different classes, where the classes on the left appear much more frequently than those on the right. This is characterized by a long-tail, imbalanced class distribution. Here we take the 18 classes with more than 500 instances as “head” classes and the remaining 67 classes with fewer than 500 instances as “tail” classes. Since the model mainly learns from the head classes, it tends to predict these classes with higher classification confidences. In contrast, due to insufficient training, the model tends to predict tail classes with lower classification confidences. Therefore, a fixed threshold will filter out many potential positive samples from tail classes and treat them as background, which will lead to a severe loss of useful supervision information from tail classes.

To address these challenges, we propose a semi-supervised traffic sign detection method with dynamic pseudo-label selection and Gated-Feature-Fusion-based Proposal Refinement. Built on YOLOv11, the proposed method introduces class-aware pseudo-label selection, gated cross-scale feature fusion, and adaptive loss weighting to reduce the tendency to select more pseudo-labels from head classes instead of tail classes, improving the detection performance of small traffic signs. The main contributions of this paper are summarized as follows:

(1) We propose a Class-Distribution-based Dynamic Pseudo-Label Selection module (CD-DPLS). Instead of using a single global classification-confidence threshold for all classes, the CD-DPLS assigns different classification-confidence thresholds to different traffic sign classes to avoid retaining more pseudo-labels for head classes while filtering out tail classes. To make the class distribution estimation more reliable under limited labeled data, we combine two kinds of class distributions: the class distribution from labeled data and the class distribution provided by CLIP on unlabeled data. Thus, the proposed method can adjust class-specific classification-confidence thresholds dynamically to improve the detection performance on tail classes.

(2) We propose a Gated-Feature-Fusion-based Proposal Refinement strategy (GFF-PR). To avoid the missed detections of small traffic signs, the GFF-PR fuses different-scale features through a gating mechanism, and uses the fused features to refine object proposals. Specifically, the GFF-PR uses two types of proposals based on the fused feature pyramid for training. The first type of proposals consists of those whose fused confidence (computed as the product of classification confidence and localization confidence) exceeds a dynamic positive threshold determined from the mean and standard deviation of fused confidences in the current batch. The second type of proposals consists of those whose fused confidence lies between the dynamic positive threshold and the fixed negative threshold of 0.1. These proposals are retained only when the confidence gain between the fused feature pyramid and original feature pyramid exceeds 0.2. This strategy thereby improves the detection performance of small traffic signs.

(3) We propose an Adaptive-Weight Focal Loss (AWFL). Existing methods usually treat all pseudo-labels above the classification-confidence threshold equally, while the AWFL assigns each pseudo-label an adaptive weight based on the ratio between its classification confidence and the class-specific classification-confidence threshold. As a result, pseudo-labels, which are much higher than the aforementioned threshold, can get larger weights. This mitigates the issues caused by the imbalanced class distributions.

2. Related Work

2.1. Semi-Supervised Object Detection

In Semi-Supervised Object Detection (SSOD), many approaches are inspired by self-training and consistency-based methods originally developed for semi-supervised image classification. A representative example is the Noisy Student framework, subsequently motivating the STAC method [32]. he STAC adopts a staged training strategy, where a fixed teacher model first generates pseudo-labels for unlabeled data, and a student model is subsequently trained based on these labels. To overcome the limitations of this decoupled training process and improve pseudo-label reliability, subsequent SSOD frameworks have been proposed [16,33,34,35,36]. Most of these methods employ an Exponential Moving Average (EMA) mechanism to dynamically update the teacher model based on the student’s parameters, enabling more stable pseudo-label generation during training. Within this framework, the SSOD has gradually expanded from single-stage detectors to more complex architectures. Several representative methods explore different strategies to reduce pseudo-label noise and bias. Unbiased Teacher [33] lleviates bias through loss calibration, while Instant-Teaching [16] encourages mutual error correction between teacher and student models. Soft Teacher [35] further enhances pseudo-label robustness via score-weighted losses and bounding box perturbation. More recently, the SSOD has been extended to Transformer-based detectors, such as Semi-DETR [36], which introduces hybrid matching and cost-based pseudo-label filtering. Building upon these studies, our work focuses on addressing pseudo-label selection and assignment ambiguity in semi-supervised frameworks based on single-stage detectors.

2.2. Traffic Sign Detection

Traditional traffic sign detection methods usually adopt a two-stage pipeline [23], in which candidate regions are first generated based on hand-crafted cues such as color, shape, and edge information to reduce the search space, and then classified using machine learning algorithms like Support Vector Machines (SVM) or AdaBoost with discriminative features such as Histogram of Oriented Gradients (HOG) [24,25]. Although this paradigm provides basic detection capability, a major limitation is its heavy reliance on manually designed features, which restricts the robustness under complex road conditions. With the advent of deep learning, Convolutional Neural Networks (CNNs) have demonstrated clear advantages by learning hierarchical feature representations directly from raw data [26]. Building upon this paradigm shift, recent traffic sign detection research has increasingly focused on multi-scale representation learning and attention-based feature enhancement. For instance, Wang et al. [27] proposed AF-FPN, which integrates an Adaptive Attention Module (AAM) and a Feature Enhancement Module (FEM) to improve multi-scale feature representation through feature pyramids with diverse receptive fields. Omid et al. [28] introduced a pyramid transformer architecture with local attention and multi-head self-attention (MHSA) [29], enabling effective modeling of both scale-specific features and long-range dependencies. Wang et al. [30] further developed a context-aware, attention-driven weighted fusion network that combines MHSA with convolutional operations and relative positional encoding to enhance global–local contextual modeling and adaptive feature fusion. Additionally, Zhang et al. [31] proposed a cascaded R-CNN framework with hierarchical multi-scale feature extraction and attention mechanisms, progressively refining detection results and improving robustness to scale variation and complex backgrounds.

3. Methods

3.1. Overview of Our Method

The semi-supervised traffic sign detection method aims to address the limitations of existing methods in imbalanced class distributions and small object detection. The overall framework is illustrated in Figure 2. The proposed method consists of three main components: the Class-Distribution-based Dynamic Pseudo-Label Selection module, the Gated-Feature-Fusion-based Proposal Refinement strategy, and the Adaptive-Weight Focal Loss. The efficient single-stage fully convolutional network, YOLOv11, is adopted as the baseline detector.

We divide the training data into a small set of labeled data

D_{L} = {(x_{i}^{l}, y_{i}^{l})}_{i = 1}^{N_{l}}

and a large set of unlabeled data

D_{U} = {x_{i}^{u}}_{i = 1}^{N_{u}}

, typically satisfying

N_{u} ≫ N_{l}

.

The framework consists of a teacher model and a student model with identical structures. The student model updates its parameters via backpropagation, while the teacher model’s parameters are updated using the Exponential Moving Average (EMA) of the student model’s parameters to ensure stability in pseudo-label generation.

3.2. Class-Distribution-Based Dynamic Pseudo-Label Selection

Existing pseudo-label selection methods mainly rely on fixed confidence estimation to filter unreliable predictions. Although such strategies can improve the quality of pseudo-labels to some extent, they are still insufficient for traffic sign detection under class imbalance. In particular, teacher models tend to produce higher classification confidences for head classes, while the classification confidences of tail classes are lower due to limited training samples. Therefore, using a fixed global threshold or a simple Top-k strategy will suppress the tail classes, introducing pseudo-label noise for head classes. This issue may accumulate during iterative semi-supervised training, further exacerbating the model’s preference for head classes.

More importantly, using fixed classification-confidence thresholds alone cannot explicitly regulate the distribution of pseudo-labels. As a result, even after low-quality pseudo-labels are filtered out, the generated pseudo-labels may still remain imbalanced across classes, which prevents the detector from learning tail classes. To address this issue, we propose the Class-Distribution-based Dynamic Pseudo-Label Selection (CD-DPLS) module, as illustrated in Figure 3. Instead of applying fixed classification-confidence thresholds to all classes, the CD-DPLS exploits the class distribution derived from labeled data and unlabeled data to dynamically adjust the classification-confidence threshold for each class. The CD-DPLS improves the detection performance on tail classes.

3.2.1. Theoretical Basis and Distribution Estimation

The effectiveness of class distribution probabilities is grounded in a statistical observation: in semi-supervised learning settings, even when the proportion of labeled data is small, the empirical class distribution of the labeled data can still approximate the class distribution characteristics of the entire dataset.

Assuming the labeled dataset

D_{L}

and the unlabeled dataset

D_{U}

are independent and identically distributed, according to the generalization of Hoeffding’s inequality, for any class k, the deviation between the estimated class distribution probability

γ_{k}

and the true distribution

γ_{k}^{*}

satisfies the following probability bound:

|γ_{k} - γ_{k}^{*}| \leq \sqrt{\frac{log (2 n)}{2 n}} + \sqrt{\frac{log (2 m)}{2 m}}

(1)

where n and m denote the numbers of labeled and unlabeled data, respectively. As the number of labeled samples n increases, this error converges at a rate of

O_{p} (1 / \sqrt{n})

.

However, this distribution estimation approach based on labeled data still has clear limitations in long-tailed semi-supervised scenarios. On the one hand, when the labeling ratio is low, the number of samples from tail classes in the labeled set is often limited. As a result, although the overall error bound decreases as n grows, the estimated class probabilities for different classes may still fluctuate. On the other hand, the unlabeled data is usually more abundant than the labeled data and thus contain richer information. Relying only on the labeled set makes it difficult to fully capture the latent distribution characteristics of the entire training dataset. Therefore, we calibrate the dataset’s class distribution by combining the class distribution probabilities derived from labeled data with those from unlabeled data estimated by a pre-trained CLIP, ultimately obtaining more reliable class distribution probabilities for the subsequent dynamic selection process.

3.2.2. CLIP-Based Class Distribution Estimation

Since class labels in traffic sign detection tasks are often abbreviations (e.g., “pl50”, “p23”) and lack semantic information, directly applying a pre-trained CLIP for zero-shot inference faces challenges. To this end, we designed a "Class-Semantic Mapping Mechanism" to utilize CLIP’s generalization ability to extract class distribution probabilities

γ_{k}^{c l i p}

from unlabeled data, which serve as a complement to the class distribution probabilities

γ_{k}^{l a b e l}

derived from labeled data.

Since the original dataset labels (e.g., “pl50”, “p23”) are engineering codes, the CLIP text encoder cannot directly capture their semantics. Therefore, we construct a mapping function

M

that maps the abbreviated label space

C_{abbr}

to the natural language description space

C_{desc}

. For example, “pl50” is mapped to “Speed limit 50 km/h”, and “p23” is mapped to “No left turn”. To enhance the contextual representation of textual features, we insert the mapped descriptions into a prompt template to generate the text embedding

t_{k}

for each class k:

t_{k} = {Encoder}_{text} (" A photo of a [" + M ({class}_{k}) + "] traffic sign ")

(2)

Then we use the CLIP to perform sampling inference on the unlabeled dataset

D_{U}

to estimate its distribution. Specifically, for unlabeled images, we use the preliminary prediction boxes from the teacher model to crop them and input them into the CLIP image encoder to extract visual features

v_{i}

. By calculating the cosine similarity between the visual features and all class text embeddings

t_{k}

, we obtain the semantic prediction probability for the sample:

p_{k}^{c l i p} (v_{i}) = \frac{e x p (s i m (t_{k}, v_{i}) / τ)}{\sum_{j = 1}^{K} e x p (s i m (t_{j}, v_{i}) / τ)}

(3)

By aggregating statistics on a subset of unlabeled samples, we obtain the distribution estimation

γ_{k}^{c l i p}

based on semantic understanding.

By counting the predicted class of unlabeled data, we obtain the class distribution probabilities of the unlabeled data, denoted as

γ_{k}^{c l i p}

. Finally, to combine the class distribution probabilities derived from labeled and unlabeled data, we employ a weighted summation to fuse the two probabilities, yielding the final fused class distribution probability

γ_{k}

:

γ_{k} = β \cdot γ_{k}^{label} + (1 - β) \cdot γ_{k}^{clip}

(4)

3.2.3. Class-Distribution-Based Threshold Setting

To achieve better alignment between the pseudo-label distribution and the true distribution, we formulate the semi-supervised learning process as an optimization problem with regularization constraints. Unlike traditional methods that only minimize classification loss, we introduce a class-aware regularization term to explicitly control the quantity of pseudo-labels generated for each class.

Specifically, for class k, the teacher model predicts the classification confidences for the corresponding traffic sign on the unlabeled dataset, and these confidences are sorted in descending order as

C o n f_{k, 1} \geq C o n f_{k, 2} \geq \dots \geq C o n f_{k, m}

. Additionally, we further introduce positive- and negative-sample reliability ratios

η_{1}, η_{0} \in [0, 1]

to filter out the false positive samples. The top

η_{1} γ_{k}

proportion of samples are selected as positive pseudo-labels, while the bottom

η_{0} (1 - γ_{k})

proportion are selected as negative pseudo-labels. Thereby, the classification-confidence thresholds of positive and negative samples are computed as follows:

τ_{k}^{+} = {C o n f}_{k, ⌊η_{1} γ_{k} m⌋}, τ_{k}^{-} = {C o n f}_{k, ⌊{m - η}_{0} (1 - γ_{k}) m⌋}

(5)

Under this definition,

τ_{k}^{+}

and

τ_{k}^{-}

vary dynamically with the class distribution probability to improve the detection accuracy of each class.

3.2.4. Dynamic Pseudo-Label Selection

Finally, the classification pseudo-label

{\hat{y}}_{k}

for an unlabeled sample x by the teacher model is determined by the following rule:

{\hat{y}}_{k} = \{\begin{matrix} 1, & {Conf}_{k} (x) \geq τ_{k}^{+} \\ 0, & {Conf}_{k} (x) \leq τ_{k}^{-} \\ ignore, & otherwise \end{matrix}

(6)

This mechanism encourages the pseudo-label distribution of each class to better align with the target class distribution.

3.3. Gated-Feature-Fusion-Based Candidate Refinement Strategy

In the previous section, we introduced the CD-DPLS to address the challenge of imbalanced class distributions. However, in semi-supervised object detection, pseudo-label quality is also limited by the missed detection of small objects. The small objects often obtain low classification confidences due to their limited visual information, making them more likely to be mistaken as background. As a result, many potentially useful positive samples are ignored during the training.

To address this issue, we propose the Gated-Feature-Fusion-based Proposal Refinement strategy (GFF-PR). The main idea is to enhance the feature representation of small traffic signs by adaptively fusing features from different pyramid levels. Based on the fused feature pyramid, the model generates proposals for training and further refines them. In this way, proposals, which are weak in the original feature pyramid but become more reliable after feature fusion, can be identified and retained as valuable training signals.

3.3.1. Feature Pyramid Construction for Feature Fusion

To introduce scale robustness at the feature level, we construct a fused feature pyramid. Specifically, the teacher model processes the original scale image and a 0.5x down-sampled image in parallel. After extracting features using the YOLOv11 Backbone, we align and fuse the original scale feature layer

F_{i}

with the down-sampled feature layer

F_{i - 1}^{d o w n}

. To achieve adaptive feature fusion, we designed a gated fusion function

g (\cdot)

to dynamically calculate the weights. As shown in Figure 4, for the i-th feature layer, we first concatenate the original scale feature

F_{i}

and the down-sampled feature

F_{i - 1}^{d o w n}

along the channel dimension, and generate the fusion coefficient

λ_{i}

through a lightweight network composed of global average pooling and a Multi-Layer Perceptron (MLP):

λ_{i} = σ (M L P (A v g P o o l ([F_{i}, F_{i - 1}^{d o w n}])))

(7)

where

σ

is the Sigmoid activation function, used to normalize the weights to the

(0, 1)

. The final fused feature

F_{i}^{f u s e d}

is obtained by weighted summation:

F_{i}^{f u s e d} = λ_{i} \cdot F_{i} + (1 - λ_{i}) \cdot F_{i - 1}^{d o w n}

(8)

This mechanism allows the network to adaptively select whether to retain the detailed information from the original feature layer or the semantic information from the down-sampled feature layer, based on the specific scale characteristics of the target.

3.3.2. Candidate Boxes Selection Based on Confidence Gain

Since the YOLOv11 is a one-stage anchor-free detector, it lacks the candidate box set generated by a Region Proposal Network (RPN). To maintain evaluation under the dense prediction paradigm, we utilize the unique prediction branch structure of YOLO to construct a dual confidence metric.

Dense Prediction Alignment and Association:Let B be the set of predictions output by the teacher model on the original feature pyramid, and

B_{f u s e d}

be the set of predictions on the fused feature pyramid. For each candidate box

b_{i}^{'} \in B_{f u s e d}

, we first compute its fused confidence and categorize it into positive, ambiguous, or negative groups according to the positive threshold (

τ_{p o s}

) and negative threshold (

τ_{n e g}

).

τ_{p o s}

is dynamically determined from the mean and standard deviation of fused confidences in the current batch, while

τ_{n e g}

is fixed at 0.1. For each ambiguous candidate box, we compare it with the corresponding prediction

b_{i} \in B_{r e g}

at the same spatial location.

Dual Confidence Gain Calculation: The fused confidence in the original and fused feature pyramids is computed as

C o n f (b_{i}) = C l s (b_{i}) \cdot I o U (b_{i})

(9)

C o n f_{fused} (b_{i}^{'}) = C l s_{fused} (b_{i}^{'}) \cdot I o U_{fused} (b_{i}^{'})

(10)

Here,

Cls (\cdot)

represents the classfication confidence predicted by the network.

IoU (\cdot)

represents the localization confidence predicted by the network.

Pseudo-Label Generation: We define the fusion confidence gain as

Δ C o n f = C o n f_{f u s e d} (b_{i}^{'}) - C o n f_{r e g} (b_{i})

. If

Δ C o n f

exceeds 0.2, the ambiguous proposals are retained, and the proposals

b_{i}^{'}

from the fused feature pyramid are adopted. Therefore, the final candidate box set consists of two parts: positive candidates whose fused confidence exceeds the dynamic positive threshold, and ambiguous candidates which satisfy the fused-confidence gain criterion. In this way, the GFF-PR reduces missed detections of small traffic signs by feature fusion.

3.4. Overall Optimization Objective

To fully leverage the proposed semi-supervised framework, we design a comprehensive objective function. The total loss

L_{t o t a l}

for the student model is composed of a supervised loss on labeled data and a multi-task unsupervised loss on unlabeled data:

L_{t o t a l} = L_{s u p} + L_{u n s u p} = L_{s u p}^{c l s} + L_{s u p}^{l o c} + L_{u n s u p}^{A W F L} + L_{u n s u p}^{l o c} + μ \cdot L_{u n s u p}^{i o u}

(11)

For the labeled data

D_{L}

, we employ standard detection losses. Specifically,

L_{s u p}^{c l s}

utilizes the Binary Cross Entropy Loss to minimize the classification error between the prediction and the ground truth class labels. For bounding box regression,

L_{s u p}^{l o c}

adopts the Generalized IoU Loss.

For the unlabeled data

D_{U}

, the loss terms are computed using the refined pseudo-labels:

Adaptive-Weight Focal Loss: For the unsupervised classification loss, standard semi-supervised methods typically use a binary cross-entropy or standard Focal Loss, and treat all filtered pseudo-labels as equally important ground truths. This is ineffective because pseudo-labels have different levels of reliability. Moreover, directly using the raw classification confidences as the weight leads to a side-effect: it gives higher weights to head-class pseudo-labels and lower weights to tail class pseudo-labels. As a result, the model pays less attention to the very classes that need more supervision. To address this, we propose the Adaptive-Weight Focal Loss (

L_{unsup}^{AWFL}

). Instead of relying on raw classification confidences, we introduce a Class-Adaptive Relative Weight strategy that adjusts the contribution of each pseudo-label based on how much its classification confidence exceeds the class-specific classification-confidence threshold. The formulation is:

L_{u n s u p}^{A W F L} = - \sum_{j} ω_{j} \cdot {(1 - p_{j})}^{γ} log p_{j}

(12)

where

p_{j}

is the student model’s classification confidence for the target class, and

γ

is the focusing parameter. Crucially,

ω_{j}

is defined as the ratio between the teacher model’s classification confidence and the class-specific classification-confidence threshold.

ω_{j} = min (\frac{p_{j}^{f u s e d}}{τ_{c_{j}}^{+}}, 1.0)

(13)

where

p_{j}^{f u s e d}

represents the classification confidence of the j-th sample in the fused feature pyramid, and

τ_{c_{j}}^{+}

denotes the positive classification-confidence threshold corresponding to the predicted class

c_{j}

of the sample. By defining

ω_{j}

as the ratio of the classification confidence to the specific classification-confidence threshold, we ensure that tail class samples are assigned higher importance weights when they exceed their corresponding thresholds. This mitigates the issues caused by the imbalanced class distributions.

Localization Loss: Consistent with the supervised branch, we employ GIoU Loss for bounding box regression.

IoU Consistency Loss: To improve the student’s estimation of localization quality, we employ Binary Cross Entropy Loss as an auxiliary task.

4. Experiments

We present a semi-supervised object detection method implemented in Python using the PyTorch deep learning framework. The method employs YOLOv11s as the detector, with over 300-epoch pre-training and 100-epoch semi-supervised fine-tuning. During the training, both unlabeled and labeled data batches contain 16 samples. The Adam optimizer is utilized with a constant learning rate of 0.001. The Exponential Moving Average (EMA) parameter

α

is set to 0.999, and the loss function weight is set to 1.

4.1. Evaluation Metrics

To evaluate the performance of the proposed method, we adopt mAP₅₀,

A P_{S}

,

A P_{M}

, and

A P_{L}

as evaluation metrics.

The mean Average Precision (mAP) is a standard metric for object detection and is computed from the Precision-Recall (P-R) curve. The precision and recall are defined as:

m A P = \int_{0}^{1} P (R) d R

(14)

P = \frac{T P}{T P + F P}

(15)

R = \frac{T P}{T P + F N}

(16)

where

P (R)

denotes the precision at a given recall level R;

T P

,

F P

, and

F N

denote the numbers of true positives, false positives, and false negatives, respectively. mAP₅₀ denotes the mAP values computed at IoU thresholds of 0.5. Also,

A P_{S}

,

A P_{M}

, and

A P_{L}

represent the average precision for small, medium, and large-scale objects, respectively.

4.2. Ablation Study

In ablation studies, we conducted experiments using the 10% labeled data setting. As shown in Table 1, we conduct the experiments to evaluate the contribution of each component of the proposed semi-supervised traffic sign detection method.

As shown in Table 1, the baseline model, Efficient Teacher, achieves only 23.2% mAP₅₀ with 10% labeled data. However, the CD-DPLS improves the mAP₅₀ from 23.2% to 33.2%. This improvement mainly comes from the ability of the CD-DPLS to dynamically adjust the pseudo-label selection threshold for each class according to the class distribution probabilities, and hence improving the detection performance of the proposed method.

We then evaluate the effectiveness of the GFF-PR independently based on the baseline model. The experimental results show that the GFF-PR improves the mAP₅₀ from 23.2% to 33.8% (see Table 1). This is because the GFF-PR successfully retains the missed detections of small traffic signs. When the CD-DPLS and the GFF-PR are used together, the proposed method further increases the mAP₅₀ from 23.2% to 35.6% due to advantages of the CD-DPLS and the GFF-PR.

To further verify the effectiveness of the proposed loss function, we replace the focal loss in Efficient Teacher with AWFL. As shown in the last row of Table 1, our method with the AWFL improves the mAP₅₀ from 35.6% to 36.3%, achieving an additional gain of 0.7%. By combining the CD-DPLS, GFF-PR, and AWFL, the proposed method achieves the best overall performance.

To visually demonstrate the effectiveness of the proposed components in improving traffic sign detection performance, we conduct an experiment with rare traffic sign classes, as well as small traffic signs. As shown in Figure 5, the baseline model, Efficient Teacher, misses the detection of the tail class “pr40”, while it misclassifies other tail-class objects, such as “pr60” and “w66”. In addition, it also fails to detect small traffic signs in the scenes.

With the CD-DPLS, the model not only detects the previously missed tail-class “pr40”, but also corrects the misclassification of “pr60” and “w66” by dynamically adjusting the classification-confidence thresholds, thereby improving the detection performance on tail classes. With the GFF-PR, the model can capture small distant targets.

Finally, after adding the AWFL for overall optimization, the model shows higher classification confidence.

Different Feature Scales: To verify the impact of fused features in the GFF-PR module, we compare the detection accuracy and inference speed of down-sampled feature (

F^{d o w n}

), original feature (F) and fused feature (

F^{f u s e d}

). Details are shown in Table 2.

The experimental results show that, although the down-sampled feature

F^{d o w n}

has the fastest inference speed (56 FPS), the resolution reduction of the feature map damages the details of small objects, resulting in only 8.9% Small Object Average Precision (

A P_{S}

), and the lowest overall mAP₅₀.

In contrast, the fused feature

F^{f u s e d}

achieves the best detection performance. Specifically, the

A P_{S}

increases from 12.4% to 15.4%, demonstrating the effectiveness of the GFF-PR strategy for small traffic sign detection. Although the inference speed decreases to 46 FPS, the model can still satisfy real-time detection requirement. Besides small objects, the fused feature strategy also leads to the detection performance improvements on medium and large objects.

Pseudo-Label Selection Strategies: To validate the effectiveness of the proposed CD-DPLS, we compare the model performance under different pseudo-label filtering strategies. Traditional semi-supervised object detection methods typically employ a fixed classification-confidence threshold to filter pseudo-labels; however, this traditional strategy exhibits limitations in the scenario of imbalanced class distributions.

As shown in Table 3, the threshold setting of 0.9 drops the mAP₅₀ of tail classes to 15.4%, and yields only 20.5 valid pseudo-labels per image on average, limiting the overall mAP₅₀ to 30.5%. Conversely, the threshold setting of 0.5 leads to the overall mAP₅₀ to 33.8%, because of the introduction of incorrect pseudo-labels. However, by dynamically adjusting classification-confidence thresholds based on the fused class distribution probability, the CD-DPLS boosts tail class detection performance to 24.8% and achieves the best overall performance (36.3% mAP₅₀).

4.3. Parameter Sensitivity Analysis

4.3.1. Positive- and Negative-Sample Reliability Ratios

The performance of the CD-DPLS is highly dependent on the threshold settings used for pseudo-label filtering. Specifically, this process is controlled by the positive- and negative-sample reliability ratios

η_{1}

and

η_{0}

, which define the proportions of reliable positive pseudo-labels and negative pseudo-labels. To explore the configuration of these two ratios, we conduct a joint grid search within the ranges of

η_{1} \in [0.80, 1.00]

and

η_{0} \in [0.90, 0.99]

under the 10% labeled data setting. The experiment results are detailed in Table 4.

As shown in Table 4, the best detection performance (36.3%) of the proposed method is achieved when

η_{1} = 0.90

and

η_{0} = 0.97

. Here, the optimal settings enable the proposed method to select more reliable pseudo-labels.

4.3.2. Class Distribution Fusion Weight

The parameter

β

serves as a balancing factor between the class distribution probabilities derived from labeled data (

γ_{k}^{l a b e l}

) and those derived from unlabeled data (

γ_{k}^{c l i p}

). We conduct a sensitivity analysis by varying

β

from 0.0 to 1.0 with a step size of 0.2 under the 10% labeled data setting.

As shown in Table 5, the detection performance of the proposed method follows an inverted U-shaped trend, indicating that using only labeled or unlabeled class distribution probability is not the best choice. The best detection performance (36.3%

m A P_{50}

) of the proposed method is achieved at

β = 0.6

, demonstrating that combining labeled and unlabeled class distribution probabilities can provide the most accurate fused class distribution probability, which enables better classification-confidence thresholds for each class.

4.4. Comparison with the State-of-the-Art Methods

To comprehensively evaluate our method, we compare it with the state-of-the-art two-stage, one-stage, and Transformer-based semi-supervised detectors under identical settings. As shown in Table 6, our method achieves the highest mAP₅₀ across all labeled data ratios. Under the 10% labeled setting, our method reaches 36.3% mAP₅₀, outperforming the PseCo and Semi-DETR by 8.5% and 6.9%, respectively. This improvement mainly comes from the combination of our three proposed modules: the CD DPLS, GFF PR, and AWFL. It can be observed that with the increasing of labeled data ratio, the detection performance of the proposed method gradually increases. Regarding inference speed, our method achieves 45.8 FPS, outperforming two-stage and Transformer-based models. Although our method is slower than the original Efficient Teacher (50.5 FPS) due to the computation overhead of the CD DPLS, GFF PR, and AWFL, the mAP₅₀ gain (from 23.2% to 36.3%) justifies the trade-off between efficiency and accuracy.

4.5. Visualization

To demonstrate the effectiveness of the proposed semi-supervised detection method, we select several representative cases of small objects and tail classes for qualitative comparison. Figure 6 presents the detection results of our method, the ground truth, and several semi-supervised methods. These methods include Semi-DETR and the PseCo (representing the best detection performances in end-to-end and two-stage categories, respectively), as well as our baseline, Efficient Teacher, implemented on both YOLOv5 and YOLOv11.

(1) Small-object detection: As shown in the distant traffic sign cases, small targets have only a few pixels in the image, so they are easily ignored in deep networks. Specifically, Efficient Teacher and the PseCo exhibit missed small-object detections. Besides, all the comparison methods except ours also suffer from false detections. In contrast, with the help of the GFF-PR, our method can better recover small objects by using fused features.

(2) Tail-class detection: As shown in Figure 6, all the comparison methods except ours misclassify the tail class objects as other classes. Furthermore, the PseCo also exhibit missed detections for tail-class objects. In contrast, the CD-DPLS retains more pseudo-labels for tail classes by dynamically adjusting the classification-confidence thresholds according to the fused class distribution probability, while the AWFL mitigates the issues caused by the imbalanced class distribution. As shown in the last row of Figure 6, our proposed method predicts the correct classes with higher confidence for tail-class objects such as “pr30”, “w45” and “w60”.

5. Conclusion

In this paper, we propose a semi-supervised traffic sign detection method to address two challenges in traffic scenes: imbalanced class distributions and missed small-object detections. The proposed CD-DPLS combines class distribution probabilities from labeled and unlabeled data to dynamically adjust pseudo-label selection, increasing the detection performance on tail classes. In addition, the GFF-PR improves the detection of small objects by leveraging fused features to re-evaluate ambiguous proposals. Finally, we introduce the AWFL, which assigns adaptive weights to traffic sign classes, to mitigate the issues caused by the imbalanced class distributions.

Experiments show that the proposed method outperforms the existing state-of-the-art semi-supervised detection methods. It also maintains a real-time inference speed of 45.8 FPS, demonstrating a balance between detection accuracy and efficiency. The experiment results also suggest that the proposed method is an effective solution for real-world intelligent transportation systems.

Author Contributions

Conceptualization, Y.S. and G.Y.; methodology, C.X., Y.S. and M.C.; software, C.X.; validation, C.X.; formal analysis, C.X.; investigation, C.X.; resources, G.Y.; writing—original draft preparation, C.X.; writing—review and editing, Y.S. and M.C.; visualization, C.X.; supervision, Y.S., G.Y. and M.C.; project administration, C.X.; funding acquisition, Y.S. and G.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by the National Natural Science Foundation of China (Grant No. 61671255), Nantong Social Livelihood Science and Technology Plan (Grant No. MS2024025), Jiangsu Province’s “Qinglan Project” for Middle-Aged and Young Academic Leaders (2023), and the Postgraduate Research & Practice Innovation Program of Jiangsu Province (Grant No. SJCX25_2016).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mogelmose, M.M.; Trivedi, M.M.; Moeslund, T.B. Vision-based traffic sign detection and analysis for intelligent driver assistance systems: Perspectives and survey. IEEE Trans. Intell. Transp. Syst. 2012, 13, 1484–1497. [Google Scholar] [CrossRef]
Sun, H.; Wang, R.; Li, Y.; et al. SET: Spectral enhancement for tiny object detection. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) 2025, 4713–4723. [Google Scholar]
Sun, H.; Li, Y.; Yang, L.; et al. Uncertainty-aware gradient stabilization for small object detection. Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV) 2025, 8407–8417. [Google Scholar]
Wang, Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
Yang, C.; Zhuang, K.; Chen, M.; et al. Traffic sign interpretation via natural language description. IEEE Trans. Intell. Transp. Syst. 2024, 25, 18939–18953. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) 2020, 10781–10790. [Google Scholar]
Yang, W.; Wang, C.; Zhang, T.; et al. SA3Det++: Side-aware quality estimation for semi-supervised 3D object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 10664–10679. [Google Scholar] [CrossRef]
Shehzadi, T.; Hashmi, K.A.; Sarode, S.; et al. STEP-DETR: Advancing DETR-based semi-supervised object detection with super teacher and pseudo-label guided text queries. Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV) 2025, 3069–3079. [Google Scholar]
Chen, C.; Han, J.; Debattista, K. Virtual category learning: A semi-supervised learning method for dense prediction with extremely limited labels. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5595–5611. [Google Scholar] [CrossRef]
Luo, Y.; Zhu, J.; Li, M.; et al. Smooth neighbors on teacher graphs for semi-supervised learning. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) 2018, 8896–8905. [Google Scholar]
Yang, X.; Li, P.; Zhou, Q.; et al. Dense information learning based semi-supervised object detection. IEEE Trans. Image Process. 2025, 34, 1022–1035. [Google Scholar] [CrossRef]
Zeng, X.; Liu, X.; Xiang, X. Confidence-weighted teacher: Semi-supervised object detection based on confidence correction. Pattern Recognit. Comput. Vis. LNCS 15043(2025), 1–15.
Zhang, B.; Wang, Z.; Du, B. Boosting semi-supervised object detection in remote sensing images with active teaching. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Huang, T.-K.; Yeh, M.-C. Improving semi-supervised object detection by ROI-enhanced contrastive learning. APSIPA ASC 2024, 1–6. [Google Scholar]
Zhang, R.; Xu, C.; Xu, F.; et al. S3OD: Size-unbiased semi-supervised object detection in aerial images. ISPRS J. Photogramm. Remote Sens. 2025, 221, 179–192. [Google Scholar] [CrossRef]
Tran, P.V. SimLTD: Simple supervised and semi-supervised long-tailed object detection. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) 2025, 4672–4681. [Google Scholar]
Yang, X.; Song, Z.; King, I.; et al. A survey on deep semi-supervised learning. IEEE Trans. Knowl. Data Eng. 2023, 35, 8934–8954. [Google Scholar] [CrossRef]
Wang, C.; Xu, C.; Li, X.; et al. Multi-clue consistency learning to bridge gaps between general and oriented objects in semi-supervised detection. Proc. AAAI 2025, 7582–7590. [Google Scholar] [CrossRef]
Zhao, T.; Fang, Q.; Shi, S.; et al. Density-guided dense pseudo-label selection for semi-supervised oriented object detection. Proc. IEEE Int. Conf. Image Process. (ICIP) 2024, 1092–1098. [Google Scholar]
Shehzadi, T.; Hashmi, K.A.; Stricker, D.; et al. Sparse Semi-DETR: Sparse learnable queries for semi-supervised object detection. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) 2024, 5840–5850. [Google Scholar]
Cao, F.; Yan, K.; Chen, H.; et al. SSCD-YOLO: Semi-supervised cross-domain YOLOv8 for pedestrian detection in low-light conditions. IEEE Access 2025, 13, 61225–61236. [Google Scholar] [CrossRef]
Chen, S.; Zhang, Z.; Zhang, L.; et al. A semi-supervised learning framework combining CNN and multiscale transformer for traffic sign detection and recognition. IEEE Internet Things J. 2024, 11, 19500–19519. [Google Scholar] [CrossRef]
Yang, Y.; Luo, H.; Xu, H.; et al. Towards real-time traffic sign detection and classification. IEEE Trans. Intell. Transp. Syst. 2016, 17, 2022–2031. [Google Scholar] [CrossRef]
Yuan, X.; Hao, X.; Chen, H.; et al. Robust traffic sign recognition based on color global and local oriented edge magnitude patterns. IEEE Trans. Intell. Transp. Syst. 2014, 15, 1466–1477. [Google Scholar] [CrossRef]
Liu, C.; Chang, F.; Chen, Z.; et al. Fast traffic sign recognition via high-contrast region extraction and extended sparse representation. IEEE Trans. Intell. Transp. Syst. 2016, 17, 79–92. [Google Scholar] [CrossRef]
Chen, S.; Zhang, Z.; Ma, H.; et al. A content-adaptive hierarchical deep learning model for detecting arbitrary oriented road surface elements using MLS point clouds. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–16. [Google Scholar] [CrossRef]
Wang, J.; Chen, Y.; Dong, Z.; et al. Improved YOLOv5 network for real-time multi-scale traffic sign detection. Neural Comput. Appl. 2023, 35, 7853–7865. [Google Scholar] [CrossRef]
Manzari, O.N.; Boudesh, A.; Shokouhi, S.B. Pyramid transformer for traffic sign detection. Int. Conf. Comput. Knowl. Eng. 2022, 112–116. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; et al. Attention is all you need. Adv. Neural Inf. Process. Syst. (NeurIPS) 2017, 5998–6008. [Google Scholar]
Wang, G.; Zhou, K.; Wang, L.; et al. Context-aware and attention-driven weighted fusion traffic sign detection network. IEEE Access 2023, 11, 42104–42112. [Google Scholar] [CrossRef]
Zhang, J.; Xie, Z.; Sun, J.; et al. A cascaded R-CNN with multiscale attention and imbalanced samples for traffic sign detection. IEEE Access 2020, 8, 29742–29754. [Google Scholar] [CrossRef]
Sohn, K.; Zhang, Z.; Li, C.L.; et al. A simple semi-supervised learning framework for object detection. arXiv 2020, arXiv:2005.04757. [Google Scholar] [CrossRef]
Liu, Y.-C.; Ma, C.-Y.; He, Z.; et al. Unbiased teacher for semi-supervised object detection. arXiv 2021, arXiv:2102.09480. [Google Scholar] [CrossRef]
Tang, Y.; Chen, W.; Luo, Y.; et al. Humble teachers teach better students for semi-supervised object detection. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) 2021, 3132–3141. [Google Scholar]
Xu, M.; Zhang, Z.; Hu, H.; et al. End-to-end semi-supervised object detection with soft teacher. Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV) 2021, 3060–3069. [Google Scholar]
Zhang, J.; Lin, X.; Zhang, W.; et al. Semi-DETR: Semi-supervised object detection with detection transformers. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) 2023, 23809–23818. [Google Scholar]
Wang, P.; Cai, Z.; Yang, H.; et al. Omni-DETR: Omni-Supervised Object Detection with Transformers. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) 2022, 9367–9376. [Google Scholar]
Li, G.; Li, X.; Wang, Y.; et al. PseCo: Pseudo labeling and consistency training for semi-supervised object detection. arXiv 2022, arXiv:2203.16348. [Google Scholar]
Xu, B.; Chen, M.; Guan, W.; et al. Efficient Teacher: Semi-supervised object detection for YOLOv5. arXiv 2023, arXiv:2302.07577. [Google Scholar] [CrossRef]
Liu, Y.C.; Ma, C.Y.; Kira, Z. Unbiased Teacher v2: Semi-supervised object detection for anchor-free and anchor-based detectors. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) 2022, 9819–9828. [Google Scholar]
Luo, G.; Zhou, Y.; Jin, L.; et al. Towards end-to-end semi-supervised learning for one-stage object detection. arXiv 2023, arXiv:2306.00930. [Google Scholar]
Zhou, H.; Ge, Z.; Liu, S.; et al. Dense Teacher: Dense pseudo-labels for semi-supervised object detection. arXiv 2022, arXiv:2206.07246. [Google Scholar]

Figure 1. Imbalanced class distribution in the traffic sign dataset. The horizontal axis shows the classes of traffic signs, and the vertical axis shows the number of traffic sign instances, revealing a severe imbalance between head classes and tail classes.

Figure 2. The overall architecture of the proposed semi-supervised traffic sign detection framework. The framework is based on the Efficient Teacher paradigm using YOLOv11 as the baseline. It contains two parallel streams: the teacher model generates pseudo-labels from weakly augmented unlabeled data, while the student model learns from labeled and unlabeled data under stronger augmentation. The framework includes three key components: the Class-Distribution-based Dynamic Pseudo-Label Selection module (CD-DPLS), which dynamically sets class-specific pseudo-label classification-confidence thresholds; the Gated-Feature-Fusion-based Proposal Refinement strategy (GFF-PR), which fuses multi-scale features to refine proposals and recover small traffic signs; and the Adaptive-Weight Focal Loss (AWFL), which mitigates the issues caused by the imbalanced class distributions.

Figure 3. Schematic illustration of the Class-Distribution-based Dynamic Pseudo-Label Selection module (CD-DPLS). The CD-DPLS combines the class distribution from labeled data with the class distribution estimated by CLIP from unlabeled data. By using the fused class distribution, the module dynamically adjusts the classification-confidence threshold for each class.

Figure 4. Detailed architecture of the Gated Feature Fusion module. This module builds a fused feature pyramid by adaptively combining features from different scales. Specifically, it takes the original feature map

F_{i}

and the downsampled feature

F_{i - 1}^{d o w n}

as inputs, and feeds the concatenated result into a lightweight gating network, composed of Global Average Pooling and a Multi-Layer Perceptron to generate the fusion weight

λ_{i}

. The final fused feature

F_{i}^{f u s e d}

is then obtained through weighted fusion. In this way, the module can balance spatial details from the original feature and stronger semantic information from the downsampled feature, making it more effective for small traffic signs.

Figure 4. Detailed architecture of the Gated Feature Fusion module. This module builds a fused feature pyramid by adaptively combining features from different scales. Specifically, it takes the original feature map

F_{i}

and the downsampled feature

F_{i - 1}^{d o w n}

as inputs, and feeds the concatenated result into a lightweight gating network, composed of Global Average Pooling and a Multi-Layer Perceptron to generate the fusion weight

λ_{i}

. The final fused feature

F_{i}^{f u s e d}

is then obtained through weighted fusion. In this way, the module can balance spatial details from the original feature and stronger semantic information from the downsampled feature, making it more effective for small traffic signs.

Figure 5. Visual comparison of different components in the proposed method. As the components are gradually introduced, the proposed method shows improvements in tail-class detection, distant small-object detection, as well as prediction confidence.

Figure 6. Visual comparison of different components in the proposed method. As the components are gradually introduced, the proposed method shows improvements in tail-class detection, distant small-object detection, as well as prediction confidence.

Table 1. Ablation study on the effectiveness of different components under the 10% labeled data setting. “CD-DPLS” denotes the Class-Distribution-based Dynamic Pseudo-Label Selection module, “GFF-CR” denotes the Gated-Feature-Fusion-based Candidate Refinement strategy, and “AWFL” denotes the Adaptive-Weight Focal Loss.

DC-Fusion	SBR-Neck	SCA-Upsampling	mAP₅₀	mAP_50:95
			23.2%	12.8%
✓			27.9%	17.5%
	✓		27.5%	17.9%
		✓	25.7%	19.2%
✓	✓		28.8%	19.8%
✓	✓	✓	32.1%	20.4%

Table 2. Performance comparison of different feature scale. We compare the detection accuracy and inference speed utilizing strictly down-sampled features (

F^{d o w n}

), original scale features (F), and the proposed fused features (

F^{f u s e d}

).

Table 2. Performance comparison of different feature scale. We compare the detection accuracy and inference speed utilizing strictly down-sampled features (

F^{d o w n}

), original scale features (F), and the proposed fused features (

F^{f u s e d}

).

Feature Level	mAP₅₀	$A P_{S}$	$A P_{M}$	$A P_{L}$	FPS
$F^{d o w n}$	32.8%	8.9%	22.1%	36.1%	56
F	34.1%	12.4%	24.5%	33.5%	52
$F^{f u s e d}$	36.3%	15.4%	25.8%	36.5%	46

Table 3. Performance comparison of different pseudo-label selection strategies. We compare the detection accuracy on head and tail classes, and the dynamic thresholding process from initial candidates to final valid pseudo-labels utilizing fixed thresholds (0.9, 0.5) and the proposed CD-DPLS. “All” means all the classes in the dataset, “Head” means the head classes, and “Tail” means the tail classes.

Threshold Strategy	mAP₅₀(All)	mAP₅₀(Head)	mAP₅₀(Tail)	Avg. Initial Candidates	Avg. Pseudo-labels
0.9 (Fixed)	30.5%	58.2%	15.4%	138.6	20.5
0.5 (Fixed)	33.8%	52.1%	20.6%	162.4	45.8
CDA (Dynamic)	36.3%	60.5%	24.8%	145.2	32.4

Table 4. Sensitivity analysis of positive- and negative-sample reliability ratios. The table reports the mAP₅₀ performance under varying combinations of

η_{1}

and

η_{0}

.

Table 4. Sensitivity analysis of positive- and negative-sample reliability ratios. The table reports the mAP₅₀ performance under varying combinations of

η_{1}

and

η_{0}

.

$η_{1} ∖ η_{0}$	0.90	0.93	0.95	0.97	0.99
0.80	33.8%	34.2%	34.5%	34.2%	33.9%
0.85	34.2%	34.6%	34.9%	34.9%	34.5%
0.90	34.5%	34.9%	35.1%	36.3%	35.3%
0.95	34.3%	34.7%	34.8%	35.2%	34.9%
1.00	33.9%	34.3%	34.5%	34.7%	34.4%

Table 5. Sensitivity analysis of the class distribution fusion weight

β

in the CD-DPLS.

β = 0

represents using only class distribution probabilities derived from unlabeled data, while

β = 1

represents using only the class distribution probabilities derived from labeled data.

Table 5. Sensitivity analysis of the class distribution fusion weight

β

in the CD-DPLS.

β = 0

represents using only class distribution probabilities derived from unlabeled data, while

β = 1

represents using only the class distribution probabilities derived from labeled data.

$β$ Value	0.0	0.2	0.4	0.6	0.8	1.0
mAP₅₀	33.9%	35.6%	36.1%	36.3%	35.8%	35.1%

Table 6. Comparison with state-of-the-art semi-supervised object detection methods on the traffic sign dataset. The results are reported using mAP₅₀ with 1%, 2%, 5%, and 10% labeled data ratios. FPS indicates the inference speed on the same hardware. Our method achieves superior accuracy across different labeled data ratios while maintaining real-time performance.

Category	Method	1%	2%	5%	10%	FPS
End-to-end	Omni-DETR [37]	9.07%	15.1%	20.1%	28.7%	10.7
	Semi-DETR [36]	9.72%	16.2%	21.5%	29.4%	9.1
Two-stage	STAC [32]	5.54%	7.36%	12.9%	21.2%	14.4
	Unbiased Teacher [33]	6.81%	12.5%	16.7%	24.5%	16.8
	PseCo [38]	8.04%	14.8%	19.1%	27.8%	15.7
	Humble Teacher [34]	6.48%	9.31%	15.2%	24.5%	17.8
One-stage	Efficient Teacher (YOLOv5) [39]	7.33%	12.5%	15.7%	23.2%	50.5
	Efficient Teacher (YOLOv11)	8.58%	14.7%	19.4%	27.8%	49.3
	Unbiased Teacher v2 [40]	7.19%	9.82%	16.2%	23.5%	48.8
	One Teacher [41]	7.47%	11.4%	15.8%	24.7%	46.7
	Dense Teacher [42]	8.03%	12.8%	16.9%	25.1%	48.6
	Ours	11.5%	18.9%	26.6%	36.3%	45.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Semi-Supervised Traffic Sign Detection with Dynamic Pseudo-Label Selection and Gated-Feature-Fusion-Based Proposal Refinement

Abstract

Keywords:

Subject:

1. Introduction

2. Related Work

2.1. Semi-Supervised Object Detection

2.2. Traffic Sign Detection

3. Methods

3.1. Overview of Our Method

3.2. Class-Distribution-Based Dynamic Pseudo-Label Selection

3.2.1. Theoretical Basis and Distribution Estimation

3.2.2. CLIP-Based Class Distribution Estimation

3.2.3. Class-Distribution-Based Threshold Setting

3.2.4. Dynamic Pseudo-Label Selection

3.3. Gated-Feature-Fusion-Based Candidate Refinement Strategy

3.3.1. Feature Pyramid Construction for Feature Fusion

3.3.2. Candidate Boxes Selection Based on Confidence Gain

3.4. Overall Optimization Objective

4. Experiments

4.1. Evaluation Metrics

4.2. Ablation Study

4.3. Parameter Sensitivity Analysis

4.3.1. Positive- and Negative-Sample Reliability Ratios

4.3.2. Class Distribution Fusion Weight

4.4. Comparison with the State-of-the-Art Methods

4.5. Visualization

5. Conclusion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

MDPI Initiatives

Important Links

Subscribe