Figure 1.
Initial quality-assurance example reproduced from Shih
et al. [
12], showing multiple radiologists’ bounding-box annotations on the same pneumonia case during the shared calibration stage. The figure highlights both overlap and variation in box placement, illustrating the expert-driven annotation process underlying the RSNA bounding-box dataset.
Figure 1.
Initial quality-assurance example reproduced from Shih
et al. [
12], showing multiple radiologists’ bounding-box annotations on the same pneumonia case during the shared calibration stage. The figure highlights both overlap and variation in box placement, illustrating the expert-driven annotation process underlying the RSNA bounding-box dataset.
Figure 2.
Example chest X-ray images with bounding box annotations. The images illustrate both localized and multifocal pneumonia patterns.
Figure 2.
Example chest X-ray images with bounding box annotations. The images illustrate both localized and multifocal pneumonia patterns.
Figure 3.
Aggregated bounding box location heatmap showing the spatial distribution of pneumonia annotations across the dataset. High-density regions correspond to the lung fields, particularly in mid-to-lower zones.
Figure 3.
Aggregated bounding box location heatmap showing the spatial distribution of pneumonia annotations across the dataset. High-density regions correspond to the lung fields, particularly in mid-to-lower zones.
Figure 4.
Schematic illustration of the three pneumonia severity levels used in this study. The severity groups are defined from the total annotated disease area per chest X-ray image, with Severity 1 corresponding to low/mild burden ( px2), Severity 2 to moderate burden (– px2), and Severity 3 to severe burden ( px2). The visual opacities are illustrative and represent increasing radiographic burden from localized involvement to widespread disease.
Figure 4.
Schematic illustration of the three pneumonia severity levels used in this study. The severity groups are defined from the total annotated disease area per chest X-ray image, with Severity 1 corresponding to low/mild burden ( px2), Severity 2 to moderate burden (– px2), and Severity 3 to severe burden ( px2). The visual opacities are illustrative and represent increasing radiographic burden from localized involvement to widespread disease.
Figure 5.
Count and proportion of images in the three derived severity categories. The optimized thresholding procedure yields an almost perfectly balanced distribution across Severity 1, Severity 2, and Severity 3.
Figure 5.
Count and proportion of images in the three derived severity categories. The optimized thresholding procedure yields an almost perfectly balanced distribution across Severity 1, Severity 2, and Severity 3.
Figure 6.
Histogram of total disease area per pneumonia-positive chest X-ray in the RSNA 2018 dataset. For each image, total burden was computed by summing the areas of all annotated lesion bounding boxes. The right-skewed distribution indicates that most positive cases have relatively limited annotated extent, while a smaller subset exhibits substantially larger radiographic burden.
Figure 6.
Histogram of total disease area per pneumonia-positive chest X-ray in the RSNA 2018 dataset. For each image, total burden was computed by summing the areas of all annotated lesion bounding boxes. The right-skewed distribution indicates that most positive cases have relatively limited annotated extent, while a smaller subset exhibits substantially larger radiographic burden.
Figure 7.
Distribution of total disease area in pneumonia-positive RSNA 2018 chest X-rays, stacked by sex. The shaded regions indicate the three final severity intervals derived from the optimized rank-based split: Severity 1 (low / mild), Severity 2 (moderate), and Severity 3 (severe).
Figure 7.
Distribution of total disease area in pneumonia-positive RSNA 2018 chest X-rays, stacked by sex. The shaded regions indicate the three final severity intervals derived from the optimized rank-based split: Severity 1 (low / mild), Severity 2 (moderate), and Severity 3 (severe).
Figure 8.
Distribution of total disease area across the three severity categories, shown using boxplots and violin plots. Both visualizations confirm the expected monotonic increase in burden from Severity 1 to Severity 3.
Figure 8.
Distribution of total disease area across the three severity categories, shown using boxplots and violin plots. Both visualizations confirm the expected monotonic increase in burden from Severity 1 to Severity 3.
Figure 9.
Empirical cumulative distribution functions (ECDFs) of total disease area for the three severity categories. The curves show the within-grade distribution of lesion burden and confirm the ordinal separation between low / mild, moderate, and severe groups.
Figure 9.
Empirical cumulative distribution functions (ECDFs) of total disease area for the three severity categories. The curves show the within-grade distribution of lesion burden and confirm the ordinal separation between low / mild, moderate, and severe groups.
Figure 10.
Sex distribution within each severity category after the optimized three-tier split. Males are more frequent than females in all tiers, but the optimization avoids extreme sex imbalance while preserving near-equal class sizes.
Figure 10.
Sex distribution within each severity category after the optimized three-tier split. Males are more frequent than females in all tiers, but the optimization avoids extreme sex imbalance while preserving near-equal class sizes.
Figure 11.
Dataset split used in this study. From the curated RSNA 2018 cohort employed in the pipeline, 8,851 studies were retained in the Normal category, while pneumonia-positive cases were divided into three severity groups: Severity 1 (1,999 images), Severity 2 (2,006 images), and Severity 3 (2,007 images). A stratified class-wise split was then applied independently within each category, assigning 70% of the data to training, 15% to validation, and 15% to testing. The held-out test subset is not used during optimization, but is reserved for later inference-based evaluation in order to simulate a more realistic clinical deployment scenario.
Figure 11.
Dataset split used in this study. From the curated RSNA 2018 cohort employed in the pipeline, 8,851 studies were retained in the Normal category, while pneumonia-positive cases were divided into three severity groups: Severity 1 (1,999 images), Severity 2 (2,006 images), and Severity 3 (2,007 images). A stratified class-wise split was then applied independently within each category, assigning 70% of the data to training, 15% to validation, and 15% to testing. The held-out test subset is not used during optimization, but is reserved for later inference-based evaluation in order to simulate a more realistic clinical deployment scenario.
Figure 12.
Budget-aware ViT configuration exploration. Although many architectural and optimization combinations were possible, the search was reduced to 24 strategically selected runs: four ViT backbones, three learning rates, and two weight-decay values. All other training settings were fixed to enable fair comparison under realistic computational constraints.
Figure 12.
Budget-aware ViT configuration exploration. Although many architectural and optimization combinations were possible, the search was reduced to 24 strategically selected runs: four ViT backbones, three learning rates, and two weight-decay values. All other training settings were fixed to enable fair comparison under realistic computational constraints.
Figure 13.
Overview of the Vision Transformer (ViT) pipeline used in this study. The workflow starts from the RSNA 2018 chest X-ray dataset reorganized into a severity-aware JPEG folder structure, followed by a stratified class-wise split into training, validation, and testing subsets. An automated NAS procedure is then used to explore multiple ViT architectures and hyperparameter combinations, including model variant, image resolution, batch size, learning rate, weight decay, head design, optimizer, scheduler, and mixed-precision settings. The pipeline supports both the four-class severity-aware setting and specialized binary branches comparing normal cases against each severity subgroup independently.
Figure 13.
Overview of the Vision Transformer (ViT) pipeline used in this study. The workflow starts from the RSNA 2018 chest X-ray dataset reorganized into a severity-aware JPEG folder structure, followed by a stratified class-wise split into training, validation, and testing subsets. An automated NAS procedure is then used to explore multiple ViT architectures and hyperparameter combinations, including model variant, image resolution, batch size, learning rate, weight decay, head design, optimizer, scheduler, and mixed-precision settings. The pipeline supports both the four-class severity-aware setting and specialized binary branches comparing normal cases against each severity subgroup independently.
Figure 14.
Specialized binary Vision Transformer pipeline used for the Severity 1 branch. The workflow takes only Normal and Pneumonia Severity 1 (low/mild) chest X-ray images as input, applies preprocessing and augmentation, and trains a ViT-B/16 model at resolution for binary classification. The output layer produces softmax probabilities for the two classes, allowing the model to distinguish subtle low/mild pneumonia cases from normal chest radiographs.
Figure 14.
Specialized binary Vision Transformer pipeline used for the Severity 1 branch. The workflow takes only Normal and Pneumonia Severity 1 (low/mild) chest X-ray images as input, applies preprocessing and augmentation, and trains a ViT-B/16 model at resolution for binary classification. The output layer produces softmax probabilities for the two classes, allowing the model to distinguish subtle low/mild pneumonia cases from normal chest radiographs.
Figure 15.
Overview of the hardware and software environment used for model training. Experiments were conducted on Runpod.io using a PyTorch 2.8.0 environment with one NVIDIA RTX 5090 GPU (32 GB VRAM), 60 GB RAM, 12 vCPUs, and 80 GB disk space.
Figure 15.
Overview of the hardware and software environment used for model training. Experiments were conducted on Runpod.io using a PyTorch 2.8.0 environment with one NVIDIA RTX 5090 GPU (32 GB VRAM), 60 GB RAM, 12 vCPUs, and 80 GB disk space.
Figure 16.
Qualitative examples from the generated three-tier severity split. Bounding boxes were overlaid with OpenCV on a debugging-oriented dataset created for manual inspection. From Severity 1 (low/mild) to Severity 3 (severe), the annotated boxes generally increase in size and/or number, while the visible pneumonia-related white opacities also become progressively more extensive and pronounced.
Figure 16.
Qualitative examples from the generated three-tier severity split. Bounding boxes were overlaid with OpenCV on a debugging-oriented dataset created for manual inspection. From Severity 1 (low/mild) to Severity 3 (severe), the annotated boxes generally increase in size and/or number, while the visible pneumonia-related white opacities also become progressively more extensive and pronounced.
Figure 17.
Comparison of final and maximum validation accuracy across the 24 Severity 1 ViT exploration runs. Results are grouped by backbone and configuration. The strongest validation performance was obtained by ViT-B/16 using a learning rate of and weight decay of 0.01.
Figure 17.
Comparison of final and maximum validation accuracy across the 24 Severity 1 ViT exploration runs. Results are grouped by backbone and configuration. The strongest validation performance was obtained by ViT-B/16 using a learning rate of and weight decay of 0.01.
Figure 18.
Training duration comparison across the 24 Severity 1 ViT exploration runs. ViT-L/16 required substantially longer training time than the other backbones, while ViT-B/32 was the fastest model family but achieved lower validation accuracy than ViT-B/16.
Figure 18.
Training duration comparison across the 24 Severity 1 ViT exploration runs. ViT-L/16 required substantially longer training time than the other backbones, while ViT-B/32 was the fastest model family but achieved lower validation accuracy than ViT-B/16.
Figure 19.
Training and validation accuracy curves for the selected Severity 1 ViT-B/16 model. The best validation checkpoint was reached at epoch 39, with a maximum validation accuracy of 94.98%, while the final validation accuracy was 94.42%.
Figure 19.
Training and validation accuracy curves for the selected Severity 1 ViT-B/16 model. The best validation checkpoint was reached at epoch 39, with a maximum validation accuracy of 94.98%, while the final validation accuracy was 94.42%.
Figure 20.
Training and validation loss curves for the selected Severity 1 ViT-B/16 model. The training loss decreases throughout the training schedule, while the validation loss shows signs of saturation and later instability, suggesting mild overfitting after the best validation checkpoint.
Figure 20.
Training and validation loss curves for the selected Severity 1 ViT-B/16 model. The training loss decreases throughout the training schedule, while the validation loss shows signs of saturation and later instability, suggesting mild overfitting after the best validation checkpoint.
Figure 21.
Validation macro-precision, macro-recall, and macro-F1 evolution for the selected Severity 1 ViT-B/16 model. These metrics provide a class-balanced view of model behavior beyond accuracy alone.
Figure 21.
Validation macro-precision, macro-recall, and macro-F1 evolution for the selected Severity 1 ViT-B/16 model. These metrics provide a class-balanced view of model behavior beyond accuracy alone.
Figure 22.
Class-specific validation accuracy for the selected Severity 1 ViT-B/16 model. The curves compare validation performance on the normal class and the low/mild pneumonia class.
Figure 22.
Class-specific validation accuracy for the selected Severity 1 ViT-B/16 model. The curves compare validation performance on the normal class and the low/mild pneumonia class.
Figure 23.
Training and validation confusion matrices for the selected Severity 1 ViT-B/16 model at the final epoch. The model achieves very strong training-set separation, while the validation matrix shows that most errors occur when Severity 1 low/mild pneumonia images are predicted as normal.
Figure 23.
Training and validation confusion matrices for the selected Severity 1 ViT-B/16 model at the final epoch. The model achieves very strong training-set separation, while the validation matrix shows that most errors occur when Severity 1 low/mild pneumonia images are predicted as normal.
Figure 24.
Comparison of final and maximum validation accuracy across the 24 Severity 2 ViT exploration runs. Results are grouped by backbone and configuration. Several configurations reached the highest maximum validation accuracy of 97.88%, while ViT-B/16 achieved the best final validation accuracy among the top-performing runs.
Figure 24.
Comparison of final and maximum validation accuracy across the 24 Severity 2 ViT exploration runs. Results are grouped by backbone and configuration. Several configurations reached the highest maximum validation accuracy of 97.88%, while ViT-B/16 achieved the best final validation accuracy among the top-performing runs.
Figure 25.
Training duration comparison across the 24 Severity 2 ViT exploration runs. ViT-L/16 required substantially longer training time than the other backbones, while ViT-B/16 provided the best balance between validation accuracy and computational efficiency.
Figure 25.
Training duration comparison across the 24 Severity 2 ViT exploration runs. ViT-L/16 required substantially longer training time than the other backbones, while ViT-B/16 provided the best balance between validation accuracy and computational efficiency.
Figure 26.
Training and validation accuracy curves for the selected Severity 2 ViT-B/16 model. The best validation checkpoint was reached at epoch 13, with a maximum validation accuracy of 97.88%, while the final validation accuracy was 97.84%.
Figure 26.
Training and validation accuracy curves for the selected Severity 2 ViT-B/16 model. The best validation checkpoint was reached at epoch 13, with a maximum validation accuracy of 97.88%, while the final validation accuracy was 97.84%.
Figure 27.
Training and validation loss curves for the selected Severity 2 ViT-B/16 model. The model converges rapidly, with validation loss stabilizing after the early epochs, indicating stable generalization behavior for the moderate-severity classification branch.
Figure 27.
Training and validation loss curves for the selected Severity 2 ViT-B/16 model. The model converges rapidly, with validation loss stabilizing after the early epochs, indicating stable generalization behavior for the moderate-severity classification branch.
Figure 28.
Validation macro-precision, macro-recall, and macro-F1 evolution for the selected Severity 2 ViT-B/16 model. The curves show strong balanced validation performance across the normal and moderate-pneumonia classes.
Figure 28.
Validation macro-precision, macro-recall, and macro-F1 evolution for the selected Severity 2 ViT-B/16 model. The curves show strong balanced validation performance across the normal and moderate-pneumonia classes.
Figure 29.
Class-specific validation accuracy for the selected Severity 2 ViT-B/16 model. The curves compare validation performance on the normal class and the moderate-pneumonia class, showing strong performance for both categories.
Figure 29.
Class-specific validation accuracy for the selected Severity 2 ViT-B/16 model. The curves compare validation performance on the normal class and the moderate-pneumonia class, showing strong performance for both categories.
Figure 30.
Training and validation confusion matrices for the selected Severity 2 ViT-B/16 model at the final epoch. The model shows near-perfect separation on the training set and strong generalization on the validation set, with relatively few errors in both directions.
Figure 30.
Training and validation confusion matrices for the selected Severity 2 ViT-B/16 model at the final epoch. The model shows near-perfect separation on the training set and strong generalization on the validation set, with relatively few errors in both directions.
Figure 31.
Comparison of final and maximum validation accuracy across the 24 Severity 3 ViT exploration runs. Results are grouped by backbone and configuration. The strongest validation performance was obtained by ViT-L/16, while ViT-B/16 and ViT-B/32 achieved closely competitive results at substantially lower computational cost.
Figure 31.
Comparison of final and maximum validation accuracy across the 24 Severity 3 ViT exploration runs. Results are grouped by backbone and configuration. The strongest validation performance was obtained by ViT-L/16, while ViT-B/16 and ViT-B/32 achieved closely competitive results at substantially lower computational cost.
Figure 32.
Training duration comparison across the 24 Severity 3 ViT exploration runs. ViT-L/16 required the longest training time, whereas ViT-B/32 was the fastest among the competitive configurations.
Figure 32.
Training duration comparison across the 24 Severity 3 ViT exploration runs. ViT-L/16 required the longest training time, whereas ViT-B/32 was the fastest among the competitive configurations.
Figure 33.
Training and validation accuracy curves for the selected Severity 3 ViT-L/16 model. The best validation checkpoint was reached at epoch 39, with a maximum validation accuracy of 99.08%, while the final validation accuracy was 98.99%.
Figure 33.
Training and validation accuracy curves for the selected Severity 3 ViT-L/16 model. The best validation checkpoint was reached at epoch 39, with a maximum validation accuracy of 99.08%, while the final validation accuracy was 98.99%.
Figure 34.
Training and validation loss curves for the selected Severity 3 ViT-L/16 model. The model converges rapidly and maintains low validation loss, supporting the strong generalization performance observed for the severe-pneumonia branch.
Figure 34.
Training and validation loss curves for the selected Severity 3 ViT-L/16 model. The model converges rapidly and maintains low validation loss, supporting the strong generalization performance observed for the severe-pneumonia branch.
Figure 35.
Validation macro-precision, macro-recall, and macro-F1 evolution for the selected Severity 3 ViT-L/16 model. The curves show strong balanced validation performance across the normal and severe-pneumonia classes.
Figure 35.
Validation macro-precision, macro-recall, and macro-F1 evolution for the selected Severity 3 ViT-L/16 model. The curves show strong balanced validation performance across the normal and severe-pneumonia classes.
Figure 36.
Class-specific validation accuracy for the selected Severity 3 ViT-L/16 model. The curves compare validation performance on the normal class and the severe-pneumonia class, showing strong performance for both categories.
Figure 36.
Class-specific validation accuracy for the selected Severity 3 ViT-L/16 model. The curves compare validation performance on the normal class and the severe-pneumonia class, showing strong performance for both categories.
Figure 37.
Training and validation confusion matrices for the selected Severity 3 ViT-L/16 model at the final epoch. The model achieves almost perfect separation on the training set and very strong generalization on the validation set, with only a small number of errors in either direction.
Figure 37.
Training and validation confusion matrices for the selected Severity 3 ViT-L/16 model at the final epoch. The model achieves almost perfect separation on the training set and very strong generalization on the validation set, with only a small number of errors in either direction.
Figure 38.
Comparison of final and maximum validation accuracy across the 24 classical binary ViT exploration runs. The best validation performance was obtained by ViT-L/16 using a learning rate of and weight decay of 0.05.
Figure 38.
Comparison of final and maximum validation accuracy across the 24 classical binary ViT exploration runs. The best validation performance was obtained by ViT-L/16 using a learning rate of and weight decay of 0.05.
Figure 39.
Training duration comparison across the 24 classical binary ViT exploration runs. ViT-L/16 required the longest training time, while ViT-B/16 provided a more efficient alternative with competitive validation accuracy.
Figure 39.
Training duration comparison across the 24 classical binary ViT exploration runs. ViT-L/16 required the longest training time, while ViT-B/16 provided a more efficient alternative with competitive validation accuracy.
Figure 40.
Training and validation accuracy curves for the selected classical binary ViT-L/16 model. The best validation checkpoint was reached at epoch 45, with a maximum validation accuracy of 95.05%, while the final validation accuracy was 94.72%.
Figure 40.
Training and validation accuracy curves for the selected classical binary ViT-L/16 model. The best validation checkpoint was reached at epoch 45, with a maximum validation accuracy of 95.05%, while the final validation accuracy was 94.72%.
Figure 41.
Training and validation loss curves for the selected classical binary ViT-L/16 model. The training loss decreases during training, while the validation loss shows fluctuations after the early convergence period, supporting validation-based checkpoint selection.
Figure 41.
Training and validation loss curves for the selected classical binary ViT-L/16 model. The training loss decreases during training, while the validation loss shows fluctuations after the early convergence period, supporting validation-based checkpoint selection.
Figure 42.
Validation macro-precision, macro-recall, and macro-F1 evolution for the selected classical binary ViT-L/16 model. These metrics provide a class-balanced view of model performance beyond accuracy alone.
Figure 42.
Validation macro-precision, macro-recall, and macro-F1 evolution for the selected classical binary ViT-L/16 model. These metrics provide a class-balanced view of model performance beyond accuracy alone.
Figure 43.
Class-specific validation accuracy for the selected classical binary ViT-L/16 model. The curves compare validation performance on the normal class and the combined pneumonia-positive class.
Figure 43.
Class-specific validation accuracy for the selected classical binary ViT-L/16 model. The curves compare validation performance on the normal class and the combined pneumonia-positive class.
Figure 44.
Size comparison of the three selected severity-specific ViT models. The Severity 1 selected model has a disk size of 989 MB (1,037,438,261 bytes), while the Severity 2 and Severity 3 selected models are considerably larger, with 3.41 GB (3,669,091,565 bytes) and 3.39 GB (3,650,217,197 bytes), respectively.
Figure 44.
Size comparison of the three selected severity-specific ViT models. The Severity 1 selected model has a disk size of 989 MB (1,037,438,261 bytes), while the Severity 2 and Severity 3 selected models are considerably larger, with 3.41 GB (3,669,091,565 bytes) and 3.39 GB (3,650,217,197 bytes), respectively.
Figure 45.
Overview of the cross-severity inference experiment. Each selected severity-specific ViT model was applied to all three pneumonia validation subsets, resulting in a total of nine cross-severity inference runs.
Figure 45.
Overview of the cross-severity inference experiment. Each selected severity-specific ViT model was applied to all three pneumonia validation subsets, resulting in a total of nine cross-severity inference runs.
Figure 46.
Cross-severity inference probability distributions for the selected severity-specific ViT models. Each model was applied to the pneumonia validation images from Severity 1, Severity 2, and Severity 3. The violin plots show the predicted probability of the target class of each binary model: for the Severity 1 model, for the Severity 2 model, and for the Severity 3 model. The dashed horizontal line indicates the 0.5 decision threshold.
Figure 46.
Cross-severity inference probability distributions for the selected severity-specific ViT models. Each model was applied to the pneumonia validation images from Severity 1, Severity 2, and Severity 3. The violin plots show the predicted probability of the target class of each binary model: for the Severity 1 model, for the Severity 2 model, and for the Severity 3 model. The dashed horizontal line indicates the 0.5 decision threshold.
Figure 47.
Conceptual overview of the proposed hierarchical severity-routing framework. The system evaluates the Severity 3 model first, followed by the Severity 2 and Severity 1 models only when the higher-severity probability does not exceed its corresponding threshold. If none of the severity branches reaches the decision threshold, the case can be flagged as Normal or Uncertain for additional review.
Figure 47.
Conceptual overview of the proposed hierarchical severity-routing framework. The system evaluates the Severity 3 model first, followed by the Severity 2 and Severity 1 models only when the higher-severity probability does not exceed its corresponding threshold. If none of the severity branches reaches the decision threshold, the case can be flagged as Normal or Uncertain for additional review.
Figure 48.
Proposed next-step extension of the severity-aware classification framework through lung-segmentation-guided modeling.
Figure 48.
Proposed next-step extension of the severity-aware classification framework through lung-segmentation-guided modeling.
Figure 49.
Example concept dashboard illustrating how AI-assisted chest X-ray analysis could support clinical triage and consultation workflows. The interface combines a prioritized patient queue, current-exam visualization with AI attention overlays, severity prediction, confidence scoring, and structured decision-support outputs. In this example, the system highlights a severe pneumonia case, suggests immediate review priority, provides explainability cues such as attention maps and suspicious-region highlights, and summarizes clinically relevant findings to assist radiologists and clinicians in faster review, triage prioritization, and more informed consultation.
Figure 49.
Example concept dashboard illustrating how AI-assisted chest X-ray analysis could support clinical triage and consultation workflows. The interface combines a prioritized patient queue, current-exam visualization with AI attention overlays, severity prediction, confidence scoring, and structured decision-support outputs. In this example, the system highlights a severe pneumonia case, suggests immediate review priority, provides explainability cues such as attention maps and suspicious-region highlights, and summarizes clinically relevant findings to assist radiologists and clinicians in faster review, triage prioritization, and more informed consultation.
Table 1.
Summary statistics of bounding box annotations in the RSNA Pneumonia Detection dataset.
Table 1.
Summary statistics of bounding box annotations in the RSNA Pneumonia Detection dataset.
| # |
Metric |
Description / Value |
| 1 |
Total bounding box annotations |
9,555 annotations across all pneumonia-positive cases |
| 2 |
Positive patients |
6,012 patients labeled with pneumonia |
| 3 |
Negative patients |
20,672 patients labeled as normal |
| 4 |
Average bounding boxes per positive patient |
1.59 bounding boxes per patient |
| 5 |
Maximum bounding boxes per patient |
Up to 4 bounding boxes in a single patient |
| 6 |
Positive cases without bounding boxes |
0 (all positive cases include at least one annotation) |
| 7 |
Negative cases with bounding boxes |
0 (no bounding boxes are assigned to negative cases) |
| 8 |
Annotation consistency |
Bounding boxes are exclusively associated with positive cases, indicating a consistent labeling structure |
Table 2.
Severity 1 ViT model exploration results across 24 runs. Four ViT backbones were evaluated using six learning-rate and weight-decay configurations per model. All runs used input images, batch size 8, ImageNet-pretrained initialization, AdamW optimization, AMP, and cosine annealing scheduling.
Table 2.
Severity 1 ViT model exploration results across 24 runs. Four ViT backbones were evaluated using six learning-rate and weight-decay configurations per model. All runs used input images, batch size 8, ImageNet-pretrained initialization, AdamW optimization, AMP, and cosine annealing scheduling.
| Run |
Model |
Configuration |
Final Train [%] |
Final Val. [%] |
Max Train [%] |
Max Val. [%] |
Best Epoch |
Duration [s] |
| 1 |
ViT-B/16 |
LR=, WD=0.05 |
88.91 |
88.39 |
89.68 |
88.89 |
44 |
7594.0 |
| 2 |
ViT-B/16 |
LR=, WD=0.01 |
88.44 |
86.59 |
88.82 |
87.24 |
42 |
7625.5 |
| 3 |
ViT-B/16 |
LR=, WD=0.05 |
96.24 |
93.00 |
96.24 |
93.55 |
34 |
7578.8 |
| 4 |
ViT-B/16 |
LR=, WD=0.01 |
96.08 |
92.58 |
96.08 |
93.23 |
37 |
7606.9 |
| 5 |
ViT-B/16 |
LR=, WD=0.05 |
99.93 |
94.19 |
99.93 |
94.33 |
20 |
7513.6 |
| 6 |
ViT-B/16 |
LR=, WD=0.01 |
99.82 |
94.42 |
99.90 |
94.98 |
13 |
7483.2 |
| 7 |
ViT-L/16 |
LR=, WD=0.05 |
89.23 |
88.34 |
89.34 |
88.53 |
44 |
19499.1 |
| 8 |
ViT-L/16 |
LR=, WD=0.01 |
87.89 |
86.64 |
88.18 |
87.14 |
48 |
19442.4 |
| 9 |
ViT-L/16 |
LR=, WD=0.05 |
85.84 |
85.99 |
91.47 |
92.35 |
12 |
19465.4 |
| 10 |
ViT-L/16 |
LR=, WD=0.01 |
94.90 |
92.90 |
94.90 |
93.00 |
42 |
19425.0 |
| 11 |
ViT-L/16 |
LR=, WD=0.05 |
99.95 |
93.96 |
99.95 |
94.70 |
19 |
19498.0 |
| 12 |
ViT-L/16 |
LR=, WD=0.01 |
99.88 |
94.01 |
99.90 |
94.56 |
20 |
19540.5 |
| 13 |
ViT-B/32 |
LR=, WD=0.05 |
89.30 |
88.20 |
89.33 |
88.29 |
48 |
6339.7 |
| 14 |
ViT-B/32 |
LR=, WD=0.01 |
89.63 |
88.71 |
89.64 |
88.89 |
48 |
6911.0 |
| 15 |
ViT-B/32 |
LR=, WD=0.05 |
90.63 |
89.31 |
91.13 |
89.45 |
47 |
6656.7 |
| 16 |
ViT-B/32 |
LR=, WD=0.01 |
90.50 |
89.03 |
90.71 |
89.08 |
42 |
6362.0 |
| 17 |
ViT-B/32 |
LR=, WD=0.05 |
96.22 |
93.82 |
96.68 |
93.82 |
35 |
6337.4 |
| 18 |
ViT-B/32 |
LR=, WD=0.01 |
98.10 |
93.18 |
98.10 |
93.59 |
30 |
6239.3 |
| 19 |
ViT-L/32 |
LR=, WD=0.05 |
88.40 |
87.74 |
88.77 |
88.29 |
45 |
7886.5 |
| 20 |
ViT-L/32 |
LR=, WD=0.01 |
87.91 |
87.28 |
88.21 |
87.56 |
47 |
7911.9 |
| 21 |
ViT-L/32 |
LR=, WD=0.05 |
88.58 |
87.88 |
90.58 |
90.69 |
8 |
7485.0 |
| 22 |
ViT-L/32 |
LR=, WD=0.01 |
87.89 |
87.05 |
89.21 |
89.31 |
3 |
7468.2 |
| 23 |
ViT-L/32 |
LR=, WD=0.05 |
99.88 |
93.46 |
99.91 |
93.82 |
13 |
7399.4 |
| 24 |
ViT-L/32 |
LR=, WD=0.01 |
99.73 |
93.27 |
99.83 |
93.92 |
16 |
7645.3 |
Table 3.
Severity 2 ViT model exploration results across 24 runs. Four ViT backbones were evaluated using six learning-rate and weight-decay configurations per model. All runs used input images, batch size 8, ImageNet-pretrained initialization, AdamW optimization, AMP, and cosine annealing scheduling.
Table 3.
Severity 2 ViT model exploration results across 24 runs. Four ViT backbones were evaluated using six learning-rate and weight-decay configurations per model. All runs used input images, batch size 8, ImageNet-pretrained initialization, AdamW optimization, AMP, and cosine annealing scheduling.
| Run |
Model |
Configuration |
Final Train [%] |
Final Val. [%] |
Max Train [%] |
Max Val. [%] |
Best Epoch |
Duration [s] |
| 1 |
ViT-B/16 |
LR=, WD=0.05 |
94.12 |
94.33 |
94.12 |
94.33 |
48 |
7525.0 |
| 2 |
ViT-B/16 |
LR=, WD=0.01 |
93.71 |
94.01 |
93.75 |
94.06 |
45 |
7555.3 |
| 3 |
ViT-B/16 |
LR=, WD=0.05 |
97.89 |
97.33 |
97.96 |
97.47 |
36 |
7559.3 |
| 4 |
ViT-B/16 |
LR=, WD=0.01 |
99.75 |
96.68 |
99.78 |
97.33 |
20 |
7529.4 |
| 5 |
ViT-B/16 |
LR=, WD=0.05 |
99.99 |
97.84 |
99.99 |
97.88 |
13 |
7544.9 |
| 6 |
ViT-B/16 |
LR=, WD=0.01 |
99.98 |
97.33 |
100.00 |
97.84 |
13 |
7537.1 |
| 7 |
ViT-L/16 |
LR=, WD=0.05 |
91.75 |
91.11 |
91.75 |
91.85 |
46 |
19564.9 |
| 8 |
ViT-L/16 |
LR=, WD=0.01 |
92.32 |
92.26 |
92.32 |
92.77 |
42 |
19745.9 |
| 9 |
ViT-L/16 |
LR=, WD=0.05 |
98.81 |
97.37 |
98.88 |
97.51 |
29 |
19578.3 |
| 10 |
ViT-L/16 |
LR=, WD=0.01 |
90.18 |
89.96 |
95.50 |
97.14 |
11 |
19538.5 |
| 11 |
ViT-L/16 |
LR=, WD=0.05 |
99.95 |
97.70 |
100.00 |
97.88 |
15 |
19604.9 |
| 12 |
ViT-L/16 |
LR=, WD=0.01 |
99.97 |
97.79 |
99.99 |
97.88 |
2 |
19489.7 |
| 13 |
ViT-B/32 |
LR=, WD=0.05 |
93.12 |
93.37 |
93.12 |
93.51 |
46 |
6800.5 |
| 14 |
ViT-B/32 |
LR=, WD=0.01 |
93.28 |
93.37 |
93.64 |
93.55 |
45 |
6954.9 |
| 15 |
ViT-B/32 |
LR=, WD=0.05 |
93.83 |
94.33 |
93.85 |
94.56 |
41 |
6737.8 |
| 16 |
ViT-B/32 |
LR=, WD=0.01 |
94.37 |
94.98 |
94.37 |
95.12 |
48 |
6866.0 |
| 17 |
ViT-B/32 |
LR=, WD=0.05 |
98.48 |
97.33 |
98.48 |
97.42 |
28 |
7068.8 |
| 18 |
ViT-B/32 |
LR=, WD=0.01 |
99.32 |
96.96 |
99.36 |
97.60 |
28 |
6744.8 |
| 19 |
ViT-L/32 |
LR=, WD=0.05 |
92.87 |
92.91 |
92.87 |
93.09 |
47 |
7825.6 |
| 20 |
ViT-L/32 |
LR=, WD=0.01 |
93.25 |
93.09 |
93.46 |
93.64 |
47 |
7846.1 |
| 21 |
ViT-L/32 |
LR=, WD=0.05 |
99.52 |
96.68 |
99.59 |
97.14 |
29 |
8065.1 |
| 22 |
ViT-L/32 |
LR=, WD=0.01 |
99.50 |
96.78 |
99.56 |
97.19 |
29 |
8019.9 |
| 23 |
ViT-L/32 |
LR=, WD=0.05 |
99.97 |
97.56 |
99.97 |
97.88 |
16 |
7866.2 |
| 24 |
ViT-L/32 |
LR=, WD=0.01 |
99.94 |
97.42 |
99.95 |
97.65 |
19 |
7927.1 |
Table 4.
Severity 3 ViT model exploration results across 24 runs. Four ViT backbones were evaluated using six learning-rate and weight-decay configurations per model. All runs used input images, batch size 8, ImageNet-pretrained initialization, AdamW optimization, AMP, and cosine annealing scheduling.
Table 4.
Severity 3 ViT model exploration results across 24 runs. Four ViT backbones were evaluated using six learning-rate and weight-decay configurations per model. All runs used input images, batch size 8, ImageNet-pretrained initialization, AdamW optimization, AMP, and cosine annealing scheduling.
| Run |
Model |
Configuration |
Final Train [%] |
Final Val. [%] |
Max Train [%] |
Max Val. [%] |
Best Epoch |
Duration [s] |
| 1 |
ViT-B/16 |
LR=, WD=0.05 |
97.25 |
97.51 |
97.52 |
97.56 |
47 |
7861.7 |
| 2 |
ViT-B/16 |
LR=, WD=0.01 |
96.80 |
96.59 |
96.89 |
96.82 |
46 |
7893.4 |
| 3 |
ViT-B/16 |
LR=, WD=0.05 |
99.80 |
98.62 |
99.84 |
98.80 |
32 |
7867.2 |
| 4 |
ViT-B/16 |
LR=, WD=0.01 |
99.87 |
98.66 |
99.88 |
98.76 |
32 |
7910.0 |
| 5 |
ViT-B/16 |
LR=, WD=0.05 |
100.00 |
98.85 |
100.00 |
98.94 |
15 |
7912.2 |
| 6 |
ViT-B/16 |
LR=, WD=0.01 |
100.00 |
98.80 |
100.00 |
98.94 |
9 |
7927.9 |
| 7 |
ViT-L/16 |
LR=, WD=0.05 |
94.14 |
94.06 |
96.44 |
96.68 |
34 |
19422.8 |
| 8 |
ViT-L/16 |
LR=, WD=0.01 |
97.20 |
97.05 |
97.42 |
97.05 |
48 |
19448.3 |
| 9 |
ViT-L/16 |
LR=, WD=0.05 |
96.18 |
95.90 |
97.25 |
98.34 |
2 |
19424.7 |
| 10 |
ViT-L/16 |
LR=, WD=0.01 |
94.91 |
94.98 |
98.34 |
98.48 |
21 |
19635.3 |
| 11 |
ViT-L/16 |
LR=, WD=0.05 |
99.97 |
98.99 |
100.00 |
99.08 |
20 |
19641.5 |
| 12 |
ViT-L/16 |
LR=, WD=0.01 |
99.99 |
98.85 |
99.99 |
99.03 |
5 |
19583.5 |
| 13 |
ViT-B/32 |
LR=, WD=0.05 |
97.31 |
97.10 |
97.46 |
97.19 |
49 |
6034.6 |
| 14 |
ViT-B/32 |
LR=, WD=0.01 |
96.87 |
97.24 |
97.25 |
97.37 |
43 |
6160.5 |
| 15 |
ViT-B/32 |
LR=, WD=0.05 |
97.88 |
97.84 |
98.02 |
97.97 |
40 |
6089.2 |
| 16 |
ViT-B/32 |
LR=, WD=0.01 |
97.76 |
97.74 |
98.10 |
97.79 |
48 |
6002.1 |
| 17 |
ViT-B/32 |
LR=, WD=0.05 |
99.74 |
98.66 |
99.77 |
98.89 |
26 |
5947.8 |
| 18 |
ViT-B/32 |
LR=, WD=0.01 |
99.78 |
98.85 |
99.82 |
98.89 |
31 |
5926.0 |
| 19 |
ViT-L/32 |
LR=, WD=0.05 |
97.24 |
97.05 |
97.24 |
97.24 |
45 |
7459.4 |
| 20 |
ViT-L/32 |
LR=, WD=0.01 |
96.73 |
96.55 |
96.89 |
96.64 |
44 |
7575.9 |
| 21 |
ViT-L/32 |
LR=, WD=0.05 |
99.82 |
98.57 |
99.91 |
98.62 |
24 |
7786.2 |
| 22 |
ViT-L/32 |
LR=, WD=0.01 |
99.83 |
98.43 |
99.85 |
98.71 |
24 |
7803.8 |
| 23 |
ViT-L/32 |
LR=, WD=0.05 |
100.00 |
98.62 |
100.00 |
98.71 |
24 |
7743.0 |
| 24 |
ViT-L/32 |
LR=, WD=0.01 |
99.98 |
98.71 |
99.99 |
98.76 |
15 |
7747.5 |
Table 5.
Classical binary ViT model exploration results across 24 runs. The binary task corresponds to normal versus pneumonia-positive chest X-ray classification.
Table 5.
Classical binary ViT model exploration results across 24 runs. The binary task corresponds to normal versus pneumonia-positive chest X-ray classification.
| Run |
Model |
Configuration |
Final Train [%] |
Final Val. [%] |
Max Val. [%] |
Val. F1 Macro [%] |
Best Epoch |
Duration [s] |
| 1 |
ViT-B/16 |
LR=, WD=0.05 |
89.94 |
89.80 |
89.80 |
89.46 |
43 |
14857.0 |
| 2 |
ViT-B/16 |
LR=, WD=0.01 |
90.89 |
90.24 |
90.34 |
90.01 |
45 |
14907.8 |
| 3 |
ViT-B/16 |
LR=, WD=0.05 |
99.07 |
94.08 |
94.48 |
94.23 |
34 |
14871.0 |
| 4 |
ViT-B/16 |
LR=, WD=0.01 |
99.22 |
93.88 |
94.55 |
94.33 |
33 |
16564.9 |
| 5 |
ViT-B/16 |
LR=, WD=0.05 |
99.90 |
94.35 |
94.55 |
94.32 |
6 |
10412.8 |
| 6 |
ViT-B/16 |
LR=, WD=0.01 |
99.92 |
94.65 |
94.68 |
94.47 |
49 |
10409.6 |
| 7 |
ViT-L/16 |
LR=, WD=0.05 |
90.06 |
90.41 |
90.44 |
90.11 |
49 |
26723.4 |
| 8 |
ViT-L/16 |
LR=, WD=0.01 |
89.86 |
90.88 |
90.95 |
90.65 |
48 |
26607.5 |
| 9 |
ViT-L/16 |
LR=, WD=0.05 |
89.04 |
89.06 |
93.10 |
92.76 |
4 |
26701.1 |
| 10 |
ViT-L/16 |
LR=, WD=0.01 |
90.87 |
91.45 |
92.06 |
91.70 |
4 |
26696.9 |
| 11 |
ViT-L/16 |
LR=, WD=0.05 |
99.75 |
94.72 |
95.05 |
94.85 |
45 |
26763.1 |
| 12 |
ViT-L/16 |
LR=, WD=0.01 |
99.85 |
94.15 |
94.95 |
94.76 |
40 |
26745.3 |
| 13 |
ViT-B/32 |
LR=, WD=0.05 |
90.47 |
88.93 |
89.30 |
88.95 |
48 |
8004.9 |
| 14 |
ViT-B/32 |
LR=, WD=0.01 |
89.91 |
88.43 |
88.90 |
88.56 |
42 |
7951.5 |
| 15 |
ViT-B/32 |
LR=, WD=0.05 |
94.89 |
93.20 |
93.30 |
93.04 |
49 |
7953.7 |
| 16 |
ViT-B/32 |
LR=, WD=0.01 |
95.14 |
92.93 |
93.27 |
93.00 |
34 |
8019.4 |
| 17 |
ViT-B/32 |
LR=, WD=0.05 |
97.29 |
93.67 |
94.01 |
93.74 |
40 |
8098.2 |
| 18 |
ViT-B/32 |
LR=, WD=0.01 |
98.35 |
93.81 |
93.98 |
93.73 |
37 |
8121.1 |
| 19 |
ViT-L/32 |
LR=, WD=0.05 |
90.56 |
89.91 |
90.14 |
89.77 |
46 |
10459.6 |
| 20 |
ViT-L/32 |
LR=, WD=0.01 |
91.03 |
89.74 |
90.21 |
89.88 |
45 |
10273.4 |
| 21 |
ViT-L/32 |
LR=, WD=0.05 |
98.73 |
93.24 |
93.71 |
93.44 |
29 |
10272.8 |
| 22 |
ViT-L/32 |
LR=, WD=0.01 |
98.86 |
93.07 |
93.64 |
93.37 |
37 |
10252.3 |
| 23 |
ViT-L/32 |
LR=, WD=0.05 |
99.81 |
94.15 |
94.28 |
94.03 |
30 |
10291.8 |
| 24 |
ViT-L/32 |
LR=, WD=0.01 |
99.90 |
93.54 |
94.38 |
94.15 |
30 |
10329.3 |
Table 6.
Selected ViT configurations for classical binary classification and severity-specific extended training and inference. The classical binary model is included as a conventional baseline, while the Severity 1–3 models correspond to the selected branches of the proposed severity-aware framework.
Table 6.
Selected ViT configurations for classical binary classification and severity-specific extended training and inference. The classical binary model is included as a conventional baseline, while the Severity 1–3 models correspond to the selected branches of the proposed severity-aware framework.
| Task / Branch |
Selected Model |
Configuration |
Final Val. Acc. [%] |
Max Val. Acc. [%] |
Duration [s] |
| Classical binary |
ViT-L/16 |
LR=, WD=0.05 |
94.72 |
95.05 |
26763.1 |
| Severity 1 |
ViT-B/16 |
LR=, WD=0.01 |
94.42 |
94.98 |
7483.2 |
| Severity 2 |
ViT-B/16 |
LR=, WD=0.05 |
97.84 |
97.88 |
7544.9 |
| Severity 3 |
ViT-L/16 |
LR=, WD=0.05 |
98.99 |
99.08 |
19641.5 |
Table 7.
Cross-severity inference summary for the selected severity-specific ViT models. For each binary model, the reported probability corresponds to the target class of that model: for the Severity 1 model, for the Severity 2 model, and for the Severity 3 model.
Table 7.
Cross-severity inference summary for the selected severity-specific ViT models. For each binary model, the reported probability corresponds to the target class of that model: for the Severity 1 model, for the Severity 2 model, and for the Severity 3 model.
| No. |
Model |
Validation Set |
N |
Median Prob. |
Prob. ≥ 0.5 [%] |
| 1 |
Severity 1 model |
Severity 1 |
400 |
1.000 |
82.0 |
| 2 |
Severity 1 model |
Severity 2 |
401 |
1.000 |
93.0 |
| 3 |
Severity 1 model |
Severity 3 |
401 |
1.000 |
96.5 |
| 4 |
Severity 2 model |
Severity 1 |
400 |
0.963 |
72.5 |
| 5 |
Severity 2 model |
Severity 2 |
401 |
0.999 |
93.5 |
| 6 |
Severity 2 model |
Severity 3 |
401 |
1.000 |
97.3 |
| 7 |
Severity 3 model |
Severity 1 |
400 |
0.420 |
48.5 |
| 8 |
Severity 3 model |
Severity 2 |
401 |
0.990 |
85.8 |
| 9 |
Severity 3 model |
Severity 3 |
401 |
0.998 |
95.0 |