Figure 1.
Representative samples from the Tokaido synthetic dataset (test split). Each row shows the RGB image (left), component ground-truth mask (centre; Slab = red, Beam = green, Column = blue, Nonstructural = yellow-orange, Nonbridge = dark grey), and damage ground-truth mask (right; Concrete = red, ExposedRebar = green, Nondamage = dark grey). Rows 1–3 are selected to show abundant damage pixels from both foreground classes; rows 4–6 cover all structural component classes. Outside the selected rows, damage pixels are typically sparse and localised on structural surfaces, reflecting the severe class imbalance addressed by the imbalance-aware loss functions.
Figure 1.
Representative samples from the Tokaido synthetic dataset (test split). Each row shows the RGB image (left), component ground-truth mask (centre; Slab = red, Beam = green, Column = blue, Nonstructural = yellow-orange, Nonbridge = dark grey), and damage ground-truth mask (right; Concrete = red, ExposedRebar = green, Nondamage = dark grey). Rows 1–3 are selected to show abundant damage pixels from both foreground classes; rows 4–6 cover all structural component classes. Outside the selected rows, damage pixels are typically sparse and localised on structural surfaces, reflecting the severe class imbalance addressed by the imbalance-aware loss functions.
Figure 2.
Component-aware damage segmentation architecture. The shared ConvNeXt encoder produces multi-scale features F fed to both branches. The component branch (top) computes feature map and logits ; the damage branch (bottom) fuses with the conditioning tensor before the damage head. The five conditioning modes differ in how is formed (see legend); in the damage-only and parallel-heads modes the dashed conditioning path is absent. The subscript denotes trainable parameters.
Figure 2.
Component-aware damage segmentation architecture. The shared ConvNeXt encoder produces multi-scale features F fed to both branches. The component branch (top) computes feature map and logits ; the damage branch (bottom) fuses with the conditioning tensor before the damage head. The five conditioning modes differ in how is formed (see legend); in the damage-only and parallel-heads modes the dashed conditioning path is absent. The subscript denotes trainable parameters.
Figure 3.
ImageNet pretraining versus random initialisation (U-Net, EfficientNet-B3, ). Columns: input image, ground-truth, pretrained prediction, scratch prediction. Classes: Concrete = red, ExposedRebar = green, Nondamage = dark grey. Training from scratch produces fragmented, noisy outputs compared to the pretrained model, reflecting the near- foreground mIoU gap ( vs. ).
Figure 3.
ImageNet pretraining versus random initialisation (U-Net, EfficientNet-B3, ). Columns: input image, ground-truth, pretrained prediction, scratch prediction. Classes: Concrete = red, ExposedRebar = green, Nondamage = dark grey. Training from scratch produces fragmented, noisy outputs compared to the pretrained model, reflecting the near- foreground mIoU gap ( vs. ).
Figure 4.
Qualitative damage-segmentation results for ConvNeXt-Seg (weighted CE + Dice, seed 3). Images are selected from the synthetic test split to show substantial damage from both foreground classes. Columns: input image, ground-truth mask, predicted mask, and prediction–ground-truth (GT) overlay. Classes: Concrete = red, ExposedRebar = green, Nondamage = dark grey. The model closely follows the ground truth for both damage classes while maintaining near-perfect Nondamage accuracy.
Figure 4.
Qualitative damage-segmentation results for ConvNeXt-Seg (weighted CE + Dice, seed 3). Images are selected from the synthetic test split to show substantial damage from both foreground classes. Columns: input image, ground-truth mask, predicted mask, and prediction–ground-truth (GT) overlay. Classes: Concrete = red, ExposedRebar = green, Nondamage = dark grey. The model closely follows the ground truth for both damage classes while maintaining near-perfect Nondamage accuracy.
Figure 5.
Qualitative comparison of damage predictions across all five joint-model ablation variants (seed 1) on representative synthetic test images. Columns: ground-truth damage mask, damage-only, parallel-heads, hard-mask, soft-detach, soft-full. All five variants produce visually similar outputs, consistent with the near-identical quantitative scores in
Table 9. Classes: Concrete = red, ExposedRebar = green, Nondamage = dark grey.
Figure 5.
Qualitative comparison of damage predictions across all five joint-model ablation variants (seed 1) on representative synthetic test images. Columns: ground-truth damage mask, damage-only, parallel-heads, hard-mask, soft-detach, soft-full. All five variants produce visually similar outputs, consistent with the near-identical quantitative scores in
Table 9. Classes: Concrete = red, ExposedRebar = green, Nondamage = dark grey.
Figure 6.
Qualitative component-segmentation results for the joint soft-detach model (seed 1) on representative synthetic test images. Columns: input image, ground-truth component mask, predicted component mask, and prediction–ground-truth (GT) overlay. Classes: Slab = red, Beam = green, Column = blue, Nonstructural = yellow-orange, Nonbridge = dark grey. The model accurately delineates all major structural elements, achieving on the test split.
Figure 6.
Qualitative component-segmentation results for the joint soft-detach model (seed 1) on representative synthetic test images. Columns: input image, ground-truth component mask, predicted component mask, and prediction–ground-truth (GT) overlay. Classes: Slab = red, Beam = green, Column = blue, Nonstructural = yellow-orange, Nonbridge = dark grey. The model accurately delineates all major structural elements, achieving on the test split.
Figure 7.
Structural component segmentation on real viaduct photographs (trained ConvNeXt-Seg, seed 1). Columns: input image, component ground-truth, trained prediction. Classes: Slab = red, Beam = green, Column = blue, Nonstructural = yellow-orange, Nonbridge = dark grey. Real photographs carry component annotations only; damage annotations are unavailable for this image set.
Figure 7.
Structural component segmentation on real viaduct photographs (trained ConvNeXt-Seg, seed 1). Columns: input image, component ground-truth, trained prediction. Classes: Slab = red, Beam = green, Column = blue, Nonstructural = yellow-orange, Nonbridge = dark grey. Real photographs carry component annotations only; damage annotations are unavailable for this image set.
Figure 8.
Qualitative prompt ablation for GroundedSAM structural segmentation on real viaduct photographs. Columns: input image, component ground-truth, and zero-shot predictions for prompt variants v0 (bridge jargon), v2 (single terms), v3 (common vocabulary, best), and v4 (combined extended phrases). Classes: Slab = red, Beam = green, Column = blue, Nonstructural = yellow-orange, Nonbridge = dark grey. Variants v1 (descriptive geometry) and v4 (combined extended phrases) produce the weakest structural differentiation; v3 (common vocabulary) achieves the best overall mIoU (
Table 12).
Figure 8.
Qualitative prompt ablation for GroundedSAM structural segmentation on real viaduct photographs. Columns: input image, component ground-truth, and zero-shot predictions for prompt variants v0 (bridge jargon), v2 (single terms), v3 (common vocabulary, best), and v4 (combined extended phrases). Classes: Slab = red, Beam = green, Column = blue, Nonstructural = yellow-orange, Nonbridge = dark grey. Variants v1 (descriptive geometry) and v4 (combined extended phrases) produce the weakest structural differentiation; v3 (common vocabulary) achieves the best overall mIoU (
Table 12).
Table 1.
Summary of component-aware ablation variants.
Table 1.
Summary of component-aware ablation variants.
| Variant |
Component head |
Fusion |
Conditioning signal |
Gradient to component branch |
| damage-only |
× |
× |
— |
not applicable |
| parallel-heads |
✓ |
× |
— |
component loss only |
| hard-mask |
✓ |
✓ |
one-hot () |
none from damage†
|
| soft-detach |
✓ |
✓ |
|
none from damage†
|
| soft-full |
✓ |
✓ |
|
yes |
Table 2.
Ablation: Loss Comparison (damage)
Table 2.
Ablation: Loss Comparison (damage)
| Variant |
mIoUmain
|
mIoUall
|
Dice |
Precision |
| weighted_ce_dice |
0.535 |
0.688 |
0.797 |
0.816 |
| tversky |
0.526 |
0.681 |
0.792 |
0.793 |
| ce_dice |
0.523 |
0.679 |
0.790 |
0.846 |
| focal_tversky |
0.498 |
0.662 |
0.775 |
0.768 |
| ce |
0.491 |
0.659 |
0.771 |
0.852 |
| focal_ce |
0.467 |
0.642 |
0.756 |
0.838 |
Table 3.
Ablation: Encoder Comparison (damage)
Table 3.
Ablation: Encoder Comparison (damage)
| Variant |
mIoUmain
|
mIoUall
|
Dice |
Precision |
| efficientnet_b5 |
0.547 |
0.696 |
0.803 |
0.850 |
| efficientnet_b3 |
0.526 |
0.681 |
0.791 |
0.842 |
| efficientnet_b0 |
0.488 |
0.656 |
0.769 |
0.823 |
| resnet50 |
0.418 |
0.609 |
0.724 |
0.780 |
Table 4.
Ablation: Resolution Comparison (damage)
Table 4.
Ablation: Resolution Comparison (damage)
| Variant |
mIoUmain
|
mIoUall
|
Dice |
Precision |
| 640x1120 |
0.549 |
0.697 |
0.805 |
0.857 |
| 512x896 |
0.536 |
0.689 |
0.797 |
0.839 |
| 384x672 |
0.464 |
0.640 |
0.754 |
0.822 |
| 256x448 |
0.398 |
0.595 |
0.711 |
0.763 |
Table 5.
Ablation: Pretrained Vs Scratch (damage)
Table 5.
Ablation: Pretrained Vs Scratch (damage)
| Variant |
mIoUmain
|
mIoUall
|
Dice |
Precision |
| pretrained |
0.519 |
0.677 |
0.787 |
0.826 |
| scratch |
0.264 |
0.504 |
0.605 |
0.638 |
Table 6.
Damage segmentation benchmark on the Tokaido synthetic test split (weighted CE + Dice loss, mean ± standard deviation over all available runs; n denotes run count per cell — most models use seeds 1–3, ConvNeXt-Seg uses 6 completed attempts and DeepLabV3+ uses 4. These cells contain logged repeated attempts with reused seed values, rather than additional distinct seeds). Params denotes trainable parameter count in millions (M); Prec and Rec denote precision and recall. †Swin-UperNet collapses to predicting the Nondamage class in all runs; ExposedRebar IoU ≈ 0.
Table 6.
Damage segmentation benchmark on the Tokaido synthetic test split (weighted CE + Dice loss, mean ± standard deviation over all available runs; n denotes run count per cell — most models use seeds 1–3, ConvNeXt-Seg uses 6 completed attempts and DeepLabV3+ uses 4. These cells contain logged repeated attempts with reused seed values, rather than additional distinct seeds). Params denotes trainable parameter count in millions (M); Prec and Rec denote precision and recall. †Swin-UperNet collapses to predicting the Nondamage class in all runs; ExposedRebar IoU ≈ 0.
| Model |
Encoder |
Params |
n |
mIoUmain
|
mIoUall
|
Dice |
Prec |
Rec |
FPS |
| SegFormer |
PVT-V2-B2 |
26.0M |
3 |
|
0.710 |
0.815 |
0.817 |
0.814 |
19.5 |
| ConvNeXt-Seg |
ConvNeXt-B |
93.4M |
6 |
|
0.706 |
0.811 |
0.824 |
0.799 |
17.3 |
| UNet++ |
EfficientNet-B3 |
13.6M |
3 |
|
0.690 |
0.798 |
0.838 |
0.766 |
17.8 |
| UNet |
EfficientNet-B3 |
13.2M |
3 |
|
0.686 |
0.795 |
0.816 |
0.777 |
19.6 |
| DeepLabV3+ |
EfficientNet-B3 |
11.7M |
4 |
|
0.649 |
0.762 |
0.787 |
0.742 |
20.2 |
| ViT-Seg |
ViT-B/16 |
93.1M |
3 |
|
0.571 |
0.686 |
0.743 |
0.645 |
14.1 |
| Swin-UperNet†
|
Swin-B |
92.1M |
3 |
|
0.341 |
0.357 |
0.376 |
0.352 |
13.7 |
Table 7.
Per-class IoU breakdown on the Tokaido synthetic test split (weighted CE + Dice loss, mean ± standard deviation over all available runs; run counts per model match
Table 6). Nondamage standard deviation is
for all models except Swin-UperNet (
) and is omitted for brevity.
†Swin-UperNet collapses to the Nondamage class in all runs.
Table 7.
Per-class IoU breakdown on the Tokaido synthetic test split (weighted CE + Dice loss, mean ± standard deviation over all available runs; run counts per model match
Table 6). Nondamage standard deviation is
for all models except Swin-UperNet (
) and is omitted for brevity.
†Swin-UperNet collapses to the Nondamage class in all runs.
| Model |
mIoUmain
|
Nondamage |
Concrete |
ExposedRebar |
| SegFormer |
|
0.994 |
|
|
| ConvNeXt-Seg |
|
0.994 |
|
|
| UNet++ |
|
0.993 |
|
|
| UNet |
|
0.993 |
|
|
| DeepLabV3+ |
|
0.992 |
|
|
| ViT-Seg |
|
0.990 |
|
|
| Swin-UperNet†
|
|
0.982 |
|
|
Table 8.
Component segmentation benchmark on the Tokaido synthetic test split (weighted CE + Dice loss, mean ± standard deviation over seeds 1–3). Params denotes trainable parameter count in millions (M); Prec and Rec denote precision and recall. excludes the Nonbridge background class.
Table 8.
Component segmentation benchmark on the Tokaido synthetic test split (weighted CE + Dice loss, mean ± standard deviation over seeds 1–3). Params denotes trainable parameter count in millions (M); Prec and Rec denote precision and recall. excludes the Nonbridge background class.
| Model |
Encoder |
Params |
n |
mIoUmain
|
mIoUall
|
Dice |
Prec |
Rec |
FPS |
| ConvNeXt-Seg |
ConvNeXt-B |
93.4M |
3 |
|
0.964 |
0.981 |
0.979 |
0.984 |
19.5 |
| SegFormer |
PVT-V2-B2 |
26.0M |
3 |
|
0.959 |
0.979 |
0.976 |
0.981 |
24.6 |
| UNet++ |
EfficientNet-B3 |
13.6M |
3 |
|
0.957 |
0.978 |
0.976 |
0.980 |
24.5 |
| UNet |
EfficientNet-B3 |
13.2M |
3 |
|
0.956 |
0.977 |
0.975 |
0.980 |
27.6 |
| DeepLabV3+ |
EfficientNet-B3 |
11.7M |
3 |
|
0.955 |
0.977 |
0.975 |
0.979 |
27.6 |
| ViT-Seg |
ViT-B/16 |
93.1M |
3 |
|
0.947 |
0.973 |
0.972 |
0.974 |
16.8 |
| Swin-UperNet |
Swin-B |
92.1M |
3 |
|
0.857 |
0.922 |
0.918 |
0.926 |
16.4 |
Table 9.
Component-aware ablation results on the Tokaido synthetic test set (mean ± standard deviation over seeds 1–3). Superscripts dmg and comp denote damage and component tasks, respectively. †Variant has no component head; component metrics are not applicable.
Table 9.
Component-aware ablation results on the Tokaido synthetic test set (mean ± standard deviation over seeds 1–3). Superscripts dmg and comp denote damage and component tasks, respectively. †Variant has no component head; component metrics are not applicable.
| Variant |
|
Concrete IoU |
ExposedRebar IoU |
|
| damage-only†
|
|
|
|
— |
| parallel-heads |
|
|
|
|
| hard-mask |
|
|
|
|
| soft-detach |
|
|
|
|
| soft-full |
|
|
|
|
Table 10.
Component segmentation on real photographs (ConvNeXt-Seg, seeds 1–3, weighted CE + Dice; same three single-task checkpoints as
Table 8; global pooled IoU over 16 images). Rows “Seed 1”–“Seed 3” are individual checkpoints; the final
entry is mean ± standard deviation across seeds. The
Nonbridge class is omitted from
because its support is small and arises primarily from resize-pad preprocessing in the evaluation pipeline.
Table 10.
Component segmentation on real photographs (ConvNeXt-Seg, seeds 1–3, weighted CE + Dice; same three single-task checkpoints as
Table 8; global pooled IoU over 16 images). Rows “Seed 1”–“Seed 3” are individual checkpoints; the final
entry is mean ± standard deviation across seeds. The
Nonbridge class is omitted from
because its support is small and arises primarily from resize-pad preprocessing in the evaluation pipeline.
| |
Nonbridge |
Slab |
Beam |
Column |
Nonstructural |
|
| Seed 1 |
0.010 |
0.456 |
0.409 |
0.598 |
0.036 |
0.374 |
| Seed 2 |
0.011 |
0.727 |
0.447 |
0.755 |
0.056 |
0.496 |
| Seed 3 |
0.009 |
0.505 |
0.424 |
0.643 |
0.034 |
0.402 |
| Mean |
0.010 |
0.562 |
0.427 |
0.665 |
0.042 |
|
Table 11.
Text prompts used for each structural class across the five GroundedSAM prompt variants (v0–v4; note that these are distinct from the five joint-model conditioning variants in
Table 9). Each cell lists the period-separated phrases passed to GroundingDINO for that class.
Nonstructural (class 4) is not directly prompted; it receives all pixels not assigned to classes 1–3 after priority-order merging.
Table 11.
Text prompts used for each structural class across the five GroundedSAM prompt variants (v0–v4; note that these are distinct from the five joint-model conditioning variants in
Table 9). Each cell lists the period-separated phrases passed to GroundingDINO for that class.
Nonstructural (class 4) is not directly prompted; it receives all pixels not assigned to classes 1–3 after priority-order merging.
| Variant |
Slab (1) |
Beam (2) |
Column (3) |
Nonstructural (4) |
| v0 — baseline |
bridge deck slab. concrete slab. bridge floor |
bridge girder. bridge beam. concrete beam |
bridge column. bridge pier. concrete column |
railway rail. railway sleeper. bridge railing |
| v1 — descriptive geom. |
flat concrete deck. horizontal concrete surface. concrete floor plate |
concrete T-beam. prestressed concrete girder. longitudinal girder under deck |
vertical concrete pier. cylindrical column. rectangular concrete pillar |
metal handrail. steel guard rail. bridge fence |
| v2 — single terms |
slab |
beam |
column |
railing |
| v3 — common vocab. (best) |
concrete floor. flat concrete ceiling. concrete soffit |
structural beam. concrete support beam. I-beam |
pillar. concrete pillar. support column |
railing. fence. barrier |
| v4 — combined |
bridge deck slab. flat concrete deck. concrete floor plate. bridge floor |
bridge beam. bridge girder. concrete beam. prestressed concrete girder. T-beam |
bridge column. concrete pier. bridge pier. vertical concrete pillar |
bridge railing. metal handrail. steel fence. guardrail |
Table 12.
Structural-task prompt ablation for GroundedSAM on 16 real photographs ( excludes Nonbridge).
Table 12.
Structural-task prompt ablation for GroundedSAM on 16 real photographs ( excludes Nonbridge).
| Variant |
Slab |
Beam |
Column |
Nonstructural |
|
| v0 — baseline (bridge-specific phrases) |
0.291 |
0.126 |
0.223 |
0.142 |
0.195 |
| v1 — descriptive geometry |
0.184 |
0.018 |
0.090 |
0.133 |
0.106 |
| v2 — single terms (slab/beam/…) |
0.372 |
0.089 |
0.310 |
0.105 |
0.219 |
| v3 — common vocabulary (best) |
0.338 |
0.133 |
0.491 |
0.038 |
0.250 |
| v4 — combined extended phrases |
0.131 |
0.147 |
0.106 |
0.088 |
0.118 |
Table 14.
Summary of progress relative to the prior study [
13]. All metrics are foreground mIoU (background excluded) unless otherwise noted. Prior damage figures are per-class IoU averaged over the two foreground damage classes (
Cracks,
Reinforcement) from Table 3 of that paper.
Table 14.
Summary of progress relative to the prior study [
13]. All metrics are foreground mIoU (background excluded) unless otherwise noted. Prior damage figures are per-class IoU averaged over the two foreground damage classes (
Cracks,
Reinforcement) from Table 3 of that paper.
| Aspect |
Prior work |
Present work |
| Input resolution |
|
|
| Architectures evaluated |
2 |
7 |
| Loss functions compared |
ad-hoc |
4 (systematic ablation) |
| Repeated seeds |
no |
3 per config |
| Component mIoU (synthetic test) |
76–87% (val, 4 cls) |
94.6–95.7% (test, 5 cls) |
| Damage mIoU (synthetic test) |
(2 arch) |
(SegFormer mean) |
| Component mIoU (real photos) |
(U-Net, 4 cls) |
(trained, 5 cls, gap ) |
| Zero-shot structural baseline |
— |
(GroundedSAM) |
| Joint component-aware ablation |
no |
yes (5 variants) |