Controlled Benchmarking and Component-Aware Ablation for Railway Viaduct Structural and Damage Segmentation

Piotr Tauzowski; Paweł Hołobut; Bartłomiej Błachowski

doi:10.20944/preprints202606.0926.v1

Submitted:

10 June 2026

Posted:

11 June 2026

You are already at the latest version

Abstract

Automated damage inspection of railway viaducts requires pixel-level identification of structural components and surface damage such as cracking and rebar exposure. A common assumption in bridge inspection is that damage segmentation improves when component information is provided alongside the image. This study tests that assumption on the Tokaido synthetic viaduct dataset using controlled comparisons between segmentation models with and without component information. Both damage and structural component segmentation are evaluated across multiple architectures, and the trained component model is assessed on real viaduct photographs against a baseline model requiring no task-specific training. Adding component information does not improve damage segmentation: all tested strategies remain within 0.008~mean Intersection-over-Union (mIoU) of a baseline without component input. This null result persists even when component predictions are reliable, indicating that structural element identity does not provide useful information for damage localisation in this setting. The best unconditioned model reaches 0.569~mIoU for damage segmentation; for real-photo component segmentation, the trained model reaches 0.424~mIoU compared with 0.250~mIoU for the training-free baseline. These results show that multi-task benefits reported in bridge inspection do not automatically translate into gains from explicit use of component information on synthetic viaduct data, where damage placement is largely independent of structural element type. The multi-architecture benchmark and the measured real-photo structural transfer gap provide reference baselines for subsequent work on component-aware and transfer-robust inspection.

Keywords:

semantic segmentation

;

viaduct inspection

;

bridge damage

;

CNN

;

Vision Transformer

;

component-aware segmentation

;

Tokaido dataset

;

GroundedSAM

;

synthetic-to-real transfer

Subject:

Engineering - Civil Engineering

1. Introduction

Railway viaduct inspection requires pixel-level information about both structural components and visible surface damage. Manual visual inspection remains costly, time-consuming, and dependent on subjective interpretation, especially when elevated structures must be assessed repeatedly across large transport networks [2,4,5]. Semantic segmentation offers a direct route from inspection imagery to spatially explicit component and damage maps, but its usefulness depends on whether models can identify small damage regions and transfer beyond the data used for training. These requirements make viaduct inspection a stringent test case for modern segmentation methods.

Recent bridge-inspection research has established two important directions. First, bridge element parsing and defect segmentation can be learned jointly, and multi-task models are often reported to improve element or defect metrics relative to single-task baselines [4,5]. Second, synthetic environments, such as the Tokaido synthetic viaduct benchmark and related synthetic inspection datasets [41,46,55], can reduce annotation cost and support repeatable experiments. These studies show that component information, synthetic data, and deep segmentation architectures can support structural inspection. They do not, however, resolve the issue of whether explicit component conditioning improves damage localisation under a controlled viaduct benchmark.

The distinction between multi-task learning and explicit component conditioning is central to this study. A multi-task network may improve damage segmentation because shared features regularise the encoder, because component supervision improves structural representation, or because the damage branch directly receives component information. Prior bridge-inspection work has mostly evaluated the combined effect of feature sharing and task interaction [4,5]. The practical question remains narrower: if a model already has access to an image, does adding predicted component information help it localise cracks, concrete damage, or exposed reinforcement? This question is testable only when architecture, training recipe, dataset split, and evaluation protocol are controlled.

Architecture choice adds a second unresolved factor. General semantic-segmentation research has moved from convolutional encoder–decoder models such as U-Net and DeepLabV3+ toward hierarchical Transformers and modernised convolutional architectures, including Swin Transformer, SegFormer, and ConvNeXt [17,19,24,25,26]. Inspection imagery differs from generic scene parsing because damage classes are sparse, thin, and texture-driven. A model that performs well on broad segmentation benchmarks may therefore fail under strong foreground–background imbalance. A controlled comparison across convolutional neural network (CNN), ConvNeXt, Vision Transformer (ViT), SegFormer, and Swin-style architectures is needed before claims about the benefits of component-aware damage localisation can be interpreted.

Synthetic-to-real transfer also remains a limiting condition for real-life deployment. The Tokaido dataset provides controlled synthetic viaduct imagery with component and damage annotations, while dacl10k and related bridge datasets demonstrate the diversity and difficulty of real inspection photographs [9,55]. Foundation segmentation models provide a complementary reference point because GroundingDINO and the Segment Anything Model (SAM) can produce zero-shot masks without task-specific training [42,43]. Comparing a trained synthetic-data model against such a training-free baseline helps separate ordinary segmentation accuracy from real-photo transfer performance.

This study evaluates railway viaduct component and damage segmentation under a controlled benchmark design. Seven segmentation architectures are trained for structural component and damage segmentation on the Tokaido synthetic dataset, using fixed splits, repeated seeds, and a common evaluation protocol. A separate component-aware ablation study tests damage-only, parallel-head, hard-mask, detached soft-conditioning, and end-to-end soft-conditioning variants. The design isolates whether component predictions improve damage localisation beyond a damage-only baseline. The best damage model reaches

0.569

foreground mIoU on the synthetic test split, while the component-aware variants remain within

0.008

mIoU of the unconditioned baseline. This result shows that explicit component conditioning does not provide a measurable damage-segmentation gain in this setting.

The paper makes three contributions. First, the study provides a controlled CNN–Transformer benchmark for viaduct structural component and damage segmentation on the Tokaido dataset. Second, the component-aware ablation shows that reliable component predictions do not automatically improve damage localisation. Third, the real-photo component-transfer evaluation compares a component segmentation model trained on synthetic data with a GroundingDINO+SAM zero-shot pipeline (GroundedSAM), where the trained model reaches

0.424

mIoU and the zero-shot baseline reaches

0.250

mIoU on real viaduct photographs. Together, these results position the benefits of component-aware conditioning as an empirical question rather than an assumed benefit, and they provide baseline evidence for future transfer-robust inspection models.

2. Related Work

Research on automated bridge and viaduct inspection draws from two distinct but interacting bodies of literature: generic semantic segmentation, which supplies the backbone architectures and training methods, and inspection-specific segmentation, which defines the tasks, datasets, and domain challenges. Progress in generic segmentation has been rapid and architecturally diverse, while progress in inspection-specific methods has been constrained by small real-world datasets, heterogeneous evaluation protocols, and narrow damage taxonomies. This section reviews both areas with respect to the three unresolved questions that motivate the present study: how convolutional neural network (CNN) and Transformer architectures compare on the same viaduct benchmark, whether explicit component-label conditioning improves damage localisation, and how large the synthetic-to-real transfer gap is against a training-free baseline.

2.1. Semantic Segmentation Architectures for Structural Inspection

The encoder–decoder paradigm, established by U-Net [17] and its densely connected extension UNet++ [18], defined the spatial segmentation baseline: a contracting encoder captures context and a symmetric decoder recovers spatial resolution through skip connections. DeepLabv3+ [19] offered a complementary approach via atrous spatial pyramid pooling, capturing multi-scale context without the resolution loss of pooling. UperNet [49] paired a feature pyramid network with a pyramid pooling module over any hierarchical backbone, and is the primary decoder used with Swin Transformer and ConvNeXt in the present benchmark. These convolutional baselines are well validated for bridge crack and damage segmentation tasks [16,38] and represent the CNN family in the controlled comparison.

Vision Transformer architectures introduced global self-attention and altered the accuracy–efficiency tradeoff. Swin Transformer [24] achieved 48.1 mIoU on ADE20K with the UperNet decoder by using shifted window attention to build hierarchical feature maps. SegFormer [25] combined a mix-Transformer hierarchical encoder with a lightweight all-multilayer-perceptron (MLP) decoder, reaching 51.0/51.8 mIoU (single-/multi-scale). ConvNeXt [26] demonstrated that a convolutional network redesigned with Transformer principles—large kernels, inverted bottleneck, and layer normalisation—matches Swin Transformer at comparable scale (44.4 vs. 44.5 mIoU). This result showed that self-attention is not strictly necessary for competitive dense segmentation, and motivates treating CNN and Transformer families as peers under controlled comparison rather than as successive generations.

Applied to inspection-specific tasks, the advantage of Transformers over CNNs is inconsistent. TransUNet improves over U-Net and DeepLabv3+ by at least 3.8% mIoU on concrete crack datasets [31], suggesting that global attention helps localise thin crack features. However, the CCSNet ablation [38] reveals that the Transformer branch alone drops to 52.61% mIoU on a three-class concrete damage task, well below the CNN branch at 76.13%; only their combination reaches 82.74%. MixSegNet [57] and ETCS-Net [35] report similar patterns: hybrid CNN-Transformer designs outperform either family in isolation for crack and multi-class damage segmentation. Because existing inspection papers use heterogeneous datasets and training protocols, the relative ranking of architecture families on a common viaduct benchmark remains an open empirical question.

2.2. Bridge Component Segmentation and Multi-Task Inspection

Bridge element parsing originated as component classification [2] and evolved into pixel-level component segmentation [21], enabling spatial attribution of damage to specific structural elements. The dominant framework became multi-task learning, in which a shared encoder is jointly supervised on component and damage targets, motivated by the expectation that component identity provides a structural prior for damage localisation.

The reported multi-task evidence is quantitatively consistent. Zhang et al. [4] compare twelve multi-task design variants on a 145-image Virginia DOT bridge dataset. The best variant improves bridge element mIoU from 82.89% to 85.48% and corrosion mIoU from 80.85% to 83.31% over single-task baselines, while also reducing total training time. Multi-Task High-Resolution Network (MT-HRNet) [60] applies a parallel multi-resolution backbone to the joint element-defect task, reaching 78.09% element mIoU and 79.76% defect mIoU on the Steel Bridge Condition Inspection Visual (SBCIV) benchmark. AECIF-Net [5] replaces simple feature sharing with bidirectional cross-task co-attention between task-specific relearning subnets. The model achieves 92.11% element mIoU and 87.16% defect mIoU on SBCIV, improving over naive multi-task learning by 2.51% and 1.90% respectively. Yao et al. [46] validate joint component and crack segmentation on high-rise building images, achieving component F-scores above 92% and crack F-score 73.51% under a synthetic-trained DeepLabv3+ model.

The studies above do not establish whether the reported performance gain obtained by multi-task training arises from shared-encoder regularisation, from richer structural feature representations, or from explicit routing of component information into the damage branch. All three studies train a shared encoder jointly; none isolate the conditioning signal by providing component predictions as an explicit input to a damage-only encoder. Because the present study asks precisely this question, the multi-task literature provides the motivating context but not the answer.

2.3. Synthetic Inspection Data and Domain Transfer

Manually annotating real inspection images is expensive and inconsistent: lighting, viewpoint, surface texture, and damage morphology vary widely, and expert labellers are scarce. Synthetic rendering offers a scalable alternative. The Tokaido dataset [55] renders Tokaido Shinkansen viaducts from 3D CG models under varied lighting, weather, and damage states, producing pixel-accurate component and damage annotations without manual labelling. Zarski et al. [41] apply a comparable approach to post-earthquake building inspection, achieving component mIoU 91.54% on synthetic test data. Crack and exposed-rebar IoU remain near 30% because label overlap and class imbalance are not resolved by the synthetic rendering alone, and the authors explicitly note that real-world fine-tuning is required. Yao et al. [46] generate 2,000 photorealistic building images using a physics-based crack plugin, reaching component F-scores above 92% and crack F-score 73.51%, with the synthetic model successfully identifying over 96% of real field images. Rahman et al. [45] target bridge damage transfer directly but do not report standardised mIoU transfer metrics.

These studies share a common limitation: the magnitude of the synthetic-to-real performance drop is reported inconsistently, measured on private test sets, or not reported at all. dacl10k [9], a 9,920-image real bridge inspection benchmark with 12 damage and 6 component classes, provides the standardised evaluation domain needed to measure transfer consistently, but dacl10k has not previously been used as the target for Tokaido-trained models. The dacl10k taxonomy covers 12 damage classes with a different labelling convention from Tokaido’s two-class damage scheme (Concrete surface damage, ExposedRebar); mapping the two taxonomies would require annotation re-labelling beyond the scope of this study. The 16 hand-annotated real viaduct photographs used here provide a smaller but taxonomy-compatible transfer target. Promptable foundation models offer a complementary reference point. GroundingDINO+SAM [42,43] produces text-prompted segmentation without task-specific training and is used here as a zero-shot lower bound. Parameter-efficient foundation-model adaptation is also viable: low-rank adaptation (LoRA) of SAM achieves mask average precision (

A P_{mask}

)

= 0.703

and mIoU

= 0.868

on a masonry crack dataset [44], demonstrating that fine-tuning can close the gap to task-specific segmentation. Fine-tuned adaptation and zero-shot deployment differ fundamentally in data requirements, however, and the zero-shot protocol is the appropriate baseline for measuring how much task-specific synthetic training contributes.

Collectively, the existing literature establishes that hybrid CNN-Transformer designs outperform single-architecture baselines for damage segmentation, that multi-task bridge inspection is broadly beneficial, and that synthetic environments can support inspection-relevant segmentation. Three questions remain unresolved under controlled conditions: whether architecture family determines performance on a standardised viaduct benchmark, whether explicit component-label conditioning provides damage-segmentation gains beyond shared-encoder supervision, and what the synthetic-to-real component-segmentation gap is relative to a training-free zero-shot baseline on real viaduct photographs.

3. Methodology

This section defines the segmentation tasks, implemented model families, component-aware conditioning variants, and training configuration used throughout the benchmark reported in the present paper. The study is both a controlled architecture benchmark and an ablation-study design: standard segmentation architectures are compared under the same conditions, while the component-aware variants constitute the conditioning interventions used to test whether structural component information helps damage localisation.

3.1. Task Formulation and Label Spaces

Each input image is a red–green–blue (RGB) viaduct rendering

x \in R^{3 \times H \times W}

, resized and cropped to

H \times W = 512 \times 896

before being passed to the network. Two dense prediction tasks are considered. The structural component task predicts one of five labels at each pixel: Nonbridge, Slab, Beam, Column, or Nonstructural. The last class, Nonstructural, aggregates minor structural elements (rail, sleeper, and other non-load-bearing parts); the merging uses zero-based indices after label conversion (see Section 3.2). The damage task predicts one of three labels: Nondamage, Concrete, or ExposedRebar. The single-task models output one logit tensor

z \in R^{C \times H \times W}

for the selected task, with

C = 5

for component segmentation and

C = 3

for damage segmentation.

Performance is optimised and monitored using foreground mean Intersection-over-Union (mIoU), denoted

{mIoU}_{main}

, where Intersection-over-Union (IoU) is computed per class before averaging. This metric excludes the dominant background class, Nonbridge for component segmentation and Nondamage for damage segmentation, so that model selection is driven by the structural and damage classes that determine inspection usefulness. The same foreground convention is used by the validation scheduler and by the checkpoint criterion.

3.2. Dataset

Both tasks are trained and evaluated on the Tokaido dataset [55], a synthetic photo-realistic collection of 7,575 viaduct images rendered from three-dimensional bridge models using the open-source Blender3D modelling tool. A viaduct is a multi-span elevated bridge structure, typically used to carry railways or roads over valleys or urban areas. The images are representative of UAV-acquired aerial imagery and come with pixel-accurate ground-truth annotations stored as unsigned-integer label images. Each sample includes segmentation masks covering structural object types — slab (the flat horizontal deck that forms the roadway or rail bed), beam (longitudinal load-bearing members that span between supports), column (vertical pillars that carry the beam loads to the ground), rail and sleeper (railway track components), and other minor elements — as well as two damage categories: concrete surface damage (cracking and spalling visible on the surface) and exposed rebar (severe deterioration where steel reinforcing bars have become visible due to cover loss or corrosion).

A validity flag stored in the file-list CSV is used to exclude low-quality or incomplete samples prior to any split; files whose label images are missing from disk are additionally filtered out. Figure 1 shows representative examples of the input images and their ground-truth annotations.

Class indexing and label merging.

Raw label images store class indices as unsigned integers starting at 1; the preprocessing pipeline subtracts 1 to obtain zero-based indices in the range 0–7 before merging. For the component task, zero-based source classes 5, 6, and 7 — corresponding to raw classes 6 (rail), 7 (sleeper), and 8 (other minor elements) — are merged into the Nonstructural class at zero-based index 4. Raw class 5 already maps to Nonstructural and is not a merge source. The merge reduces the eight raw component labels to five training classes.

3.3. Preprocessing and Augmentation

All images follow a common preprocessing pipeline before tensor conversion. For training, the image is first scaled with LongestMaxSize so that the longer side matches the configured training size, then padded with zeros using PadIfNeeded until the

512 \times 896

canvas is obtained. A random

512 \times 896

crop is then extracted. Training augmentation consists of horizontal flipping with probability

0.5

, random brightness–contrast perturbation with probability

0.3

, and Gaussian blur with probability

0.15

, followed by normalisation and ToTensorV2. For non-augmented evaluation transforms, the same longest-side scaling, zero padding, normalisation, and tensor conversion are used without random crop or photometric augmentation. The same resize–pad–normalise convention is used for both component and damage inputs, ensuring that task differences arise from labels and model heads rather than from preprocessing.

3.4. Single-Task Segmentation Architectures

The single-task benchmark evaluates seven segmentation architectures under the same task interface. Three models use segmentation_models_pytorch decoders with an ImageNet-pretrained timm-efficientnet-b3 encoder: U-Net [17], UNet++ [18], and DeepLabV3+ [19]. U-Net provides the basic encoder–decoder baseline with skip connections; UNet++ adds nested skip pathways to reduce the semantic gap between encoder and decoder features; DeepLabV3+ replaces the symmetric decoder with atrous spatial pyramid pooling and a lightweight refinement decoder. Their verified trainable parameter counts are 13.16 million, 13.63 million, and 11.68 million, respectively.

The remaining four models use custom wrappers around timm encoders and decoder heads. ViT-Seg uses vit_base_patch16_224 [23] with a UPerNet-style decoder [49]; intermediate transformer blocks are reshaped into spatial feature maps and fused at one-quarter input resolution. ConvNeXt-Seg uses convnext_base [26] as a four-stage feature encoder with the same UPerNet-style decoder. SegFormer uses a Pyramid Vision Transformer v2 (PVT v2) pvt_v2_b2 hierarchical encoder with a SegFormer-style multilayer-perceptron (MLP) decoder [25]; each stage is projected to a common channel dimension, upsampled to one-quarter resolution, concatenated, and fused before the segmentation head. Swin-UPerNet uses swin_base_patch4_window7_224 [24] with the UPerNet-style decoder. The verified trainable parameter counts are 93.12 million for ViT-Seg, 93.37 million for ConvNeXt-Seg, 25.97 million for SegFormer, and 92.12 million for Swin-UPerNet.

UPerNet decoder.

The shared UPerNet decoder used by ViT-Seg, ConvNeXt-Seg, and Swin-UPerNet consists of (i) per-level lateral

1 \times 1

convolutions that project all feature maps to a common

d = 256

-dimensional space, (ii) a Feature Pyramid Network (FPN) top-down pathway that adds upsampled deeper features to shallower ones, (iii)

3 \times 3

convolutional refinement on each level, and (iv) a final fusion step that upsamples all levels to a common resolution, concatenates them, and applies a further

3 \times 3

convolution. The segmentation head consists of a Conv-BN-ReLU block followed by a

1 \times 1

convolution that produces the C-channel logit map, which is upsampled to the original image resolution using bilinear interpolation.

3.5. Component-Aware Damage Segmentation

The component-aware multitask model uses a shared convnext_base encoder and task-specific UPerNet-style decoder branches. The component branch produces a feature map

h^{c}

at one-quarter input resolution and component logits

z^{c} \in R^{5 \times H \times W}

; the damage branch produces a damage feature map

h^{d}

and damage logits

z^{d} \in R^{3 \times H \times W}

. When conditioning is active, the damage feature map is concatenated with a component-derived conditioning tensor

q^{c}

and passed through a single Conv-BN-ReLU fusion block before the damage head. The input channel count is

256 + 5 = 261

for the hard-mask variant (one-hot over five component classes) and

256 + 256 = 512

for the soft variants (full feature map); in both cases the output is projected back to 256 channels. Figure 2 shows the shared encoder, task branches, and optional conditioning route.

Five conditioning configurations are evaluated, corresponding to distinct hypotheses about the role of component context (Table 1):

damage-only: the component branch is absent; no fusion. Single-task damage baseline (93.37M parameters).
parallel-heads: both decoders are present and trained jointly on a combined loss, but the fusion module is absent. Tests whether shared multitask supervision alone benefits damage-segmentation quality (99.17M parameters).
hard-mask: $q^{c}$ is a discrete one-hot map derived from $arg max z^{c}$ at resolution $H / 4 \times W / 4$ , passed through the fusion module without gradient (non-differentiable argmax). A one-hot encoding assigns each pixel a binary vector with a single 1 in the position of the winning class and 0s everywhere else, discarding all probability information. Because argmax is not differentiable, no gradient flows through this conditioning path (99.78M parameters).
soft-detach: $q^{c} = sg (F_{comp})$ , where $sg (\cdot)$ denotes the stop-gradient operator (implemented via .detach() in PyTorch), which passes the value forward normally but blocks gradient flow during backpropagation. The continuous component feature map is used as conditioning signal, but no gradient flows from the damage loss through the component path. This isolates the effect of the conditioning signal from gradient coupling (100.35M parameters).
soft-full: $q^{c} = F_{comp}$ , without stop-gradient. The damage loss backpropagates through the fusion module into the component decoder and the shared encoder (100.35M parameters).

A sixth variant, component-only, trains only the component branch in isolation to establish structural segmentation performance independently of the damage objective; it is not part of the conditioning ablation.

The causal comparisons implied by these variants are part of the method definition. Damage-only versus parallel-heads tests whether auxiliary component supervision helps the shared representation when no component signal is received by the damage decoder. Parallel-heads versus hard-mask tests whether coarse predicted component identity helps damage segmentation. Parallel-heads versus soft-detach tests whether feature-level component information helps independently of extra gradient coupling. Soft-detach versus soft-full tests whether allowing the damage objective to update the component path through the conditioning branch changes the damage solution.

3.6. Training Objective and Optimisation

Single-task models in the main benchmark are trained with one of four configured losses: cross-entropy plus Dice loss, weighted cross-entropy plus Dice loss, Tversky loss [11], and focal Tversky loss [12]. The preliminary loss ablation additionally evaluates plain cross-entropy (CE) and focal cross-entropy [10], giving six loss variants in that study. Dice loss is used in multiclass mode from logits [58], and the Dice term receives weight

0.5

in the composite cross-entropy–Dice objectives. The Tversky parameters are

α = 0.3

and

β = 0.7

. For focal Tversky, the verified focal exponent is

γ = 2.0

for the structural task and

γ = 1.5

for the damage and joint component-aware tasks. Weighted losses use per-class weights derived from training-set pixel frequencies. Let

f_{c}

denote the fraction of training pixels assigned to class c. The per-class weight is

w_{c} = clip (\frac{{\tilde{w}}_{c}}{\bar{\tilde{w}}}, 0.25, 5.0), {\tilde{w}}_{c} = \frac{1}{\sqrt{f_{c} + ε}},

where

ε = 10^{- 12}

and

\bar{\tilde{w}}

is the mean of

{\tilde{w}}_{c}

over all classes. The square-root inverse downweights frequent classes without extreme amplification of rare ones; the

[0.25, 5.0]

clip prevents any single class from dominating the loss.

For joint component-aware training, the component and damage branches use the same configured loss type and hyperparameters, each with its own class weights. When both branches are present, the total objective is

L = L_{d} (z^{d}, y^{d}) + λ_{c} L_{c} (z^{c}, y^{c}),

where

L_{d}

and

L_{c}

are the damage and component segmentation losses,

y^{d}

and

y^{c}

are the corresponding label maps, and

λ_{c} = 1.0

. In component-only mode, the damage term is absent and only

L_{c}

is optimised. In damage-only mode, the component term is absent and only

L_{d}

is optimised.

Checkpointing and best-model selection.

Validation is performed on the held-out validation split after every training epoch. The model state is saved whenever the foreground validation

{mIoU}_{main}

strictly improves; after training, the best checkpoint is reloaded for final evaluation on the test set.

All models are trained with AdamW [50] and weight decay

0.01

. The learning rate is

3 \times 10^{- 4}

for the segmentation_models_pytorch (SMP)-based CNN single-task architectures and

1 \times 10^{- 4}

for the custom timm-based single-task architectures and joint component-aware variants. A ReduceLROnPlateau scheduler monitors validation

{mIoU}_{main}

in maximisation mode with factor

0.5

, patience 3, and minimum learning rate

10^{- 6}

. Single-task models are configured for 40 epochs and joint component-aware models for 60 epochs. The configured seeds are 1, 2, and 3. Batch size is 4 for U-Net, UNet++, DeepLabV3+, ConvNeXt-Seg, SegFormer, and the joint component-aware variants, and 2 for ViT-Seg and Swin-UPerNet. Mixed precision is enabled when supported by the training device.

4. Experiments

This section describes the dataset split, evaluation metrics, and experimental protocols. The benchmark covers seven architectures across both the damage and component segmentation tasks, each trained with four loss variants and three random seeds, yielding 168 nominal benchmark runs; several cells contain additional completed attempts (primarily ConvNeXt-Seg and DeepLabV3+ damage runs), bringing the total benchmark rows in the experiment log to 194. Some repeated attempts reuse seed values, so these row counts are completed-attempt counts rather than distinct-seed counts; the original execution rationale for the repeats is not documented. Together with 86 ablation runs and 15 joint-ablation runs, the experiment log contains 295 completed training runs in total. The 86 ablation runs comprise four training-protocol factors (loss, encoder, resolution, pretraining) each evaluated on both the damage and component tasks with 2–3 seeds per variant (68 runs), plus an 18-run cross-family loss confirmation that verified the loss ranking from U-Net/EfficientNet-B3 generalises to ConvNeXt-Seg and ViT-Seg before committing the recipe to the full benchmark; the 15 joint-ablation runs are five conditioning variants × 3 seeds. All experiments were executed on Compute Unified Device Architecture (CUDA)-compatible graphics processing unit (GPU) hardware.

4.1. Dataset Split

All valid annotated images — those passing the dataset’s built-in validity filter — are partitioned into train, validation, and test sets at a 70/15/15 ratio using a fixed random seed (seed 123). The same split is applied identically to both tasks and to all experiment variants.

4.2. Evaluation Metrics

All segmentation performance is reported using mean Intersection-over-Union (mIoU) over foreground classes (

{mIoU}_{main}

), which excludes the dominant background class (Nondamage for damage segmentation, Nonbridge for component segmentation). The full-class mean (

{mIoU}_{all}

), macro Dice coefficient, precision, recall, and per-class Intersection-over-Union (IoU) are reported as secondary metrics. All metrics are computed from a global confusion matrix accumulated over the entire test split; results across seeds are reported as mean ± standard deviation. Classes with zero ground-truth support in the evaluated split yield undefined values and are excluded from macro averages. Inference throughput in frames per second (FPS) is measured on the same CUDA-compatible GPU hardware used for training.

4.3. Ablation Study Design

Four training-protocol factors are ablated in isolation using U-Net with EfficientNet-B3 as the reference model, varying one factor at a time with all others fixed. Most factors use two seeds per variant; the pretraining factor uses three. The optimal setting from each factor is carried forward to the main benchmark. The four factors are: (1) loss function, (2) encoder family, (3) input resolution, and (4) ImageNet pretraining.

4.4. Oracle-Filter Analysis

The null conditioning result admits two interpretations: either component predictions are too noisy to carry useful information to the damage branch (hypothesis H1), or component layout is genuinely uninformative for damage (hypothesis H2). To distinguish between these hypotheses, a filter oracle evaluates damage segmentation on a subset of test images where component predictions are demonstrably reliable.

For each seed, the soft-detach model’s per-image foreground component mIoU (

{mIoU}_{main}^{comp}

, where the superscript comp denotes the component task) is computed on each test image. Images are partitioned into a good-component subset (

{mIoU}_{main}^{comp} \geq τ

) and its complement at two confidence thresholds,

τ \in {0.7, 0.8}

. Damage mIoU is then computed separately for each subset using pooled confusion matrices, to match the training evaluation convention. The soft-detach and damage-only models are compared on the good-component subset: if H1 were correct, the conditioned model should outperform the unconditioned model when component predictions are accurate.

4.5. Synthetic-to-Real Transfer and Zero-Shot Baseline

The three single-task ConvNeXt-Seg component checkpoints (seeds 1–3) from the main benchmark are evaluated on 16 real viaduct photographs with hand-annotated component masks, using the same resize-pad-normalise preprocessing pipeline and no test-time augmentation. Because real photographs are available for the component task only, damage transfer is not evaluated on real images; the reported domain gap is therefore specific to structural component segmentation.

To establish a training-free lower bound, the GroundingDINO + SAM pipeline (GroundedSAM) [42,43] is applied to the same 16 photographs in a zero-shot setting. GroundingDINO receives text prompts describing each structural class and outputs bounding boxes; SAM ViT-H generates pixel-level masks for each box. Five prompt variants are evaluated and the best-performing prompt set is used for the comparative results reported in Table 13 (Section 5.6).

5. Results

This section first presents the completed ablation studies that establish the best training recipe, then reports damage-segmentation performance on the synthetic Tokaido test split for the available single-task baselines, examines whether component-aware joint modelling improves damage localisation, and finally assesses how well component segmentation transfers to real photographs.

Unless noted otherwise, values reported in the form

a \pm b

in this section denote mean ± one standard deviation over repeated runs with different random seeds.

5.1. Ablation Studies

The four recipe-selection ablations (loss, encoder, resolution, pretraining) use U-Net with EfficientNet-B3 as the controlled baseline model, varying one factor at a time while keeping all other settings fixed (512×896 resolution, ImageNet-pretrained encoder, ce_dice loss). Most factors use two seeds per variant; the pretraining factor uses three. This single-architecture design is standard practice: the ablation findings establish the best training recipe, which is then applied uniformly to all seven architectures in the main benchmark — the ablation results themselves are not part of the architecture comparison.

Loss function.

Table 2 compares six loss functions on the damage task. Weighted cross-entropy plus Dice (weighted_ce_dice) achieves the highest foreground mIoU (

0.535

), followed by Tversky (

0.526

) and unweighted cross-entropy plus Dice (ce_dice,

0.523

). Focal variants together with plain cross-entropy (ce

= 0.491

) rank the lowest (focal_tversky

= 0.498

, focal_ce

= 0.467

), likely because the focal exponent suppresses the already-dominant Nondamage class while simultaneously reducing the gradient signal on the rare ExposedRebar class. Based on these results, weighted_ce_dice is adopted as the default loss for all subsequent single-task experiments. The

0.068

gap between the best and the worst loss confirms that imbalance management — not architecture choice — is the primary tuning lever for this task.

Encoder backbone.

Table 3 shows a clear capacity gradient across SMP-compatible encoders. EfficientNet-B5 attains the highest foreground mIoU (

0.547

), followed by B3 (

0.526

), B0 (

0.488

), and ResNet-50 (

0.418

). The

0.129

gap between EfficientNet-B5 and ResNet-50 confirms that encoder width and depth matter substantially for this task. EfficientNet-B3 is retained as the default in the main benchmark to balance accuracy and computational cost; B5 is noted as the stronger option when inference latency and memory are not a constraint. The consistent ranking across encoders confirms that representational capacity is the dominant factor, independent of decoder design or loss function.

Input resolution.

Table 4 reveals a monotonic improvement with resolution: foreground mIoU increases from

0.398

at

256 \times 448

to

0.549

at

640 \times 1120

, a gain of

0.151

over the tested range. The sharpest step occurs between

384 \times 672

(

0.464

) and

512 \times 896

(

0.536

), suggesting that crack and rebar textures are partially lost below the

512 \times 896

threshold. The standard benchmark resolution is

512 \times 896

. This sensitivity underscores the fact that fine-scale crack and rebar features require sufficient spatial detail, a baseline condition that the full benchmark satisfies.

ImageNet pretraining.

Table 5 quantifies the impact of encoder initialisation. Pretrained weights yield foreground mIoU

0.519

versus

0.264

from random initialisation — a relative drop of almost

50 %

. Figure 3 confirms visually that the scratch model produces fragmented, noisy predictions with poor class separation, while the pretrained model already localises damage reliably. Pretrained encoders are used throughout all experiments.

5.2. Damage Segmentation on the Synthetic Test Split

All seven architectures are trained using the recipe identified in the ablation studies: weighted_ce_dice loss, EfficientNet-B3 encoder for the three SMP-based models (U-Net, UNet++, DeepLabV3+),

512 \times 896

input resolution, and ImageNet-pretrained encoder weights. Table 6 reports mean test metrics per architecture (weighted CE + Dice loss, seeds 1–3). SegFormer achieves the highest foreground

{mIoU}_{main}

(

0.569 \pm 0.013

) and macro Dice (

0.815

), closely followed by ConvNeXt-Seg (

0.562 \pm 0.006

,

0.811

). The two lightweight EfficientNet-based decoders, UNet++ and UNet, form a second tier in the range

0.533

–

0.538

. DeepLabV3+ is notably weaker (

0.477

) despite using the same EfficientNet-B3 encoder, reflecting the limitations of its atrous spatial pyramid pooling (ASPP)-based decoder for fine-grained damage textures. ViT-Seg underperforms all CNN models (

0.362

) in spite of its large parameter count, consistent with the known sensitivity of plain ViT architectures to dataset size and high-resolution fine-scale features. Swin-UperNet fails to converge meaningfully on the damage task (

{mIoU}_{main} = 0.021 \pm 0.011

): in all 12 runs, the model collapses to predicting the Nondamage class almost exclusively, with ExposedRebar IoU

\approx 0.000

. This behaviour is consistent with the known instability of the Swin-UperNet training pipeline under strong class imbalance, even with weighted losses.

Table 7 reports the mean per-class IoU for each architecture, averaged over all available runs (weighted CE + Dice; run counts match Table 6). The Nondamage class is near-perfectly solved (

\approx 0.99

) across all converging models; the remaining challenge lies entirely in the two foreground classes. Concrete is consistently the easier foreground class (

0.359

–

0.602

), while ExposedRebar is harder (

0.366

–

0.535

) because rebar pixels are sparse and visually thin. The gap between Concrete and ExposedRebar is small for strong models (SegFormer:

0.602

vs.

0.535

) and collapses for ViT-Seg (

0.359

vs.

0.366

), where neither class is well localised. Swin-UperNet effectively predicts zero ExposedRebar IoU in all runs, confirming its collapse to the majority class.

The strongest individual checkpoint is SegFormer with Tversky loss (seed 3), reaching

{mIoU}_{main} = 0.583

,

{mIoU}_{all} = 0.720

, macro Dice

= 0.823

, and

19.2

FPS, with class-wise IoUs of

0.994

(Nondamage),

0.607

(Concrete), and

0.559

(ExposedRebar). The best weighted CE + Dice run overall is SegFormer seed 3 at foreground mIoU

0.577

, followed closely by SegFormer seed 2 (

0.576

) and ConvNeXt-Seg seed 3 (

0.574

).

Qualitative results are shown in Figure 4.

5.3. Component Segmentation on the Synthetic Test Split

Table 8 reports component segmentation metrics on the synthetic test split for all seven architectures (weighted CE + Dice loss, seeds 1–3). Unlike the damage task, all CNN-based and transformer-based models converge successfully on the component task, achieving foreground

{mIoU}_{main}

between

0.937

and

0.957

. ConvNeXt-Seg leads at

0.957 \pm 0.001

, followed by SegFormer (

0.951 \pm 0.001

); the three EfficientNet-based decoders cluster tightly in the

0.946

–

0.948

range. ViT-Seg reaches

0.937 \pm 0.0003

, and Swin-UperNet, while far behind on damage, still attains

0.832 \pm 0.014

on components — indicating that its failure on the damage task is specific to the highly imbalanced damage label distribution rather than a general inability to learn structural features.

These high component scores motivate the joint-architecture ablation in Section 5.4: a component branch trained on this task provides reliable structural-layout estimates (

{mIoU}_{main}^{comp} \approx 0.92

–

0.96

), making the question of whether such estimates help damage detection well-posed and testable.

5.4. Component-Aware Joint Architecture Ablation

We explicitly test the hypothesis that component conditioning improves damage localisation beyond what a shared encoder and auxiliary component supervision already provide. All five conditioning variants (damage-only, parallel-heads, hard-mask, soft-detach, soft-full; see Section 3.5) use the ConvNeXt-Seg architecture (ConvNeXt-Base encoder, UPerNet decoder), which achieved the highest component mIoU among single-task models. Table 9 reports the test-set damage and component metrics for all five variants, averaged over three seeds. The table does not support the initial hypothesis. All five variants remain within a narrow damage-performance band, even though prior bridge-inspection studies report gains from cross-task interaction in other settings [4,5]. Tokaido therefore serves as a controlled counterexample to any universal claim that component conditioning automatically improves bridge-damage segmentation.

Qualitative predictions across the five variants are shown in Figure 5, and component-segmentation results are illustrated in Figure 6. Two observations stand out. First, all five variants attain very similar damage foreground mIoU, ranging from

0.553

to

0.561

— a spread of

0.008

. The best value, soft-detach at

0.5609 \pm 0.0086

, exceeds the damage-only baseline (

0.5600 \pm 0.0060

) by only

0.0009

, which is an order of magnitude smaller than the per-variant seed standard deviations (

0.0045

–

0.0112

). No conditioning variant therefore shows a statistically meaningful damage gain over the baseline. This is the central falsification result of the paper. Second, the component branch delivers consistently strong component segmentation across all four variants that include it, with

{mIoU}_{main}^{comp}

between

0.916

and

0.930

. Joint training therefore achieves high-quality component segmentation, and the null result cannot be explained by failure of the component task.

Relative to the damage-only variant (

19.5

ms per image,

51.4

FPS), adding the component branch increases damage inference cost to

26.9

–

28.0

ms per image (

35.8

–

37.2

FPS), i.e., a 38–

44 %

latency increase without a measurable accuracy gain.

In practical terms, the component-aware variants first estimate which structural component each pixel belongs to (for example, slab, beam, or column) and then allow the damage branch to use that information. If damage were strongly tied to component type, these variants should outperform the damage-only baseline. They do not. The evidence instead suggests that, in the Tokaido dataset, damage localisation depends mainly on local surface appearance cues such as cracking, spalling texture, and exposed rebar, rather than on whether the damaged pixel lies on a slab, beam, or column. Read in the context of the bridge-inspection literature, this is best understood as a dataset-dependent negative result rather than as a contradiction of all prior component-aware work: some datasets encode stronger correlations between structural role and damage type, whereas Tokaido does not.

Oracle-filter analysis.

To test whether imperfect component predictions are masking a genuine conditioning benefit, the test set is re-evaluated on subsets where the soft-detach component branch is reliable. At threshold

τ = 0.8

, the good-component subset contains 138–140 images depending on seed. On this subset, soft-detach achieves

{mIoU}_{main}^{dmg} = 0.522 \pm 0.016

, compared with

0.526 \pm 0.010

for damage-only and

0.523 \pm 0.017

for parallel-heads. The same pattern holds at the looser threshold

τ = 0.7

(soft-detach

0.539 \pm 0.010

, damage-only

0.544 \pm 0.019

, soft-detach minus damage-only difference

= - 0.005

). The null result therefore persists even when conditioning quality is explicitly filtered, arguing against unreliable component predictions as the explanation for the lack of improvement. Even with reliable component predictions, we find no statistically meaningful improvement in damage localisation.

Notably, all three variants show the same damage-score drop on the good-component subset relative to its complement (

Δ [good - bad] \approx - 0.05

to

- 0.06

). This is a dataset-level structural effect rather than a consequence of conditioning: images where structural components are easily identifiable in the Tokaido synthetic test set tend to be geometrically simple, undamaged scenes with fewer foreground damage pixels, yielding intrinsically lower

{mIoU}_{main}^{dmg}

. Readers should therefore not interpret the lower subset-level scores as evidence that conditioned models perform worse on easier images; the difference reflects the distribution of damage content across subsets, not a conditioning effect.

5.5. Component Segmentation on Real Photographs

To assess synthetic-to-real transfer of the component model, the three single-task ConvNeXt-Seg component checkpoints from seeds 1–3 are evaluated on the 16 real viaduct photographs with hand-annotated component masks. Inference uses the same resize-pad-normalise pipeline as the synthetic evaluation; results are reported as global pooled IoU (matching the training metric).

The same three single-task ConvNeXt-Seg checkpoints achieve

0.957 \pm 0.001

on the synthetic test split (Table 8), giving an absolute domain gap of

0.533

(

0.957 \pm 0.001 \to 0.424 \pm 0.064

). Because the real-photo evaluation reuses the exact benchmark checkpoints, this gap is a matched model comparison for the component task rather than a cross-model comparison. Three patterns emerge from the per-class breakdown.

Column generalises best (mean IoU

= 0.665

, standard deviation

= 0.066

): columns have a distinctive elongated vertical geometry and uniform material appearance that transfer from the photo-realistic Tokaido renders.

Slab is the highest-variance class (IoU range

0.46

–

0.73

, standard deviation

= 0.118

), indicating that seed-to-seed variation in the synthetic optimum translates to large differences in cross-domain performance for this class. Slab occupancy varies substantially across real images, amplifying inter-seed variability.

Beam achieves moderate IoU (mean

= 0.427

, standard deviation

= 0.016

), consistent across seeds. Beam appearances vary considerably with bridge type and viewing angle, making generalisation harder than for columns.

Nonstructural is near zero (mean IoU

= 0.042

), as expected: this class aggregates heterogeneous elements (railings, cables, signage) that are underrepresented in training and visually diverse in the field.

Nonbridge IoU is trivially low (

\approx 0.01

), reflecting the small amount of padding-induced background introduced by resize-pad preprocessing. This class is uninformative in the real evaluation.

Overall, the component model demonstrates meaningful cross-domain generalisation for structurally salient classes (Column, Slab) while showing limited transfer for heterogeneous or underrepresented classes. Closing the domain gap would require domain-adaptive training or fine-tuning on annotated real imagery.

5.6. Zero-Shot GroundedSAM Baseline

To provide a training-free lower bound for real-image structural segmentation, we evaluate the GroundingDINO + SAM pipeline [42,43] (“GroundedSAM”) in a zero-shot setting on the same 16 real viaduct photographs used in Section 5.5. The real photographs carry component annotations only; no real-image damage annotations are available, so this evaluation covers the structural task exclusively. This experiment should not be read as a full evaluation of SAM-based inspection methods. It isolates what an off-the-shelf prompt-driven pipeline can deliver without task-specific adaptation. Trainable or refined SAM-based frameworks can be materially stronger in structural-inspection settings [27,28]; the present comparison is therefore intended only as a zero-shot lower bound.

Zero-shot means that the model is applied directly to the target task without any task-specific training or fine-tuning: it generalises purely from its large-scale pre-training on diverse image–text data. The GroundedSAM pipeline operates in two stages. First, GroundingDINO [43] is a vision–language model that takes a natural-language text prompt as input (e.g., “pillar. concrete pillar. support column”) and outputs bounding boxes in the image that spatially correspond to the described concept, together with a confidence score. Second, the Segment Anything Model (SAM) [42] takes each bounding box as a spatial prompt and produces a pixel-precise segmentation mask for the corresponding region. The two stages are therefore complementary: GroundingDINO localises “what” in the image matches the text, and SAM delineates “where” exactly the boundary lies. SAM ViT-H (the highest-capacity SAM variant) is used throughout. No task-specific training is performed; the only design choices are the text prompts that describe each foreground class.

Prompt sensitivity.

Because GroundingDINO is sensitive to prompt wording, five prompt variants are tested for the structural task (Table 11 and Table 12). The baseline prompts use extended bridge-specific phrases; the remaining variants progressively simplify or replace the vocabulary.

The results reveal nuanced prompt sensitivity (Table 12). The common-vocabulary variant v3 achieves the highest overall

{mIoU}_{main}

(

0.250

), followed by v2 single terms (

0.219

) and the bridge-specific baseline v0 (

0.195

). The descriptive-geometry variant v1 (

0.106

) and the combined extended phrases v4 (

0.118

) are substantially weaker, suggesting that verbose multi-component descriptions impair GroundingDINO’s grounding even when the individual terms are semantically accurate. Prompt conciseness appears to matter as much as vocabulary choice: the two weakest variants are precisely those with the longest, most composite prompt strings. The common-vocabulary variant v3 is used as the best structural prompt set in the benchmark results below. Figure 8 shows qualitative predictions for all tested variants.

Zero-shot benchmark results.

Table 13 reports GroundedSAM performance on the structural task alongside the trained ConvNeXt-Seg component model from Section 5.5. Real-damage annotations are not available for the 16 real photographs; the GroundedSAM evaluation therefore covers the structural task only.

Table 13. GroundedSAM zero-shot structural evaluation on 16 real photographs (prompt variant v3, common vocabulary). The trained-model row is the mean over three seeds from Table 10.

{mIoU}_{main}

excludes Nonbridge.

Table 13. GroundedSAM zero-shot structural evaluation on 16 real photographs (prompt variant v3, common vocabulary). The trained-model row is the mean over three seeds from Table 10.

{mIoU}_{main}

excludes Nonbridge.

Model	Slab	Beam	Column	Nonstructural	${mIoU}_{main}$
Structural task
GroundedSAM (zero-shot, v3)	0.338	0.133	0.491	0.038	0.250
ConvNeXt-Seg (trained, mean)	0.562	0.427	0.665	0.042	0.424

GroundedSAM achieves

{mIoU}_{main} = 0.250

on structural components without any task-specific training, establishing a zero-shot lower bound for real-image structural segmentation. These numbers should therefore be interpreted narrowly: they quantify the performance of a prompt-based zero-shot GroundingDINO + SAM workflow on our real-image subset, not the ceiling of SAM-derived methods for infrastructure inspection. Figure 7 provides a qualitative comparison of both models on real photographs. The trained component model outperforms GroundedSAM on all structural classes, with the largest advantage on Column (

0.665

vs

0.491

) and Slab (

0.562

vs

0.338

). The overall trained-vs-zero-shot gap is

0.174

in

{mIoU}_{main}

.

6. Discussion

The benchmark separates two questions that are often conflated in inspection segmentation: which architecture is the strongest for a given task, and whether structural component information is useful for damage localisation. The architecture comparison shows that these answers are task-dependent. SegFormer gives the strongest damage segmentation result, while ConvNeXt-Seg gives the strongest component segmentation result. This difference is consistent with the visual character of the two tasks. Damage segmentation is dominated by sparse, thin, texture-level foreground classes, where the SegFormer-style multi-scale representation is competitive despite its moderate parameter count. Component segmentation is closer to scene parsing: slabs, beams, and columns occupy larger coherent regions, and the ConvNeXt-Seg model captures this structural layout more effectively on the synthetic test split. The practical implication is that a single architecture ranking is not sufficient for inspection systems that must solve both component and damage segmentation.

The component-aware ablation provides the central negative result of this paper. Adding a component branch, feeding a hard predicted component mask or detached component features into the damage segmentation module, or allowing full end-to-end feature conditioning did not produce a measurable damage gain over the damage-only baseline. The observed spread across variants is smaller than the run-to-run variation reported for the same variants, and the best conditioned model improves the baseline by only

0.0009

foreground mIoU. The oracle-filter analysis makes the interpretation more specific: even on images where component predictions are reliable, the conditioned model does not outperform the unconditioned model. This argues against the explanation that component-aware conditioning failed simply because the component branch was inaccurate. The more plausible interpretation for this dataset is that the damage labels are driven mainly by local surface appearance, such as cracking and exposed rebar texture, rather than by whether the pixel lies on a slab, beam, or column.

This negative result should be read as dataset- and label-specific. It does not contradict bridge-inspection studies where joint component and damage learning improves performance, because those studies do not necessarily isolate explicit component conditioning from shared supervision and may involve damage categories that correlate more strongly with structural role. The present design tests a narrower claim: whether a damage segmentation model that already receives the original image as input benefits from being provided with predicted component information under a controlled Tokaido benchmark. For this claim, the evidence is consistently negative. The tested fusion module is intentionally simple — a single local Conv-BN-ReLU block that injects masks or feature maps at decoder resolution — so the result falsifies the hypothesis about damage-segmentation gains specifically for this controlled local-fusion design, not for richer conditioning architectures such as cross-attention or spatial routing. The cost side reinforces the same conclusion: the component-aware variants increase inference latency by 38–

44 %

relative to the damage-only model while leaving damage accuracy effectively unchanged.

The real-photograph component experiment shows a different but equally important boundary. The trained ConvNeXt-Seg component model outperforms the zero-shot GroundedSAM baseline on the 16 annotated real photographs, indicating that synthetic task-specific training transfers useful structural information. At the same time, the drop from synthetic component performance to real-photo component performance is large, and the weakest real-photo class is the heterogeneous Nonstructural category. The zero-shot result should therefore be interpreted as a lower bound for prompt-based segmentation without adaptation, not as a ceiling for foundation-model approaches. Adapted SAM-style models or fine-tuned component segmenters may close part of this gap, but the current experiment shows that synthetic training alone does not remove the domain shift.

Progress relative to prior work.

The present benchmark substantially extends the earlier study [13], which evaluated only U-Net and DeepLabV3+ on the same Tokaido dataset at a lower resolution of

320 \times 160

pixels using a custom Keras/TensorFlow implementation without repeated-seed evaluation or systematic loss ablation. Table 14 summarises the key quantitative improvements.

The two studies differ in resolution, class count, evaluation split, framework, and architecture scope, so no single factor can be isolated as the cause of any difference; the values in Table 14 are therefore descriptive rather than causal. On the component task, the previous study reported validation mIoU of 76% (U-Net) and 87% (DeepLabV3+) over four classes at

320 \times 160

. The present benchmark, evaluated on the held-out test set at

512 \times 896

with five classes and repeated seeds, reports

{mIoU}_{main}

of

0.947

(U-Net) and

0.946

(DeepLabV3+), with the best architecture (ConvNeXt-Seg) reaching

0.957

. On the damage task, the previous study reported per-class IoU of 48% for Cracks and 44% for Reinforcement (U-Net, Tversky loss); the best model here (SegFormer, Tversky loss) achieves

{mIoU}_{main}^{dmg} = 0.583

, with the mean over three seeds at

0.569

. The metrics are not directly comparable (per-class IoU versus foreground mIoU, validation versus test), so these figures serve as historical context only. On real-image transfer, the prior study reported a four-class foreground mean of approximately 18%; the present

{mIoU}_{main}^{comp} = 0.424

on the same 16 photographs reflects simultaneous changes in class set, resolution, and training protocol, making direct comparison difficult.

The main limitation on the conclusions of the present study is scope. The conditioning conclusion is restricted to the Tokaido synthetic damage labels and to the tested ConvNeXt-based conditioning mechanisms. It should not be generalized to all bridge datasets or all component-aware designs, especially where damage type is physically tied to structural role. The real-photo evaluation is also restricted to component segmentation on 16 annotated photographs; no real-image damage masks are available in this study, so real-world damage transfer cannot be assessed. Future work should therefore focus on two targeted extensions: real-image fine-tuning or domain adaptation for component segmentation, and conditioning tests on datasets where component labels and damage labels have stronger physical coupling. These extensions would test whether the reported null result follows from the properties of the Tokaido dataset or is a broader limitation of explicit component conditioning for fine-scale damage segmentation.

7. Conclusion

This study evaluated structural component and damage segmentation for railway viaduct inspection under a controlled synthetic-data benchmark. Seven segmentation architectures were compared on the Tokaido dataset, and a separate component-aware ablation tested whether explicit component information improves damage localisation. The strongest damage segmentation model was SegFormer, with foreground

{mIoU}_{main} = 0.569 \pm 0.013

, while the strongest component segmentation model was ConvNeXt-Seg, with foreground

{mIoU}_{main} = 0.957 \pm 0.001

. These task-dependent rankings show that architecture choice cannot be reduced to a single best model across structural and damage segmentation.

The central result is that explicit component conditioning did not improve damage segmentation in this setting. Across the five ConvNeXt-based component-aware variants, the full range of foreground damage mIoU was

0.008

, and the best conditioned variant improved over the damage-only baseline by only

0.0009

. The oracle-filter analysis reached the same conclusion on images where component predictions were reliable: conditioned and unconditioned damage models remained statistically indistinguishable. This evidence supports a narrow but important conclusion: for the Tokaido damage labels tested here, component identity does not provide a measurable localisation benefit beyond the image evidence already available to a damage model. It does not imply that component-aware inspection is generally ineffective; rather, the benefit depends on whether the dataset encodes a meaningful relationship between structural element type and damage occurrence.

The synthetic-to-real evaluation showed both the value and the limit of task-specific training. A ConvNeXt-Seg component model trained on synthetic Tokaido imagery reached

{mIoU}_{main} = 0.424 \pm 0.064

on 16 real viaduct photographs, compared with

0.250

for the zero-shot GroundedSAM baseline. At the same time, the drop from

0.957 \pm 0.001

on synthetic component segmentation to

0.424 \pm 0.064

on real photographs quantifies a substantial domain gap. Future work should therefore prioritise real-image adaptation, larger real-photo test sets, and datasets where component labels and damage labels have stronger physical coupling. The benchmark and ablation results reported here provide reference baselines for that next stage, while showing that component-aware conditioning should be treated as an empirical design choice rather than an assumed source of improvement.

Author Contributions

Conceptualization, [Bartłomiej Błachowski, Piotr Tauzowski]; methodology, [Piotr Tauzowski]; software, [Piotr Tauzowski]; validation, [Bartłomiej Błachowski, Paweł Hołobut]; formal analysis, [Paweł Hołobut]; investigation, [Bartłomiej Błachowski]; data curation, [Piotr Tauzowski]; writing—original draft preparation, [Piotr Tauzowski]; writing—review and editing, [Piotr Tauzowski, Paweł Hołobut]; visualization, [Piotr Tauzowski]; supervision, [Bartłomiej Błachowski]. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The Tokaido dataset used here is publicly available. Trained model checkpoints and evaluation scripts are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chu, H.; Wang, W.; Deng, L. Tiny-Crack-Net: A multiscale feature fusion network with attention mechanisms for segmentation of tiny cracks. Computer-Aided Civil and Infrastructure Engineering 2022. [CrossRef]
Narazaki, Y.; Hoskere, V.; Hoang, T.A.; Jr., B.F.S. Automated Vision-Based Bridge Component Extraction Using Multiscale Convolutional Neural Networks. In Proceedings of the 3rd Huixian International Forum on Earthquake Engineering for Young Researchers, Urbana-Champaign, IL, USA, 2017.
Hoskere, V.; Narazaki, Y.; Hoang, T.A.; Jr., B.F.S. MaDnet: Multi-Task Semantic Segmentation of Multiple Types of Structural Materials and Damage in Images of Civil Infrastructure. Journal of Civil Structural Health Monitoring 2020, 10, 757–773. [CrossRef]
Zhang, C.; Karim, M.M.; Qin, R. A Multitask Deep Learning Model for Parsing Bridge Elements and Segmenting Defect in Bridge Inspection Images. Transportation Research Record 2022, pp. 1–11.
Zhang, C.; Yin, Z.; Qin, R. Attention-Enhanced Co-Interactive Fusion Network (AECIF-Net) for Automated Structural Condition Assessment in Visual Inspection. arXiv preprint arXiv:2307.07643 2024. PDF stored locally in the project reference collection.
Saida, T.; Rashid, M.; Nemoto, Y.; Tsukamoto, S.; Asai, T.; Nishio, M. CNN-Based Segmentation Frameworks for Structural Component and Earthquake Damage Determinations Using UAV Images. Earthquake Engineering and Engineering Vibration 2023, 22, 359–369. [CrossRef]
Bai, Z.; Zou, D.; Liu, T.; Li, K.; Luo, W.; Liao, H.; Zhou, A. Component-Aware Post-Earthquake Damage Recognition for RC Structures Using Instance Segmentation and Oriented Bounding Box Detection. Construction and Building Materials 2025, 491, 142718. [CrossRef]
Flotzinger, J.; R"osch, P.J.; Benz, C.; Ahmad, M.; Cankaya, M.; Mayer, H.; Rodehorst, V.; Oswald, N.; Braml, T. dacl-challenge: Semantic Segmentation during Visual Bridge Inspections. In Proceedings of the Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), 2024.
Flotzinger, J.; R"osch, P.J.; Braml, T. dacl10k: Benchmark for Semantic Bridge Damage Segmentation. arXiv preprint arXiv:2309.00460 2023. PDF stored locally in the project reference collection.
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Doll’ar, P. Focal Loss for Dense Object Detection. In Proceedings of the Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2980–2988.
Salehi, S.S.M.; Erdogmus, D.; Gholipour, A. Tversky Loss Function for Image Segmentation Using 3D Fully Convolutional Deep Networks. In Proceedings of the Machine Learning in Medical Imaging. Springer, 2017, Vol. 10541, Lecture Notes in Computer Science, pp. 379–387.
Abraham, N.; Khan, N.M. A Novel Focal Tversky Loss Function with Improved Attention U-Net for Lesion Segmentation. In Proceedings of the 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), 2019, pp. 683–687.
Tauzowski, P.; Ostrowski, M.; Bogucki, D.; Jarosik, P.; Błachowski, B. Structural Component Identification and Damage Localization of Civil Infrastructure Using Semantic Segmentation. Sensors 2025, 25, 4698. [CrossRef]
Hamishebahar, Y.; Guan, H.; So, S.; Jo, J. A Comprehensive Review of Deep Learning-Based Crack Detection Approaches. Applied Sciences 2022, 12, 1374. [CrossRef]
Ai, D.; Jiang, G.; Lam, S.K.; He, P.; Li, C. Computer Vision Framework for Crack Detection of Civil Infrastructure—A Review. Engineering Applications of Artificial Intelligence 2023, 117, 105478. [CrossRef]
Nguyen, S.D.; Tran, T.S.; Tran, V.P.; Lee, H.J.; Piran, M.J.; Le, V.P. Deep Learning-Based Crack Detection: A Survey. International Journal of Pavement Research and Technology 2023, 16, 943–967. [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015.
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer, 2018, Vol. 11045, Lecture Notes in Computer Science, pp. 3–11.
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the Proceedings of the European Conference on Computer Vision (ECCV), 2018.
Wang, J.; Ueda, T. Automatic Damage Detection and Segmentation Using Deep Learning Algorithms in Reinforced Concrete Structure Inspections. Structural Concrete 2025, 26, 5511–5534. [CrossRef]
Liu, G.; Ding, W.; Shu, J.; Strauss, A.; Duan, Y. Two-Stream Boundary-Aware Neural Network for Concrete Crack Segmentation and Quantification. Structural Control and Health Monitoring 2023, 2023, 3301106. [CrossRef]
Xu, S.; Shen, R.; Liu, Y.; Song, Y.; Xin, R.; Liu, E.; Shi, Y. Cross-Domain Coupled Convolutional Transformer Network for Concrete Damage Detection. Structural Control and Health Monitoring 2025, 2025, 6547856. [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), 2021. PDF stored locally in the project reference collection.
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2021.
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
Eltouny, K.; Sajedi, S.; Liang, X. Dmg2Former-AR: Vision Transformers with Adaptive Rescaling for High-Resolution Structural Visual Inspection. Sensors 2024, 24, 6007. [CrossRef]
Azimi, M.; Yang, T.Y. Transformer-Based Framework for Accurate Segmentation of High-Resolution Images in Structural Health Monitoring. Computer-Aided Civil and Infrastructure Engineering 2024, 39, 3670–3684. [CrossRef]
Thisanke, H.; Deshan, C.; Chamith, K.; Seneviratne, S.; Vidanaarachchi, R.; Herath, D. Semantic Segmentation Using Vision Transformers: A Survey. Engineering Applications of Artificial Intelligence 2023, 126, 106669. [CrossRef]
Zhang, L.; Lu, J.; Zheng, S.; Zhao, X.; Zhu, X.; Fu, Y.; Xiang, T.; Feng, J.; Torr, P.H.S. Vision Transformers: From Semantic Segmentation to Dense Prediction. International Journal of Computer Vision 2024, 132, 6142–6162. [CrossRef]
Shamsabadi, E.A.; Xu, C.; Rao, A.S.; Nguyen, T.; Ngo, T.; da Costa, D.D. Vision Transformer-Based Autonomous Crack Detection on Asphalt and Concrete Surfaces. Automation in Construction 2022, 140, 104316. [CrossRef]
Hadinata, P.N.; Simanta, D.; Eddy, L.; Nagai, K. Multiclass Segmentation of Concrete Surface Damages Using U-Net and DeepLabV3+. Applied Sciences 2023, 13, 2398. [CrossRef]
Chu, H.; Deng, L.; Yuan, H.; Long, L.; Guo, J. A transformer and self-cascade operation-based architecture for segmenting high-resolution bridge cracks. Automation in Construction 2024, 158, 105194. [CrossRef]
Fan, Y.; Hu, Z.; Li, Q.; Sun, Y.; Chen, J.; Zhou, Q. CrackNet: A Hybrid Model for Crack Segmentation with Dynamic Loss Function. Sensors 2024, 24, 7134. [CrossRef]
Wang, W.; Su, C. Transformer-Based Crack Segmentation for Concrete Structures in Complex Scenarios. Structural Concrete 2026, 27, 474–491. [CrossRef]
Zhong, J.; Fan, Y.; Zhao, X.; Zhou, Q.; Xu, Y. Multi-Type Structural Damage Image Segmentation via Dual-Stage Optimization-Based Few-Shot Learning. Smart Cities 2024, 7, 1888–1906. [CrossRef]
Li, Y.; Wang, Y.; Zhao, H.; Gu, H.; Yu, Y.; Bao, T.; Wei, Y.; Xiang, Z. Automated Defect Segmentation and Quantification in Concrete Structures via Unmanned Aerial Vehicle-Based Lightweight Deep Learning. Computer-Aided Civil and Infrastructure Engineering 2025, 40, 4465–4484. [CrossRef]
Sun, L.; Yang, Y.; Zhou, G.; Chen, A.; Zhang, Y.; Cai, W.; Li, L. An Integration–Competition Network for Bridge Crack Segmentation under Complex Scenes. Computer-Aided Civil and Infrastructure Engineering 2024, 39, 617–634. [CrossRef]
Dang, M.; Wang, H.; Nguyen, T.H.; Tighitz, L.; Tien, L.D.; Nguyen, T.N.; Nguyen, N.P. CDD-TR: Automated concrete defect investigation using an improved deformable transformers. Journal of Building Engineering 2023, 75, 106976. [CrossRef]
Liu, C.; Wang, P.; Wang, X.; Miao, J. Autonomous damage segmentation of post-fire reinforced concrete structural components. Advanced Engineering Informatics 2024, p. 102498. [CrossRef]
Żarski, M.; W’ojcik, B.; Miszczak, J.A.; Błachowski, B.; Ostrowski, M. Computer Vision Based Inspection on Post-Earthquake with UAV Synthetic Dataset. IEEE Access 2022, 10. [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H.; et al. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. In Proceedings of the Proceedings of the European Conference on Computer Vision (ECCV), 2024.
Ye, Z.; Lovell, L.; Faramarzi, A.; Ninić, J. Sam-based instance segmentation models for the automation of structural damage detection. Advanced Engineering Informatics 2024, 62, 102826. [CrossRef]
Rahman, A.U.; Hoskere, V. Instance Segmentation of Reinforced Concrete Bridge Point Clouds with Transformers Trained Exclusively on Synthetic Data. Automation in Construction 2025, 173, 106067. [CrossRef]
Yao, Z.; Jiang, S.; Wang, S.; Wang, J.; Liu, H.; Narazaki, Y.; Cui, J.; Jr., B.F.S. Intelligent Crack Identification Method for High-Rise Buildings Aided by Synthetic Environments. Structural Design of Tall and Special Buildings 2024, 33, e2117. [CrossRef]
Pozzer, S.; Souza, M.P.V.D.; Hena, B.; Hesam, S.; Rezayie, R.K.; Azar, E.R.; Lopez, F.; Maldague, X. Effect of different imaging modalities on the performance of a CNN: An experimental study on damage segmentation in infrared, visible, and fused images of concrete structures. NDT & E International 2022, 133, 102709. [CrossRef]
Alexander, Q.G.; Hoskere, V.; Narazaki, Y.; Maxwell, A.; Jr., B.F.S. Fusion of thermal and RGB images for automated deep learning based crack detection in civil infrastructure. AI in Civil Engineering 2022, 1, 2. [CrossRef]
Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified Perceptual Parsing for Scene Understanding. In Proceedings of the Proceedings of the European Conference on Computer Vision (ECCV), 2018.
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3431–3440. [CrossRef]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv preprint arXiv:1706.05587 2017.
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep High-Resolution Representation Learning for Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5693–5703. [CrossRef]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 568–578.
Narazaki, Y.; Hoskere, V.; Yoshida, K.; Spencer, B.F., Jr.; Fujino, Y. Synthetic Environments for Vision-Based Structural Condition Assessment of Japanese High-Speed Railway Viaducts. Mechanical Systems and Signal Processing 2021, 160, 107850. [CrossRef]
Liu, T.; Zhang, L.; Zhou, G.; Cai, W.; Cai, C.; Li, L. BC-DUnet-Based Segmentation of Fine Cracks in Bridges under a Complex Background. PLOS ONE 2022, 17, e0265258. [CrossRef]
Zhou, Y.; Ali, R.; Mokhtar, N.; Harun, S.W.; Iwahashi, M. MixSegNet: A Novel Crack Segmentation Network Combining CNN and Transformer. IEEE Access 2024. [CrossRef]
Milletari, F.; Navab, N.; Ahmadi, S.A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 2016; pp. 565–571. [CrossRef]
Berman, M.; Rannen Triki, A.; Blaschko, M.B. The Lovász-Softmax Loss: A Tractable Surrogate for the Optimization of the Intersection-Over-Union Measure in Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 2018; pp. 4413–4421. [CrossRef]
Zhao, H.; et al. MT-HRNet: Multi-Task High-Resolution Network for Bridge Inspection. Automation in Construction 2021.

Figure 1. Representative samples from the Tokaido synthetic dataset (test split). Each row shows the RGB image (left), component ground-truth mask (centre; Slab = red, Beam = green, Column = blue, Nonstructural = yellow-orange, Nonbridge = dark grey), and damage ground-truth mask (right; Concrete = red, ExposedRebar = green, Nondamage = dark grey). Rows 1–3 are selected to show abundant damage pixels from both foreground classes; rows 4–6 cover all structural component classes. Outside the selected rows, damage pixels are typically sparse and localised on structural surfaces, reflecting the severe class imbalance addressed by the imbalance-aware loss functions.

Figure 2. Component-aware damage segmentation architecture. The shared ConvNeXt encoder

E_{θ}

produces multi-scale features F fed to both branches. The component branch (top) computes feature map

h^{c}

and logits

z^{c}

; the damage branch (bottom) fuses

h^{d}

with the conditioning tensor

q^{c}

before the damage head. The five conditioning modes differ in how

q^{c}

is formed (see legend); in the damage-only and parallel-heads modes the dashed conditioning path is absent. The subscript

θ

denotes trainable parameters.

Figure 2. Component-aware damage segmentation architecture. The shared ConvNeXt encoder

E_{θ}

produces multi-scale features F fed to both branches. The component branch (top) computes feature map

h^{c}

and logits

z^{c}

; the damage branch (bottom) fuses

h^{d}

with the conditioning tensor

q^{c}

before the damage head. The five conditioning modes differ in how

q^{c}

is formed (see legend); in the damage-only and parallel-heads modes the dashed conditioning path is absent. The subscript

θ

denotes trainable parameters.

Figure 3. ImageNet pretraining versus random initialisation (U-Net, EfficientNet-B3,

512 \times 896

). Columns: input image, ground-truth, pretrained prediction, scratch prediction. Classes: Concrete = red, ExposedRebar = green, Nondamage = dark grey. Training from scratch produces fragmented, noisy outputs compared to the pretrained model, reflecting the near-

50 %

foreground mIoU gap (

0.519

vs.

0.264

).

Figure 3. ImageNet pretraining versus random initialisation (U-Net, EfficientNet-B3,

512 \times 896

). Columns: input image, ground-truth, pretrained prediction, scratch prediction. Classes: Concrete = red, ExposedRebar = green, Nondamage = dark grey. Training from scratch produces fragmented, noisy outputs compared to the pretrained model, reflecting the near-

50 %

foreground mIoU gap (

0.519

vs.

0.264

).

Figure 4. Qualitative damage-segmentation results for ConvNeXt-Seg (weighted CE + Dice, seed 3). Images are selected from the synthetic test split to show substantial damage from both foreground classes. Columns: input image, ground-truth mask, predicted mask, and prediction–ground-truth (GT) overlay. Classes: Concrete = red, ExposedRebar = green, Nondamage = dark grey. The model closely follows the ground truth for both damage classes while maintaining near-perfect Nondamage accuracy.

Figure 5. Qualitative comparison of damage predictions across all five joint-model ablation variants (seed 1) on representative synthetic test images. Columns: ground-truth damage mask, damage-only, parallel-heads, hard-mask, soft-detach, soft-full. All five variants produce visually similar outputs, consistent with the near-identical quantitative scores in Table 9. Classes: Concrete = red, ExposedRebar = green, Nondamage = dark grey.

Figure 6. Qualitative component-segmentation results for the joint soft-detach model (seed 1) on representative synthetic test images. Columns: input image, ground-truth component mask, predicted component mask, and prediction–ground-truth (GT) overlay. Classes: Slab = red, Beam = green, Column = blue, Nonstructural = yellow-orange, Nonbridge = dark grey. The model accurately delineates all major structural elements, achieving

{mIoU}_{main}^{comp} \approx 0.92

on the test split.

Figure 6. Qualitative component-segmentation results for the joint soft-detach model (seed 1) on representative synthetic test images. Columns: input image, ground-truth component mask, predicted component mask, and prediction–ground-truth (GT) overlay. Classes: Slab = red, Beam = green, Column = blue, Nonstructural = yellow-orange, Nonbridge = dark grey. The model accurately delineates all major structural elements, achieving

{mIoU}_{main}^{comp} \approx 0.92

on the test split.

Figure 7. Structural component segmentation on real viaduct photographs (trained ConvNeXt-Seg, seed 1). Columns: input image, component ground-truth, trained prediction. Classes: Slab = red, Beam = green, Column = blue, Nonstructural = yellow-orange, Nonbridge = dark grey. Real photographs carry component annotations only; damage annotations are unavailable for this image set.

Figure 8. Qualitative prompt ablation for GroundedSAM structural segmentation on real viaduct photographs. Columns: input image, component ground-truth, and zero-shot predictions for prompt variants v0 (bridge jargon), v2 (single terms), v3 (common vocabulary, best), and v4 (combined extended phrases). Classes: Slab = red, Beam = green, Column = blue, Nonstructural = yellow-orange, Nonbridge = dark grey. Variants v1 (descriptive geometry) and v4 (combined extended phrases) produce the weakest structural differentiation; v3 (common vocabulary) achieves the best overall mIoU (Table 12).

Table 1. Summary of component-aware ablation variants.

Variant	Component head	Fusion	Conditioning signal	Gradient to component branch
damage-only	×	×	—	not applicable
parallel-heads	✓	×	—	component loss only
hard-mask	✓	✓	one-hot ( $arg max z^{c}$ )	none from damage^†
soft-detach	✓	✓	$sg (F_{comp})$	none from damage^†
soft-full	✓	✓	$F_{comp}$	yes

^† No gradient from the damage loss reaches the component branch; the component branch continues to be trained via the component segmentation loss.

Table 2. Ablation: Loss Comparison (damage)

Variant	mIoU_main	mIoU_all	Dice	Precision
weighted_ce_dice	0.535	0.688	0.797	0.816
tversky	0.526	0.681	0.792	0.793
ce_dice	0.523	0.679	0.790	0.846
focal_tversky	0.498	0.662	0.775	0.768
ce	0.491	0.659	0.771	0.852
focal_ce	0.467	0.642	0.756	0.838

Table 3. Ablation: Encoder Comparison (damage)

Variant	mIoU_main	mIoU_all	Dice	Precision
efficientnet_b5	0.547	0.696	0.803	0.850
efficientnet_b3	0.526	0.681	0.791	0.842
efficientnet_b0	0.488	0.656	0.769	0.823
resnet50	0.418	0.609	0.724	0.780

Table 4. Ablation: Resolution Comparison (damage)

Variant	mIoU_main	mIoU_all	Dice	Precision
640x1120	0.549	0.697	0.805	0.857
512x896	0.536	0.689	0.797	0.839
384x672	0.464	0.640	0.754	0.822
256x448	0.398	0.595	0.711	0.763

Table 5. Ablation: Pretrained Vs Scratch (damage)

Variant	mIoU_main	mIoU_all	Dice	Precision
pretrained	0.519	0.677	0.787	0.826
scratch	0.264	0.504	0.605	0.638

Table 6. Damage segmentation benchmark on the Tokaido synthetic test split (weighted CE + Dice loss, mean ± standard deviation over all available runs; n denotes run count per cell — most models use seeds 1–3, ConvNeXt-Seg uses 6 completed attempts and DeepLabV3+ uses 4. These cells contain logged repeated attempts with reused seed values, rather than additional distinct seeds). Params denotes trainable parameter count in millions (M); Prec and Rec denote precision and recall. ^†Swin-UperNet collapses to predicting the Nondamage class in all runs; ExposedRebar IoU ≈ 0.

Model	Encoder	Params	n	mIoU_main	mIoU_all	Dice	Prec	Rec	FPS
SegFormer	PVT-V2-B2	26.0M	3	$0.569 \pm 0.013$	0.710	0.815	0.817	0.814	19.5
ConvNeXt-Seg	ConvNeXt-B	93.4M	6	$0.562 \pm 0.006$	0.706	0.811	0.824	0.799	17.3
UNet++	EfficientNet-B3	13.6M	3	$0.538 \pm 0.011$	0.690	0.798	0.838	0.766	17.8
UNet	EfficientNet-B3	13.2M	3	$0.533 \pm 0.008$	0.686	0.795	0.816	0.777	19.6
DeepLabV3+	EfficientNet-B3	11.7M	4	$0.477 \pm 0.011$	0.649	0.762	0.787	0.742	20.2
ViT-Seg	ViT-B/16	93.1M	3	$0.362 \pm 0.013$	0.571	0.686	0.743	0.645	14.1
Swin-UperNet^†	Swin-B	92.1M	3	$0.021 \pm 0.011$	0.341	0.357	0.376	0.352	13.7

Table 7. Per-class IoU breakdown on the Tokaido synthetic test split (weighted CE + Dice loss, mean ± standard deviation over all available runs; run counts per model match Table 6). Nondamage standard deviation is

< 0.002

for all models except Swin-UperNet (

0.003

) and is omitted for brevity. ^†Swin-UperNet collapses to the Nondamage class in all runs.

Table 7. Per-class IoU breakdown on the Tokaido synthetic test split (weighted CE + Dice loss, mean ± standard deviation over all available runs; run counts per model match Table 6). Nondamage standard deviation is

< 0.002

for all models except Swin-UperNet (

0.003

) and is omitted for brevity. ^†Swin-UperNet collapses to the Nondamage class in all runs.

Model	mIoU_main	Nondamage	Concrete	ExposedRebar
SegFormer	$0.569 \pm 0.013$	0.994	$0.602 \pm 0.018$	$0.535 \pm 0.010$
ConvNeXt-Seg	$0.562 \pm 0.006$	0.994	$0.598 \pm 0.011$	$0.525 \pm 0.005$
UNet++	$0.538 \pm 0.011$	0.993	$0.570 \pm 0.007$	$0.507 \pm 0.016$
UNet	$0.533 \pm 0.008$	0.993	$0.556 \pm 0.006$	$0.510 \pm 0.014$
DeepLabV3+	$0.477 \pm 0.011$	0.992	$0.515 \pm 0.016$	$0.440 \pm 0.010$
ViT-Seg	$0.362 \pm 0.013$	0.990	$0.359 \pm 0.021$	$0.366 \pm 0.008$
Swin-UperNet^†	$0.021 \pm 0.011$	0.982	$0.042 \pm 0.021$	$0.000 \pm 0.000$

Table 8. Component segmentation benchmark on the Tokaido synthetic test split (weighted CE + Dice loss, mean ± standard deviation over seeds 1–3). Params denotes trainable parameter count in millions (M); Prec and Rec denote precision and recall.

{mIoU}_{main}

excludes the Nonbridge background class.

Table 8. Component segmentation benchmark on the Tokaido synthetic test split (weighted CE + Dice loss, mean ± standard deviation over seeds 1–3). Params denotes trainable parameter count in millions (M); Prec and Rec denote precision and recall.

{mIoU}_{main}

excludes the Nonbridge background class.

Model	Encoder	Params	n	mIoU_main	mIoU_all	Dice	Prec	Rec	FPS
ConvNeXt-Seg	ConvNeXt-B	93.4M	3	$0.957 \pm 0.001$	0.964	0.981	0.979	0.984	19.5
SegFormer	PVT-V2-B2	26.0M	3	$0.951 \pm 0.001$	0.959	0.979	0.976	0.981	24.6
UNet++	EfficientNet-B3	13.6M	3	$0.948 \pm 0.005$	0.957	0.978	0.976	0.980	24.5
UNet	EfficientNet-B3	13.2M	3	$0.947 \pm 0.001$	0.956	0.977	0.975	0.980	27.6
DeepLabV3+	EfficientNet-B3	11.7M	3	$0.946 \pm 0.002$	0.955	0.977	0.975	0.979	27.6
ViT-Seg	ViT-B/16	93.1M	3	$0.937 \pm 0.001$	0.947	0.973	0.972	0.974	16.8
Swin-UperNet	Swin-B	92.1M	3	$0.832 \pm 0.014$	0.857	0.922	0.918	0.926	16.4

Table 9. Component-aware ablation results on the Tokaido synthetic test set (mean ± standard deviation over seeds 1–3). Superscripts dmg and comp denote damage and component tasks, respectively. ^†Variant has no component head; component metrics are not applicable.

Variant	${mIoU}_{main}^{dmg}$	Concrete IoU	ExposedRebar IoU	${mIoU}_{main}^{comp}$
damage-only^†	$0.5600 \pm 0.0060$	$0.5882$	$0.5319$	—
parallel-heads	$0.5535 \pm 0.0055$	$0.5884$	$0.5187$	$0.9229 \pm 0.0065$
hard-mask	$0.5574 \pm 0.0045$	$0.5936$	$0.5212$	$0.9262 \pm 0.0036$
soft-detach	$0.5609 \pm 0.0086$	$0.5976$	$0.5242$	$0.9231 \pm 0.0021$
soft-full	$0.5545 \pm 0.0112$	$0.5909$	$0.5181$	$0.9270 \pm 0.0027$

Table 10. Component segmentation on real photographs (ConvNeXt-Seg, seeds 1–3, weighted CE + Dice; same three single-task checkpoints as Table 8; global pooled IoU over 16 images). Rows “Seed 1”–“Seed 3” are individual checkpoints; the final

{mIoU}_{main}

entry is mean ± standard deviation across seeds. The Nonbridge class is omitted from

{mIoU}_{main}

because its support is small and arises primarily from resize-pad preprocessing in the evaluation pipeline.

Table 10. Component segmentation on real photographs (ConvNeXt-Seg, seeds 1–3, weighted CE + Dice; same three single-task checkpoints as Table 8; global pooled IoU over 16 images). Rows “Seed 1”–“Seed 3” are individual checkpoints; the final

{mIoU}_{main}

entry is mean ± standard deviation across seeds. The Nonbridge class is omitted from

{mIoU}_{main}

because its support is small and arises primarily from resize-pad preprocessing in the evaluation pipeline.

	Nonbridge	Slab	Beam	Column	Nonstructural	${mIoU}_{main}$
Seed 1	0.010	0.456	0.409	0.598	0.036	0.374
Seed 2	0.011	0.727	0.447	0.755	0.056	0.496
Seed 3	0.009	0.505	0.424	0.643	0.034	0.402
Mean	0.010	0.562	0.427	0.665	0.042	$0.424 \pm 0.064$

Table 11. Text prompts used for each structural class across the five GroundedSAM prompt variants (v0–v4; note that these are distinct from the five joint-model conditioning variants in Table 9). Each cell lists the period-separated phrases passed to GroundingDINO for that class. Nonstructural (class 4) is not directly prompted; it receives all pixels not assigned to classes 1–3 after priority-order merging.

Variant	Slab (1)	Beam (2)	Column (3)	Nonstructural (4)
v0 — baseline	bridge deck slab. concrete slab. bridge floor	bridge girder. bridge beam. concrete beam	bridge column. bridge pier. concrete column	railway rail. railway sleeper. bridge railing
v1 — descriptive geom.	flat concrete deck. horizontal concrete surface. concrete floor plate	concrete T-beam. prestressed concrete girder. longitudinal girder under deck	vertical concrete pier. cylindrical column. rectangular concrete pillar	metal handrail. steel guard rail. bridge fence
v2 — single terms	slab	beam	column	railing
v3 — common vocab. (best)	concrete floor. flat concrete ceiling. concrete soffit	structural beam. concrete support beam. I-beam	pillar. concrete pillar. support column	railing. fence. barrier
v4 — combined	bridge deck slab. flat concrete deck. concrete floor plate. bridge floor	bridge beam. bridge girder. concrete beam. prestressed concrete girder. T-beam	bridge column. concrete pier. bridge pier. vertical concrete pillar	bridge railing. metal handrail. steel fence. guardrail

Table 12. Structural-task prompt ablation for GroundedSAM on 16 real photographs (

{mIoU}_{main}

excludes Nonbridge).

Table 12. Structural-task prompt ablation for GroundedSAM on 16 real photographs (

{mIoU}_{main}

excludes Nonbridge).

Variant	Slab	Beam	Column	Nonstructural	${mIoU}_{main}$
v0 — baseline (bridge-specific phrases)	0.291	0.126	0.223	0.142	0.195
v1 — descriptive geometry	0.184	0.018	0.090	0.133	0.106
v2 — single terms (slab/beam/…)	0.372	0.089	0.310	0.105	0.219
v3 — common vocabulary (best)	0.338	0.133	0.491	0.038	0.250
v4 — combined extended phrases	0.131	0.147	0.106	0.088	0.118

Table 14. Summary of progress relative to the prior study [13]. All metrics are foreground mIoU (background excluded) unless otherwise noted. Prior damage figures are per-class IoU averaged over the two foreground damage classes (Cracks, Reinforcement) from Table 3 of that paper.

Aspect	Prior work	Present work
Input resolution	$320 \times 160$	$512 \times 896$
Architectures evaluated	2	7
Loss functions compared	ad-hoc	4 (systematic ablation)
Repeated seeds	no	3 per config
Component mIoU (synthetic test)	76–87% (val, 4 cls)	94.6–95.7% (test, 5 cls)
Damage mIoU (synthetic test)	$\sim 0.46$ (2 arch)	$0.569$ (SegFormer mean)
Component mIoU (real photos)	$\sim 0.18$ (U-Net, 4 cls)	$0.424$ (trained, 5 cls, gap $\approx 0.53$ )
Zero-shot structural baseline	—	$0.250$ (GroundedSAM)
Joint component-aware ablation	no	yes (5 variants)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Controlled Benchmarking and Component-Aware Ablation for Railway Viaduct Structural and Damage Segmentation

Abstract

Keywords:

Subject:

1. Introduction

2. Related Work

2.1. Semantic Segmentation Architectures for Structural Inspection

2.2. Bridge Component Segmentation and Multi-Task Inspection

2.3. Synthetic Inspection Data and Domain Transfer

3. Methodology

3.1. Task Formulation and Label Spaces

3.2. Dataset

Class indexing and label merging.

3.3. Preprocessing and Augmentation

3.4. Single-Task Segmentation Architectures

UPerNet decoder.

3.5. Component-Aware Damage Segmentation

3.6. Training Objective and Optimisation

Checkpointing and best-model selection.

4. Experiments

4.1. Dataset Split

4.2. Evaluation Metrics

4.3. Ablation Study Design

4.4. Oracle-Filter Analysis

4.5. Synthetic-to-Real Transfer and Zero-Shot Baseline

5. Results

5.1. Ablation Studies

Loss function.

Encoder backbone.

Input resolution.

ImageNet pretraining.

5.2. Damage Segmentation on the Synthetic Test Split

5.3. Component Segmentation on the Synthetic Test Split

5.4. Component-Aware Joint Architecture Ablation

Oracle-filter analysis.

5.5. Component Segmentation on Real Photographs

5.6. Zero-Shot GroundedSAM Baseline

Prompt sensitivity.

Zero-shot benchmark results.

6. Discussion

Progress relative to prior work.

7. Conclusion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

MDPI Initiatives

Important Links

Subscribe