Preprint
Article

This version is not peer-reviewed.

Controlled Benchmarking and Component-Aware Ablation for Railway Viaduct Structural and Damage Segmentation

Submitted:

10 June 2026

Posted:

11 June 2026

You are already at the latest version

Abstract
Automated damage inspection of railway viaducts requires pixel-level identification of structural components and surface damage such as cracking and rebar exposure. A common assumption in bridge inspection is that damage segmentation improves when component information is provided alongside the image. This study tests that assumption on the Tokaido synthetic viaduct dataset using controlled comparisons between segmentation models with and without component information. Both damage and structural component segmentation are evaluated across multiple architectures, and the trained component model is assessed on real viaduct photographs against a baseline model requiring no task-specific training. Adding component information does not improve damage segmentation: all tested strategies remain within 0.008~mean Intersection-over-Union (mIoU) of a baseline without component input. This null result persists even when component predictions are reliable, indicating that structural element identity does not provide useful information for damage localisation in this setting. The best unconditioned model reaches 0.569~mIoU for damage segmentation; for real-photo component segmentation, the trained model reaches 0.424~mIoU compared with 0.250~mIoU for the training-free baseline. These results show that multi-task benefits reported in bridge inspection do not automatically translate into gains from explicit use of component information on synthetic viaduct data, where damage placement is largely independent of structural element type. The multi-architecture benchmark and the measured real-photo structural transfer gap provide reference baselines for subsequent work on component-aware and transfer-robust inspection.
Keywords: 
;  ;  ;  ;  ;  ;  ;  ;  

1. Introduction

Railway viaduct inspection requires pixel-level information about both structural components and visible surface damage. Manual visual inspection remains costly, time-consuming, and dependent on subjective interpretation, especially when elevated structures must be assessed repeatedly across large transport networks  [2,4,5]. Semantic segmentation offers a direct route from inspection imagery to spatially explicit component and damage maps, but its usefulness depends on whether models can identify small damage regions and transfer beyond the data used for training. These requirements make viaduct inspection a stringent test case for modern segmentation methods.
Recent bridge-inspection research has established two important directions. First, bridge element parsing and defect segmentation can be learned jointly, and multi-task models are often reported to improve element or defect metrics relative to single-task baselines [4,5]. Second, synthetic environments, such as the Tokaido synthetic viaduct benchmark and related synthetic inspection datasets [41,46,55], can reduce annotation cost and support repeatable experiments. These studies show that component information, synthetic data, and deep segmentation architectures can support structural inspection. They do not, however, resolve the issue of whether explicit component conditioning improves damage localisation under a controlled viaduct benchmark.
The distinction between multi-task learning and explicit component conditioning is central to this study. A multi-task network may improve damage segmentation because shared features regularise the encoder, because component supervision improves structural representation, or because the damage branch directly receives component information. Prior bridge-inspection work has mostly evaluated the combined effect of feature sharing and task interaction [4,5]. The practical question remains narrower: if a model already has access to an image, does adding predicted component information help it localise cracks, concrete damage, or exposed reinforcement? This question is testable only when architecture, training recipe, dataset split, and evaluation protocol are controlled.
Architecture choice adds a second unresolved factor. General semantic-segmentation research has moved from convolutional encoder–decoder models such as U-Net and DeepLabV3+ toward hierarchical Transformers and modernised convolutional architectures, including Swin Transformer, SegFormer, and ConvNeXt [17,19,24,25,26]. Inspection imagery differs from generic scene parsing because damage classes are sparse, thin, and texture-driven. A model that performs well on broad segmentation benchmarks may therefore fail under strong foreground–background imbalance. A controlled comparison across convolutional neural network (CNN), ConvNeXt, Vision Transformer (ViT), SegFormer, and Swin-style architectures is needed before claims about the benefits of component-aware damage localisation can be interpreted.
Synthetic-to-real transfer also remains a limiting condition for real-life deployment. The Tokaido dataset provides controlled synthetic viaduct imagery with component and damage annotations, while dacl10k and related bridge datasets demonstrate the diversity and difficulty of real inspection photographs [9,55]. Foundation segmentation models provide a complementary reference point because GroundingDINO and the Segment Anything Model (SAM) can produce zero-shot masks without task-specific training [42,43]. Comparing a trained synthetic-data model against such a training-free baseline helps separate ordinary segmentation accuracy from real-photo transfer performance.
This study evaluates railway viaduct component and damage segmentation under a controlled benchmark design. Seven segmentation architectures are trained for structural component and damage segmentation on the Tokaido synthetic dataset, using fixed splits, repeated seeds, and a common evaluation protocol. A separate component-aware ablation study tests damage-only, parallel-head, hard-mask, detached soft-conditioning, and end-to-end soft-conditioning variants. The design isolates whether component predictions improve damage localisation beyond a damage-only baseline. The best damage model reaches 0.569 foreground mIoU on the synthetic test split, while the component-aware variants remain within 0.008 mIoU of the unconditioned baseline. This result shows that explicit component conditioning does not provide a measurable damage-segmentation gain in this setting.
The paper makes three contributions. First, the study provides a controlled CNN–Transformer benchmark for viaduct structural component and damage segmentation on the Tokaido dataset. Second, the component-aware ablation shows that reliable component predictions do not automatically improve damage localisation. Third, the real-photo component-transfer evaluation compares a component segmentation model trained on synthetic data with a GroundingDINO+SAM zero-shot pipeline (GroundedSAM), where the trained model reaches 0.424 mIoU and the zero-shot baseline reaches 0.250 mIoU on real viaduct photographs. Together, these results position the benefits of component-aware conditioning as an empirical question rather than an assumed benefit, and they provide baseline evidence for future transfer-robust inspection models.

3. Methodology

This section defines the segmentation tasks, implemented model families, component-aware conditioning variants, and training configuration used throughout the benchmark reported in the present paper. The study is both a controlled architecture benchmark and an ablation-study design: standard segmentation architectures are compared under the same conditions, while the component-aware variants constitute the conditioning interventions used to test whether structural component information helps damage localisation.

3.1. Task Formulation and Label Spaces

Each input image is a red–green–blue (RGB) viaduct rendering x R 3 × H × W , resized and cropped to H × W = 512 × 896 before being passed to the network. Two dense prediction tasks are considered. The structural component task predicts one of five labels at each pixel: Nonbridge, Slab, Beam, Column, or Nonstructural. The last class, Nonstructural, aggregates minor structural elements (rail, sleeper, and other non-load-bearing parts); the merging uses zero-based indices after label conversion (see Section 3.2). The damage task predicts one of three labels: Nondamage, Concrete, or ExposedRebar. The single-task models output one logit tensor z R C × H × W for the selected task, with C = 5 for component segmentation and C = 3 for damage segmentation.
Performance is optimised and monitored using foreground mean Intersection-over-Union (mIoU), denoted mIoU main , where Intersection-over-Union (IoU) is computed per class before averaging. This metric excludes the dominant background class, Nonbridge for component segmentation and Nondamage for damage segmentation, so that model selection is driven by the structural and damage classes that determine inspection usefulness. The same foreground convention is used by the validation scheduler and by the checkpoint criterion.

3.2. Dataset

Both tasks are trained and evaluated on the Tokaido dataset [55], a synthetic photo-realistic collection of 7,575 viaduct images rendered from three-dimensional bridge models using the open-source Blender3D modelling tool. A viaduct is a multi-span elevated bridge structure, typically used to carry railways or roads over valleys or urban areas. The images are representative of UAV-acquired aerial imagery and come with pixel-accurate ground-truth annotations stored as unsigned-integer label images. Each sample includes segmentation masks covering structural object types — slab (the flat horizontal deck that forms the roadway or rail bed), beam (longitudinal load-bearing members that span between supports), column (vertical pillars that carry the beam loads to the ground), rail and sleeper (railway track components), and other minor elements — as well as two damage categories: concrete surface damage (cracking and spalling visible on the surface) and exposed rebar (severe deterioration where steel reinforcing bars have become visible due to cover loss or corrosion).
A validity flag stored in the file-list CSV is used to exclude low-quality or incomplete samples prior to any split; files whose label images are missing from disk are additionally filtered out. Figure 1 shows representative examples of the input images and their ground-truth annotations.

Class indexing and label merging.

Raw label images store class indices as unsigned integers starting at 1; the preprocessing pipeline subtracts 1 to obtain zero-based indices in the range 0–7 before merging. For the component task, zero-based source classes 5, 6, and 7 — corresponding to raw classes 6 (rail), 7 (sleeper), and 8 (other minor elements) — are merged into the Nonstructural class at zero-based index 4. Raw class 5 already maps to Nonstructural and is not a merge source. The merge reduces the eight raw component labels to five training classes.

3.3. Preprocessing and Augmentation

All images follow a common preprocessing pipeline before tensor conversion. For training, the image is first scaled with LongestMaxSize so that the longer side matches the configured training size, then padded with zeros using PadIfNeeded until the 512 × 896 canvas is obtained. A random 512 × 896 crop is then extracted. Training augmentation consists of horizontal flipping with probability 0.5 , random brightness–contrast perturbation with probability 0.3 , and Gaussian blur with probability 0.15 , followed by normalisation and ToTensorV2. For non-augmented evaluation transforms, the same longest-side scaling, zero padding, normalisation, and tensor conversion are used without random crop or photometric augmentation. The same resize–pad–normalise convention is used for both component and damage inputs, ensuring that task differences arise from labels and model heads rather than from preprocessing.

3.4. Single-Task Segmentation Architectures

The single-task benchmark evaluates seven segmentation architectures under the same task interface. Three models use segmentation_models_pytorch decoders with an ImageNet-pretrained timm-efficientnet-b3 encoder: U-Net [17], UNet++ [18], and DeepLabV3+ [19]. U-Net provides the basic encoder–decoder baseline with skip connections; UNet++ adds nested skip pathways to reduce the semantic gap between encoder and decoder features; DeepLabV3+ replaces the symmetric decoder with atrous spatial pyramid pooling and a lightweight refinement decoder. Their verified trainable parameter counts are 13.16 million, 13.63 million, and 11.68 million, respectively.
The remaining four models use custom wrappers around timm encoders and decoder heads. ViT-Seg uses vit_base_patch16_224 [23] with a UPerNet-style decoder [49]; intermediate transformer blocks are reshaped into spatial feature maps and fused at one-quarter input resolution. ConvNeXt-Seg uses convnext_base [26] as a four-stage feature encoder with the same UPerNet-style decoder. SegFormer uses a Pyramid Vision Transformer v2 (PVT v2) pvt_v2_b2 hierarchical encoder with a SegFormer-style multilayer-perceptron (MLP) decoder [25]; each stage is projected to a common channel dimension, upsampled to one-quarter resolution, concatenated, and fused before the segmentation head. Swin-UPerNet uses swin_base_patch4_window7_224 [24] with the UPerNet-style decoder. The verified trainable parameter counts are 93.12 million for ViT-Seg, 93.37 million for ConvNeXt-Seg, 25.97 million for SegFormer, and 92.12 million for Swin-UPerNet.

UPerNet decoder.

The shared UPerNet decoder used by ViT-Seg, ConvNeXt-Seg, and Swin-UPerNet consists of (i) per-level lateral 1 × 1 convolutions that project all feature maps to a common d = 256 -dimensional space, (ii) a Feature Pyramid Network (FPN) top-down pathway that adds upsampled deeper features to shallower ones, (iii)  3 × 3 convolutional refinement on each level, and (iv) a final fusion step that upsamples all levels to a common resolution, concatenates them, and applies a further 3 × 3 convolution. The segmentation head consists of a Conv-BN-ReLU block followed by a 1 × 1 convolution that produces the C-channel logit map, which is upsampled to the original image resolution using bilinear interpolation.

3.5. Component-Aware Damage Segmentation

The component-aware multitask model uses a shared convnext_base encoder and task-specific UPerNet-style decoder branches. The component branch produces a feature map h c at one-quarter input resolution and component logits z c R 5 × H × W ; the damage branch produces a damage feature map h d and damage logits z d R 3 × H × W . When conditioning is active, the damage feature map is concatenated with a component-derived conditioning tensor q c and passed through a single Conv-BN-ReLU fusion block before the damage head. The input channel count is 256 + 5 = 261 for the hard-mask variant (one-hot over five component classes) and 256 + 256 = 512 for the soft variants (full feature map); in both cases the output is projected back to 256 channels. Figure 2 shows the shared encoder, task branches, and optional conditioning route.
Five conditioning configurations are evaluated, corresponding to distinct hypotheses about the role of component context (Table 1):
  • damage-only: the component branch is absent; no fusion. Single-task damage baseline (93.37M parameters).
  • parallel-heads: both decoders are present and trained jointly on a combined loss, but the fusion module is absent. Tests whether shared multitask supervision alone benefits damage-segmentation quality (99.17M parameters).
  • hard-mask: q c is a discrete one-hot map derived from arg max z c at resolution H / 4 × W / 4 , passed through the fusion module without gradient (non-differentiable argmax). A one-hot encoding assigns each pixel a binary vector with a single 1 in the position of the winning class and 0s everywhere else, discarding all probability information. Because argmax is not differentiable, no gradient flows through this conditioning path (99.78M parameters).
  • soft-detach: q c = sg ( F comp ) , where sg ( · ) denotes the stop-gradient operator (implemented via .detach() in PyTorch), which passes the value forward normally but blocks gradient flow during backpropagation. The continuous component feature map is used as conditioning signal, but no gradient flows from the damage loss through the component path. This isolates the effect of the conditioning signal from gradient coupling (100.35M parameters).
  • soft-full: q c = F comp , without stop-gradient. The damage loss backpropagates through the fusion module into the component decoder and the shared encoder (100.35M parameters).
A sixth variant, component-only, trains only the component branch in isolation to establish structural segmentation performance independently of the damage objective; it is not part of the conditioning ablation.
The causal comparisons implied by these variants are part of the method definition. Damage-only versus parallel-heads tests whether auxiliary component supervision helps the shared representation when no component signal is received by the damage decoder. Parallel-heads versus hard-mask tests whether coarse predicted component identity helps damage segmentation. Parallel-heads versus soft-detach tests whether feature-level component information helps independently of extra gradient coupling. Soft-detach versus soft-full tests whether allowing the damage objective to update the component path through the conditioning branch changes the damage solution.

3.6. Training Objective and Optimisation

Single-task models in the main benchmark are trained with one of four configured losses: cross-entropy plus Dice loss, weighted cross-entropy plus Dice loss, Tversky loss [11], and focal Tversky loss [12]. The preliminary loss ablation additionally evaluates plain cross-entropy (CE) and focal cross-entropy [10], giving six loss variants in that study. Dice loss is used in multiclass mode from logits [58], and the Dice term receives weight 0.5 in the composite cross-entropy–Dice objectives. The Tversky parameters are α = 0.3 and β = 0.7 . For focal Tversky, the verified focal exponent is γ = 2.0 for the structural task and γ = 1.5 for the damage and joint component-aware tasks. Weighted losses use per-class weights derived from training-set pixel frequencies. Let f c denote the fraction of training pixels assigned to class c. The per-class weight is
w c = clip w ˜ c w ˜ ¯ , 0.25 , 5.0 , w ˜ c = 1 f c + ε ,
where ε = 10 12 and w ˜ ¯ is the mean of w ˜ c over all classes. The square-root inverse downweights frequent classes without extreme amplification of rare ones; the [ 0.25 , 5.0 ] clip prevents any single class from dominating the loss.
For joint component-aware training, the component and damage branches use the same configured loss type and hyperparameters, each with its own class weights. When both branches are present, the total objective is
L = L d ( z d , y d ) + λ c L c ( z c , y c ) ,
where L d and L c are the damage and component segmentation losses, y d and y c are the corresponding label maps, and λ c = 1.0 . In component-only mode, the damage term is absent and only L c is optimised. In damage-only mode, the component term is absent and only L d is optimised.

Checkpointing and best-model selection.

Validation is performed on the held-out validation split after every training epoch. The model state is saved whenever the foreground validation mIoU main strictly improves; after training, the best checkpoint is reloaded for final evaluation on the test set.
All models are trained with AdamW [50] and weight decay 0.01 . The learning rate is 3 × 10 4 for the segmentation_models_pytorch (SMP)-based CNN single-task architectures and 1 × 10 4 for the custom timm-based single-task architectures and joint component-aware variants. A ReduceLROnPlateau scheduler monitors validation mIoU main in maximisation mode with factor 0.5 , patience 3, and minimum learning rate 10 6 . Single-task models are configured for 40 epochs and joint component-aware models for 60 epochs. The configured seeds are 1, 2, and 3. Batch size is 4 for U-Net, UNet++, DeepLabV3+, ConvNeXt-Seg, SegFormer, and the joint component-aware variants, and 2 for ViT-Seg and Swin-UPerNet. Mixed precision is enabled when supported by the training device.

4. Experiments

This section describes the dataset split, evaluation metrics, and experimental protocols. The benchmark covers seven architectures across both the damage and component segmentation tasks, each trained with four loss variants and three random seeds, yielding 168 nominal benchmark runs; several cells contain additional completed attempts (primarily ConvNeXt-Seg and DeepLabV3+ damage runs), bringing the total benchmark rows in the experiment log to 194. Some repeated attempts reuse seed values, so these row counts are completed-attempt counts rather than distinct-seed counts; the original execution rationale for the repeats is not documented. Together with 86 ablation runs and 15 joint-ablation runs, the experiment log contains 295 completed training runs in total. The 86 ablation runs comprise four training-protocol factors (loss, encoder, resolution, pretraining) each evaluated on both the damage and component tasks with 2–3 seeds per variant (68 runs), plus an 18-run cross-family loss confirmation that verified the loss ranking from U-Net/EfficientNet-B3 generalises to ConvNeXt-Seg and ViT-Seg before committing the recipe to the full benchmark; the 15 joint-ablation runs are five conditioning variants × 3 seeds. All experiments were executed on Compute Unified Device Architecture (CUDA)-compatible graphics processing unit (GPU) hardware.

4.1. Dataset Split

All valid annotated images — those passing the dataset’s built-in validity filter — are partitioned into train, validation, and test sets at a 70/15/15 ratio using a fixed random seed (seed 123). The same split is applied identically to both tasks and to all experiment variants.

4.2. Evaluation Metrics

All segmentation performance is reported using mean Intersection-over-Union (mIoU) over foreground classes ( mIoU main ), which excludes the dominant background class (Nondamage for damage segmentation, Nonbridge for component segmentation). The full-class mean ( mIoU all ), macro Dice coefficient, precision, recall, and per-class Intersection-over-Union (IoU) are reported as secondary metrics. All metrics are computed from a global confusion matrix accumulated over the entire test split; results across seeds are reported as mean ± standard deviation. Classes with zero ground-truth support in the evaluated split yield undefined values and are excluded from macro averages. Inference throughput in frames per second (FPS) is measured on the same CUDA-compatible GPU hardware used for training.

4.3. Ablation Study Design

Four training-protocol factors are ablated in isolation using U-Net with EfficientNet-B3 as the reference model, varying one factor at a time with all others fixed. Most factors use two seeds per variant; the pretraining factor uses three. The optimal setting from each factor is carried forward to the main benchmark. The four factors are: (1) loss function, (2) encoder family, (3) input resolution, and (4) ImageNet pretraining.

4.4. Oracle-Filter Analysis

The null conditioning result admits two interpretations: either component predictions are too noisy to carry useful information to the damage branch (hypothesis H1), or component layout is genuinely uninformative for damage (hypothesis H2). To distinguish between these hypotheses, a filter oracle evaluates damage segmentation on a subset of test images where component predictions are demonstrably reliable.
For each seed, the soft-detach model’s per-image foreground component mIoU ( mIoU main comp , where the superscript comp denotes the component task) is computed on each test image. Images are partitioned into a good-component subset ( mIoU main comp τ ) and its complement at two confidence thresholds, τ { 0.7 , 0.8 } . Damage mIoU is then computed separately for each subset using pooled confusion matrices, to match the training evaluation convention. The soft-detach and damage-only models are compared on the good-component subset: if H1 were correct, the conditioned model should outperform the unconditioned model when component predictions are accurate.

4.5. Synthetic-to-Real Transfer and Zero-Shot Baseline

The three single-task ConvNeXt-Seg component checkpoints (seeds 1–3) from the main benchmark are evaluated on 16 real viaduct photographs with hand-annotated component masks, using the same resize-pad-normalise preprocessing pipeline and no test-time augmentation. Because real photographs are available for the component task only, damage transfer is not evaluated on real images; the reported domain gap is therefore specific to structural component segmentation.
To establish a training-free lower bound, the GroundingDINO + SAM pipeline (GroundedSAM) [42,43] is applied to the same 16 photographs in a zero-shot setting. GroundingDINO receives text prompts describing each structural class and outputs bounding boxes; SAM ViT-H generates pixel-level masks for each box. Five prompt variants are evaluated and the best-performing prompt set is used for the comparative results reported in Table 13 (Section 5.6).

5. Results

This section first presents the completed ablation studies that establish the best training recipe, then reports damage-segmentation performance on the synthetic Tokaido test split for the available single-task baselines, examines whether component-aware joint modelling improves damage localisation, and finally assesses how well component segmentation transfers to real photographs.
Unless noted otherwise, values reported in the form a ± b in this section denote mean ± one standard deviation over repeated runs with different random seeds.

5.1. Ablation Studies

The four recipe-selection ablations (loss, encoder, resolution, pretraining) use U-Net with EfficientNet-B3 as the controlled baseline model, varying one factor at a time while keeping all other settings fixed (512×896 resolution, ImageNet-pretrained encoder, ce_dice loss). Most factors use two seeds per variant; the pretraining factor uses three. This single-architecture design is standard practice: the ablation findings establish the best training recipe, which is then applied uniformly to all seven architectures in the main benchmark — the ablation results themselves are not part of the architecture comparison.

Loss function.

Table 2 compares six loss functions on the damage task. Weighted cross-entropy plus Dice (weighted_ce_dice) achieves the highest foreground mIoU ( 0.535 ), followed by Tversky ( 0.526 ) and unweighted cross-entropy plus Dice (ce_dice, 0.523 ). Focal variants together with plain cross-entropy (ce = 0.491 ) rank the lowest (focal_tversky  = 0.498 , focal_ce = 0.467 ), likely because the focal exponent suppresses the already-dominant Nondamage class while simultaneously reducing the gradient signal on the rare ExposedRebar class. Based on these results, weighted_ce_dice is adopted as the default loss for all subsequent single-task experiments. The 0.068 gap between the best and the worst loss confirms that imbalance management — not architecture choice — is the primary tuning lever for this task.

Encoder backbone.

Table 3 shows a clear capacity gradient across SMP-compatible encoders. EfficientNet-B5 attains the highest foreground mIoU ( 0.547 ), followed by B3 ( 0.526 ), B0 ( 0.488 ), and ResNet-50 ( 0.418 ). The 0.129 gap between EfficientNet-B5 and ResNet-50 confirms that encoder width and depth matter substantially for this task. EfficientNet-B3 is retained as the default in the main benchmark to balance accuracy and computational cost; B5 is noted as the stronger option when inference latency and memory are not a constraint. The consistent ranking across encoders confirms that representational capacity is the dominant factor, independent of decoder design or loss function.

Input resolution.

Table 4 reveals a monotonic improvement with resolution: foreground mIoU increases from 0.398 at 256 × 448 to 0.549 at 640 × 1120 , a gain of 0.151 over the tested range. The sharpest step occurs between 384 × 672 ( 0.464 ) and 512 × 896 ( 0.536 ), suggesting that crack and rebar textures are partially lost below the 512 × 896 threshold. The standard benchmark resolution is 512 × 896 . This sensitivity underscores the fact that fine-scale crack and rebar features require sufficient spatial detail, a baseline condition that the full benchmark satisfies.

ImageNet pretraining.

Table 5 quantifies the impact of encoder initialisation. Pretrained weights yield foreground mIoU 0.519 versus 0.264 from random initialisation — a relative drop of almost 50 % . Figure 3 confirms visually that the scratch model produces fragmented, noisy predictions with poor class separation, while the pretrained model already localises damage reliably. Pretrained encoders are used throughout all experiments.

5.2. Damage Segmentation on the Synthetic Test Split

All seven architectures are trained using the recipe identified in the ablation studies: weighted_ce_dice loss, EfficientNet-B3 encoder for the three SMP-based models (U-Net, UNet++, DeepLabV3+), 512 × 896 input resolution, and ImageNet-pretrained encoder weights. Table 6 reports mean test metrics per architecture (weighted CE + Dice loss, seeds 1–3). SegFormer achieves the highest foreground mIoU main ( 0.569 ± 0.013 ) and macro Dice ( 0.815 ), closely followed by ConvNeXt-Seg ( 0.562 ± 0.006 , 0.811 ). The two lightweight EfficientNet-based decoders, UNet++ and UNet, form a second tier in the range 0.533 0.538 . DeepLabV3+ is notably weaker ( 0.477 ) despite using the same EfficientNet-B3 encoder, reflecting the limitations of its atrous spatial pyramid pooling (ASPP)-based decoder for fine-grained damage textures. ViT-Seg underperforms all CNN models ( 0.362 ) in spite of its large parameter count, consistent with the known sensitivity of plain ViT architectures to dataset size and high-resolution fine-scale features. Swin-UperNet fails to converge meaningfully on the damage task ( mIoU main = 0.021 ± 0.011 ): in all 12 runs, the model collapses to predicting the Nondamage class almost exclusively, with ExposedRebar IoU 0.000 . This behaviour is consistent with the known instability of the Swin-UperNet training pipeline under strong class imbalance, even with weighted losses.
Table 7 reports the mean per-class IoU for each architecture, averaged over all available runs (weighted CE + Dice; run counts match Table 6). The Nondamage class is near-perfectly solved ( 0.99 ) across all converging models; the remaining challenge lies entirely in the two foreground classes. Concrete is consistently the easier foreground class ( 0.359 0.602 ), while ExposedRebar is harder ( 0.366 0.535 ) because rebar pixels are sparse and visually thin. The gap between Concrete and ExposedRebar is small for strong models (SegFormer: 0.602 vs. 0.535 ) and collapses for ViT-Seg ( 0.359 vs. 0.366 ), where neither class is well localised. Swin-UperNet effectively predicts zero ExposedRebar IoU in all runs, confirming its collapse to the majority class.
The strongest individual checkpoint is SegFormer with Tversky loss (seed 3), reaching mIoU main = 0.583 , mIoU all = 0.720 , macro Dice = 0.823 , and 19.2 FPS, with class-wise IoUs of 0.994 (Nondamage), 0.607 (Concrete), and 0.559 (ExposedRebar). The best weighted CE + Dice run overall is SegFormer seed 3 at foreground mIoU 0.577 , followed closely by SegFormer seed 2 ( 0.576 ) and ConvNeXt-Seg seed 3 ( 0.574 ).
Qualitative results are shown in Figure 4.

5.3. Component Segmentation on the Synthetic Test Split

Table 8 reports component segmentation metrics on the synthetic test split for all seven architectures (weighted CE + Dice loss, seeds 1–3). Unlike the damage task, all CNN-based and transformer-based models converge successfully on the component task, achieving foreground mIoU main between 0.937 and 0.957 . ConvNeXt-Seg leads at 0.957 ± 0.001 , followed by SegFormer ( 0.951 ± 0.001 ); the three EfficientNet-based decoders cluster tightly in the 0.946 0.948 range. ViT-Seg reaches 0.937 ± 0.0003 , and Swin-UperNet, while far behind on damage, still attains 0.832 ± 0.014 on components — indicating that its failure on the damage task is specific to the highly imbalanced damage label distribution rather than a general inability to learn structural features.
These high component scores motivate the joint-architecture ablation in Section 5.4: a component branch trained on this task provides reliable structural-layout estimates ( mIoU main comp 0.92 0.96 ), making the question of whether such estimates help damage detection well-posed and testable.

5.4. Component-Aware Joint Architecture Ablation

We explicitly test the hypothesis that component conditioning improves damage localisation beyond what a shared encoder and auxiliary component supervision already provide. All five conditioning variants (damage-only, parallel-heads, hard-mask, soft-detach, soft-full; see Section 3.5) use the ConvNeXt-Seg architecture (ConvNeXt-Base encoder, UPerNet decoder), which achieved the highest component mIoU among single-task models. Table 9 reports the test-set damage and component metrics for all five variants, averaged over three seeds. The table does not support the initial hypothesis. All five variants remain within a narrow damage-performance band, even though prior bridge-inspection studies report gains from cross-task interaction in other settings [4,5]. Tokaido therefore serves as a controlled counterexample to any universal claim that component conditioning automatically improves bridge-damage segmentation.
Qualitative predictions across the five variants are shown in Figure 5, and component-segmentation results are illustrated in Figure 6. Two observations stand out. First, all five variants attain very similar damage foreground mIoU, ranging from 0.553 to 0.561 — a spread of 0.008 . The best value, soft-detach at 0.5609 ± 0.0086 , exceeds the damage-only baseline ( 0.5600 ± 0.0060 ) by only 0.0009 , which is an order of magnitude smaller than the per-variant seed standard deviations ( 0.0045 0.0112 ). No conditioning variant therefore shows a statistically meaningful damage gain over the baseline. This is the central falsification result of the paper. Second, the component branch delivers consistently strong component segmentation across all four variants that include it, with mIoU main comp between 0.916 and 0.930 . Joint training therefore achieves high-quality component segmentation, and the null result cannot be explained by failure of the component task.
Relative to the damage-only variant ( 19.5 ms per image, 51.4 FPS), adding the component branch increases damage inference cost to 26.9 28.0 ms per image ( 35.8 37.2 FPS), i.e., a 38– 44 % latency increase without a measurable accuracy gain.
In practical terms, the component-aware variants first estimate which structural component each pixel belongs to (for example, slab, beam, or column) and then allow the damage branch to use that information. If damage were strongly tied to component type, these variants should outperform the damage-only baseline. They do not. The evidence instead suggests that, in the Tokaido dataset, damage localisation depends mainly on local surface appearance cues such as cracking, spalling texture, and exposed rebar, rather than on whether the damaged pixel lies on a slab, beam, or column. Read in the context of the bridge-inspection literature, this is best understood as a dataset-dependent negative result rather than as a contradiction of all prior component-aware work: some datasets encode stronger correlations between structural role and damage type, whereas Tokaido does not.

Oracle-filter analysis.

To test whether imperfect component predictions are masking a genuine conditioning benefit, the test set is re-evaluated on subsets where the soft-detach component branch is reliable. At threshold τ = 0.8 , the good-component subset contains 138–140 images depending on seed. On this subset, soft-detach achieves mIoU main dmg = 0.522 ± 0.016 , compared with 0.526 ± 0.010 for damage-only and 0.523 ± 0.017 for parallel-heads. The same pattern holds at the looser threshold τ = 0.7 (soft-detach 0.539 ± 0.010 , damage-only 0.544 ± 0.019 , soft-detach minus damage-only difference = 0.005 ). The null result therefore persists even when conditioning quality is explicitly filtered, arguing against unreliable component predictions as the explanation for the lack of improvement. Even with reliable component predictions, we find no statistically meaningful improvement in damage localisation.
Notably, all three variants show the same damage-score drop on the good-component subset relative to its complement ( Δ [ good bad ] 0.05 to 0.06 ). This is a dataset-level structural effect rather than a consequence of conditioning: images where structural components are easily identifiable in the Tokaido synthetic test set tend to be geometrically simple, undamaged scenes with fewer foreground damage pixels, yielding intrinsically lower mIoU main dmg . Readers should therefore not interpret the lower subset-level scores as evidence that conditioned models perform worse on easier images; the difference reflects the distribution of damage content across subsets, not a conditioning effect.

5.5. Component Segmentation on Real Photographs

To assess synthetic-to-real transfer of the component model, the three single-task ConvNeXt-Seg component checkpoints from seeds 1–3 are evaluated on the 16 real viaduct photographs with hand-annotated component masks. Inference uses the same resize-pad-normalise pipeline as the synthetic evaluation; results are reported as global pooled IoU (matching the training metric).
The same three single-task ConvNeXt-Seg checkpoints achieve 0.957 ± 0.001 on the synthetic test split (Table 8), giving an absolute domain gap of 0.533 ( 0.957 ± 0.001 0.424 ± 0.064 ). Because the real-photo evaluation reuses the exact benchmark checkpoints, this gap is a matched model comparison for the component task rather than a cross-model comparison. Three patterns emerge from the per-class breakdown.
Column generalises best (mean IoU = 0.665 , standard deviation = 0.066 ): columns have a distinctive elongated vertical geometry and uniform material appearance that transfer from the photo-realistic Tokaido renders.
Slab is the highest-variance class (IoU range 0.46 0.73 , standard deviation = 0.118 ), indicating that seed-to-seed variation in the synthetic optimum translates to large differences in cross-domain performance for this class. Slab occupancy varies substantially across real images, amplifying inter-seed variability.
Beam achieves moderate IoU (mean = 0.427 , standard deviation = 0.016 ), consistent across seeds. Beam appearances vary considerably with bridge type and viewing angle, making generalisation harder than for columns.
Nonstructural is near zero (mean IoU = 0.042 ), as expected: this class aggregates heterogeneous elements (railings, cables, signage) that are underrepresented in training and visually diverse in the field.
Nonbridge IoU is trivially low ( 0.01 ), reflecting the small amount of padding-induced background introduced by resize-pad preprocessing. This class is uninformative in the real evaluation.
Overall, the component model demonstrates meaningful cross-domain generalisation for structurally salient classes (Column, Slab) while showing limited transfer for heterogeneous or underrepresented classes. Closing the domain gap would require domain-adaptive training or fine-tuning on annotated real imagery.

5.6. Zero-Shot GroundedSAM Baseline

To provide a training-free lower bound for real-image structural segmentation, we evaluate the GroundingDINO + SAM pipeline [42,43] (“GroundedSAM”) in a zero-shot setting on the same 16 real viaduct photographs used in Section 5.5. The real photographs carry component annotations only; no real-image damage annotations are available, so this evaluation covers the structural task exclusively. This experiment should not be read as a full evaluation of SAM-based inspection methods. It isolates what an off-the-shelf prompt-driven pipeline can deliver without task-specific adaptation. Trainable or refined SAM-based frameworks can be materially stronger in structural-inspection settings [27,28]; the present comparison is therefore intended only as a zero-shot lower bound.
Zero-shot means that the model is applied directly to the target task without any task-specific training or fine-tuning: it generalises purely from its large-scale pre-training on diverse image–text data. The GroundedSAM pipeline operates in two stages. First, GroundingDINO [43] is a vision–language model that takes a natural-language text prompt as input (e.g., “pillar. concrete pillar. support column”) and outputs bounding boxes in the image that spatially correspond to the described concept, together with a confidence score. Second, the Segment Anything Model (SAM) [42] takes each bounding box as a spatial prompt and produces a pixel-precise segmentation mask for the corresponding region. The two stages are therefore complementary: GroundingDINO localises “what” in the image matches the text, and SAM delineates “where” exactly the boundary lies. SAM ViT-H (the highest-capacity SAM variant) is used throughout. No task-specific training is performed; the only design choices are the text prompts that describe each foreground class.

Prompt sensitivity.

Because GroundingDINO is sensitive to prompt wording, five prompt variants are tested for the structural task (Table 11 and Table 12). The baseline prompts use extended bridge-specific phrases; the remaining variants progressively simplify or replace the vocabulary.
The results reveal nuanced prompt sensitivity (Table 12). The common-vocabulary variant v3 achieves the highest overall mIoU main ( 0.250 ), followed by v2 single terms ( 0.219 ) and the bridge-specific baseline v0 ( 0.195 ). The descriptive-geometry variant v1 ( 0.106 ) and the combined extended phrases v4 ( 0.118 ) are substantially weaker, suggesting that verbose multi-component descriptions impair GroundingDINO’s grounding even when the individual terms are semantically accurate. Prompt conciseness appears to matter as much as vocabulary choice: the two weakest variants are precisely those with the longest, most composite prompt strings. The common-vocabulary variant v3 is used as the best structural prompt set in the benchmark results below. Figure 8 shows qualitative predictions for all tested variants.

Zero-shot benchmark results.

Table 13 reports GroundedSAM performance on the structural task alongside the trained ConvNeXt-Seg component model from Section 5.5. Real-damage annotations are not available for the 16 real photographs; the GroundedSAM evaluation therefore covers the structural task only.
Table 13. GroundedSAM zero-shot structural evaluation on 16 real photographs (prompt variant v3, common vocabulary). The trained-model row is the mean over three seeds from Table 10. mIoU main excludes Nonbridge.
Table 13. GroundedSAM zero-shot structural evaluation on 16 real photographs (prompt variant v3, common vocabulary). The trained-model row is the mean over three seeds from Table 10. mIoU main excludes Nonbridge.
Model Slab Beam Column Nonstructural mIoU main
Structural task
   GroundedSAM (zero-shot, v3) 0.338 0.133 0.491 0.038 0.250
   ConvNeXt-Seg (trained, mean) 0.562 0.427 0.665 0.042 0.424
GroundedSAM achieves mIoU main = 0.250 on structural components without any task-specific training, establishing a zero-shot lower bound for real-image structural segmentation. These numbers should therefore be interpreted narrowly: they quantify the performance of a prompt-based zero-shot GroundingDINO + SAM workflow on our real-image subset, not the ceiling of SAM-derived methods for infrastructure inspection. Figure 7 provides a qualitative comparison of both models on real photographs. The trained component model outperforms GroundedSAM on all structural classes, with the largest advantage on Column ( 0.665 vs 0.491 ) and Slab ( 0.562 vs 0.338 ). The overall trained-vs-zero-shot gap is 0.174 in mIoU main .

6. Discussion

The benchmark separates two questions that are often conflated in inspection segmentation: which architecture is the strongest for a given task, and whether structural component information is useful for damage localisation. The architecture comparison shows that these answers are task-dependent. SegFormer gives the strongest damage segmentation result, while ConvNeXt-Seg gives the strongest component segmentation result. This difference is consistent with the visual character of the two tasks. Damage segmentation is dominated by sparse, thin, texture-level foreground classes, where the SegFormer-style multi-scale representation is competitive despite its moderate parameter count. Component segmentation is closer to scene parsing: slabs, beams, and columns occupy larger coherent regions, and the ConvNeXt-Seg model captures this structural layout more effectively on the synthetic test split. The practical implication is that a single architecture ranking is not sufficient for inspection systems that must solve both component and damage segmentation.
The component-aware ablation provides the central negative result of this paper. Adding a component branch, feeding a hard predicted component mask or detached component features into the damage segmentation module, or allowing full end-to-end feature conditioning did not produce a measurable damage gain over the damage-only baseline. The observed spread across variants is smaller than the run-to-run variation reported for the same variants, and the best conditioned model improves the baseline by only 0.0009 foreground mIoU. The oracle-filter analysis makes the interpretation more specific: even on images where component predictions are reliable, the conditioned model does not outperform the unconditioned model. This argues against the explanation that component-aware conditioning failed simply because the component branch was inaccurate. The more plausible interpretation for this dataset is that the damage labels are driven mainly by local surface appearance, such as cracking and exposed rebar texture, rather than by whether the pixel lies on a slab, beam, or column.
This negative result should be read as dataset- and label-specific. It does not contradict bridge-inspection studies where joint component and damage learning improves performance, because those studies do not necessarily isolate explicit component conditioning from shared supervision and may involve damage categories that correlate more strongly with structural role. The present design tests a narrower claim: whether a damage segmentation model that already receives the original image as input benefits from being provided with predicted component information under a controlled Tokaido benchmark. For this claim, the evidence is consistently negative. The tested fusion module is intentionally simple — a single local Conv-BN-ReLU block that injects masks or feature maps at decoder resolution — so the result falsifies the hypothesis about damage-segmentation gains specifically for this controlled local-fusion design, not for richer conditioning architectures such as cross-attention or spatial routing. The cost side reinforces the same conclusion: the component-aware variants increase inference latency by 38– 44 % relative to the damage-only model while leaving damage accuracy effectively unchanged.
The real-photograph component experiment shows a different but equally important boundary. The trained ConvNeXt-Seg component model outperforms the zero-shot GroundedSAM baseline on the 16 annotated real photographs, indicating that synthetic task-specific training transfers useful structural information. At the same time, the drop from synthetic component performance to real-photo component performance is large, and the weakest real-photo class is the heterogeneous Nonstructural category. The zero-shot result should therefore be interpreted as a lower bound for prompt-based segmentation without adaptation, not as a ceiling for foundation-model approaches. Adapted SAM-style models or fine-tuned component segmenters may close part of this gap, but the current experiment shows that synthetic training alone does not remove the domain shift.

Progress relative to prior work.

The present benchmark substantially extends the earlier study [13], which evaluated only U-Net and DeepLabV3+ on the same Tokaido dataset at a lower resolution of 320 × 160 pixels using a custom Keras/TensorFlow implementation without repeated-seed evaluation or systematic loss ablation. Table 14 summarises the key quantitative improvements.
The two studies differ in resolution, class count, evaluation split, framework, and architecture scope, so no single factor can be isolated as the cause of any difference; the values in Table 14 are therefore descriptive rather than causal. On the component task, the previous study reported validation mIoU of 76% (U-Net) and 87% (DeepLabV3+) over four classes at 320 × 160 . The present benchmark, evaluated on the held-out test set at 512 × 896 with five classes and repeated seeds, reports mIoU main of 0.947 (U-Net) and 0.946 (DeepLabV3+), with the best architecture (ConvNeXt-Seg) reaching 0.957 . On the damage task, the previous study reported per-class IoU of 48% for Cracks and 44% for Reinforcement (U-Net, Tversky loss); the best model here (SegFormer, Tversky loss) achieves mIoU main dmg = 0.583 , with the mean over three seeds at 0.569 . The metrics are not directly comparable (per-class IoU versus foreground mIoU, validation versus test), so these figures serve as historical context only. On real-image transfer, the prior study reported a four-class foreground mean of approximately 18%; the present mIoU main comp = 0.424 on the same 16 photographs reflects simultaneous changes in class set, resolution, and training protocol, making direct comparison difficult.
The main limitation on the conclusions of the present study is scope. The conditioning conclusion is restricted to the Tokaido synthetic damage labels and to the tested ConvNeXt-based conditioning mechanisms. It should not be generalized to all bridge datasets or all component-aware designs, especially where damage type is physically tied to structural role. The real-photo evaluation is also restricted to component segmentation on 16 annotated photographs; no real-image damage masks are available in this study, so real-world damage transfer cannot be assessed. Future work should therefore focus on two targeted extensions: real-image fine-tuning or domain adaptation for component segmentation, and conditioning tests on datasets where component labels and damage labels have stronger physical coupling. These extensions would test whether the reported null result follows from the properties of the Tokaido dataset or is a broader limitation of explicit component conditioning for fine-scale damage segmentation.

7. Conclusion

This study evaluated structural component and damage segmentation for railway viaduct inspection under a controlled synthetic-data benchmark. Seven segmentation architectures were compared on the Tokaido dataset, and a separate component-aware ablation tested whether explicit component information improves damage localisation. The strongest damage segmentation model was SegFormer, with foreground mIoU main = 0.569 ± 0.013 , while the strongest component segmentation model was ConvNeXt-Seg, with foreground mIoU main = 0.957 ± 0.001 . These task-dependent rankings show that architecture choice cannot be reduced to a single best model across structural and damage segmentation.
The central result is that explicit component conditioning did not improve damage segmentation in this setting. Across the five ConvNeXt-based component-aware variants, the full range of foreground damage mIoU was 0.008 , and the best conditioned variant improved over the damage-only baseline by only 0.0009 . The oracle-filter analysis reached the same conclusion on images where component predictions were reliable: conditioned and unconditioned damage models remained statistically indistinguishable. This evidence supports a narrow but important conclusion: for the Tokaido damage labels tested here, component identity does not provide a measurable localisation benefit beyond the image evidence already available to a damage model. It does not imply that component-aware inspection is generally ineffective; rather, the benefit depends on whether the dataset encodes a meaningful relationship between structural element type and damage occurrence.
The synthetic-to-real evaluation showed both the value and the limit of task-specific training. A ConvNeXt-Seg component model trained on synthetic Tokaido imagery reached mIoU main = 0.424 ± 0.064 on 16 real viaduct photographs, compared with 0.250 for the zero-shot GroundedSAM baseline. At the same time, the drop from 0.957 ± 0.001 on synthetic component segmentation to 0.424 ± 0.064 on real photographs quantifies a substantial domain gap. Future work should therefore prioritise real-image adaptation, larger real-photo test sets, and datasets where component labels and damage labels have stronger physical coupling. The benchmark and ablation results reported here provide reference baselines for that next stage, while showing that component-aware conditioning should be treated as an empirical design choice rather than an assumed source of improvement.

Author Contributions

Conceptualization, [Bartłomiej Błachowski, Piotr Tauzowski]; methodology, [Piotr Tauzowski]; software, [Piotr Tauzowski]; validation, [Bartłomiej Błachowski, Paweł Hołobut]; formal analysis, [Paweł Hołobut]; investigation, [Bartłomiej Błachowski]; data curation, [Piotr Tauzowski]; writing—original draft preparation, [Piotr Tauzowski]; writing—review and editing, [Piotr Tauzowski, Paweł Hołobut]; visualization, [Piotr Tauzowski]; supervision, [Bartłomiej Błachowski]. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The Tokaido dataset used here is publicly available. Trained model checkpoints and evaluation scripts are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Chu, H.; Wang, W.; Deng, L. Tiny-Crack-Net: A multiscale feature fusion network with attention mechanisms for segmentation of tiny cracks. Computer-Aided Civil and Infrastructure Engineering 2022. [CrossRef]
  2. Narazaki, Y.; Hoskere, V.; Hoang, T.A.; Jr., B.F.S. Automated Vision-Based Bridge Component Extraction Using Multiscale Convolutional Neural Networks. In Proceedings of the 3rd Huixian International Forum on Earthquake Engineering for Young Researchers, Urbana-Champaign, IL, USA, 2017.
  3. Hoskere, V.; Narazaki, Y.; Hoang, T.A.; Jr., B.F.S. MaDnet: Multi-Task Semantic Segmentation of Multiple Types of Structural Materials and Damage in Images of Civil Infrastructure. Journal of Civil Structural Health Monitoring 2020, 10, 757–773. [CrossRef]
  4. Zhang, C.; Karim, M.M.; Qin, R. A Multitask Deep Learning Model for Parsing Bridge Elements and Segmenting Defect in Bridge Inspection Images. Transportation Research Record 2022, pp. 1–11.
  5. Zhang, C.; Yin, Z.; Qin, R. Attention-Enhanced Co-Interactive Fusion Network (AECIF-Net) for Automated Structural Condition Assessment in Visual Inspection. arXiv preprint arXiv:2307.07643 2024. PDF stored locally in the project reference collection.
  6. Saida, T.; Rashid, M.; Nemoto, Y.; Tsukamoto, S.; Asai, T.; Nishio, M. CNN-Based Segmentation Frameworks for Structural Component and Earthquake Damage Determinations Using UAV Images. Earthquake Engineering and Engineering Vibration 2023, 22, 359–369. [CrossRef]
  7. Bai, Z.; Zou, D.; Liu, T.; Li, K.; Luo, W.; Liao, H.; Zhou, A. Component-Aware Post-Earthquake Damage Recognition for RC Structures Using Instance Segmentation and Oriented Bounding Box Detection. Construction and Building Materials 2025, 491, 142718. [CrossRef]
  8. Flotzinger, J.; R"osch, P.J.; Benz, C.; Ahmad, M.; Cankaya, M.; Mayer, H.; Rodehorst, V.; Oswald, N.; Braml, T. dacl-challenge: Semantic Segmentation during Visual Bridge Inspections. In Proceedings of the Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), 2024.
  9. Flotzinger, J.; R"osch, P.J.; Braml, T. dacl10k: Benchmark for Semantic Bridge Damage Segmentation. arXiv preprint arXiv:2309.00460 2023. PDF stored locally in the project reference collection.
  10. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Doll’ar, P. Focal Loss for Dense Object Detection. In Proceedings of the Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2980–2988.
  11. Salehi, S.S.M.; Erdogmus, D.; Gholipour, A. Tversky Loss Function for Image Segmentation Using 3D Fully Convolutional Deep Networks. In Proceedings of the Machine Learning in Medical Imaging. Springer, 2017, Vol. 10541, Lecture Notes in Computer Science, pp. 379–387.
  12. Abraham, N.; Khan, N.M. A Novel Focal Tversky Loss Function with Improved Attention U-Net for Lesion Segmentation. In Proceedings of the 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), 2019, pp. 683–687.
  13. Tauzowski, P.; Ostrowski, M.; Bogucki, D.; Jarosik, P.; Błachowski, B. Structural Component Identification and Damage Localization of Civil Infrastructure Using Semantic Segmentation. Sensors 2025, 25, 4698. [CrossRef]
  14. Hamishebahar, Y.; Guan, H.; So, S.; Jo, J. A Comprehensive Review of Deep Learning-Based Crack Detection Approaches. Applied Sciences 2022, 12, 1374. [CrossRef]
  15. Ai, D.; Jiang, G.; Lam, S.K.; He, P.; Li, C. Computer Vision Framework for Crack Detection of Civil Infrastructure—A Review. Engineering Applications of Artificial Intelligence 2023, 117, 105478. [CrossRef]
  16. Nguyen, S.D.; Tran, T.S.; Tran, V.P.; Lee, H.J.; Piran, M.J.; Le, V.P. Deep Learning-Based Crack Detection: A Survey. International Journal of Pavement Research and Technology 2023, 16, 943–967. [CrossRef]
  17. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015.
  18. Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer, 2018, Vol. 11045, Lecture Notes in Computer Science, pp. 3–11.
  19. Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  20. Wang, J.; Ueda, T. Automatic Damage Detection and Segmentation Using Deep Learning Algorithms in Reinforced Concrete Structure Inspections. Structural Concrete 2025, 26, 5511–5534. [CrossRef]
  21. Liu, G.; Ding, W.; Shu, J.; Strauss, A.; Duan, Y. Two-Stream Boundary-Aware Neural Network for Concrete Crack Segmentation and Quantification. Structural Control and Health Monitoring 2023, 2023, 3301106. [CrossRef]
  22. Xu, S.; Shen, R.; Liu, Y.; Song, Y.; Xin, R.; Liu, E.; Shi, Y. Cross-Domain Coupled Convolutional Transformer Network for Concrete Damage Detection. Structural Control and Health Monitoring 2025, 2025, 6547856. [CrossRef]
  23. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), 2021. PDF stored locally in the project reference collection.
  24. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  25. Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2021.
  26. Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  27. Eltouny, K.; Sajedi, S.; Liang, X. Dmg2Former-AR: Vision Transformers with Adaptive Rescaling for High-Resolution Structural Visual Inspection. Sensors 2024, 24, 6007. [CrossRef]
  28. Azimi, M.; Yang, T.Y. Transformer-Based Framework for Accurate Segmentation of High-Resolution Images in Structural Health Monitoring. Computer-Aided Civil and Infrastructure Engineering 2024, 39, 3670–3684. [CrossRef]
  29. Thisanke, H.; Deshan, C.; Chamith, K.; Seneviratne, S.; Vidanaarachchi, R.; Herath, D. Semantic Segmentation Using Vision Transformers: A Survey. Engineering Applications of Artificial Intelligence 2023, 126, 106669. [CrossRef]
  30. Zhang, L.; Lu, J.; Zheng, S.; Zhao, X.; Zhu, X.; Fu, Y.; Xiang, T.; Feng, J.; Torr, P.H.S. Vision Transformers: From Semantic Segmentation to Dense Prediction. International Journal of Computer Vision 2024, 132, 6142–6162. [CrossRef]
  31. Shamsabadi, E.A.; Xu, C.; Rao, A.S.; Nguyen, T.; Ngo, T.; da Costa, D.D. Vision Transformer-Based Autonomous Crack Detection on Asphalt and Concrete Surfaces. Automation in Construction 2022, 140, 104316. [CrossRef]
  32. Hadinata, P.N.; Simanta, D.; Eddy, L.; Nagai, K. Multiclass Segmentation of Concrete Surface Damages Using U-Net and DeepLabV3+. Applied Sciences 2023, 13, 2398. [CrossRef]
  33. Chu, H.; Deng, L.; Yuan, H.; Long, L.; Guo, J. A transformer and self-cascade operation-based architecture for segmenting high-resolution bridge cracks. Automation in Construction 2024, 158, 105194. [CrossRef]
  34. Fan, Y.; Hu, Z.; Li, Q.; Sun, Y.; Chen, J.; Zhou, Q. CrackNet: A Hybrid Model for Crack Segmentation with Dynamic Loss Function. Sensors 2024, 24, 7134. [CrossRef]
  35. Wang, W.; Su, C. Transformer-Based Crack Segmentation for Concrete Structures in Complex Scenarios. Structural Concrete 2026, 27, 474–491. [CrossRef]
  36. Zhong, J.; Fan, Y.; Zhao, X.; Zhou, Q.; Xu, Y. Multi-Type Structural Damage Image Segmentation via Dual-Stage Optimization-Based Few-Shot Learning. Smart Cities 2024, 7, 1888–1906. [CrossRef]
  37. Li, Y.; Wang, Y.; Zhao, H.; Gu, H.; Yu, Y.; Bao, T.; Wei, Y.; Xiang, Z. Automated Defect Segmentation and Quantification in Concrete Structures via Unmanned Aerial Vehicle-Based Lightweight Deep Learning. Computer-Aided Civil and Infrastructure Engineering 2025, 40, 4465–4484. [CrossRef]
  38. Sun, L.; Yang, Y.; Zhou, G.; Chen, A.; Zhang, Y.; Cai, W.; Li, L. An Integration–Competition Network for Bridge Crack Segmentation under Complex Scenes. Computer-Aided Civil and Infrastructure Engineering 2024, 39, 617–634. [CrossRef]
  39. Dang, M.; Wang, H.; Nguyen, T.H.; Tighitz, L.; Tien, L.D.; Nguyen, T.N.; Nguyen, N.P. CDD-TR: Automated concrete defect investigation using an improved deformable transformers. Journal of Building Engineering 2023, 75, 106976. [CrossRef]
  40. Liu, C.; Wang, P.; Wang, X.; Miao, J. Autonomous damage segmentation of post-fire reinforced concrete structural components. Advanced Engineering Informatics 2024, p. 102498. [CrossRef]
  41. Żarski, M.; W’ojcik, B.; Miszczak, J.A.; Błachowski, B.; Ostrowski, M. Computer Vision Based Inspection on Post-Earthquake with UAV Synthetic Dataset. IEEE Access 2022, 10. [CrossRef]
  42. Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  43. Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H.; et al. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. In Proceedings of the Proceedings of the European Conference on Computer Vision (ECCV), 2024.
  44. Ye, Z.; Lovell, L.; Faramarzi, A.; Ninić, J. Sam-based instance segmentation models for the automation of structural damage detection. Advanced Engineering Informatics 2024, 62, 102826. [CrossRef]
  45. Rahman, A.U.; Hoskere, V. Instance Segmentation of Reinforced Concrete Bridge Point Clouds with Transformers Trained Exclusively on Synthetic Data. Automation in Construction 2025, 173, 106067. [CrossRef]
  46. Yao, Z.; Jiang, S.; Wang, S.; Wang, J.; Liu, H.; Narazaki, Y.; Cui, J.; Jr., B.F.S. Intelligent Crack Identification Method for High-Rise Buildings Aided by Synthetic Environments. Structural Design of Tall and Special Buildings 2024, 33, e2117. [CrossRef]
  47. Pozzer, S.; Souza, M.P.V.D.; Hena, B.; Hesam, S.; Rezayie, R.K.; Azar, E.R.; Lopez, F.; Maldague, X. Effect of different imaging modalities on the performance of a CNN: An experimental study on damage segmentation in infrared, visible, and fused images of concrete structures. NDT & E International 2022, 133, 102709. [CrossRef]
  48. Alexander, Q.G.; Hoskere, V.; Narazaki, Y.; Maxwell, A.; Jr., B.F.S. Fusion of thermal and RGB images for automated deep learning based crack detection in civil infrastructure. AI in Civil Engineering 2022, 1, 2. [CrossRef]
  49. Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified Perceptual Parsing for Scene Understanding. In Proceedings of the Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  50. Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.
  51. Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3431–3440. [CrossRef]
  52. Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv preprint arXiv:1706.05587 2017.
  53. Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep High-Resolution Representation Learning for Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5693–5703. [CrossRef]
  54. Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 568–578.
  55. Narazaki, Y.; Hoskere, V.; Yoshida, K.; Spencer, B.F., Jr.; Fujino, Y. Synthetic Environments for Vision-Based Structural Condition Assessment of Japanese High-Speed Railway Viaducts. Mechanical Systems and Signal Processing 2021, 160, 107850. [CrossRef]
  56. Liu, T.; Zhang, L.; Zhou, G.; Cai, W.; Cai, C.; Li, L. BC-DUnet-Based Segmentation of Fine Cracks in Bridges under a Complex Background. PLOS ONE 2022, 17, e0265258. [CrossRef]
  57. Zhou, Y.; Ali, R.; Mokhtar, N.; Harun, S.W.; Iwahashi, M. MixSegNet: A Novel Crack Segmentation Network Combining CNN and Transformer. IEEE Access 2024. [CrossRef]
  58. Milletari, F.; Navab, N.; Ahmadi, S.A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 2016; pp. 565–571. [CrossRef]
  59. Berman, M.; Rannen Triki, A.; Blaschko, M.B. The Lovász-Softmax Loss: A Tractable Surrogate for the Optimization of the Intersection-Over-Union Measure in Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 2018; pp. 4413–4421. [CrossRef]
  60. Zhao, H.; et al. MT-HRNet: Multi-Task High-Resolution Network for Bridge Inspection. Automation in Construction 2021.
Figure 1. Representative samples from the Tokaido synthetic dataset (test split). Each row shows the RGB image (left), component ground-truth mask (centre; Slab = red, Beam = green, Column = blue, Nonstructural = yellow-orange, Nonbridge = dark grey), and damage ground-truth mask (right; Concrete = red, ExposedRebar = green, Nondamage = dark grey). Rows 1–3 are selected to show abundant damage pixels from both foreground classes; rows 4–6 cover all structural component classes. Outside the selected rows, damage pixels are typically sparse and localised on structural surfaces, reflecting the severe class imbalance addressed by the imbalance-aware loss functions.
Figure 1. Representative samples from the Tokaido synthetic dataset (test split). Each row shows the RGB image (left), component ground-truth mask (centre; Slab = red, Beam = green, Column = blue, Nonstructural = yellow-orange, Nonbridge = dark grey), and damage ground-truth mask (right; Concrete = red, ExposedRebar = green, Nondamage = dark grey). Rows 1–3 are selected to show abundant damage pixels from both foreground classes; rows 4–6 cover all structural component classes. Outside the selected rows, damage pixels are typically sparse and localised on structural surfaces, reflecting the severe class imbalance addressed by the imbalance-aware loss functions.
Preprints 217961 g001
Figure 2. Component-aware damage segmentation architecture. The shared ConvNeXt encoder E θ produces multi-scale features F fed to both branches. The component branch (top) computes feature map h c and logits z c ; the damage branch (bottom) fuses h d with the conditioning tensor q c before the damage head. The five conditioning modes differ in how q c is formed (see legend); in the damage-only and parallel-heads modes the dashed conditioning path is absent. The subscript θ denotes trainable parameters.
Figure 2. Component-aware damage segmentation architecture. The shared ConvNeXt encoder E θ produces multi-scale features F fed to both branches. The component branch (top) computes feature map h c and logits z c ; the damage branch (bottom) fuses h d with the conditioning tensor q c before the damage head. The five conditioning modes differ in how q c is formed (see legend); in the damage-only and parallel-heads modes the dashed conditioning path is absent. The subscript θ denotes trainable parameters.
Preprints 217961 g002
Figure 3. ImageNet pretraining versus random initialisation (U-Net, EfficientNet-B3, 512 × 896 ). Columns: input image, ground-truth, pretrained prediction, scratch prediction. Classes: Concrete = red, ExposedRebar = green, Nondamage = dark grey. Training from scratch produces fragmented, noisy outputs compared to the pretrained model, reflecting the near- 50 % foreground mIoU gap ( 0.519 vs. 0.264 ).
Figure 3. ImageNet pretraining versus random initialisation (U-Net, EfficientNet-B3, 512 × 896 ). Columns: input image, ground-truth, pretrained prediction, scratch prediction. Classes: Concrete = red, ExposedRebar = green, Nondamage = dark grey. Training from scratch produces fragmented, noisy outputs compared to the pretrained model, reflecting the near- 50 % foreground mIoU gap ( 0.519 vs. 0.264 ).
Preprints 217961 g003
Figure 4. Qualitative damage-segmentation results for ConvNeXt-Seg (weighted CE + Dice, seed 3). Images are selected from the synthetic test split to show substantial damage from both foreground classes. Columns: input image, ground-truth mask, predicted mask, and prediction–ground-truth (GT) overlay. Classes: Concrete = red, ExposedRebar = green, Nondamage = dark grey. The model closely follows the ground truth for both damage classes while maintaining near-perfect Nondamage accuracy.
Figure 4. Qualitative damage-segmentation results for ConvNeXt-Seg (weighted CE + Dice, seed 3). Images are selected from the synthetic test split to show substantial damage from both foreground classes. Columns: input image, ground-truth mask, predicted mask, and prediction–ground-truth (GT) overlay. Classes: Concrete = red, ExposedRebar = green, Nondamage = dark grey. The model closely follows the ground truth for both damage classes while maintaining near-perfect Nondamage accuracy.
Preprints 217961 g004
Figure 5. Qualitative comparison of damage predictions across all five joint-model ablation variants (seed 1) on representative synthetic test images. Columns: ground-truth damage mask, damage-only, parallel-heads, hard-mask, soft-detach, soft-full. All five variants produce visually similar outputs, consistent with the near-identical quantitative scores in Table 9. Classes: Concrete = red, ExposedRebar = green, Nondamage = dark grey.
Figure 5. Qualitative comparison of damage predictions across all five joint-model ablation variants (seed 1) on representative synthetic test images. Columns: ground-truth damage mask, damage-only, parallel-heads, hard-mask, soft-detach, soft-full. All five variants produce visually similar outputs, consistent with the near-identical quantitative scores in Table 9. Classes: Concrete = red, ExposedRebar = green, Nondamage = dark grey.
Preprints 217961 g005
Figure 6. Qualitative component-segmentation results for the joint soft-detach model (seed 1) on representative synthetic test images. Columns: input image, ground-truth component mask, predicted component mask, and prediction–ground-truth (GT) overlay. Classes: Slab = red, Beam = green, Column = blue, Nonstructural = yellow-orange, Nonbridge = dark grey. The model accurately delineates all major structural elements, achieving mIoU main comp 0.92 on the test split.
Figure 6. Qualitative component-segmentation results for the joint soft-detach model (seed 1) on representative synthetic test images. Columns: input image, ground-truth component mask, predicted component mask, and prediction–ground-truth (GT) overlay. Classes: Slab = red, Beam = green, Column = blue, Nonstructural = yellow-orange, Nonbridge = dark grey. The model accurately delineates all major structural elements, achieving mIoU main comp 0.92 on the test split.
Preprints 217961 g006
Figure 7. Structural component segmentation on real viaduct photographs (trained ConvNeXt-Seg, seed 1). Columns: input image, component ground-truth, trained prediction. Classes: Slab = red, Beam = green, Column = blue, Nonstructural = yellow-orange, Nonbridge = dark grey. Real photographs carry component annotations only; damage annotations are unavailable for this image set.
Figure 7. Structural component segmentation on real viaduct photographs (trained ConvNeXt-Seg, seed 1). Columns: input image, component ground-truth, trained prediction. Classes: Slab = red, Beam = green, Column = blue, Nonstructural = yellow-orange, Nonbridge = dark grey. Real photographs carry component annotations only; damage annotations are unavailable for this image set.
Preprints 217961 g007
Figure 8. Qualitative prompt ablation for GroundedSAM structural segmentation on real viaduct photographs. Columns: input image, component ground-truth, and zero-shot predictions for prompt variants v0 (bridge jargon), v2 (single terms), v3 (common vocabulary, best), and v4 (combined extended phrases). Classes: Slab = red, Beam = green, Column = blue, Nonstructural = yellow-orange, Nonbridge = dark grey. Variants v1 (descriptive geometry) and v4 (combined extended phrases) produce the weakest structural differentiation; v3 (common vocabulary) achieves the best overall mIoU (Table 12).
Figure 8. Qualitative prompt ablation for GroundedSAM structural segmentation on real viaduct photographs. Columns: input image, component ground-truth, and zero-shot predictions for prompt variants v0 (bridge jargon), v2 (single terms), v3 (common vocabulary, best), and v4 (combined extended phrases). Classes: Slab = red, Beam = green, Column = blue, Nonstructural = yellow-orange, Nonbridge = dark grey. Variants v1 (descriptive geometry) and v4 (combined extended phrases) produce the weakest structural differentiation; v3 (common vocabulary) achieves the best overall mIoU (Table 12).
Preprints 217961 g008
Table 1. Summary of component-aware ablation variants.
Table 1. Summary of component-aware ablation variants.
Variant Component
head
Fusion Conditioning
signal
Gradient to
component branch
damage-only × × not applicable
parallel-heads × component loss only
hard-mask one-hot ( arg max z c ) none from damage
soft-detach sg ( F comp ) none from damage
soft-full F comp yes
No gradient from the damage loss reaches the component branch; the component branch continues to be trained via the component segmentation loss.
Table 2. Ablation: Loss Comparison (damage)
Table 2. Ablation: Loss Comparison (damage)
Variant mIoUmain mIoUall Dice Precision
weighted_ce_dice 0.535 0.688 0.797 0.816
tversky 0.526 0.681 0.792 0.793
ce_dice 0.523 0.679 0.790 0.846
focal_tversky 0.498 0.662 0.775 0.768
ce 0.491 0.659 0.771 0.852
focal_ce 0.467 0.642 0.756 0.838
Table 3. Ablation: Encoder Comparison (damage)
Table 3. Ablation: Encoder Comparison (damage)
Variant mIoUmain mIoUall Dice Precision
efficientnet_b5 0.547 0.696 0.803 0.850
efficientnet_b3 0.526 0.681 0.791 0.842
efficientnet_b0 0.488 0.656 0.769 0.823
resnet50 0.418 0.609 0.724 0.780
Table 4. Ablation: Resolution Comparison (damage)
Table 4. Ablation: Resolution Comparison (damage)
Variant mIoUmain mIoUall Dice Precision
640x1120 0.549 0.697 0.805 0.857
512x896 0.536 0.689 0.797 0.839
384x672 0.464 0.640 0.754 0.822
256x448 0.398 0.595 0.711 0.763
Table 5. Ablation: Pretrained Vs Scratch (damage)
Table 5. Ablation: Pretrained Vs Scratch (damage)
Variant mIoUmain mIoUall Dice Precision
pretrained 0.519 0.677 0.787 0.826
scratch 0.264 0.504 0.605 0.638
Table 6. Damage segmentation benchmark on the Tokaido synthetic test split (weighted CE + Dice loss, mean ± standard deviation over all available runs; n denotes run count per cell — most models use seeds 1–3, ConvNeXt-Seg uses 6 completed attempts and DeepLabV3+ uses 4. These cells contain logged repeated attempts with reused seed values, rather than additional distinct seeds). Params denotes trainable parameter count in millions (M); Prec and Rec denote precision and recall. Swin-UperNet collapses to predicting the Nondamage class in all runs; ExposedRebar IoU ≈ 0.
Table 6. Damage segmentation benchmark on the Tokaido synthetic test split (weighted CE + Dice loss, mean ± standard deviation over all available runs; n denotes run count per cell — most models use seeds 1–3, ConvNeXt-Seg uses 6 completed attempts and DeepLabV3+ uses 4. These cells contain logged repeated attempts with reused seed values, rather than additional distinct seeds). Params denotes trainable parameter count in millions (M); Prec and Rec denote precision and recall. Swin-UperNet collapses to predicting the Nondamage class in all runs; ExposedRebar IoU ≈ 0.
Model Encoder Params n mIoUmain mIoUall Dice Prec Rec FPS
SegFormer PVT-V2-B2 26.0M 3 0.569 ± 0.013 0.710 0.815 0.817 0.814 19.5
ConvNeXt-Seg ConvNeXt-B 93.4M 6 0.562 ± 0.006 0.706 0.811 0.824 0.799 17.3
UNet++ EfficientNet-B3 13.6M 3 0.538 ± 0.011 0.690 0.798 0.838 0.766 17.8
UNet EfficientNet-B3 13.2M 3 0.533 ± 0.008 0.686 0.795 0.816 0.777 19.6
DeepLabV3+ EfficientNet-B3 11.7M 4 0.477 ± 0.011 0.649 0.762 0.787 0.742 20.2
ViT-Seg ViT-B/16 93.1M 3 0.362 ± 0.013 0.571 0.686 0.743 0.645 14.1
Swin-UperNet Swin-B 92.1M 3 0.021 ± 0.011 0.341 0.357 0.376 0.352 13.7
Table 7. Per-class IoU breakdown on the Tokaido synthetic test split (weighted CE + Dice loss, mean ± standard deviation over all available runs; run counts per model match Table 6). Nondamage standard deviation is < 0.002 for all models except Swin-UperNet ( 0.003 ) and is omitted for brevity. Swin-UperNet collapses to the Nondamage class in all runs.
Table 7. Per-class IoU breakdown on the Tokaido synthetic test split (weighted CE + Dice loss, mean ± standard deviation over all available runs; run counts per model match Table 6). Nondamage standard deviation is < 0.002 for all models except Swin-UperNet ( 0.003 ) and is omitted for brevity. Swin-UperNet collapses to the Nondamage class in all runs.
Model mIoUmain Nondamage Concrete ExposedRebar
SegFormer 0.569 ± 0.013 0.994 0.602 ± 0.018 0.535 ± 0.010
ConvNeXt-Seg 0.562 ± 0.006 0.994 0.598 ± 0.011 0.525 ± 0.005
UNet++ 0.538 ± 0.011 0.993 0.570 ± 0.007 0.507 ± 0.016
UNet 0.533 ± 0.008 0.993 0.556 ± 0.006 0.510 ± 0.014
DeepLabV3+ 0.477 ± 0.011 0.992 0.515 ± 0.016 0.440 ± 0.010
ViT-Seg 0.362 ± 0.013 0.990 0.359 ± 0.021 0.366 ± 0.008
Swin-UperNet 0.021 ± 0.011 0.982 0.042 ± 0.021 0.000 ± 0.000
Table 8. Component segmentation benchmark on the Tokaido synthetic test split (weighted CE + Dice loss, mean ± standard deviation over seeds 1–3). Params denotes trainable parameter count in millions (M); Prec and Rec denote precision and recall. mIoU main excludes the Nonbridge background class.
Table 8. Component segmentation benchmark on the Tokaido synthetic test split (weighted CE + Dice loss, mean ± standard deviation over seeds 1–3). Params denotes trainable parameter count in millions (M); Prec and Rec denote precision and recall. mIoU main excludes the Nonbridge background class.
Model Encoder Params n mIoUmain mIoUall Dice Prec Rec FPS
ConvNeXt-Seg ConvNeXt-B 93.4M 3 0.957 ± 0.001 0.964 0.981 0.979 0.984 19.5
SegFormer PVT-V2-B2 26.0M 3 0.951 ± 0.001 0.959 0.979 0.976 0.981 24.6
UNet++ EfficientNet-B3 13.6M 3 0.948 ± 0.005 0.957 0.978 0.976 0.980 24.5
UNet EfficientNet-B3 13.2M 3 0.947 ± 0.001 0.956 0.977 0.975 0.980 27.6
DeepLabV3+ EfficientNet-B3 11.7M 3 0.946 ± 0.002 0.955 0.977 0.975 0.979 27.6
ViT-Seg ViT-B/16 93.1M 3 0.937 ± 0.001 0.947 0.973 0.972 0.974 16.8
Swin-UperNet Swin-B 92.1M 3 0.832 ± 0.014 0.857 0.922 0.918 0.926 16.4
Table 9. Component-aware ablation results on the Tokaido synthetic test set (mean ± standard deviation over seeds 1–3). Superscripts dmg and comp denote damage and component tasks, respectively. Variant has no component head; component metrics are not applicable.
Table 9. Component-aware ablation results on the Tokaido synthetic test set (mean ± standard deviation over seeds 1–3). Superscripts dmg and comp denote damage and component tasks, respectively. Variant has no component head; component metrics are not applicable.
Variant mIoU main dmg Concrete IoU ExposedRebar IoU mIoU main comp
damage-only 0.5600 ± 0.0060 0.5882 0.5319
parallel-heads 0.5535 ± 0.0055 0.5884 0.5187 0.9229 ± 0.0065
hard-mask 0.5574 ± 0.0045 0.5936 0.5212 0.9262 ± 0.0036
soft-detach 0.5609 ± 0.0086 0.5976 0.5242 0.9231 ± 0.0021
soft-full 0.5545 ± 0.0112 0.5909 0.5181 0.9270 ± 0.0027
Table 10. Component segmentation on real photographs (ConvNeXt-Seg, seeds 1–3, weighted CE + Dice; same three single-task checkpoints as Table 8; global pooled IoU over 16 images). Rows “Seed 1”–“Seed 3” are individual checkpoints; the final mIoU main entry is mean ± standard deviation across seeds. The Nonbridge class is omitted from mIoU main because its support is small and arises primarily from resize-pad preprocessing in the evaluation pipeline.
Table 10. Component segmentation on real photographs (ConvNeXt-Seg, seeds 1–3, weighted CE + Dice; same three single-task checkpoints as Table 8; global pooled IoU over 16 images). Rows “Seed 1”–“Seed 3” are individual checkpoints; the final mIoU main entry is mean ± standard deviation across seeds. The Nonbridge class is omitted from mIoU main because its support is small and arises primarily from resize-pad preprocessing in the evaluation pipeline.
Nonbridge Slab Beam Column Nonstructural mIoU main
Seed 1 0.010 0.456 0.409 0.598 0.036 0.374
Seed 2 0.011 0.727 0.447 0.755 0.056 0.496
Seed 3 0.009 0.505 0.424 0.643 0.034 0.402
Mean 0.010 0.562 0.427 0.665 0.042 0.424 ± 0.064
Table 11. Text prompts used for each structural class across the five GroundedSAM prompt variants (v0–v4; note that these are distinct from the five joint-model conditioning variants in Table 9). Each cell lists the period-separated phrases passed to GroundingDINO for that class. Nonstructural (class 4) is not directly prompted; it receives all pixels not assigned to classes 1–3 after priority-order merging.
Table 11. Text prompts used for each structural class across the five GroundedSAM prompt variants (v0–v4; note that these are distinct from the five joint-model conditioning variants in Table 9). Each cell lists the period-separated phrases passed to GroundingDINO for that class. Nonstructural (class 4) is not directly prompted; it receives all pixels not assigned to classes 1–3 after priority-order merging.
Variant Slab (1) Beam (2) Column (3) Nonstructural (4)
v0 — baseline bridge deck slab. concrete slab. bridge floor bridge girder. bridge beam. concrete beam bridge column. bridge pier. concrete column railway rail. railway sleeper. bridge railing
v1 — descriptive geom. flat concrete deck. horizontal concrete surface. concrete floor plate concrete T-beam. prestressed concrete girder. longitudinal girder under deck vertical concrete pier. cylindrical column. rectangular concrete pillar metal handrail. steel guard rail. bridge fence
v2 — single terms slab beam column railing
v3 — common vocab. (best) concrete floor. flat concrete ceiling. concrete soffit structural beam. concrete support beam. I-beam pillar. concrete pillar. support column railing. fence. barrier
v4 — combined bridge deck slab. flat concrete deck. concrete floor plate. bridge floor bridge beam. bridge girder. concrete beam. prestressed concrete girder. T-beam bridge column. concrete pier. bridge pier. vertical concrete pillar bridge railing. metal handrail. steel fence. guardrail
Table 12. Structural-task prompt ablation for GroundedSAM on 16 real photographs ( mIoU main excludes Nonbridge).
Table 12. Structural-task prompt ablation for GroundedSAM on 16 real photographs ( mIoU main excludes Nonbridge).
Variant Slab Beam Column Nonstructural mIoU main
v0 — baseline (bridge-specific phrases) 0.291 0.126 0.223 0.142 0.195
v1 — descriptive geometry 0.184 0.018 0.090 0.133 0.106
v2 — single terms (slab/beam/…) 0.372 0.089 0.310 0.105 0.219
v3 — common vocabulary (best) 0.338 0.133 0.491 0.038 0.250
v4 — combined extended phrases 0.131 0.147 0.106 0.088 0.118
Table 14. Summary of progress relative to the prior study [13]. All metrics are foreground mIoU (background excluded) unless otherwise noted. Prior damage figures are per-class IoU averaged over the two foreground damage classes (Cracks, Reinforcement) from Table 3 of that paper.
Table 14. Summary of progress relative to the prior study [13]. All metrics are foreground mIoU (background excluded) unless otherwise noted. Prior damage figures are per-class IoU averaged over the two foreground damage classes (Cracks, Reinforcement) from Table 3 of that paper.
Aspect Prior work Present work
Input resolution 320 × 160 512 × 896
Architectures evaluated 2 7
Loss functions compared ad-hoc 4 (systematic ablation)
Repeated seeds no 3 per config
Component mIoU (synthetic test) 76–87% (val, 4 cls) 94.6–95.7% (test, 5 cls)
Damage mIoU (synthetic test) 0.46 (2 arch) 0.569 (SegFormer mean)
Component mIoU (real photos) 0.18 (U-Net, 4 cls) 0.424 (trained, 5 cls, gap  0.53 )
Zero-shot structural baseline 0.250 (GroundedSAM)
Joint component-aware ablation no yes (5 variants)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated