1. Introduction
In the stone processing and construction industries, digital transformation has created demand for high-fidelity virtual material representations to support virtual prototyping, digital twin applications, and mass customization workflows [
1,
2]. Natural stone textures, particularly marble with its stochastic vein patterns, pose unique synthesis challenges: each slab exhibits non-repeating structures requiring both photorealistic rendering and precise designer control [
3]. However, obtaining large-scale annotated datasets for training conditional generative models is extremely challenging in industrial contexts due to the prohibitive cost of manual pixel-level annotation, the proprietary nature of production data, and limited batch sizes typical of specialty materials [
4]. This data scarcity problem represents a critical barrier to deploying deep learning solutions for texture synthesis in manufacturing environments.
While conditional Generative Adversarial Networks (cGANs)[
5] have demonstrated impressive capabilities for image-to-image translation tasks [
6,
7], their application to industrial texture synthesis faces two unresolved challenges. First, existing approaches assume the availability of ground-truth semantic masks, an assumption that does not hold for proprietary manufacturing data, where manual annotation would require weeks of skilled labor [
4,
8]. Second, the standard evaluation paradigm relies exclusively on automated metrics, such as Fréchet Inception Distance (FID) [
9], Inception Score (IS) [
10], and Multi-Scale Structural Similarity Index Measure (MS-SSIM), which were originally designed for object recognition rather than texture quality. Recent studies have shown that these metrics exhibit weak or negative correlations with human perceptual judgments in texture synthesis tasks. The Inception-v3 network underlying FID was trained for object classification on ImageNet, making it architecturally designed to be texture-invariant and thus insensitive to texture artifacts (e.g., checkerboard patterns, repetitive microstructures) that are perceptually obvious to human observers [
11]. This metric-perception discrepancy has been rigorously documented, Zhou et al. (2019) [
12] demonstrated that across multiple datasets FID scores do not fully correlate with the scores of evaluations made by humans, while the largest human evaluation study to date, comprising over 207,000 perceptual judgments across 41 generative models, concluded that existing metrics don’t show strong correlations with human evaluations [
13,
14]. Borji (2022) [
14] further documented that FID has a blind spot regarding image quality and is particularly unsuitable for specialized domains where visual features differ substantially from those of ImageNet. Yet systematic analyses have demonstrated that FID exhibits high sensitivity to low-level image statistics, such as noise and blur, that fail to correlate with human perception, while remaining largely insensitive to high-level artifacts that human observers immediately detect [
13]. This fundamental misalignment has reinforced the position that human preference constitutes the ultimate ground truth for evaluating AI-generated content [
15]. The present study addresses two critical gaps: first, it provides direct empirical evidence of metric-perception divergence across conditional GAN architectures in an industrial texture synthesis context; second, it introduces an annotation-free segmentation pipeline integrating SLIC, GMM, and Graph Cut optimization, thereby eliminating the manual labeling bottleneck that has historically constrained the application of conditional generative models to proprietary manufacturing datasets.
The evolution of texture synthesis spans three major paradigms. Traditional procedural generation, exemplified by Perlin noise [
16,
17], can produce marble-like patterns through mathematical algorithms but often with limited realism and requiring expert tuning[
18,
19]. Example-based non-parametric methods leverage real source images to generate textures but struggle with large-scale structures like continuous marble veins [
20,
21,
22,
23,
24]. Modern deep generative models have revolutionized image synthesis [
25,
26], with GANs emerging as efficient alternatives [
27]. Neural style transfer [
11,
28] established the principle of separating structure and appearance, while Spatial GANs enabled scalable texture generation [
29,
30].
GANs enable controllable synthesis by conditioning generation on structural inputs like semantic masks. The Pix2Pix framework [
31] established paired image-to-image translation using U-Net generators and PatchGAN discriminators, combining adversarial loss with L1 reconstruction to enforce structural fidelity. BicycleGAN [
32] addresses mode collapse by enforcing bijective mappings between latent codes and outputs, enabling diverse generations. The most significant innovation for mask-conditioned synthesis is Spatially-Adaptive Normalization (SPADE) in GauGAN [
6,
31], which modulates normalization parameters as learned functions of the input mask at every layer, preventing semantic information from being washed away by standard normalization, critical for preserving vein boundaries while synthesizing organic appearance [
33].
While diffusion models have achieved state-of-the-art results, their deployment faces practical barriers in manufacturing: inference is significantly slower than GANs, and conditional variants like ControlNet [
34] rely on foundation models pre-trained on billions of images. For manufacturing environments with limited proprietary data (~200–500 samples) and computational constraints, cGANs represent a practical choice. Comprehensive comparison with fine-tuned foundation models would require different experimental protocols and constitutes valuable future work [
35].
The annotation bottleneck has motivated unsupervised segmentation research in medical imaging [
36,
37,
38], yet adoption in manufacturing remains limited [
4,
8]. Unsupervised methods combine classical computer vision techniques in a principled pipeline: Simple Linear Iterative Clustering (SLIC) superpixels [
39,
40] reduce computational complexity by grouping pixels into perceptually coherent regions while preserving boundaries [
41]; Gaussian Mixture Models (GMM) then probabilistically classify these regions into semantic classes based on color features [
42,
43]; finally, Graph Cut algorithms [
43] or Normalized Cuts [
44] enforce spatial coherence through energy minimization, removing segmentation noise while respecting natural material boundaries. Interactive methods like GrabCut [
45] offer additional refinement capabilities. This work resorts to Borovec et al.’s pipeline [
37] to natural stone, demonstrating that the same approach can automatically extract marble vein structures suitable for cGAN training without manual annotation.
Recent industrial applications demonstrate the transformative potential of GANs. In manufacturing quality control, GANs address data scarcity for defect detection [
46] and anomaly detection [
47,
48]. Beyond inspection, GANs contribute to design optimization [
49], product lifecycle prediction [
50], and digital twin systems [
51]. Strategic applications include technology road mapping [
52] and custom material design [
53,
54]. However, prior work on natural materials like marble remains scarce [
55,
56], with most studies focusing on regular repeating patterns rather than stochastic geological textures.
To overcome these limitations, we propose a dual-evaluation framework for benchmarking conditional GANs in industrial texture synthesis that addresses both the annotation bottleneck and evaluation uncertainty. Our framework integrates: (1) an adapted unsupervised segmentation pipeline [
37] that automatically extracts structural masks from raw production scans, eliminating manual annotation costs; and (2) a rigorous human-centered validation protocol combining Visual Turing Tests [
10] and Mean Opinion Scores adapted from telecommunications standards [
57,
58] to complement standard automated metrics [
59]. To the best of our knowledge, this represents the first systematic application of dual-protocol evaluation (automated + human) to industrial material texture synthesis, and the first demonstration that unsupervised mask generation enables conditional GAN training for natural stone without manual labeling.
This study focuses on a marble type from a single quarry, marketed under the commercial name Exotic Ambar, providing a controlled testbed for architectural comparison without confounding variables and reflecting authentic industrial deployment scenarios in which manufacturers specialize in specific materials. We systematically compare four conditional GAN architectures, baseline cGAN, Pix2Pix, BicycleGAN, and GauGAN, selected for their shared architectural lineage (U-Net generators, PatchGAN discriminators) and trainability on modest datasets (~300 samples) using accessible GPU resources, unlike StyleGAN[
60] or foundation models requiring extensive tuning or billions of parameters. The main contributions of this study are:
We validate an unsupervised segmentation pipeline (SLIC + GMM + Graph Cut) for automatically generating semantic masks from marble imagery, demonstrating a practical solution to the data scarcity challenge where obtaining pixel-perfect annotations is economically prohibitive.
We conduct a systematic benchmark comparing four cGAN architectures trained and evaluated under identical conditions on real industrial data (289 high-resolution marble scans acquired from production lines), providing evidence-based architecture selection guidance for practitioners.
We implement a dual-evaluation framework contrasting automated metrics (FID, IS, MS-SSIM) with human-centered assessment (Visual Turing Test [
10], Mean Opinion Scores from domain experts), revealing significant metric-perception discrepancies with direct implications for deployment decisions.
We demonstrate that GauGAN achieves human-indistinguishable synthesis quality despite inferior FID scores, while Pix2Pix exhibits the opposite pattern, establishing empirically that automated metrics alone are insufficient for architecture selection in quality-critical manufacturing applications [
61,
62].
We provide comprehensive methodology documentation to enable replication and extension to other natural material synthesis tasks (wood, fabric, geological samples).
The remainder of this paper is organized as follows.
Section 2 details our methodology: data collection, the unsupervised segmentation pipeline, cGAN implementations, and the dual-evaluation protocol.
Section 3 presents the results: visual comparisons, automated metrics, human-evaluation outcomes, metric-perception discrepancy analysis, and computational performance.
Section 4 discusses practical implications for industrial deployment and limitations.
Section 5 concludes with actionable recommendations and directions for future research.
3. Results
Results are organized to demonstrate the three core claims of this work: (1) unsupervised mask generation enables training without annotation cost, (2) human-centered evaluation reveals architecture-specific perceptual quality that automated metrics fail to capture, and (3) a critical divergence exists between metric-based optimization and human perception in industrial texture synthesis. We begin with qualitative visual assessment of generated textures, present comprehensive quantitative metrics, introduce structured human evaluation findings, and conclude with computational efficiency analysis to inform deployment decisions.
3.1. Qualitative Assessment: Visual Comparison Across Architectures
All four conditional GAN architectures successfully learned to synthesize photorealistic marble textures from binary vein masks after training on the 232-sample dataset with unsupervised annotations.
Figure 4 presents a systematic comparison across four representative samples with varying vein patterns: given identical mask inputs (column a), each architecture generates plausible marble appearances exhibiting correct vein placement, naturalistic matrix coloration, and appropriate texture granularity relative to ground-truth real marble images (column b).
Visual inspection reveals distinct architectural characteristics. The baseline conditional GAN (column c) produces diverse outputs due to explicit latent sampling, successfully generating variation in matrix tone and vein texture while maintaining structural fidelity to the input mask. Pix2Pix (column d) generates sharp, high-contrast outputs with excellent mask adherence and strong vein definition, producing textures that appear highly realistic at standard viewing distances. BicycleGAN (column e) successfully produces outputs with controlled diversity through its latent embedding mechanism, though this diversity sometimes manifests as variation in global lighting and color temperature rather than localized material texture properties. GauGAN (column f) exhibits smooth texture quality with particularly organic vein-to-matrix transitions, producing outputs where the boundary between vein and matrix regions appears naturally graduated rather than sharply delineated.
Across all four samples spanning different vein geometries, from parallel diagonal structures (Samples 1-2) to curved organic patterns (Sample 3) and complex scattered networks (Sample 4),the architectures demonstrate consistent synthesis capability. Each architecture maintains its characteristic visual signature across diverse structural inputs, indicating that observed quality differences stem from fundamental architectural design choices rather than mask-specific overfitting. All generated outputs are visually plausible and structurally faithful to their conditioning masks, validating the efficacy of the unsupervised segmentation pipeline for providing geometric guidance to the generative process.
3.2. Quantitative Metrics: Automated Performance Evaluation
Table 1 presents comprehensive quantitative evaluation across 10 automated metrics computed on all 57 validation samples. Pix2Pix achieved the best performance on distributional metrics, including the widely-used Fréchet Inception Distance (FID = 85.286, 2.3% lower than GauGAN) and Inception Score (IS = 1.940). This consistent metric superiority reflects Pix2Pix’s L1 reconstruction loss enforcing pixel-level fidelity to ground-truth training data. GauGAN ranked second on distributional metrics but demonstrated the strongest performance on reconstruction fidelity measures: structural similarity (MS-SSIM = 0.713), pixel accuracy (PSNR = 22.626 dB, MSE = 0.006), correlation (CC = 0.847), and mask adherence (FMI-pixel = 0.886). BicycleGAN and baseline conditional GAN showed weaker metric performance, particularly on distributional measures (FID > 94), suggesting their latent diversity mechanisms produce outputs that deviate further from training distribution statistics in Inception-v3 feature space. Full metric distributions across all validation samples are shown in
Appendix A.3.
3.3. Human-Centered Evaluation: Perceptual Quality Assessment
The structured human evaluation protocols reveal a striking divergence from automated metric rankings.
Figure 5(a) presents Visual Turing Pass Rates (VTPR): GauGAN achieved the lowest (best) pass rate (0.533, 95% CI: 0.400–0.667), indicating expert evaluators could not reliably distinguish GauGAN outputs from real marble at better-than-chance levels. In contrast, Pix2Pix, the highest-performing architecture in metric-based evaluation, achieved the highest (worst) VTPR (0.650, 95% CI: 0.583–0.717), meaning experts correctly identified Pix2Pix outputs as synthetic in 65% of trials despite its 2.3% FID advantage over GauGAN. BicycleGAN (0.633, 95% CI: 0.567–0.700) and baseline cGAN (0.583, 95% CI: 0.517–0.650) achieved intermediate performance. Mean Opinion Scores (
Figure 5(b)) corroborate this ranking: GauGAN received the highest naturalness ratings (MOS-MA = 2.889, 95% CI: 2.578–3.200), while Pix2Pix scored lowest among GAN architectures (2.333, 95% CI: 2.022–2.644). The baseline cGAN achieved a MOS-MA of 2.644 (95% CI: 2.333–2.955), and BicycleGAN scored 2.667 (95% CI: 2.356–2.978).
These human-centered evaluations showed a somewhat inverted ranking relative to automated metrics: the architecture optimized for FID (Pix2Pix) performs worst in perceptual authenticity, while GauGAN, which achieves the second-best FID, produces textures that expert evaluators perceive as indistinguishable from real marble. This metric-perception divergence challenges the foundational assumption that minimizing Inception-based distributional distance yields perceptually superior outputs, particularly for stochastic texture synthesis tasks where fine-scale irregularity defines naturalness.
3.4. The Metric-Perception Divergence
The contradiction between automated and human assessments is evident in the comparative results: architectures with better (lower) FID scores do not consistently achieve better human perceptual scores. This misalignment contradicts the foundational assumption of metric-based GAN development,that minimizing FID produces perceptually superior outputs. Pix2Pix represents the most extreme example of this divergence, achieving the best FID (85.286) yet worst human scores (VTPR = 0.650, MOS-MA = 2.333). Conversely, GauGAN achieved the best human evaluation scores (VTPR = 0.533, MOS-MA = 2.889) despite having a higher (worse) FID of 87.308 compared to Pix2Pix. This inverted pattern,where the architecture that best optimizes the standard training metric performs worst in human assessment,reveals a fundamental limitation in using Inception-v3-based metrics as the sole validation criterion for texture synthesis tasks.
Visual artifact analysis (
Figure 6) reveals the mechanism underlying this divergence through comparative magnification of real marble, GauGAN output, and Pix2Pix output. While all three appear photorealistic at standard viewing distances, magnified inspection reveals critical qualitative differences. Real marble exhibits stochastic, aperiodic texture where fine-scale features, such as grain patterns, vein edge irregularities, and matrix crystallization details, show continuous local variation with no repeating motifs. GauGAN successfully replicates this natural stochasticity: magnified regions reveal organic texture variation indistinguishable from real marble, where adjacent areas maintain natural uniqueness without systematic pattern repetition.
In contrast, Pix2Pix outputs exhibit a subtle but systematic failure mode at magnification: identical or near-identical texture motifs recur multiple times within local neighborhoods (indicated by arrows in
Figure 6(c)). These repetitive microstructures, such as specific vein branching geometries, matrix grain arrangements, or edge detail patterns appearing 3-5 times in the same orientation, violate the aperiodic character of natural mineral crystallization. While real marble and GauGAN outputs show rich local variation where no two adjacent regions share identical fine-scale structure, Pix2Pix occasionally replicates learned texture patterns during generation, producing spatially repetitive structures. This artifact manifests with sufficient frequency that expert evaluators, trained to assess natural stone quality, immediately perceive it as a synthetic regularity inconsistent with geological formation processes.
3.5. Computational Efficiency and Deployment Feasibility
Figure 7 presents the training evolution across all four architectures, revealing distinct convergence patterns that explain their final performance characteristics. The FID evolution (
Figure 7(c)) demonstrates that GauGAN achieved the fastest and smoothest convergence, reaching its final FID of 87.308 within 1,000 epochs and maintaining stability thereafter. Pix2Pix exhibited slower but consistent FID improvement, achieving its best score of 85.286 at epoch 3,000. The baseline cGAN and BicycleGAN showed more volatile training dynamics with higher final FID values (94.623 and 100.071, respectively), suggesting that explicit latent diversity mechanisms complicate distributional alignment with Inception-v3 feature statistics.
Generator loss trajectories (
Figure 7(a)) reveal the architectural differences in learning dynamics. GauGAN exhibited the most dramatic initial loss decrease, dropping from ~45 to ~21 within the first 1,000 epochs before stabilizing, a pattern indicating rapid feature learning enabled by SPADE’s multi-scale semantic injection. BicycleGAN showed similar convergence behavior but with lower initial loss (~22), while Pix2Pix maintained relatively stable generator loss throughout training (~13-14), consistent with its deterministic mapping and L1 regularization. The baseline cGAN exhibited the lowest generator loss (~2-4), but this did not translate to superior FID performance, highlighting the disconnect between generator loss magnitude and perceptual quality.
Discriminator loss evolution (
Figure 7(b)) provides insight into adversarial training stability. GauGAN’s discriminator initially struggled (loss ~0.4, indicating overly confident predictions) before stabilizing around 0.6-0.7, suggesting the generator learned to produce challenging outputs that maintained discriminator uncertainty. Pix2Pix’s discriminator loss decreased steadily from ~1.2 to ~0.2, indicating the discriminator became increasingly confident in detecting synthetic images, potentially explaining why human evaluators also found Pix2Pix outputs more detectable despite superior FID. The baseline cGAN and BicycleGAN maintained more balanced discriminator losses (~0.5-0.6), consistent with the Nash equilibrium in adversarial training.
Table 3 reports computational characteristics relevant for industrial deployment. GauGAN requires the fewest trainable parameters (53.1M) despite architectural complexity, as SPADE normalization layers are parameter-efficient relative to U-Net encoder stacks. However, GauGAN demands the highest training cost (0.82 min/epoch, 1761.3 GFLOPS) due to the computational expense of per-pixel normalization parameter prediction. This translates to a total training time of 42.4 hours to convergence (3,100 epochs) compared to 10.2 hours for Pix2Pix (5,100 epochs × 0.12 min/epoch) and 12.7 hours for the baseline cGAN (6,900 epochs × 0.11 min/epoch). Pix2Pix offers competitive training efficiency (0.12 min/epoch) with parameter count (61.4M). At inference time, all models achieve performance on consumer GPUs for 1280×720 outputs consistent with their usage in interactive design tools. Given GauGAN’s superior perceptual quality validated through human evaluation, the 4.2× total training time penalty relative to Pix2Pix (42.4 vs. 10.2 hours) is justified for quality-critical applications like architectural visualization, where human perception is the ultimate arbiter.
4. Discussion
This section synthesizes the experimental findings to interpret their broader implications for industrial texture synthesis, analyzes the training dynamics and architectural insights, and discusses the study’s limitations while proposing concrete directions for future research.
4.1. Synthesis of Findings and Implications
The comprehensive evaluation confirms that conditional GANs are highly effective at synthesizing photorealistic, structurally controlled marble textures. However, the results reveal a significant misalignment between automated quantitative metrics and expert human judgment, a finding with profound implications for how the research community validates generative models. The quantitative-perceptual divergence between Pix2Pix (best FID: 85.286, worst MOS-MA: 2.333) and GauGAN (FID: 87.308, best MOS-MA: 2.889) demonstrates that Inception-based metrics fail to capture human-relevant texture quality. This finding aligns with large-scale empirical evidence that models producing more perceptually realistic images paradoxically score worse on FID, suggesting that replacing Inception-v3 with alternative encoders could improve human-metric alignment [
13]. Our results extend these findings to another industrial context, confirming Borji’s (2022) [
14] observation that FID is particularly unsuitable for specialized domains where visual features differ from those in natural images. The mechanism underlying this divergence is architectural: Pix2Pix’s transposed convolutional layers are prone to introducing subtle patterns that, although they marginally affect structural similarity metrics, are immediately perceptible to human observers as unnatural regularity. In contrast, GauGAN’s SPADE-based generator avoids transposed convolutions entirely, producing smooth organic textures that experts could not distinguish from real marble at statistically significant levels.
This finding has significant implications for industrial applications. Optimizing for FID alone would select Pix2Pix, delivering textures that professional users immediately recognize as artificial, a costly error in architectural visualization, virtual prototyping, and digital arts, where material authenticity determines client acceptance. For applications in which the end user is human, structured, human-centered evaluations should be considered an essential component of the validation pipeline. While automated metrics remain invaluable for guiding training dynamics, they are insufficient as the sole arbiters of perceptual quality.
Furthermore, the success of GauGAN in generating controllable, high-fidelity textures signals a potential paradigm shift in industrial material design. Our work demonstrates that conditional GANs successfully unify the control of traditional procedural methods with the realism of data-driven techniques, enabling explicit structural control through binary masks while synthesizing photorealistic local appearance. This capability could transition design processes from selecting materials from predefined catalogs to actively creating bespoke, digitally native materials on demand, thereby supporting virtual prototyping workflows and digital twin applications.
4.2. Training Dynamics and Architectural Insights
The training dynamics reveal fundamental architectural trade-offs between convergence speed, computational cost, and perceptual quality. GauGAN’s rapid FID convergence demonstrates SPADE’s efficiency in learning texture distributions through multi-scale semantic injection. However, this advantage comes with a substantial per-epoch computational cost: GauGAN requires 1,761.3 GFLOPS per forward pass (10.3× higher than Pix2Pix) due to per-pixel convolutions in each SPADE layer. The discriminator loss patterns provide additional insight into the metric-perception divergence. Pix2Pix’s steadily decreasing discriminator loss indicates that the discriminator has learned to reliably detect synthetic outputs, aligning with human evaluation results. In contrast, GauGAN maintained discriminator uncertainty throughout training, suggesting that its outputs remained difficult to classify even for networks explicitly trained to detect them. This adversarial balance correlates with human indistinguishability, validating the original GAN objective. For industrial deployment, these findings suggest clear decision criteria: applications requiring rapid iteration should prioritize Pix2Pix despite detectable artifacts, whereas quality-critical applications justify GauGAN’s 4.2× training-time investment to achieve human-level perceptual authenticity.
4.3. Practical Validation of Unsupervised Mask Generation
The successful training of all architectures to photorealistic quality levels validates the unsupervised segmentation pipeline as a practical solution to the annotation bottleneck. Masks generated from SLIC superpixels, GMM clustering, and Graph Cut optimization were sufficient to condition high-fidelity synthesis across all 289 marble slabs without manual correction, which is critical for industrial deployment, where annotation costs can be high for datasets of this scale.
The robustness to mask imperfections is noteworthy: the L1 reconstruction loss during GAN training implicitly corrects minor mask inaccuracies by learning to fill vein regions with textures that match the training data statistics. This aligns with recent work demonstrating that two-stage generative pipelines can be effective even with imperfect label maps [
38]. However, systematic segmentation failures would propagate through the synthesis pipeline, suggesting that future work exploring foundation models like Segment Anything could further improve robustness.
4.4. Limitations and Directions for Future Research
While this study establishes a foundational benchmark for mask-conditioned synthesis of stochastic natural materials, several limitations warrant acknowledgment.
First, our methodology was validated on a single marble type. While this material exhibits rich vein patterns providing a challenging test case, geological materials display enormous visual diversity. Future work should extend this framework to taxonomically diverse natural stones to assess whether the findings generalize across material classes with different mechanisms of texture generation.
Second, the statistical power of human evaluation is constrained by the sample size. While sufficient for exploratory validation and aligned with practices in perceptual quality assessment literature, future work should employ larger panels to establish robust effect sizes and enable subgroup analyses across evaluator expertise levels. Consumer preferences may differ systematically from expert judgments, and expanding evaluation through crowdsourced platforms following ITU-standardized protocols would strengthen generalizability.
Third, we deliberately excluded diffusion models from this benchmark for methodological rigor: comparing models with fundamentally different training paradigms (adversarial vs. denoising), data requirements, and computational profiles would introduce confounding variables that obscure architecture-specific insights. State-of-the-art diffusion models like Stable Diffusion and ControlNet leverage massive pre-trained foundation models, making direct comparison methodologically complex. A dedicated diffusion model comparison using the same dataset and evaluation protocols is underway as a follow-up study, with careful attention to disentangling architectural effects from pre-training data scale.
Fourth, extending this 2D framework to 3D volumetric texture generation would enable more immersive visualization in architectural applications where marble veins penetrate through material depth. Recent work on 3D-aware generative models and neural radiance fields provides promising foundations for this extension.
Finally, the observed metric-perception misalignment suggests a critical need for developing new evaluation metrics that better align with human judgment for texture synthesis tasks. Research directions include learning-based perceptual metrics trained on human judgments, multi-scale texture descriptors that capture stochastic properties, and hybrid frameworks that combine efficient automated screening with targeted human validation. The dataset and protocols established here could serve as benchmarks for developing such metrics.
Figure 1.
Flowchart of the proposed pipeline for controllable marble texture synthesis. The workflow proceeds through four stages: (1) Data curation and pre-processing, including artifact-based filtering and image standardization; (2) Semantic mask generation via SLIC superpixel segmentation, GMM-based probabilistic modeling, and Graph Cut spatial regularization; (3) Conditional GAN implementation, encompassing model construction and adversarial training; (4) Performance evaluation combining qualitative human-centered assessment (VTPR, MOS-MA) with quantitative automated metrics (pixel-based, statistical, and learned distributional).
Figure 1.
Flowchart of the proposed pipeline for controllable marble texture synthesis. The workflow proceeds through four stages: (1) Data curation and pre-processing, including artifact-based filtering and image standardization; (2) Semantic mask generation via SLIC superpixel segmentation, GMM-based probabilistic modeling, and Graph Cut spatial regularization; (3) Conditional GAN implementation, encompassing model construction and adversarial training; (4) Performance evaluation combining qualitative human-centered assessment (VTPR, MOS-MA) with quantitative automated metrics (pixel-based, statistical, and learned distributional).
Figure 2.
Representative marble slabs and unsupervised mask generation. Top row: Four samples showing natural variation in vein density and orientation from dataset of the marble with commercial name Exotic Ambar. Bottom row: Corresponding binary masks generated automatically via SLIC + GMM + Graph Cut pipeline.
Figure 2.
Representative marble slabs and unsupervised mask generation. Top row: Four samples showing natural variation in vein density and orientation from dataset of the marble with commercial name Exotic Ambar. Bottom row: Corresponding binary masks generated automatically via SLIC + GMM + Graph Cut pipeline.
Figure 3.
Dual-evaluation framework for comparative assessment of automated metrics versus human perceptual judgment. (a) Framework overview: generated images undergo parallel quantitative (10 automated metrics) and qualitative evaluation (2 protocols with human expert evaluators), with results compared to identify metric-perception alignment or divergence. (b) Quantitative metrics battery: ten metrics spanning pixel-based and structural (MSE, PSNR, MS-SSIM, SCD), statistical (SD, CC, Entropy, FMI-pixel), and learned distributional families (IS, FID). Arrows indicate optimization direction (↑ = higher is better; ↓ = lower is better). (c) VTPR protocol: three domain experts perform 60 two-alternative forced-choice trials, identifying real versus synthetic marble within 4-second viewing windows. VTPR = 1 - (error rate); 0.5 indicates perfect indistinguishability (chance level), values >0.5 indicate detectably synthetic outputs. (d) MOS-MA protocol: experts rate images on 5-poin Likert scale (1 = “Clearly Artificial”, 5 = “Clearly Natural”) without time constraints. MOS-MA = mean rating across all trials.
Figure 3.
Dual-evaluation framework for comparative assessment of automated metrics versus human perceptual judgment. (a) Framework overview: generated images undergo parallel quantitative (10 automated metrics) and qualitative evaluation (2 protocols with human expert evaluators), with results compared to identify metric-perception alignment or divergence. (b) Quantitative metrics battery: ten metrics spanning pixel-based and structural (MSE, PSNR, MS-SSIM, SCD), statistical (SD, CC, Entropy, FMI-pixel), and learned distributional families (IS, FID). Arrows indicate optimization direction (↑ = higher is better; ↓ = lower is better). (c) VTPR protocol: three domain experts perform 60 two-alternative forced-choice trials, identifying real versus synthetic marble within 4-second viewing windows. VTPR = 1 - (error rate); 0.5 indicates perfect indistinguishability (chance level), values >0.5 indicate detectably synthetic outputs. (d) MOS-MA protocol: experts rate images on 5-poin Likert scale (1 = “Clearly Artificial”, 5 = “Clearly Natural”) without time constraints. MOS-MA = mean rating across all trials.

Figure 4.
Qualitative comparison of marble texture synthesis across four cGAN architectures. (a) Input binary mask. (b) Ground-truth real marble. (c) cGAN output. (d) Pix2Pix output. (e) BicycleGAN output. (f) GauGAN output. Rows show samples with varying vein density and orientation.
Figure 4.
Qualitative comparison of marble texture synthesis across four cGAN architectures. (a) Input binary mask. (b) Ground-truth real marble. (c) cGAN output. (d) Pix2Pix output. (e) BicycleGAN output. (f) GauGAN output. Rows show samples with varying vein density and orientation.
Figure 5.
Human-centered evaluation reveals ranking inversion relative to automated metrics. (a) Visual Turing Pass Rate (VTPR): fraction of trials where expert evaluators correctly identified synthetic images as real (lower values indicate more realistic outputs that fool experts). Dashed line at 0.5 indicates chance level performance (perfect indistinguishability). GauGAN’s 95% confidence interval (error bars) spans 0.5, demonstrating expert-level photorealism. Pix2Pix, despite achieving the best FID score, is most easily detected by human evaluators. (b) Mean Opinion Score on Marble Authenticity (MOS-MA) on 5-point Likert scale (1 = clearly artificial, 5 = perfectly natural). GauGAN achieves highest authenticity ratings; Pix2Pix rated significantly worse despite metric superiority. Error bars represent 95% confidence intervals for all architectures.
Figure 5.
Human-centered evaluation reveals ranking inversion relative to automated metrics. (a) Visual Turing Pass Rate (VTPR): fraction of trials where expert evaluators correctly identified synthetic images as real (lower values indicate more realistic outputs that fool experts). Dashed line at 0.5 indicates chance level performance (perfect indistinguishability). GauGAN’s 95% confidence interval (error bars) spans 0.5, demonstrating expert-level photorealism. Pix2Pix, despite achieving the best FID score, is most easily detected by human evaluators. (b) Mean Opinion Score on Marble Authenticity (MOS-MA) on 5-point Likert scale (1 = clearly artificial, 5 = perfectly natural). GauGAN achieves highest authenticity ratings; Pix2Pix rated significantly worse despite metric superiority. Error bars represent 95% confidence intervals for all architectures.
Figure 6.
This finding validates recent literature documenting Inception-v3’s inadequacy for texture quality assessment in stochastic material synthesis, but extends it to an industrial context with economic stakes: optimizing for FID in a product design application would select Pix2Pix, delivering textures that professional users immediately recognize as artificial due to perceptible local regularity, a costly error with direct consequences for architectural visualization, virtual prototyping, and design workflows where material authenticity determines client acceptance.
Figure 6.
This finding validates recent literature documenting Inception-v3’s inadequacy for texture quality assessment in stochastic material synthesis, but extends it to an industrial context with economic stakes: optimizing for FID in a product design application would select Pix2Pix, delivering textures that professional users immediately recognize as artificial due to perceptible local regularity, a costly error with direct consequences for architectural visualization, virtual prototyping, and design workflows where material authenticity determines client acceptance.
Figure 7.
Training dynamics across the four architectures. (a) Generator loss convergence. (b) Discriminator loss evolution. (c) FID evolution over epochs.
Figure 7.
Training dynamics across the four architectures. (a) Generator loss convergence. (b) Discriminator loss evolution. (c) FID evolution over epochs.
Table 1.
Final hyperparameter settings per architecture, determined through validation FID optimization.
Table 1.
Final hyperparameter settings per architecture, determined through validation FID optimization.
| Architecture |
Generator LR |
Discriminator LR |
Encoder LR |
λL1 |
λlatent
|
Latent Dim |
Total Training Epochs |
| cGAN |
1×10−4 |
5×10−5 |
— |
— |
— |
100 |
6,900 |
| Pix2Pix |
2×10−4 |
2×10−4 |
— |
100 |
— |
— |
5,100 |
| BicycleGAN |
2×10−4 |
2×10−4 |
2×10−4 |
100 |
10 |
8 |
3,400 |
| GauGAN |
5×10−5 |
5×10−5 |
— |
350 |
— |
— |
3,100 |
Table 2.
Automated performance metrics on validation set (57 samples, mean ± std). Bold indicates best performance per metric.
Table 2.
Automated performance metrics on validation set (57 samples, mean ± std). Bold indicates best performance per metric.
| |
Pixel-Based and Structural Metrics |
Statistical Metrics |
Learned Distributional Metrics |
| Architecture |
MSE (↓) |
PSNR (↑) |
MS-SSIM (↑) |
SCD (↑) |
SD (↑) |
CC (↑) |
Entropy (↑) |
FMI-pixel (↑) |
IS (↑) |
FID (↓) |
| cGAN |
0.009 ± 0.004 |
20.931 ± 1.898 |
0.636 ± 0.065 |
0.980 ± 0.010 |
0.123 ± 0.015 |
0.780 ± 0.047 |
6.433 ± 0.226 |
0.643 ± 0.093 |
1.790 ± 0.154 |
94.623 |
| Pix2pix |
0.007 ± 0.003 |
21.710 ± 1.883 |
0.680 ± 0.076 |
0.981 ± 0.010 |
0.125 ± 0.011 |
0.829 ± 0.056 |
6.439 ± 0.222 |
0.794 ± 0.154 |
1.940 ± 0.275 |
85.286 |
| BicycleGAN |
0.007 ± 0.004 |
22.165 ± 2.207 |
0.693 ± 0.080 |
0.982 ± 0.009 |
0.119 ± 0.011 |
0.839 ± 0.057 |
6.385 ± 0.214 |
0.826 ± 0.171 |
1.766 ± 0.197 |
100.071 |
| GauGAN |
0.006 ± 0.003 |
22.626 ± 2.514 |
0.713 ± 0.098 |
0.983 ± 0.010 |
0.120 ± 0.014 |
0.847 ± 0.065 |
6.381 ± 0.231 |
0.886 ± 0.225 |
1.903 ± 0.264 |
87.308 |
Table 3.
Model complexity and training efficiency metrics.
Table 3.
Model complexity and training efficiency metrics.
| Architecture |
# of Parameters (Millions) |
Computational cost [GFLOPS] |
Avg. Time/Epoch (min) |
Total Epochs |
| cGAN |
63.9 |
170.2 |
0.11 |
6900 |
| pix2pix |
61.4 |
170.2 |
0.12 |
5100 |
| BicycleGAN |
85.9 |
105.4 |
0.17 |
3400 |
| GauGAN |
53.1 |
1761.3 |
0.82 |
3100 |