Submitted:
16 March 2026
Posted:
17 March 2026
You are already at the latest version
Abstract
The marble industry relies on proprietary commercial names rather than objective visual categories, creating market inefficiencies for stakeholders who select stones based on appearance. Supervised classification methods perpetuate this problem by replicating inconsistent commercial labels instead of discovering intrinsic visual structure. We propose an unsupervised pipeline combining a two-stage training strategy, pure self-supervised pretraining followed by cluster-aware fine-tuning of a DINO Vision Transformer, with UMAP dimensionality reduction and Ward's agglomerative hierarchical clustering. Systematic ablation studies on 1,540 marble images spanning 10 commercial varieties validate each design choice: cluster-aware training at k=10 yields superior embeddings over the self-supervised baseline (Silhouette Score 0.778 vs. 0.761; Davies–Bouldin Index 0.293 vs. 0.364), UMAP compression to five dimensions resolves high-dimensional noise pathologies, and Ward's linkage produces the most compact partitions. The resulting taxonomy reveals three phenomena invisible to commercial classification: cross-category merging of visually indistinguishable stones carrying different market names, intra-category splitting of heterogeneous sub-populations within single varieties, and coherent grouping where commercial and visual boundaries coincide. We further demonstrate that standard extrinsic metrics are misaligned with unsupervised taxonomy objectives when reference labels encode the inconsistencies the method aims to resolve. This work provides a validated methodology for data-driven visual classification in the natural stone industry and a transferable template for domains with unreliable labelling conventions.