Computer Science and Mathematics

Sort by

Article
Computer Science and Mathematics
Computer Vision and Graphics

Noah Fang

,

Salma Ali

Abstract: Current multimodal large language model (MLLM)-based autonomous driving systems struggle with deep contextual understanding, fine-grained personalization, and transparent risk assessment in complex real-world scenarios. This paper introduces ConsciousDriver, a novel context-aware multimodal personalized autonomous driving system designed to address these limitations. Our system integrates a Context-Awareness Module for richer environmental understanding and a Dynamic Risk Adaptation Mechanism to flexibly modulate driving behaviors based on real-time user prompts and situational risks. Built upon an extended MLLM architecture, ConsciousDriver processes environmental inputs and user prompts to generate deep contextual understanding, context-adaptive danger levels, optimal action decisions, and explicit decision intent explanations. Evaluated on an extended PAD-Highway benchmark, ConsciousDriver demonstrates superior performance in driving safety, efficiency, lane-keeping, and traffic density adaptation. Furthermore, it exhibits robust adaptability to diverse personalized prompts and enhanced performance in challenging traffic scenarios, with lower collision and higher completion rates. Human evaluation confirms the high quality of its explanations. ConsciousDriver represents a significant advancement towards intelligent, adaptive, and trustworthy autonomous driving.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Cosimo Aliani

,

Cosimo Lorenzetto Bologna

,

Piergiorgio Francia

,

Leonardo Bocchi

Abstract: Standing equine cone-beam CT (CBCT) enables diagnostic imaging in weight-bearing conditions, reducing the need for general anaesthesia, but residual postural sway during acquisition can introduce motion artefacts that degrade image quality. External optical tracking based on a ChArUco fiducial and an auxiliary RGB camera is a practical strategy for projection-wise motion compensation; however, the impact of camera--marker geometry on pose-estimation performance is not well characterised. This study evaluates how viewing angle and working distance affect ChArUco-based pose estimation in a controlled CBCT-motivated setting. Pose estimates were obtained with an Intel RealSense D435 RGB sensor and OpenCV, using dedicated mechanical fixtures to vary viewing angle in 1° increments and to adjust the camera-to-board distance in 5cm steps. Accuracy and precision were quantified using mean absolute error with respect to ground truth and the standard deviation across repeated measurements. Continuous and cyclic acquisition protocols yielded comparable errors, indicating that repeated repositioning did not introduce substantial additional variability. Viewing-angle experiments revealed a consistent accuracy--precision trade-off for rotation estimation: frontal views minimise mean absolute error but exhibit the highest variability, whereas increasingly oblique views reduce variability at the expense of larger mean error. Increasing working distance was associated with larger standard deviations, particularly affecting depth repeatability. These results provide practical guidance for selecting camera placement and nominal viewpoints when deploying ChArUco-based tracking for motion-aware standing equine CBCT/CT workflows.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Jerry Gao

,

Radhika Agarwal

,

Rohit Vardam

,

Jasleen Narang

Abstract: OCR systems are increasingly being used in high-stakes and form-based applications, such as employment onboarding and voter registration, where accuracy is crucial to ensure system reliability. The paper introduces a test modeling and evaluation framework for Smart OCR applications that combines decision tables, context trees, and component-level complexity analysis to permit complete and automated validation. The proposed approach models document structure, environment (lighting, angle, and quality of images), and completeness of the input used to produce representative test cases of different operational situations. Accuracy and coverage measures are also used together to measure fidelity of recognition and structural completeness of multiple form components at the multimodal level, such as text fields, tables, checkboxes, radio buttons, and signatures. The framework is confirmed with an empirical case study of the application of employment and voter registration forms with three commercial OCR tools, such as Microsoft Lens, Amazon Textract, and Parsio.io. The experimental findings show the apparent trade-offs between accuracy and coverage and indicate considerable differences in symbolic and contextual extraction abilities by tools. Parsio.io scores the best on the criteria of balanced performance, as it has a high coverage and multimodal recognition strength among the tested systems.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Ziyu Fang

,

Minghao Ye

Abstract: Image-to-PointCloud (I2P) place recognition is crucial for autonomous systems, facing challenges from modality discrepancies and environmental variations. Existing feature fusion strategies often fall short in complex real-world scenarios. We propose AttnLink, a novel framework that significantly enhances I2P place recognition through a sophisticated attention-guided cross-modal feature fusion mechanism. AttnLink integrates an Adaptive Depth Completion Network to generate dense depth maps and an Attention-Guided Cross-Modal Feature Encoder, utilizing lightweight spatial attention for local features and a context-gating mechanism for robust semantic clustering. Our core innovation is a Multi-Head Attention Fusion Network, which adaptively weights and fuses multi-modal, multi-level descriptors for a highly discriminative global feature vector. Trained end-to-end, AttnLink demonstrates superior performance on KITTI and HAOMO datasets, outperforming state-of-the-art methods in retrieval accuracy, efficiency, and robustness to varying input quality. Detailed ablation studies confirm the effectiveness of its components, supporting AttnLink's reliable deployment in real-time autonomous driving applications.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Zeyuan Xun

,

Yichen Ku

Abstract: Three-dimensional medical image segmentation is critical for clinical applications, yet expert annotations are costly, driving the need for semi-supervised learning. Current semi-supervised methods struggle with robustly integrating diverse network architectures and managing pseudo-label quality, especially in complex three-dimensional scenarios. We propose Dynamic Multi-Expert Diffusion Segmentation (DMED-Seg), a novel framework for semi-supervised three-dimensional medical image segmentation. DMED-Seg leverages a Diffusion Expert for global contextual understanding and a Convolutional Expert for fine-grained local detail extraction. A key innovation is the Dynamic Fusion Module, a lightweight Transformer that adaptively integrates multi-scale features and predictions from both experts based on their confidence. Complementing this, Confidence-Aware Consistency Learning enhances pseudo-label quality for unlabeled data using DFM-derived confidence, while Inter-expert Feature Alignment fosters synergistic learning between experts through contrastive loss. Extensive experiments on multiple public three-dimensional medical datasets demonstrate DMED-Seg consistently achieves superior performance across various labeled data ratios, outperforming state-of-the-art methods. Ablation studies confirm the efficacy of each proposed component, highlighting DMED-Seg as a highly effective and practical solution for three-dimensional medical image segmentation.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Gregory Yu

,

Ian Butler

,

Aaron Collins

Abstract: Subject-driven text-to-image generation presents a significant challenge: faithfully reproducing a specific subject's identity within novel, text-described scenes. Existing solutions typically involve computationally expensive model fine-tuning or less performant training-free methods. This paper introduces Content-Adaptive Grafting (CAG), a novel, efficient, and entirely training-free framework designed to achieve high subject fidelity and strong text alignment. CAG operates without modifying the underlying generative model's weights, instead leveraging intelligent noise initialization and adaptive feature fusion during inference. Our framework comprises Initial Structure Guidance (ISG), which prepares a structurally consistent starting point via an inverted collage image, and Dynamic Content Fusion (DCF), which adaptively infuses multi-scale reference features using a gated attention mechanism and a time-dependent decay strategy. Extensive experiments demonstrate that CAG significantly outperforms state-of-the-art training-free baselines in subject fidelity and text alignment, while maintaining competitive efficiency. Ablation studies and human evaluations further validate the critical contributions of ISG and DCF, affirming CAG's leading position in providing a high-quality, practical solution for subject-driven text-to-image generation.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Mark Harris

,

Hunter Shaw

,

Ryan Young

Abstract: Multimodal misinformation demands robust Cross-modal Entity Consistency (CEC) verification, aligning textual entities with visual depictions. Current large vision-language models (LVLMs) struggle with fine-grained entity verification, especially in complex "contextual mismatch" scenarios, failing to capture intricate relationships or leverage auxiliary information. To address this, we propose the Entity-Aware Cross-Modal Fusion Network (EACFN), a novel architecture for deep semantic alignment and robust integration of external visual evidence. EACFN incorporates modules for entity encoding, cross-attention for reference image enhancement, and a Graph Neural Network (GNN)-based module for explicit inter-modal relational reasoning, culminating in fine-grained consistency predictions. Experiments on three annotated datasets demonstrate EACFN's superior performance, significantly outperforming state-of-the-art zero-shot LVLMs across tasks, particularly with reference images. EACFN also shows improved computational efficiency and stronger agreement with human judgments in ambiguous contexts. Our contributions include the innovative EACFN architecture, its GNN-based relational reasoning module, and effective integration of reference image information for enhanced verification robustness.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Lianghao Tan

,

Yongjia Song

,

Ziyan Wen

Abstract: Accurate assessment of emotional states is critical in clinical diagnostics, yet traditional multimodal sentiment analysis often suffers from "modality laziness," where models overlook subtle micro-expressions in favor of text priors. This study proposes Lingo-Aura, a cognitive-enhanced framework based on Mistral-7B designed to align visual micro-expressions and acoustic signals with large language model (LLM) embeddings. We introduce a robust Double-MLP Projector and global mean pooling to bridge the modality gap while suppressing temporal noise and ensuring numerical stability during mixed-precision training. Crucially, the framework leverages a teacher LLM to generate meta-cognitive label, such as reasoning mode and information stance, which are injected as explicit context to guide deep intent reasoning. Experimental results on the CMU-MOSEI dataset demonstrate that Lingo-Aura achieves a 135% improvement in emotion intensity correlation compared to text-only baselines. These findings suggest that Lingo-Aura effectively identifies discrepancies between verbal statements and internal emotional states, offering a powerful tool for mental health screening and pain assessment in non-verbal clinical populations.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Gaetane Lorna N. Tchana

,

Damaris Belle M. Fotso

,

Antonio Hendricks

,

Christophe Bobda

Abstract: Conventional image stitching pipelines predominantly rely on homographic alignment, whose planar assumption often breaks down in dual-camera configurations capturing non-planar scenes, leading to visual artifacts such as geometric warping, spherical bulging, and structural deformation. To address these limitations, this paper presents SENA (SEamlessly NAtural), a geometry-driven image stitching approach with three complementary contributions. First, we propose a hierarchical affine-based warping strategy that combines global affine initialization, local affine refinement, and a smooth free-form deformation field regulated by seamguard adaptive smoothing. This multi-scale design preserves local shape, parallelism, and aspect ratios, thereby reducing the hallucinated distortions commonly associated with homography-based models. Second, SENA incorporates a geometry-driven adequate zone detection mechanism that identifies parallax-minimized regions directly from the disparity consistency of RANSAC-filtered feature correspondences, without relying on semantic segmentation or depth estimation. Third, building upon this adequate zone, we apply anchor-based seamline cutting and segmentation, enforcing one-to-one geometric correspondence between image pairs by construction and reducing ghosting, duplication, and smearing artifacts. Extensive experiments on challenging datasets demonstrate that SENA achieves alignment accuracy comparable to leading homography-based methods, while providing improved shape preservation, texture continuity, and overall visual realism.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Daniel Vera-Yanez

,

António Pereira

,

Nuno Rodrigues

,

José Pascual Molina

,

Arturo S. García

,

Antonio Fernández-Caballero

Abstract: The seemingly endless expanse of the sky might suggest that it could support a large volume of aerial traffic with minimal risk of collisions. However, mid-air collisions do occur and are a significant concern for aviation safety. Pilots are trained in scanning the sky for other aircraft and maneuvering to avoid such accidents, which is known as the basic see-and-avoid principle. While this method has proven effective, it is not infallible because human vision has limitations, and pilot performance can be affected by fatigue or distraction. Despite progress in electronic conspicuity (EC) systems, which effectively increases visibility of aircraft to other airspace users, their utility as collision avoidance systems remains limited. This is because they are recommended but not mandatory in uncontrolled airspace, where most mid-air accidents occur, so other aircraft may not mount a compatible device or have it inactive. Besides, their use carries some risks, such as over-focusing on them. In response to these concerns, this paper presents evidence on the utility of using an optical flow-based obstacle detection system that can complement the pilot and electronic visibility in collision avoidance, but that, contrary to them, neither gets tired as the pilot nor depends on what other aircraft mount, as with EC devices. The current investigation demonstrates that the proposed optical flow-based obstacle detection system meets or exceeds the critical minimum time required for pilots to detect and react to flying obstacles using a mid-air collision simulator in various test environments.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Jingyuan Zhu

,

Anbang Chen

,

Bowen Wang

,

Sining Huang

,

Yukun Song

,

Yixiao Kang

Abstract: This paper presents a systematic comparison of neural architectures for English-to-Spanish machine translation. We implement and evaluate five model configurations ranging from vanilla LSTM encoder-decoders to Transformer models with pretrained embeddings. Using the OPUS-100 corpus (1M training pairs) and FLORES+ benchmark (2,009 test pairs), we evaluate translation quality using BLEU, chrF, and COMET metrics. Our best Transformer model achieves a BLEU score of 20.26, closing approximately 65% of the performance gap between our strongest LSTM baseline (BLEU 10.66) and the state-of-the-art Helsinki-NLP model (BLEU 26.60). We analyze the impact of architectural choices, data scale, and pretrained embeddings on translation quality, providing insights into the trade-offs between model complexity and performance.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Asri Mulyani

,

Muljono -

,

Purwanto -

,

Moch Arief Soeleman

Abstract: Diabetic retinopathy (DR) is a leading cause of vision impairment and permanent blindness worldwide, requiring an accurate and automated system to classify its multi-grade severity to ensure timely patient intervention. However, standard Convolutional Neural Networks (CNNs) often struggle to capture the fine, high-frequency microvascular patterns critical for diagnosis. This study proposes a Robust Intelligent CNN Model (RICNN) designed to improve multi-level DR classification by integrating Gabor-based feature extraction with deep learning. The model also incorporates SMOTE (Synthetic Minority Oversampling Technique) balancing and Adam optimization for efficient convergence. The proposed RICNN was evaluated on the Messidor dataset (1,200 images) across four severity levels: Mild, Moderate, Severe, and Proliferative DR. The results showed that RICNN achieved superior performance with 89% accuracy, 88.75% precision, 89% recall, and 89% F1-score. The model also demonstrated high robustness in identifying advanced stages, achieving AUCs of 97% for Severe DR and 99% for Proliferative DR. Comparative analysis confirms that texture-aware Gabor enhancement significantly outperforms Local Binary Pattern (LBP) and Color Histogram approaches. These findings indicate that the proposed RICNN provides a reliable and intelligent foundation for clinical decision support systems, potentially reducing diagnostic errors and preventing vision loss in high-risk populations.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Xu Ji

,

Kai Song

,

Lianzheng Sun

,

Haolin Lu

,

Hengyuan Zhang

,

Yiran Feng

Abstract: To overcome the low accuracy of conventional methods for estimating liquid volume and food nutrient content in bowl-type tableware, as well as the tool dependence and time-consuming nature of manual measurements, this study proposes an integrated approach that combines geometric reconstruction with deep learning–based segmentation. After a one-time camera cali-bration, only a frontal and a top-down image of a bowl are required. The pipeline automatically extracts key geometric information, including rim diameter, base diameter, bowl height, and the inner-wall profile, to complete geometric modeling and capacity computation. The estimated parameters are stored in a reusable bowl database, enabling repeated predictions of liquid vol-ume and food nutrient content at different fill heights. We further propose Bowl Thick Net to predict bowl wall thickness with millimeter-level accuracy. In addition, we developed a Geome-try-aware Feature Pyramid Network (GFPN) module and integrated it into an improved Mask R-CNN framework to enable precise segmentation of bowl contours. By integrating the contour mask with the predicted bowl wall thickness, precise geometric parameters for capacity estima-tion can be obtained. Liquid volume is then predicted using the geometric relationship of the liq-uid or food surface, while food nutrient content is estimated by coupling predicted food weight with a nutritional composition database. Experiments demonstrate an arithmetic mean error of −3.03% for bowl capacity estimation, a mean liquid-volume prediction error of 9.24%, and a mean nutrient-content (by weight) prediction error of 11.49% across eight food categories.

Short Note
Computer Science and Mathematics
Computer Vision and Graphics

Marcus Elvain

,

Howard Pellorin

Abstract: Generative models have achieved remarkable success in producing realistic images and short video clips, but existing approaches struggle to maintain *persistent worldcoherence over long durations and across multiple modalities. We propose Multimodal Supervisory Graphs (MSG), a novel framework for world modeling that unifies geometry (3D structure), identity (consistent entities), physics (dynamic behavior), and interaction (user/agent inputs) in a single abstract representation. MSG represents the environment as a dynamic latent graph, factorized by these four aspects and trained with cross-modal supervision from visual (RGB-D), pose, and audio streams. This unified world abstraction enables generative AI systems to maintain consistent scene layouts, preserve object identities over time, obey physical laws, and incorporate interactive user prompts, all within one model. In our experiments, MSG demonstrates superior long-term coherence and cross-modal consistency compared to state-of-the-art generative video baselines, effectively bridging the gap between powerful short-term video generation and persistent, interactive world modeling. Our framework outperforms prior methods on metrics of identity consistency, physical plausibility, and multi-view geometry alignment, enabling new applications in extended reality and autonomous agent simulation.

Short Note
Computer Science and Mathematics
Computer Vision and Graphics

Brennan Sloane

,

Landon Vireo

,

Keaton Farrow

Abstract: High-fidelity telepresence requires the reconstruction of photorealistic 3D avatars in real-time to facilitate immersive interaction. Current solutions face a dichotomy: they are either computationally expensive multi-view systems (e.g., Codec Avatars) or lightweight mesh-based approximations that suffer from the "uncanny valley" effect due to a lack of high-frequency detail. In this paper, we propose Mono-Splat, a novel framework for reconstructing high-fidelity, animatable human avatars from a single monocular webcam video stream. Our method leverages 3D Gaussian Splatting (3DGS) combined with a lightweight deformation field driven by standard 2D facial landmarks. Unlike Neural Radiance Fields (NeRFs), which typically suffer from slow inference speeds due to volumetric ray-marching, our explicit Gaussian representation enables rendering at >45 FPS on consumer hardware. We further introduce a landmark-guided initialization strategy to mitigate the depth ambiguity inherent in monocular footage. Extensive experiments demonstrate that our approach outperforms existing NeRF-based and mesh-based methods in both rendering quality (PSNR/SSIM) and inference speed, presenting a viable, accessible pathway for next-generation VR telepresence.

Short Note
Computer Science and Mathematics
Computer Vision and Graphics

Landon Vireo

,

Brennan Sloane

,

Arden Piercefield

,

Greer Holloway

,

Keaton Farrow

Abstract: Diminished Reality (DR)—the ability to visually remove real-world objects from a live Augmented Reality (AR) feed—is essential for reducing cognitive load and decluttering workspaces. However, existing techniques face a critical challenge: removing an object creates a visual void ("hole") that must be filled with a plausible background. Traditional 2D inpainting methods lack temporal consistency, causing the background to flicker or slide as the user moves. In this paper, we propose Clean-Splat, a novel framework for real-time, multi-view consistent object removal. We leverage 3D Gaussian Splatting (3DGS) for scene representation and integrate a View-Consistent Diffusion Prior to hallucinate occluded background geometry and texture. Unlike previous NeRF-based inpainting which is prohibitively slow, our method updates the 3D scene representation in near real-time, enabling rendering at >30 FPS on consumer hardware. Extensive experiments on real-world cluttered scenes demonstrate that Clean-Splat achieves state-of-the-art perceptual quality (LPIPS) and temporal stability compared to existing video inpainting approaches.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Zeren Gu

,

Jialei Tan

Abstract: Human action recognition (HAR) remains challenging, particularly for skeleton-based methods due to issues like domain shift and limited deep semantic understanding. Traditional Graph Convolutional Networks often struggle with effective cross-domain adaptation and inferring complex semantic relationships. To address these limitations, we propose CD-SEAFNet, a novel framework meticulously designed to significantly enhance robustness and cross-domain generalization for skeleton-based action recognition. CD-SEAFNet integrates three core modules: an Adaptive Spatio-Temporal Graph Feature Extractor that dynamically learns and adjusts graph structures to capture nuanced spatio-temporal dynamics; a Semantic Context Encoder and Fusion Module which leverages natural language descriptions to inject high-level semantic understanding via a cross-modal adaptive fusion mechanism; and a Domain Alignment and Classification Module that employs adversarial training and contrastive learning to generate domain-invariant, yet discriminative, features. Extensive experiments on the challenging NTU RGB+D datasets demonstrate that CD-SEAFNet consistently outperforms state-of-the-art methods across various evaluation protocols, unequivocally validating the synergistic effectiveness of our adaptive graph structure, semantic enhancement, and robust domain alignment strategies.

Short Note
Computer Science and Mathematics
Computer Vision and Graphics

Brennan Sloane

,

Landon Vireo

,

Greer Holloway

,

Keaton Farrow

Abstract: The demand for photorealistic Virtual Reality (VR) content is outpacing the ability of creators to model environments manually. While consumer 360-degree cameras are ubiquitous, they traditionally offer only 3-Degrees-of-Freedom (3-DoF) experiences, where users can look around but cannot physically move through the space. This restriction breaks immersion and frequently induces vestibular mismatch (motion sickness). In this paper, we propose InstantVR, a novel pipeline that automatically converts a sparse sequence of panoramic (equirectangular) images into a fully volumetric, walkable 6-DoF environment. We leverage 3D Gaussian Splatting (3DGS) adapted for spherical projection models to reconstruct high-fidelity scenes in minutes. Furthermore, we introduce a density-based Navigability Analysis module that automatically extracts a collision mesh and floor plan from the reconstructed point cloud, allowing users to physically walk within the generated scene without passing through virtual geometry. Experimental results demonstrate that InstantVR renders at >98 FPS per eye on consumer VR hardware, significantly outperforming NeRF-based alternatives in both training speed (8 mins vs. 4 hours) and rendering latency.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Daniel Levi

,

Yael Friedman

,

Eitan Cohen

Abstract: Detecting small and low-contrast brain lesions is difficult for conventional foundation models, which often prioritize larger structures and overlook micro-lesions. MicroLesion-SAM enhances SAM by integrating structural centerline priors and scale-adaptive refinement to increase sensitivity to lesions below 20 voxels. The model uses probabilistic skeleton cues to guide attention toward lesion cores while preserving high-frequency features through resolution-adaptive enhancement. Evaluated on WMH2020 (60 subjects) and ISLES2018 (103 subjects), MicroLesion-SAM increases small-lesion recall from 0.684 to 0.812 (+18.7%) and raises global Dice to 0.902 (+4.5% over SAM-Med2D). HD95 decreases from 21.9 mm to 14.6 mm (−33.3%), and lesion-wise F1 improves by 11.2%. Cross-dataset validation on an independent clinical cohort of 72 subjects shows a 12.1% improvement in micro-lesion detection. Ablation experiments confirm that the centerline priors contribute a 7.3% Dice gain, while heatmap analysis shows more accurate localization of micro-vascular abnormalities.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Hiroshi Tanaka

,

Keiko Matsumoto

,

Ryohei Nakamura

,

Ayumi Sato

Abstract: Monitoring glioma progression using longitudinal MRI is hindered by inconsistent follow-up intervals, variations in imaging quality, and non-linear morphological changes. DiffProto-Net addresses these challenges by learning temporal lesion prototypes refined through a diffusion-based reconstruction process. The network stabilizes temporal representations by encouraging smooth evolution patterns and reducing noise-induced variability. It also models lesion progression trajectories, producing temporally coherent segmentation masks across all recorded timepoints. Experiments conducted on 830 longitudinal MRI sequences (3–6 timepoints per patient) show that DiffProto-Net achieves a TC-Dice of 0.874, outperforming UNet-LSTM (0.785, +11.3%) and ST-UNet (0.812, +7.6%). Temporal volume deviation decreases from 14.2% to 6.8%, indicating superior stability in tracking tumor evolution. For progression prediction, the model achieves an AUC of 0.903, representing a 12.5% improvement over 3D-ResNet. Under synthetic timing perturbations, temporal drift is reduced by 15.1%, and ablation experiments reveal that diffusion-based prototype refinement contributes nearly 10% Dice improvement.

of 32

Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated