Computer Science and Mathematics

Sort by

Concept Paper
Computer Science and Mathematics
Computer Vision and Graphics

Gurpreet Singh

,

Purva Mundada

Abstract: Sign language translation (SLT) aims to convert sign language videos into spoken language text, serving as a critical bridge for communication between the Deaf and hearing communities. While recent advances in Multimodal Large Language Models (MLLMs) have shown promising results in gloss-free SLT, existing methods typically rely on single-modality visual features, failing to fully exploit the complementary nature of appearance and structural cues inherent in sign language. In this architectural proposition paper, we introduce SignFuse, a novel dual-stream cross-modal fusion framework that synergistically combines CNN-based visual features with Graph Convolutional Network (GCN)-based skeletal features for gloss-free sign language translation. Our framework introduces three key innovations: (1) a Cross-Modal Fusion Attention (CMFA) module that performs bidirectional cross-attention between visual and skeletal modalities to produce enriched multimodal representations; (2) a Hierarchical Temporal Aggregation (HTA) mechanism that captures sign language dynamics at multiple temporal scales—frame-level, segment-level, and sequence-level; and (3) a Progressive Multi-Stage Training blueprint that systematically aligns visual-skeletal features with the LLM’s linguistic space through contrastive pre-training, feature alignment, and LoRA-based fine-tuning. We provide the complete mathematical formulation, detailed architectural specifications, and a fully implemented PyTorch codebase. As the computational barriers to training MLLMs remain high, we formalize the experimental methodology required to validate this framework on standard benchmarks (PHOENIX-14T, CSL-Daily, How2Sign) and extend an open invitation to the broader research community to conduct empirical validation and advance this architectural paradigm through collaboration. This work is presented as a concept and architectural framework paper, aiming to establish a theoretical foundation and encourage future empirical validation by the research community.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Talha Laique

,

Mikkel Gunnes

,

Ole Folkedal

,

Jonatan Nilsson

,

Evelina Andrea Losneslokken Green

,

Hannah Normann Gundersen

,

Øyvind Øverlia

,

Habib Ullah

Abstract: Intensive salmon farming is associated with high mortality rates, highlighting the need for new welfare indicators that can detect adverse conditions earlier and less invasively than many current approaches. Existing animal-based indicators used in the industry typically depend on subjective scoring and provide information mostly after welfare problems have already developed, such as emaciation, wounds, or scale loss. Preliminary data and ongoing investigation suggest that melanin-based skin pigmentation may change dynamically with stress and condition in salmonid fishes. In this study, we present a semi-automated methodology for assessing changes in the grayscale intensity of melanin-based skin spots within the operculum region of adult Atlantic salmon (Salmo salar) kept in sea water. The pipeline combines computer vision models to detect the operculum, segment individual spots, and extract grayscale-based features for spot-level analysis over time. The method was applied to out-of-water images collected before and after exposure to a confinement episode. The results showed an overall shift in grayscale intensity from black to pigmentation fading after the challenge, although responses varied among individuals. These findings indicate that the proposed methodology can detect temporal changes in opercular melanin-based spots under applied experimental conditions. We therefore present this work as proof of principle for using computer vision to quantify changes in melanin-based skin spots as a potentially useful, non-invasive indicator of stress and welfare in Atlantic Salmon.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Ali Emre Gök

,

Mustafa Yurdakul

,

Şakir Taşdemir

Abstract: In medical image analysis, modeling local and global features in high-resolution data presents a significant challenge. While the widely used Convolutional Neural Networks (CNNs) struggle to capture long-range dependencies between distant pixels, the high computational cost (O(N2)) of Vision Transformer (ViT) architectures causes bottlenecks in clinical applications. This study investigates the integration of Mamba models which were developed to overcome these limitations and have linear complexity, into medical image analysis, along with recent studies in literature. This fundamentally continuous-time control theory-based architecture dynamically adapts to hardware resolution. The mamba models effectively retain anatomical structures and lesions in memory while filtering out irrelevant noise through their selective mechanism. Moreover, bidirectional scanning (Vision Mamba) and cross-scan (VMamba) methods are used to prevent the loss of spatial information and to overcome the necessity of processing one-dimensional data due to language-based structure of the models. The reviewed literature can be categorized under three main headings: hybrid models, efficient and lightweight designs, and spatial representation studies. Comprehensive analyses of literature indicate that Mamba models deliver significantly higher inference speed and memory efficiency compared to traditional CNN and ViT approaches owing to their hardware-aware design and linear computational efficiency. In conclusion, Mamba architecture has the potential to become a next-generation standard that demonstrates high performance while maintaining global contextual integrity across diverse medical fields such as radiology, ophthalmology, and dermatology.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Yu Shang

,

Yinzhou Tang

,

Xin Zhang

,

Shengyuan Wang

,

Yuwei Yan

,

Honglin Zhang

,

Zhiheng Zheng

,

Jie Zhao

,

Jie Feng

,

Chen Gao

+2 authors

Abstract: World models have emerged as a pivotal research direction, with recent breakthroughs in generative AI underscoring their potential for advancing artificial general intelligence. For embodied AI, world models are critical for enabling robots to effectively understand, interact with, and make informed decisions in real-world physical environments. This survey systematically reviews recent progress in embodied world models, under a novel technical taxonomy. We hierarchically organize the field by model architectures, training methodologies, application scenarios, and evaluation approaches, thus offering researchers a clear technical roadmap. We first thoroughly discuss vision-based generative world models and latent space world models, along with their corresponding training paradigms. We then explore the multifaceted roles of embodied world models in robotic applications, from functioning as cloud-based simulation environments to on-device agent brains. Additionally, we summarize important evaluation dimensions for benchmarking embodied world models. Finally, we outline key challenges and provide insights into promising future research directions within this crucial domain. We summarize the representative works discussed in this survey at https://github. com/tsinghua-fib-lab/Awesome-Embodied-World-Model.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Min Pang

,

Jichao Jiao

,

Yingjian Zhang

Abstract: Three-dimensional (3D) shape recognition is a fundamental task in computer vision, where view-based methods have recently achieved state-of-the-art performance. However, effectively capturing and exploiting the rich geometric correspondences between different views remains a key challenge, as such information is crucial for accurate shape representation. Existing methods often fall short in explicitly modeling these structured correlations, which limits their ability to fully leverage discriminative shape information. To address this limitation, we propose a novel View-based Graph Convolution and Sampling Fusion Network (View-GFN). View-GFN employs a hierarchical architecture that progressively coarsens the view-graph to learn multi-scale features. In this structure, views are treated as graph nodes, and a predefined-value strategy is introduced to initialize the adjacency matrix (AM) for constructing initial node correlations. For effective graph coarsening, we develop a novel view down-sampling method based on a cluster assignment matrix. Furthermore, a Graph Convolution and Sampling Fusion (CSF) module is designed to seamlessly integrate deep feature embeddings with the topological information derived from view down-sampling. Extensive experiments on benchmark datasets, including ModelNet40 and RGB-D, demonstrate that View-GFN achieves a superior recognition accuracy of 97.8%, surpassing previous methods while reducing the number of model parameters by nearly 50%. These results validate the superiority of our hierarchical fusion strategy in capturing multi-view geometric information both effectively and efficiently.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Xiaoming Zhang

,

Rundong Zhuang

Abstract: Reliable test-tube detection on clinical conveyor lines remains difficult when tubes are densely packed, placed irregularly, weakly illuminated, partially blurred by robot vibration, and contaminated by glare from glass or PET surfaces. These conditions erode the short-axis boundary cues and faint graduation marks that slender tubes depend on. We therefore build WDA-TNET on YOLOv11 and target the failure modes at four points of the pipeline. First, WGSR restores blurred regions selectively based on wavelet energy, avoiding over-sharpening specular areas. Second, GSCIM suppresses glare-dominated channel responses in the backbone through direction-aware pooling and cross-channel interaction, retaining weak structural cues like liquid-level edges. Third, DCPAF separates height and width encoding in the neck, dynamically balancing long-axis context and short-axis localization suitable for elongated targets. Finally, ATSS and MPDIoU stabilize supervision when positives are sparse and boxes overlap only weakly. We evaluated our model on the newly constructed Complex Test Tube (CTT) dataset containing 11,955 images and 81,044 instances. WDA-TNET achieves 94.1% precision and 79.1% mAP50:95, improving mAP50:95 by 3.6 percentage points over YOLOv11. On the transparent-container HeinSight4 dataset, the model attains 95.2% mAP50:95, proving robust cross-domain generalization.

Dataset
Computer Science and Mathematics
Computer Vision and Graphics

Igor Garcia-Atutxa

,

Hodei Calvo-Soraluze

,

Francisca Villanueva-Flores

Abstract: Open, well-documented datasets are essential for the reproducible development of vision systems for urban utility management. This Data Descriptor presents a curated RGB object-detection benchmark of four classes associated with electrical distribution and street-level utility assets: Inspection Chamber, Overhead-to-Underground Transition, General Protection Box, and Transformer Substation. The public release contains 997 valid image-label pairs partitioned into 698 training, 150 validation, and 149 test images. Images were acquired during 2019 in multiple localities across Spain, predominantly with a mobile phone and, in occasional cases, using Google Maps as a complementary visual source, and were manually annotated with LabelImg before export to YOLO format. During curation, four invalid image-label pairs were removed because at least one YOLO bounding box exceeded the normalized image domain. The benchmark contains 1,939 object instances, with marked class imbalance: General Protection Box accounts for 50.2% of objects whereas Transformer Substation represents 4.7%. Images are heterogeneous in size and viewpoint, ranging from 90 × 170 to 4160 × 4032 pixels, with a median resolution of 619 × 544 pixels and a median of two annotated objects per image. The public GitHub release is organized into images/, labels/, and metadata/ directories; metadata stores split definitions, classes.txt, data.yaml, inventory information, annotation schema documentation, and diagnostic summary figures. Beyond detector benchmarking, the dataset can support scalable mapping of visible distribution-grid assets, with potential value for smart-city digital twins and data-informed EV charging deployment.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Hangbiao Li

,

Haojun Mo

,

Xing Li

,

Tao Fang

,

Sikun Liu

,

Shuzhen Yu

,

Zhibo Rao

Abstract: Stereo matching has witnessed rapid advances on curated benchmarks, yet deploying models in unconstrained real-world environments remains a fundamental challenge. This paper presents a sparse self-prompt guided network (SSPGNet) for stereo matching with strong generalization across diverse environments. Our core innovation lies in a sparse self-prompt guidance mechanism: 1) a sparse disparity map, used as a prompt, is self-estimated from visual foundation model features via cost aggregation; and 2) the sparse disparity is progressively refined into dense disparity maps through cross-attention-based stereo feature interaction, enabling sparse-to-dense disparity prediction. Additionally, we collected a diverse set of indoor and outdoor stereo pairs using a ZED 2 camera to assess the real-world performance of our model. Extensive experiments demonstrate that the proposed sparse-to-dense prompt mechanism not only preserves the semantic awareness of visual foundation models but also enhances stereo correspondence reasoning, achieving strong performance on public benchmarks and our in-the-wild dataset. These results highlight the potential of SSPGNet for direct deployment in real-world stereo perception systems. The code and data will be made publicly available upon publication.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Nizamuddin Maitlo

,

Nooruddin Noonari

,

Fayaz Ahmed

,

Afifa Hussain

Abstract: We present Color-in-Context, a dataset of 12,086 photographs annotated along two complementary dimensions: color (Black, Blue, Gray, Orange, Pink, Purple, Skyblue, White, Yellow) and illumination (fluorescentLight, indoor, indoorNight, sunLight). The dataset is organized into 36 joint categories (9 colors × 4 illumination conditions) using a consistent folder hierarchy and normalized labels. We provide summary counts across colors, illuminations, and selected joint buckets, and an optional manifest file to support deterministic indexing and integrity checking. This Data Descriptor documents dataset construction, label normalization, duplicate screening, and file-integrity checks, and provides usage guidance for split generation and reporting under varied illumination.

Review
Computer Science and Mathematics
Computer Vision and Graphics

Ramona Kühlechner

Abstract: Precise segmentation of defects is a key component of industrial quality control. This paper presents a comprehensive overview of contemporary methods utilising convolutional neural networks that have demonstrated practical efficacy. Depending on the application, semantic, instance-based, panoptic and hybrid segmentation methods are used to reliably detect material defects. Finally, prospects for industrial use are discussed, including the optimisation of hybrid methods, real-time capability and integration into existing production processes to ensure efficient, robust and practical defect detection.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Rongchang Lu

,

Yunzhi Jiang

,

Bingcheng Liao

,

Conghan Yue

,

Xin Hai

,

Guoxin Chen

Abstract: Real-time 2D imagery super-resolution (SR) in UAV remote sensing encounters significant speed and resource-consuming bottlenecks during large-scale processing. To overcome this, we propose Semantic Injection State Modeling for Super-Resolution (SIMSR), an ultra-lightweight architecture that integrates land-cover semantics into a linear state-space model. This integration mitigates state forgetting inherent in linear processing by linking hierarchical features to persistent semantic prototypes, enabling high-fidelity image enhancement. The model achieves a state-of-the-art PSNR of 32.9+ for 4x SR on RSSCN7 agricultural grassland imagery. Furthermore, the implementation of geographically-chunked (tile-based) parallel processing simultaneously eliminates computational redundancies, yielding a 10.85x inference speedup, a 54% memory reduction, and an 8.74x faster training time. This breakthrough facilitates practical real-time SR deployment on UAV platforms, demonstrating strong efficacy for ecological monitoring applications by providing the detailed imagery essential for accurate analysis.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Xiaoming Zhang

,

Rundong Zhuang

Abstract: In unmanned pharmacy and home-care medicine management applications, reliable pillbox localization is a prerequisite for automated dispensing and grasping. However, existing detectors still perform poorly in complex environments where dense stacking, occlusion, weak illumination, and high inter-class similarity occur simultaneously. To address this problem, GSPM-YOLO is proposed as an improved detector built on the YOLOv11 framework for complex pillbox recognition, and four novel plug-and-play lightweight modules are developed: GSimConv, a lightweight dual-branch convolution module that incorporates the Attention Weight Calculation Algorithm in HardSAM for edge-preserving feature extraction, PSCAM for position-sensitive coordinate attention, MSAAM, a multi-scale strip-pooling module that integrates the Horizontal Context-Aware Attention weight calculation algorithm to strengthen occluded targets, and LGPFH for bidirectional ghost pyramid fusion. To simulate the complex operating environments of dispensing robots, we construct MBox-Complex, a dataset of 3{,}041 images with 8{,}153 annotations across 25 drug categories. Ablation experiments first validate the effectiveness of the four-module composition, with F1 rising from 0.641 to 0.714, and each module is then individually compared with advanced replacement schemes in dedicated substitution experiments to verify its own effectiveness. The integrated model is then benchmarked against advanced detectors and domain-specific methods on the self-constructed MBox-Complex dataset, achieving 0.727 mAP@50 and 0.427 mAP@50-95 with 3.8M parameters and surpassing YOLOv11 by 7.1 and 4.0 percentage points and YOLOv12 by 4.3 and 3.1 percentage points, respectively. Further cross-dataset evaluation on the VOC and Brain Tumor benchmark datasets verifies the transferability of the proposed model. Grad-CAM is adopted to visualize the detector's attention distribution, and the resulting heatmaps together with detection visualizations confirm that the proposed model focuses more precisely on stacked and occluded regions.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Ruize Xia

Abstract: Downstream adaptation of a contrastively pretrained vision--language model can improve in-domain accuracy while degrading performance on unseen transfer tasks. This study examines how full fine-tuning and low-rank adaptation alter attention heatmaps under a controlled design that matches learning rate across adaptation methods. The completed matched-learning-rate matrix contains 80 runs using the OpenAI Contrastive Language--Image Pretraining model with a base 32-patch vision transformer image encoder, two datasets (EuroSAT and Oxford-IIIT Pets), four shared learning rates (1e-6, 5e-6, 1e-5, and 5e-5), and five random seeds. We measure classification-token-to-patch attention entropy, the fraction of patches required to capture 95\% of attention mass, attention concentration, head diversity, in-domain validation accuracy, and adapter-aware zero-shot accuracy on CIFAR-100. Three findings emerge. First, learning rate is a primary determinant of structural drift: on EuroSAT, full fine-tuning moves from entropy broadening at 1e-6 (+1.83%) to marked contraction at 5e-5 (-3.99%), whereas low-rank adaptation remains entropy-positive across the full matched grid (+0.68% to +1.50%). Second, low-rank adaptation preserves out-of-domain transfer substantially better than full fine-tuning at matched learning rates: averaged across the EuroSAT grid, zero-shot accuracy on CIFAR-100 is 45.13% for low-rank adaptation versus 11.28% for full fine-tuning; on Oxford-IIIT Pets, the corresponding averages are 58.01% and 8.54%. Third, Oxford-IIIT Pets exhibits a clear interaction with optimization scale: low-learning-rate low-rank adaptation underfits the in-domain task, so method-only averages can obscure the regime in which it becomes competitive. Additional rollout, patch-to-patch, centered-kernel-alignment, and backbone analyses are directionally consistent with these controlled results. Across both controlled datasets, runs with broader retained attention support also retain more zero-shot performance. Taken together, these findings support attention heatmap drift as an informative descriptive lens on model adaptation while arguing against a universal interpretation of the observed behavior as a single collapse phenomenon.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Donald Martin

,

Blake Bowman

Abstract: Incomplete 3D point clouds present a significant challenge in diverse applications due to sensor limitations and occlusions. Existing methods often struggle to balance local detail accuracy and global structural integrity, frequently yielding artifacts or distortions due to reliance on symmetric Chamfer Distance or unguided contrastive losses. We propose Dual-Constraint Contrastive Completion (DCCC), an end-to-end framework integrating asymmetric weighted Chamfer distance with multi-granularity contrastive learning. DCCC utilizes an encoder-decoder backbone with Mamba layers for efficient feature extraction. Central is the Asymmetric Contrastive Chamfer Loss (ACCL), decoupling local precision and global integrity objectives into distinct contrastive components, optimized via dynamic asymmetric weighting. A Self-Supervised Structural Guidance (SSG) module further learns coarse structural priors directly from incomplete inputs, reducing annotation reliance and improving robustness. Extensive experiments on benchmark datasets demonstrate DCCC's superior performance. DCCC achieves best-in-class results across critical metrics, significantly enhancing structural completeness and fine-grained accuracy in diverse settings, including real-world scenarios and high sparsity.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Valli Nayagam V

,

Anukarthika S

,

Muhesh Krishnaa S

,

Sri Sathya K B

Abstract: The rapid expansion of sports broadcasting and digital media platforms has increased the demand for intelligent systems capable of automatically identifying important sports events for real-time analytics and highlight generation. Manual annotation of sports videos requires significant time and effort and may introduce human errors during analysis. This paper presents a real-time sports action recognition framework using a hybrid CNN–Transformer architecture for detecting critical events in football and cricket videos. The proposed system processes live or recorded video streams through frame extraction, normalization, and spatial feature learning using the MobileNetV2 network. Temporal relationships between consecutive frames are modeled using a Transformer encoder to improve action understanding. The framework classifies events such as pass and goal in football, and four, six, and wicket in cricket. Motion-based filtering and confidence thresholding reduce non-action frames and improve prediction reliability. Detected events are recorded with timestamps and displayed using broadcast-style overlays to support automated highlight generation. Experimental evaluation demonstrates high recognition accuracy and efficient real-time performance on low-cost hardware platforms. The framework provides an effective solution for sports analytics, media automation, and intelligent decision-support systems.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Jianhua Zhu

,

Changjiang Liu

,

Danling Liang

Abstract: Multi-modal remote sensing image registration is a challenging task due to differences in resolution, viewpoint, and intensity, which often leads to inaccurate and time-consuming results with existing algorithms. To address these issues, we propose an algorithm based on Curvature Scale Space Contour Point Features (CSSCPF). Our approach combines multi-scale Sobel edge detection, dominant direction determination, an improved curvature scale space corner detector, a new gradient definition, and enhanced SIFT descriptors. Test results on publicly available datasets show that our algorithm outperforms existing methods in overall performance. Our code will be released at https://github.com/JianhuaZhu-IR.

Review
Computer Science and Mathematics
Computer Vision and Graphics

Ge Gao

,

Chen Feng

,

Yuxuan Jiang

,

Tianhao Peng

,

Ho Man Kwan

,

Siyue Teng

,

Chengxi Zeng

,

Yixuan Li

,

Changqi Wang

,

Robbie Hamilton

+3 authors

Abstract: While conventional video coding standards remain predominant in real-world applications, neural video compression has emerged over the past decade as an active research area, offering alternative solutions with potentially significant coding gains through end-to-end optimization. Owing to the rapid pace of recent progress, existing reviews of neural video coding quickly become outdated and often lack a systematic taxonomy and meaningful benchmarking. To address this gap, we provide a comprehensive review of two major classes of neural video codecs - scene-agnostic and scene-adaptive - with a focus on their design characteristics and limitations. More importantly, we benchmark representative state-of-the-art methods from each category under common test conditions recommended by video coding standardization bodies. This provides, to the best of our knowledge, the first first large-scale unified comparison between conventional and neural video codecs under controlled settings. Our results show that neural codecs can already achieve competitive, and in some cases superior, performance relative to VTM and AVM, although they still fall short of ECM in overall coding efficiency under both Low Delay and Random Access configurations. To facilitate future algorithm benchmarking, we will release the full implementations and results at https://nvc-review-2025.github.io, thereby providing a useful resource for the video compression research community.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Marc Tornero-Soria

,

Antonio-José Sánchez-Salmerón

,

Eduardo Vendrell-Vidal

Abstract: Public YOLO model releases typically provide high-level architectural descriptions and headline benchmark results, but offer limited empirical attribution of performance to individual blocks under controlled training conditions. This paper presents a modular, block-level analysis of YOLO26’s object detection architecture, detailing the design, function, and contribution of each component. We systematically examine YOLO26’s convolutional modules, bottleneck-based refinement blocks, spatial pyramid pooling, and position-sensitive attention mechanisms. Each block is analyzed in terms of objective and internal flow. In parallel, we conduct targeted ablation studies to quantify the effect of key design choices on accuracy (mAP50–95) and inference latency under a fixed, fully specified training and benchmarking protocol. Experiments use the MS COCO [1] dataset with the standard train2017 split (≈118k images) for training and the full val2017 split (5k images) for evaluation. The result is a self-contained technical reference that supports interpretability, reproducibility, and evidence-based architectural decision-making for real-time detection models.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Longcheng Huang

,

Mengguang Liao

,

Shaoning Li

,

Chuanguang Zhu

,

Sichun Long

Abstract: Maritime search and rescue is an important component of emergency response frameworks and primarily relies on UAVs for maritime object detection. However, maritime accidents frequently occur in low-visibility environments, such as foggy or low-light conditions, which lead to low contrast, blurred object boundaries, and degraded texture representations. Most existing maritime object detection algorithms are developed for natural light scenes, and their performance deteriorates markedly when deployed directly in low-visibility environments, primarily due to reduced image quality that hinders feature extraction and semantic information aggregation. Although several studies incorporate image enhancement techniques prior to detection to improve image quality, these approaches often introduce significant additional computational overhead, limiting their practical deployment on UAV platforms. To tackle these challenges, this paper proposes a lightweight model built upon a recent YOLO framework, termed Multi-Scale Adaptive YOLO (MSA-YOLO), for maritime detection using UAVs in low-visibility environments. The proposed model systematically optimizes the backbone, neck, and detection head networks. Specifically, an improved StarNet backbone is designed by integrating ECA mechanisms and multi-scale convolutional kernels, which strengthen feature extraction capability while maintaining low computational overhead. In the neck network, a high-frequency enhanced residual block branch is inserted into the C3k2 module to capture richer detailed information, while depthwise separable convolution is utilized to further reduce computational cost. Moreover, a non-parametric attention module is incorporated into the detection head to adaptively optimize features in the classification and regression branches. Finally, a joint loss function that combines bounding box regression, classification, and distribution focal losses is utilized to improve detection accuracy and training stability. Experimental results on the constructed AFO, Zhoushan Island, and Shandong Province datasets demonstrate that, relative to YOLOv11-s, MSA-YOLO reduces model parameters and FLOPs by 52.07% and 41.36%, respectively, while achieving improvements of 1.11% and 1.33% in mAP@0.5:0.95 and mAP@0.5. These results indicate that the proposed method effectively balances computational efficiency and detection accuracy, rendering it suitable for for practical maritime search and rescue applications in low-visibility environments.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Xiaoming Chen

,

Xiaoyu Jiang

,

Yingqing Huang

,

Xi Wang

,

Chaoqun Ma

Abstract: We propose a field-transformation-based framework for generating phase-only light-field holograms from a single RGB image. The method establishes an explicit pipeline from monocular scene inference to holographic wavefront synthesis, without requiring multi-view capture or task-specific hologram-network training. First, we construct a layered occlusion RGB-D model from the input image using monocular depth estimation, connectivity-based layer decomposition, and occlusion-aware inpainting, which provides a lightweight 3D prior for sparse-view rendering in the small-parallax regime. Second, we transform the rendered sparse RGB-D light field into a target complex wavefront on the recording plane through local frequency mapping, thereby bridging explicit scene geometry and wave-optical field construction. Third, we optimize the phase-only hologram under multi-planeamplitude constraints using a geometrically consistent initial phase and an error-driven adaptive depth-sampling strategy, which improves convergence stability and reconstruction quality under a limited computational budget. Numerical experiments show that the proposed method achieves better depth continuity, occlusion fidelity, and lower speckle noise than representative layer-based and point-based methods, and improves the average PSNR and SSIM by approximately 3 dB and 0.15, respectively, over Hogel-Free Holography. Optical experiments further confirm the physical feasibility and robustness of the proposed framework.

of 35

Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated