Computer Science and Mathematics

Sort by

Article
Computer Science and Mathematics
Computer Vision and Graphics

Roberto Carlos Moreno-Hernández

,

Juan A. Moreno-Hernández

,

Margarita De la Portilla-Reynoso

,

Claudia del C. Gutiérrez-Torres

,

Juan G. Barbosa-Saldaña

,

Didier Samayoa

,

José A. Jiménez-Bernal

Abstract: Seismic attribute selection remains a critical yet often heuristic component in deep learning-based segmentation workflows. In this work, we propose a redundancy-aware framework to systematically analyse the contribution of seismic attributes by combining input-space statistics, representational similarity (CKA), and error-based evaluation. Our results show that statistical redundancy in the input space does not directly translate to functional redundancy within the network. In particular, attributes such as amplitude and instantaneous phase may exhibit high similarity at the input level while producing distinct error patterns and meaningful performance gains. We further demonstrate that complementary attributes do not necessarily yield additive improvements. While some combinations introduce conflicting interactions that limit global performance, others provide stable and consistent improvements across classes. Notably, the combination of amplitude, phase, and local variance forms a minimal informative subset that improves segmentation performance in a balanced manner, particularly in challenging facies. These findings highlight that attribute selection should be guided by functional complementarity and stability of interaction rather than by input diversity alone. The proposed framework provides a principled approach for identifying effective attribute subsets, contributing to more efficient and interpretable seismic segmentation workflows.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Ruicheng Yang

,

Hailiang Zhao

,

Yongyi Kong

,

Yicheng Lai

,

Jiansen Zhao

Abstract: Reliable visual detection of small floating objects on the water surface is a prerequisite for environmental monitoring and clean-up tasks performed by unmanned surface vehicles (USVs) on inland waterways. Such scenes are routinely degraded by low illumination at dawn and dusk, strong specular reflections, ripple-induced clutter, and large object-scale variations, which together cause missed detections, false alarms, and unstable localization. This paper proposes YOLO11-LREP, a lightweight detection framework built upon YOLO11n and tailored for water-surface floating-object recognition under such adverse conditions. Four complementary improvements are integrated: (i) a Coordinate Attention (CoordAtt) module is inserted at the top of the backbone to enhance positional encoding and highlight obstacle-related semantic regions; (ii) three Efficient Channel Attention (ECA) modules are embedded at the multi-scale fusion nodes of the Neck so that reflection- and ripple-induced spurious channel responses can be suppressed at almost no extra cost; (iii) the Powerful-IoU (PIoU) loss replaces the original regression loss to enforce four-side boundary alignment and stabilize convergence on small, blurred-edge targets; and (iv) a joint low-light and reflection augmentation strategy, together with CutMix region-level mixing, broadens the training distribution along the illumination and occlusion axes. Experiments on the public FloW-Img dataset, split into 1,200 training and 800 validation images (2,024 instances) and run under a fixed random seed (seed = 0, deterministic = true), show that YOLO11-LREP attains AP₅₀ = 80.1 %, AP₅₀:₉₅ = 38.5 %, and AP_S = 24.3 % with only 2.84 M parameters and 9.3 GFLOPs. On an NVIDIA RTX 4060 Laptop GPU, the model runs at 3.3 ms total per 640×640 image (≈303 FPS), satisfying real-time perception requirements while retaining lightweight deployability. Ablation experiments verify the individual and complementary contributions of each component, and a systematic threshold sensitivity analysis (F₁ fluctuation < 0.2 %) demonstrates the stability of the final model.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Arthur Nigmatzyanov

,

Gonzalo Ferrer

Abstract: Solving the 3D Point Cloud Place Recognition (3D-PCPR) task is essential for the localization and mapping of depth-based perception systems. Visual Place Recognition methods are highly dependent on image texture information, while the limited number of available point cloud datasets for 3D-PCPR causes the methods to overfit to specific data. Our objective is to use the latest foundation models for 3D point clouds. These models are trained using enormous 3D object datasets where the density is nearly uniform. However, the point clouds produced by the LiDAR sensors are sparse and non-uniformly distributed. We propose a new approach, Unified Point Cloud 3D Place Recognition (Uni-PCPR), effectively maintaining the expressiveness of features generated by the foundation model. We have evaluated the performance of Uni-PCPR on several datasets and found that it generalizes well to unseen data, outperforming other methods. The code will be available upon acceptance.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Mustafa Yurdakul

,

Ahmet Eren Kaplan

Abstract: Accurate and timely detection of brain tumors on magnetic resonance imaging (MRI) is a critical prerequisite for effective neuro-oncological management. While convolutional neural networks (CNNs) have been the dominant paradigm for medical image classification over the past decade, the recent emergence of vision-capable large language models (LLMs) offers a complementary, training-free pathway to image-based decision support. This study presents a controlled, head-to-head comparison between 17 ImageNet-pretrained CNN architectures and 8 state-of-the-art multimodal LLMs on the publicly available Brain MRI Images for Brain Tumor Detection dataset (n = 253; 155 tumor, 98 non-tumor). Following an 80/20 train–test partition (n = 202 / n = 51), CNN models were fine-tuned via transfer learning, whereas LLMs were evaluated in a zero-shot configuration using a standardized prompt. Test-set performance was assessed using accuracy, precision, recall, specificity, F1-score, Cohen's kappa, and the area under the receiver operating characteristic curve (AUC). Among CNNs, six architectures (DenseNet169, DenseNet201, InceptionV3, ResNet101V2, VGG16, Xception) tied at 94.12% test accuracy, while ResNet50 and NASNetMobile exhibited pronounced overfitting (45.10% and 49.02%, respectively). Among LLMs, ChatGPT 5.4 Thinking achieved perfect classification (100% on all metrics), with ChatGPT 5.5 Thinking and Gemini 3.1 Thinking attaining 98.04% and 94.12% accuracy. These findings indicate that modern multimodal foundation models can match or exceed bespoke CNNs in low-data medical imaging tasks and support their further investigation as components of clinical decision-support pipelines.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Zahid Ullah

,

Minki Hong

,

Jihie Kim

Abstract: Continual learning (CL), also referred to as lifelong learning, aims to develop intelligent systems capable of learning continuously from sequential data while retaining previously acquired knowledge. As AI systems are increasingly deployed in dynamic real-world environments, CL has become essential for enabling long-term adaptation without catastrophic forgetting. This review provides a structured overview of major CL paradigms, including task-incremental, domain-incremental, class-incremental, online, multimodal, and federated CL. We examine the theoretical foundations of CL, particularly the stability-plasticity dilemma, catastrophic forgetting, transfer dynamics, and representation learning. In addition, we analyze major methodological categories, including regularization-based, replay-based, architecture-based, optimization-based, representation-learning, and parameter-efficient approaches. Recent developments involving transformers, prompt learning, foundation models, and multimodal adaptation are also discussed as emerging directions in modern CL research. Furthermore, this review highlights important issues related to benchmark fragmentation, evaluation inconsistency, memory constraints, computational efficiency, scalability, and privacy-aware learning. We also summarize key application domains, including computer vision, natural language processing, robotics, healthcare, and medical imaging. Finally, we identify open research challenges and future directions toward scalable, reliable, and deployment-oriented lifelong learning systems capable of operating effectively in continuously evolving environments.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Yu Shang

,

Yinzhou Tang

,

Xin Zhang

,

Shengyuan Wang

,

Yuwei Yan

,

Honglin Zhang

,

Zhiheng Zheng

,

Jie Zhao

,

Jie Feng

,

Chen Gao

+3 authors

Abstract: World models have emerged as a pivotal research direction, with recent breakthroughs in generative AI underscoring their potential for advancing artificial general intelligence. For embodied AI, world models are critical for enabling robots to effectively understand, interact with, and make informed decisions in real-world physical environments. This survey systematically reviews recent progress in embodied world models, under a novel technical taxonomy. We hierarchically organize the field by model architectures, training methodologies, application scenarios, and evaluation approaches, thus offering researchers a clear technical roadmap. We first thoroughly discuss vision-based generative world models and latent space world models, along with their corresponding training paradigms. We then explore the multifaceted roles of embodied world models in robotic applications, from functioning as cloud-based simulation environments to on-device agent brains. Additionally, we summarize important evaluation dimensions for benchmarking embodied world models. Finally, we outline key challenges and provide insights into promising future research directions within this crucial domain. We summarize the representative works discussed in this survey at https://github. com/tsinghua-fib-lab/Awesome-Embodied-World-Model.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Ziquan Liu

,

Zhen Wang

,

Qiwei Wu

,

Chengbo Hu

,

Yongling Lu

Abstract: Corridor management—such as reliance on manual warning zone delineation and inconsistent boundaries—this paper proposes 3DSim-WZD, an automatic ground-level warning zone detection method based on 3D simulation data augmentation. Guided by "interpretable geometric priors combined with deep learning regression," the framework integrates four modules: parametric simulation generation, simulation-to-real transfer, boundary vertex regression, and voltage-level-based expansion. Specifically, parametric virtual scenes are constructed in Unity3D to automatically derive accurate vertex labels. The open-source Stable Diffusion framework, combined with ControlNet and LoRA, is employed for sim-to-real style transfer to reduce domain gaps. Furthermore, directional detection convolutional kernels are incorporated into the YOLO12m backbone to enhance sensitivity to transmission structures. Finally, safety clearance distances are mapped according to voltage levels for regulatory-compliant warning zones. Evaluated on a dataset of 5,000 simulated and 300 real samples, the method achieves a mIoU of 91.2% and an inference speed of 46.8 FPS, demonstrating significant potential for large-scale deployment.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Somasis Roy

,

Anirban Mitra

,

Sanjit Kumar Setua

Abstract: This research presents a novel approach for enhancing retinal fundus images to detect anomalies better and diagnose retinal diseases. The work is divided into two stages: image representation and enhancement. Fundus images are represented in a Clifford color space, a 3D color model based on the RGB system, where colors are stored as multivectors that preserve color information and luminance. A rotation operation is applied to correct the image's illumination by adjusting brightness and color deviations, with the rotation angle and axis being critical for accurate enhancement. The gray-level axis serves as the rotational plane and the rotational angle of with a grayscale bivector axis, determined via discrete entropy (DE), optimally corrects image illumination. Following this, the green channel is extracted and enhanced using the CLAHE technique before being recombined with the other channels, and the image is reverse-rotated to its original color space. The effectiveness of the proposed method is evaluated using PSNR, DE, and SSIM on the MESSIDOR and DRIVE datasets, showing superior image quality and information preservation compared to existing methods. This enhanced technique is particularly beneficial for retinal landmark and lesion detection, improving diagnostic accuracy in retinal imaging.

Review
Computer Science and Mathematics
Computer Vision and Graphics

Lei Zhang

,

Tianyu Zhang

,

Xiaowei Fu

,

Fuxiang Huang

,

Wenguan Wang

,

David Zhang

Abstract: Frequency, as a physical quantity that describes the rate at which periodic events occur, is a crucial perspective and component, and can help observe and recognize the world via versatile frequency transforms. Initially, built on Fourier analysis theory, it played an important role primarily in the field of signal processing, and has gradually become an indispensable part of deep learning to solve complex problems. Deep learning in the frequency domain (a.k.a. Fourier domain), which we called \textbf{F}requency-principled \textbf{D}eep \textbf{L}earning (FDL), has been extensively employed in a wide range of scenarios owing to its compelling advantages, such as global receptive field, high computational efficiency, inherent data decomposition and explainability. Deep neural networks also exhibit certain properties from a frequency-domain perspective, which provides valuable insights for powerful model design and refinement. Despite growing attention to frequency-domain approaches in deep learning and computer vision, the absence of a systematic synthesis makes it difficult to grasp the current landscape, identify the core methodologies, perceive the challenge, and chart a course for future research. Moreover, a comprehensive explanation for why introducing frequency-domain methods contributes to problem-solving is still lacking. This survey aims to provide a comprehensive and structured overview of frequency-principled vision and learning to address this gap. Unlike previous reviews that may focus on isolated aspects, our work seeks to connect and systematize the field through a unified taxonomy. Specifically, we conduct a systematic survey and analysis of existing literature from multiple perspectives: frequency principle (theory), implementations (algorithms), applications, challenges and future frontiers of FDL across various tasks.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Yichuan Zheng

,

Jin Shi

,

Wei Shen

Abstract: Product recognition in e-commerce live streaming is hindered by rapid viewpoint changes, occlusions, motion blur, and inconsistencies between visual and spoken information. Existing approaches typically focus on individual components such as detection, OCR, or speech recognition, which limits their effectiveness in end-to-end scenarios.To address this problem, we propose an integrated framework that combines task-oriented keyframe selection with multimodal semantic fusion. The framework first uses D-FINE to localize product regions, and then selects informative frames through two complementary strategies. Strategy A considers both detection confidence and Laplacian-based sharpness, while Strategy B combines detection confidence with a learned image-quality score estimated by an EfficientNetV2-based model. OCR, visual recognition, and ASR are then applied to the selected data, and a Qwen-Plus large language model is used to integrate multimodal evidence into structured product outputs. Experiments on an in-house dataset demonstrate significant gains over a last-frame baseline. Strategy A increases Perfect Match Rate from 58.00% to 80.00% and Product Name Recognition Accuracy from 78.00% to 98.00%. Strategy B achieves 77.00% and 98.00%, respectively. Ablation studies further show that the full multimodal framework consistently outperforms unimodal and dual-modality variants. In addition, Top-K analysis indicates that single-frame inference provides a good balance between performance and efficiency.Overall, the proposed framework offers an effective and practical solution for product recognition in complex live-streaming scenarios.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Zhipeng Ye

,

Zihao Lu

,

Yuan Zhang

,

Wenjie Qin

,

Jiayi Hong

,

Yan Cao

,

Xuming Wu

,

Zhibin Shao

Abstract: 3D Gaussian Splatting (3DGS) degrades severely in sparse-view scenarios, often collapsing into artifacts due to under-constrained optimization. While incorporating monocular depth priors provides dense supervision, their inherent multi-view inconsistency frequently distorts geometry. To address this, we propose GeoTrack-GS, a geometry-first framework that refines noisy depth priors using reliable self-supervised constraints. Specifically, we leverage sparse feature tracks to enforce macro-level reprojection consistency and introduce a micro-level anisotropic regularizer via K-NN PCA to suppress rank-collapse. On this corrected geometry, we design GT-DCA, a geometry-guided deformable cross-attention module that captures view-dependent appearance without compromising structure. A Decoupled Constraint Stabilization strategy further balances these heterogeneous signals during training. Experiments on LLFF and DTU under 3-9 input views, and on Mip-NeRF 360 under 12 input views, demonstrate that GeoTrack-GS achieves state-of-the-art geometric fidelity while maintaining competitive rendering quality compared to existing baselines, effectively reducing floaters and "waxy" surfaces.

Review
Computer Science and Mathematics
Computer Vision and Graphics

Jin Yang

,

Jing Zhang

,

Xiaobing Yu

Abstract: Machine learning and deep learning models trained on a source domain often suffer from performance degradation when deployed to new target domains due to domain shifts arising from differences in data distributions, acquisition conditions, or temporal variations. Domain adaptation addresses this issue by transferring knowledge from labeled source data to unlabeled target data. However, acquiring labels for target-domain samples is often costly or impractical in real-world applications. To improve label efficiency, active domain adaptation (ADA) and active continual learning (ACL) integrate active learning strategies into domain adaptation and continual learning frameworks. ADA selectively queries informative target samples to enhance adaptation performance, while ACL extends this paradigm to sequential settings, enabling models to adapt to evolving data streams while mitigating catastrophic forgetting. This survey provides a systematic review of ADA and ACL, focusing on their advances and applications. We further examine extensions of ADA such as source-free ADA, integration with semi-supervised learning, and advanced techniques for handling challenging adaptation scenarios. In addition, we summarize applications across computer vision, medical imaging, robotics, natural language processing, scientific and engineering tasks. Finally, we discuss open challenges and future directions, including robust adaptation under complex distribution shifts and reliable semi-supervised adaptation.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Zichen Zhang

,

Chengjun Guo

Abstract: Existing vision-based defect detection algorithms built upon YOLOv11 often exhibit unstable performance in complex building environments, where varying illumination conditions and partial occlusions caused by debris or vegetation can severely degrade detection accuracy. More importantly, most existing methods rely solely on visual features while neglecting domain-specific prior knowledge from civil engineering, particularly the geometric continuity of structural damages and the physical stress distribution around defect regions. As a result, these approaches remain vulnerable to background interference, show limited capability in extracting features of small-scale defects, and may generate detections that are inconsistent with the actual physical characteristics of structures.To overcome these limitations, this paper proposes an enhanced detection framework, termed **PIA-YOLO**, which integrates a Physical Information Attention (PIA) module and a Residual Efficient Channel Attention (RECA) module as dual attention branches. Specifically, the PIA module incorporates civil engineering priors by embedding physically inspired gradient operators into the attention mechanism, rather than directly solving physical equations, thereby enhancing structural feature perception and suppressing physically unreasonable detections. Meanwhile, the RECA module adaptively recalibrates channel-wise feature responses through learnable residual coefficients, enabling more effective representation of subtle defects such as cracks and spalling that are characterized by small targets and weak pixel contrast.Extensive experiments on both public datasets and a self-built crack dataset demonstrate the effectiveness of the proposed method. Compared with the baseline YOLOv11, PIA-YOLO improves mAP@0.5 by 2.2\% and 15.9\%, respectively, while increasing recall by 4.6\% and 34.0\%, without significantly sacrificing inference speed or increasing computational cost. These results indicate that PIA-YOLO provides an efficient and accurate solution for intelligent building defect detection, with promising applications in structural inspection, environmental monitoring, traffic infrastructure management, and post-disaster assessment.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Masachika Akage

,

Daisuke Yoshida

,

Wakana Fujimoto

Abstract: UAV-based crack inspection of port quay walls is promising for efficient infrastructure maintenance, but its practical deployment remains hindered by frequent false positives caused by debris, stains, and irregular surface textures. This study proposes a false-positive reduction framework for a crack inspection system based on aerial images acquired by a small general-purpose UAV. The proposed method introduces anomaly detection after object detection so that detected crack candidate regions are re-evaluated based on their deviation from the learned feature distribution of crack images. A Vision Transformer (ViT)-based anomaly detection model is employed, and both stand-ard-threshold and low-threshold object detection settings are investigated. Experimental validation across five verification areas showed that the combination of standard-threshold object detection and anomaly detection consistently improved F1 and F2 scores over the conventional baseline, demonstrating stable suppression of false positives while main-taining crack detectability. Under the low-threshold setting, Frangi filter-based pre-processing was more effective than grayscale-based preprocessing, achieving a favorable balance between broader crack extraction and false-positive suppression in some 5 m cases. However, this advantage decreased as image resolution deteriorated. Overall, the results indicate that the most robust configuration in the current framework is the combination of standard-threshold object detection and anomaly-based false-positive suppression. In contrast, the benefit of low-threshold operation depends strongly on image resolution. The findings also suggest that practical deployment requires calibration of the anoma-ly-detection threshold based on site conditions and GSD.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Baoyi Zhang

,

Xi Yu

,

Wuyi Cai

,

Xian Zhou

,

Binhai Wang

,

Tongyun Zhang

Abstract: Triangular mesh is one of the most widely used representations for 3D surfaces. However, high-resolution mesh models often contain a large number of triangles, leading to significant burdens in storage, transmission, and real-time rendering. Mesh simplification aims to reduce model complexity while preserving geometric fidelity and structural features. Classical methods, such as quadric error metrics (QEM), rely solely on local geometric errors, making them difficult to distinguish between redundant regions and structurally important features, often resulting in feature loss and topological degradation. To address these limitations, this study proposes a structure-aware triangular mesh simplification framework based on graph neural networks (GNNs)-guided QEM. GNNs are employed as a structural importance estimator to predict geometric saliencies of mesh edges. The predicted importances are incorporated into the classical QEM edge collapse cost through a soft modulation mechanism. Furthermore, a geometry-saliency driven dynamic cost modulation strategy is designed, enabling the simplification process to prioritize critical features in early stages and gradually transition to global error minimization in later stages, without compromising the geometric optimality of QEM. In terms of model design, hybrid structural representation GNNs are constructed by integrating spectral geometry and a dual-branch architecture. Laplacian positional encoding is introduced to capture global topological information, while 1-hop and 2-hop message passing branches enable multi-scale representation of complex geometric structures. In addition, a staged inference strategy is adopted to dynamically update graph structural features during simplification, effectively mitigating topological drift. Experimental results on the TOSCA dataset demonstrate that the proposed method achieves stable performance across various simplification ratios. It consistently outperforms FQMS and remains comparable to classical QEM in terms of geometric error (P_CD) and normal consistency (P_NE). For structural preservation (P_LE), the method shows clear advantages, with win-rates generally exceeding 70%. Moreover, it significantly improves the preservation of local geometric details at low to moderate simplification ratios. In summary, the proposed method effectively enhances local structural preservation while maintaining global geometric topology, providing an interpretable and practical solution for integrating learning-based structural awareness with classical geometric optimization in mesh simplification.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Sergio Villanueva

,

Emilio Soria-Olivas

,

Manuel Sánchez-Montañés

Abstract: Automotive inspection in real production lines requires robust detection of rare and diverse 2 defects. Fully supervised methods are often unfeasible because defective samples are scarce 3 and heterogeneous. This work benchmarks recent unsupervised anomaly detection (UAD) 4 methods on AutoVI, a real industrial dataset covering six automotive inspection tasks 5 with challenging lighting, cluttered backgrounds, and multiple viewpoints. We establish 6 RGB and pseudo-depth baselines for seven UAD models under a unified training and 7 evaluation protocol, training exclusively on defect-free samples with z-score calibration for 8 fair comparison. On top of these baselines we study late-fusion ensembles that combine 9 complementary detectors within RGB and across modalities, at both image-score and pixel- 10 map level, reporting AUROC, AP, TPR@TNR, and pixel-level sPRO/AUsPRO at 5% false 11 positive rate. The main finding is that RGB-only late-fusion ensembles consistently improve 12 pixel-level localization, often recovering defect coverage where all individual models fail. 13 Combining RGB with monocular pseudo-depth through the same scheme, by contrast, 14 does not yield systematic gains and is highly sensitive to the quality of the estimated depth 15 channel. These results, validated with statistical significance testing across three random 16 seeds, provide practical guidance for composing UAD pipelines in automotive inspection.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Tianzhi Jia

,

Shikui Wei

,

Yao Zhao

Abstract: Low-light image enhancement aims to recover high-quality visuals from poorly illuminated inputs, yet existing methods often suffer from over-enhancement, noise amplification, and semantic inconsistency in complex scenes. In this paper, we propose SeMaNet, a novel semantic-guided framework that integrates textual priors with a hybrid Transformer-Mamba architecture for controllable and efficient low-light enhancement. Our approach begins by leveraging pre-trained CLIP to generate semantically meaningful attention maps from natural language prompts, enabling interpretable region-aware enhancement without requiring pixel-level annotations. These semantic priors are then fused with illumination estimates and raw image features through a cross-attention mechanism, allowing dynamic interaction among multi-modal cues. To balance global context modeling and computational efficiency, we design a U-Net-based restoration network that interleaves Transformer blocks for long-range dependency capture and Mamba layers for linear-time sequence processing. Furthermore, our method explicitly models the image formation process via a perturbation-aware Retinex decomposition, enhancing physical plausibility. Extensive experiments on LOL v1, LOL-v2-real, LOL-v2-synthetic, SID, SMID, and SDSD-out datasets demonstrate that SeMaNet achieves state-of-the-art performance in both quantitative metrics (PSNR, SSIM) and qualitative quality, particularly excelling in preserving semantic coherence and fine details under challenging lighting conditions. The hybrid architecture also offers superior inference efficiency compared to pure Transformer-based models.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Jiahao An

,

Qingxue Wang

,

Chunshan Wang

,

Xiang Sun

,

Qingwei Tian

,

Jin Yuan

Abstract: Rapid identification of maize waterlogging is essential for post-disaster agricultural assessment, but most existing methods rely on multi-temporal imagery that is often unavailable immediately after extreme rainfall events. This study proposes SAB-DeepLabV3+, a semantic segmentation model for mapping waterlogged maize from single-date multispectral imagery within pre-extracted maize planting areas. Built on DeepLabV3+, the model integrates three task-specific modules: a Spectral-Spatial Information Enhancement Module to improve feature discrimination under spectral mixing, an Adaptive Multi-Scale Pooling Module to capture heterogeneous patch sizes, and a Boundary Enhancement Module to refine transition zones. A pixel-level dataset containing 12,198 image patches was constructed from 62 multispectral scenes collected across five major maize-producing cities in Heilongjiang Province, China, during 2022–2024. On the test set, SAB-DeepLabV3+ achieved a waterlogged-class IoU of 68.30%, mIoU of 80.37%, mF1 of 88.62%, and OA of 93.49%, outperforming DeepLabV3+. Leave-one-city-out evaluation further produced an average mIoU of 76.56% and a waterlogged-class IoU of 63.45%. These results indicate that single-date high-resolution multispectral imagery can support rapid and reliable maize waterlogging mapping.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Mustafa Yurdakul

,

Ahmet Çakmak

Abstract: Effervescent tablets are highly hygroscopic solid dosage forms in which even minor surface defects can compromise product stability, dose uniformity, and patient safety. Reliable, high-throughput defect detection is therefore essential, yet the existing literature overwhelmingly focuses on compressed or film-coated tablets and rarely offers a systematic comparison across recent YOLO families and scales. This study presents a multi-scale performance benchmarking of three recent YOLO families—YOLO11, YOLO12, and YOLO26—on a newly constructed effervescent tablet defect dataset. The dataset comprises 251 high-resolution images acquired under controlled illumination, each containing 12 tablets, and is manually annotated in YOLO format across six physical-condition classes (intact, damaged, cracked, broken, moist, and stained), yielding 3,012 bounding-box instances. All five standard scale variants (n, s, m, l, x) of each family were trained for 100 epochs under identical hyper-parameter settings, producing fifteen model variants that are compared in terms of mAP@0.5, mAP@0.5:0.95, precision, recall, inference speed (FPS), parameter count, and FLOPs. Experimental results show that YOLO11l achieves the best overall accuracy, with 96.8% mAP@0.5 and 91.7% mAP@0.5:0.95, while YOLO11n offers the most attractive real-time trade-off at 345.9 FPS with 95.6% mAP@0.5 and only 2.5M parameters. YOLO12 variants deliver competitive accuracy but at markedly lower inference speeds for the larger scales, whereas YOLO26 scales lag in the nano regime (88.0% mAP@0.5) but close the gap at l/x scales. Class-wise analysis of YOLO11l shows consistently high performance across all six defect categories, with mAP@0.5 ranging from 0.940 (damaged) to 0.994 (stained). The results provide practical guidance for selecting a YOLO configuration for real-time effervescent tablet inspection lines and demonstrate that modern nano- and small-scale detectors are already sufficient for high-throughput pharmaceutical quality control.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Gongxun Lin

,

Jincheng Jiang

,

Jiaheng Cai

,

Xingjian Luo

,

Zihao Wang

,

Hao Sun

,

Ziyuan Pu

Abstract: Real-time video object detection on unmanned aerial vehicles (UAVs) is essential for urban inspection and autonomous perception, yet its deployment on edge devices is severely constrained by the high computational cost of accurate detectors, the quantization sensitivity of hybrid convolution-attention networks, and the system-level latency of full video processing pipelines. To address these challenges, we present DUST-YOLO, a deployment-oriented algorithm-hardware co-design framework for lightweight and efficient UAV small-object detection on edge platforms. First, we introduce a multi-dimensional structured pruning strategy that applies asymmetric channel pruning to convolutional and feature-fusion modules while compressing the Swin Transformer prediction heads and bottleneck stacks, thereby reducing parameters and computation with limited impact on multi-scale representation capability. Second, we develop a hardware-aware mixed-precision quantization-aware training (QAT) scheme that maps computation-intensive backbone layers to INT8 while preserving the Transformer-related modules in FP16, improving inference efficiency while mitigating the accuracy loss caused by uniform low-bit quantization. Third, we compile the optimized network with TensorRT and integrate the resulting inference engine into a DeepStream-based asynchronous video pipeline on the edge platform, enabling end-to-end acceleration by reducing decoding, preprocessing, and memory-transfer overheads. Experimental results on the VisDrone2019-DET dataset and the NVIDIA Jetson Orin NX demonstrate that DUST-YOLO achieves 43.7% mAP@0.5 acuracy with an end-to-end latency of 36.3 ms and a throughput of 27.5 FPS. Compared with the state-of-the-art detector, DUST-YOLO reduces end-to-end latency by 56.9% and improves end-to-end video throughput by ×2.31, while lowering total energy consumption by 68.5%.

of 37

Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated