Computer Science and Mathematics

Sort by

Article

Computer Vision and Graphics

Integrating Frequency-Spatial Features for Energy-Efficient OPGW Target Recognition in UAV-Assisted Mobile Monitoring

Lin Huang

Xubin Ren

Daiming Qu

Lanhua Li

Jing Xu

Abstract: Optical Fiber Composite Overhead Ground Wire (OPGW) cables serve dual functions in power systems, lightning protection and critical communication infrastructure for real-time grid monitoring. Accurate OPGW identification during UAV inspections is essential to prevent miscuts and maintain powercommunication functionality. However, detecting small, twisted OPGW segments among visually similar ground wires is challenging, particularly given the computational and energy constraints of edge-based UAV platforms. We propose OPGW-DETR, a lightweight detector based on the D-FINE framework, optimized for low-power operation to enable reliable onboard detection. The model incorporates two key innovations: multi-scale convolutional global average pooling (MC-GAP), which fuses spatial features across multiple receptive fields and integrates frequency-domain information for enhanced fine-grained representation, and a hybrid gating mechanism that dynamically balances global and spatial features while preserving original information through residual connections. By enabling real-time onboard inference with minimal energy consumption, OPGW-DETR addresses UAV battery and bandwidth limitations while ensuring continuous detection capability. Evaluated on a custom OPGW dataset, the S-scale model achieves 3.9% improvement in average precision (AP) and 2.5% improvement in AP₅₀ over the baseline. These gains enhance communication reliability by reducing misidentification risks, enabling uninterrupted grid monitoring in low-power UAV inspection scenarios where accurate detection is critical for communication integrity and grid safety.

Posted: 05 December 2025

https://doi.org/10.20944/preprints202512.0582.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

SIFT-SNN for Traffic-Flow Infrastructure Safety: A Real-Time Context-Aware Anomaly Detection Framework

Munish Rathee

Boris Bačić

Maryam Doborjeh

Abstract: Automated anomaly detection in transportation infrastructures is essential for enhancing safety and reducing the operational costs associated with manual inspection protocols. This study presents an improved neuromorphic vision system, which extends the prior SIFT-SNN (Scale-Invariant Feature Transform–Spiking Neural Network) proof-of-concept by incorporating temporal feature aggregation for context-aware and sequence-stable detection. Analysis of classical stitching-based pipelines exposed sensitivity to motion and lighting variations, motivating the proposed temporally smoothed neuromorphic design. SIFT keypoints are encoded into latency-based spike trains and classified using a leaky integrate-and-fire (LIF) spiking neural network implemented in PyTorch. Evaluated across three hardware configurations: an NVIDIA RTX 4060 GPU, an Intel i7 CPU, and a simulated Jetson Nano; the system achieves 92.3% accuracy and a macro F1-score of 91.0% under five-fold cross-validation. Inference latencies were measured at 9.5 ms, 26.1 ms, and ~48.3 ms per frame, respectively. Memory footprints were under 290 MB and power consumption estimated to be between 5 and 65 W. The classifier distinguishes between safe, partially dislodged, and fully dislodged barrier pins, which are critical failure modes for the Auckland Harbour Bridge’s Movable Concrete Barrier (MCB) system. Temporal smoothing further improves recall for ambiguous cases. Achieving a compact model size (2.9 MB), low-latency inference, and minimal power demands, the proposed framework offers a deployable, interpretable, and energy-efficient alternative to conventional CNN-based inspection tools. Future work will be focused on exploring generalisability and transferability of the work presented, additional input sources, as well as human computer interaction paradigms for various deployment infrastructures and advancement.

Posted: 03 December 2025

https://doi.org/10.20944/preprints202512.0391.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

Linear-Region-Based Contour Tracking for Edge Images

Erick Huitrón-Ramírez

Leonel G. Corona-Ramírez

Diego Jiménez-Badillo

Abstract: This work presents the Linear-Region-Based Contour Tracking (LRCT) method for extracting external contours in images, designed to achieve an accurate and efficient description of shapes, particularly useful for archaeological materials with irregular geometries. The approach treats the contour as a discrete signal and analyzes image regions containing edge segments. From these regions, a local linear model is estimated to guide the selection and chaining of representative pixels, yielding a continuous perimeter trajectory. This strategy reduces the amount of data required to describe the contour without compromising shape fidelity. As a case study, the method was applied to images of replicas of archaeological materials exhibiting substantial variations in color and morphology. The results show that the obtained trajectories are comparable in quality to those generated using Canny edge detection and Moore tracing, while providing compact representations well suited for subsequent analyses. Consequently, the method offers an efficient and reproducible alternative for documentation, recording, and morphological comparison, strengthening data-driven approaches in archaeological research.

Posted: 02 December 2025

https://doi.org/10.20944/preprints202512.0009.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

LBA-Net: Lightweight Boundary-Aware Network for Efficient Breast Ultrasound Image Segmentation

Ye Deng

Meng Chen

Jieguang Liu

Qi Cheng

Xiaopeng Xu

Yali Qu

Abstract:

Breast ultrasound segmentation is challenged by strong noise, low contrast, and ambiguous lesion boundaries. Although deep models achieve high accuracy, their heavy computational cost limits deployment on portable ultrasound devices. In contrast, lightweight networks often struggle to preserve fine boundary details. To address this gap, we propose the lightweight boundary-aware network. A MobileNetV3-based encoder with the atrous spatial hyramid pooling is integrated for efficient multi-scale representation learning. The applied the lightweight boundary-aware block uses an adaptive fusion to combine efficient channel attention and depthwise spatial attention to enhance discriminative capability with minimal computational overhead. A boundary-guided dual-head decoding scheme injects explicit boundary priors and enforces boundary consistency to sharpen and stabilize margin delineation. Experiments on curated BUSI* and BUET* datasets demonstrate that the proposed network achieves 82.8% Dice, 38 px HD95, and real-time inference speeds (123 FPS GPU / 19 FPS CPU) using only 1.76M parameters. They show that this proposed network can offer a highly favorable balance between accuracy and efficiency.

Abstract:

Posted: 01 December 2025

https://doi.org/10.20944/preprints202509.2349.v3

Article

Computer Science and Mathematics

Computer Vision and Graphics

UCA-Net: A Transformer-Based U-Shaped Underwater Enhancement Network with Compound Attention Mechanism

Cheng Yu

Jian Zhou

Lin Wang

Guizhen Liu

Zhongjun Ding

Abstract:

Images captured underwater frequently suffer from color casts, blurring, and distortion, which are mainly attributable to the unique optical characteristics of water. Although conventional UIE methods rooted in physics are available, their effectiveness is often constrained, particularly in challenging aquatic and illumination conditions. More recently, deep learning has become a leading paradigm for UIE, recognized for its superior performance and operational efficiency. This paper proposes UCA-Net, a lightweight CNN-Transformer hybrid network. It incorporates multiple attention mechanisms and utilizes composite attention to effectively enhance textures, reduce blur, and correct color. A novel adaptive sparse self-attention module is introduced to jointly restore global color consistency and fine local details. The model employs a U-shaped encoder-decoder architecture with three-stage up- and down-sampling, facilitating multi-scale feature extraction and global context fusion for high-quality enhancement. Experimental results on multiple public datasets demonstrate UCA-Net’s superior performance, achieved with fewer parameters and lower computational cost. Its effectiveness is further validated by improvements in various downstream image tasks.

Abstract:

Posted: 01 December 2025

https://doi.org/10.20944/preprints202511.2280.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

Machine Vision and Deep Learning for Robotic Harvesting of Shiitake Mushrooms

Thomas Rowland

Mark Hansen

Melvyn Smith

Lyndon Smith

Abstract:

Automation and computer vision are increasingly vital in modern agriculture, yet mushroom harvesting remains largely manual due to complex morphology and occluded growing environments. This study investigates the application of deep learning–based instance segmentation and keypoint detection to enable robotic harvesting of Lentinula edodes (shiitake) mushrooms. A dedicated RGB-D image dataset, the first open-access RGB-D dataset for mushroom harvesting, was created using a Microsoft Azure DK 3D camera under varied lighting and backgrounds. Two state-of-the-art segmentation models, YOLOv8-seg and Detectron2 Mask R-CNN, were trained and evaluated under identical conditions to compare accuracy, inference speed, and robustness. YOLOv8 achieved higher mean average precision (mAP = 67.9) and significantly faster inference, while Detectron2 offered comparable qualitative performance and greater flexibility for integration into downstream robotic systems. Experiments comparing RGB and RG-D inputs revealed minimal accuracy differences, suggesting that colour cues alone provide sufficient information for reliable segmentation. A proof-of-concept keypoint-detection model demonstrated the feasibility of identifying stem cut-points for robotic manipulation. These findings confirm that deep learning–based vision systems can accurately detect and localise mushrooms in complex environments, forming a foundation for fully automated harvesting. Future work will focus on expanding datasets, incorporating true four-channel RGB-D networks, and integrating perception with robotic actuation for intelligent agricultural automation.

Abstract:

Posted: 26 November 2025

https://doi.org/10.20944/preprints202511.2003.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

The Impact of Responsive Design on User Experience

Kylychbek Parpiev

Abstract: Responsive design has become essential as users increasingly access websites through devices of different screen sizes, including laptops and smartphones. This study investigates how responsive design affects user experience, particularly in the areas of readability, navigation, task performance, and emotional engagement. Sixty participants tested both responsive and fixed-layout websites on laptops and mobile devices. Quantitative results show that responsive websites improved task completion time by 32%, and users rated them significantly higher in usability and readability (4.4/5 compared to 3.4/5 for fixed layouts). Heatmap and click-path analyses demonstrated reduced cognitive load and fewer navigation errors on responsive interfaces. Qualitative interviews revealed that users perceived responsive designs as modern, smooth, and easy to use, whereas fixed layouts were described as tiring and frustrating. However, accessibility issues remain for users with visual impairments. Overall, the findings indicate that responsive design enhances both functional and emotional aspects of user experience. The study recommends combining responsive layout practices with accessibility guidelines to ensure inclusive interaction across devices.

Posted: 26 November 2025

https://doi.org/10.20944/preprints202511.2063.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

A Lightweight Degradation-Aware Framework for Robust Object Detection in Adverse Weather

Seungun Park

Jiakang Kuai

Hyunsu Kim

Hyunseong Ko

ChanSung Jung

Yunsik Son

Abstract: Object detection in adverse weather remains challenging due to the simultaneous degradation of visibility, structural boundaries, and semantic consistency. Existing restoration-driven or multi-branch detection approaches often fail to recover task-relevant features or introduce substantial computational overhead. To address this problem, DLC-SSD, a lightweight degradation-aware framework for detecting robust objects in bad weather environments, is proposed. The framework is based on an integrated restoration and refining strategy that performs image-level degradation correction, structural information enhancement, and semantic expression refinement in stages. First, the Differentiable Image Processing (DIP) module performs low-cost enhancement to adapt to global and local degradation patterns. After that, the Lightweight Edge-Guided Attention (LEGA) module uses a fixed Laplacian-based structural dictionary to reinforce the boundary cues in shallow, high-resolution features. Finally, the Content-aware Spatial Transformer with Gating (CSTG) module captures long-distance contextual relationships and refines the deep semantic representation by suppressing noise. These components are jointly optimized end-to-end with the single shot multibox detection (SSD) backbone. In rain, fog, and low-light conditions, DLC-SSD demonstrated more stable performance than conventional detectors and maintained a quasi-real-time inference speed, confirming its practicality in intelligent monitoring and autonomous driving environments.

Posted: 26 November 2025

https://doi.org/10.20944/preprints202511.2035.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

Dynamic Contextual Relational Alignment Network for Open-Vocabulary Video Visual Relation Detection

Linyu Lou

Jiarong Mo

Abstract: Video Visual Relation Detection plays a central role in understanding complex video content by identifying evolving spatio-temporal interactions between object tracklets. However, current approaches are hindered by long-tailed predicate distributions, the gap between image-based semantics and video dynamics, and the challenge of generalizing to unseen relation categories. We introduce the Dynamic Contextual Relational Alignment Network (DCRAN), an end-to-end framework designed to address these issues. DCRAN integrates a spatio-temporal gating mechanism to enrich tracklet representations with surrounding context, a dynamic relational prompting module that produces adaptive predicate prompts for each subject--object pair, and a multi-granular semantic alignment module that jointly aligns object features and relational representations with their corresponding textual cues through hierarchical contrastive learning. Experiments on standard benchmarks show that DCRAN substantially improves the detection of both frequent and previously unseen relations, demonstrating the value of dynamic prompting and multi-level alignment for robust video relational understanding.

Posted: 25 November 2025

https://doi.org/10.20944/preprints202511.1974.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

Markerless AR Registration Framework Using Multi-Modal Imaging for Orthopedic Surgical Guidance

James R. Whitmore

Hui Lin

Charlotte P. Evans

David K. Mitchell

Sophie L. Carter

Abstract: Orthopedic surgery demands precise bone alignment, yet marker-based navigation adds cost and complexity. We propose a markerless registration framework fusing depth imaging with preoperative CT/MRI via mutual information optimization. Anatomical features are extracted with a U-Net segmentation model, and alignment is performed using multi-modal ICP enhanced by adaptive weight learning. In 15 phantom surgeries, mean TRE was 1.4 mm, compared to 2.6 mm with marker-based setups. Preparation time decreased by 35%, and frame rates reached 25 fps. The framework reduces invasiveness while preserving surgical accuracy, offering clinical scalability.

Posted: 25 November 2025

https://doi.org/10.20944/preprints202511.1969.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

Self-Calibrating Dual-Stream Network for Semi-Supervised 3D Medical Image Segmentation

Zeyuan Xun

Abstract: Three-dimensional medical image segmentation plays an essential role in clinical diagnosis and treatment, yet its progress is often limited by the heavy reliance on expert annotations. Semi-supervised learning offers a way to ease this burden by making better use of unlabeled data, but most existing approaches struggle with the complexity of volumetric data, the instability of pseudo-labels, and the challenge of combining different types of information. To address these issues, we introduce a Self-Calibrating Dual-Stream Semi-Supervised Segmentation framework. The model incorporates a structural branch built on a 3D convolutional architecture to capture fine anatomical details, alongside a contextual branch based on a lightweight transformer to gather broader spatial cues. An uncertainty-aware refinement module is employed to improve pseudo-label reliability, and a feature integration mechanism adaptively merges information from both branches to maintain a balance between local accuracy and global consistency. Experiments on multiple public datasets show that the proposed method achieves strong segmentation performance under limited supervision and provides more precise boundary localization. Component analyses further verify the importance of each module, and expert feedback highlights its potential value in clinical practice.

Posted: 25 November 2025

https://doi.org/10.20944/preprints202511.1963.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

Adaptive Spatiotemporal Condenser for Efficient Long-Form Video Question Answering

Bowen Nian

Mingyu Tan

Abstract: Long-form video question answering (VQA) presents substantial challenges due to the extensive volume of spatiotemporal information, inherent redundancy, and limitations of conventional sequence reduction methods. To address these issues, we introduce the Adaptive Spatiotemporal Condenser (ASC), a novel architecture designed for efficient extraction and condensation of question-relevant information from lengthy video sequences. ASC employs a lightweight, learnable module that dynamically identifies and aggregates critical spatiotemporal tokens, compressing them into a fixed-length, information-dense representation suitable for large language models (LLMs). Our key innovations include an adaptive condensation mechanism, a question-conditioned importance scoring process for precise information focusing, and an inherently efficient and flexible design. Extensive experiments on challenging long-form VQA benchmarks demonstrate that our ASC-LLaVA model consistently achieves state-of-the-art performance, surpassing prior methods. Ablation studies confirm the critical contribution of each ASC component, while further analysis validates its robustness across varying video lengths, effectiveness in "needle-in-a-haystack" scenarios, and generalizability across different LLM backbones. These findings highlight ASC's capability to significantly enhance VQA accuracy and computational efficiency for complex, long-form video understanding.

Posted: 25 November 2025

https://doi.org/10.20944/preprints202511.1770.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

Context-Aware Knowledge Harmonization for Visual Question Reasoning

Lina Vermeersch

Quentin Moor

Elodie Fairchild

Sarah Van Steen

Abstract: Knowledge-intensive visual question answering requires a model to fluidly integrate visual perception, linguistic comprehension, and external knowledge sources. Although recent advances in knowledge-based VQA have explored the incorporation of structured and unstructured knowledge, they frequently overlook the discrepancies between the visual scene and the retrieved knowledge, resulting in semantic conflicts that degrade reasoning quality. To address this long-standing issue, we introduce \textsc{SenseAlign}, a unified context-harmonization framework designed to assess, quantify, and mitigate semantic divergence between image-grounded evidence and externally retrieved knowledge. The core principle behind \textsc{SenseAlign} is to provide an adaptive mechanism that evaluates whether the newly incorporated knowledge is consistent with the visual–linguistic context, and proportionally adjusts its influence based on a principled inconsistency score. Specifically, we first formulate a novel semantic discrepancy estimator that combines caption-based uncertainty signals with a cross-context semantic similarity evaluation, allowing the system to diagnose whether external knowledge aligns with the underlying visual semantics. Building upon this inconsistency estimator, we further develop an adaptive knowledge assimilation strategy that dynamically regulates explicit knowledge from structured sources and implicit knowledge encoded in pretrained multimodal models. Through this perspective, \textsc{SenseAlign} offers a general mechanism for preventing over-reliance on irrelevant or misleading facts while still enabling the model to leverage genuinely helpful knowledge. Comprehensive experiments on the OK-VQA benchmark demonstrate that our approach consistently surpasses strong baselines and establishes a new state-of-the-art performance. These results highlight the significance of explicitly modeling semantic compatibility when integrating heterogeneous knowledge for visual reasoning tasks.

Posted: 24 November 2025

https://doi.org/10.20944/preprints202511.1720.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

FlareSat: A Benchmark Landsat 8 Dataset for Gas Flaring Segmentation in Oil and Gas Facilities

Osmary Camila Bortoncello Glober

Ricardo Dutra da Silva

Abstract: Gas flaring in oil and gas facilities prevents pressure buildup and ensures safety, but large-scale flaring emits significant greenhouse gases and contributes to climate change. Detecting and monitoring gas flaring are crucial for mitigating its impact. Satellite imagery offers key advantages for these tasks, including open data availability, global coverage, and broad spectral capture. However, despite the availability of flame and smoke datasets, as well as hyperspectral data for methane emissions, there is a lack of open hyperspectral satellite datasets and deep learning approaches for gas flaring detection. To address this gap, we introduce FlareSat, a specialized dataset for gas flaring segmentation using Landsat 8 imagery. It includes 7,337 labeled image patches (256 X 256 pixels) covering 5,508 facilities across 94 countries, including onshore and offshore sites, making it a valuable resource for future research. Additionally, to ensure robustness, the dataset includes patches featuring sources similar to gas flaring, namely wildfires, active volcanoes, and urban areas with high solar reflectance. To evaluate the dataset, we used a baseline semantic segmentation model along with variations, exploring attention layers and transfer learning. Results showed that specialized machine learning techniques enhance the ability to distinguish between gas flares and other high-temperature sources.

Posted: 24 November 2025

https://doi.org/10.20944/preprints202511.1755.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

Motion Detection Development for Dynamic SLAM Based on Epipolar Geometry

Sedat Dikici

Fikret Arı

Abstract: Dynamic environments pose a major challenge for visual SLAM, as independently moving objects introduce feature correspondences that violate the static-scene assumption and degrade pose estimation. To address this, we propose a geometry-based filtering method that augments classical epipolar residuals with a new Epipolar Direction Consistency (EDC) metric. For each feature match, EDC evaluates the angular agreement between the observed optical-flow vector and the tangent direction of its corresponding epipolar line. This directional cue, combined with positional residuals in an adaptive scoring scheme and refined through short-window temporal voting, enables reliable separation of static inliers from dynamic outliers without requiring learning-based models or semantic information. The method is lightweight, easily integrated into feature-based SLAM pipelines, and automatically adapts to varying motion levels using MAD-based thresholds. Experiments demonstrate that inserting the EDC filter into a standard ORB-style pipeline improves trajectory stability and accuracy by reducing drift caused by moving objects, while preserving real-time performance. Overall, EDC provides a simple, interpretable, and training-free mechanism for enhancing SLAM robustness in dynamic scenes.

Posted: 21 November 2025

https://doi.org/10.20944/preprints202511.1630.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

Comparison of the Performance of Baseline CNN and Transfer Learning for Classifying Road Markings (Pedestrian Crossing, Speed Bump, Lane Divider)

Aibiike Omorova

Abstract: Road markings play an important role in vehicle navigation, helping as visual signs that promote road safety and compliance with traffic rules. This study uses quantitative experimental method to compare a baseline Convolutional Neural Network (CNN) trained from scratch with transfer learning using EfficientNetB0 using a small custom dataset captured via purposive sampling. The dataset was divided into training (70%), validation (30%), and a separate test set (75 images) with balanced class distribution. Results show that the baseline CNN achieved 56.3% validation accuracy and 66.7% test accuracy (macro F1-score: 0.672), while Effi- cientNetB0 reached 100% validation accuracy and 93.3% test accuracy (macro F1-score: 0.932). The 26.6 percentage point improvement in test accuracy highlights the advantage of transfer learning on small datasets, extending previous work by providing multi-class classification of pedestrian crossings, speed bumps, and lane dividers—a task that had not previously been considered in a single system.

Posted: 19 November 2025

https://doi.org/10.20944/preprints202511.1441.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

MedSeg-Adapt: Clinical Query-Guided Adaptive Medical Image Segmentation via Generative Data Augmentation and Benchmarking

Gregory Yu

Aaron Collins

Ian Butler

Abstract: Medical image segmentation systems, despite their recent sophistication, often face substantial performance degradation when exposed to unseen imaging environments—caused by differences in scanner types, acquisition protocols, or rare pathological conditions. To address this crucial issue, we introduce \textbf{MedSeg-Adapt}, a novel framework that enables \textit{clinical query-guided adaptive medical image segmentation}. MedSeg-Adapt features an autonomous generative data augmentation module that dynamically synthesizes environment-specific and clinically diverse training data using advanced medical image Diffusion models in combination with large language models (LLMs). This module automatically generates realistic image variants, natural language clinical queries, and pseudo-annotations—without requiring new reinforcement learning policies or manual labeling. In addition, we establish \textbf{MedScanDiff}, a new benchmark comprising five challenging medical imaging environments: Higher-resolution CT, Low-dose CT, Varying-field MRI, Specific Pathology Variant, and Pediatric Imaging. Extensive experiments demonstrate that fine-tuning state-of-the-art models such as MedSeg-Net, VMed-LLM, and UniMedSeg on MedSeg-Adapt-generated data significantly enhances robustness and segmentation accuracy across unseen settings, achieving an improvement of Dice Similarity Coefficient (DSC). MedSeg-Adapt thus provides a practical and effective pathway toward self-adaptive, clinically grounded medical image segmentation.

Posted: 19 November 2025

https://doi.org/10.20944/preprints202511.1448.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

A Survey on Video Generation Technologies, Applications, and Ethical Considerations

Kaiqi Chen

Abstract: Video generation has rapidly advanced from early GAN-based systems to modern diffusion- and transformer-based models that deliver unprecedented photorealism and controllability. This survey synthesizes progress across foundational models (GAN, autoregressive, diffusion, masked modeling, and hybrids), information representations (spatiotemporal convolution, patch tokens, latent spaces), and generation schemes (decoupled, hierarchical, multi-staged, latent). We map applications in gaming, embodied AI, autonomous driving, education, filmmaking, and biomedicine, and analyze technical challenges in real-time generation, long-horizon consistency, physics fidelity, generalization, and multimodal reasoning. We also discuss governance and ethics, including misinformation, intellectual property, fairness, privacy, accountability, and environmental impact. Finally, we summarize evaluation methodologies (spatial, temporal, and human-centered metrics) and highlight future directions for efficient, controllable, and trustworthy video generation.

Posted: 19 November 2025

https://doi.org/10.20944/preprints202511.1332.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

A Survey of Recent Advances in Adversarial Attack and Defense on Vision-Language Models

Md Iqbal Hossain

Neeresh Kumar Perla

Afia Sajeeda

Siyu Xia

Ming Shao

Abstract: In the rapidly advancing domain of artificial intelligence, Vision-Language Models (VLMs) have emerged as critical tools by synergizing visual and textual data processing to facilitate a multitude of applications including automated image captioning, accessibility enhancements, and intelligent responses to multimodal queries. This survey explores the evolving paradigm of Pre-training, Fine-tuning, and Inference that has notably enhanced the capabilities of VLMs, allowing them to perform effectively across various downstream tasks and even enable zero-shot predictions. Despite their advancements, VLMs are vulnerable to adversarial attacks, largely because of their reliance on large-scale, internet-sourced pre-training datasets. These attacks can significantly undermine the models' integrity by manipulating their input interpretations, posing severe security risks and eroding user trust. Our survey delves into the complexities of these adversarial threats, which range from single-modal to sophisticated multimodal strategies, highlighting the urgent need for robust defense mechanisms. We discuss innovative defense strategies that adapt model architectures, integrate adversarially robust training objectives, and employ fine-tuning techniques to counteract these vulnerabilities. This paper aims to provide a comprehensive overview of current challenges and future directions in the adversarial landscape of VLMs, emphasizing the importance of securing these models to ensure their safe integration into various real-world applications.

Posted: 18 November 2025

https://doi.org/10.20944/preprints202511.1363.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

A Computer Vision and AI-Based System for Real-Time Detection and Diagnosis of Olive Leaf Diseases

Saud Saad Alqahtani

Hossam El-Din Moustafa

Elsaid A. Marzouk

Ramadan Madi Ali Bakir

Abstract:

This paper introduces OLIVE-CAD, a novel Computer-Aided Diagnostics system designed for the real life, on-site detection of olive leaf diseases. The core of the system is a YOLOv12-based convolutional neural network model, which was trained on a comprehensive dataset of 11,315 olive leaf images. The images were categorized into 'Aculus', 'Scab', and 'Healthy,' with the dataset divided for training (70%), evaluation (20%), and real-world testing (10%). The key contribution of this work is the end-to-end integration of a custom, field-deployable Computer-Aided Diagnostics system. The trained YOLOv12 model achieved a mean average precision of 98.2% and mean average recall of 95.4%, while the model achieves class-specific evaluation precision of 95.3% and recall of 97.7% for 'Healthy' class; 97.9% precision and 88.3% of recall for 'Aculus' class; and precision of 94.3% and 95.4% of recall for 'Scab' class. OLIVE-CAD enables the storage of the immediate disease diagnostic outcomes to a predesigned database, providing a practical, deployable solution for agricultural applications. The research recommends an IoT-Based real-time central operation diagnostic monitoring system as future work.

Abstract:

Posted: 17 November 2025

https://doi.org/10.20944/preprints202511.1135.v1

of 30