Computer Science and Mathematics

Sort by

Article
Computer Science and Mathematics
Computer Vision and Graphics

Wei Chen

,

Jiing Fang

Abstract: Video Frame Interpolation (VFI) is critical for generating smooth slow-motion and increasing video frame rates, yet it faces significant challenges in achieving high fidelity, accurate motion modeling, and robust spatiotemporal consistency, particularly for large displacements and occlusions. This paper introduces TemporalFlowDiffuser (TFD), a novel end-to-end latent space diffusion Transformer designed to overcome these limitations with exceptional efficiency and quality. TFD employs a lightweight Video Autoencoder to compress frames into a low-dimensional latent space. A Spatiotemporal Transformer models complex spatiotemporal dependencies and motion patterns, augmented by auxiliary latent optical flow features. Leveraging Flow Matching as its diffusion scheduler, TFD achieves high-quality frame generation with remarkably few denoising steps, making it highly suitable for real-time applications. Our extensive experiments on a challenging high-motion dataset demonstrate that TFD significantly outperforms state-of-the-art methods like RIFE across metrics such as PSNR, SSIM, and VFID, showcasing superior visual quality, structural similarity, and spatiotemporal consistency. Furthermore, human evaluation confirms TFD's enhanced perceptual realism and temporal smoothness, validating its efficacy in generating visually compelling and coherent video content.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Rajarshi Karmakar

,

Ciaran Eising

,

Rekha Ramachandra

,

Sahil Zaidi

Abstract: We propose SuperSegmentation, a unified, fully-convolutional architecture for semantic keypoint correspondence in dynamic urban scenes. The model extends SuperPoint’s self-supervised interest point detector–descriptor backbone with a DeepLab-style Atrous Spatial Pyramid Pooling head for semantic segmentation and a lightweight sub-pixel regression branch. Using Cityscapes camera intrinsics and extrinsics to construct geometry-aware homographies, SuperSegmentation jointly predicts keypoints, descriptors, semantic labels(e.g., static vs. dynamic classes), and sub-pixel offsets from a shared encoder. Our experiments are conducted on Cityscapes, where a backbone pretrained on MS-COCO with strong random homographies over approximately planar images is fine-tuned with deliberately attenuated synthetic warps, as we found that reusing the aggressive COCO-style homographies on Cityscapes produced unrealistically large distortions. Within this controlled setting, we observe that adding semantic masking and sub-pixel refinement consistently improves stability on static structures and suppresses keypoints on dynamic or ambiguous regions.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Lorenzo Alejandro Matadamas-Torres

,

Juan Regino Maldonado

,

Idarh Matadamas

,

Luis Alberto Alonso-Hernandez

,

Manuel de Jesus Melo-Monterrey

,

Lorena Juith Ramírez-López

,

Luis Enrique Rodríguez-Antonio

Abstract: The article analyzes scientific information concerning the effects of computer science on the creative industries, an approach that has been consolidated as a driver of the global economy fundamentally based on knowledge, innovation, and creativity. A bibliometric review of articles in the Scopus database (1983–November 2025) was applied to evaluate the conceptual evolution, fundamental themes, and most influential authors. The research was developed in three phases: (1) search criteria within the research field, (2) perfor-mance analysis, and (3) results analysis. The results showed a steady increase in the pro-duction of studies, particularly since 2019 due to the COVID-19 pandemic, focusing pri-marily on the areas of digitalization, innovation, and artificial intelligence. The authors with the highest number of publications originate from China, Indonesia, and Malaysia. The research determines the convergence between computing and creativity, which con-stitutes a strategic opportunity for the global economy. However, it acknowledges re-strictions linked to the period of analysis and the dependence on a single database, thus suggesting future studies be expanded to include other sources and temporal contexts.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Pengju Liu

,

Hongzhi Zhang

,

Chuanhao Zhang

,

Feng Jiang

Abstract: In clinical CT imaging, high-density metallic implants often induce severe metal artifacts that obscure critical anatomical structures and degrade image quality, thereby hindering accurate diagnosis. Although deep learning has advanced CT metal artifact reduction (CT-MAR), many methods do not effectively use frequency information, which can limit the recovery of both fine details and overall image structure. To address this limitation, we propose a Hybrid-Frequency-Aware Mixture-of-Experts (HFMoE) network for CT-MAR. The proposed method synergizes the spatial-frequency localization of the wavelet transform with the global spectral representation of the Fourier transform to achieve precise multi-scale modeling of artifact characteristics. Specifically, we design a Hybrid-Frequency Interaction Encoder with three specialized branches, incorporating wavelet-domain, Fourier-domain, and cascaded wavelet–Fourier modulation, to distinctively refine local details, global structures, and complex cross-domain features. Then, they are fused via channel attention to yield a comprehensive representation. Furthermore, a frequency-aware Mixture-of-Experts (MoE) mechanism is introduced to dynamically route features to specific frequency experts based on the degradation severity, thereby adaptively assigning appropriate receptive fields to handle varying metal artifacts. Evaluations on synthetic (DeepLesion) and clinical (SpineWeb, CLINIC-metal) datasets show that HFMoE outperforms existing methods in both quantitative metrics and visual quality. Our method demonstrates the value of explicit frequency-domain adaptation for CT-MAR and could inform the design of other image restoration tasks.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Doo-Ho Choi

,

Youn-Lee Oh

,

Minji Oh

,

Eun-Ji Lee

,

Sung-I Woo

,

Minseek Kim

,

Ji-Hoon Im

Abstract: Mushrooms have long been economically and nutritionally important crops, and recent advances in digital agriculture have increased interest in automating phenotypic evaluation. Due to the limitation of traditional phenotype assessment, various artificial intelligence (AI) models including YOLOv8 have been introduced to evaluate mushroom phenotypes non-destructively and efficiently. However, unlike previous models, few studies of mushroom phenotype assessment with YOLOv11 were published. In this study, using Pleurotus ostreatus and Flammulina velutipes, comparison of mushroom phenotype analysis between YOLOv8 and YOLOv11 was processed. All images were captured under controlled conditions and conducted to be preprocessed for the model evaluation. The results demonstrated that YOLOv11 achieved segmentation accuracy comparable to YOLOv8 (ΔmAP50–95 < 0.01) while substantially improving computational efficiency with a reduction of approximately 15–20%. In validation with the physical measurements of mushroom phenotype, both models showed biologically meaningful and moderate correlations across phenotypic traits (r ≈ 0.2–0.44; R² ≈ 0.72–0.83), confirming that YOLO-derived measurements captured essential dimensional variation. Inter-model comparisons revealed strong consistency (r ≥ 0.94, R² ≥ 0.96, MAE ≤ 0.40), indicating that YOLOv11 maintained the predictive reliability of YOLOv8 while operating with superior computational efficiency. This study establishes YOLOv11 as a robust foundation for AI-assisted digital breeding and automated quality monitoring systems in fungal research and precision agriculture.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Huazhong Wang

,

Xuetao Wang

,

Lihua Sun

,

Qingchao Jiang

Abstract:

Pipelines play a critical role in industrial production and daily life as essential conduits for transportation. However, defects frequently arise because of environmental and manufacturing factors, posing potential safety hazards. To address the limitations of traditional object detection methods, such as inefficient feature extraction and loss of critical information, this paper proposes an improved algorithm named FALW-YOLOv8, based on YOLOv8. The FasterBlock is integrated into the C2f module to replace standard convolutional layers, thereby reducing redundant computations and significantly enhancing the efficiency of feature extraction. Additionally, the ADown module is employed to improve multi-scale feature retention, while the LSKA attention mechanism is incorporated to optimize detection accuracy, particularly for small defects. The Wise-IoU v2 loss function is adopted to refine bounding box precision for complex samples. Experimental results demonstrate that the proposed FALW-YOLOv8 achieves a 5.8% improvement in mAP50, alongside a 34.8% reduction in parameters and a 30.86% decrease in computational cost. This approach effectively balances accuracy and efficiency, making it suitable for real-time industrial inspection applications.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Yan Wang

,

Yingchong Wang

,

Xiuqi Zhang

,

Xiaoyu Ding

Abstract: Subtle vibrations induced in everyday objects by ambient sound, especially speech, carry rich acoustic cues that can potentially be transformed into meaningful text. This paper presents a Vision-based Subtitle Generator (VSG). This is the first attempt to recover text directly from high-speed videos of sound-induced object vibrations using a generative approach. To this end, VSG introduces a phase-based motion estimation (PME) technique that treats each pixel as an “independent microphone”, and extracts thousands of pseudo-acoustic signals from a single video. Meanwhile, the pretrained Hidden-unit Bidirectional Encoder Representations from Transformers (HuBERT) serves as the encoder of the proposed VSG-Transformer architecture, effectively transferring large-scale acoustic representation knowledge to the vibration-to-text task. These strategies significantly reduce reliance on large volumes of video data. Experimentally, text was generated from vibrations induced in a bag of chips by AISHELL-1 audio samples. Two VSG-Transformer variants with different parameter scales (Base and Large) achieved character error rates of 13.7 and 12.5%, respectively, demonstrating the remarkable effectiveness of the proposed generative approach. Furthermore, experiments using signal upsampling techniques showed that VSG-Transformer performance was promising even when low-frame-rate videos were used, highlighting the potential of the proposed VSG approach in scenarios featuring conventional cameras.
Review
Computer Science and Mathematics
Computer Vision and Graphics

Mohammad Osman Khan

,

Imran Khan Apu

Abstract: Human Action Recognition (HAR) has grown into one of the most active areas in computer vision, finding uses in healthcare, smart homes, security, autonomous driving, and even human–robot interaction. Over the past decade, deep learning has transformed how HAR is approached. Instead of relying on handcrafted features, modern models learn directly from raw data, whether that comes from RGB videos, skeleton sequences, depth maps, wearable devices, or wireless signals. Existing surveys typically focus on either technical architectures or specific modalities, lacking comprehensive integration of recent advances, practical applications, and explainability. This survey addresses this gap by examining cutting-edge deep learning methods alongside their real-world deployment in fall detection, rehabilitation monitoring, and navigation systems. We analyze emerging techniques driving HAR forward: transformer architectures for temporal modeling, self-supervised learning reducing annotation requirements, contrastive learning for robust representations, and graph neural networks excelling in skeleton-based recognition through joint relationship modeling. Advanced approaches, including few-shot and meta-learning, enable novel activity recognition with limited data, while cross-modal learning facilitates knowledge transfer between sensor modalities. Federated learning preserves privacy across distributed devices, neural architecture search automates design optimization, and domain adaptation improves generalization across environments and populations, collectively advancing HAR toward efficient, adaptable, deployment-ready solutions. By synthesizing recent advances, real-world applications, and explainability requirements, this survey provides researchers and practitioners a consolidated roadmap for developing HAR systems that are accurate, interpretable, and ready for practical deployment across diverse domains.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Stefanie Amiruzzaman

,

Md. Amiruzzaman

,

James Dracup

,

Alexander Pham

,

Benjamin Crocker

,

Linh Ngo

,

And M. Ali Akber Dewan

Abstract: This study aims to develop a system for translating American Sign Language (ASL) to and from English, enhancing accessibility for ASL users. We leveraged a publicly available dataset to train a model that accurately predicts ASL signs and their English translations. The system employs AI-based transformers for bidirectional translation: converting text and speech into ASL using computer vision and translating ASL signs into text. For user accessibility, we built a web-based interface that integrates a computer vision framework (MediaPipe) to detect key body landmarks, including hands, shoulders, and facial features. This enables the system to process text, speech input, and video recordings, which are stored using msgpack and analyzed to generate ASL imagery. Additionally, we are developing a transformer model that is trained jointly on pairs of gloss sequences and sentences using connectionist temporal classification (CTC) and cross-entropy loss. Along with that, we are utilizing an EfficientNet-B0 pretrained on the ImageNet dataset with 1D convolution blocks to extract features from video frames, helping facilitate the conversion of ASL signs into structured English text.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Mingxuan Du

,

Yutian Zeng

Abstract: Automated fine-grained classification of gastrointestinal diseases from endoscopic images is vital for early diagnosis yet remains difficult due to limited annotated data, large appearance variations, and the subtle nature of many lesions. Existing few-shot and relational learning approaches often struggle with handling drastic viewpoint shifts, complex contextual cues, and distinguishing visually similar pathologies under scarce supervision. To address these challenges, we introduce an Adaptive Contextual-Relational Network (ACRN), an end-to-end framework tailored for robust and efficient few-shot classification in gastrointestinal imaging. ACRN incorporates an adaptive contextual-relational module that fuses multi-scale contextual encoding with dynamic graph-based matching to enhance both feature representation and relational reasoning. An enhanced task interpolation strategy further enriches task diversity by generating more realistic virtual tasks through feature similarity–guided interpolation. Together with a lightweight encoder equipped with spatial attention and an efficient attention routing mechanism, the model achieves strong discriminative capability while maintaining practical computational efficiency. Experiments on a challenging benchmark demonstrate improved accuracy and stability over prior methods, with ablation studies confirming the contribution of each component. ACRN also shows resilience to common image perturbations and provides performance comparable to experienced clinicians on particularly difficult cases, underscoring its potential as a supportive tool in clinical workflows.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Xinyu Liu

,

Xiongwei Sun

,

Jile Wang

Abstract: Accurate detection of small objects in remote sensing imagery remains challenging due to their limited texture, sparse features, and weak contrast. To address this, an enhanced small object detection model integrating top-down and bottom-up attention mechanisms is proposed. First, we design two statistical model-constrained feature pre-extraction networks to enhance the spatial patterns of small objects before feeding them into the backbone network. Next, a top-down attention mechanism followed by an over-view-then-refinement process is employed to guide region-level feature extraction. Finally, a bottom-up feature fusion strategy is utilized to integrate micro features and macro structural features in a bottom-up manner, enhancing the representational capacity of limited features for small objects. Evaluations on the AI-TOD and SODA-A datasets show that our method outperforms existing benchmark models. On the AI-TOD dataset, it improves AP and AP0.5 by 0.3% and 0.7%, respectively. More notably, on the more challenging SODA-A dataset, it achieves significant gains of 2.6% in AP and 1.4% in AP0.5. These consistent enhancements across different datasets verify the effectiveness of our method in boosting the detection performance, particularly for small targets.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Lin Huang

,

Xubin Ren

,

Daiming Qu

,

Lanhua Li

,

Jing Xu

Abstract: Optical Fiber Composite Overhead Ground Wire (OPGW) cables serve dual functions in power systems, lightning protection and critical communication infrastructure for real-time grid monitoring. Accurate OPGW identification during UAV inspections is essential to prevent miscuts and maintain powercommunication functionality. However, detecting small, twisted OPGW segments among visually similar ground wires is challenging, particularly given the computational and energy constraints of edge-based UAV platforms. We propose OPGW-DETR, a lightweight detector based on the D-FINE framework, optimized for low-power operation to enable reliable onboard detection. The model incorporates two key innovations: multi-scale convolutional global average pooling (MC-GAP), which fuses spatial features across multiple receptive fields and integrates frequency-domain information for enhanced fine-grained representation, and a hybrid gating mechanism that dynamically balances global and spatial features while preserving original information through residual connections. By enabling real-time onboard inference with minimal energy consumption, OPGW-DETR addresses UAV battery and bandwidth limitations while ensuring continuous detection capability. Evaluated on a custom OPGW dataset, the S-scale model achieves 3.9% improvement in average precision (AP) and 2.5% improvement in AP50 over the baseline. These gains enhance communication reliability by reducing misidentification risks, enabling uninterrupted grid monitoring in low-power UAV inspection scenarios where accurate detection is critical for communication integrity and grid safety.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Munish Rathee

,

Boris Bačić

,

Maryam Doborjeh

Abstract: Automated anomaly detection in transportation infrastructures is essential for enhancing safety and reducing the operational costs associated with manual inspection protocols. This study presents an improved neuromorphic vision system, which extends the prior SIFT-SNN (Scale-Invariant Feature Transform–Spiking Neural Network) proof-of-concept by incorporating temporal feature aggregation for context-aware and sequence-stable detection. Analysis of classical stitching-based pipelines exposed sensitivity to motion and lighting variations, motivating the proposed temporally smoothed neuromorphic design. SIFT keypoints are encoded into latency-based spike trains and classified using a leaky integrate-and-fire (LIF) spiking neural network implemented in PyTorch. Evaluated across three hardware configurations: an NVIDIA RTX 4060 GPU, an Intel i7 CPU, and a simulated Jetson Nano; the system achieves 92.3% accuracy and a macro F1-score of 91.0% under five-fold cross-validation. Inference latencies were measured at 9.5 ms, 26.1 ms, and ~48.3 ms per frame, respectively. Memory footprints were under 290 MB and power consumption estimated to be between 5 and 65 W. The classifier distinguishes between safe, partially dislodged, and fully dislodged barrier pins, which are critical failure modes for the Auckland Harbour Bridge’s Movable Concrete Barrier (MCB) system. Temporal smoothing further improves recall for ambiguous cases. Achieving a compact model size (2.9 MB), low-latency inference, and minimal power demands, the proposed framework offers a deployable, interpretable, and energy-efficient alternative to conventional CNN-based inspection tools. Future work will be focused on exploring generalisability and transferability of the work presented, additional input sources, as well as human computer interaction paradigms for various deployment infrastructures and advancement.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Erick Huitrón-Ramírez

,

Leonel G. Corona-Ramírez

,

Diego Jiménez-Badillo

Abstract: This work presents the Linear-Region-Based Contour Tracking (LRCT) method for extracting external contours in images, designed to achieve an accurate and efficient description of shapes, particularly useful for archaeological materials with irregular geometries. The approach treats the contour as a discrete signal and analyzes image regions containing edge segments. From these regions, a local linear model is estimated to guide the selection and chaining of representative pixels, yielding a continuous perimeter trajectory. This strategy reduces the amount of data required to describe the contour without compromising shape fidelity. As a case study, the method was applied to images of replicas of archaeological materials exhibiting substantial variations in color and morphology. The results show that the obtained trajectories are comparable in quality to those generated using Canny edge detection and Moore tracing, while providing compact representations well suited for subsequent analyses. Consequently, the method offers an efficient and reproducible alternative for documentation, recording, and morphological comparison, strengthening data-driven approaches in archaeological research.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Ye Deng

,

Meng Chen

,

Jieguang Liu

,

Qi Cheng

,

Xiaopeng Xu

,

Yali Qu

Abstract:

Breast ultrasound segmentation is challenged by strong noise, low contrast, and ambiguous lesion boundaries. Although deep models achieve high accuracy, their heavy computational cost limits deployment on portable ultrasound devices. In contrast, lightweight networks often struggle to preserve fine boundary details. To address this gap, we propose the lightweight boundary-aware network. A MobileNetV3-based encoder with the atrous spatial hyramid pooling is integrated for efficient multi-scale representation learning. The applied the lightweight boundary-aware block uses an adaptive fusion to combine efficient channel attention and depthwise spatial attention to enhance discriminative capability with minimal computational overhead. A boundary-guided dual-head decoding scheme injects explicit boundary priors and enforces boundary consistency to sharpen and stabilize margin delineation. Experiments on curated BUSI* and BUET* datasets demonstrate that the proposed network achieves 82.8% Dice, 38 px HD95, and real-time inference speeds (123 FPS GPU / 19 FPS CPU) using only 1.76M parameters. They show that this proposed network can offer a highly favorable balance between accuracy and efficiency.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Cheng Yu

,

Jian Zhou

,

Lin Wang

,

Guizhen Liu

,

Zhongjun Ding

Abstract:

Images captured underwater frequently suffer from color casts, blurring, and distortion, which are mainly attributable to the unique optical characteristics of water. Although conventional UIE methods rooted in physics are available, their effectiveness is often constrained, particularly in challenging aquatic and illumination conditions. More recently, deep learning has become a leading paradigm for UIE, recognized for its superior performance and operational efficiency. This paper proposes UCA-Net, a lightweight CNN-Transformer hybrid network. It incorporates multiple attention mechanisms and utilizes composite attention to effectively enhance textures, reduce blur, and correct color. A novel adaptive sparse self-attention module is introduced to jointly restore global color consistency and fine local details. The model employs a U-shaped encoder-decoder architecture with three-stage up- and down-sampling, facilitating multi-scale feature extraction and global context fusion for high-quality enhancement. Experimental results on multiple public datasets demonstrate UCA-Net’s superior performance, achieved with fewer parameters and lower computational cost. Its effectiveness is further validated by improvements in various downstream image tasks.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Thomas Rowland

,

Mark Hansen

,

Melvyn Smith

,

Lyndon Smith

Abstract:

Automation and computer vision are increasingly vital in modern agriculture, yet mushroom harvesting remains largely manual due to complex morphology and occluded growing environments. This study investigates the application of deep learning–based instance segmentation and keypoint detection to enable robotic harvesting of Lentinula edodes (shiitake) mushrooms. A dedicated RGB-D image dataset, the first open-access RGB-D dataset for mushroom harvesting, was created using a Microsoft Azure DK 3D camera under varied lighting and backgrounds. Two state-of-the-art segmentation models, YOLOv8-seg and Detectron2 Mask R-CNN, were trained and evaluated under identical conditions to compare accuracy, inference speed, and robustness. YOLOv8 achieved higher mean average precision (mAP = 67.9) and significantly faster inference, while Detectron2 offered comparable qualitative performance and greater flexibility for integration into downstream robotic systems. Experiments comparing RGB and RG-D inputs revealed minimal accuracy differences, suggesting that colour cues alone provide sufficient information for reliable segmentation. A proof-of-concept keypoint-detection model demonstrated the feasibility of identifying stem cut-points for robotic manipulation. These findings confirm that deep learning–based vision systems can accurately detect and localise mushrooms in complex environments, forming a foundation for fully automated harvesting. Future work will focus on expanding datasets, incorporating true four-channel RGB-D networks, and integrating perception with robotic actuation for intelligent agricultural automation.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Kylychbek Parpiev

Abstract: Responsive design has become essential as users increasingly access websites through devices of different screen sizes, including laptops and smartphones. This study investigates how responsive design affects user experience, particularly in the areas of readability, navigation, task performance, and emotional engagement. Sixty participants tested both responsive and fixed-layout websites on laptops and mobile devices. Quantitative results show that responsive websites improved task completion time by 32%, and users rated them significantly higher in usability and readability (4.4/5 compared to 3.4/5 for fixed layouts). Heatmap and click-path analyses demonstrated reduced cognitive load and fewer navigation errors on responsive interfaces. Qualitative interviews revealed that users perceived responsive designs as modern, smooth, and easy to use, whereas fixed layouts were described as tiring and frustrating. However, accessibility issues remain for users with visual impairments. Overall, the findings indicate that responsive design enhances both functional and emotional aspects of user experience. The study recommends combining responsive layout practices with accessibility guidelines to ensure inclusive interaction across devices.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Seungun Park

,

Jiakang Kuai

,

Hyunsu Kim

,

Hyunseong Ko

,

ChanSung Jung

,

Yunsik Son

Abstract: Object detection in adverse weather remains challenging due to the simultaneous degradation of visibility, structural boundaries, and semantic consistency. Existing restoration-driven or multi-branch detection approaches often fail to recover task-relevant features or introduce substantial computational overhead. To address this problem, DLC-SSD, a lightweight degradation-aware framework for detecting robust objects in bad weather environments, is proposed. The framework is based on an integrated restoration and refining strategy that performs image-level degradation correction, structural information enhancement, and semantic expression refinement in stages. First, the Differentiable Image Processing (DIP) module performs low-cost enhancement to adapt to global and local degradation patterns. After that, the Lightweight Edge-Guided Attention (LEGA) module uses a fixed Laplacian-based structural dictionary to reinforce the boundary cues in shallow, high-resolution features. Finally, the Content-aware Spatial Transformer with Gating (CSTG) module captures long-distance contextual relationships and refines the deep semantic representation by suppressing noise. These components are jointly optimized end-to-end with the single shot multibox detection (SSD) backbone. In rain, fog, and low-light conditions, DLC-SSD demonstrated more stable performance than conventional detectors and maintained a quasi-real-time inference speed, confirming its practicality in intelligent monitoring and autonomous driving environments.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Linyu Lou

,

Jiarong Mo

Abstract: Video Visual Relation Detection plays a central role in understanding complex video content by identifying evolving spatio-temporal interactions between object tracklets. However, current approaches are hindered by long-tailed predicate distributions, the gap between image-based semantics and video dynamics, and the challenge of generalizing to unseen relation categories. We introduce the Dynamic Contextual Relational Alignment Network (DCRAN), an end-to-end framework designed to address these issues. DCRAN integrates a spatio-temporal gating mechanism to enrich tracklet representations with surrounding context, a dynamic relational prompting module that produces adaptive predicate prompts for each subject--object pair, and a multi-granular semantic alignment module that jointly aligns object features and relational representations with their corresponding textual cues through hierarchical contrastive learning. Experiments on standard benchmarks show that DCRAN substantially improves the detection of both frequent and previously unseen relations, demonstrating the value of dynamic prompting and multi-level alignment for robust video relational understanding.

of 31

Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2025 MDPI (Basel, Switzerland) unless otherwise stated