Computer Science and Mathematics

Sort by

Article
Computer Science and Mathematics
Computer Vision and Graphics

Yanxia Lyu,

Yuhang Liu,

Qianqian Zhao,

Ziwen Hao,

Xin Song

Abstract: Remote sensing image (RSI) super-resolution plays a critical role in improving image details and reducing costs associated with physical imaging devices. However, existing super-resolution methods are not applicable to resource-constrained edge devices because they are hampered by a large number of parameters and significant computational complexity. To address these challenges, we propose a novel lightweight super-resolution model for remote sensing images, a strip-like feature superpixel interaction network (SFSIN), which combines the flexibility of convolutional neural networks (CNNs) with the long-range learning capabilities of Transformer. Specifically, the Transformer captures global context information through long-range dependencies, while the CNN performs shape-adaptive convolutions. By stacking strip-like feature superpixel interaction modules (SFSI), we aggregate strip-like features to enable deep feature extraction from local and global perspectives. In addition to traditional methods that rely solely on direct upsampling for reconstruction, our model uses the convolutional block attention module with upsampling convolution (CBAMUpConv), which integrates deep features from spatial and channel dimensions to improve reconstruction performance. Extensive experiments on the AID dataset show that SFSIN outperforms ten state-of-the-art lightweight models. SFSIN achieves a PSNR of 33.10 dB and an SSIM of 0.8715 on the ×2 scale, outperforming competitive models in both quantity and quality, while also excelling at higher scales.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Bin Wu,

Wei Wei,

Jinwen Shi,

Sanrui Lin,

Ling Li

Abstract: This study tackles the challenges posed by complex outdoor lighting on traditional visual measurement methods for helicopter rotor blade motion parameters. We introduce the Multi-feature Fusion and Self-Attention Rotating Detector (MFSA-RD), a novel detection framework that significantly enhances detection accuracy and efficiency. The model enhances YOLOv5s by integrating an advanced multi-feature extraction module that optimizes feature integration across various scales and positions. It streamlines the network by removing redundant convolution layers and utilizes a multi-head self-attention mechanism coupled with a Cross-Stage Partial (CSP) convolution to effectively manage complex lighting disturbances. Moreover, the model incorporates a Skew Intersection Over Union (SKEWIOU) and angular loss to refine the loss functions, leading to improved detection performance. Extensive evaluations on both a proprietary outdoor rotor blade dataset and the DOTA-v1.0 public dataset demonstrate significant enhancements over the baseline YOLOv5s, with improvements in mean average precision (mAP) and frame rates by up to 12.8\% and 47.7\%, respectively.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Van-Khang Nguyen,

Chiung-An Chen,

Cheng-Yu Hsu,

Bo-Yi Li

Abstract: We applied processing technology to detect and diagnose liver tumors in patients. The cancer imaging archive (TCIA) was used as it contains images of patients diagnosed with liver tumors by medical experts. These images were analyzed to detect and segment liver tumors using advanced segmentation techniques. Following segmentation, the images were converted into binary images for the automatic detection of the liver’s shape. The tumors within the liver were then localized and measured. By employing these image segmentation techniques, we accurately determined the size of the tumors. The application of medical image processing techniques significantly aids medical experts in identifying liver tumors more efficiently.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Rongyong Zhao,

Lingchen Han,

Yuxin Cai,

Bingyu Wei,

Arifur Rahman,

Cuiling Li,

Yunlong Ma

Abstract: Pedestrian panic behavior is a primary cause of overcrowding and stampede accidents in public micro-road network areas with high-density pedestrians. However, reliably detecting such behaviors remains challenging due to their inherent complexity, variability, and stochastic nature. Current detection models often rely on single-modality features, limiting their effectiveness in complex, dynamic crowd scenarios. To overcome these limitations, this study proposes a novel multimodal panic detection approach integrating crowd density mapping, pedestrian trajectory analysis, and semantic recognition. Specifically, crowd density maps are generated using a convolutional neural network (CDNet) to identify regions with abnormal density gradients via contour analysis. Within these potential panic zones, pedestrian trajectories are analyzed through LSTM networks to capture irregular movements such as counterflow and nonlinear wandering behaviors. Concurrently, semantic recognition based on Transformer models is utilized to identify verbal distress cues extracted through Baidu AI real-time speech-to-text conversion. These multi-modal features—spatial, temporal, and semantic—are systematically fused and weighted using an MLP-based feature fusion framework to achieve robust panic detection accuracy. Comprehensive experiments on the UCF Crowd dataset demonstrate that this proposed approach significantly outperforms state-of-the-art methods, achieving an accuracy of 91.7%. The proposed detection framework can technically support real-time crowd safety management further in high-density pedestrian scenarios, including significant public crowd-gathering activities, transportation hubs, and emergency evacuations.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Bakht Alam Khan,

Jin-Woo Jung

Abstract: Human Activity Recognition (HAR) plays a critical role across various fields, including surveillance, healthcare, and robotics, by enabling systems to interpret and respond to human behaviors. In this research, we present an innovative method for HAR that leverages the strengths of Dilated Convolutional Neural Networks (CNNs) integrated with Long Short-Term Memory (LSTM) networks. The proposed architecture achieves an impressive accuracy of 94.9%, surpassing the conventional CNN-LSTM approach, which achieves 93.7% accuracy on the challenging UCF 50 dataset. The use of dilated CNNs significantly enhances the model's ability to capture extensive spatial-temporal features by expanding the receptive field, thus enabling the recognition of intricate human activities. This approach effectively preserves fine-grained details without increasing computational costs. The inclusion of LSTM layers further strengthens the model's performance by capturing temporal dependencies, allowing for a deeper understanding of action sequences over time. To validate the robustness of our model, we assessed its generalization capabilities on an unseen YouTube video, demonstrating its adaptability to real-world applications. The superior performance and flexibility of our approach suggests its potential to advance HAR applications in areas like surveillance, human-computer interaction, and healthcare monitoring.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Zhong Chen,

Hanruo Chen,

Junsong Leng,

Xiaolei Zhang,

Qi Gao,

Weiyu Dong

Abstract: Remote sensing image change detection, being a pixel-level dense prediction task, requires both high speed and high accuracy. The redundancy within the models and detection errors, particularly missed detection, generally affect accuracy and merit further research. Moreover, the former also leads to a reduction in speed. To guarantee the efficiency of change detection, encompassing both speed and accuracy, a VMamba-based Multi-scale Feature Guiding Fusion Network (VMMCD) is proposed. This network is capable of promptly modeling global relationships and realizing multi-scale feature interaction. Specifically, Mamba backbone is adopted to replace the commonly used CNN and Transformer backbones. By leveraging VMamba’s global modeling ability with linear computational complexity, the computational resources needed for extracting global features are reduced. Secondly, considering the characteristics of the VMamba model, a compact and efficient lightweight network architecture was devised. The aim is to reduce the model’s redundancy, thereby avoiding the extraction or introduction of interfering and redundant information. As a result, the speed and accuracy of the model are both enhanced. Finally, the Multi-scale Feature Guiding Fusion (MFGF) module is developed, which strengthen the global modeling ability of VMamba. Additionally, it enriches the interaction among multi-scale features to address the common issue of missed detection in changed areas. The proposed network has achieved competitive results on three datasets, and remarkably surpasses the current state-of-the-art performance on SYSU-CD, with F1 of 83.35% and IoU of 71.45%. Moreover, for inputs of 256×256 size, it is more than three times faster than the current state-of-the-art Mamba-based change detection model. This outstanding achievement demonstrates the effectiveness of our proposed approach.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Jánoki Imre,

Kiss Dorottya,

Zarándy Ákos

Abstract: Workplace safety, particularly in production lines and the automotive industry, has received unprecedented attention in recent years, leading to stricter regulations and driving innovation in monitoring technologies. A key focus has been on the development of systems to detect drowsiness and monitor attention levels in both workers and drivers. Current visual monitoring approaches analyze various indicators including eyelid movements, yawning frequency, head position, driving behavior, traffic response patterns, and physiological measurements such as heart rate and its variability using photoplethysmography. This paper presents an integrated solution that combines multiple cutting-edge components to assess attention and drowsiness levels. Our open-source system, released under the GNU Public License, incorporates measurements of head position, yawning detection, gaze tracking, and blink analysis. The developed software platform enables testing applications in efficiency monitoring, manufacturing process optimization, and safety enhancement.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Dimitar Rangelov,

Sierd Waanders,

Kars Waanders,

Evgeni Genchev,

Maurice van Keulen,

Radoslav Miltchev

Abstract: This study presents a comparative evaluation of Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) for 3D reconstruction under various filming techniques. High-quality 3D models are essential in a range of applications, from forensic investigations to cultural heritage preservation, architecture, and robotics, where detail accuracy and minimal noise are critical. Leveraging continuous video footage captured with a stabilized full-frame camera setup, this research examines both algorithms across indoor and outdoor environments using consistent datasets. Key assessment criteria include reconstruction noise, detail preservation, and processing time. The results reveal that while both approaches generate high-fidelity reconstructions, 3DGS consistently outperforms NeRF in computational efficiency and noise reduction. These insights provide valuable guidance for selecting suitable reconstruction techniques across different professional domains.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Costas Panagiotakis

Abstract: In this paper, we study the 2D Shape Equipartition Problem (2D-SEP) with minimal boundaries and we propose an efficient method that solves the problem with low computational cost. The goal of 2D-SEP is to obtain a segmentation into N equal area segments (regions), where the number of segments (N) is given by the user, under the constraint that the length of boundaries between the segments is minimized. We define the 2D-SEP and we study problem solutions using basic geometric shapes. We propose a 2D Shape Equipartition algorithm based on a fast balanced clustering method (SEP-FBC) that efficiently solve the 2D-SEP problem under complex 2D shapes in O(N·|S|·log(|S|)), where |S| denote the number of image pixels. The proposed SEP-FBC method initializes clustering using centroids provided by the k-means algorithm, which is executed first. During each iteration of the main SEP-FBC process, a region-growing procedure is applied, starting from the smallest region and expanding until regions of equal area are achieved. Additionally, a Particle Swarm Optimization (PSO) method that uses the SEP-FBC method under different initial centroids, has been also proposed to explore better 2D-SEP solutions and to show how the selection of the initial centroids affect the performance of the proposed method. Finally, we present experimental results on more than 2,800 2D shapes to evaluate the performance of the proposed methods and illustrate that their solutions outperform other methods from the literature.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Hakan Dalkıç

Abstract: X-ray container imaging systems generate monochromatic images that are primarily used for the detection of prohibited items associated with smuggling activities. Within the scope of this study, two image enhancement algorithms were developed to reveal concealed or subtle details within scan images. For the purpose of noise reduction, median filtering and sharpening filters were utilized in the respective algorithms. In the contrast enhancement stage, one algorithm employed Histogram Equalization, while the other utilized Contrast Limited Adaptive Histogram Equalization (CLAHE). In the CLAHE-based method, optimal parameters were determined based on the entropy of the image, particularly by tuning the block size and clipping limit to achieve ideal enhancement results. To mitigate the naturally occurring airborne noise frequently observed in X-ray scans, contrast adjustment was further refined based on the median brightness of the image, which contributed to preserving a more natural visual appearance. The effectiveness of the developed algorithms was evaluated using objective image quality metrics, including Signal-to-Noise Ratio (SNR) and entropy. Results indicate that contrast-enhanced images improve the distinguishability of objects with similar intensity levels, thereby assisting X-ray operators in anomaly detection and interpretation during inspection processes.
Review
Computer Science and Mathematics
Computer Vision and Graphics

Xinyue Zhang,

Jianfeng Wang,

Jinqiao Wei,

Xinyu Yuan,

Ming Wu

Abstract: Medical image segmentation, a critical task in medical image analysis, aims to precisely delineate regions of interest (ROIs) such as organs, lesions, and cells, and is crucial for applications including computer-aided diagnosis, surgical planning, radiation therapy, and pathological analysis. While fully supervised deep learning methods have demonstrated remarkable performance in this domain, their reliance on large-scale, pixel-level annotated datasets—a significant label scarcity challenge—severely hinders their widespread deployment in clinical settings. Addressing this limitation, this review focuses on non-fully supervised learning paradigms, systematically investigating the application of semi-supervised, weakly supervised, and unsupervised learning techniques for medical image segmentation. We delve into the theoretical foundations, core advantages, typical application scenarios, and representative algorithmic implementations associated with each paradigm. Furthermore, this paper compiles and critically reviews commonly utilized benchmark datasets within the field. Finally, we discuss future research directions and challenges, offering insights for advancing the field and reducing dependence on extensive annotations.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Matteo Fincato,

Roberto Vezzani

Abstract: Multi-person pose estimation is the task of detecting and regressing the keypoint coordinates of multiple people in a single image. Significant progress has been achieved in recent years, especially with the introduction of transformer-based end-to-end methods. In this paper, we present DualPose, a novel framework that enhances multi-person pose estimation by leveraging a dual-block transformer decoding architecture. Class prediction and keypoint estimation are split into parallel blocks so each sub-task can be separately improved and the risk of interference is reduced. This architecture improves the precision of keypoint localization and the model's capacity to accurately classify individuals. To improve model performance, the keypoints-block uses parallel processing of self-attentions, providing a novel strategy that improves keypoint localization accuracy and precision. Additionally, DualPose incorporates a contrastive denoising (CDN) mechanism, leveraging positive and negative samples to stabilize training and improve robustness. Thanks to CDN, a variety of training samples is created by introducing controlled noise into the ground truth, improving the model's ability to discern between valid and incorrect keypoints. DualPose achieves state-of-the-art results outperforming recent end-to-end methods, as shown by extensive experiments on the MS COCO and CrowdPose datasets. The code and pretrained models are publicly available.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Theresa Chen

Abstract: This paper presents OpenRSSI, a novel motion capture system that leverages ultra-wideband (UWB) radio signal strength indicators combined with inertial measurement units (IMUs) to achieve high-precision tracking without the positional drift common in pure inertial systems. Our approach utilizes an adaptive sensor fusion algorithm that dynamically adjusts to environmental conditions and movement patterns, providing robust tracking across varied use cases.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Ioana Livia Stefan,

Andrei Mateescu,

Ionut Lentoiu,

Silviu Raileanu,

Florin Daniel Anton,

Dragos Constantin Popescu,

Ioan Stefan Sacala

Abstract: With its fast advancements, Cloud Computing opens many opportunities for research in various applications from the robotics field. In our paper, we further explore the prospect of integrating Cloud AI object recognition services into an industrial robotics sorting task. Starting from our previously implemented solution on a digital twin, we are now putting our proposed architecture to the test in the real world, on an industrial robot, where factors such as illumination, shadows, different coloring and textures of the materials influence the performance of the vision system. We compare the results of our suggested method with the ones from an industrial machine vision software, obtaining promising performance, opening additional application perspectives in the robotics field as Cloud and AI technology continuously improves.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Sam Kavanagh,

Andrew Luxton-Reilly,

Burkhard C. Wünsche,

Beryl Plimmer,

Sebastian Dunn

Abstract: Virtual Reality (VR) has existed in the realm of education for over half a century, however it has never achieved widespread adoption. This was traditionally attributed to costs and usability problems associated with the technologies, but a new generation of consumer VR headsets has helped mitigate these issues to a large degree. Arguably the greater barrier is now the overhead involved in creating educational VR content, the process of which has remained largely unchanged. In this paper we investigate the use of 360∘ video as an alternative way of producing educational VR content with a much lower barrier to entry. We report on the differences in user experience between 360∘ and standard desktop video. We also compare the short and long term learning retention of tertiary students who viewed the same video recordings, but watched them in either 360∘ or standard video formats. Our results indicate that students retain an equal amount of information from either video format but perceive 360∘ video to be more enjoyable and engaging, and would prefer to use it as a supplementary material in their coursework.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Weiying Piao,

Yongqi Han,

Liye Hu,

Chunxue Wang

Abstract: Given the variety of focus measure operators, selecting an appropriate one based on scene requirements is critical. Based on the evaluation of the focusing curve morphology and considering the accuracy and robustness of metrics, four metrics were designed: the steep region width (Ws), the steep to gentle ratio (Rsg), the curvature of peak point (Cp), and the relative root mean square error (RRMSE). Several typical focus measure operators were chosen for experimental evaluation. The results demonstrate that the proposed metrics effectively characterize the performance and features of various operators. These metrics serve as a valuable reference for selecting appropriate operators and provide a theoretical foundation for designing new ones.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Wenyuan Wang,

Ruisi Yang,

Jutao Xiao,

Shuwei Huo,

Zhengze Chen

Abstract: Oral disease recognition aims to identify specific diseases from individual oral images. With the progress of deep learning, pioneering computational methods tailored for this task have started to demonstrate their potential. Conventional deep learning approaches typically treat oral disease recognition as a simplistic image-to-label mapping, thereby neglecting the critical need to explicitly model disease-specific visual patterns. This oversimplification compromises their ability to generalize effectively to unseen data, particularly when training data is limited. To address this issue, we analyzed the mechanisms of the diagnostic process, which involves first identifying pathological features in images and then determining the corresponding oral disease based on these features. Current methods, however, overlook the step of extracting pathological features, resulting in suboptimal model performance. To overcome this limitation, we propose a novel framework termed Dynamic Compositional Prototype Learning (DyCoP).The DyCoP framework leverages componential prototypes and category prototypes to represent localized pathological features and oral diseases, respectively. It employs a dynamic composition mechanism to establish relationships between multiple pathological features and specific oral diseases. Furthermore, we introduce a gradient suppression strategy to fully utilize general knowledge embedded in pretrained models. These approaches enhance the model’s capacity to capture effective features and learn accurate diagnostic logic. Experiments conducted on the Dental Condition Dataset demonstrate the superiority of our method, achieving an accuracy of 93.3% and a macro-F1 score of 91.9%, outperforming state-of-the-art approaches by significant margins. This framework effectively addresses both precision and generalizability bottlenecks in automated oral disease diagnosis.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Romeo Šajina,

Goran Oreški,

Marina Ivasic-Kos

Abstract: Multi-person pose forecasting involves predicting the future body poses of multiple individuals over time, involving complex movement dynamics and interaction dependencies. Its relevance spans various fields, including computer vision, robotics, human-computer interaction, and surveillance. This paper introduces GCN-Transformer, a novel model for multi-person pose forecasting that leverages the integration of Graph Convolutional Network and Transformer architectures. We integrated novel loss terms during the training phase to enable the model to learn both interaction dependencies and the trajectories of multiple joints simultaneously. Additionally, we propose a novel pose forecasting evaluation metric called Final Joint Position and Trajectory Error (FJPTE), which assesses both local movement dynamics and global movement errors by considering the final position and the trajectory leading up to it, providing a more comprehensive assessment of movement dynamics. Comprehensive evaluations on the SoMoF Benchmark and ExPI datasets demonstrate that the proposed GCN-Transformer model consistently outperforms existing state-of-the-art (SOTA) models according to the VIM and MPJPE metrics. Specifically, GCN-Transformer shows a 5% improvement over the closest SOTA model on the SoMoF Benchmark’s MPJPE metric and a 2.6% improvement over the closest SOTA model on the ExPI dataset MPJPE metric. Unlike other models whose performance fluctuates across datasets, GCN-Transformer performs consistently, proving its robustness in multi-person pose forecasting and providing an excellent foundation for the application of GCN-Transformer in different domains. The code is available at https://github.com/RomeoSajina/GCN-Transformer.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Fahmid Al Farid,

Md Ahsanul Bari,

Abu Saleh Musa Miah,

Sarina Mansur,

Jia Uddin,

Prabha Kumaresan

Abstract: Ambient Assisted Living (AAL) leverages technology to support the elderly and individuals with disabilities. A key challenge in AAL systems is efficient human activity recognition (HAR), yet no study has systematically compared single-view (SV) and multi-view (MV) HAR. This review addresses this gap by analyzing the evolution from SV to MV-HAR, covering benchmark datasets, feature extraction methods, and classification approaches. We examine how HAR systems have transitioned to MV with advanced deep learning architectures optimized for AAL, improving accuracy and robustness. Additionally, we explore machine learning and deep learning models—including CNNs, RNNs, LSTMs, TCNs, and GCNs—as well as lightweight transfer learning techniques for resource-constrained environments. Key challenges such as data remediation, privacy, and generalization are discussed alongside potential solutions like sensor fusion and advanced learning methods. Our study provides insights into advancements and future directions, guiding the development of intelligent, efficient, and privacy-compliant HAR systems for AAL.
Article
Computer Science and Mathematics
Computer Vision and Graphics

Wu Yonghao,

Liu Minyi,

Li Jun

Abstract: In various fields including visual processing, computer graphics, neuroscience, and biological sciences, geons are widely recognized as fundamental units of complex shapes. Their importance has been broadly acknowledged. However, accurately identifying and extracting these geons remains a challenge.This study integrates theories from signal processing, computer graphics, neuroscience, and biological sciences, utilizing "object imaging" and neural networks to describe mathematical operators, in order to reveal the essence of visual geons. Experiments validate the core hypothesis of geon theory, namely that geons are foundational components for the visual system to recognize complex objects. Through training, neural networks are capable of identifying distinct basic geons and, based on this foundation, performing target recognition in more complex scenarios. This effectively confirms the existence of geons and their critical role in visual recognition, providing new tools and theoretical foundations for related research fields.

of 25

Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2025 MDPI (Basel, Switzerland) unless otherwise stated