Computer Science and Mathematics

Sort by

Article

Hardware and Architecture

AgroNova: An Autonomous IoT Platform for Greenhouse Climate Control

Borislav Toskov

Asya Toskova

Abstract: This article presents AgroNova, an intelligent and autonomous Internet of Things (IoT) platform developed for real-time monitoring and control of the microclimate in greenhouses. The system combines distributed wireless sensor nodes, actuator mod-ules, a local gateway equipped with a rule-based control agent, and a cloud infra-structure for data visualization and decision support. The platform’s hybrid architec-ture enables autonomous operation in the event of internet failures and at the same time allows the integration of a large language model (LLM) for context-based deci-sions. AgroNova was implemented in a tomato greenhouse and validated over a period of seven months, during which over 400,000 environmental data points were recorded. The system effectively kept temperature and humidity within optimal agronomic ranges and reduced deviation time compared to manual control. In experimental tests, the LLM component generated relevant recommendations under complex conditions, such as bad weather. The results show that AgroNova is a reliable and scalable solution for greenhouse microclimate management. The combination of local autonomy and cloud intelligence of the platform offers promising applications in precision agriculture. Future work in-cludes extending the scope of LLM-assisted reasoning and adapting the platform to additional crops and greenhouse environments.

Posted: 13 February 2026

https://doi.org/10.20944/preprints202602.1079.v1

Article

Computer Science and Mathematics

Hardware and Architecture

Transformer Algorithmics: A Tutorial on Efficient Implementation of Transformers on Hardware

Christoforos Kachris

Abstract: The rise of Large Language Models (LLMs) has redefined the landscape of artificial intelligence, with the Transformer architecture serving as the foundational backbone for these breakthroughs. Despite their algorithmic dominance, Transformers impose extreme computational and memory demands that render general-purpose processing elements (PEs), such as standard CPUs and GPUs, increasingly inefficient in terms of power density and throughput. As the industry moves toward domain-specific accelerators, there is a critical need for specialized digital design strategies that address the "Memory Wall" and the quadratic complexity of attention mechanisms. This paper presents a comprehensive tutorial on the most efficient hardware architectures for implementing Transformer components in digital logic. We provide a bottom-up analysis of the hardware realization of Multi-Head Attention (MHA), Feed-Forward Networks (FFN), and non-linear normalization units like Softmax and LayerNorm. Specifically, we explore state-of-the-art implementation techniques, including Systolic Arrays for linear projections, CORDIC and LUT-based approximations for non-linearities, and the emerging SwiGLU gated architectures. Furthermore, we discuss the latest trends in hardware-software co-design, such as the use of FlashAttention-4 and Tensor Memory (TMEM) pathways to minimize on-chip data movement. This tutorial serves as a guide for computer engineers and researchers to bridge the gap between high-level Transformer mathematics and low-level RTL-optimized hardware.

Posted: 11 February 2026

https://doi.org/10.20944/preprints202602.0932.v1

Article

Computer Science and Mathematics

Hardware and Architecture

Prompt Sensitivity and Bias Amplification in Aligned Video Diffusion Models

Marco Rossi

Giulia Bianchi

Alessandro Conti

Abstract: While alignment tuning aims to constrain undesirable outputs, its interaction with prompt sensitivity in video diffusion models has not been systematically quantified. This study examines how minor semantic perturbations in prompts affect bias emergence in aligned versus unaligned video diffusion systems. We generate 26,700 video samples using paired prompts with controlled lexical and contextual variations. Bias amplification is measured using demographic skew ratios, attribute co-occurrence statistics, and visual saliency attribution. Results indicate that aligned models exhibit 34.1% higher sensitivity to prompt perturbations in socially sensitive contexts, leading to amplified bias variance across outputs. These findings suggest that alignment tuning may unintentionally increase model fragility to prompt-level noise, posing challenges for reliable bias mitigation.

Posted: 27 January 2026

https://doi.org/10.20944/preprints202601.2005.v1

Article

Computer Science and Mathematics

Hardware and Architecture

Cross-Modal Bias Transfer in Aligned Video Diffusion Models

Yuki Nakamura

Kenji Sato

Ayaka Suzuki

Hiroshi Tanaka

Abstract: Video diffusion models integrate visual, temporal, and textual signals, creating potential pathways for cross-modal bias transfer. This paper studies how alignment tuning affects the transmission of social bias between text and visual modalities in video generation. We evaluate 14,200 text-to-video samples using a cross-modal attribution framework that decomposes bias contributions across input modalities. Quantitative analysis reveals that alignment tuning reduces text-conditioned bias by 24.8%, yet increases visually induced bias carryover by 31.5%, particularly in identity-related scenarios. The results demonstrate that alignment tuning redistributes bias across modalities rather than eliminating it, highlighting the need for modality-aware alignment strategies.

Posted: 27 January 2026

https://doi.org/10.20944/preprints202601.1956.v1

Article

Computer Science and Mathematics

Hardware and Architecture

Reliability–Latency Co-Optimization in Parallel Register Array Frameworks Under Fault Injection

Jun Wei

Li Ming

Wei Zhang

Abstract: Achieving low latency and high reliability simultaneously remains a fundamental trade-off in parallel register array frameworks. This paper introduces a system-level reliability–latency co-optimization model for parallel register array communication under fault-prone environments. The proposed approach formulates path selection and redundancy allocation as a constrained optimization problem, where latency minimization is balanced against probabilistic reliability guarantees. A heuristic solver is developed to efficiently compute near-optimal configurations for large-scale register arrays. Extensive fault injection experiments were conducted on register arrays with up to 8192 registers, considering both transient and permanent faults. Compared with fixed-configuration parallel frameworks, the proposed model achieves an average latency reduction of 21.5% while maintaining reliability above 97% across all tested fault scenarios. Sensitivity analysis further shows that the model adapts effectively to varying fault distributions, with latency degradation limited to less than 8% under worst-case fault clustering. These findings provide a quantitative foundation for reliability-aware design of parallel register array systems.

Posted: 26 January 2026

https://doi.org/10.20944/preprints202601.1888.v1

Article

Computer Science and Mathematics

Hardware and Architecture

The Spike Processing Unit (SPU): An IIR Filter Approach to Hardware-Efficient Spiking Neurons

Hugo Puertas de Araújo

Abstract: This paper presents the Spike Processing Unit (SPU), a digital spiking neuron model based on a discrete-time second-order Infinite Impulse Response (IIR) filter. By constraining filter coefficients to powers of two, the SPU implements all internal operations via shift-and-add arithmetic on 6-bit signed integers, eliminating general-purpose multipliers. Unlike traditional models, computation in the SPU is fundamentally temporal; spike timing emerges from the interaction between input events and internal IIR dynamics rather than signal intensity accumulation. The model’s efficacy is evaluated through a temporal pattern discrimination task. Using Particle Swarm Optimization (PSO) within a hardware-constrained parameter space, a single SPU is optimized to emit pattern-specific spikes while remaining silent under stochastic noise. Results from cycle-accurate Python simulations and synthesizable VHDL implementations confirm that learned temporal dynamics are preserved in digital hardware. This work demonstrates that discrete-time IIR-based neurons enable reliable temporal spike processing under strict quantization and arithmetic constraints.

Posted: 14 January 2026

https://doi.org/10.20944/preprints202509.1538.v2

Review

Computer Science and Mathematics

Hardware and Architecture

A Review of Floating-Point Arithmetic Algorithms Using Taylor Series Expansion and Mantissa Region Division Techniques

Jianglin Wei

Haruo Kobayashi

Abstract: This paper reviews digital floating-point arithmetic algorithms that employ Taylor series expansion combined with mantissa region division techniques, drawing upon the results of our research. In many scientific computing applications, compact and low-power hardware implementations are essential. To address these requirements, this review presents algorithms specifically designed to operate under such constraints. The focus is placed on efficient floating-point operations—including division, inverse square root, square root, exponentiation, and logarithmic functions—all realized through Taylor series expansions. Furthermore, the paper examines the trade-offs involved, such as the number of additions, subtractions, and multiplications, as well as the hardware cost associated with Look-Up Table (LUT) size. These factors are analyzed to identify the most suitable algorithms for engineering applications and to facilitate their practical implementation.

Posted: 05 January 2026

https://doi.org/10.20944/preprints202601.0284.v1

Article

Computer Science and Mathematics

Hardware and Architecture

A Unified GF(4)–Symplectic Framework for Quantum Error Correction: A Constructive, Pedagogical Derivation of the Steane [[7,1,3]] Code

Amir Hameed Mir

Abstract: This paper presents a complete, constructive derivation of the Steane [[7,1,3]] quantum error-correcting code using a unified framework that bridges GF(4) algebra, binary symplectic representation, and stabilizer formalism. We demonstrate how classical coding theory, finite-field arithmetic, and symplectic geometry naturally converge to form a comprehensive foundation for quantum error correction. Starting from the classical Hamming [7,4,3] code, we provide explicit constructions showing: (1) how GF(4) encodes the Pauli group modulo phases, (2) how the symplectic inner product on F²ⁿ₂ captures commutativity, (3) how syndrome extraction reduces to binary matrix multiplication, and (4) how transversal Clifford gates emerge from symplectic transformations. The step-by-step derivation encompasses stabilizer construction, centralizer analysis, logical operator identification, code distance verification, and fault-tolerant syndrome measurement via flagged circuits. All results are derived using elementary finite-field and binary linear algebra, ensuring the exposition is self-contained and accessible. We further illustrate how this algebraic framework extends naturally to modern quantum LDPC codes. This work serves as both a pedagogical tutorial for students entering quantum error correction and a unified reference for researchers implementing stabilizer codes in practice

Posted: 04 December 2025

https://doi.org/10.20944/preprints202512.0353.v1

Concept Paper

Computer Science and Mathematics

Hardware and Architecture

ESDM–SMTJ: An Entropic Semantic Dynamics Model for Classical Probabilistic Hardware with Superparamagnetic Tunnel Junctions

Ezequiel Lapilover

Abstract: We introduce ESDM–SMTJ, an Entropic Semantic Dynamics Model implemented on clas- sical probabilistic hardware based on superparamagnetic tunnel junctions (SMTJs). The model represents the internal state of a symbolic or cognitive system as a trajectory Σ(τ ) in a layered state space, with τ ∈ [0, 1] interpreted as an internal computation time from initial query to final answer. Each expression e (for example 2 + 2 =?) induces a program-specific dynamics U τ e that iteratively updates Σ(τ ). Ambiguous operators such as “+” are treated as multi-modal : every occurrence admits a finite family of semantic modes i, and an entropic gate scores each mode by the predicted reduction ∆H(k) i of the output entropy if that mode is selected at position k. These scores are mapped to effective energy levels E(k) i = E0 − κ∆H(k) i in a local SMTJ p-bit block, whose Boltzmann statistics implement a softmax distribution over modes at the hardware level. The resulting dynamics exhibits rumination (high-entropy plateaus), insight-like transi- tions (sharp entropy drops) and stabilization in low-entropy attractors, together with a natural notion of semantic commit at an internal time τc < 1 and a blind reveal of the output token via SMTJ readout at τf ≈ 1. We illustrate how simple arithmetical judgements—including rare anomalies such as 2 + 2 → 5 under mis-tuned parameters—can be expressed in this frame- work, and we outline a quantum extension in which semantic modes become basis states of a Hamiltonian with complex amplitudes instead of classical probabilities.

Posted: 02 December 2025

https://doi.org/10.20944/preprints202512.0272.v1

Article

Computer Science and Mathematics

Hardware and Architecture

Design of an Energy-Efficient SHA-3 Accelerator on Artix-7 FPGA for Secure Network Applications

Abdulmunem A. Abdulsamad

Sándor R. Répás

Abstract: With the rapid growth of secure communication and data integrity needs in embedded and networked systems, there is a growing demand for cryptographic solutions that are not only secure but also energy- and area-efficient. While software-based SHA-3 implementations offer flexibility, they often fall short in meeting the tight performance and power budgets of modern resource-constrained environments. This paper presents a hardware-accelerated SHA-3 implementation optimised for the Xilinx Artix-7 FPGA. The proposed architecture features a fully pipelined Keccak-f[1600] core and leverages techniques such as partial loop unrolling, clock gating, and pipeline balancing to improve efficiency. Designed in VHDL and synthesised using Vivado 2024.2.2, the accelerator achieves a throughput of 1.35 Gbps at 210 MHz with a total power consumption of just 0.94 W—resulting in an energy efficiency of 1.44 Gbps/W. The design is validated against NIST SHA-3 test vectors and demonstrates a strong balance between speed, low power, and hardware utilisation. These characteristics make it well-suited for deployment in secure embedded applications, such as IoT devices, edge nodes, and real-time authentication systems.

Posted: 28 November 2025

https://doi.org/10.20944/preprints202511.2191.v1

Article

Computer Science and Mathematics

Hardware and Architecture

Large Pages, Large Leaks? Hugepage-Induced Side-Channels vs. Performance Improvements in Cryptographic Computations

Xinyao Li

Akhilesh Tyagi

Abstract: Side-channel attacks leveraging microarchitectural components such as caches and translation lookaside buffers (TLBs) pose increasing risks to cryptographic and machine-learning workloads. This paper presents a comparative study of performance and side-channel leakage under two page-size configurations—standard 4KB pages and 2MB huge pages—using paired attacker–victim experiments instrumented with both Performance Monitoring Unit (PMU) counters and precise per-access timing using rdtscp(). The victim executes repeated, key-dependent memory accesses across eight cryptographic modes (AES, ChaCha20, RSA, and ECC variants) while the attacker records eight PMU features per access (cpu-cycles, instructions, cache-references, cache-misses, etc.) and precise rdtscp() timing. The resulting traces are analyzed using a multilayer perceptron classifier to quantify key-dependent leakage. Results show that the 2MB huge-page configuration achieves a comparable key-classification accuracy (mean 0.79 vs. 0.77 for 4KB) while reducing average CPU cycles by approximately 11%. Page-index identification remains near random chance (3.6--3.7% for PMU side-channels and 1.5% for timing side-channel), indicating no increase in measurable leakage at the page level. These findings suggest that huge-page mappings can improve runtime efficiency without amplifying observable side-channel vulnerabilities, offering a practical configuration for balancing performance and security in user-space cryptographic workloads.

Posted: 20 November 2025

https://doi.org/10.20944/preprints202511.1567.v1

Review

Computer Science and Mathematics

Hardware and Architecture

Cooling, Placement, and Virtualization for Sustainability

Pedro Ramos Brandao

Abstract: The exponential growth in global data generation has elevated the role of data centers in modern society. However, their immense energy requirements raise significant environ-mental concerns. This paper aims to demonstrate that current innovations in data center cooling systems, server placement architectures, and virtualization techniques are not on-ly technologically advanced but also critical drivers of energy sustainability. Through an in-depth review of current research, development of key technological pathways, and de-tailed discussion supported by 40 scholarly references, we establish that sustaina-ble data centers are not a futuristic ideal but a present necessity. The analysis is grounded in rigorous scientific methodologies, including thermodynamic modeling, computational fluid dynamics (CFD), and workload orchestration frameworks. By integrating energy-aware designs with cutting-edge software deployment models, data centers are being transformed from energy-intensive infrastructures into hubs of sustainable computational power. This transformation is supported not only by theoretical principles but also by a growing body of empirical data that demonstrates marked improvements in energy usage efficiency (PUE), carbon footprint (CUE), and overall sustainability metrics.

Posted: 18 August 2025

https://doi.org/10.20944/preprints202508.1226.v1

Article

Computer Science and Mathematics

Hardware and Architecture

An Open Chisel-Based Framework for Hardware Acceleration on High-Performance FPGA Cards

Robin Gay

Tarek Ould-Bachir

Abstract: This paper presents an open and fully Chisel-based hardware acceleration framework tailored for high-performance FPGA platforms, with a specific focus on AMD/Xilinx Alveo UltraScale+ cards. While the high-level synthesis (HLS) flow offered by Xilinx enables rapid deployment and is well-suited for many applications, it can be overly abstract for low-level control scenarios such as ASIC prototyping. The alternative RTL Kernel flow offers finer control but often suffers from the limitations of legacy hardware description languages and the overhead of vendor-specific tooling. To address these limitations, we propose a fully open-source workflow based on Chisel, a modern hardware construction language embedded in Scala. Chisel combines the flexibility of object-oriented programming with the ability to generate synthesizable RTL, enabling scalable, reusable, and modular designs. Our framework demonstrates how Chisel can be used to implement advanced hardware features including AXI4/AXI4-Lite interfacing, multi-clock domain designs, asynchronous communication primitives, and enhanced simulation capabilities such as custom VCD trace generation. The use of the Vivado RTL flow bypasses the constraints imposed by the Xilinx golden image and XRT stack, allowing direct programming and fine-grained control over the FPGA fabric. Lightweight host communication is achieved via the XDMA IP and Linux device files, enabling platform-agnostic integration using standard programming languages such as C++ and Python. As a proof of concept, we implement a high-throughput matrix-vector multiplication engine for floating-point data in a self-alignment format (SAF), fully utilizing the resources of a multi-SLR Alveo U200 card. Benchmark results show efficient pipelined operation and full cross-SLR scalability, validating the viability of the proposed framework for custom acceleration pipelines.

Posted: 13 August 2025

https://doi.org/10.20944/preprints202508.0984.v1

Article

Computer Science and Mathematics

Hardware and Architecture

Near-Optimal Multirun March Memory Tests for Neighborhood Pattern-Sensitive Faults in Random-Access Memories

Petru Cascaval

Doina Cascaval

Abstract: This research paper addresses the problem of testing n×1 random-access memories (RAMs) in which complex models of unlinked static neighborhood pattern-sensitive faults (NPSF) are considered. Specifically, two well-known fault models are addressed: the classical NPSF model that includes only memory faults sensitized by transition write operations and an extended NPSF model that covers faults sensitized by transition write operations as well as faults sensitized by non-transition writes or read operations. For these NPSF fault models, near-optimal multirun march memory tests suitable for implementation in embedded self-test logic are proposed. The assessment of the optimality is based on the fact that, for any group of cells corresponding to the NPSF model, the state graph is completely covered and each arc is traversed only once, which means that the graph is of the Eulerian type. Additional write operations are only required for data background changes. A characteristic of a memory test algorithm where multiple data backgrounds are applied is that test data is always correlated with the address of the accessed location. For easy implementation in embedded self-test logic, the proposed tests use 4×4 memory initialization patterns rather than the more difficult-to-implement 3×3 patterns, as is the case with other currently known near-optimal memory tests.

Posted: 09 July 2025

https://doi.org/10.20944/preprints202507.0773.v1

Article

Computer Science and Mathematics

Hardware and Architecture

Extending a Moldable Computer Architecture to Accelerate DL Inference on FPGA

Mirko Mariotti

Giulio Bianchini

Igor Neri

Daniele Spiga

Diego Ciangottini

Loriano Storchi

Abstract: Over the past years, the field of Machine and Deep Learning has seen a strong 1 developments both in terms of software and hardware with the increase of specialised devices. One of the biggest challenges in this field is the inference phase, where the trained model makes predictions of unseen data. Although computationally powerful, traditional computing architectures face limitations in efficiently managing requests, especially from an energy point of view. For this reason, the need arose to find alternative hardware solutions and among these there are Field Programmable Gate Arrays (FPGAs): their key feature of being reconfigurable, combined with parallel processing capability, low latency and low power consumption, makes those devices uniquely suited to accelerating inference tasks. In this paper, we present a novel approach to accelerate the inference phase of a Multi-Layer Perceptron (MLP) using BondMachine , an OpenSource framework for the design of hardware accelerators for FPGAs. Analysis of the latency, energy consumption and resource usage as well as comparisons with respect to standard architectures and other FPGA approaches are presented, highlighting the strengths and critical points of the proposed solution.

Posted: 27 May 2025

https://doi.org/10.20944/preprints202505.2111.v1

Article

Computer Science and Mathematics

Hardware and Architecture

Plücker Conoid-Inspired Geometry for Wave-Based Computing Systems

Arturo Tozzi

Abstract: Computing hardware approaches face challenges related to spatial efficiency, thermal regulation, signal latency and manufacturing complexity. We evaluated the potential of Plücker conoid-inspired geometry (PCIG) as a wave modulation strategy for wave-based systems like optical/acoustic computing platforms. We propose optical transistors in which guided input beams interact with surfaces modulated according to a Plücker conoid profile. The conoid’s sinusoidally modulated geometry introduces phase shifts to the wavefront, enabling passive control over signal flow, controllable transmission, reflection or redirection. Our device acts like a geometric gate, without requiring electronic components, electrical power or nonlinear media. We conducted simulations comparing standard planar wave propagation with waveforms modulated by PCIG. In PCIG, significant increases were detected in phase variance, indicating phase reshaping; in bandwidth expansion, leading to enhanced spectral resolution/information throughput; in information density, reflecting a denser wavefield encoding; in modulation depth, providing a broader dynamic range for signal expression. Still, PCIG emulates nonlinear propagation phenomena in linear media, enabling structured signal processing without material tuning. While electronic computers offer higher precision and general-purpose flexibility, Plücker-based systems provide low-energy alternatives for spatial computation based on parallel, analog signal processing, especially when computation is spatially embedded, inherently parallel and physically constrained. PCIG is well-suited for photonic/acoustic circuits operating without external energy inputs, for image processing and pattern recognition tasks, as an alternative to logic gates in neuromorphic systems and for reconfigurable metasurfaces and embedded sensor arrays requiring decentralized control. In particular, PCIG may be employed in extreme environments like underwater, aerospace or infrastructure monitoring.

Posted: 18 April 2025

https://doi.org/10.20944/preprints202504.1531.v1

Article

Computer Science and Mathematics

Hardware and Architecture

Adaptive NVM Word Compression Based on Cache Line Dynamics on Micro-Architecture

Jialin Wang

Zhen Yang

Zhenghao Yin

Yajuan Du

Abstract: With the explosive growth of big data in the era of artificial intelligence, emerging memory systems demand enhanced efficiency and scalability to address the limitations of conventional DRAM architectures. While DRAM remains prevalent for its high-speed operation, it is constrained by capacity restrictions, refresh power overhead, and scalability barriers. Non-volatile memory (NVM) technologies present a viable alternative with their inherent advantages of low refresh power consumption and superior scalability. However, NVM is faced with two critical challenges which are higher write latency and constrained write endurance. This paper proposes DCom, an adaptive compression that mitigates NVM write operations through intelligent data pattern analysis. DCom employs a dual-component architecture, i.e., a dynamic half-word cache that monitors word-level access patterns across various workload phases, and an adaptive frequency table that enables bit-width reduction compression for recurrent data patterns. By implementing selective compression based on real-time frequency analysis, DCom effectively reduces NVM write intensity while maintaining data integrity. We implement DCom on the Gem5 and NVMain simulators and demonstrate its effectiveness through experimental evaluation. The experiment result shows that DCom achieves substantial reduction in NVM writes and improves system performance by optimizing the compression of cache line data.

Posted: 15 April 2025

https://doi.org/10.20944/preprints202504.1172.v1

Review

Computer Science and Mathematics

Hardware and Architecture

A Survey on Advancements in Scheduling Techniques for Efficient Deep Learning Computations on GPUs

Rupinder Kaur

Arghavan Asad

Seham Al Abdul Wahid

Farah Mohammadi

Abstract: This comprehensive survey explores recent advancements in scheduling techniques for efficient deep learning computations on GPUs. The article highlights challenges related to parallel thread execution, resource utilization, and memory latency in GPUs, which can lead to suboptimal performance. The surveyed research focuses on novel scheduling policies to improve memory latency tolerance, exploit parallelism, and enhance GPU resource utilization. Additionally, it explores the integration of prefetching mechanisms, fine-grained warp scheduling, and warp switching strategies to optimize deep learning computations. Experimental evaluations demonstrate significant improvements in throughput, memory bank parallelism, and latency reduction. The insights gained from this survey can guide researchers, system designers, and practitioners in developing more efficient and powerful deep learning systems on GPUs. Furthermore, potential future research directions include advanced scheduling techniques, energy efficiency considerations, and the integration of emerging computing technologies. By continuously advancing scheduling techniques, the full potential of GPUs can be unlocked for a wide range of applications and domains, including GPU-accelerated deep learning, task scheduling, resource management, memory optimization, and more.

Posted: 20 February 2025

https://doi.org/10.20944/preprints202412.0276.v2

Article

Computer Science and Mathematics

Hardware and Architecture

Benchmarking Hyper-Breakpoints for Efficient Virtual Machine Introspection

Lukas Beierlieb

Alexander Schmitz

Christian Dietrich

Raphael Springer

Lukas Iffländer

Abstract: Virtual Machine Introspection (VMI) is a powerful technology used to detect and analyze malicious software inside Virtual Machines (VMs) from outside. Asynchronously accessing the VM’s memory can be insufficient for efficiently monitoring what is happening inside of a VM. Active VMI introduces breakpoints to intercept VM execution at relevant points. Especially for frequently visited breakpoints, it is crucial to keep their performance overhead as small as possible. In this paper, we provide an systematization of existing VMI breakpoint implementation variants, propose workloads to quantify the different performance penalties of breakpoints, and implement them in the benchmarking application bpbench. We used this benchmark to measure that, on an Intel Core i5 7300U, SmartVMI’s breakpoints take around 81 µs to handle, and keeping the breakpoint invisible costs an additional 21 µs per read access. The availability of bpbench allows the comparison of different breakpoint mechanisms, as well as their performance optimization with immediate feedback.

Posted: 03 January 2025

https://doi.org/10.20944/preprints202501.0111.v1

Article

Computer Science and Mathematics

Hardware and Architecture

Object Detection Post-Processing Accelerator Based on Co-Design of Hardware and Software

Dengtian Yang

Lan Chen

Xiaoran Hao

Mao Ni

Ming Chen

Yiheng Zhang

Abstract:

Deep learning significantly advances object detection. Post process, a critical component of this process, selects valid bounding boxes to represent true targets during inference and assigns boxes and labels to these objects during training to optimize the loss function. However, post process constitutes a substantial portion of the total processing time for a single image. This inefficiency primarily arises from the extensive Intersection over Union (IoU) calculations required between numerous redundant bounding boxes in post-processing algorithms. To reduce the redundant IoU calculations, we introduce a classification prioritization strategy during both training and inference post processes. Additionally, post process involves sorting operations that contribute to inefficiency. To minimize unnecessary comparisons in Top-K sorting, we have improved the bitonic sorter by developing a hybrid bitonic algorithm. These improvements have effectively accelerated post process. Given the similarities between training and inference post processes, we unify four typical post-processing algorithms and design a hardware accelerator based on this framework. Our accelerator achieves at least 7.55 times the speed in inference post process compared to recent accelerators. When compared to the RTX 2080 Ti system, our proposed accelerator offers at least 21.93 times the speed for training post process and 19.89 times for inference post process, thereby significantly enhancing the efficiency of loss function minimization.

Abstract:

Posted: 05 December 2024

https://doi.org/10.20944/preprints202412.0438.v1

of 4