Computer Science and Mathematics

Sort by

Article
Computer Science and Mathematics
Hardware and Architecture

Amir Hameed Mir

Abstract: This paper presents a complete, constructive derivation of the Steane [[7,1,3]] quantum error-correcting code using a unified framework that bridges GF(4) algebra, binary symplectic representation, and stabilizer formalism. We demonstrate how classical coding theory, finite-field arithmetic, and symplectic geometry naturally converge to form a comprehensive foundation for quantum error correction. Starting from the classical Hamming [7,4,3] code, we provide explicit constructions showing: (1) how GF(4) encodes the Pauli group modulo phases, (2) how the symplectic inner product on F2n2 captures commutativity, (3) how syndrome extraction reduces to binary matrix multiplication, and (4) how transversal Clifford gates emerge from symplectic transformations. The step-by-step derivation encompasses stabilizer construction, centralizer analysis, logical operator identification, code distance verification, and fault-tolerant syndrome measurement via flagged circuits. All results are derived using elementary finite-field and binary linear algebra, ensuring the exposition is self-contained and accessible. We further illustrate how this algebraic framework extends naturally to modern quantum LDPC codes. This work serves as both a pedagogical tutorial for students entering quantum error correction and a unified reference for researchers implementing stabilizer codes in practice
Concept Paper
Computer Science and Mathematics
Hardware and Architecture

Ezequiel Lapilover

Abstract: We introduce ESDM–SMTJ, an Entropic Semantic Dynamics Model implemented on clas- sical probabilistic hardware based on superparamagnetic tunnel junctions (SMTJs). The model represents the internal state of a symbolic or cognitive system as a trajectory Σ(τ ) in a layered state space, with τ ∈ [0, 1] interpreted as an internal computation time from initial query to final answer. Each expression e (for example 2 + 2 =?) induces a program-specific dynamics U τ e that iteratively updates Σ(τ ). Ambiguous operators such as “+” are treated as multi-modal : every occurrence admits a finite family of semantic modes i, and an entropic gate scores each mode by the predicted reduction ∆H(k) i of the output entropy if that mode is selected at position k. These scores are mapped to effective energy levels E(k) i = E0 − κ∆H(k) i in a local SMTJ p-bit block, whose Boltzmann statistics implement a softmax distribution over modes at the hardware level. The resulting dynamics exhibits rumination (high-entropy plateaus), insight-like transi- tions (sharp entropy drops) and stabilization in low-entropy attractors, together with a natural notion of semantic commit at an internal time τc < 1 and a blind reveal of the output token via SMTJ readout at τf ≈ 1. We illustrate how simple arithmetical judgements—including rare anomalies such as 2 + 2 → 5 under mis-tuned parameters—can be expressed in this frame- work, and we outline a quantum extension in which semantic modes become basis states of a Hamiltonian with complex amplitudes instead of classical probabilities.
Article
Computer Science and Mathematics
Hardware and Architecture

Abdulmunem A. Abdulsamad

,

Sándor R. Répás

Abstract: With the rapid growth of secure communication and data integrity needs in embedded and networked systems, there is a growing demand for cryptographic solutions that are not only secure but also energy- and area-efficient. While software-based SHA-3 implementations offer flexibility, they often fall short in meeting the tight performance and power budgets of modern resource-constrained environments. This paper presents a hardware-accelerated SHA-3 implementation optimised for the Xilinx Artix-7 FPGA. The proposed architecture features a fully pipelined Keccak-f[1600] core and leverages techniques such as partial loop unrolling, clock gating, and pipeline balancing to improve efficiency. Designed in VHDL and synthesised using Vivado 2024.2.2, the accelerator achieves a throughput of 1.35 Gbps at 210 MHz with a total power consumption of just 0.94 W—resulting in an energy efficiency of 1.44 Gbps/W. The design is validated against NIST SHA-3 test vectors and demonstrates a strong balance between speed, low power, and hardware utilisation. These characteristics make it well-suited for deployment in secure embedded applications, such as IoT devices, edge nodes, and real-time authentication systems.
Article
Computer Science and Mathematics
Hardware and Architecture

Xinyao Li

,

Akhilesh Tyagi

Abstract: Side-channel attacks leveraging microarchitectural components such as caches and translation lookaside buffers (TLBs) pose increasing risks to cryptographic and machine-learning workloads. This paper presents a comparative study of performance and side-channel leakage under two page-size configurations—standard 4KB pages and 2MB huge pages—using paired attacker–victim experiments instrumented with both Performance Monitoring Unit (PMU) counters and precise per-access timing using rdtscp(). The victim executes repeated, key-dependent memory accesses across eight cryptographic modes (AES, ChaCha20, RSA, and ECC variants) while the attacker records eight PMU features per access (cpu-cycles, instructions, cache-references, cache-misses, etc.) and precise rdtscp() timing. The resulting traces are analyzed using a multilayer perceptron classifier to quantify key-dependent leakage. Results show that the 2MB huge-page configuration achieves a comparable key-classification accuracy (mean 0.79 vs. 0.77 for 4KB) while reducing average CPU cycles by approximately 11%. Page-index identification remains near random chance (3.6--3.7% for PMU side-channels and 1.5% for timing side-channel), indicating no increase in measurable leakage at the page level. These findings suggest that huge-page mappings can improve runtime efficiency without amplifying observable side-channel vulnerabilities, offering a practical configuration for balancing performance and security in user-space cryptographic workloads.
Article
Computer Science and Mathematics
Hardware and Architecture

Hugo Puertas de Araújo

Abstract: This paper introduces the Spike Processing Unit (SPU), a novel digital spiking neuron model engineered for ultra-efficient hardware implementation. Departing from biologically-plausible models, the SPU prioritizes computational performance by leveraging a discrete-time Infinite Impulse Response (IIR) filter with a key innovation: its coefficients are constrained to powers of two. This design eliminates the need for power-hungry digital multipliers, replacing them with simple bit-shift operations. Information is encoded using the inter-spike interval (ISI) format, which decouples signal representation from numerical precision. This allows the model to operate efficiently with low-precision 6-bit two’s complement integers without introducing representation errors. The model’s functionality is demonstrated through a temporal pattern discrimination task, where a single SPU is trained via a genetic algorithm to distinguish between specific input patterns and suppress noise, generating output spikes at distinct times. This proof-of-concept, validated in Python simulation, confirms the model’s core operational principle. The proposed approach provides a scalable and multiplier-free framework for Spiking Neural Networks, contributing to the advancement of energy-efficient neuromorphic computing.
Review
Computer Science and Mathematics
Hardware and Architecture

Pedro Ramos Brandao

Abstract: The exponential growth in global data generation has elevated the role of data centers in modern society. However, their immense energy requirements raise significant environ-mental concerns. This paper aims to demonstrate that current innovations in data center cooling systems, server placement architectures, and virtualization techniques are not on-ly technologically advanced but also critical drivers of energy sustainability. Through an in-depth review of current research, development of key technological pathways, and de-tailed discussion supported by 40 scholarly references, we establish that sustaina-ble data centers are not a futuristic ideal but a present necessity. The analysis is grounded in rigorous scientific methodologies, including thermodynamic modeling, computational fluid dynamics (CFD), and workload orchestration frameworks. By integrating energy-aware designs with cutting-edge software deployment models, data centers are being transformed from energy-intensive infrastructures into hubs of sustainable computational power. This transformation is supported not only by theoretical principles but also by a growing body of empirical data that demonstrates marked improvements in energy usage efficiency (PUE), carbon footprint (CUE), and overall sustainability metrics.
Article
Computer Science and Mathematics
Hardware and Architecture

Robin Gay

,

Tarek Ould-Bachir

Abstract: This paper presents an open and fully Chisel-based hardware acceleration framework tailored for high-performance FPGA platforms, with a specific focus on AMD/Xilinx Alveo UltraScale+ cards. While the high-level synthesis (HLS) flow offered by Xilinx enables rapid deployment and is well-suited for many applications, it can be overly abstract for low-level control scenarios such as ASIC prototyping. The alternative RTL Kernel flow offers finer control but often suffers from the limitations of legacy hardware description languages and the overhead of vendor-specific tooling. To address these limitations, we propose a fully open-source workflow based on Chisel, a modern hardware construction language embedded in Scala. Chisel combines the flexibility of object-oriented programming with the ability to generate synthesizable RTL, enabling scalable, reusable, and modular designs. Our framework demonstrates how Chisel can be used to implement advanced hardware features including AXI4/AXI4-Lite interfacing, multi-clock domain designs, asynchronous communication primitives, and enhanced simulation capabilities such as custom VCD trace generation. The use of the Vivado RTL flow bypasses the constraints imposed by the Xilinx golden image and XRT stack, allowing direct programming and fine-grained control over the FPGA fabric. Lightweight host communication is achieved via the XDMA IP and Linux device files, enabling platform-agnostic integration using standard programming languages such as C++ and Python. As a proof of concept, we implement a high-throughput matrix-vector multiplication engine for floating-point data in a self-alignment format (SAF), fully utilizing the resources of a multi-SLR Alveo U200 card. Benchmark results show efficient pipelined operation and full cross-SLR scalability, validating the viability of the proposed framework for custom acceleration pipelines.
Article
Computer Science and Mathematics
Hardware and Architecture

Petru Cascaval

,

Doina Cascaval

Abstract: This research paper addresses the problem of testing n×1 random-access memories (RAMs) in which complex models of unlinked static neighborhood pattern-sensitive faults (NPSF) are considered. Specifically, two well-known fault models are addressed: the classical NPSF model that includes only memory faults sensitized by transition write operations and an extended NPSF model that covers faults sensitized by transition write operations as well as faults sensitized by non-transition writes or read operations. For these NPSF fault models, near-optimal multirun march memory tests suitable for implementation in embedded self-test logic are proposed. The assessment of the optimality is based on the fact that, for any group of cells corresponding to the NPSF model, the state graph is completely covered and each arc is traversed only once, which means that the graph is of the Eulerian type. Additional write operations are only required for data background changes. A characteristic of a memory test algorithm where multiple data backgrounds are applied is that test data is always correlated with the address of the accessed location. For easy implementation in embedded self-test logic, the proposed tests use 4×4 memory initialization patterns rather than the more difficult-to-implement 3×3 patterns, as is the case with other currently known near-optimal memory tests.
Article
Computer Science and Mathematics
Hardware and Architecture

Mirko Mariotti

,

Giulio Bianchini

,

Igor Neri

,

Daniele Spiga

,

Diego Ciangottini

,

Loriano Storchi

Abstract: Over the past years, the field of Machine and Deep Learning has seen a strong 1 developments both in terms of software and hardware with the increase of specialised devices. One of the biggest challenges in this field is the inference phase, where the trained model makes predictions of unseen data. Although computationally powerful, traditional computing architectures face limitations in efficiently managing requests, especially from an energy point of view. For this reason, the need arose to find alternative hardware solutions and among these there are Field Programmable Gate Arrays (FPGAs): their key feature of being reconfigurable, combined with parallel processing capability, low latency and low power consumption, makes those devices uniquely suited to accelerating inference tasks. In this paper, we present a novel approach to accelerate the inference phase of a Multi-Layer Perceptron (MLP) using BondMachine , an OpenSource framework for the design of hardware accelerators for FPGAs. Analysis of the latency, energy consumption and resource usage as well as comparisons with respect to standard architectures and other FPGA approaches are presented, highlighting the strengths and critical points of the proposed solution.
Article
Computer Science and Mathematics
Hardware and Architecture

Arturo Tozzi

Abstract: Computing hardware approaches face challenges related to spatial efficiency, thermal regulation, signal latency and manufacturing complexity. We evaluated the potential of Plücker conoid-inspired geometry (PCIG) as a wave modulation strategy for wave-based systems like optical/acoustic computing platforms. We propose optical transistors in which guided input beams interact with surfaces modulated according to a Plücker conoid profile. The conoid’s sinusoidally modulated geometry introduces phase shifts to the wavefront, enabling passive control over signal flow, controllable transmission, reflection or redirection. Our device acts like a geometric gate, without requiring electronic components, electrical power or nonlinear media. We conducted simulations comparing standard planar wave propagation with waveforms modulated by PCIG. In PCIG, significant increases were detected in phase variance, indicating phase reshaping; in bandwidth expansion, leading to enhanced spectral resolution/information throughput; in information density, reflecting a denser wavefield encoding; in modulation depth, providing a broader dynamic range for signal expression. Still, PCIG emulates nonlinear propagation phenomena in linear media, enabling structured signal processing without material tuning. While electronic computers offer higher precision and general-purpose flexibility, Plücker-based systems provide low-energy alternatives for spatial computation based on parallel, analog signal processing, especially when computation is spatially embedded, inherently parallel and physically constrained. PCIG is well-suited for photonic/acoustic circuits operating without external energy inputs, for image processing and pattern recognition tasks, as an alternative to logic gates in neuromorphic systems and for reconfigurable metasurfaces and embedded sensor arrays requiring decentralized control. In particular, PCIG may be employed in extreme environments like underwater, aerospace or infrastructure monitoring.
Article
Computer Science and Mathematics
Hardware and Architecture

Jialin Wang

,

Zhen Yang

,

Zhenghao Yin

,

Yajuan Du

Abstract: With the explosive growth of big data in the era of artificial intelligence, emerging memory systems demand enhanced efficiency and scalability to address the limitations of conventional DRAM architectures. While DRAM remains prevalent for its high-speed operation, it is constrained by capacity restrictions, refresh power overhead, and scalability barriers. Non-volatile memory (NVM) technologies present a viable alternative with their inherent advantages of low refresh power consumption and superior scalability. However, NVM is faced with two critical challenges which are higher write latency and constrained write endurance. This paper proposes DCom, an adaptive compression that mitigates NVM write operations through intelligent data pattern analysis. DCom employs a dual-component architecture, i.e., a dynamic half-word cache that monitors word-level access patterns across various workload phases, and an adaptive frequency table that enables bit-width reduction compression for recurrent data patterns. By implementing selective compression based on real-time frequency analysis, DCom effectively reduces NVM write intensity while maintaining data integrity. We implement DCom on the Gem5 and NVMain simulators and demonstrate its effectiveness through experimental evaluation. The experiment result shows that DCom achieves substantial reduction in NVM writes and improves system performance by optimizing the compression of cache line data.
Review
Computer Science and Mathematics
Hardware and Architecture

Rupinder Kaur

,

Arghavan Asad

,

Seham Al Abdul Wahid

,

Farah Mohammadi

Abstract: This comprehensive survey explores recent advancements in scheduling techniques for efficient deep learning computations on GPUs. The article highlights challenges related to parallel thread execution, resource utilization, and memory latency in GPUs, which can lead to suboptimal performance. The surveyed research focuses on novel scheduling policies to improve memory latency tolerance, exploit parallelism, and enhance GPU resource utilization. Additionally, it explores the integration of prefetching mechanisms, fine-grained warp scheduling, and warp switching strategies to optimize deep learning computations. Experimental evaluations demonstrate significant improvements in throughput, memory bank parallelism, and latency reduction. The insights gained from this survey can guide researchers, system designers, and practitioners in developing more efficient and powerful deep learning systems on GPUs. Furthermore, potential future research directions include advanced scheduling techniques, energy efficiency considerations, and the integration of emerging computing technologies. By continuously advancing scheduling techniques, the full potential of GPUs can be unlocked for a wide range of applications and domains, including GPU-accelerated deep learning, task scheduling, resource management, memory optimization, and more.
Article
Computer Science and Mathematics
Hardware and Architecture

Lukas Beierlieb

,

Alexander Schmitz

,

Christian Dietrich

,

Raphael Springer

,

Lukas Iffländer

Abstract: Virtual Machine Introspection (VMI) is a powerful technology used to detect and analyze malicious software inside Virtual Machines (VMs) from outside. Asynchronously accessing the VM’s memory can be insufficient for efficiently monitoring what is happening inside of a VM. Active VMI introduces breakpoints to intercept VM execution at relevant points. Especially for frequently visited breakpoints, it is crucial to keep their performance overhead as small as possible. In this paper, we provide an systematization of existing VMI breakpoint implementation variants, propose workloads to quantify the different performance penalties of breakpoints, and implement them in the benchmarking application bpbench. We used this benchmark to measure that, on an Intel Core i5 7300U, SmartVMI’s breakpoints take around 81 µs to handle, and keeping the breakpoint invisible costs an additional 21 µs per read access. The availability of bpbench allows the comparison of different breakpoint mechanisms, as well as their performance optimization with immediate feedback.
Article
Computer Science and Mathematics
Hardware and Architecture

Dengtian Yang

,

Lan Chen

,

Xiaoran Hao

,

Mao Ni

,

Ming Chen

,

Yiheng Zhang

Abstract:

Deep learning significantly advances object detection. Post process, a critical component of this process, selects valid bounding boxes to represent true targets during inference and assigns boxes and labels to these objects during training to optimize the loss function. However, post process constitutes a substantial portion of the total processing time for a single image. This inefficiency primarily arises from the extensive Intersection over Union (IoU) calculations required between numerous redundant bounding boxes in post-processing algorithms. To reduce the redundant IoU calculations, we introduce a classification prioritization strategy during both training and inference post processes. Additionally, post process involves sorting operations that contribute to inefficiency. To minimize unnecessary comparisons in Top-K sorting, we have improved the bitonic sorter by developing a hybrid bitonic algorithm. These improvements have effectively accelerated post process. Given the similarities between training and inference post processes, we unify four typical post-processing algorithms and design a hardware accelerator based on this framework. Our accelerator achieves at least 7.55 times the speed in inference post process compared to recent accelerators. When compared to the RTX 2080 Ti system, our proposed accelerator offers at least 21.93 times the speed for training post process and 19.89 times for inference post process, thereby significantly enhancing the efficiency of loss function minimization.

Article
Computer Science and Mathematics
Hardware and Architecture

Chung-Wei Kuo

,

Wei Wei

,

Chun-Chang Lin

,

Yu-Yi Hong

,

Jia-Ruei Liu

,

Kuo-Yu Tsai

Abstract: 5G technology and IoT devices are improving efficiency and quality of life across many sectors. IoT devices are often used in open environments where they handle sensitive data. This makes them vulnerable to side-channel attacks (SCA), where attackers can intercept and analyses the electromagnetic signals emitted by microcontroller units (MCUs) to expose encryption keys and compromise sensitive data. To address this pressing issue, we proposed a highly efficient key replacement mechanism tailored specifically for lightweight IoT microcontrollers. This mechanism established a secure Diffie-Hellman (D-H) channel for key transmission, effectively preventing key leakage and providing strong defense against SCAs. The core of this solution lied in its integration of the Moving Target Defense (MTD) approach, dynamically updating encryption keys with each cryptographic cycle. Experimental results demonstrated that the proposed mechanism achieves key updates with minimal time overhead, ranging between 12 and 50 milliseconds per encryption transmission. More importantly, it exhibits resilience against template attacks. After 20,000 attack attempts, only 2 out of 16 AES-128 subkeys were compromised, reflecting a significant improvement in the security of IoT devices. This dynamic key replacement mechanism dramatically reduced the risk of data leakage, offering an effective and scalable solution for lightweight IoT microcontroller applications that require both efficient performance and strong security.
Article
Computer Science and Mathematics
Hardware and Architecture

Vedran Dakić

,

Karlo Bertina

,

Jasmin Redžepagić

,

Damir Regvart

Abstract: Integrating remote monitoring systems is crucial in the ever-changing field of data center management to enhance performance and guarantee reliability. This paper outlines a comprehensive strategy for monitoring remote servers by utilizing agents that establish connections to the Redfish API (Application Programming Interface) and vSphere hypervisor API. Our solution uses the Redfish standard to provide secure and standardized management of hardware components in diverse server environments. This improves interoperability and scalability. Simultaneously, the vSphere agent enables monitoring and hardware administration in vSphere-based virtualized environments, offering crucial insights into the state of the underlying hardware. This system, which employs two agents, simplifies the management of servers and seamlessly integrates with current data center infrastructures, enhancing efficiency. The policy-based alerting system built on top of these agents offers a lot of capabilities based on both agents leveraging their alerting systems. This, in turn, can improve the capabilities of next-generation data centers.
Article
Computer Science and Mathematics
Hardware and Architecture

Henry Juarez Vargas

,

Roger Mijael Mansilla Huanacuni

,

Fred Torres Cruz

Abstract: The widespread adoption of the QWERTY keyboard layout, designed primarily for English, presents significant challenges for speakers of indigenous languages such as Quechua, particularly in the Puno region of Peru. This research examines the extent to which the QWERTY layout affects the writing and digital communication of Quechua speakers. Through an analysis of the Quechua language’s unique alphabet and character frequency, combined with insights from local speakers, we identify the limitations imposed by the QWERTY system on the efficient digital transcription of Quechua. The study further proposes alternative keyboard layouts, including optimizations of QWERTY and DVORAK, designed to enhance typing efficiency and reduce the digital divide for Quechua speakers. Our findings underscore the need for localized technological solutions to preserve linguistic diversity while improving digital literacy for indigenous communities. The proposed modifications offer a pathway toward more inclusive digital tools that respect and accommodate linguistic diversity.
Article
Computer Science and Mathematics
Hardware and Architecture

NIcholas Ayres

,

Lipika Deka

,

Daniel Paluszczyszyn

Abstract: The past 40 years have seen automotive Electronic Control Units (ECUs) move from being purely mechanical controlled to being primarily digital controlled. While there has been significant improvements in terms of passenger safety and vehicle efficiency including optimised fuel consumption, rising ECU numbers have resulted in increased vehicle weight, greater demands placed on power,more complex hardware and software, ad-hoc methods for updating software, and subsequent rise in costs for both vehicle manufacturer and consumer. To address these issues, the research presented in this paper proposes virtualisation technologies to be applied within automotive Electrical/Electronic (E/E) architecture. To proposed approach is evaluated through a comprehensive study of the CPU and memory resource requirement in order to support container-based ECU automotive functions. This comprehensive performance evaluation reveals that lightweight container virtualisation has the potential to welcome a paradigm shift in the E/E architecture, promoting consolidation and enhancing the architecture through power, weight and cost savings. Container based virtualisation will also enable a robust mechanism to facilitate online dynamic software updates throughout the lifetime of a vehicle.
Article
Computer Science and Mathematics
Hardware and Architecture

Heonhui Jung

,

Hyunyoung Oh

Abstract: This study introduces a hardware accelerator to support various Post-Quantum Cryptosystem (PQC) schemes, addressing the quantum computing threat to cryptographic security. PQCs, while more secure, also bring significant computational demands, especially problematic for lightweight devices. Previous hardware accelerators are typically scheme-specific, which is inefficient given the National Institute of Standards and Technology (NIST)'s multiple finalists. Our approach focuses on the shared operations among these schemes, allowing a single design to accelerate multiple candidate PQCs at the same time. This is further enhanced by allocating resources according to performance profiling results. Our compact, scalable hardware accelerator supports four of NIST PQC finalists, achieving an area efficiency of up to 81.85\% compared to the current state-of-the-art multi-scheme accelerator while supporting twice as many schemes. The design demonstrates average throughput improvements ranging from 0.97$\times$ to 35.97$\times$ across the four schemes and their main operations, offering an efficient solution for implementing multiple PQC schemes within constrained hardware environments.
Review
Computer Science and Mathematics
Hardware and Architecture

Rupinder Kaur

,

Arghavan Asad

,

Farahnaz Mohammadi

Abstract: This comprehensive review explores the advancements in processing-in-memory (PIM) techniques for deep learning applications. It addresses the challenges faced by monolithic chip architectures and highlights the benefits of chiplet-based designs in terms of scalability, modularity, and flexibility. The review emphasizes the importance of dataflow-awareness, communication optimization, and thermal considerations in designing PIM-enabled manycore architectures. It discusses different machine learning workloads and their tailored dataflow requirements. Additionally, the review presents a heterogeneous PIM system for energy-efficient neural network training and discusses thermally efficient dataflow-aware monolithic 3D (M3D) NoC architectures for accelerating CNN inferencing. The advantages of TEFLON (Thermally Efficient Dataflow-Aware 3D NoC) over performance-optimized SFC-based counterparts are highlighted. Overall, this review provides valuable insights into the development and evaluation of chiplet and PIM architectures, emphasizing improved performance, energy efficiency, and inference accuracy in deep learning applications.

of 4

Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2025 MDPI (Basel, Switzerland) unless otherwise stated