Computer Science and Mathematics

Sort by

Article
Computer Science and Mathematics
Hardware and Architecture

Arturo Tozzi

Abstract: Computing hardware approaches face challenges related to spatial efficiency, thermal regulation, signal latency and manufacturing complexity. We evaluated the potential of Plücker conoid-inspired geometry (PCIG) as a wave modulation strategy for wave-based systems like optical/acoustic computing platforms. We propose optical transistors in which guided input beams interact with surfaces modulated according to a Plücker conoid profile. The conoid’s sinusoidally modulated geometry introduces phase shifts to the wavefront, enabling passive control over signal flow, controllable transmission, reflection or redirection. Our device acts like a geometric gate, without requiring electronic components, electrical power or nonlinear media. We conducted simulations comparing standard planar wave propagation with waveforms modulated by PCIG. In PCIG, significant increases were detected in phase variance, indicating phase reshaping; in bandwidth expansion, leading to enhanced spectral resolution/information throughput; in information density, reflecting a denser wavefield encoding; in modulation depth, providing a broader dynamic range for signal expression. Still, PCIG emulates nonlinear propagation phenomena in linear media, enabling structured signal processing without material tuning. While electronic computers offer higher precision and general-purpose flexibility, Plücker-based systems provide low-energy alternatives for spatial computation based on parallel, analog signal processing, especially when computation is spatially embedded, inherently parallel and physically constrained. PCIG is well-suited for photonic/acoustic circuits operating without external energy inputs, for image processing and pattern recognition tasks, as an alternative to logic gates in neuromorphic systems and for reconfigurable metasurfaces and embedded sensor arrays requiring decentralized control. In particular, PCIG may be employed in extreme environments like underwater, aerospace or infrastructure monitoring.
Article
Computer Science and Mathematics
Hardware and Architecture

Jialin Wang,

Zhen Yang,

Zhenghao Yin,

Yajuan Du

Abstract: With the explosive growth of big data in the era of artificial intelligence, emerging memory systems demand enhanced efficiency and scalability to address the limitations of conventional DRAM architectures. While DRAM remains prevalent for its high-speed operation, it is constrained by capacity restrictions, refresh power overhead, and scalability barriers. Non-volatile memory (NVM) technologies present a viable alternative with their inherent advantages of low refresh power consumption and superior scalability. However, NVM is faced with two critical challenges which are higher write latency and constrained write endurance. This paper proposes DCom, an adaptive compression that mitigates NVM write operations through intelligent data pattern analysis. DCom employs a dual-component architecture, i.e., a dynamic half-word cache that monitors word-level access patterns across various workload phases, and an adaptive frequency table that enables bit-width reduction compression for recurrent data patterns. By implementing selective compression based on real-time frequency analysis, DCom effectively reduces NVM write intensity while maintaining data integrity. We implement DCom on the Gem5 and NVMain simulators and demonstrate its effectiveness through experimental evaluation. The experiment result shows that DCom achieves substantial reduction in NVM writes and improves system performance by optimizing the compression of cache line data.
Review
Computer Science and Mathematics
Hardware and Architecture

Rupinder Kaur,

Arghavan Asad,

Seham Al Abdul Wahid,

Farah Mohammadi

Abstract: This comprehensive survey explores recent advancements in scheduling techniques for efficient deep learning computations on GPUs. The article highlights challenges related to parallel thread execution, resource utilization, and memory latency in GPUs, which can lead to suboptimal performance. The surveyed research focuses on novel scheduling policies to improve memory latency tolerance, exploit parallelism, and enhance GPU resource utilization. Additionally, it explores the integration of prefetching mechanisms, fine-grained warp scheduling, and warp switching strategies to optimize deep learning computations. Experimental evaluations demonstrate significant improvements in throughput, memory bank parallelism, and latency reduction. The insights gained from this survey can guide researchers, system designers, and practitioners in developing more efficient and powerful deep learning systems on GPUs. Furthermore, potential future research directions include advanced scheduling techniques, energy efficiency considerations, and the integration of emerging computing technologies. By continuously advancing scheduling techniques, the full potential of GPUs can be unlocked for a wide range of applications and domains, including GPU-accelerated deep learning, task scheduling, resource management, memory optimization, and more.
Article
Computer Science and Mathematics
Hardware and Architecture

Lukas Beierlieb,

Alexander Schmitz,

Christian Dietrich,

Raphael Springer,

Lukas Iffländer

Abstract: Virtual Machine Introspection (VMI) is a powerful technology used to detect and analyze malicious software inside Virtual Machines (VMs) from outside. Asynchronously accessing the VM’s memory can be insufficient for efficiently monitoring what is happening inside of a VM. Active VMI introduces breakpoints to intercept VM execution at relevant points. Especially for frequently visited breakpoints, it is crucial to keep their performance overhead as small as possible. In this paper, we provide an systematization of existing VMI breakpoint implementation variants, propose workloads to quantify the different performance penalties of breakpoints, and implement them in the benchmarking application bpbench. We used this benchmark to measure that, on an Intel Core i5 7300U, SmartVMI’s breakpoints take around 81 µs to handle, and keeping the breakpoint invisible costs an additional 21 µs per read access. The availability of bpbench allows the comparison of different breakpoint mechanisms, as well as their performance optimization with immediate feedback.
Article
Computer Science and Mathematics
Hardware and Architecture

Dengtian Yang,

Lan Chen,

Xiaoran Hao,

Mao Ni,

Ming Chen,

Yiheng Zhang

Abstract:

Deep learning significantly advances object detection. Post process, a critical component of this process, selects valid bounding boxes to represent true targets during inference and assigns boxes and labels to these objects during training to optimize the loss function. However, post process constitutes a substantial portion of the total processing time for a single image. This inefficiency primarily arises from the extensive Intersection over Union (IoU) calculations required between numerous redundant bounding boxes in post-processing algorithms. To reduce the redundant IoU calculations, we introduce a classification prioritization strategy during both training and inference post processes. Additionally, post process involves sorting operations that contribute to inefficiency. To minimize unnecessary comparisons in Top-K sorting, we have improved the bitonic sorter by developing a hybrid bitonic algorithm. These improvements have effectively accelerated post process. Given the similarities between training and inference post processes, we unify four typical post-processing algorithms and design a hardware accelerator based on this framework. Our accelerator achieves at least 7.55 times the speed in inference post process compared to recent accelerators. When compared to the RTX 2080 Ti system, our proposed accelerator offers at least 21.93 times the speed for training post process and 19.89 times for inference post process, thereby significantly enhancing the efficiency of loss function minimization.

Article
Computer Science and Mathematics
Hardware and Architecture

Chung-Wei Kuo,

Wei Wei,

Chun-Chang Lin,

Yu-Yi Hong,

Jia-Ruei Liu,

Kuo-Yu Tsai

Abstract: 5G technology and IoT devices are improving efficiency and quality of life across many sectors. IoT devices are often used in open environments where they handle sensitive data. This makes them vulnerable to side-channel attacks (SCA), where attackers can intercept and analyses the electromagnetic signals emitted by microcontroller units (MCUs) to expose encryption keys and compromise sensitive data. To address this pressing issue, we proposed a highly efficient key replacement mechanism tailored specifically for lightweight IoT microcontrollers. This mechanism established a secure Diffie-Hellman (D-H) channel for key transmission, effectively preventing key leakage and providing strong defense against SCAs. The core of this solution lied in its integration of the Moving Target Defense (MTD) approach, dynamically updating encryption keys with each cryptographic cycle. Experimental results demonstrated that the proposed mechanism achieves key updates with minimal time overhead, ranging between 12 and 50 milliseconds per encryption transmission. More importantly, it exhibits resilience against template attacks. After 20,000 attack attempts, only 2 out of 16 AES-128 subkeys were compromised, reflecting a significant improvement in the security of IoT devices. This dynamic key replacement mechanism dramatically reduced the risk of data leakage, offering an effective and scalable solution for lightweight IoT microcontroller applications that require both efficient performance and strong security.
Article
Computer Science and Mathematics
Hardware and Architecture

Vedran Dakić,

Karlo Bertina,

Jasmin Redžepagić,

Damir Regvart

Abstract: Integrating remote monitoring systems is crucial in the ever-changing field of data center management to enhance performance and guarantee reliability. This paper outlines a comprehensive strategy for monitoring remote servers by utilizing agents that establish connections to the Redfish API (Application Programming Interface) and vSphere hypervisor API. Our solution uses the Redfish standard to provide secure and standardized management of hardware components in diverse server environments. This improves interoperability and scalability. Simultaneously, the vSphere agent enables monitoring and hardware administration in vSphere-based virtualized environments, offering crucial insights into the state of the underlying hardware. This system, which employs two agents, simplifies the management of servers and seamlessly integrates with current data center infrastructures, enhancing efficiency. The policy-based alerting system built on top of these agents offers a lot of capabilities based on both agents leveraging their alerting systems. This, in turn, can improve the capabilities of next-generation data centers.
Article
Computer Science and Mathematics
Hardware and Architecture

Henry Juarez Vargas,

Roger Mijael Mansilla Huanacuni,

Fred Torres Cruz

Abstract: The widespread adoption of the QWERTY keyboard layout, designed primarily for English, presents significant challenges for speakers of indigenous languages such as Quechua, particularly in the Puno region of Peru. This research examines the extent to which the QWERTY layout affects the writing and digital communication of Quechua speakers. Through an analysis of the Quechua language’s unique alphabet and character frequency, combined with insights from local speakers, we identify the limitations imposed by the QWERTY system on the efficient digital transcription of Quechua. The study further proposes alternative keyboard layouts, including optimizations of QWERTY and DVORAK, designed to enhance typing efficiency and reduce the digital divide for Quechua speakers. Our findings underscore the need for localized technological solutions to preserve linguistic diversity while improving digital literacy for indigenous communities. The proposed modifications offer a pathway toward more inclusive digital tools that respect and accommodate linguistic diversity.
Article
Computer Science and Mathematics
Hardware and Architecture

NIcholas Ayres,

Lipika Deka,

Daniel Paluszczyszyn

Abstract: The past 40 years have seen automotive Electronic Control Units (ECUs) move from being purely mechanical controlled to being primarily digital controlled. While there has been significant improvements in terms of passenger safety and vehicle efficiency including optimised fuel consumption, rising ECU numbers have resulted in increased vehicle weight, greater demands placed on power,more complex hardware and software, ad-hoc methods for updating software, and subsequent rise in costs for both vehicle manufacturer and consumer. To address these issues, the research presented in this paper proposes virtualisation technologies to be applied within automotive Electrical/Electronic (E/E) architecture. To proposed approach is evaluated through a comprehensive study of the CPU and memory resource requirement in order to support container-based ECU automotive functions. This comprehensive performance evaluation reveals that lightweight container virtualisation has the potential to welcome a paradigm shift in the E/E architecture, promoting consolidation and enhancing the architecture through power, weight and cost savings. Container based virtualisation will also enable a robust mechanism to facilitate online dynamic software updates throughout the lifetime of a vehicle.
Article
Computer Science and Mathematics
Hardware and Architecture

Heonhui Jung,

Hyunyoung Oh

Abstract: This study introduces a hardware accelerator to support various Post-Quantum Cryptosystem (PQC) schemes, addressing the quantum computing threat to cryptographic security. PQCs, while more secure, also bring significant computational demands, especially problematic for lightweight devices. Previous hardware accelerators are typically scheme-specific, which is inefficient given the National Institute of Standards and Technology (NIST)'s multiple finalists. Our approach focuses on the shared operations among these schemes, allowing a single design to accelerate multiple candidate PQCs at the same time. This is further enhanced by allocating resources according to performance profiling results. Our compact, scalable hardware accelerator supports four of NIST PQC finalists, achieving an area efficiency of up to 81.85\% compared to the current state-of-the-art multi-scheme accelerator while supporting twice as many schemes. The design demonstrates average throughput improvements ranging from 0.97$\times$ to 35.97$\times$ across the four schemes and their main operations, offering an efficient solution for implementing multiple PQC schemes within constrained hardware environments.
Review
Computer Science and Mathematics
Hardware and Architecture

Rupinder Kaur,

Arghavan Asad,

Farahnaz Mohammadi

Abstract: This comprehensive review explores the advancements in processing-in-memory (PIM) techniques for deep learning applications. It addresses the challenges faced by monolithic chip architectures and highlights the benefits of chiplet-based designs in terms of scalability, modularity, and flexibility. The review emphasizes the importance of dataflow-awareness, communication optimization, and thermal considerations in designing PIM-enabled manycore architectures. It discusses different machine learning workloads and their tailored dataflow requirements. Additionally, the review presents a heterogeneous PIM system for energy-efficient neural network training and discusses thermally efficient dataflow-aware monolithic 3D (M3D) NoC architectures for accelerating CNN inferencing. The advantages of TEFLON (Thermally Efficient Dataflow-Aware 3D NoC) over performance-optimized SFC-based counterparts are highlighted. Overall, this review provides valuable insights into the development and evaluation of chiplet and PIM architectures, emphasizing improved performance, energy efficiency, and inference accuracy in deep learning applications.
Article
Computer Science and Mathematics
Hardware and Architecture

Alejandro Juarez-Lora,

Victor H. Ponce-Ponce,

Humberto Sossa-Azuela,

Osvaldo Espinosa-Sosa,

Elsa Rubio-Espino

Abstract: In this article, we propose a circuit to imitate the behavior of a Reward-Modulated Spike-Timing-Dependent Plasticity synapse. When two neurons in adjacent layers produce spikes, each spike modifies the thickness of the common synapse. As a result, the synapse’s ability to conduct impulses is controlled, leading to an unsupervised learning rule. By introducing a reward signal, reinforcement learning is enabled by redirecting the growth and shrinkage of synapses based on signal feedback from the environment. The proposed synapse manages the convolution of the emitted spike signals to promote either the strengthening or weakening of the synapse, which is represented as the resistance value of a memristor device. As memristors have a conductance range that may differ from the available current input range of typical CMOS neuron designs, the synapse circuit can be adjusted to regulate the spike’s amplitude current to comply with the neuron. The circuit described in this work allows for the implementation of fully interconnected layers of neuron analog circuits. This is achieved by having each synapse reconform the spike signal, thus removing the burden of providing enough power from the neurons to each memristor. The synapse circuit was tested using a CMOS analog neuron described in the literature. Additionally, the article provides insight into how to properly describe the hysteresis behavior of the memristor in Verilog-A code. The testing and learning capabilities of the synapse circuit are demonstrated in simulation using the Skywater-130nm process. The article’s main goal is to provide the basic building blocks for Deep Neural Neural Networks relying on spiking neurons and memristors as the basic processing elements to handle spike generation, propagation, and synaptic plasticity.
Review
Computer Science and Mathematics
Hardware and Architecture

Alexander Tekles,

Nico Mexis,

Stefan Katzenbeisser

Abstract: Over the last decade, a lot of research has been conducted on memristors. Most of this research focusses on using memristors for Artificial Intelligence (AI) applications and to fabricate non-volatile memory, but also the security aspects of memristors have been examined. The current study summarises and compares five reviews on the security aspects of memristors. These reviews cover two different perspectives: (1) security applications of memristors such as Physical Unclonable Functions (PUFs) or True Random Number Generators (TRNGs), and (2) potential threats when using memristors to train and store neural networks. The comparison of the reviews reveals that different sets of studies are included in the reviews and different characterisations of the studies are provided. This shows that different perspectives are necessary to get a comprehensive overview of the security aspects of memristors. By synthesising the perspectives of different reviews, this study helps to get such an overview.
Article
Computer Science and Mathematics
Hardware and Architecture

Binh Kieu-Do-Nguyen,

Nguyen The Binh,

Cuong Pham-Quoc,

Phuc Nghi Huynh,

Ngoc-Thinh Tran,

Trong-Thuc Hoang,

Cong-Kha Pham

Abstract: In the era of the post-quantum Internet of Things (IoT), the implementation of anti-quantum cryptographic algorithms in numerous terminals can successfully defend against prospective quantum computing assaults. Lattice-based cryptography can withstand quantum computing attacks, making it a viable substitute for the currently prevalent classical public-key cryptographic technique. Nevertheless, the algorithm’s significant time complexity will result in a substantial computational burden on the edge computing chip in the IoT terminal. The computation of polynomial multiplication is the most demanding task in lattice-based cryptographic algorithms. Therefore, the investigation into efficient methods for calculating polynomial multiplication is highly important. Quick number theory transformations (NTT) are a widely employed technique to accelerate polynomial multiplication. This study presents a hardware implementation of an efficient number theory transformation. We utilize a multi-level pipeline architecture in the design to accomplish parallel calculations and execute it on a low-profile Artix7-XC7A100T FPGA device. The performance evaluation results demonstrate that our implementation significantly enhances performance and reduces resource usage compared to other existing proposals on the same platform. We suggested NTT core can be implemented in edge computing chips to enhance computational speed due to its small and low-latency design. The experimental results show that the proposed design, which supports both NTT and inverse NTT, achieves 417-Megahertz and consumes only 541-LUTs on Artix-7 XC7A100T.
Communication
Computer Science and Mathematics
Hardware and Architecture

Peter Schulz,

Grigore Sleahtitchi

Abstract: Edge computing helps radio-supported mobile systems to process computing tasks on the side of servers. The proximity to an edge computing node avoids excessive latencies, which benefits the calculation of real-time algorithms. Special computing needs arise because mobile systems are often so limited in weight and energy that they cannot carry powerful on-board computers. We propose the use of FPGA-based co-processors (FPGA: Field Programmable Gate Array) to handle computations in an edge node. The calculation of the Fast Fourier Transformation (FFT) will be presented as an example of a co-processor. The use of FPGA-based co-processors poses a particular challenge when a mobile system leaves its radio cell, and the computing context must be transferred to another edge node. The article first addresses specific edge computing requirements, such as case-by-case reconfiguration of computing hardware and the handover mechanism from one edge node to another. Using the example of an FPGA-based FFT co-processor, we describe its development, which was carried out under the condition that mobile clients can request different co-processors and that they can also change the edge node when changing the radio cell. The latter requires passing the co-processor context. For the FPGA, this means that the co-processor is part of a partially reconfigurable environment and must support the handover mechanisms in hardware. At the end we indicate the need for FPGA resources and compare this with alternative solutions.
Article
Computer Science and Mathematics
Hardware and Architecture

Muhammad Sohail Ibrahim,

Muhammad Usman,

Jeong-A Lee

Abstract: Deep Neural Network (DNN) inference demands substantial computing power, resulting in significant energy consumption. A large number of negative output activations in convolution layers are rendered zero due to the invocation of the ReLU activation function. This results in a substantial number of unnecessary computations that consume significant amounts of energy. This paper presents ECHO: Energy-efficient Computation Harnessing Onilne Arithmetic - A MSDF-based accelerator for DNN inference, designed for computation pruning, utilizing an unconventional arithmetic paradigm known as online/ most-significant digit first (MSDF) arithmetic which performs computations in a digit-serial manner. The MSDF digital serial computation of online arithmetic enables overlapped computation of successive operations leading to substantial performance improvements. The online arithmetic, coupled with a negative output detection scheme, facilitates early and precise recognition of negative outputs. This, in turn, allows for the timely termination of unnecessary computations, resulting in a reduction of energy consumption. The implemented design has been realized on the Xilinx Virtex-7 VU3P FPGA and subjected to a comprehensive evaluation through a rigorous comparative analysis involving widely used performance metrics. Experimental results demonstrate promising power and throughput improvements compared to contemporary methods. In particular, the proposed design achieved an average improvement in power consumption of up to 81%, 82.9%, and 40.6% for VGG-16, ResNet-18, and ResNet-50 workloads compared to the conventional bit-serial design, respectively. Furthermore, significant average speedups of 2.39×, 2.6×, and 2.42 were observed when comparing the proposed design to conventional bit-serial designs for VGG-16, ResNet-18, and ResNet-50 models respectively.
Article
Computer Science and Mathematics
Hardware and Architecture

Samuel López-Asunción,

Pablo Ituero

Abstract: Spiking neural networks (SNNs) promise to perform tasks currently done by classical artificial neural networks (ANNs) faster, in smaller footprints and using less energy. Neuromorphic processors are set out to revolutionize computing at a large scale, but the move to edge-computing applications calls for finely-tuned custom implementations to keep pushing towards more efficient systems. To that end, we have examined the architectural design space for executing spiking neuron models on FPGA platforms, focusing on achieving ultra-low area and power consumption. This work presents an efficient clock-driven spiking neuron architecture used for the implementation of both fully-connected cores and 2D convolutional cores, which rely on deep pipelines for synaptic processing and distributed memory for weight and neuron states. With them, we have developed an accelerator for an SNN version of the LeNet-5 network trained on the MNIST dataset. At around 5.5 slices/neuron and only 348 mW, it is able to use 33% less area and 4 times as less power per neuron as current state-of-the-art implementations while keeping low simulation step times.
Article
Computer Science and Mathematics
Hardware and Architecture

Józef Kulisz,

Filip Jokiel

Abstract: The paper proposes a new implementation of the PID algorithm in digital hardware. The proposed circuit implements an advanced PID formula, containing a non ideal derivative component, and weighting coefficients, which enable reducing influence of setpoint changes in the proportional and derivative components. The implementation operates on standard single precision (32 bit) floating-point numbers. The proposed circuit structure is optimized for cost. It uses just one arithmetic block, performing the multiply-and-add operation. The calculations are carried out in a sequential manner. The circuit was implemented in a Cyclone V FPGA device from Intel, using the Quartus Prime software. Proper operation of the circuit was verified by simulation. The proposed solution is comparable in terms of speed with other hardware implementations of the PID algorithm operating on standard single precision floating-point numbers, while being significantly cheaper. However, it outperforms by several orders of magnitude the speed of any software-based implementation, including solutions using PLCs, and CPU/MCUs. The proposed circuit structure, together with the overall regulator device concept, suit well the SoC (System on Chip), or SoPC (System on Programmable Chip) idea, i. e. a device, that contains a CPU core immersed in “FPGA fabric” - logic resources characteristic for FPGA devices.
Article
Computer Science and Mathematics
Hardware and Architecture

Pargol Hatefi,

Mohammad Salehi

Abstract: It is necessary to manage the power consumption and use fault-tolerance techniques to meet high reliability in Real-Time Embedded Systems, . Hence, in recent years, this is the main reason for providing an ideal approach to decrease power consumption through the worst-case behavior of instantaneous power management (peak-power management), while observing the Thermal Design Power (TDP). Although the multi-core chips in real-time embedded systems provide great opportunity for implementation of Two-phase Triple modular redundancy to meet high reliability, the energy overhead management of concurrently executing tasks is one of the challenges facing designers. In this article, we propose a scheme named Permanent Fault Aware, Two-Phase Peak-Power Management (PFA-TP3M), for scheduling real-time tasks in multi-core systems to eliminate peak power overlap from concurrent running tasks to maintain peak power consumption at the chip TDP. and through mapping different parts of each task on separate cores, prevents consecutive conflicts on cores, which have faced permanent faults to design the fault-tolerant system. Our simulation results show that our proposed scheme provides up to 700x fault tolerance increase (630x on average) compared to the state-of-the-art power management algorithm, an achievement costs up to 14%. (average 11%) increase the length of the task graph compared to it.
Article
Computer Science and Mathematics
Hardware and Architecture

Saeid Gorgin,

Mohammad Sina Karvandi,

Somaye Moghari,

Mohammad K Fallah,

Jeong-A Lee

Abstract: The effectiveness of Fuzzy Inference Systems (FISs) in manipulating uncertainty and nonlinearity makes them a subject of significant interest for decision-making in embedded systems. Accordingly, optimizing FIS hardware improves its performance, efficiency, and capabilities, leading to a better user experience, increased productivity, and cost savings. To be compatible with the limited power budget in most embedded systems, this paper presents a framework to realize ultra-low power FIS hardware. It supports optimizations for both conventional arithmetic as well as MSDF-computing to be highly consistent with MSDF-based sensors. In MSDF-computing FIS all the processes of fuzzification, inference, and defuzzification are done on serially coming data bits. To demonstrate the efficiency of the proposed framework, we utilized Matlab, Chisel3, and Vivado to implement it from high-level descriptions of FIS to hardware synthesis. We also developed a Scala library in Chisel3 to establish a connection between these tools, bridging the gap, and facilitating design space exploration at the arithmetic level. Furthermore, we realized an FIS for the navigation of autonomous mobile robots in unknown environments. Synthesis results show the superiority of the output of our suggested design framework in terms of resource usage as well as power and energy consumption compared to the Matlab HDL code generator output.

of 3

Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2025 MDPI (Basel, Switzerland) unless otherwise stated