Submitted:
12 August 2025
Posted:
13 August 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- The framework natively supports AXI4 and AXI4-Lite interfaces [16], multi-clock domain design, and asynchronous communication primitives – all developed in pure Chisel including the top-level module, promoting portability and maintainability.
- We demonstrate a proof-of-concept accelerator: a high-performance, pipelined floating-point matrix-vector multiplication engine based on the Self-Alignment Format (SAF) [17]. The design is mapped across all three Super Logic Regions (SLRs) of an AMD/Xilinx Alveo U200 FPGA [18], validating both performance and cross-SLR scalability.
- We provide the first full Chisel implementation of the SAF arithmetic format [17], demonstrating the framework’s capability to express non-trivial numerical operators in a modular and reusable form.
- All source code, including the SAF accelerator, Chisel framework modules, and simulation/testbench infrastructure, is released as open source and made freely available via a GitHub repository [19] to encourage reproducibility and community adoption.
2. Background
2.1. Hardware Platform
2.1.1. The UltraScale+ Architecture
2.1.2. Super Logic Regions and Inherent Design Constraints
2.1.3. The Alveo Platforms
- The golden image takes up a fixed amount of resources: according to UG1120 [25] (in section "U200 Gen3x16 XDMA base_2 Platform"), it consumes little less than one third of the available LUT and FF in SLR1, and roughly half of the available RAM and URAM blocks in SLR1. This results in roughly 18% of the LUT and FF, and 15% of the RAM tiles (URAM and BRAM). The replacement used in this design uses approximately 11000 LUT (0.9%), 14000 FF (9.5%) and 31 RAM blocks (1.4%) – see 2.1.5.
- The XRT library is packaged only for some yum and APT-based Linux distributions and officially supports some versions of Ubuntu, RHEL and CentOS [28], which needlessly constrains the host platform OS.
- The framework enforces pre-defined host-card control and data transfer protocols [29]. While these are particularly integrated with HLS flows, they prevent the designer from using custom IP cores for AXI communication, and impose Xilinx’s IP cores, which may not reach the AXI4 protocol’s full throughput [30].
- In the RTL kernel design flow, the Vitis linker is able to place each kernel in a user-defined SLR [31], but this entails that cross-SLR boudary communication is always implemented with AXI4 protocols, which can have non-necessary repercussions on design complexity and restrain ASIC prototyping applications.
2.1.4. The Vivado RTL Flow
2.1.5. XDMA and Host-Card Communication
2.2. Chisel
2.2.1. Introducing Chisel
2.2.2. Chisel Testing Capabilities
2.3. Floating-Point Formats
2.3.1. Self-Alignment Technique
2.3.2. Berkeley’s Hardfloat
3. Implementation Methodology
3.1. Matrix-Vector Multiplication Applicative Core
3.1.1. Floating-Point Multiply-Accumulators
3.1.2. Applicative Matrix-Vector Multiplication Pipeline
- valid indicates valid data on the bus;
- prog specifies that the data is to be programmed in a PE memory;
- write specifies that the data was generated by a PE and is meant to be output.

3.2. Multi-Clock Domain Design in Chisel
3.2.1. Metastability and Synchronization
3.2.2. Multi-Cycle Paths
3.2.3. Asynchronous FIFO Queues

3.3. Integration
3.3.1. Communication with the XDMA IP
| Listing 1. withClockAndReset and withClock usage in the CoreWrapper module (wiring is omitted). |
![]() |

3.3.2. Top-Level Module
| Listing 2. Core clock generation using Chisel BlackBoxes. |
![]() |
| Listing 3. Interface bundles wiring. |
![]() |
3.4. Testing Asynchronous Designs in Chisel
| Listing 4. The FakeClockDivider class. |
![]() |
| Listing 5. Asynchronous test module and test-bench. |
![]() |

4. Experimental Results
4.1. Floating Point Precision
4.2. Timings and Performance
4.3. Card Resource Usage
5. Discussion
5.1. Floating Point Handling
5.2. Performance
5.3. Resource Usage
5.4. Proof-of-Concept Objectives
- Reusable: module encapsulation allows reusing parts of the design like FIFO, MCP, or even the AXI stack, in other designs
- Real-life enabled: multi-clock designs on complex FPGA architectures can be useful to prototype ASIC designs
- Fully controllable: the designer has RTL control over the hardware and syscall level control over the software. Moreover, all card resources are available for use.
- Platform-agnostic: Any Linux-based OS supporting the basic dependencies of the project can run matmul.
6. Conclusions and Future Research
Author Contributions
Funding
Conflicts of Interest
References
- Hernandez, J.; et al. Field Programmable Gate Array: An Extensive Review, Recent Trends, Challenges, and Applications. Computation 2019, 7, 63.
- Society, I.C. The Role of Field-Programmable Gate Arrays in the Acceleration of Scientific Computing. IEEE Computer 2024, 57, 45–53.
- AMD/Xilinx. UltraScale Architecture and Product Data Sheet: Overview, 2025. v4.7.
- Du, L.; Liang, T.; Zhou, X.; Ge, J.; Li, S.; Sinha, S.; Zhao, J.; Xie, Z.; Zhang, W. FADO: Floorplan-Aware Directive Optimization Based on Synthesis and Analytical Models for High-Level Synthesis Designs on Multi-Die FPGAs. ACM Trans. Reconfigurable Technol. Syst. 2024, 17.
- Uguen, Y.; Dinechin, F.D.; Lezaud, V.; Derrien, S. Application-Specific Arithmetic in High-Level Synthesis Tools. ACM Trans. Archit. Code Optim. 2020, 17.
- Cong, J.; Lau, J.; Liu, G.; Neuendorffer, S.; Pan, P.; Vissers, K.; Zhang, Z. FPGA HLS Today: Successes, Challenges, and Opportunities. ACM Trans. Reconfigurable Technol. Syst. 2022, 15.
- AMD. Xilinx Runtime (XRT). https://xilinx.github.io/XRT, 2023. [Accessed: 2025-06-01].
- Sahebi, A.; Barbone, M.; Procaccini, M.; Luk, W.; Gaydadjiev, G.; Giorgi, R. Distributed large-scale graph processing on FPGAs. Journal of Big Data 2023, 10, 95.
- Sozzo, E.D.; Conficconi, D.; Zeni, A.; Salaris, M.; Sciuto, D.; Santambrogio, M.D. Pushing the Level of Abstraction of Digital System Design: A Survey on How to Program FPGAs. ACM Comput. Surv. 2022, 55.
- chipsalliance. Chisel homepage. https://www.chisel-lang.org. [Accessed: 2025-05-29].
- Schoeberl, M.; Damsgaard, H.J.; Pezzarossa, L.; Keszocze, O.; Jellum, E.R. Hardware Generators with Chisel. In Proceedings of the Euromicro Conference on Digital System Design (DSD), 2024, pp. 168–175.
- Käyrä, M.; Hämäläinen, T.D. A Survey on System-on-a-Chip Design Using Chisel HW Construction Language. In Proceedings of the Annual Conference of the IEEE Industrial Electronics Society (IECON), 2021, pp. 1–6.
- AMD. DMA/Bridge Subsystem for PCI Express (XDMA) v4.1 - Product Guide PG195. https://docs.xilinx.com/r/en-US/pg195-pcie-dma, 2023. [Accessed: 2025-06-01].
- IEEE Computer Society. IEEE Standard for Verilog Hardware Description Language (IEEE Std 1364-2001). https://standards.ieee.org/standard/1364-2001.html, 2001. [Accessed: 2025-06-01].
- edwardcwang. decoupled-serializer. https://github.com/edwardcwang/decoupled-serializer, 2025.
- Arm Ltd.. AMBA AXI and ACE Protocol Specification (AXI3, AXI4, AXI4-Lite). https://developer.arm.com/documentation/ihi0022/e, 2022. [Accessed: 2025-06-01].
- Ould-Bachir, T.; David, J.P. Self-Alignment Schemes for the Implementation of Addition-Related Floating-Point Operators. ACM Trans. Reconfigurable Technol. Syst. 2013, 6.
- AMD. Alveo U200 Data Center Accelerator Card. https://www.amd.com/en/products/accelerators/alveo-u200, 2023. [Accessed: 2025-06-01].
- Robin Gay (RobinGTM). matmul. https://github.com/RobinGTM/matmul/tree/v1.0-mdpi, 2025. [Public repository].
- AMD. AMD Alveo Adaptable Accelerator Cards. https://www.amd.com/en/products/accelerators/alveo.html, 2025. [Accessed: 2025-06-02].
- AMD. UltraScale Architecture Clocking Resources User Guide, 2025. 1.11 English.
- Saban, K. Xilinx Stacked Silicon Interconnect Technology Delivers Breakthrough FPGA Capacity, Bandwidth, and Power Efficiency. Technical Report WP380, Xilinx, 2012. v1.2.
- AMD/Xilinx. UltraFast Design Methodology Guide for FPGAs and SoCs, 2024. 2024.2 English.
- AMD/Xilinx. Alveo U200 and U250 Accelerator Cards User Guide, 2023. 1.1 English.
- AMD/Xilinx. Alveo Data Center Accelerator Card Platforms User Guide, 2023. 2.0.1 English.
- AMD. Pynq homepage. https://www.pynq.io/, 2024. [Accessed: 2025-06-02].
- Xilinx. Alveo U200 and U250 Data Center Accelerator Cards Data Sheet, 2023. v1.7 English.
- AMD/Xilinx. Xilinx Runtime (XRT) Release Notes, 2024. 2024.2 English.
- AMD/Xilinx. Vitis Tutorials: Hardware Acceleration, 2025. 2024.2 English.
- Gisselquist Technology, L. Building and AXI-Lite slave the easy way. https://zipcpu.com/blog/2020/03/08/easyaxil.html, 2020. [Accessed: 2025-06-02].
- AMD/Xilinx. Data Center Acceleration Using Vitis User Guide, 2025. 2024.2 English.
- Xilinx. DMA/Bridge Subsystem for PCI Express v4.1 Product Guide, 2022. v4.1.
- Xilinx. dma_ip_drivers. https://github.com/Xilinx/dma_ip_drivers.git, 2025. [Accessed: 2025-06-02].
- Schoeberl, M. Digital Design with Chisel; Kindle Direct Publishing, 2019.
- Izraelevitz, A.; Koenig, J.; Li, P.; Lin, R.; Wang, A.; Magyar, A.; Kim, D.; Schmidt, C.; Markley, C.; Lawson, J.; et al. Reusability is FIRRTL ground: Hardware construction languages, compiler frameworks, and transformations. In Proceedings of the 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov 2017, pp. 209–216. https://doi.org/10.1109/ICCAD.2017.8203780.
- Li, P.S.; Izraelevitz, A.M.; Bachrach, J. Specification for the FIRRTL Language. Technical Report UCB/EECS-2016-9, EECS Department, University of California, Berkeley, 2016.
- chipsalliance. chisel. https://github.com/chipsalliance/chisel, 2025. [Accessed: 2025-06-27].
- chipsalliance. Chisel documentation homepage. https://www.chisel-lang.org/docs. [Accessed: 2025-06-27].
- chipsalliance. svsim. https://github.com/chipsalliance/chisel/tree/main/svsim, 2025. [Accessed: 2025-06-27].
- chipsalliance. Motivation – "Why Chisel?". https://www.chisel-lang.org/docs/explanations/motivation. [Accessed: 2025-06-27].
- chipsalliance. Multiple Clock Domains. https://www.chisel-lang.org/docs/explanations/multi-clock. [Accessed: 2025-06-27].
- Center, S. Scala Build Tool homepage. https://www.scala-sbt.org/, 2025. [Accessed: 2025-07-07].
- Veripool. Verilator homepage. https://www.veripool.org/verilator/, 2024. [Accessed: 2025-06-27].
- ucb bar. chiseltest. https://github.com/ucb-bar/chiseltest, 2024. [Accessed: 2025-06-27].
- edwardcwang. decoupled-serializer. https://github.com/RobinGTM/decoupled-serializer, 2025.
- IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2008 2008, pp. 1–70.
- Muller, J.M.; Brunie, N.; de Dinechin, F.; Jeannerod, C.P.; Joldes, M.; Lefèvre, V.; Melquiond, G.; Revol, N.; Torres, S. Handbook of Floating-point Arithmetic (2nd edition); Birkhäuser Basel, 2018; pp. 1–627. [CrossRef]
- ucb bar. berkeley-hardfloat. https://github.com/ucb-bar/berkeley-hardfloat, 2023. [Accessed: 2025-06-27].
- de Dinechin, F.; Pasca, B.; Creţ, O.; Tudoran, R. An FPGA-specific approach to floating-point accumulation and sum-of-products. 2008 International Conference on Field-Programmable Technology 2008, pp. 33–40.
- Luo, Z.; Martonosi, M. Accelerating pipelined integer and floating-point accumulations in configurable hardware with delayed addition techniques. IEEE Transactions on Computers 2000, 49, 208–218. [CrossRef]
- Vangal, S.; Hoskote, Y.; Borkar, N.; Alvandpour, A. A 6.2-GFlops Floating-Point Multiply-Accumulator With Conditional Normalization. IEEE Journal of Solid-State Circuits 2006, 41, 2314–2323. [CrossRef]
- Hajizadeh, F.; Ould-Bachir, T.; David, J.P. CuFP: An HLS Library for Customized Floating-Point Operators. Electronics 2024, 13.
- Hauser, J. Berkeley HardFloat. https://www.jhauser.us/arithmetic/HardFloat.html, 2024.
- AMBA AXI-Stream Protocol Specification. Technical Report ARM IHI 0051B (ID040921), 2021.
- Wikipedia contributors. Metastability (electronics) — Wikipedia, The Free Encyclopedia, 2025. [Accessed: 2025-08-11].
- Ginosar, R. Metastability and Synchronizers: A Tutorial. IEEE Design & Test of Computers 2011, 28, 23–35. [CrossRef]
- Stephenson, J.; Chen, D.; Fung, R.; Chromczak, J. White Paper – Understanding metastability in FPGAs. Technical Report WP-01082-1.2, Altera Corporation, 2009. ver. 1.2.
- Golson, S. Synchronization and Metastability. In Proceedings of the SNUG Silicon Valley 2014. Trilobyte Systems, 2014. [Accessed: 2025-07-04].
- Cummings, C.E. Clock Domain Crossing (CDC) Design & Verification Techniques Using SystemVerilog. In Proceedings of the SNUG Boston 2008. Sunburst Design, Inc., 2008.
- Cummings, C.E. Simulation and Synthesis Techniques for Asynchronous FIFO Design. In Proceedings of the SNUG San Jose 2002. Sunburst Design, Inc., 2002. [Accessed: 2025-06-01].
- Gray, F. Pulse code communication. Technical Report US2632058A, United States Patent Office, 1947.
- vineetskumar. XDMA tests failing with Error 512 on Alveo U200 #331. https://github.com/Xilinx/dma_ip_drivers/pull/240, 2025. [Accessed: 2025-08-11].
- eniv. [XDMA] BUG: scheduling while atomic in engine_service_poll #229. https://github.com/Xilinx/dma_ip_drivers/issues/229, 2023. [Accessed: 2025-08-11].
- GitSoftwareNow. The XDMA did not work when I send big buffer #298. https://github.com/Xilinx/dma_ip_drivers/issues/298, 2024. [Accessed: 2025-08-11].
- mpb27. XDMA: End of packet has issues and needs more testing. #91. https://github.com/Xilinx/dma_ip_drivers/issues/91#issuecomment-2316828753, 2020. [Accessed: 2025-08-11].
- dwd_pete. C2H Streaming XDMA Linux Driver Broken. https://adaptivesupport.amd.com/s/question/0D52E00006hpgSoSAI/c2h-streaming-xdma-linux-driver-broken?language=en_US, 2018. [Accessed: 2025-08-11].
- AMBA AXI and ACE Protocol Specification. Technical Report ARM IHI 022D (ID102711), 2011. AXI3, AXI4, and AXI4-Lite ACE and ACE-Lite.
- chipsalliance. Chisel Multiple Clock Domains documentation. https://www.chisel-lang.org/docs/explanations/multi-clock. [Accessed: 2025-07-08].
- The GSL Team. GNU Scientific Library homepage. https://www.gnu.org/software/gsl/, 2024. [Accessed: 2025-07-12].
- The Free Software Foundation, I. GNU General Public License. https://www.gnu.org/licenses/gpl-3.0.en.html, 2007. [Accessed: 2025-08-11].







| MH | MW | FP | Hardware time (ms) | Software time (ms) | Relative error (%) | |||
| Mean | Max | Mean | Max | Mean | Max | |||
| 10 | 10 | SAF | 0.031937 | 0.077000 | 0.000950 | 0.016000 | 0.000274 | 0.003808 |
| 50 | 50 | SAF | 0.023638 | 0.129000 | 0.002490 | 0.031000 | 0.004407 | 0.075721 |
| 100 | 50 | SAF | 0.025186 | 0.073000 | 0.004631 | 0.020000 | 0.012710 | 1.165867 |
| 100 | 100 | SAF | 0.022483 | 0.123000 | 0.008745 | 0.039000 | 0.039777 | 6.425782 |
| 100 | 200 | SAF | 0.025268 | 0.175000 | 0.020996 | 0.096000 | 0.027460 | 4.977176 |
| 300 | 300 | SAF | 0.016123 | 0.090000 | 0.048481 | 0.210000 | 0.126155 | 18.62970 |
| 400 | 400 | SAF | 0.013933 | 0.058000 | 0.072135 | 0.258000 | 0.312330 | 50.52832 |
| 10 | 10 | HF | 0.026675 | 0.094000 | 0.000827 | 0.112000 | 0.000000 | 0.000000 |
| 50 | 50 | HF | 0.030913 | 0.171000 | 0.003696 | 0.134000 | 0.000000 | 0.000000 |
| 100 | 50 | HF | 0.023033 | 0.117000 | 0.005115 | 0.042000 | 0.000000 | 0.000000 |
| 100 | 100 | HF | 0.023357 | 0.103000 | 0.008535 | 0.042000 | 0.000000 | 0.000000 |
| 100 | 200 | HF | 0.021130 | 0.117000 | 0.016658 | 0.068000 | 0.000000 | 0.000000 |
| 300 | 300 | HF | 0.014292 | 0.052000 | 0.042206 | 0.205000 | 0.000000 | 0.000000 |
| 400 | 400 | HF | 0.015355 | 0.078000 | 0.080186 | 0.346000 | 0.000000 | 0.000000 |
| MH | MW | FP | LUT | FF | BRAM | DSP | ||||
| Count | % | Count | % | Count | % | Count | % | |||
| Golden image (left resources) | 978000 | 82.72 | 1956000 | 82.72 | 1860 | 86.11 | 5880 | 85.96 | ||
| 10 | 10 | S | 22437 | 1.898 | 16889 | 0.7143 | 33 | 1.528 | 20 | 0.2924 |
| 50 | 50 | S | 69384 | 5.869 | 27162 | 1.149 | 35 | 1.620 | 100 | 1.462 |
| 100 | 50 | S | 128222 | 10.85 | 40141 | 1.698 | 35 | 1.620 | 200 | 2.924 |
| 100 | 100 | S | 133661 | 11.31 | 40355 | 1.707 | 35 | 1.620 | 200 | 2.924 |
| 100 | 200 | S | 144255 | 12.21 | 40564 | 1.716 | 35 | 1.620 | 200 | 2.924 |
| 300 | 200 | S | 410972 | 34.77 | 94354 | 3.991 | 36 | 1.667 | 600 | 8.772 |
| 300 | 300 | S | 363067 | 25.86 | 92316 | 3.904 | 337 | 15.60 | 600 | 8.771 |
| 400 | 400 | S | 472903 | 40.00 | 117767 | 4.981 | 437 | 20.23 | 800 | 11.70 |
| 800 | 800 | S | 925654 | 78.30 | 224181 | 9.481 | 839 | 38.84 | 1600 | 23.39 |
| 900 | 900 | S | 1042003 | 88.14 | 250384 | 10.59 | 941 | 43.56 | 1800 | 26.32 |
| 1000 | 1000 | S | 1115257 | 97.44 | 276586 | 11.70 | 1041 | 48.19 | 2000 | 29.24 |
| 10 | 10 | H | 19122 | 1.618 | 16080 | 0.6802 | 33 | 1.528 | 20 | 0.2924 |
| 50 | 50 | H | 51927 | 4.393 | 23094 | 0.9767 | 35 | 1.620 | 100 | 1.462 |
| 100 | 50 | H | 93263 | 7.889 | 32004 | 1.354 | 35 | 1.620 | 200 | 2.924 |
| 100 | 100 | H | 98566 | 8.339 | 32215 | 1.363 | 35 | 1.620 | 200 | 2.924 |
| 100 | 200 | H | 108816 | 9.205 | 32426 | 1.372 | 35 | 1.620 | 200 | 2.924 |
| 300 | 200 | H | 305556 | 25.85 | 70048 | 2.963 | 36 | 1.667 | 600 | 8.772 |
| 300 | 300 | H | 246760 | 21.01 | 68017 | 2.923 | 337 | 15.60 | 600 | 8.771 |
| 400 | 400 | H | 324692 | 27.46 | 85764 | 3.627 | 437 | 20.23 | 800 | 11.70 |
| 800 | 800 | H | 644338 | 54.50 | 160182 | 6.775 | 841 | 38.94 | 1600 | 23.39 |
| 1000 | 1000 | H | 802580 | 67.89 | 196585 | 8.314 | 1041 | 48.19 | 2000 | 29.24 |
| 1200 | 800 | H | 960472 | 81.24 | 236592 | 10.01 | 1245 | 57.64 | 2400 | 35.09 |
| 1300 | 700 | H | 1040130 | 87.98 | 255090 | 10.79 | 1345 | 62.27 | 2600 | 38.01 |
| 1400 | 800 | H | 1119565 | 94.70 | 273582 | 11.57 | 1445 | 66.90 | 2800 | 40.94 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).




