Submitted:
19 January 2023
Posted:
20 January 2023
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Hardware Platform Considerations
1.2. Computer System Considerations
1.2.1. Boot Time Reservation
1.2.2. Contiguous Memory Allocation
1.2.3. Working with Non-Contiguous Buffers
2. Existing Solutions for the Implementation of DMA in FPGAs
2.1. Official DMA Engines from AMD/Xilinx
2.2. Selected Existing Open-Source DMA Engines
2.3. PCIe SG DMA Controller
2.4. Wupper
3. Concept of a Versatile DMA Engine for HEP
4. Implementation of DMA in FPGA with HLS
| Listing 1. A very simple code using the AXI interface to read the data from memory and write the modified data to its original location. That is a shortened source published by AMD/Xilinx at [32]. |
![]() |
| Listing 2. Simple code receiving the AXI-Stream packet and storing it in the buffer inside the computer systems memory. That is a modified source published by AMD/Xilinx in [34]. |
![]() |
4.1. Development of the Final HLS Solution
| Listing 3. Definition of structures and constants for the core of the DMA engine in HLS. |
![]() |
| Listing 4. Top-level function implementing the core of the DMA engine in HLS. Explained in Section 4.1 |
![]() |
4.2. Readin Subtask
| Listing 5. Simplified code reading the input data and preparing the burst markers. Explained in Section 4.2. |
![]() |
4.3. Prepare Subtask
| Listing 6. Simplified code preparing the chunk descriptors. Explained in Section 4.3. |
![]() |
4.4. Writeout Subtask
| Listing 7. Simplified code writing the output data based on prepared chunk descriptors. Explained in Section 4.4 and Section 4.5. |
![]() |
4.5. Update_Outs Subtask
4.6. HDL Support Cores
5. Software Supporting the DMA Engine
5.1. Detailed Description of the Software Operation
- It prepares the huge pages-backed DMA buffer. It creates a file of the required size in a hugetblfs filesystem (that can be done even in a shell script). Then the created file is mapped into the application address space.
-
Whenever the data acquisition is started or restarted, the following actions must be done:
- -
- The DMA driver resets the engine (due to HLS limitations, it is needed to set the initial values of registers).
- -
- The DMA driver maps the buffer for DMA (if the buffer was already mapped, the mapping is destroyed and recreated)4.
- -
- The DMA driver configures the DMA engine to work with the currently mapped buffer. In particular, it writes the bus addresses of all huge pages into the engine’s registers.
- -
- The DAQ control application configures the data source.
- -
- In the multi-packet mode, the data processing application starts the processing threads.
- -
- The DMA driver starts the engine.
- -
- The DAQ control application starts the data source.
-
If the single-packet mode is used, the data processing loop works as follows:
- -
- If no data packet is available, the DMA interrupts are switched on, and the application sleeps, waiting for data or command.
- -
- If the error occurred or the stop command has been received, the application leaves the data processing loop.
- -
- If the new data packet is received, the DMA interrupts are masked, and the packet is passed to the data processing function.
- -
- After the packet is processed, it is confirmed and freed.
- -
- The next iteration of the loop is started.
-
If the multi-packet mode is used, the data processing loop works as follows:
- -
- IIf no data packet is available, the DMA interrupts are switched on, and the application sleeps, waiting for data or command.
- -
- If the error occurred or the stop command has been received, the application leaves the data processing loop.
- -
- If the new data packet is received, its number is passed to one of the data processing threads via ZMQ, and the engine is notified that the particular packet has been scheduled for processing.5
- -
- The software checks if other packets are received and waiting for processing6. In the internal loop, all available packets are scheduled for processing in available threads.
- -
- The next iteration of the loop is started.
-
Actions performed by the signal processing thread in the multi-packet mode are the following:
- -
- The thread sleeps, waiting for a packet to be processed.
- -
- The parts of the DMA buffer containing the packet descriptors and data of the packet are synchronized for the CPU7.
- -
- The start and end addresses of the packet data are read from the descriptor.
- -
- The packet data are processed.
- -
- After the data are processed, the packet is marked for freeing8.
- -
- The thread is stopped if the error occurred or the stop command has been received. Otherwise, the above operations are repeated.
-
The shutdown procedure
- -
- The DAQ application stops the data source.
- -
- The DMA application stops the DMA engine.
- -
- In the multi-packet mode, the DMA application sends the STOP command to processing threads and joins them.
- -
- The DMA application frees the resources – unmaps, and frees the DMA buffer.
6. Tests and Results
6.1. Tests in the RTL Simulations
6.2. Tests in the Actual Hardware
7. Conclusions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| The following abbreviations are used in this manuscript: | |
| FPGA | Field programmable gate array |
| DMA | Direct memory access |
| DAQ | Data acquisition system |
| HEP | High-energy physics |
| SoC | System on chip |
| MPSoC | Multi-processor system on chip |
| TLP | Transaction Layer Packet in the PCI Express interface |
| KiB | 1024 bytes |
| MiB | 1024*1024 bytes |
| GiB | 1024*1024*1024 bytes |
References
- Baron, S.; Ballabriga, R.; Bonacini, S.; Cobanoglu, O.; Gui, P.; Kloukinas, K.; Hartin, P.; Llopart, X.; Fedorov, T.; Francisco, R.; et al. The GBT Project 2009. [CrossRef]
- Marin, M.B.; Baron, S.; Feger, S.; Leitao, P.; Lupu, E.; Soos, C.; Vichoudis, P.; Wyllie, K. The GBT-FPGA core: features and challenges. Journal of Instrumentation 2015, 10, C03021–C03021. [Google Scholar] [CrossRef]
- Soós, C.; Détraz, S.; Olanterä, L.; Sigaud, C.; Troska, J.; Vasey, F.; Zeiler, M. Versatile Link PLUS transceiver development. Journal of Instrumentation 2017, 12, C03068–C03068. [Google Scholar] [CrossRef]
- Mendez, J.M.; Baron, S.; Kulis, S.; Fonseca, J. New LpGBT-FPGA IP: Simulation model and first implementation. In Proceedings of the Proceedings of Topical Workshop on Electronics for Particle Physics — PoS(TWEPP2018); Sissa Medialab: Antwerp, Belgium, 2019; p. 059. [Google Scholar] [CrossRef]
- AXI Bridge for PCI Express Gen3 Subsystem v3.0. https://docs.xilinx.com/v/u/en-US/pg194-axi-bridge-pcie-gen3. [Online; accessed 7-August-2022].
- DMA/Bridge Subsystem for PCI Express v4.1. https://docs.xilinx.com/r/en-US/pg195-pcie-dma. [Online; accessed 7-August-2022].
- AMBA AXI-Stream Protocol Specification. https://developer.arm.com/documentation/ihi0051/latest. [Online; accessed 15-October-2022].
- Corbet, J.; Rubini, A.; Kroah-Hartman, G.; Rubini, A. Linux device drivers, 3rd ed ed.; O’Reilly: Beijing ; Sebastopol, CA, 2005. Also available at https://lwn.net/Kernel/LDD3/ [Online; accessed 7-August-2022].
- Intel® 64 and IA-32 Architectures Software Developer Manuals. https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html. [Online; accessed 7-August-2022].
- An Introduction to IOMMU Infrastructure in the Linux Kernel. https://lenovopress.lenovo.com/lp1467.pdf. [Online; accessed 15-October-2022].
- 32/64 bit, IOMMU and SWIOTLB in Linux. http://xillybus.com/tutorials/iommu-swiotlb-linux. [Online; accessed 15-October-2022].
- The kernel’s command-line parameters. https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html. [Online; accessed 7-August-2022].
- Suryavanshi, A.S.; Sharma, S. An approach towards improvement of contiguous memory allocation linux kernel: a review. Indonesian Journal of Electrical Engineering and Computer Science 2022, 25, 1607. [Google Scholar] [CrossRef]
- AXI DMA Controller. https://www.xilinx.com/products/intellectual-property/axidma.html. [Online; accessed 7-August-2022].
- AXI Central DMA Controller. https://www.xilinx.com/products/intellectual-property/axicentraldma. [Online; accessed 7-August-2022].
- AXI Central DMA Controller. https://www.xilinx.com/products/intellectual-property/axivideodma.html. [Online; accessed 7-August-2022].
- AXI Datamover. https://www.xilinx.com/products/intellectual-property/axidatamover.html. [Online; accessed 7-August-2022].
- Zabołotny, W.M. DMA implementations for FPGA-based data acquisition systems. 2017, p. 1044548. [CrossRef]
- Simple AXI4-Stream -> PCIe core for Virtex 7. https://gitlab.com/WZab/versatile-dma1. [Online; accessed 7-August-2022].
- AXI4-Stream FIFO. https://docs.xilinx.com/v/u/en-US/pg080-axi-fifo-mm-s. [Online; accessed 7-August-2022].
- DMA for PCI Express (PCIe) Subsystem. https://www.xilinx.com/products/intellectual-property/pcie-dma.html. [Online; accessed 7-August-2022].
- Gisselquist, D. WB2AXIP: Bus interconnects, bridges, and other components. https://github.com/ZipCPU/wb2axip. [Online; accessed 7-August-2022].
- Gisselquist, D. AXIS2MM – AXI Stream to AXI Memory Mapped interface. https://github.com/ZipCPU/wb2axip/blob/master/rtl/axis2mm.v. [Online; accessed 7-August-2022].
- Gisselquist, D. AXISGDMA a scatter-gather DMA implementation. https://github.com/ZipCPU/wb2axip/blob/master/rtl/axisgdma.v. [Online; accessed 7-August-2022].
- PCIe SG DMA controller. https://opencores.org/projects/pciesgdma. [Online; accessed 7-August-2022].
- Wupper: A PCIe Gen3/Gen4 DMA controller for Xilinx FPGAs. https://opencores.org/projects/virtex7pciedma. [Online; accessed 7-August-2022].
- Corbet, J. NAPI polling in kernel threads. https://lwn.net/Articles/833840/, 2020. [Online; accessed 12-January-2023].
- Gisselquist, D. Building a basic AXI Master. https://zipcpu.com/blog/2020/03/23/wbm2axisp.html, 2020. [Online; accessed 7-August-2022].
- Gisselquist, D. Examples of AXI4 bus masters. https://zipcpu.com/blog/2021/06/28/master-examples.html, 2021. [Online; accessed 7-August-2022].
- Vivado Design Suite User Guide, High-Level Synthesis. https://www.xilinx.com/content/dam/xilinx/support/documents/swmanuals/xilinx20201/ug902-vivado-high-level-synthesis.pdf. [Online; accessed 7-August-2022].
- AXI4 Master Interface. https://docs.xilinx.com/r/en-US/ug1399-vitis-hls/AXI4-Master-Interface. [Online; accessed 7-August-2022].
- Vitis HLS Introductory Examples – Using AXI Master. https://github.com/Xilinx/Vitis-HLS-Introductory-Examples/blob/master/Interface/Memory/usingaximaster/example.cpp. [Online; accessed 7-August-2022].
- AXI4-Stream Interfaces. https://docs.xilinx.com/r/en-US/ug1399-vitis-hls/AXI4-Stream-Interfaces. [Online; accessed 7-August-2022].
- Vitis HLS Introductory Examples – AXI Stream to Master. https://github.com/Xilinx/Vitis-HLS-Introductory-Examples/blob/master/Interface/Streaming/axistreamtomaster. [Online; accessed 7-August-2022].
- HLS pragmas (Vitis). https://docs.xilinx.com/r/en-US/ug1399-vitis-hls/HLS-Pragmas. [Online; accessed 7-August-2022].
- Zabołotny, W.M. Implementation of heapsort in programmable logic with high-level synthesis. In Proceedings of the Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2018; Romaniuk, R.S.; Linczuk, M., Eds.; SPIE: Wilga, Poland, 2018; p. 245. [CrossRef]
- Zabolotny, W.M. Implementation of OMTF trigger algorithm with high-level synthesis. In Proceedings of the Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2019; Romaniuk, R.S.; Linczuk, M., Eds.; SPIE: Wilga, Poland, 2019; p. 22. [CrossRef]
- Driver for the wzdaq1 AXI/PCIe DAQ system. https://gitlab.com/WZab/wzdaqdrv. [Online; accessed 6-January-2023].
- Zabolotny, W.M. QEMU-based hardware/software co-development for DAQ systems. Journal of Instrumentation 2022, 17, C04004. [Google Scholar] [CrossRef]
- QEMU repository with model of PCIe-connected HLS DMA engine. https://github.com/wzab/qemu/tree/wzdaq-hls. [Online; accessed 6-January-2023].
- QEMU repository with model system bus connected HLS DMA engine. https://github.com/wzab/qemu/tree/wzdaq-hls-sysbus. [Online; accessed 6-January-2023].
- Integrated Logic Analyzer (ILA). https://www.xilinx.com/products/intellectual-property/ila.html. [Online; accessed 12-November-2022].
- Gisselquist, D. Building the perfect AXI4 slave. https://zipcpu.com/blog/2019/05/29/demoaxi.html, 2019. [Online; accessed 7-August-2022].
- WB2AXIP: Bus interconnects, bridges, and other components. https://github.com/ZipCPU/wb2axip. [Online; accessed 7-August-2022].
- hls_dma - a simple yet versatile HLS-implemented DMA engine. https://gitlab.com/WZabISE/hlsdma. [Online; accessed 6-January-2023].
- Xilinx Kintex UltraScale FPGA KCU105 Evaluation Kit. https://www.xilinx.com/products/boards-and-kits/kcu105.html. [Online; accessed 6-January-2023].
- TEC0330 - PCIe FMC Carrier with Xilinx Virtex-7 FPGA. https://shop.trenz-electronic.de/en/Products/Trenz-Electronic/PCIe-FMC-Carrier/TEC0330-Xilinx-Virtex-7/. [Online; accessed 6-January-2023].
- Dementev, D.; Guminski, M.; Kovalev, I.; Kruszewski, M.; Kudryashov, I.; Kurganov, A.; Miedzik, P.; Murin, Y.; Pozniak, K.; Schmidt, C.J.; et al. Fast Data-Driven Readout System for the Wide Aperture Silicon Tracking System of the BM@N Experiment. Physics of Particles and Nuclei 2021, 52, 830–834. [Google Scholar] [CrossRef]
- Frotscher, Axel. The (p,3p) two-proton removal from neutron-rich nuclei and the development of the STRASSE tracker 2021. [CrossRef]
| 1 | |
| 2 | A similar approach is used in Linux drivers for network cards [27]. |
| 3 | Because both buffers are circular, the numbers are increased using modular arithmetics. |
| 4 | This operation requires using functions get_user_pages_fast and __sg_alloc_table_from_pages or sg_alloc_table_from_pages_segment, and the implementation depends on the version of the kernel. |
| 5 | The device driver uses dedicated ioctl commands for that purpose: DAQ1_IOC_GET_READY_DESC for getting the number of the received packet, DAQ1_IOC_CONFIRM_SRV for writing the number of the last scheduled packet into the last scheduled packet register in the engine. |
| 6 | A dedicated ioctl DAQ1_IOC_GET_WRITTEN_DESC command returns the number of the first packet that is not ready for processing yet. So all packets between the returned by DAQ1_IOC_GET_READY_DESC and that one may be scheduled for processing. |
| 7 | A dedicated ioctl DAQ1_IOC_SYNC is used for that purpose. Synchronizing the arbitrarily selected part of the SG buffer in the Linux kernel requires storing a separate array of addresses of all huge pages creating the buffer. The original sg_table structure does not support random access. |
| 8 | A dedicated ioctl DAQ1_IOC_CONFIRM_THAT command is used for that. Due to the parallel handling of multiple packets, the driver must keep track of all packets ready to be freed. A bitmap is used for that purpose. When the packet currently pointed by the “current packet” register is freed, all the packets marked for freeing are also freed. The “current packet” and “current buffer” are then updated accordingly. |
| 9 | For historical reasons, the chunk length is defined with the MAX_BURST_LENGTH constant in the HLS sources. It does not affect the AXI burst size, which is always 256. |
| 10 | The DMA engine was prepared for integration with projects using the 2020.1 version of Vivado-HLS and Vivado. Therefore the same version was used to synthesize, implement and test it. |
| 11 | Lengths of 256 and 2048 words were used, like in simulation in Section 6.1, to compare simulated and actual performances. |
| 12 | Trenz Electronic advertises TEC0330 as 8-lanes PCIe Gen 2 capable. However, the FPGA chip used in the board supports PCIe Gen 3, and correct operation in 8xGen3 configuration has been verified in exhaustive tests of three different boards. |
| 13 | Currently the design requires Vivado and Vivado-HLS 2020.1, because that’s the version used by other projects with which it should be integrated. |












| KCU105 | TEC0330 | |||||
|---|---|---|---|---|---|---|
| LUTs | Flip Flops | Block RAMs | LUTs | Flip Flops | Block RAMs | |
| Available | 242400 | 484800 | 600 | 204000 | 408000 | 750 |
| Used for 256-words chunks | 9909 (4.09%) | 15204 (3.14%) | 45 (7.5%) | 12503 (6.13%) | 15928 (3.90%) | 45 (6%) |
| Used for 2048-words chunks | 9858 (4.07%) | 15213 (3.14%) | 69.5 (11.58%) | 12445 (6.10%) | 15946 (3.91%) | 69.5 (9.27%) |
| KCU105 | TEC0330 | |||
|---|---|---|---|---|
| Absolute | Percentage | Absolute | Percentage | |
| Available | 7.877 GB/s | 100% | 7.877 GB/s | 100% |
| Used for 256-words chunks | 4.731 GB/s | 60.1%) | 4.721 GB/s | 59.9% |
| Used for 2048-words chunks | 6.724 GB/s | 85.4%) | 6.691 GB/s | 84.9% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).






