2. Overall System Logic Design
The system's top-level logic, depicted in
Figure 2, is designed around an embedded Microblaze soft-core processor, establishing a hardware-software co-processing architecture. This design strategically separates the system's operation into a control path, managed by the processor, and a high-speed data path, implemented in programmable logic.In the control path, the Microblaze processor acts as the master controller. It configures the AD9361 transceiver's internal registers via the SPI bus to set key parameters (e.g., frequency, gain, bandwidth) and processes commands received from a host computer through a UART module. To communicate with the data path, it uses the AXI4-Lite bus protocol to send control signals to the FPGA's programmable logic (PL), enabling dynamic scheduling of custom hardware modules.
In the data path, the Microblaze generates and writes custom waveform data into on-chip Block RAM (BRAM). From there, dedicated driver logic within the PL reads this data at high speed and streams it to the AD9361 for transmission. The precise timing of this entire process is managed by PL-based timer and interrupt controllers, ensuring signal periodicity and stability. This architecture fully leverages the flexibility of the Microblaze for complex control tasks while exploiting the massive parallelism of the FPGA's PL for high-performance data processing.
2.1. BRAM Data Transfer Module
This module is responsible for generating and buffering the transmission waveforms. The data flow begins when the Microblaze soft-core processor receives configuration parameters, such as frequency, pulse width, and amplitude, from a host computer via the UART interface. Based on these parameters, the processor software generates the corresponding digital waveform sequence. Finally, using the AXI bus, the Microblaze writes the generated data into the FPGA's on-chip Block RAM (BRAM), where it is cached for subsequent high-speed access by the programmable logic.
2.2. AD9361 Data Driver Module Design
In this project, the AD9361 is configured via its SPI port to operate in a dual-receiver, dual-transmitter (2R2T) mode using a Low-Voltage Differential Signaling (LVDS) interface. The data transfer timing for the receive (RX) path is detailed in
Figure 3. This interface employs a source-synchronous, double data rate (DDR) scheme where each 12-bit I/Q sample is transmitted to the FPGA over two full clock cycles. The RX_FRAME signal functions as a channel selector: a high level indicates data transmission for the first channel, while a low level indicates transmission for the second. The transmit (TX) interface operates based on a similar protocol.
The AD9361 provides the FPGA with six differential data pairs, one differential
source-synchronous clock pair, and a frame synchronization signal. The physical layer implementation for this interface, detailed in the block diagram in
Figure 4, follows a structured processing chain. First, the incoming differential clock is converted to a single-ended signal by an IBUFDS primitive and then routed through a BUFG global buffer to ensure high-quality, low-skew clock distribution. The differential data and frame signals are similarly passed through IBUFDS primitives. To compensate for signal skew arising from PCB trace length mismatches, each of these single-ended signals is then fed into an IDELAYE2 primitive for fine-grained delay tuning. Finally, the delay-compensated signals are input to IDDR primitives, which de-serialize the double-data-rate stream by capturing data on both the rising and falling clock edges. The resulting parallel bits are then concatenated to reconstruct the full 12-bit I and Q data samples. The transmit path is implemented using an analogous methodology.
2.3. Data Storage and Coherent Integration Module
The data storage and coherent accumulation capabilities of this system are built upon a high-performance DDR4 SDRAM. Operating at a clock frequency of 1200 MHz, it provides the hardware foundation for the real-time throughput of massive radar data. The module processes a composite 24-bit data stream (12-bit I and 12-bit Q) from the AD9361 at a 40 MHz sampling rate. Due to the inherent limitations of DDR4 memory—such as periodic self-refresh cycles and the inability to perform simultaneous read and write operations—we employ a time-division operational strategy. This strategy involves writing one full frame of data before initiating the read and process sequence for multiple frames.
To enable flexible pulse accumulation, the system supports a dynamically configurable range of 1 to 40 pulses. Accordingly, the 512 MB physical address space of the DDR4 is logically partitioned into 40 distinct 12.8 MB regions, each dedicated to storing the data from a single pulse echo. For efficient data management and memory alignment, we define a "data frame" as a 2 KB block containing 680 24-bit samples. This structure allows the storage depth for each pulse, configurable from 1 to 18,823 frames, to be dynamically set by the host computer based on the specific detection task. As illustrated in
Figure 5, the module comprises three core sub-modules: Data Write, Coherent Accumulation, and Data Computation, which collectively manage the entire workflow from data buffering to the final accumulated output.
2.3.1. Data Write Module
To enable flexible pulse accumulation, the system supports a dynamically configurable range of 1 to 40 pulses. Accordingly, the 512 MB physical address space of the DDR4 is logically partitioned into 40 distinct 12.8 MB regions, each dedicated to storing the data from a single pulse echo. For efficient data management and memory alignment, we define a "data frame" as a 2 KB block containing 680 24-bit samples. This structure allows the storage depth for each pulse, configurable from 1 to 18,823 frames, to be dynamically set by the host computer based on the specific detection task. As illustrated in
Figure 5, the module comprises three core sub-modules: Data Write, Coherent Accumulation, and Data Computation, which collectively manage the entire workflow from data buffering to the final accumulated output.
2.3.2. Coherent integration module
This module orchestrates the intricate process of storing and retrieving data frames from DDR4 memory for accumulation. Its operation begins by reconstructing a complete data frame (680 samples, 2 KB) from the 192-bit-wide stream provided by the Data Write module's output FIFO. The module's logic is best understood in two phases: an initial buffer-filling phase and a steady-state circular buffer phase.
Initial Filling Phase: Initially, the system writes incoming data frames sequentially into the N dynamically configured pulse regions within the DDR4 memory. The memory address pointer automatically advances to the next region once the current one is filled to its configured frame depth. This process continues until all N pulse regions have been populated for the first time.
Steady-State Circular Buffer Phase: Once the buffers are full, the module transitions to its steady-state operation, functioning as a large, multi-pulse circular buffer. Here, the arrival of a new data frame, which is used to overwrite the oldest corresponding frame data, simultaneously triggers a synchronized read operation. This read command fetches data from the same frame address across all N active pulse regions. This critical step provides the N parallel data streams required for the subsequent computation stage. As shown in
Figure 6, the data read from each of the N regions is then temporarily buffered into one of 40 dedicated, 192-bit-wide FIFOs, with the number of active FIFOs precisely matching the configured pulse count.
2.3.3. Data Computing Module
This module performs the parallel summation of the data streams from the multiple output FIFOs. A naive approach of summing up to 40 input channels in a single clock cycle would result in an extremely high logic fan-in, creating a critical path that makes achieving timing closure impossible at the target frequency. To overcome this, our design implements a three-stage pipelined adder tree, as illustrated in
Figure 7.
Stage 1: The 40 input data streams are divided into groups, and seven parallel adders generate seven partial sums.Stage 2: The seven partial sums from the first stage are further reduced by a second layer of adders, yielding two intermediate sums.Stage 3: A final adder combines the two intermediate sums to produce the fully accumulated output signal.
This hierarchical, pipelined architecture systematically breaks the long combinatorial path into shorter, manageable segments, which significantly reduces the overall path delay and ensures stable operation at high clock frequencies.
2.4. Ethernet Data Transfer Module
Th This module manages the high-speed uplink of processed data to a host computer via Gigabit Ethernet. At the hardware level, the system employs a Micrel KSZ9031RNX Physical Layer (PHY) transceiver, which interfaces with the FPGA through the Reduced Gigabit Media Independent Interface (RGMII) standard.
At the protocol level, the implementation is based on the TCP/IP suite. We specifically selected the User Datagram Protocol (UDP) as the transport layer protocol, as it is ideally suited for the real-time, continuous data streams generated by the radar system. Unlike connection-oriented protocols such as TCP, UDP eliminates the overhead associated with connection establishment and data acknowledgment handshakes. This results in significantly lower latency and higher transmission efficiency, making it the optimal choice for high-throughput, real-time streaming applications. The internal logic of the UDP transmitter is implemented as a finite state machine (FSM), as depicted in
Figure 8.
The transmission data path begins with the final accumulated results from the Data Computation module being fed into a dedicated FIFO within the Ethernet module. This FIFO serves as a packet assembly buffer, interfacing with the byte-oriented network stack by providing an 8-bit (1-byte) wide read port. The system monitors the fill level of this buffer continuously. Once the amount of cached data reaches 1500 bytes—a size selected to match a typical Ethernet Maximum Transmission Unit (MTU)—the UDP packet transmission sequence is triggered, sending a full data payload to the host computer.
2.5. Host Computer Interface and Operation
System control and parameter configuration are handled by a custom host-side Graphical User Interface (GUI) developed in MATLAB, shown in
Figure 9. The application, which communicates with the hardware platform via a serial port, features an automatic scanning and connection function for user convenience. The GUI is organized into three primary panels.
Transceiver & Acquisition Settings: This panel allows for the configuration of the AD9361 RF front-end, including parameters like transmit attenuation and receive gain. Users can also set key acquisition parameters, such as the transmission duration and the number of data frames to capture. A "System Initialization" button provides a one-click reset to default values.
Waveform Generation: Here, users can define the transmission waveform by specifying its amplitude and frequency. Generating a Linear Frequency Modulated (LFM) signal is straightforward: the user selects a checkbox, enters the LFM parameters, and clicks the "Generate Signal" button.
Accumulation Control & System Status: This panel contains the most critical run-time control, allowing the user to dynamically set the number of pulses for coherent accumulation (from 1 to 40). It also provides real-time hardware monitoring, where users can query the board's temperature and voltage, with the readouts appearing in a status display console.
Processed data is transmitted from the FPGA over the network to the host computer, where it can be captured using a standard network utility like the 'NetAssist' tool shown in
Figure 10. To receive the data stream, the utility is configured by selecting the UDP protocol, setting the local host's IP address and listening port, and then activating the listening service. Furthermore, the tool includes an integrated data logging feature. Enabling the "Receive and save to file" option automatically stores the payload of all incoming UDP packets to a local file, providing a convenient method for subsequent offline analysis.