Vector addition
Introduction
The Vitis/XRT framework adopts a heterogeneous programming model, utilizing both CPU and FPGA-based processing units. This model comprises two primary components: a host program and an FPGA binary housing the synthesized processing units. The host program, written in OpenCL C/C++, executes on the host CPU and incorporates calls to user-space APIs provided by the AMD Runtime library (XRT) to manage interactions with the FPGA processing units. The kernel defines the processing units' functionality, written in C/C++ and potentially incorporating RTL modules in Verilog or VHDL. Once synthesized, these processing units communicate using standard AXI interfaces. Vitis/XRT supports both HPC servers and embedded platforms. The host program runs on an x86 processor in HPC scenarios, while the kernel is synthesized onto a PCIe®-attached acceleration card. This workshop focuses on the HPC workflow, with details for an embedded platform approach available through the provided link. For the initial program, we aim to accelerate vector addition on the FPGA.
Kernel code
Here, we will design the accelerator in C and rely on the Vitis framework to synthesize the HLS description into hardware. The Vitis framework also supports kernel descriptions in Verilog or VHDL.
The kernel code for our accelerator is given:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
|
Key takeaways from this code include:
- The use of extern "C" to address name mangling issues.
- The pragmas influence the resulting design of HLS synthesis.
- The remaining code is a straightforward C program for vector addition.
HLS pragmas
HLS pragmas serve as a means to control the HLS process, providing a way to direct the HLS flow for optimizing the design, reducing latency, improving throughput performance, and minimizing area and device resource usage in the resulting RTL code. These pragmas are applied directly to the source code for the kernel.
This code employs three HLS interface
pragmas to map function parameters to distinct kernel ports. These kernel ports are AXI4 memory-mapped (m_axi
) interfaces, enabling the kernel to read and write data in global memory. To minimize overhead, ports a
and c
are wrapped into one bundle, while port b
is placed in a separate bundle. This allows for parallel reading of elements from both arrays.
Host code
Now, let's dive into the host code. The initial segment of the host code focuses on configuring the OpenCL environment for FPGA programming through interaction with the Xilinx platform. The process starts by acquiring a list of available OpenCL platforms and iterates through them, displaying the name of each platform. In the context of OpenCL, a platform serves as a framework that offers an abstraction layer for heterogeneous computing, enabling developers to write programs that can be executed across diverse processing units like CPUs, GPUs, and FPGAs. The host program specifically chooses the first platform named "Xilinx," indicating the selection of a platform with FPGA cards from the Xilinx vendor.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
Moving to the second phase, the host code retrieves a list of accelerator devices linked to the chosen platform and displays the count of identified devices. Subsequently, the host program iterates through the devices, presenting their names, and then selects the desired device through user input or a predefined value. This series of operations facilitates establishing a connection between the host and the targeted FPGA device, an essential step in preparing for subsequent FPGA programming using OpenCL.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
In this phase, the host program establishes a vital OpenCL context, serving as a pivotal environment for the management and coordination of OpenCL operations. The OpenCL context is created using the chosen FPGA device through the cl::Context
class. Contexts play a crucial role in the OpenCL runtime, overseeing objects such as command-queues, memory, program and kernel entities, and facilitating the execution of kernels on one or more devices defined within the context.
Following this, the program generates an OpenCL command queue via the cl::CommandQueue
class, functioning as a conduit for submitting OpenCL commands and transferring data to the FPGA device for execution. This command queue is associated with both the selected device and the previously established OpenCL context. Notably, it is configured to enable profiling, empowering performance analysis capabilities.
Lastly, the host program initializes three OpenCL buffers using the cl::Buffer
class. These buffers, denoted as buffer_a
, buffer_b
, and buffer_res
, serve as dedicated memory spaces facilitating data transfer between the host and the FPGA device during the execution of OpenCL kernels. These buffers play a critical role in orchestrating data flow and synchronization between the host and the FPGA device throughout the OpenCL computation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
Moving forward, the host program initiates the creation of a program object associated with the established OpenCL context. This involves loading a binary file (.xclbin file) containing the compiled FPGA kernel. Subsequently, the program binary is loaded into an OpenCL program object, with the target FPGA device specified. The program creates a kernel object from the program, specifically using the KERNEL_CL
identifier, and undergoes error checking. The successful creation of the program implies that the FPGA device has been effectively programmed with the designated OpenCL kernel. In cases where errors emerge during this process, the program diligently reports the inability to program the device with the provided .xclbin file.
Note: the host source code includes a function, read_binary_file
, responsible for reading the binary file.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
In the subsequent step, the host program initiates the migration of arrays a
and b
from host to device memory. The definition of arrays clearly outlines the specific buffers to which each array is mapped. Following this, the program proceeds to set the kernel arguments for the OpenCL kernel, established in prior steps. The setArg
method binds the kernel arguments to their corresponding OpenCL kernel parameters. In this context, buffer_res
, buffer_a
, and buffer_b
are assigned to the kernel's first, second, and third arguments. A constant value (DATA_SIZE
) is also designated as the fourth argument. These kernel arguments crucially define the data sources and sizes the OpenCL kernel will manipulate during its execution on the FPGA device.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Subsequently, the host program enqueues the OpenCL kernel (kernel
) for execution on the FPGA device using the command q.enqueueTask(kernel)
. This command submits the kernel for execution in the command queue.
After the kernel execution, the program proceeds to the final step, which reads the output buffer (buffer_res
) back from the FPGA device to the host. A vector named result is initialized to store the results, and the command q.finish()
ensures that all previously enqueued commands on the command queue are completed before moving forward. Finally, the command q.enqueueMigrateMemObjects
performs a read of the output buffer, copying the results from the FPGA device to the host's result vector. This step finalizes the data transfer and enables the host program to access the computed results from the FPGA.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
The above code is published in folder 01-vec-add
of the workshop's repo.
Hardware and software simulation
In the XRT framework, software and hardware emulation play pivotal roles in developing and validating FPGA-accelerated applications. Software emulation involves the simulation of FPGA kernels on a host machine using a high-level language like C++ or OpenCL. This allows developers to quickly test and debug their algorithms before deploying them to the hardware. The following shell commands are used to compile the kernel for software simulation.
emconfigutil --platform $PLATFORM
v++ -c -t sw_emu --platform $PLATFORM --config $DIR/fpga.cfg -k $KERNEL $DIR/$KERNEL.cpp -o ./$KERNEL.xo
v++ -l -t sw_emu --platform $PLATFORM --config $DIR/fpga.cfg ./$KERNEL.xo -o ./$KERNEL.xclbin
On the other hand, hardware emulation provides a more accurate representation of the FPGA device by utilizing FPGA-specific simulation models. This enables developers to assess the performance and functionality of their designs in an environment resembling the target FPGA hardware. To compile kernel for hardware emulation, we just need to change the target switch switch -t to hw_emu.
emconfigutil --platform $PLATFORM
v++ -c -t hw_emu --platform $PLATFORM --config $DIR/fpga.cfg -k $KERNEL $DIR/$KERNEL.cpp -o ./$KERNEL.xo
v++ -l -t hw_emu --platform $PLATFORM --config $DIR/fpga.cfg ./$KERNEL.xo -o ./$KERNEL.xclbin
After we compiled both host application and kernel, we simply run the following code to run the simulation:
XCL_EMULATION_MODE=sw_emu ./host $PLATFORM $KERNEL.xclbin
We automate this process using shell scripts compile.sh
and run.sh