Vector addition

Introduction

The Vitis/XRT framework adopts a heterogeneous programming model, utilizing both CPU and FPGA-based processing units. This model comprises two primary components: a host program and an FPGA binary housing the synthesized processing units. The host program, written in OpenCL C/C++, executes on the host CPU and incorporates calls to user-space APIs provided by the AMD Runtime library (XRT) to manage interactions with the FPGA processing units. The kernel defines the processing units' functionality, written in C/C++ and potentially incorporating RTL modules in Verilog or VHDL. Once synthesized, these processing units communicate using standard AXI interfaces. Vitis/XRT supports both HPC servers and embedded platforms. The host program runs on an x86 processor in HPC scenarios, while the kernel is synthesized onto a PCIe®-attached acceleration card. This workshop focuses on the HPC workflow, with details for an embedded platform approach available through the provided link. For the initial program, we aim to accelerate vector addition on the FPGA.

Kernel code

Here, we will design the accelerator in C and rely on the Vitis framework to synthesize the HLS description into hardware. The Vitis framework also supports kernel descriptions in Verilog or VHDL.

The kernel code for our accelerator is given:

extern "C" {
    void vadd(int* c,
        const int* a,
        const int* b,
        const int n_elements)
    {

        #pragma HLS interface m_axi port=a bundle=aximm1
        #pragma HLS interface m_axi port=b bundle=aximm2
        #pragma HLS interface m_axi port=c bundle=aximm1


        int arrayA[BUFFER_SIZE];
        int arrayB[BUFFER_SIZE];
        int arrayC[BUFFER_SIZE];

    main_loop:
        for (int i = 0; i < n_elements; i += BUFFER_SIZE){

            int size = BUFFER_SIZE;

            if(i + size > n_elements)
                size = n_elements - i;

        readA:
            for(int j = 0; j < size; j++)
                arrayA[j] = a[i + j];

        readB:
            for(int j = 0; j < size; j++)
                arrayB[j] = b[i + j];

        vadd:
            for(int j = 0; j < size; j++)
                arrayC[j] = arrayA[j] + arrayB[j];

        writeC:
        for(int j = 0; j < size; j++)
             c[i + j] = arrayC[j];
        }
    }
}

Key takeaways from this code include:

The use of extern "C" to address name mangling issues.
The pragmas influence the resulting design of HLS synthesis.
The remaining code is a straightforward C program for vector addition.

HLS pragmas

HLS pragmas serve as a means to control the HLS process, providing a way to direct the HLS flow for optimizing the design, reducing latency, improving throughput performance, and minimizing area and device resource usage in the resulting RTL code. These pragmas are applied directly to the source code for the kernel.

This code employs three HLS interface pragmas to map function parameters to distinct kernel ports. These kernel ports are AXI4 memory-mapped (m_axi) interfaces, enabling the kernel to read and write data in global memory. To minimize overhead, ports a and c are wrapped into one bundle, while port b is placed in a separate bundle. This allows for parallel reading of elements from both arrays.

Host code

Now, let's dive into the host code. The initial segment of the host code focuses on configuring the OpenCL environment for FPGA programming through interaction with the Xilinx platform. The process starts by acquiring a list of available OpenCL platforms and iterates through them, displaying the name of each platform. In the context of OpenCL, a platform serves as a framework that offers an abstraction layer for heterogeneous computing, enabling developers to write programs that can be executed across diverse processing units like CPUs, GPUs, and FPGAs. The host program specifically chooses the first platform named "Xilinx," indicating the selection of a platform with FPGA cards from the Xilinx vendor.

//***************************************************
// STEP 1: Get the platform 
//***************************************************
vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
cl::Platform platform;

for(cl::Platform &p: platforms)
{
    const string name = p.getInfo<CL_PLATFORM_NAME>();
    cout << "PLATFORM: " << name << endl;
    if(name == "Xilinx")
    {
        platform = p;
        break;
    }
}

if(platform == cl::Platform())
{
    cout << "Xilinx platform not found!" << endl;
    exit(EXIT_FAILURE);
}

Moving to the second phase, the host code retrieves a list of accelerator devices linked to the chosen platform and displays the count of identified devices. Subsequently, the host program iterates through the devices, presenting their names, and then selects the desired device through user input or a predefined value. This series of operations facilitates establishing a connection between the host and the targeted FPGA device, an essential step in preparing for subsequent FPGA programming using OpenCL.

//***************************************************
// STEP 2: Get the devices and select the desired device 
//***************************************************

vector<cl::Device> devices;
platform.getDevices(CL_DEVICE_TYPE_ACCELERATOR, &devices);

cout<<"Number of devices found: " << devices.size() << endl;

cl::Device device;
for(cl::Device &iterDevice: devices){
    cout << "DEVICE: " << iterDevice.getInfo<CL_DEVICE_NAME>() << endl;
    if(iterDevice.getInfo<CL_DEVICE_NAME>() == argv[1])
        device = iterDevice;
}

In this phase, the host program establishes a vital OpenCL context, serving as a pivotal environment for the management and coordination of OpenCL operations. The OpenCL context is created using the chosen FPGA device through the cl::Context class. Contexts play a crucial role in the OpenCL runtime, overseeing objects such as command-queues, memory, program and kernel entities, and facilitating the execution of kernels on one or more devices defined within the context.

Following this, the program generates an OpenCL command queue via the cl::CommandQueue class, functioning as a conduit for submitting OpenCL commands and transferring data to the FPGA device for execution. This command queue is associated with both the selected device and the previously established OpenCL context. Notably, it is configured to enable profiling, empowering performance analysis capabilities.

Lastly, the host program initializes three OpenCL buffers using the cl::Buffer class. These buffers, denoted as buffer_a, buffer_b, and buffer_res, serve as dedicated memory spaces facilitating data transfer between the host and the FPGA device during the execution of OpenCL kernels. These buffers play a critical role in orchestrating data flow and synchronization between the host and the FPGA device throughout the OpenCL computation.

//***************************************************
// STEP 3: Create a context 
//***************************************************
// we create a context with the selected device using Context class 

cl::Context context(device, nullptr, nullptr, nullptr, &err);
cout << "CONTEXT ERROR: " << err << endl;

//***************************************************
// STEP 4: Create a command queue 
//***************************************************
// we create a command queue with the selected device and context using CommandQueue class 

cl::CommandQueue q(context, device, CL_QUEUE_PROFILING_ENABLE, &err);
cout << "COMMAND QUEUE ERROR: " << err << endl;

//***************************************************
// STEP 5: Create device buffers
//***************************************************

cl::Buffer buffer_a(context, CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, DATA_SIZE * sizeof(int), a, &err);    

cl::Buffer buffer_b(context, CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, DATA_SIZE * sizeof(int), b, &err);    

cl::Buffer buffer_res(context,  CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY, DATA_SIZE * sizeof(int), c, &err);

Moving forward, the host program initiates the creation of a program object associated with the established OpenCL context. This involves loading a binary file (.xclbin file) containing the compiled FPGA kernel. Subsequently, the program binary is loaded into an OpenCL program object, with the target FPGA device specified. The program creates a kernel object from the program, specifically using the KERNEL_CL identifier, and undergoes error checking. The successful creation of the program implies that the FPGA device has been effectively programmed with the designated OpenCL kernel. In cases where errors emerge during this process, the program diligently reports the inability to program the device with the provided .xclbin file.

Note: the host source code includes a function, read_binary_file, responsible for reading the binary file.

//***************************************************
// STEP 6: Create a program object for the context
//***************************************************


cl::Kernel kernel;
auto program_binary = read_binary_file(binary_file);
cl::Program::Binaries bins{{program_binary.data(), program_binary.size()}};

std::cout << "Trying to program device: " << device.getInfo<CL_DEVICE_NAME>() << std::endl;
cl::Program program(context, {device}, bins, nullptr, &err);
//***************************************************
// STEP 6: Create the kernel object
//***************************************************

if (err != CL_SUCCESS) {
    std::cout << "Failed to program device with xclbin file!\n";
    cout << err << endl;
} else {
    std::cout << "Device: program successful!\n";
    kernel = cl::Kernel(program, KERNEL_CL, &err);
        // we break because we found a valid device
}

In the subsequent step, the host program initiates the migration of arrays a and b from host to device memory. The definition of arrays clearly outlines the specific buffers to which each array is mapped. Following this, the program proceeds to set the kernel arguments for the OpenCL kernel, established in prior steps. The setArg method binds the kernel arguments to their corresponding OpenCL kernel parameters. In this context, buffer_res, buffer_a, and buffer_b are assigned to the kernel's first, second, and third arguments. A constant value (DATA_SIZE) is also designated as the fourth argument. These kernel arguments crucially define the data sources and sizes the OpenCL kernel will manipulate during its execution on the FPGA device.

//***************************************************
// STEP 7: Write host data to device buffers
//***************************************************


q.enqueueMigrateMemObjects({buffer_a, buffer_b}, 0 ); 
/* 0 means from host*/

//***************************************************
// STEP 8: Set the kernel arguments
//***************************************************

kernel.setArg(0, buffer_res);
kernel.setArg(1, buffer_a);
kernel.setArg(2, buffer_b);
kernel.setArg(3, DATA_SIZE);

Subsequently, the host program enqueues the OpenCL kernel (kernel) for execution on the FPGA device using the command q.enqueueTask(kernel). This command submits the kernel for execution in the command queue.

After the kernel execution, the program proceeds to the final step, which reads the output buffer (buffer_res) back from the FPGA device to the host. A vector named result is initialized to store the results, and the command q.finish() ensures that all previously enqueued commands on the command queue are completed before moving forward. Finally, the command q.enqueueMigrateMemObjects performs a read of the output buffer, copying the results from the FPGA device to the host's result vector. This step finalizes the data transfer and enables the host program to access the computed results from the FPGA.

//***************************************************
// STEP 9: Enqueue the kernel for execution
//***************************************************

q.enqueueTask(kernel);

//***************************************************
// STEP 12: Read the output buffer back to the host
//***************************************************
// Synchronous/blocking read of results

q.enqueueMigrateMemObjects({buffer_res}, CL_MIGRATE_MEM_OBJECT_HOST);

q.finish();

The above code is published in folder 01-vec-add of the workshop's repo.

Hardware and software simulation

In the XRT framework, software and hardware emulation play pivotal roles in developing and validating FPGA-accelerated applications. Software emulation involves the simulation of FPGA kernels on a host machine using a high-level language like C++ or OpenCL. This allows developers to quickly test and debug their algorithms before deploying them to the hardware. The following shell commands are used to compile the kernel for software simulation.

emconfigutil --platform $PLATFORM
v++ -c -t sw_emu --platform $PLATFORM --config $DIR/fpga.cfg -k $KERNEL $DIR/$KERNEL.cpp -o ./$KERNEL.xo
v++ -l -t sw_emu --platform $PLATFORM  --config $DIR/fpga.cfg ./$KERNEL.xo -o ./$KERNEL.xclbin

In the previous code, we use the Xilinx v++ compiler to synthesize the hardware for the kernel from the cpp file.

On the other hand, hardware emulation provides a more accurate representation of the FPGA device by utilizing FPGA-specific simulation models. This enables developers to assess the performance and functionality of their designs in an environment resembling the target FPGA hardware. To compile kernel for hardware emulation, we just need to change the target switch switch -t to hw_emu.

emconfigutil --platform $PLATFORM
v++ -c -t hw_emu --platform $PLATFORM --config $DIR/fpga.cfg -k $KERNEL $DIR/$KERNEL.cpp -o ./$KERNEL.xo
v++ -l -t hw_emu --platform $PLATFORM  --config $DIR/fpga.cfg ./$KERNEL.xo -o ./$KERNEL.xclbin

We can go one step further and synthesize the kernel to be run on the FPGA. We just need to change the target switch -t to hw. However, synthesizing kernels on FPGA takes time, so we will not cover this type at the workshop.

After we compiled both host application and kernel, we simply run the following code to run the simulation:

XCL_EMULATION_MODE=sw_emu ./host $PLATFORM $KERNEL.xclbin

Similarly, to run the hardware emulation, we need to set XCL_EMULATION_MODE environmental variable to hw_emu.

We automate this process using shell scripts compile.sh and run.sh