Xilinx Runtime (XRT) is an open-source software stack designed for managing and utilizing FPGAs to accelerate compute-intensive applications. Unlike traditional approaches to building applications on FPGAs, XRT adopts a familiar host-kernel programming strategy commonly seen in heterogeneous systems. This strategy enables users to develop custom accelerators more quickly and efficiently. Currently, XRT supports PCI-based FPGA devices from the Alveo family and several embedded FPGA devices from the ZYNQ family. In this workshop, we will utilize Alveo U250 boards.
The key benefits offered by XRT include:
- Facilitates the development of FPGA accelerators without requiring hardware expertise.
- Provides a common API for various use cases, such as edge computing or high-performance computing (HPC).
- Automates the deployment of accelerators on FPGAs.
- Allows for the employment and orchestration of multiple FPGAs on a single node or server.
Developing accelerators on FPGA is an exhaustive task. In addition to designing the accelerator itself, developers must carefully design the environment in which the accelerator resides and how it communicates with the outside world, including access to peripherals and memory. Traditionally, this was accomplished by connecting multiple intellectual property (IP) designs, which played a supporting but essential role in the overall design. However, when developing a new accelerator, there is often no need to modify the overall FPGA design; instead, the focus can be solely on the accelerator part. These supporting IPs remain static for similar applications, while the accelerator design can be adapted to specific needs. This motivated the Xilinx Runtime (XRT) to divide the platform into the immutable Shell and the compiled user partition. The figure below illustrates how the platform is structured:
The Shell provides the core infrastructure for the Alveo platform, including a hardened PCIe block that establishes physical connectivity to the host PCIe bus via two physical functions. The shell is loaded from the PROM during system boot and cannot be changed thereafter. It supplies data and control signals to the user-compiled design in the Shell partition. One crucial component in the Shell architecture is the ICAP module, whose primary task is reprograming the User partition's FPGA fabric with the compiled design.
The Shell comprises two physical functions1: privileged and non-privileged functions. The privileged functions are responsible for board management tasks, including loading firmware, shell updates, device resets, and downloading user-compiled images. In the Xilinx Runtime (XRT) stack, the XRT driver xclmgmt binds to the management physical function. Additionally, there is the XRT driver xocl, which binds to the user physical function. The user physical functions provide access to compute units in user partitions and handle crucial functionalities such as device memory management, controlling DMA engines, enabling seamless execution of kernels, interrupt handling, and more.
User partition and host-kernel synchronization
Our accelerator and memory subsystem (MMSS) reside in the user partition, fetching data from the device memory. Notably, all accelerators have standard AXI interfaces, facilitating efficient communication between accelerators and the management of kernel execution. Two modes of execution exist serial and pipeline fashion. In serial mode, the host and kernel rely on a simple synchronization scheme involving signals "start" and "done." In comparison, pipeline execution or synchronization between the host and kernel includes additional signals: "ready" and "continue." The "ready" signal informs the host that the kernel is ready to accept new requests, while the "continue" signal instructs the kernel to proceed with data processing.
Programming and execution model
The XRT programming model employs a heterogeneous approach, utilizing both the CPU and FPGA. Analogous to GPGPUs, the term "host" refers to the CPU and its associated memory, while "device" refers to the FPGA and its memory. Code executed on the host can manage memory on both the host and device and launch kernels, which are functions executed on the device. The device code or kernel describes the functionality of compute units synthesized on the FPGA. The typical programming flow is illustrated in the image below.
Users utilize the Vitis™ compiler, v++, to compile and link device code for the target platform, generating an FPGA binary representing the synthesized accelerator. These binaries, along with additional metadata, are packaged into xclbin files by the XRT. Host code, written in C/C++/OpenCL, can be compiled with gcc/g++, or in Python using OpenCL (PyOpenCL) or Python XRT (built-in Python bindings).
The execution process begins by downloading the xclbin file onto the FPGA board through specific system calls. Subsequently, data buffers are allocated to compute units, and data is transferred from the host to the device memory. An execution command buffer is established and submitted to the XRT driver for control. Once execution concludes, the data buffers are migrated back to host memory. Lastly, both the data and command buffers are released.
Physical functions are used for hardware-related management and special, sensitive tasks like firmware updates. ↩