Host code optimisation

In the previous part, we focused on optimizing the kernel, aiming for optimal memory transfer, and streamlining the datapath. However, there is room for improvement in the host code. By fine-tuning the host code, we can optimize memory transfer between host memory and the device, invoke the execution of multiple kernels, facilitate more efficient synchronization between the host and kernel, and more.

In this part, we implement a program to multiply two matrices. While writing a separate kernel for matrix multiplication is a common approach, we will leverage the previously created matrix-vector multiplication kernel. In other words, we simplify the matrix multiplication problem as a set of matrix-vector operations that can be calculated in parallel. In this exercise, we will employ four compute units for matrix-vector multiplication to perform matrix multiplication. The approach is illustrated in the image below. Each kernel receives matrix A as input, while the column and resulting vector will differ. Kernels cyclically receive columns, enabling us to process this matrix multiplication in a loop-stripping fashion. While the implementation may not be optimal, its concept is straightforward and illustrates the parallel execution of multiple kernels.

Dataflow diagram from Vitis

To allow the execution of multiple kernels, several changes need to be introduced in the host code compared to the previous implementation:

Kernel Instantiation: We instantiate several kernels under the same queue and context.

//***************************************************
// STEP 6: Create the kernel object
//***************************************************
cl::Kernel krnls[NUM_CU];
if (err != CL_SUCCESS) {
    std::cout << "Failed to program device with xclbin file!\n";
    cout << err << endl;
} else {
    for (int i = 0; i < NUM_CU;i++){
        std::cout << "Device: program successful!\n";
        krnls[i] = cl::Kernel(program, KERNEL_CL, &err);
        // we break because we found a valid device
    }
}

2. Buffer Definition and Mapping: Buffers need to be defined and mapped for each row in the transposed matrix B and the transposed resulting matrix.

//***************************************************
// STEP 7: Create buffer for matrix A and for each row of transposed matrix B and result
//***************************************************
cl::Buffer buffer_a(context, CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, DIM_SIZE * DIM_SIZE * sizeof(int), (int*)matA, &err);    

// migrate memory
cl::Buffer buffers_b[DIM_SIZE];
cl::Buffer buffers_res[DIM_SIZE];

for (int i = 0; i < DIM_SIZE; i++) {
    buffers_b[i] = cl::Buffer(context,  CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, DIM_SIZE * sizeof(int), &TmatB[i][0], &err);
    buffers_res[i] = cl::Buffer(context,  CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY, DIM_SIZE * sizeof(int), &Tres[i][0], &err);
}

Matrix Multiplication: The matrix multiplication is divided into two parts. In the first part, for every kernel, the arguments are set, and the execution of the kernel is started. Then, the code waits for all kernels to finish execution. In the second part, the results are collected, and the next loop iteration is performed.

for (i = 0; i < DIM_SIZE; i=i+NUM_CU) {

    for (int j = 0; j < NUM_CU; j++) {

        // set kernel arguments
        krnls[j].setArg(0, buffers_res[i+j]);
        krnls[j].setArg(1, buffer_a);
        krnls[j].setArg(2, buffers_b[i+j]);

        // migrate memory 
        q.enqueueMigrateMemObjects({buffer_a, buffers_b[i+j]}, 0 /* 0 means from host*/);
        //  run the kernel
        q.enqueueTask(krnls[j]);

    }
    q.finish();

    for (int j = 0; j < NUM_CU; j++) {
        q.enqueueMigrateMemObjects({buffers_res[i+j]}, CL_MIGRATE_MEM_OBJECT_HOST);
    }
    q.finish();

}

These changes enable the parallel execution of multiple kernels, allowing for a more efficient matrix-vector multiplication. Now, let's analyze the obtained traces.

In-order execution

According to the image below, it's evident that only one compute unit is doing the work, contrary to our expectations. The compute unit corrensponds to the synthesized kernel on FPGA fabric. This behavior is due to the default configuration of the OpenCL queue, which executes all kernels in an in-order manner.

This default behavior, where a single compute unit is doing the work, is expected when multiple kernels share the same queue, as is the case in our scenario. However, in a situation where we have multiple queues, we have the potential to achieve parallel execution.. To distribute the work among multiple compute units with the shared queue and enable concurrent execution, we need to create a queue, which allows out-of-order execution.

//***************************************************
// STEP 4: Create a command queue 
//***************************************************
// we create a command queue with the selected device and context using CommandQueue class 

bool ooo = true;

cl_ulong qp = CL_QUEUE_PROFILING_ENABLE;
if (ooo)
    qp = qp | CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE;

cl::CommandQueue q(context, device, qp, &err);

cout << "COMMAND QUEUE ERROR: " << err << endl;

Out-of-order execution of kernels refers to a mode of operation where multiple kernels are launched for execution concurrently, and the order in which they complete their execution is not strictly determined by the order in which they were launched. In a traditional, in-order execution model, kernels are executed sequentially, one after the other, and the completion of a kernel must precede the start of the next one. However, with out-of-order execution, multiple kernels can be launched simultaneously, and their execution can overlap or be interleaved. After we set out-of-order execution, let's look at the obtained time traces.

Out-of-order execution

As we can see, multiple compute units are operating concurrently, illustrating parallel execution. The corresponding code can be found in the repository.