Vector addition¹

The CUDA programming model is a heterogeneous model using both the CPU and GPU. Recall that the host refers to the CPU and its memory, while the device refers to the GPU and its memory. Code run on the host manages memory on both the host and device and launches kernels, which are functions executed on the device. Many GPU threads then execute these kernels in parallel. Given the heterogeneous nature of the CUDA programming model, a typical sequence of operations for a CUDA C program is:

Declare and allocate a host and device memory.
Initialize host data.
Transfer data from the host to the device.
Execute one or more kernels.
Transfer results from the device to the host.

Keeping this sequence of operations in mind, let’s look at our first CUDA C example. We start with a simple program that adds the adjacent elements of two arrays (vectors) with iNumElements elements each. We will solve the problem in two ways. First, we will start quite naively and focus only on the basics of GPU programming and the steps on the host that are needed to run our first program on GPU. Secondly, we will focus on kernel improvement.

Let us begin with the C code for vector addition intended to run on a single-core CPU. The code below shows the implementation of a function in C which computes the sum two vectors whose elements are floating-point numbers in single-precision:

// Add the elements of two arrays
void vecAdd(float *a,
               float *b,
               float *c,
               int n) {

    int tid = 0;

    while (tid < n) {
        c[tid] = a[tid] + b[tid];
        tid += 1;
    }
}

The function arguments are the addresses of all three vectors (a,b, c) and the number of elements in the vectors (iNumElements). The vectors are summed by summing all the corresponding elements (pair-wise) in the input vectors a andb and saving the result in the corresponding element of the vector c. Since the vectors have iNumElements elements, we add the vectors in a loop that repeats iNumElements times.

Notice that the individual elements of the vectors are indexed with the integer index tid. We chose this name on purpose. We will understand the reason for such a choice soon.

A naïve attempt to add up vectors on GPU

We would now like to run the VectorAdd() function on a GPU. We have learnt that GPUs are ideal for performing data-parallel tasks. The vector addition task is obviously a data-parallel task since we perform the same operation (summation) over all elements of vectors. In addition, these operations are independent of each other (we can sum the elements of vectors in any order. Hence, we can sum the vectors in parallel.

Kernel function

To execute the vector addition function on a GPU device, we must write it as a kernel function that runs on a GPU device. Each thread on the GPU device will then execute the same kernel function. The main idea is to replace loop iterations with kernel functions executing at each point in a problem domain. For example, we can process vectors with iNumElements elements with one kernel instance per element (one thread per element) or iNumElements threads. The CUDA kernel is a code sequence that will be executed by every single thread running on a GPU.

Therefore, we need to divide the work done by the VectorAdd () function as evenly as possible between the threads. In our case, this is a fairly simple task since each thread adds only two corresponding elements. Therefore, we will run as many threads on the GPU as there are elements in the vectors - in our case, it is iNumElements threads. The global index of each thread running on the GPU will be uniquely determined from the grid. It will actually correspond to the index of the vector elements that the thread adds up. The kernel that all threads will run on the GPU is shown in the code below:

// CUDA kernel. Each thread takes care of one element of c
__global__ void vecAdd(const float *a,
                        const float *b,
                        float *c,
                        const int n)
{
    // Get global thread ID
    int tid = blockIdx.x * blockDim.x + threadIdx.x;

    // Make sure we do not go out of bounds
    if (tid < n)
        c[tid] = a[tid] + b[tid];
}

We can see that the kernel is very similar in structure to the C function, but it has the qualifier __global__. This qualifier alerts the compiler that a function is to be compiled to run on a GPU device, and it will be invoked from the host. The arguments are passed to a kernel as they are passed to any C function.

Variables defined within the kernel (e.g., tid) do not need to be specified as device variables because they are assumed to reside on the device. The pointers a, b and c must be pointers to the device memory address space. We will allocate the space for the vectors in the device memory and transfer the vectors a and b from the host memory to the device memory before kernel invocation. The last argument, iNumElements, however, is not explicitly transferred to the device in the host code. Since in C/C++function arguments are passed by value by default, the CUDA runtime can automatically handle the transfer of these values to the device. This feature of the CUDA Runtime API makes launching kernels on the GPU very natural and easy — it is almost the same as calling a C function.

Suppose we want each thread to process element of the vectors with the same indices. In that case, we need a means of distinguishing and identifying each thread. We are already familiar with the CUDA variables blockDim, blockIdx, and threadIdx. The predefined variable blockDim contains the dimensions of each thread block as specified in the second execution configuration parameter for the kernel launch. The predefined variables threadIdx and blockIdx contain the index of the thread within its thread block and the index of the thread block within the grid, respectively. The expression:

int tid = blockIdx.x * blockDim.x + threadIdx.x;

generates a global index of a thread that is used to access corresponding elements of the arrays. We didn’t use it in this example but recall that there is also the predefined variable gridDim. It contains the dimensions of the grid as specified in the first execution configuration parameter to the launch. Note also that the kernel function does not contain the while loop. This is because each thread that executes the kernel will take care of only one element in the vector. Before the index tid is used to access vector elements, its value is checked against the number of elements, iNumElements, to ensure there are no out-of-bounds memory accesses. This check is required for cases where the number of elements in the vectors is not evenly divisible by the thread block size, and as a result, the number of threads launched by the kernel is larger than the array size.

Host program

The host application runs on a user’s computer (the host) and dispatches kernels to connected devices. The host application can be coded in C or C++. The host program for a heterogeneous system must carry out the following steps:

Allocate memory for each vector in the host memory.
Allocate memory for each vector in the device memory.
Initialize vectors a and b in the host memory
Copy vectors a and b from the host to the device.
Define the grid.
Execute the kernel.
Copy the vector c from the device back to the host.

The listing below shows the host code present in the main function. In the paragraphs that follow, we explain each step within the host code and briefly describe the CUDA API function used to accomplish each step.

#include <stdio.h>
#include <stdlib.h>

#include <cuda.h>
#include <cuda_runtime.h>
#include "helper_cuda.h"

int main(int argc, char *argv[])
{
    // Size of vectors
    int n = 1024;
    size_t datasize = sizeof(float) * n;

    // Host input vectors
    float *h_a;
    float *h_b;
    // Host output vector
    float *h_c;

    // Device input vectors
    float *d_a;
    float *d_b;
    // Device output vector
    float *d_c;

    // Allocate memory for each vector on host
    h_a = (float *)malloc(datasize);
    h_b = (float *)malloc(datasize);
    h_c = (float *)malloc(datasize);

    // Allocate memory for each vector on GPU
    checkCudaErrors(cudaMalloc(&d_a, datasize));
    checkCudaErrors(cudaMalloc(&d_b, datasize));
    checkCudaErrors(cudaMalloc(&d_c, datasize));

    // Initialize vectors on host
    for (int i = 0; i < n; i++)
    {
        h_a[i] = 1.0;
        h_b[i] = 1.0;
    }

    // Copy host vectors to device
    checkCudaErrors(cudaMemcpy(d_a, h_a, datasize, cudaMemcpyHostToDevice));
    checkCudaErrors(cudaMemcpy(d_b, h_b, datasize, cudaMemcpyHostToDevice));

    // Define thread grid
    dim3 blockSize = 1024;
    dim3 gridSize = 1;

    // Execute the kernel
    vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);
    getLastCudaError("vecAdd() execution failed\n");

    // Copy array back to host
    checkCudaErrors(cudaMemcpy(h_c, d_c, datasize, cudaMemcpyDeviceToHost));

    // Check the result
    float result = 0.0;
    for (int i = 0; i < n; i++)
        result += h_c[i];
    printf("Result = %.2f \n", result);

    // Release device memory
    checkCudaErrors(cudaFree(d_a));
    checkCudaErrors(cudaFree(d_b));
    checkCudaErrors(cudaFree(d_c));

    // Release host memory
    free(h_a);
    free(h_b);
    free(h_c);

    return 0;
}

The main function declares the vector pointers used to reference the vectors in the host and device memory:

    // Host input vectors
    float *h_a;
    float *h_b;
    // Host output vector
    float *h_c;

    // Device input vectors
    float *d_a;
    float *d_b;
    // Device output vector
    float *d_c;

With the following code, the host allocates memory for each host vector:

    // Allocate memory for each vector on host
    h_a = (float *)malloc(datasize);
    h_b = (float *)malloc(datasize);
    h_c = (float *)malloc(datasize);

Then, the host allocates memory on the GPU using cudaMalloc:

    // Allocate memory for each vector on GPU
    checkCudaErrors(cudaMalloc(&d_a, datasize));
    checkCudaErrors(cudaMalloc(&d_b, datasize));
    checkCudaErrors(cudaMalloc(&d_c, datasize));

The pointers h_a, h_b and h_c point to the host memory, allocated with malloc in the typical fashion, and the pointers d_a, d_b and d_c point to device memory allocated with the cudaMalloc function from the CUDA runtime API. The host and device in CUDA have separate memory spaces, both of which can be managed from the host code (CUDA C kernels can also allocate device memory on devices that support it). The host now initializes the vectors h_a and h_b in the host memory:

    // Initialize vectors on host
    for (int i = 0; i < iNumElements; i++)
    {
        h_a[i] = 1.0;
        h_b[i] = 1.0;
    }

After the initialization in the host memory, we simply copy the data of the vectors h_a and h_b to the corresponding device vectors d_a and d_b using cudaMemcpy. Its fourth argument specifies the direction of the copy. In this case, we use cudaMemcpyHostToDevice to specify that the first (destination) argument is a device pointer and the second (source) argument is a host pointer:

    // Copy host vectors to device
    checkCudaErrors(cudaMemcpy(d_a, h_a, datasize, cudaMemcpyHostToDevice));
    checkCudaErrors(cudaMemcpy(d_b, h_b, datasize, cudaMemcpyHostToDevice));

One of the most common mistakes made by those learning to program in CUDA C is to dereference the different memory spaces improperly. The device pointers should not be dereferenced in the host code as the memory allocated on the GPU is not directly addressable from the CPU. Nevertheless, the most recent GPU architectures support the so-called Unified Memory, which lets you access both CPU and GPU memory by using a single pointer.

When a kernel function is launched from the host side, execution is moved to a device where many threads are generated, and each thread executes the statements specified by the kernel function. Knowing how to organize threads is a critical part of CUDA programming. Prior to launching the kernel, we must set up the grid (i.e., we define the block dimension and the grid dimension):

    // Define thread grid
    const size_t blockSize = 128;
    const size_t gridSize = iNumElements / blockSize;

Finally, we invoke the kernel:

    // Execute the kernel
    vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, iNumElements);

The information between the triple angle brackets is the execution configuration, which dictates how many device threads execute the kernel in parallel and how these threads are organized into blocks and a grid. The first argument in the execution configuration specifies the number of thread blocks (grid dimension) in the grid, and the second specifies the number of threads in a thread block (block dimension). In our case, we launch the kernel with thread blocks containing 128 threads, and iNumElements/blockSize, which equals the number of thread blocks required to process all iNumElements elements of the vectors. When the number of elements in the vectors is not evenly divisible by the thread block size, the kernel code must check for out-of-bounds memory accesses, as we have already seen in the kernel code.

Upon the kernel completion, we transfer the result back to the host memory and free any allocated memory:

    // Copy array back to host
    checkCudaErrors(cudaMemcpy(h_c, d_c, datasize, cudaMemcpyDeviceToHost));

    // Check the result
    float result = 0.0;
    for (int i = 0; i < iNumElements; i++)
        result += h_c[i];
    printf("Result = %.2f \n", result);

    // Release device memory
    checkCudaErrors(cudaFree(d_a));
    checkCudaErrors(cudaFree(d_b));
    checkCudaErrors(cudaFree(d_c));

    // Release host memory
    free(h_a);
    free(h_b);
    free(h_c);

The above code is published in folder 02-vec-add-short of the workshop's repo.

Sum of arbitrary long vectors

While studying the execution model of a GPU, we noted that the hardware limits the number of blocks in a single launch and the number of threads per block with which we can launch a kernel. Each SM has a limited number of registers and a limited amount of shared memory. Usually, no more than 16 thread blocks can run simultaneously on a single SM with the Kepler microarchitecture, and up to 8 thread blocks can run simultaneously on a single SM with the Fermi microarchitecture. Also, there is a limit on the number of active warps on a single compute unit (64 on Kepler, 48 on Fermi). We usually want to keep as many active warps as possible because this affects occupancy. Occupancy is a ratio of active warps per compute unit to the maximum number of allowed warps. Keeping the occupancy high, we can hide latency when executing instructions. Recall that running other warps when one warp is paused is the only way to hide latency and keep hardware busy. So how would we use a thread-based approach to add two arbitrary long vectors? We will have to use a combination of threads and blocks to accomplish this. Fortunately, the solution to this issue is extremely simple. We first make a change to our kernel:

// CUDA kernel. Each thread takes care of one or more elements of c
__global__ void vecAdd( const float *a, 
                        const float *b, 
                        float *c, 
                        const int n)
{
    // Get global thread ID
    int tid = blockIdx.x * blockDim.x + threadIdx.x;

    // Make sure we do not go out of bounds
    while (tid < n){
        c[tid] = a[tid] + b[tid];
        tid += gridDim.x * blockDim.x; // Increase index by total number of threads in grid
    }
}

The kernel now looks similar to our original version of the C vector addition function VectorAdd (). We again use a while loop to iterate through the data. Rather than incrementing tid by 1, a many-core GPU device should increment tid by the number of threads we launch. We want each thread to start at a different data index, so we use the well-known indexing to obtain the global index of each thread:

    int tid = blockIdx.x * blockDim.x + threadIdx.x;

After each thread finishes its work at the current global index, we increment tid by the total number of threads in the grid. This number is obtained simply as a product of the grid dimension and block dimension:

    tid += gridDim.x * blockDim.x;

Surely you are now wondering why one thread does not simply add up 2 or 4 adjacent pairs of concurrent elements? The reason lies in how the threads access memory. Recall that memory is accessed in segments. Two adjacent threads in the warp must access two adjacent memory words to ensure memory coalescing.

The above code is published in folder 03-vec-add-arb of the workshop's repo.

© Patricio Bulić, University of Ljubljana, Faculty of Computer and Information Science. The material is published under license CC BY-NC-SA 4.0. ↩

Vector addition1

A naïve attempt to add up vectors on GPU

Kernel function

Host program

Sum of arbitrary long vectors

Vector addition¹