The CUDA programming model is a heterogeneous model using both the CPU and GPU. Recall that the host refers to the CPU and its memory, while the device refers to the GPU and its memory. Code run on the host can manage memory on both the host and device and launch kernels, which are functions executed on the device. Many GPU threads then execute these kernels in parallel. Given the heterogeneous nature of the CUDA programming model, a typical sequence of operations for a CUDA C program is:
- Declare and allocate a host and device memory.
- Initialize host data.
- Transfer data from the host to the device.
- Execute one or more kernels.
- Transfer results from the device to the host.
Keeping this sequence of operations in mind, let’s look at our first CUDA C example. We start with a simple program that adds the adjacent elements of two arrays (vectors) with
iNumElements elements each. We will solve the problem in two ways. First, we will start quite naively and focus only on the basics of GPU programming and the steps on the host that are needed to run our first program on GPU. Secondly, we will focus on kernel improvement.
Let us begin with the C code for vector addition intended to run on a single-core CPU. The code below shows the implementation of a function in C for summing two vectors whose elements are floating-point numbers in single-precision:
1 2 3 4 5 6 7 8 9 10 11 12 13
c) and the number of elements in the vectors (
iNumElements). The vectors are summed by summing all the corresponding elements (pair-wise) in the input vectors
band saving the result in the corresponding element of the vector
c. Since the vectors have
iNumElementselements, we add the vectors in a loop that repeats
We can notice that the individual elements of the vectors are indexed with the integer index
tid. We chose this name on purpose. We will understand the reason for such a choice soon.
A naïve attempt to add up vectors on GPU
We would now like to run the
VectorAdd() function on a GPU. We have learnt that GPUs are ideal for performing data-parallel tasks. The vector addition task is obviously a data-parallel task since we perform the same operation (summation) over all elements of vectors. In addition, these operations are independent of each other (we can sum the elements of vectors in any order. Hence, we can sum the vectors in parallel.
To execute the vector addition function on a GPU device, we must write it as a kernel function that runs on a GPU device. Each thread on the GPU device will then execute the same kernel function. The main idea is to replace loop iterations with kernel functions executing at each point in a problem domain. For example, we can process vectors with
iNumElements elements with one kernel instance per element (one thread per element) or
iNumElements threads. The CUDA kernel is a code sequence that will be executed by every single thread running on a GPU.
Therefore, we need to divide the work done by the
VectorAdd () function as evenly as possible between the threads. In our case, this is a fairly simple task since each thread adds only two corresponding elements. Therefore, we will run as many threads on the GPU as there are elements in the vectors - in our case, it is
iNumElements threads. The global index of each thread running on the GPU will be uniquely determined from the grid. It will actually correspond to the index of the vector elements that the thread adds up. The kernel that all threads will run on the GPU is shown in the code below:
1 2 3 4 5 6 7 8 9 10 11 12 13
We can see that the kernel is very similar in structure to the C function, but it has the qualifier
__global__. This qualifier alerts the compiler that a function is to be compiled to run on a GPU device, and it will be invoked from the host. The arguments are passed to a kernel as they are passed to any C function.
Variables defined within the kernel (e.g.,
tid) do not need to be specified as device variables because they are assumed to reside on the device. The pointers
c must be pointers to the device memory address space. This is true because we will allocate the space for the vectors in the device memory, and the vectors
b will be transferred from the host memory to the device memory before kernel invocation. The last argument,
iNumElements, however, is not explicitly transferred to the device in the host code. Since in C/C++function arguments are passed by value by default, the CUDA runtime can automatically handle the transfer of these values to the device. This feature of the CUDA Runtime API makes launching kernels on the GPU very natural and easy — it is almost the same as calling a C function.
Suppose we want each thread to process element of the vectors with the same indices. In that case, we need a means of distinguishing and identifying each thread. We are already familiar with the CUDA variables
threadIdx. The predefined variable
blockDim contains the dimensions of each thread block as specified in the second execution configuration parameter for the kernel launch. The predefined variables
blockIdx contain the index of the thread within its thread block and the index of the thread block within the grid, respectively. The expression:
int tid = blockIdx.x * blockDim.x + threadIdx.x;
generates a global index of a thread that is used to access corresponding elements of the arrays. We didn’t use it in this example but recall that there is also the predefined variable
gridDim. It contains the dimensions of the grid as specified in the first execution configuration parameter to the launch. Note also that the kernel function does not contain the while loop. This is because each thread that executes the kernel will take care of only one element in the vector. Before the index
tid is used to access vector elements, its value is checked against the number of elements,
iNumElements, to ensure there are no out-of-bounds memory accesses. This check is required for cases where the number of elements in the vectors is not evenly divisible by the thread block size, and as a result, the number of threads launched by the kernel is larger than the array size.
The host application runs on a user’s computer (the host) and dispatches kernels to connected devices. The host application can be coded in C or C++. The host program for a heterogeneous system must carry out the following steps:
- Allocate memory for each vector in the host memory.
- Allocate memory for each vector in the device memory.
- Initialize vectors
bin the host memory
- Copy vectors
bfrom the host to the device.
- Define the grid.
- Execute the kernel.
- Copy the vector
cfrom the device back to the host.
The listing below shows the host code that is implemented in the main function. In the paragraphs that follow, we explain each step within the host code and briefly describe the CUDA API function used to accomplish each step.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
14 15 16 17 18 19 20 21 22 23 24
With the following code, the host allocates memory for each host vector:
26 27 28 29
Then, the host allocates memory on the GPU using
31 32 33 34
h_c point to the host memory, allocated with
malloc in the typical fashion, and the pointers
d_c point to device memory allocated with the
cudaMalloc function from the CUDA runtime API. The host and device in CUDA have separate memory spaces, both of which can be managed from the host code (CUDA C kernels can also allocate device memory on devices that support it).
The host now initializes the vectors
h_b in the host memory:
36 37 38 39 40 41
After the initialization in the host memory, we simply copy the data of the vectors
h_b to the corresponding device vectors
cudaMemcpy. Its fourth argument specifies the direction of the copy. In this case, we use
cudaMemcpyHostToDevice to specify that the first (destination) argument is a device pointer and the second (source) argument is a host pointer:
43 44 45
When a kernel function is launched from the host side, execution is moved to a device where many threads are generated, and each thread executes the statements specified by the kernel function. Knowing how to organize threads is a critical part of CUDA programming. Prior to launching the kernel, we must set up the grid (i.e., we define the block dimension and the grid dimension):
47 48 49
Finally, we invoke the kernel:
The information between the triple angle brackets is the execution configuration, which dictates how many device threads execute the kernel in parallel and how these threads are organized into blocks and a grid. The first argument in the execution configuration specifies the number of thread blocks (grid dimension) in the grid, and the second specifies the number of threads in a thread block (block dimension). In our case, we launch the kernel with thread blocks containing
128 threads, and
iNumElements/blockSize, which equals the number of thread blocks required to process all
iNumElements elements of the vectors. When the number of elements in the vectors is not evenly divisible by the thread block size, the kernel code must check for out-of-bounds memory accesses, as we have already seen in the kernel code.
Upon the kernel completion, we transfer the result back to the host memory and free any allocated memory:
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
Sum of arbitrary long vectors
While studying the execution model of a GPU, we noted that the hardware limits the number of blocks in a single launch and the number of threads per block with which we can launch a kernel. Each SM has a limited number of registers and a limited amount of shared memory. Usually, no more than 16 thread blocks can run simultaneously on a single SM with the Kepler microarchitecture, and up to 8 thread blocks can run simultaneously on a single SM with the Fermi microarchitecture. Also, there is a limit on the number of active warps on a single compute unit (64 on Kepler, 48 on Fermi). We usually want to keep as many active warps as possible because this affects occupancy. Occupancy is a ratio of active warps per compute unit to the maximum number of allowed warps. Keeping the occupancy high, we can hide latency when executing instructions. Recall that running other warps when one warp is paused is the only way to hide latency and keep hardware busy. So how would we use a thread-based approach to add two arbitrary long vectors? We will have to use a combination of threads and blocks to accomplish this. Fortunately, the solution to this issue is extremely simple. We first make a change to our kernel:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
The kernel now looks similar to our original version of the C vector addition function
VectorAdd (). We again use a while loop to iterate through the data. Rather than incrementing
tid by 1, a many-core GPU
device should increment
tid by the number of threads we launch. We want each thread to start at a different data index, so we use the well-known indexing to obtain the global index of each thread:
After each thread finishes its work at the current global index, we increment
tid by the total number of threads in the grid. This number is obtained simply as a product of the grid dimension and block dimension:
Surely you are now wondering why one thread does not simply add up 2 or 4 adjacent pairs of concurrent elements? The reason lies in how the threads access memory. Recall that memory is accessed in segments. Two adjacent threads in the warp must access two adjacent memory words to ensure memory coalescing.