Nvidia Volta GPUs1
Let's look at the Nvidia Quadro GV100 GPU structure shown in the figure below. It contains 6 GPU Processing Clusters (GPCs), composed of 14 streaming processors (SMs) organized in pairs called Texture Processing Clusters (TPCs). Altogether, the Quadro GV100 GPU consists of 84 SMs. The Tesla V100 accelerator used in supercomputers consists of 80 SMs.
The figure below shows the structure of the Volta SM. It contains a variety of streaming processors (SPs) or cores: 64 integer cores (INT), 64 single-precision floating-point cores (FP32), 32 double-precision floating-point cores (FP64), 8 tensor cores, and 4 special function units (SFU).
A key component of the CUDA programming model is the kernel — the code that runs on the GPU device. In the kernel, we must explicitly write what each thread does.
When a kernel runs on the GPU, the thread scheduler in the figure labelled as Giga Thread Engine schedules the thread blocks to the streaming multiprocessors. Then a streaming multiprocessor internally schedules the threads to streaming processors (SPs) or cores.
The number of threads that GPU schedules and executes simultaneously is limited. A limited number of threads can also be scheduled and executed by SMs. The next section will give a more detailed description of kernel execution, including thread scheduling and execution limitations.