Skoči na vsebino

Nvidia Volta GPUs1

Let's look at the Nvidia Quadro GV100 GPU structure shown in the figure below. It contains 6 GPU Processing Clusters (GPCs), composed of 14 streaming processors (SMs) organized in pairs called Texture Processing Clusters (TPCs). Altogether, the Quadro GV100 GPU consists of 84 SMs. The Tesla V100 accelerator used in supercomputers consists of 80 SMs.

Tesla GV100 GPU

The figure below shows the structure of the Volta SM. It contains a variety of streaming processors (SPs) or cores: 64 integer cores (INT), 64 single-precision floating-point cores (FP32), 32 double-precision floating-point cores (FP64), 8 tensor cores, and 4 special function units (SFU).

Tesla V100 SM

Kernel execution

A key component of the CUDA programming model is the kernel — the code that runs on the GPU device. In the kernel, we must explicitly write what each thread does.

When a kernel runs on the GPU, the thread scheduler in the figure labelled as Giga Thread Engine schedules the thread blocks to the streaming multiprocessors. Then a streaming multiprocessor internally schedules the threads to streaming processors (SPs) or cores.

The number of threads that GPU schedules and executes simultaneously is limited. A limited number of threads can also be scheduled and executed by SMs. The next section will give a more detailed description of kernel execution, including thread scheduling and execution limitations.


  1. © Patricio Bulić, University of Ljubljana, Faculty of Computer and Information Science. The material is published under license CC BY-NC-SA 4.0