Device information¹

Runtime information

We can get some basic information about devices installed in the system by running the Nvidia System Management Interface program

nvida-smi --query

The output reveals the basic information of attached devices like name, brand, bus information, utilization, memory usage, clock rates, and power consumption.

Output nvidia-smi --query

==============NVSMI LOG==============

Timestamp                                 : Sat Jun 18 10:30:37 2022
Driver Version                            : 510.39.01
CUDA Version                              : 11.6

Attached GPUs                             : 1
GPU 00000000:81:00.0
    Product Name                          : Tesla V100S-PCIE-32GB
    Product Brand                         : Tesla
    Product Architecture                  : Volta
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1562820002759
    GPU UUID                              : GPU-57b2d021-f0e4-5d7f-b433-671b628da8cc
    Minor Number                          : 0
    VBIOS Version                         : 88.00.98.00.01
    MultiGPU Board                        : No
    Board ID                              : 0x8100
    GPU Part Number                       : 900-2G500-0440-030
    Module ID                             : 0
    Inforom Version
        Image Version                     : G500.0212.00.02
        OEM Object                        : 1.1
        ECC Object                        : 5.0
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x81
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1DF610DE
        Bus Id                            : 00000000:81:00.0
        Sub System Id                     : 0x13D610DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 3
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Throttle Reasons
        Idle                              : Not Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 32768 MiB
        Reserved                          : 257 MiB
        Used                              : 0 MiB
        Free                              : 32510 MiB
    BAR1 Memory Usage
        Total                             : 32768 MiB
        Used                              : 2 MiB
        Free                              : 32766 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 4 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : 0
                L2 Cache                  : 0
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : 0
            Double Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : 0
                L2 Cache                  : 0
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : 0
                Total                     : 0
        Aggregate
            Single Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : 0
                L2 Cache                  : 0
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : 0
            Double Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : 0
                L2 Cache                  : 0
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : 0
                Total                     : 0
    Retired Pages
        Single Bit ECC                    : 0
        Double Bit ECC                    : 0
        Pending Page Blacklist            : No
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 32 C
        GPU Shutdown Temp                 : 90 C
        GPU Slowdown Temp                 : 87 C
        GPU Max Operating Temp            : 83 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : 29 C
        Memory Max Operating Temp         : 85 C
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 37.29 W
        Power Limit                       : 250.00 W
        Default Power Limit               : 250.00 W
        Enforced Power Limit              : 250.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 250.00 W
    Clocks
        Graphics                          : 1245 MHz
        SM                                : 1245 MHz
        Memory                            : 1107 MHz
        Video                             : 1132 MHz
    Applications Clocks
        Graphics                          : 1245 MHz
        Memory                            : 1107 MHz
    Default Applications Clocks
        Graphics                          : 1245 MHz
        Memory                            : 1107 MHz
    Max Clocks
        Graphics                          : 1597 MHz
        SM                                : 1597 MHz
        Memory                            : 1107 MHz
        Video                             : 1432 MHz
    Max Customer Boost Clocks
        Graphics                          : 1597 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Processes                             : None

Device properties

Many functions are available in the CUDA runtime API to help us manage devices. For example, we can use the following two functions to query all information about GPU devices:

cudaError_t cudaGetDeviceProperties(cudaDeviceProp* prop, int device)

which returns properties for a selected device, and

cudaError_t cudaDeviceGetAttribute(int* value, cudaDeviceAttr attr, int device)

which returns information about the device.

A description of both functions and associated arguments and data structures can be found online in Cuda Toolkit Documentation.

The code below retrieves the basic device information:

#include <stdio.h>
#include <cuda.h>
#include <cuda_runtime.h>

#include "helper_cuda.h"

int main(int argc, char **argv) {

  // Get number of GPUs
  int deviceCount = 0;
  cudaError_t error = cudaGetDeviceCount(&deviceCount);

  if (error != cudaSuccess) {
    printf("cudaGetDeviceCount error %d\n-> %s\n", error, cudaGetErrorString(error));
    exit(EXIT_FAILURE);
  }

  // Get device propreties and print 
  for (int dev = 0; dev < deviceCount; dev++) {
    struct cudaDeviceProp prop;
    int value;
    printf("\n ==========  cudaDeviceGetProperties ============  \n\n");
    cudaGetDeviceProperties(&prop, dev);
    printf("\nDevice %d: \"%s\"\n", dev, prop.name);
    printf("  GPU Clock Rate (MHz):                          %d\n", prop.clockRate/1000);
    printf("  Memory Clock Rate (MHz):                       %d\n", prop.memoryClockRate/1000);
    printf("  Memory Bus Width (bits):                       %d\n", prop.memoryBusWidth);
    printf("  Peak Memory Bandwidth (GB/s):                  %.2f\n", 2.0*prop.memoryClockRate*(prop.memoryBusWidth/8)/1.0e6);
    printf("  CUDA Cores/MP:                                 %d\n", _ConvertSMVer2Cores(prop.major, prop.minor));
    printf("  CUDA Cores:                                    %d\n", _ConvertSMVer2Cores(prop.major, prop.minor) *
           prop.multiProcessorCount);
    printf("  Total amount of global memory:                 %.0f GB\n", prop.totalGlobalMem / 1073741824.0f);
    printf("  Total amount of shared memory per block:       %zu kB\n",
           prop.sharedMemPerBlock/1024);
    printf("  Total number of registers available per block: %d\n",
           prop.regsPerBlock);
    printf("  Warp size:                                     %d\n",
           prop.warpSize);
    printf("  Maximum number of threads per block:           %d\n",
           prop.maxThreadsPerBlock);
    printf("  Max dimension size of a thread block (x,y,z): (%d, %d, %d)\n",
           prop.maxThreadsDim[0], prop.maxThreadsDim[1],
           prop.maxThreadsDim[2]);
    printf("  Max dimension size of a grid size    (x,y,z): (%d, %d, %d)\n",
           prop.maxGridSize[0], prop.maxGridSize[1],
           prop.maxGridSize[2]);

    printf("\n\n\n ==========  cudaDeviceGetAttribute ============  \n\n");
    cudaDeviceGetAttribute (&value, cudaDevAttrMaxThreadsPerBlock, dev);
    printf("  Max number of threads per block:              %d\n",
           value);
    cudaDeviceGetAttribute (&value, cudaDevAttrMaxBlockDimX, dev);
    printf("  Max block dimension X:                        %d\n",
           value);
    cudaDeviceGetAttribute (&value, cudaDevAttrMaxBlockDimY, dev);
    printf("  Max block dimension Y:                        %d\n",
           value);
    cudaDeviceGetAttribute (&value, cudaDevAttrMaxBlockDimZ, dev);
    printf("  Max block dimension Z:                        %d\n",
           value);
    cudaDeviceGetAttribute (&value, cudaDevAttrMaxGridDimX, dev);
    printf("  Max grid dimension X:                         %d\n",
           value);
    cudaDeviceGetAttribute (&value, cudaDevAttrMaxGridDimY, dev);
    printf("  Max grid dimension Y:                         %d\n",
           value);
    cudaDeviceGetAttribute (&value, cudaDevAttrMaxGridDimZ, dev);
    printf("  Max grid dimension Z:                         %d\n",
           value);
    cudaDeviceGetAttribute (&value, cudaDevAttrMaxSharedMemoryPerBlock, dev);
    printf("  Max shared memory per block:                  %d\n",
           value);
    cudaDeviceGetAttribute (&value, cudaDevAttrWarpSize, dev);
    printf("  Warp size:                                    %d\n",
           value);      
    cudaDeviceGetAttribute (&value, cudaDevAttrClockRate, dev);
    printf("  Peak clock frequency in kilohertz:            %d\n",
           value);
    cudaDeviceGetAttribute (&value, cudaDevAttrMemoryClockRate, dev);
    printf("  Peak memory clock frequency in kilohertz:     %d\n",
           value);
    cudaDeviceGetAttribute (&value, cudaDevAttrGlobalMemoryBusWidth, dev);
    printf("  Global memory bus width in bits:              %d\n",
           value);
    cudaDeviceGetAttribute (&value, cudaDevAttrL2CacheSize, dev);
    printf("  Size of L2 cache in bytes:                    %d\n",
           value);
    cudaDeviceGetAttribute (&value, cudaDevAttrMaxThreadsPerMultiProcessor, dev);
    printf("  Maximum resident threads per SM:              %d\n",
           value);
    cudaDeviceGetAttribute (&value, cudaDevAttrComputeCapabilityMajor, dev);
    printf("  Major compute capability version number:      %d\n",
           value);
    cudaDeviceGetAttribute (&value, cudaDevAttrComputeCapabilityMinor, dev);
    printf("  Minor compute capability version number:      %d\n",
           value);
    cudaDeviceGetAttribute (&value, cudaDevAttrMaxSharedMemoryPerMultiprocessor, dev);
    printf("  Max shared memory per SM in bytes:            %d\n",
           value);
    cudaDeviceGetAttribute (&value, cudaDevAttrMaxRegistersPerMultiprocessor, dev);
    printf("  Max number of 32-bit registers per SM:        %d\n",
           value);  
    cudaDeviceGetAttribute (&value, cudaDevAttrMaxSharedMemoryPerBlockOptin, dev);
    printf("  Max per block shmem size on the device:       %d\n",
           value);  
    cudaDeviceGetAttribute (&value, cudaDevAttrMaxBlocksPerMultiprocessor, dev);
    printf("  Max thread blocks that can reside on a SM:    %d\n",
           value);  
  }
}

The output:

==========  cudaDeviceGetProperties ============  

Device 0: "Tesla V100S-PCIE-32GB"
  GPU Clock Rate (MHz):                          1597
  Memory Clock Rate (MHz):                       1107
  Memory Bus Width (bits):                       4096
  Peak Memory Bandwidth (GB/s):                  1133.57
  CUDA Cores/MP:                                 64
  CUDA Cores:                                    5120
  Total amount of global memory:                 32 GB
  Total amount of shared memory per block:       48 kB
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)


==========  cudaDeviceGetAttribute ============  

Device 0: "Tesla V100S-PCIE-32GB"
  Max number of threads per block:              1024
  Max block dimension X:                        1024
  Max block dimension Y:                        1024
  Max block dimension Z:                        64
  Max grid dimension X:                         2147483647
  Max grid dimension Y:                         65535
  Max grid dimension Z:                         65535
  Max shared memory per block:                  49152
  Warp size:                                    32
  Peak clock frequency in kilohertz:            1597000
  Peak memory clock frequency in kilohertz:     1107000
  Global memory bus width in bits:              4096
  Size of L2 cache in bytes:                    6291456
  Maximum resident threads per SM:              2048
  Major compute capability version number:      7
  Minor compute capability version number:      0
  Max shared memory per SM in bytes:            98304
  Max number of 32-bit registers per SM:        65536
  Max per block shmem size on the device:       98304
  Max thread blocks that can reside on a SM:    32

We can see that the Tesla V100S GPU has 5120 SMs or cores, with 64 cores per each of 80 multiprocessors. The query also lists available global memory, shared memory, and registers. Although the number of threads in a grid is very high, it is still limited, and programs should not exceeded the limits.

The above code is published in folder 01-discover-devices of the workshop's repo.

© Patricio Bulić, University of Ljubljana, Faculty of Computer and Information Science. The material is published under license CC BY-NC-SA 4.0. ↩

Device information1

Runtime information

Device properties

Device information¹