NVIDIA GPU Execution process

In general, the CUDA kernels are executed in threads. However, threads are grouped into blocks, and blocks are into grids. There are differences in the execution modes and things to take into account before coding in CUDA.

Threads are the minimum execution piece of code that goes to the Stream Multiprocessor. Then, we have blocks, that can be multidimensional and only support up to 1024 threads per block. Finally, there is a grid of blocks, where the maximum number of blocks is almost unlimited.

You can find information about the support on your GPU in:

/usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery

Result:

  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)

Before starting any CUDA algorithm, make sure of respecting the limits. In most of the GPUs, though, the maximum number of threads per block is 1024. You may have some block-size combinations. For instance:

1D: (1024, 1, 1) # in x
1D: (1, 1024, 1) # in y
2D: (32, 32, 1) # in XY (32 x 32 = 1024)
3D: (8, 8, 8) # XYZ (8 x 8 x 8 = 512)

The criterion on how to dimension blocks is kind of three but based on the following rule:

Performance hint: the block size should be divisible by the size of the warp to get the best performance.

An important aspect is how the threads are scheduled in the SMs. The scheduler often takes warps of threads and sends them for execution. In most architectures, the warp size is 32 threads.

In the best case, the warp threads start and finish at the same time. However, there are ill-cases where the thread execution diverges. In other words, there is a thread slower or a thread that follows another execution path with respect to the others. The thread divergence can be worst than you think. The SM tries to execute convergence chunks. It means that in an if condition, the SM will execute first the threads that fall into the TRUE condition, and then, it executes the threads that fall into the FALSE condition. This situation is illustrated in the following picture:

Thread divergence depicted. The GPU executes first the True statement instructions and, then, the False ones^[2]

Performance hint: most thread divergence happens when having branching or loops with a variable number of iterations within a thread. It can lead to serious degradation.

Another thing to take into account is that each block is assigned and fixed to an SM. It means that all the threads in the block will execute in the same SM. It allows the threads within the same block to use the shared memory of the SM. However, the shared memory cannot cross the blocks, meaning that the share memory scope is per block. Threads in two different blocks cannot access the very same shared memory.

Performance hint: the shared memory will be faster than accessing the global memory because it is closer to the core.

Additionally, it is important to know how CUDA cores perform math. Section 5.4 of the CUDA C++ Programming guide can help you to understand the key differences. In a summary, there are several ways to perform operations, depending on precision and accuracy. You may find:

Standard Operations IEEE 754 compliant: i.e. sinf(x)
Intrinsics: They have the same name prefixed with __ (such as __sinf(x)). They are faster as they map to fewer native instructions. However, the accuracy may be compromised.
Integer operations, half-precision, single-precision, and double-precision.

Some interesting facts to consider are:

Multiplications, additions, and fused-multiply add (FMA) performance doubles as the precision is halved. For example, for Compute Capability 7.x, each multiprocessor runs up to 32 simultaneous double-precision operations per clock, 64 for single-precision, and 128 for half-precision.
Integer arithmetic often performs similarly to floating-Point. For example, for multiplications, additions, and FMA in int32 performs with 64 ops/clocks per multiprocessor, same for single-floating-point
Memory casts have the same performance as the operations. Be careful
Approximations are usually more than enough for non-scientific applications, such as image processing and machine learning.

References

↑ NVIDIA (2017). CUDA C++ Programming Guide. v11.4.2. From: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy
↑ L. Durant, O. Giroux, M. Harris and N. Stam. (2017). Inside Volta: The World’s Most Advanced Data Center GPU. https://developer.nvidia.com/blog/inside-volta/. From: https://developer.nvidia.com/blog/inside-volta/

Previous: GPU Architecture

Index

Next: GPU Architecture/Memory hierarchy

[1] NVIDIA (2017). CUDA C++ Programming Guide. v11.4.2. From: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy

[2] L. Durant, O. Giroux, M. Harris and N. Stam. (2017). Inside Volta: The World’s Most Advanced Data Center GPU. https://developer.nvidia.com/blog/inside-volta/. From: https://developer.nvidia.com/blog/inside-volta/

[1]

[2]