NVIDIA GPU Architecture

Next: GPU Architecture/Execution process

This guide is focused on NVIDIA devices given that they are CUDA-capable. It is highly important to understand the architecture before even trying to optimize a piece of code. Optimizing without knowing the architecture is trying to walk blindly in the middle of nowhere.

As a quick start, the GPU contains the following components:

Stream Multiprocessors (SM): They are processing units with multiple execution units. It can be seen as CPU "cores" with multiple ALUs (CUDA Cores) or a vector processor. Each stream multiprocessor receives a job in form of warps of threads, which each thread executes on a CUDA core. In general, there are also blocks of threads, which are assigned to an SM for execution.
- Thread: is the minimum unit of job that is executable by a CUDA core
- Block (threads): is a group of threads. The threads can be sorted in 1D up to 3D. The scheduler assigns blocks to SMs for their execution.
- Warp (threads): is a group of threads taken to be executed in an SM run cycle. A warp is usually 32 threads, which are executed at the same time. It can be seen as a subset of the threads of a block.
- Grid (threads): is a group of blocks. The blocks can be sorted in 1D up to 3D.
- CUDA core: is a core with limited capabilities and is mainly targeted to the execution of simple operations. It can be seen as an ALU on steroids.
- Shared Memory / L1 Cache: in the CUDA context, there is a type of memory called "shared memory", which refers to a memory that resides within each SM. It is not shared amongst other SMs. Provided the blocks are assigned to an SM, the threads of different blocks may not access the same shared memory space.
Read-only Memory: is an efficient block of memory for constants of pieces of data whose access is read-only. It can be seen as const in C++.
- Warp scheduler: is the unit in charge of receiving blocks of memory and partitioning them into warps.
L2 Cache: is a cache shared amongst the SMs. It accelerates access to global memory, speeding up the applications.
Global Memory: is the memory that resides in the GPU. It is usually called VRAM or GPU-dedicated RAM.
Constant Memory: is a memory fragment that allows the storage of constant values, such as Look-Up Tables (LUTs).

We are taking the texture processing system out from the equation, provided we are more focused on computation.

Within this architecture, there are several points to take into account when optimizing, especially, the execution modes and the memory hierarchy.

References

↑ Arafa, Yehia & Badawy, Abdel-Hameed & Chennupati, Gopinath & Santhi, Nandakishore & Eidenbenz, Stephan. (2019). Low Overhead Instruction Latency Characterization for NVIDIA GPGPUs. From: https://www.researchgate.net/publication/333308943_Low_Overhead_Instruction_Latency_Characterization_for_NVIDIA_GPGPUs/

Index

Next: GPU Architecture/Execution process

❯

[1] Arafa, Yehia & Badawy, Abdel-Hameed & Chennupati, Gopinath & Santhi, Nandakishore & Eidenbenz, Stephan. (2019). Low Overhead Instruction Latency Characterization for NVIDIA GPGPUs. From: https://www.researchgate.net/publication/333308943_Low_Overhead_Instruction_Latency_Characterization_for_NVIDIA_GPGPUs/

[1]