NVIDIA GPU Memory hierarchy

From RidgeRun Developer Wiki



Previous: GPU Architecture/Execution process Index Next: GPU Architecture/Communication & Concurrency






The memory hierarchy is quite interesting in NVIDIA GPUs. From faster to slower (smaller to bigger):

  • CUDA Core Registers
  • L1 Cache
  • Shared Memory
  • L2 Cache
  • Global memory
  • Host memory

It excludes constant memory segments. However, they are optimised to be faster because of caching. One important takeaway from the programming guide is the following:

  • The constant memory segment can be cached. However,
  • if a warp makes a request to __constant__ memory where different threads in the warp are accessing different locations, those requests will be serialised.
  • The best performance will be achieved when the threads within the warp access the same constant value.
Performance hints: Constant memory segments make sense when reusing data. You can store LUTs, provided they are constantly retrieved by threads within a block.

An important point to talk about is the different types of memory. You may find the following concepts:

  • Non-pinned memory / pageable memory: corresponds to the normal host memory. A page is a segment of memory that can be transferred and it is optimised to avoid memory segmentation when dealing with dynamic partitioning. The memory may not be physically contiguous.
  • Pinned memory / non-pageable memory: is a chunk of memory contiguous physically (often). It is optimal when dealing with communication between devices and the CPU. The DMA can easily transfer the memory without using too many transactions.
  • Unified Memory Addressing / Managed Memory: is a chunk of memory whose pointer is accessible from the host and the device. Under the hood, there are two chunks of memory (one on the host and another one on the device) which are coherently managed by the UVA system.
Performance hints: Use pinned memory whenever possible (big memory chunks mainly) to exploit any communication optimisation. Using non-pinned memory can lead to serious performance degradation because of memory transfer scattering.


Previous: GPU Architecture/Execution process Index Next: GPU Architecture/Communication & Concurrency