NVIDIA CUDA Memory Management

Introduction to CUDA memory management

This wiki is intended as a brief summary of the CUDA memory management programming paradigm, specially for Jetson TX2 and Xavier boards. This page includes a description and application of zero-copy memory and unified memory programming, to be used as a reference for further work.

Zero-Copy Memory (ZC)

Definition and implications

From CUDA toolkit documentation, it is defined as “a feature that (..) enables GPU threads to directly access host memory (CPU)”. In this programming model CPU and GPU use pinned memory (i.e, same physical memory). For CUDA 8.x and below, pinned memory is “non-pageable”, which means that the shared memory region will not be coherent. In a non-coherent environment, pages are not cached and every access by the GPU (device) will use the system memory directly (skipping cache), causing higher latency and bandwidth usage. On integrated GPUs (GPU only processing) mapped pinned memory is beneficial because “it avoids superfluous copies as integrated GPU and CPU memory are physically the same”.

For CUDA 9.x and above, however, cached zero-copy memory is enabled (when supported by the hardware - See Jetson Xavier section) through CPU cache.

Coding Example

The following is a coding example using zero-copy memory [1]

// Set flag to enable zero copy access
cudaSetDeviceFlags(cudaDeviceMapHost);
 
// Host Arrays (CPU pointers)
float* h_in  = NULL;
float* h_out = NULL;
 
// Process h_in
 
// Allocate host memory using CUDA allocation calls
cudaHostAlloc((void **)&h_in,  sizeIn,  cudaHostAllocMapped);
cudaHostAlloc((void **)&h_out, sizeOut, cudaHostAllocMapped);
 
// Device arrays (CPU pointers)
float *d_out, *d_in;
// Get device pointer from host memory. No allocation or memcpy
cudaHostGetDevicePointer((void **)&d_in,  (void *) h_in , 0);
cudaHostGetDevicePointer((void **)&d_out, (void *) h_out, 0);
 
// Launch the GPU kernel
kernel<<<blocks, threads>>>(d_out, d_in);

// No need to copy d_out back
// Continue processing on host using h_out

Unified Memory Programming (UM)

Definition and implications.

From the CUDA toolkit documentation, it is defined as “a component of the CUDA programming model (...) that defines a managed memory space in which all processors see a single coherent memory image with a common address space”. There is also an explicit note referring to processor as an independent execution unit with a dedicated Memory Management Unit (MMU), including both CPU and GPU. Fig. 1 represents this modeling.

Fig.1 Representation of UM modeling (source CUDACast #18: Unified Memory)

This directly means that in an UM environment, the CPU and the GPU could share a coherent “virtual” memory region with a common address space. It is worth noting that even though the memory region is shared through address space (i.e., a single pointer to the same virtual region for both CPU and GPU), this does NOT mean the same physical memory space is shared. A copy of shared data is then needed at some point. Moreover, as stated within the CUDA documentation, UM eliminates the need for explicit data copy routines (cudaMemcpy*), avoiding the loss of performance produced by placing all data into zero-copy memory (which is explained above). Nonetheless, “data movement, of course, still takes place, so a program’s run time typically does not decrease”. As seen within CUDA documentation page, the benefit of UM programming is not to reduce execution times or increasing performance but to enable “the writing of simpler and more maintainable code”.

One key aspect of UM programming is that the shared memory region is kept coherent in an implicit and transparent manner, by migrating data towards the processing units as they are modified (when a page fault happens) by either one of the them (this is called “On-Demand Paging”). In this scheme, when a shared page is modified by either the CPU or the GPU, that page (and not the whole memory region) is copied to the other processing unit memory space, hence reducing the copies, when not a lot of shared data is modified. This process can be seen in Fig. 2. As seen in the figure, as data modification increases, more copies are required hence decreasing the system’s performance.

Fig.2 On-Demand paging scheme with UM programming (source CUDA 8 and Beyond. GPU Technology Conference, 2016)

Coding example

The following is a CUDA programming example using UM [2]

#include <iostream>
#include <math.h>
 
// CUDA kernel to add elements of two arrays
__global__
void add(int n, float *x, float *y)
{
 int index = blockIdx.x * blockDim.x + threadIdx.x;
 int stride = blockDim.x * gridDim.x;
 for (int i = index; i < n; i += stride)
   y[i] = x[i] + y[i];
}
 
int main(void)
{
 int N = 1<<20;
 float *x, *y;
 
 // Allocate Unified Memory -- pointers accessible from CPU or GPU
 cudaMallocManaged(&x, N*sizeof(float));
 cudaMallocManaged(&y, N*sizeof(float));
 
 // initialize x and y arrays on the host (CPU)
 for (int i = 0; i < N; i++) {
   x[i] = 1.0f;
   y[i] = 2.0f;
 }
 
 // Launch kernel on 1M elements on the GPU
 int blockSize = 256;
 int numBlocks = (N + blockSize - 1) / blockSize;
 add<<<numBlocks, blockSize>>>(N, x, y);
 
 // Wait for GPU to finish before accessing on host**
 cudaDeviceSynchronize();
 
 // Check for errors (all values should be 3.0f)
 float maxError = 0.0f;
 for (int i = 0; i < N; i++)
   maxError = fmax(maxError, fabs(y[i]-3.0f));
 std::cout << "Max error: " << maxError << std::endl;
 
 // Free memory
 cudaFree(x);
 cudaFree(y);
 
 return 0;
}

NVIDIA Jetson TX2

As seen in the Jetson Tegra architecture (Fig.3), in Jetson TX2, the processing units (GPU and CPUs) share the same physical memory, which is accessed via the shared memory controller and System MMU.

Fig.3 NVIDIA Tegra "Parker" block diagram (source NVIDIA Developer Blog)

Jetson TX1/TX2 uses the Pascal GPU architecture. Although this architecture supports hardware cache coherence, this feature only applies to internal GPU cache, not including hardware support for coherence between CPU cache and GPU cache. This means that (real) coherence could never be supported with zero-copy memory, since there is no actual hardware mechanism to ensure it. Moreover, zero-copy memory programming, with this architecture, would produce higher latencies and more bandwidth usage, since every shared memory access by the device (GPU) will be a cache miss, and must be transferred from main memory.

In this scenario, Unified Memory programming is recommended, since only the modified pages will be copied to or from the GPU. For a data-intensive problem, nonetheless, UM programming will be closer to zero-copy, since more and more pages will be copied to devices.

NVIDIA Jetson AGX Xavier

Starting with CUDA 9.x, and from Xavier on, cache coherence between CPUs and GPU is done via hardware through the the host (CPU) cache. This means that now, with Xavier, Zero-copy memory programming can be used to share physical memory between processing units, without the prior disadvantages of not using cache, thus reducing latency overhead and bandwidth usage. Fig. 4 shows this new feature.