RidgeRun CUDA Optimisation Guide/Empirical Experiments/Multi-threaded bounding test

Introduction

This page is the follow-up of Cuda Memory Benchmark, it adds multithreading to the testing of the different memory management modes. It's reduced to test traditional, managed, page-locked memory with and without copy call and CUDA mapped.

Testing Setup

Memory Management Methods

The program tested had the option to use each of the following memory management configurations:

Traditional mode, using malloc to reserve the memory on host, then cudaMalloc to reserve it on the device, and then having to move the data between them with cudaMemcpy. Internally, the driver will allocate a non-pageable memory chunk, to copy the data there and after the copy, finally use the data on the device.
Managed, using cudaMallocManaged and not having to manually copy the data and handle two different pointers.
Non paging memory, using cudaMallocHost a chunk of page-locked memory can be reserved that can be used directly by the device since its non-pageable
Non paging memory with discrete copy, using cudaMallocHost and a discrete call to cudaMemcpy, so its similar to the traditional model with different pointers one for host and another for device, but according to the NVIDIA docs on the mallocHost, the calls to cudaMemcpy are accelerated when using thid type of memory.
Zero-Copy Memory, using cudaHostAlloc to reserve memory that is page-locked and directly accessible to the device. There are different flags that can change the properties of the memory, in this case, the flags used were cudaHostAllocMapped and cudaHostAllocWriteCombined.

Platforms

Discrete GPU: desktop pc with RTX 2070s, using CUDA 12 and Ubuntu 20.04.
Jetson AGX Orin: using CUDA 11.4 and JP 5.0.2.
Jetson Nano 4GB devkit: using CUDA 10.2 and JP 4.6.3

Program Structure

The program is divided into three main sections, one where the input memory is filled with data, the kernel worker threads, and the verify. The verify reads all the results and uses assert to verify them. Before every test, 10 iterations of the full process were done to warm up and avoid any initialization time penalty. After that, the average of 100 runs was obtained. Each of the sections can be seen in Figure 1.

Figure 1. Measurement points on the code

Each kernel block can be seen in Figure 2.