Optimisation Recipes for CUDA Optimisation

From RidgeRun Developer Wiki



Previous: Tools/ Computational Budget Tool Index Next: Optimisation Recipes/Coarse optimisations






Currently, optimisations are based on heuristics. The optimisations must also be prioritised: There might be optimisations that have a greater impact than others. In this section, some optimisations are roughly described as a first approximation to the CUDA optimisation world. They should apply to other fields as well.

Depending on your bounds, there are several ways to proceed:

1. GPU bound

It happens when the GPU is occupied more than 80-90%, depending on the slack you need. It happens when the GPU computation resources are exploited overwhelmingly and have a high arithmetic intensity. Basically, the algorithm is touching the roof. The following are some of the most common GPU bound optimisations:

2. Memory bound

It happens when the GPU utilisation is low but the walltime is still high. Also, when the L1 hit rate is minimum and the amount of computing operations per byte is low. Usually, the data access patterns have non-adequate shapes. Some optimisations are:

3. Communication or I/O bound

This is the Achilles heel of any accelerator. Transferring data takes most of the time and there is no way to skip it unless hiding it with computation. It happens when the kernel execution time is way lower than the actual memory transfers: host to device or device to host. Please, take into account the following optimisations:

The following plot shows some heuristics (without any unit in the Y-axis) of how these behaviors look like:


Typical issues while optimising. Depending on the nature of the bottleneck, the set optimisations shall be selected


Optimisations can be divided into coarse optimisations and fine optimisations, mostly depending on their impact on the overall performance. These qualifications can vary depending on the system and the application.

The following sections will explain briefly the optimisations provided that the application is currently running correctly and you have a CUDA device.


Previous: Tools/ Computational Budget Tool Index Next: Optimisation Recipes/Coarse optimisations