Optimisation Recipes for CUDA Optimisation
RidgeRun CUDA Optimisation Guide | |||||
---|---|---|---|---|---|
GPU Architecture | |||||
|
|||||
Optimisation Workflow | |||||
|
|||||
Optimisation Recipes | |||||
|
|||||
Common pitfalls when optimising | |||||
|
|||||
Examples | |||||
|
|||||
Empirical Experiments | |||||
|
|||||
Contact Us |
Currently, optimisations are based on heuristics. The optimisations must also be prioritised: There might be optimisations that have a greater impact than others. In this section, some optimisations are roughly described as a first approximation to the CUDA optimisation world. They should apply to other fields as well.
Depending on your bounds, there are several ways to proceed:
1. GPU bound
It happens when the GPU is occupied more than 80-90%, depending on the slack you need. It happens when the GPU computation resources are exploited overwhelmingly and have a high arithmetic intensity. Basically, the algorithm is touching the roof. The following are some of the most common GPU bound optimisations:
- Offloading to a dedicated device: it implies other hardware accelerators
- Function approximations
- Condition and loops replacement
- Inlining
2. Memory bound
It happens when the GPU utilisation is low but the walltime is still high. Also, when the L1 hit rate is minimum and the amount of computing operations per byte is low. Usually, the data access patterns have non-adequate shapes. Some optimisations are:
- Correcting memory access patterns
- Improving inter-thread communication
- Increasing arithmetic intensity
3. Communication or I/O bound
This is the Achilles heel of any accelerator. Transferring data takes most of the time and there is no way to skip it unless hiding it with computation. It happens when the kernel execution time is way lower than the actual memory transfers: host to device or device to host. Please, take into account the following optimisations:
The following plot shows some heuristics (without any unit in the Y-axis) of how these behaviors look like:
Optimisations can be divided into coarse optimisations and fine optimisations, mostly depending on their impact on the overall performance. These qualifications can vary depending on the system and the application.
The following sections will explain briefly the optimisations provided that the application is currently running correctly and you have a CUDA device.