Common pitfalls when optimising NVIDIA CUDA

Common pitfalls when optimising

1. CUDA streams will guarantee concurrent execution.

CUDA streams do not guarantee concurrent execution. They simply provide the possibility for it. If you fail to meet other requirements of concurrent execution, you will not see concurrency. For example, if one of your kernels “occupies” the GPU, there is no “room” for anything else to run on the GPU.

2. UVA is smart enough to handle memory transactions.

A memory transfer does not occur by using magic. It is better to make a prefetch to guarantee memory coherency. Hints are a good way to go, always. It is like trying to reach a place, ask people and save time.

3. Over-synchronising: delaying the train by giving false alarms.

Synchronisations must be used carefully. Sometimes, it happens when placing two synchronisation barriers when only one was needed. CUDA may be naïve but not so much.

4. Use double since the modern machine performs the same as singles:

It may be a comment when developing software for CPU. However, the following comment will apply to both GPU and CPU

No, it does not! Using doubles consumes twice memory as singles. It means the application will load more cache lines and might cause more cache misses suffering penalties. Additionally, in GPU, the doubles perform twice slower than singles.

Another clarification: this comment applies only to recurrent code (code that executes many times). For non-recurrent code, it may have a negligible effect.

5. GPU will be always faster than CPU because parallelism

This is another pitfall. If the algorithm is predominantly serial or requires many synchronisation barriers, the CPU may outperform the GPU performance. From our experience, there have been cases where the CPU is 10x faster than the GPU because synchronisation bureaucracy consumes most of the time.

6. UVA avoids memory copies, increasing performance

It is actually a trade-off. It takes the memory copies out from the developer domain. However, it may happen that the pinned memory outperforms for greater chunks of memory.

Please, keep in mind the CUDA guide.

7. The GPU was built for graphics. Image processing should run just fine

It depends on the algorithm and the accelerator availability. In Tegra, the best bet is to use the hardware accelerators for that purpose. A Tegra iGPU (integrated GPU) is different from a dGPU (discrete GPU) in terms of architecture and support.

8. Accelerators are always faster than the GPU

Despite is specialisation, the clocking of the accelerators may be lower than the GPU. There are some workloads that can run faster on GPU. The programmer should take into account the trade-off between clocking and workload.

9. Feeling confident.

Optimisation is like studying quantum physics. If you don't doubt yourself, you are not advancing. Skipping architecture details for Kepler, Pascal, Tesla or Volta can lead to suboptimal results. It is better to check architecture details over and over again. Take the CUDA programming guide as your best friend.

The same applies to CPU optimisation.

Previous: Optimisation Recipes/Fine optimisations/Inlining

Index

Next: Optimisation Recipes/Examples

❯