Inter-thread communication in CUDA Coarse optimisations

From RidgeRun Developer Wiki



Previous: Optimisation Recipes/Coarse optimisations/Correct memory access patterns Index Next: Optimisation Recipes/Fine optimisations






Memory bound: Inter-thread communication

In this case, the threads share some data among themselves. From the architecture perspective, these cases are ill-formed, meaning that GPU should not be used for any kind of inter-thread communication. However, it may be used anyways in case there is no other choice.

In such cases, one way to go is to use the shared memory. The shared memory is per stream-multiprocessor and it is really fast compared to doing things globally. You can use reductions or accumulation-related atomics. However, it can lead to serious penalties.

You can use some sort of intra-warp communication by using register caching. You can see more information in this blog.

So, some alternatives for this problem can be:

  1. Use shared memory and inter-thread communication
  2. Place multiple operations in a single thread: being careful with the thread convergence
  3. Use register caching

The last is the most recommended in some cases.

Performance hints: sometimes, there are cases where the alternatives perform similarly or one out-performs on top of the other ones. It is adequate to give a try to the most effective ones. Experience is perhaps a key to avoiding trial & error.
Limit thread communication by using shuffles whenever possible. If you are going for reductions, atomics can perform okay-sh.


Previous: Optimisation Recipes/Coarse optimisations/Correct memory access patterns Index Next: Optimisation Recipes/Fine optimisations