Inter-thread communication in CUDA Coarse optimisations

Memory bound: Inter-thread communication

In this case, the threads share some data among themselves. From the architecture perspective, these cases are ill-formed, meaning that GPU should not be used for any kind of inter-thread communication. However, it may be used anyways in case there is no other choice.

In such cases, one way to go is to use the shared memory. The shared memory is per stream-multiprocessor and it is really fast compared to doing things globally. You can use reductions or accumulation-related atomics. However, it can lead to serious penalties.

You can use some sort of intra-warp communication by using register caching. You can see more information in this blog.

So, some alternatives for this problem can be:

Use shared memory and inter-thread communication
Place multiple operations in a single thread: being careful with the thread convergence
Use register caching

The last is the most recommended in some cases.

Performance hints: sometimes, there are cases where the alternatives perform similarly or one out-performs on top of the other ones. It is adequate to give a try to the most effective ones. Experience is perhaps a key to avoiding trial & error.

Limit thread communication by using shuffles whenever possible. If you are going for reductions, atomics can perform okay-sh.

Previous: Optimisation Recipes/Coarse optimisations/Correct memory access patterns

Index

Next: Optimisation Recipes/Fine optimisations

❯