Inter-thread communication in CUDA Coarse optimisations
RidgeRun CUDA Optimisation Guide | |||||
---|---|---|---|---|---|
GPU Architecture | |||||
|
|||||
Optimisation Workflow | |||||
|
|||||
Optimisation Recipes | |||||
|
|||||
Common pitfalls when optimising | |||||
|
|||||
Examples | |||||
|
|||||
Empirical Experiments | |||||
|
|||||
Contact Us |
Memory bound: Inter-thread communication
In this case, the threads share some data among themselves. From the architecture perspective, these cases are ill-formed, meaning that GPU should not be used for any kind of inter-thread communication. However, it may be used anyways in case there is no other choice.
In such cases, one way to go is to use the shared memory. The shared memory is per stream-multiprocessor and it is really fast compared to doing things globally. You can use reductions or accumulation-related atomics. However, it can lead to serious penalties.
You can use some sort of intra-warp communication by using register caching. You can see more information in this blog.
So, some alternatives for this problem can be:
- Use shared memory and inter-thread communication
- Place multiple operations in a single thread: being careful with the thread convergence
- Use register caching
The last is the most recommended in some cases.
Performance hints: sometimes, there are cases where the alternatives perform similarly or one out-performs on top of the other ones. It is adequate to give a try to the most effective ones. Experience is perhaps a key to avoiding trial & error.
Limit thread communication by using shuffles whenever possible. If you are going for reductions, atomics can perform okay-sh.