Correct memory access patterns in CUDA Coarse optimisations
RidgeRun CUDA Optimisation Guide | |||||
---|---|---|---|---|---|
![]() | |||||
GPU Architecture | |||||
|
|||||
Optimisation Workflow | |||||
|
|||||
Optimisation Recipes | |||||
|
|||||
Common pitfalls when optimising | |||||
|
|||||
Examples | |||||
|
|||||
Empirical Experiments | |||||
|
|||||
Contact Us |
Memory bound: Memory access patterns
The key: exploit the cache as much as you can. The way to achieve it is by using coalesced memory access.
Differently from the CPU architecture, in the NVIDIA GPUs, the data is loaded by the streaming multiprocessor and a chunk of threads depends on its access. It means that, if the threads do not manage to get uniform access to the memory, the performance will be suboptimal. This is an interesting presentation to know about optimisation techniques.
Some of the symptoms of incorrect memory access are:
1. Offset (not line-aligned) warp addresses: the memory is not aligned in 32 bytes, meaning that the data is located in two different memory segments (128 bytes per segment)
- Fix: force alignment by adding padding to the data.
- Impact: not alleviating this leads to 50% of bus utilisation
2. Large strides between threads within a warp: meaning that the warp requires more than one memory transfer (memory miss) to load the necessary data.
- Fix: try to change data layout
- Impact: not alleviating this leads to 1/N of bus utilisation, where N is given by the number of cache lines required by the warp
3. Each thread accesses a large contiguous region: it is usually caused by an array of structs. It is better to use a struct of arrays.
- Fix: try to change data layout or try to change the execution order (using multiple threads) in order to get coalescing.
- Impact: medium L1 hit rate. Changing data layout may lead to 2x or more in improvement.
4. Irregular (scattered) addresses: the access does not follow a uniform striding.
- Fix: try no-caching loads or read-only loads
- Impact: minimum L1 hit rate.
It can be achieved by using const __restrict__
decorator or the __ldg()
intrinsic.
The following picture depicts the phenomena in a nutshell.

Performance hints: not using a cache line per warp can lead also to degradation. It is better to use the cache line entirely.
Performance hints: Nsight can offer you a hand to see the bus occupation. Additionally, these issues can be caught by reading the code and performing a code analysis.
If the memory access has some of the issues listed above, make sure of reshaping the data layout and fulfilling the coalesced memory access.
The story does not end here. What does it happen if the application has coalesced access already but the arithmetic intensity is low with respect to the data? In other words, the application is not computing enough on each data batch. Another rule of thumb is to compute as much as you can with the data loaded. For that, modify the threads in such a way that they process multiple elements instead of just one. It does not mean placing a loop to operate over a cache line. It means that keeping in mind the alignment and the 128B cache line length, it is possible to place more than 32 operands. For instance:
- Instead of doubles, use singles: it will reduce a cache line retrieval
- Instead of singles, use halves: it will allow a thread to process two numbers while the warp only retrieves a cache line
- Instead of halves, operate chars: it may be suboptimal, but gets better arithmetic intensity.
Performance hints: keep in mind that integers can underperform floating-point. Formulate the logic in such a way that can reach that limit.
Try to use a numerical representation that can satisfy the needs. Do not over-dimension the problem if it is not necessary.
We invite you to have a look at this example.