Correct memory access patterns in CUDA Coarse optimisations

Follow Us On

Previous: Optimisation Recipes/Coarse optimisations/Communication overlapping

Index

Next: Optimisation Recipes/Coarse optimisations/Inter-thread communication

Memory bound: Memory access patterns

The key: exploit the cache as much as you can. The way to achieve it is by using coalesced memory access.

Differently from the CPU architecture, in the NVIDIA GPUs, the data is loaded by the streaming multiprocessor, and a chunk of threads depends on its access. It means that, if the threads do not manage to get uniform access to the memory, the performance will be suboptimal. This is an interesting presentation to know about optimisation techniques.

Some of the symptoms of incorrect memory access are:

1. Offset (not line-aligned) warp addresses: the memory is not aligned in 32 bytes, meaning that the data is located in two different memory segments (128 bytes per segment)

Fix: Force alignment by adding padding to the data.
Impact: not alleviating this leads to 50% of bus utilisation

2. Large strides between threads within a warp: meaning that the warp requires more than one memory transfer (memory miss) to load the necessary data.

Fix: try to change data layout
Impact: not alleviating this leads to 1/N of bus utilisation, where N is given by the number of cache lines required by the warp

3. Each thread accesses a large contiguous region: it is usually caused by an array of structs. It is better to use a struct of arrays.

Fix: try to change data layout or try to change the execution order (using multiple threads) in order to get coalescing.
Impact: medium L1 hit rate. Changing data layout may lead to 2x or more in improvement.

4. Irregular (scattered) addresses: the access does not follow a uniform striding.

Fix: try no-caching loads or read-only loads
Impact: minimum L1 hit rate.

It can be achieved by using const __restrict__ a decorator or the __ldg() intrinsic.

The following picture depicts the phenomena in a nutshell.

Possible memory access patterns in CUDA. Including an optimal (utilises the whole cache line) and other suboptimal

Performance hints: not using a cache line per warp can also lead to degradation. It is better to use the cache line entirely.

Performance hints: Nsight can offer you a hand to see the bus occupation. Additionally, these issues can be caught by reading the code and performing a code analysis.

If the memory access has some of the issues listed above, make sure of reshaping the data layout and fulfilling the coalesced memory access.

The story does not end here. What happens if the application has coalesced access already but the arithmetic intensity is low with respect to the data? In other words, the application is not computing enough on each data batch. Another rule of thumb is to compute as much as you can with the data loaded. For that, modify the threads in such a way that they process multiple elements instead of just one. It does not mean placing a loop to operate over a cache line. It means that, keeping in mind the alignment and the 128B cache line length, it is possible to place more than 32 operands. For instance:

Instead of doubles, use singles: it will reduce a cache line retrieval
Instead of singles, use halves: it will allow a thread to process two numbers while the warp only retrieves a cache line
Instead of halves, operate chars: it may be suboptimal, but it gets better arithmetic intensity.

Performance hints: keep in mind that integers can underperform floating-point. Formulate the logic in such a way that you can reach that limit.

Try to use a numerical representation that can satisfy the needs. Do not over-dimension the problem if it is not necessary.

We invite you to have a look at this example.

Previous: Optimisation Recipes/Coarse optimisations/Communication overlapping

Index

Next: Optimisation Recipes/Coarse optimisations/Inter-thread communication

❯

Correct memory access patterns in CUDA Coarse optimisations

Share This Page

Send Email Using

Memory bound: Memory access patterns