Communication overlapping in CUDA Coarse optimisations

Communication bound: Overlapping & Asynchronous calls

There are applications where the data movements impact negatively on the overall walltime. Some criteria to determine whether to overlap communication or not are:

The communication and execution are serial: first communication, then computation, and finally communication
The workload can be decomposed or batched without serious data dependencies

The basic idea of overlapping is to hide the communication overhead from the timeline. By batching, it is possible to hide the communication overhead, as presented in the Figure "How streams look like in a timeline". The communication is also performed in batches. Once the chunk is in the device, the kernel execution can go on.

Another way to hide the communication overhead is by computing some work on the CPU while the GPU is working taking advantage that the host is also a computing device. The workload must be balanced to have a similar walltime.

In applications like GStreamer, where we have a pipeline, it is possible to hide the communication overhead by separating CUDA-based elements by using streams: one frame is often independent of the others. Thus, if each element has its own stream, the communication can be hidden without any further change. It also gives the impression that the work is batched.

From the CUDA perspective, the recommended path is:

Create a new stream and avoid using the default stream: the basic idea is: when creating several instances, these instances can run in parallel because they are on different streams.
Make the memory transactions asynchronous: most transactions have the "Async" suffix. Take advantage of them
Make sure that your jobs have finished with stream synchronisation: do not use the device synchronisation unless strictly necessary.

Previous: Optimisation Recipes/Coarse optimisations/Problem size

Index

Next: Optimisation Recipes/Coarse optimisations/Correct memory access patterns