NVIDIA GPU Communication and Concurrency




Previous: GPU Architecture/Memory hierarchy Index Next: Optimisation Workflow





There are two ways in which the data is transferred between the host and the device memory. The first way is blocking one, which blocks the program execution until it is completed. The other one is by using asynchronous mechanisms. In CUDA, there is the concept of streams which is kind of similar to having multiple job pipelines running in parallel. You can have multiple jobs running on the same GPU thanks to the streams.


How streams look like in a timeline


According to the picture, a problem can be partitioned into smaller sets, which require multiple calls and smaller chunks of data. These partitions are scheduled into different streams, benefiting from communication/computation overlapping.

Performance hints: Overlapping communication with computation is one of the most powerful optimisations available in CUDA. However, it will depend on your application

For achieving parallelism in the streams, the following conditions must be met:

  • The memory involved must be pinned memory (available in cudaMallocHost or Unified Virtual Addressing memory)
  • Multiple streams (non-0 streams)
  • The kernels must not occupy the entire GPU (compute units, memory, registers, etc)
  • Exploitation of Async calls, such as cudaMemCopyAsync()

Additionally, for communication purposes, it is possible to synchronise threads, streams, and the device through the following calls:

  • __syncthreads(): blocks the threads execution. It is a block-level synchronisation barrier within the device code. It does not block threads from other blocks.
  • cudaStreamSynchronize(stream): synchronises the execution of a stream (blocks the execution on the host until the stream is finished).
  • cudaDeviceSynchronize(): synchronises the execution of the whole device.

According to your needs, use the synchronisation barriers carefully.

Performance tips: Avoid synchronising within the kernels (__syncthreads()) and avoid device synchronisations. They have huge penalties in multi-stream executions.


Previous: GPU Architecture/Memory hierarchy Index Next: Optimisation Workflow