Workload offloading in CUDA Coarse optimisations

GPU bound: offloading to a specialised device

Is it running on the right device? This is, perhaps, the first question you should ask before starting. To answer this question, you can use the following criteria:

Experience: from what you have observed in other applications.
Looking at the code: it is often better to have a look at the code and inspect how it is running. Perhaps, it should run but it is wrongly implemented.
Accelerators available for the task.

CPU results to be the most suitable architecture for most applications, including those which are highly serial and highly parallel. However, the inflection point is the walltime, when more speed is required. In the case of NVIDIA Jetson, you can have a look at which device to choose depending on your application. The criteria may change depending on your requirements and context:

Device/Task	CPU	GPU	NVDLA	PVA	VIC
Serial algorithms with accumulators	Yes	No	No	No	No
Parallel algorithms with data dependencies	Yes	*	No	No	No
Embarrasingly parallel algorithms	*	Yes	No	No	No
Image processing (unclassical)	*	Yes	No	*	No
Image processing (classical)	*	*	No	Yes	Yes
Deep Learning	*	Yes	Yes	No	No

The table presents several architectures and common applications in RidgeRun's world. They can be classified as:

Serial/Parallel General-Purpose: CPU
Parallel General-Purpose: GPU
Specialised Units: NVDLA, VIC, PVA

The NVIDIA Deep Learning Accelerator^[1] (NVDLA) is a specialised unit for Deep Learning acceleration, performing fast fixed-point matrix-matrix operations and convolutions (including efficient algorithms). The Video Image Compositor (VIC) implements various 2D image and video operations in a power-efficient manner. It handles various system UI scaling, blending, and rotation operations, video post-processing functions needed during video playback, and advanced de-noising functions used for camera capture. The Programmable Vision Accelerator (PVA) is a processor in NVIDIA Jetson devices that is specialised for image processing and computer vision algorithms. You can seamlessly use the Jetson Accelerators with the Vision Programming Interface^[2] Library, allowing to use the same code for processing on the VIC, GPU, CPU, and PVA.

The rule of thumb when offloading (accelerating) is to place the algorithms where they are more suitable. For example, if the algorithm performs some sort of image conversion, it can be accelerated in the VIC. since the VIC is specialised in this kind of task and it will outperform the GPU by several times. However, please, notice that the accelerators generally are underclocked with respect to the GPU^[3]. It leads to a trade-off when offloading and making some parts of the code run faster.

Another rule of thumb is to avoid implementing algorithms with some sort of accumulators or high data dependency on highly parallel units. Despite it is possible to offload them, it might underperform with respect to the CPU given that they have hard synchronisation barriers. For those tasks, prefer the most general ones or try to partition the problem in such a way that multi-threading can be used, i.e. using pipelining. Of course, it may depend on how severe is the data dependency.

Performance hints: Prefer specialised units on top of the generals.

Therefore, the first optimisation is:

Offload the algorithm to the most adequate device. Give a try to NVIDIA Jetson Multimedia API, VPI, and others.

References

↑ NVDLA: http://nvdla.org/
↑ VPI: https://docs.nvidia.com/vpi/
↑ VPI Architecture: https://docs.nvidia.com/vpi/architecture.html

Previous: Optimisation Recipes/Coarse optimisations

Index

Next: Optimisation Recipes/Coarse optimisations/Problem size

[1] NVDLA: http://nvdla.org/

[2] VPI: https://docs.nvidia.com/vpi/

[3] VPI Architecture: https://docs.nvidia.com/vpi/architecture.html

[1]

[2]

[3]