NVIDIA CUDA Optimisation Examples

Stitching

Problem statement

This case implies running a stitching algorithm, capturing frames from five cameras at 30 fps. Within the pipeline, there are gain and gamma correction algorithms and a colour space conversion to NV12-RGB back and forth. The algorithm has three kernels:

Conversion NV12 - RGB + Colour Gain
Gamma Correction
Conversion back to NV12

When running five kernels in parallel, the consumption is similar to the picture presented for Nsight, with a GPU occupation of ~95%.

Optimisation

From that scenario, the optimisations performed were:

Coarse:

Offload the RGBA -> NV12 colourspace conversion: use the VIC (Jetson Multimedia API).
Remove an unnecessary intermediate buffer and its computation: use ARGB instead.
Inline the gamma correction to the gain correction kernel: merge the kernels.
Make the computation in place: avoiding memory storage.

Fine:

Use a LUT to perform polynomial interpolation for the gamma correction.
Remove unneeded branching when possible by using the ternary operator
Replace the Single-Precision Floating-Point numeric representation for computation with Half-Precision Floating-Point.