NVIDIA CUDA Optimisation Examples

From RidgeRun Developer Wiki



Previous: Optimisation Recipes/Common pitfalls when optimising Index Next: Contact_Us






Stitching

Problem statement


This case implies running a stitching algorithm, capturing frames from five cameras at 30 fps. Within the pipeline, there are gain and gamma correction algorithms and a colour space conversion to NV12-RGB back and forth. The algorithm has three kernels:

  • Conversion NV12 - RGB + Colour Gain
  • Gamma Correction
  • Conversion back to NV12

When running five kernels in parallel, the consumption is similar to the picture presented for Nsight, with a GPU occupation of ~95%.

Optimisation


From that scenario, the optimisations performed were:

Coarse:

  • Offload the RGBA -> NV12 colourspace conversion: use the VIC (Jetson Multimedia API).
  • Remove an unnecessary intermediate buffer and its computation: use ARGB instead.
  • Inline the gamma correction to the gain correction kernel: merge the kernels.
  • Make the computation in place: avoiding memory storage.

Fine:

  • Use a LUT to perform polynomial interpolation for the gamma correction.
  • Remove unneeded branching when possible by using the ternary operator
  • Replace the Single-Precision Floating-Point numeric representation for computation with Half-Precision Floating-Point.

Results


  • Baseline: 1x
  • Coarse optimisations: 9x
  • Fine optimisations: 10x (coarse + fine)

So, the frame rate scaled considerably, from 10 fps to ~45 fps per camera.


Previous: Optimisation Recipes/Common pitfalls when optimising Index Next: Contact_Us