NVIDIA CUDA Optimisation Examples
RidgeRun CUDA Optimisation Guide | |||||
---|---|---|---|---|---|
GPU Architecture | |||||
|
|||||
Optimisation Workflow | |||||
|
|||||
Optimisation Recipes | |||||
|
|||||
Common pitfalls when optimising | |||||
|
|||||
Examples | |||||
|
|||||
Empirical Experiments | |||||
|
|||||
Contact Us |
Stitching
Problem statement
This case implies running a stitching algorithm, capturing frames from five cameras at 30 fps. Within the pipeline, there are gain and gamma correction algorithms and a colour space conversion to NV12-RGB back and forth. The algorithm has three kernels:
- Conversion NV12 - RGB + Colour Gain
- Gamma Correction
- Conversion back to NV12
When running five kernels in parallel, the consumption is similar to the picture presented for Nsight, with a GPU occupation of ~95%.
Optimisation
From that scenario, the optimisations performed were:
Coarse:
- Offload the RGBA -> NV12 colourspace conversion: use the VIC (Jetson Multimedia API).
- Remove an unnecessary intermediate buffer and its computation: use ARGB instead.
- Inline the gamma correction to the gain correction kernel: merge the kernels.
- Make the computation in place: avoiding memory storage.
Fine:
- Use a LUT to perform polynomial interpolation for the gamma correction.
- Remove unneeded branching when possible by using the ternary operator
- Replace the Single-Precision Floating-Point numeric representation for computation with Half-Precision Floating-Point.
Results
- Baseline: 1x
- Coarse optimisations: 9x
- Fine optimisations: 10x (coarse + fine)
So, the frame rate scaled considerably, from 10 fps to ~45 fps per camera.