RidgeRun NVIDIA PVA Development Algorithms

From RidgeRun Developer Wiki


NVIDIA partner logo






PVA Algorithms from LibPVA

RidgeRun has implemented the following image processing algorithms on the PVA. These are foundational for image signal processing (ISP) pipelines and optimized for high efficiency.


Info
Currently, these algorithms are just for performance evaluation purposes and are not intended to be used in production. Stay tuned for more!


All the measurements were taken using the following characteristics:

  • Platform: Jetson AGX Orin 32GB
  • OS: Jetpack 6.2
  • Power Profile: MAXN power mode + Jetson Clocks
  • CPU: All measurements use aggressive compiler optimization flags and OpenMP. Introducing NEON might halve the execution times.
  • PVA: all measurements use a single VPS (half of the PVA)
  • Power Measurements: using jetson-stats (a tool based on tegrastats) with a VDD_CPU_CV power meter probe.

The profiling details are:

  • Execution time CPU (ms): using dual ARM core execution
  • Execution time PVA (ms): using dual VPU PVA execution
  • Power Consumption CPU only (W): using a number of cores such that the execution time of the CPU is nearly the same as the PVA.
  • Power Consumption PVA only (W): using dual VPU PVA execution

Bit Shifting (Debayering Resolution Downscaling)

This technique allows for resolution reduction through controlled bit manipulation during debayering. It’s useful in optimizing bandwidth or matching downstream resolution requirements.

Average performance measurements are shown in the following table for the most common resolutions. Measurements are shown for an optimized implementation of the algorithm, and all results are in milliseconds. Additionally, power consumption measurements are shown in watts. A shift of 10 bits was used for the benchmarks. Performance measurements can also be observed in the attached graph.

Bit Shifting execution time and power consumption. The execution time shows the runtime to complete a transformation on a 16-bit image to an 8-bit image using one CPU core and two VPUs. The power consumption is iso-latency measurements, where the CPU uses six cores to match the PVA latency.
Resolution Execution time CPU (ms) Execution time PVA (ms) Power consumption CPU only (W) Power consumption PVA (W)
1280x720 0.309 0.04865 8.75 3.21
1920x1080 0.675 0.10678 9.14 3.27
3840x2160 2.51 0.4061 9.54 3.24
Fig 1. Bit shifting execution time. The execution time shows the runtime to complete a transformation on a 16-bit image to an 8-bit image using one CPU core and two VPUs.


This downscales a single-channel image from 16-bit to 8-bit. To match the latency of the PVA, it is required to use six ARM cores.

Radial Lens Shading Correction

Corrects vignetting or intensity falloff from the center to the edges of an image caused by lens characteristics. It’s implemented using radial correction maps that are efficiently processed on the PVA.

Average performance measurements are shown in the following table for the most common resolutions. Measurements are shown for an optimized implementation of the algorithm, and all results are in milliseconds. Additionally, power consumption measurements are shown in watts. Performance measurements can also be observed in the attached graph.

Radial Lens Shading correction execution time and power consumption. The execution time shows the runtime to process an RGB24 image using an 8-bit fixed-point correction. The execution time measurement uses one CPU core and two VPUs. The power consumption is iso-latency measurements, where the CPU uses four cores to match the PVA latency.
Resolution Execution time CPU (ms) Execution time PVA (ms) Power consumption CPU only (W) Power consumption CPU and PVA (W)
1280x720 1.56 0.145 8.4 3.69
1920x1080 3.5 0.330 7.6 3.61
3840x2160 13.8 1.402 7.2 3.57
Fig 2. Radial Lens Shading correction execution time. The execution time shows the runtime to process an RGB24 image using an 8-bit fixed-point correction. The execution time measurement uses one CPU core and two VPUs.

The measurements were done with:

  • 8-bit Fixed-point correction maps (including channels)
  • RGB images (RGB24) - 8-bit per channel
  • ARM CPU requires four ARM cores to match the PVA latency.

Colour Space Conversion (RGBA-Gray)

Transforms image data from one color space to another (e.g., RGB to YUV). It’s essential for encoding, display pipelines, and transmission where non-RGB formats are used.

These implementations showcase how RidgeRun leverages the PVA to create real-time, power-efficient vision pipelines suitable for embedded systems under tight performance constraints.

Average performance measurements are shown in the following table for the most common resolutions. Measurements are shown for an optimized version of the algorithm, and all results are in milliseconds. Additionally, power consumption measurements are shown in watts. In the example measurements, an RGBA to Grayscale conversion was performed. Performance measurements can also be observed in the attached graph.

RGBA to Grayscale conversion execution time and power consumption. The execution time shows the runtime to process an RGB32 image to a GRAY8 image. The execution time measurement uses one CPU core and two VPUs. The power consumption is iso-latency measurements, where the CPU uses six cores to match the PVA latency.
Resolution Execution time CPU (ms) Execution time PVA (ms) Power consumption CPU only (W) Power consumption PVA (W)
1280x720 1.36 0.085 10.35 3.97
1920x1080 3.05 0.195 10.74 3.84
3840x2160 12.1 0.746 10.74 3.61
Fig 3. RGBA to Grayscale conversion execution time. The execution time shows the runtime to process an RGB32 image to a GRAY8 image. The execution time measurement uses one CPU core and two VPUs.

The images involved:

  • Input: RGBA32 (8-bit per channel, four channels)
  • Output: Gray8 (8-bit single channel)
  • The CPU requires six cores to match the PVA's latency.

2D Filtering (Convolution)

Applies a 2D filter using a 5x5 kernel, it can be used for general image filtering as well as showcasing general 2D convolution performance.

Average performance measurements are shown in the following table for the most common resolutions. Measurements are shown for an optimized version of the algorithm, and all results are in milliseconds. Additionally, better performance can be achieved with further optimization as shown in NVIDIA's PVA Solutions implementation of the convolution. Performance measurements can also be observed in the attached graph.

2D Filter convolution execution time and power consumption. The execution time shows the runtime to process an RGB32 image to a GRAY8 image. The execution time measurement uses one CPU core and one VPUs.
Resolution Execution time CPU (ms) Execution time PVA (ms) Power consumption CPU only (W) Power consumption PVA (W)
1280x720 52.445 1.237 TBD TBD
1920x1080 115.724 2.528 TBD TBD
3840x2160 456.973 9.975 TBD TBD
Fig 4. 2D convolution execution time. The execution time shows the runtime to process a Gray8 image for a 5x5 kernel. The execution time measurement uses one CPU core and one VPU.

The images involved:

  • Input: Gray8 (8-bit single channel)
  • Output: Gray8 (8-bit single channel)

Final Remarks

From the energy perspective, it is possible to notice that the power consumption may increase. Nevertheless, since the PVA is faster in most cases, the energy consumption and the execution time are lower overall.

The power consumption has been acquired at the entire platform level using the jetson-stats Python library.