Performance of the Holoscan Sensor Bridge





Previous: Holoscan Sensor Bridge/Running Demo Index Next: Holoscan Sensor Bridge/Applications






Introduction

For accounting the performance of the Holoscan Sensor Bridge, we want to qualify the computational resources spent by the Holoscan Software while measuring the glass-to-glass latency.

Currently, the Holoscan Sensor Bridge is compatible with the NVIDIA Jetson Orin AGX and NVIDIA Orin IGX. We are going to cover the Jetson Orin AGX.

Jetson Orin AGX Performance

Setup

The Holoscan Sensor Bridge is connected as specified in the Holoscan Sensor Bridge/Hardware Connection using a 10Gbit/s ethernet connection.

Initial Clarifications

We are running the application as specified in the Holoscan Sensor Bridge/Running the Demo. We use the unaccelerated version of the IMX274 example given that the Jetson Orin AGX does not support DPDK[1]. It uses UDP communication over Ethernet for the Holoscan Sensor Bridge - Jetson communication. The results might dramatically change for the NVIDIA Orin IGX platforms provided they support NVIDIA ConnectX expansion cards for network communication.

The camera is configured as in the example, providing 60 fps. Display is a Samsung TV whose refresh rate is 60 Hz.

Results

The following results correspond to the baseline pipeline provided by the Holoscan Sensor Bridge framework as an example. This pipeline is illustrated by:

 
Baseline ISP Pipeline

For experimentation purposes, we have tried different setups and configurations:

Configurations
DP Display Port
HDMI HDMI Port
JC Jetson Clocks
ED Exclusive Display
NED Non Exclusive Display
MAXN Maximum Power
CISP-GW CudaISP GrayWorld
CISP-HS CudaISP HistogramStretch
DBOnly Only Debayer without ISP

which generates the following results:

Statistics MAXN + JC + ED + DP MAXN + ED + DP 15W + ED + DP
GPU 12% 50% 73.60%
GPU Freq 1.3 GHz 306 MHz 400 MHz
GPU Mem 595M 595M 565M
CPU Mem 99.1M 96.3M 101M
CPU 3.80% 5% 23.50%
CPU Freq 2.2 GHz 729 MHz 900 MHz
Power 17 W 12.9 W 12.1 W

A second camera captured the glass-to-glass latency using video mirroring (sensor capturing at a screen with a timer). The (total) CPU usage represents the percentage of the entire CPU, whereas the core is the use percentage of relative to the entire CPU.

Note: We are currently working on expanding the sensor list. Stay tuned!

CUDA ISP Results

We have optimised our application by integrating CUDA ISP into the Holoscan Sensor Bridge. CUDA ISP integrates an outstanding algorithm for colour correction and auto-white balancing with RGB space. It adjusts the histograms of each colour channel within a confidence interval, leading to a more complete colour balancing. Recalling the baseline pipeline, our optimization implies dropping the ISP Processor block and replacing the Gamma Correction block with the CUDA ISP block, leading to the following pipeline:

 
Baseline ISP Pipeline


Each of these blocks is executed in parallel for pipeline-like acceleration. Removing one of the blocks will shorten the frame processing time (latency), and further optimizing any of these blocks will also decrease the latency. In this case, the CUDA ISP manages to reduce one block.

On the other hand, an important consideration is that CUDA ISP does not offer RGBA64, needed for the Holoviz. This integrates the necessary conversions for RGBA64 to RGBA32 back and forth.

The following table highlights the results by using the Holoscan Sensor Bridge, the NVIDIA Jetson AGX Orin and the IMX274 imager:

Configuration Mean Latency (ms) Uncertainty (ms)
Baseline 41.61 8.35
DP+ED+MAXN+JC+CISP-HS 37.93 8.35
DP+ED+MAXN+JC+CISP-GW 49.76 8.35
DP+ED+MAXN+JC+DBOnly 35.96 8.35


This involves configuring the Jetson into a maximum performance mode. With the CUDA ISP in 24-bit colour depth. The latency lowered from 41.61 to 37.93 ms by optimizing the pipeline, leading to a 8.8% reduction.

More improvement can be applied by offloading the image signal processing to the FPGA, reducing the pressure on the Jetson system. The FPGA can potentially reduce the latency given the dataflow execution pattern offered by FPGA Hardware Acceleration. The minimum latency obtained without altering the FPGA design is 35.96% defining the floor of the latency by just adding a debayer to the image signal processing pipeline.

RidgeRun Services

RidgeRun has expertise in offloading processing algorithms using FPGAs, from Image Signal Processing to AI offloading. Our services include:

  • Algorithm Acceleration using FPGAs.
  • Image Signal Processing IP Cores.
  • Linux Device Drivers.
  • Low Power AI Acceleration using FPGAs.
  • Accelerated C++ Applications.

And it includes much more. Contact us at https://www.ridgerun.com/contact.



Previous: Holoscan Sensor Bridge/Running Demo Index Next: Holoscan Sensor Bridge/Applications