Performance of the Holoscan Sensor Bridge
Learn more Holoscan Framework RidgeRun documentation is currently under development. |
Introduction
For accounting the performance of the Holoscan Sensor Bridge, we want to qualify the computational resources spent by the Holoscan Software while measuring the glass-to-glass latency.
Currently, the Holoscan Sensor Bridge is compatible with the NVIDIA Jetson Orin AGX and NVIDIA Orin IGX. We are going to cover the Jetson Orin AGX.
Jetson Orin AGX Performance
Setup
The Holoscan Sensor Bridge is connected as specified in the Holoscan Sensor Bridge/Hardware Connection using a 10Gbit/s ethernet connection.
Initial Clarifications
We are running the application as specified in the Holoscan Sensor Bridge/Running the Demo. We use the unaccelerated version of the IMX274 example given that the Jetson Orin AGX does not support DPDK[1]. It uses UDP communication over Ethernet for the Holoscan Sensor Bridge - Jetson communication. The results might dramatically change for the NVIDIA Orin IGX platforms provided they support NVIDIA ConnectX expansion cards for network communication.
The camera is configured as in the example, providing 60 fps. Display is a Samsung TV whose refresh rate is 60 Hz.
Results
The following results correspond to the baseline pipeline provided by the Holoscan Sensor Bridge framework as an example. This pipeline is illustrated by:
For experimentation purposes, we have tried different setups and configurations:
Configurations | |
---|---|
DP | Display Port |
HDMI | HDMI Port |
JC | Jetson Clocks |
ED | Exclusive Display |
NED | Non Exclusive Display |
MAXN | Maximum Power |
CISP-GW | CudaISP GrayWorld |
CISP-HS | CudaISP HistogramStretch |
DBOnly | Only Debayer without ISP |
which generates the following results:
Statistics | MAXN + JC + ED + DP | MAXN + ED + DP | 15W + ED + DP |
---|---|---|---|
GPU | 12% | 50% | 73.60% |
GPU Freq | 1.3 GHz | 306 MHz | 400 MHz |
GPU Mem | 595M | 595M | 565M |
CPU Mem | 99.1M | 96.3M | 101M |
CPU | 3.80% | 5% | 23.50% |
CPU Freq | 2.2 GHz | 729 MHz | 900 MHz |
Power | 17 W | 12.9 W | 12.1 W |
A second camera captured the glass-to-glass latency using video mirroring (sensor capturing at a screen with a timer). The (total) CPU usage represents the percentage of the entire CPU, whereas the core is the use percentage of relative to the entire CPU.
Note: We are currently working on expanding the sensor list. Stay tuned!
CUDA ISP Results
We have optimised our application by integrating CUDA ISP into the Holoscan Sensor Bridge. CUDA ISP integrates an outstanding algorithm for colour correction and auto-white balancing with RGB space. It adjusts the histograms of each colour channel within a confidence interval, leading to a more complete colour balancing. Recalling the baseline pipeline, our optimization implies dropping the ISP Processor block and replacing the Gamma Correction block with the CUDA ISP block, leading to the following pipeline:
Each of these blocks is executed in parallel for pipeline-like acceleration. Removing one of the blocks will shorten the frame processing time (latency), and further optimizing any of these blocks will also decrease the latency. In this case, the CUDA ISP manages to reduce one block.
On the other hand, an important consideration is that CUDA ISP does not offer RGBA64, needed for the Holoviz. This integrates the necessary conversions for RGBA64 to RGBA32 back and forth.
The following table highlights the results by using the Holoscan Sensor Bridge, the NVIDIA Jetson AGX Orin and the IMX274 imager:
Configuration | Mean Latency (ms) | Uncertainty (ms) |
---|---|---|
Baseline | 41.61 | 8.35 |
DP+ED+MAXN+JC+CISP-HS | 37.93 | 8.35 |
DP+ED+MAXN+JC+CISP-GW | 49.76 | 8.35 |
DP+ED+MAXN+JC+DBOnly | 35.96 | 8.35 |
This involves configuring the Jetson into a maximum performance mode. With the CUDA ISP in 24-bit colour depth. The latency lowered from 41.61 to 37.93 ms by optimizing the pipeline, leading to a 8.8% reduction.
More improvement can be applied by offloading the image signal processing to the FPGA, reducing the pressure on the Jetson system. The FPGA can potentially reduce the latency given the dataflow execution pattern offered by FPGA Hardware Acceleration. The minimum latency obtained without altering the FPGA design is 35.96% defining the floor of the latency by just adding a debayer to the image signal processing pipeline.
RidgeRun Services
RidgeRun has expertise in offloading processing algorithms using FPGAs, from Image Signal Processing to AI offloading. Our services include:
- Algorithm Acceleration using FPGAs.
- Image Signal Processing IP Cores.
- Linux Device Drivers.
- Low Power AI Acceleration using FPGAs.
- Accelerated C++ Applications.
And it includes much more. Contact us at https://www.ridgerun.com/contact.
- ↑ Holoscan IMX274 Example: https://docs.nvidia.com/holoscan/sensor-bridge/1.0.0/examples.html#imx274-player-example