Performance of the Stitcher element on NVIDIA Xavier

From RidgeRun Developer Wiki



Previous: Performance/Orin Index Next: Performance/Nano







The performance of the cuda stitcher element depends on many factors, being more significant than those that have a direct influence on the output resolution.

The following sections show the measurements of the cuda-stitcher (FPS and Latency) for multiple image resolutions; as well as the impact of changing parameters such as the blending width and the homography-list.

Platform Setup

The performance measurements were done with the AGX Xavier in 30W All mode, which can be activated with

sudo nvpmodel -m 3

Some sections show a comparison between using the platform at a maximum frequency (With jetson_clocks) and in base mode. This mode can be set as follows:

sudo /usr/bin/jetson_clocks

Framerate

The AGX Xavier device presents a CPU load between 10% and 20% when using the Stitcher element. The average Frames per Second measurements are shown in the following charts varying the number of inputs, image resolutions, image overlap, and blending width. Also, the impact of executing or not the jetson_clocks script is shown in some of the results.

Comparing the number of stitched images

These measurements were done with a BORDER_WIDTH of 20 and a 10% overlap between images.

FPS on 1920x1080 images with and without jetson_clocks.sh
FPS on 3840x2160 images with and without jetson_clocks.sh

Comparing homographies

The homographies are used to define how each image is transformed before stitching, depending on its values the overlap between images can increase or decrease, yielding a smaller or larger image and impacting performance.

These measurements show the effect of different transformations with different overlap percentages, the results were obtained from stitching 2 images of 3840x2160 with a BORDER_WIDTH of 20 and running the jetson_clocks script.

FPS of the stitcher with 2 input images and multiple homographies with jetson_clocks.sh

Comparing blending widths

This parameter is set with the border-width=BORDER_WIDTH option, it sets the overlap that will be blended from the input images, therefore the larger its value the slower it is processed. These measurements were taken from the stitching of 2 images of 3840x2160 with an overlap of 10% and running the jetson_clocks script.

The blender operates only on the parts of the images that are overlapping, in this case, the overlap is 10% that's why using a BORDER_WIDTH of 500 doesn't affect performance compared to 400, since the blender is only operating on a maximum of 384 (10% overlap for 3840x2160).

FPS of the stitcher with 2 input images and multiple blending widths with jetson_clocks.sh

Comparing input resolutions

The resolution of the input image plays a big role when evaluating performance, the bigger the input, the slower the algorithm. The data below shows the result of executing the stitching over images of different resolutions, all of them with a 16:9 aspect ratio, a BORDER_WIDTH of 20, an overlap of 10% and running the jetson_clocks script.

FPS of the stitcher with 2 input images of multiple resolutions with jetson_clocks.sh

Pipeline Structure

The general structure of the pipeline used for the FPS measurements above is the following:

gst-launch-1.0 -e cudastitcher name=stitcher \
homography-list="`cat homographies.json | tr -d "\n" | tr -d " "`" \
border-width=$BORDER_WIDTH \
filesrc location=images/ImageA.jpg ! nvjpegdec ! imagefreeze ! nvvidconv ! queue ! stitcher.sink_0 \
filesrc location=images/ImageB.jpg ! nvjpegdec ! imagefreeze ! nvvidconv ! queue ! stitcher.sink_1 \
stitcher. ! perf print-cpu-load=true ! fakesink

Latency

For the purpose of this performance evaluation, Latency is measured as the time difference between the src of the element before the stitcher and the src of the stitcher itself, effectively measuring the time between input and output pads. For multiple inputs, the largest time difference is taken.

These latency measurements were taken using the GstShark interlatency tracer.

The pictures below show the latency of the cuda-stitcher element, for multiple input images and multiple resolutions, as well as using and not using the jetson_clocks script.

Latency on 1920x1080 images with and without jetson_clocks.sh
Latency on 3840x2160 images with and without jetson_clocks.sh

Pipeline structure

The general structure of the pipeline used for the latency measurements is shown below, for the case of 2 images of 3840x2160 resolution.

BORDER_WIDTH=20

INPUT_1=image_1.jpg
INPUT_2=image_2.jpg

GST_DEBUG="3,GST_TRACER:7" GST_TRACERS="interlatency" GST_SHARK_CTF_DISABLE=1 \
gst-launch-1.0 -e cudastitcher name=stitcher \
homography-list="`cat homographies.json | tr -d "\n" | tr -d " "`"   \
border-width=$BORDER_WIDTH \
multifilesrc loop=true location=$INPUT_1 ! nvjpegdec ! 'video/x-raw, width=3840, height=2160' \
! nvvidconv ! 'video/x-raw(memory:NVMM),format=NV12, width=3840, height=2160' \
! nvvidconv ! queue ! stitcher.sink_0   \
multifilesrc loop=true location=$INPUT_2 ! nvjpegdec ! 'video/x-raw, width=3840, height=2160' \
! nvvidconv ! 'video/x-raw(memory:NVMM),format=NV12, width=3840, height=2160' \
! nvvidconv ! queue ! stitcher.sink_1   \
stitcher. ! perf print-arm-load=true ! queue ! fakesink


Previous: Performance/Orin Index Next: Performance/Nano