NVIDIA Jetson AGX Thor - Deepstream performance benchmarking

Follow Us On

Previous: Deepstream/Running Examples

Index

Next: Yocto

Methodology for Performance Benchmarking

The following measurements were performed to evaluate the performance of the NVIDIA Jetson AGX Thor while running different DeepStream applications. The goal was to quantify CPU and GPU utilization under controlled test conditions and identify bottlenecks across multiple inference pipelines.

Tools and Setup

Nsight Systems CLI (`nsys`)
Profiling was conducted using the NVIDIA Nsight Systems command-line interface.
The base profiling command was:

   sudo nsys profile -t osrt,cuda,nvtx \
       -s process-tree --cpuctxsw=process-tree \
       --gpu-metrics-devices all --cuda-memory-usage true \
       --stats true -o report

This configuration collects:

OS runtime and context switch activity,
CUDA kernel launches, memory operations, and NVTX ranges,
GPU active/idle periods and memory usage statistics.

SQLite Performance Extraction Script

After each run, the resulting `.sqlite` report was analyzed using an internal tool:

   ./nsys_sqlite_perf.sh report.sqlite

This script performs advanced SQL analysis to extract:

Absolute CPU utilization** based on total thread runtime.
GPU active time**, derived from summed kernel, memcpy, and memset operations.
Per-element and per-module CPU distribution**, based on sampling callchains.
Wall time**, representing total execution duration.

The script automatically adapts to available Nsight tables (e.g., `SCHED_EVENTS`, `COMPOSITE_EVENTS`, `SAMPLING_CALLCHAINS`) and applies fallbacks for compatibility with different Nsight Systems versions.

Metrics Definitions

CPU:
Average CPU utilization normalized to a single core. Calculated as the proportion of time threads spent in a running state relative to total wall time. This normalization allows consistent comparisons across different Jetson SoCs, regardless of core count.

GPU:
Ratio between total GPU active time (kernels + memcopies + memset operations) and total wall time. This provides an approximate GPU workload percentage. Note: small overcounting may occur due to overlapping GPU activities.

Wall time (s):

Total duration from the first to the last recorded event during profiling.

Experimental Procedure

Profiling was conducted on the same Jetson AGX Thor hardware, with constant temperature and power mode (MAXN).
After execution, the `.sqlite` profiling data was processed using `nsys_sqlite_perf.sh` to generate normalized CPU and GPU averages.
Results were tabulated and compared across different sample applications.

Purpose and Relevance

The motivation behind these measurements is to:

Quantify real CPU and GPU load during DeepStream execution.
Compare efficiency between different pipelines, models, or numbers of concurrent streams.
Identify potential performance bottlenecks.
Estimate system scalability as pipeline complexity or input stream count increases.

Deepstream performance benchmarking
Sample	CPU	GPU
One frame deepstream reference app	14.28%	5.30%
Four frame deepstream reference app	13.35%	10.66
Thirty frame deepstream reference app	15.86%	36.8%
License Plate detection and recognition	2.89%	7.89%
Parallel models	6.82%	11.53%
Text recognition with OCD/OCR models	1.35%	86.6%
Object embedding vector generation peoplenet detection	10.26%	25.22%
Object embedding vector generation retail detection	2.58%	22.47%
Pose Classification	6.92	16.67%