NVIDIA Jetson AGX Thor - Performance Benchmarking
Methodology for Performance Benchmarking
The following measurements were performed to evaluate the performance of the NVIDIA Jetson AGX Thor while running different DeepStream applications. The goal was to quantify CPU and GPU utilization under controlled test conditions and identify bottlenecks across multiple inference pipelines.
Tools and Setup
Nsight Systems CLI (`nsys`)
Profiling was conducted using the NVIDIA Nsight Systems command-line interface.
The base profiling command was:
sudo nsys profile -t osrt,cuda,nvtx \
-s process-tree --cpuctxsw=process-tree \
--gpu-metrics-devices all --cuda-memory-usage true \
--stats true -o report
This configuration collects:
- OS runtime and context switch activity,
- CUDA kernel launches, memory operations, and NVTX ranges,
- GPU active/idle periods and memory usage statistics.
SQLite Performance Extraction Script
After each run, the resulting `.sqlite` report was analyzed using an internal tool:
./nsys_sqlite_perf.sh report.sqlite
This script performs advanced SQL analysis to extract:
- Absolute CPU utilization** based on total thread runtime.
- GPU active time**, derived from summed kernel, memcpy, and memset operations.
- Per-element and per-module CPU distribution**, based on sampling callchains.
- Wall time**, representing total execution duration.
The script automatically adapts to available Nsight tables (e.g., `SCHED_EVENTS`, `COMPOSITE_EVENTS`, `SAMPLING_CALLCHAINS`) and applies fallbacks for compatibility with different Nsight Systems versions.
Metrics Definitions
CPU:
Average CPU utilization normalized to a single core.
Calculated as the proportion of time threads spent in a running state relative to total wall time.
This normalization allows consistent comparisons across different Jetson SoCs, regardless of core count.
GPU:
Ratio between total GPU active time (kernels + memcopies + memset operations) and total wall time.
This provides an approximate GPU workload percentage.
Note: small overcounting may occur due to overlapping GPU activities.
- Wall time (s):
Total duration from the first to the last recorded event during profiling.
Experimental Procedure
- Profiling was conducted on the same Jetson AGX Thor hardware, with constant temperature and power mode (MAXN).
- After execution, the `.sqlite` profiling data was processed using `nsys_sqlite_perf.sh` to generate normalized CPU and GPU averages.
- Results were tabulated and compared across different sample applications.
Purpose and Relevance
The motivation behind these measurements is to:
- Quantify real CPU and GPU load during DeepStream execution.
- Compare efficiency between different pipelines, models, or numbers of concurrent streams.
- Identify potential performance bottlenecks.
- Estimate system scalability as pipeline complexity or input stream count increases.
| Sample | CPU | GPU |
|---|---|---|
| One frame deepstream reference app | 14.28% | 5.30% |
| Four frame deepstream reference app | 13.35% | 10.66 |
| Thirty frame deepstream reference app | 15.86% | 36.8% |
| License Plate detection and recognition | 2.89% | 7.89% |
| Parallel models | 6.82% | 11.53% |
| Text recognition with OCD/OCR models | 1.35% | 86.6% |
| Object embedding vector generation peoplenet detection | 10.26% | 25.22% |
| Object embedding vector generation retail detection | 2.58% | 22.47% |
| Pose Classification | 6.92 | 16.67% |