NVIDIA Jetson AGX Thor - Evaluating the Performance
Evaluating Performance
As developers, we often need to measure CPU, GPU, and memory usage to budget the applications that need to run in our projects. NVIDIA provides the tegrastats tool to extract these and other important metrics in Jetson platforms, and the Nsight Systems tool. This section provides an overview of the usage of tegrastats and Nsight Systems.
Tegrastats
How to use tegrastats
tegrastats is a tool that reports memory usage, GPU usage, CPU usage, and SOC temperature for Tegra-based devices. Tegrastats comes installed in the sample root filesystem provided in Jetpack, and it's located in the AGX Thor system path /usr/bin, so it can be launched from any location in the filesystem. To launch tegrastats, simply run the binary with privileges:
sudo tegrastats
This command will start printing the system stats each second, an example of the expected output can be found in the next subsection. If you need to customize the tegrastats behavior, the binary provides the following command-line options:
Usage: tegrastats [-option]
Options:
--help : print this help screen
--interval <millisec> : sample the information in <milliseconds>
--logfile <filename> : dump the output of tegrastats to <filename>
--load_cfg <filename> : load the information from <filename>
--readall : collect all stats including performance intensive stats
--save_cfg <filename> : save the information to <filename>
--start : run tegrastats as a daemon process in the background
--stop : stop any running instances of tegrastats
--verbose : print verbose message
Analyzing the output
You can find below an example of the output printed by tegrastats in the AGX Thor.
08-28-2025 15:36:52 RAM 2028/125772MB (lfb 4x4MB) CPU[0%@972,0%@972,0%@972,0%@972,0%@972,0%@972,0%@972,0%@972,0%@972,0%@972,0%@972,0%@972,0%@972,0%@972] EMC_FREQ 0%@665 GR3D_FREQ @[0,0,0] NVENC0_FREQ @0 NVENC1_FREQ @0 NVDEC0_FREQ @0 NVDEC1_FREQ @0 NVJPG0_FREQ @0 VIC off OFA_FREQ @0 PVA0_FREQ off APE 300 cpu@39.593C tj@41.5C soc012@39.875C soc345@41.468C VDD_GPU 0mW/0mW VDD_CPU_SOC_MSS 5137mW/5137mW VIN_SYS_5V0 4313mW/4313mW
Table 1 provides a description to help interpret some of the metrics provided by tegrastats.
| Format of Statistic | Description of Statistic | ||
|---|---|---|---|
| X | Y | Z | |
| RAM X /Y (lfb NxZ) | Amount of RAM in use, specified in megabytes, e.g 2028MB | Total amount of RAM available for applications, e.g 125772MB | Size of the largest free memory block, in megabytes. N is the number of free blocks of this size. e.g 4x4MB |
| SWAP X /Y (cached Z) | Amount of SWAP in use, in megabytes | Total amount of SWAP available for applications | Amount of SWAP cached |
| IRAM X /Y (lfb Z) | Amount of IRAM memory in use, in kilobytes | Total amount of IRAM memory available | Size of the largest free block of IRAM memory |
| CPU [X%,X%,...]@Z | Load on each CPU core relative to the current running frequency Z, or off if a core is currently powered down | - | CPU frequency in megahertz. Goes up or down dynamically depending on the CPU workload. |
| EMC_FREQ X %@Y | Percentage of EMC memory bandwidth in use relative to the current running frequency | EMC frequency in megahertz | - |
| GR3D_FREQ X %@[Y, Y, ...] | - | Frequency of each of the GPU's GPCs in megahertz. | - |
| VIC X %@Y | VIC engine loading as a percentage of current VIC engine frequency | Current VIC engine frequency | - |
| APE Y | - | APE frequency in megahertz | - |
| X @Y C | Processor block name | Processor block temperature in degrees Celsius | - |
| VDD_X YmW/ZmW | Name of the power rail * | Block’s current power consumption in milliwatts | Block’s average power consumption in milliwatts |
| NVENC0 X %@Y | - | NVENC frequency in megahertz | - |
| NVDEC0 X %@Y | - | NVDEC frequency in megahertz | - |
| NVDLA0 @Y | - | NVDLA frequency in megahertz | - |
| NVJPG0 X %@Y | - | NVJPG frequency in megahertz | - |
| PVA0_FREQ [X %,X %]@Y | PVA0_VPU0 and PVA0_VPU1 utilization (VPU is the vector processing unit) | PVA frequency in megahertz | - |
| OFA X %@Y | OFA utilization | OFA frequency in megahertz | - |
Additional options
For easier debugging, you can run tegrastats in the background as well as generating a logfile for later review. Following you can see how to do it so by creating a tegrastats.log file by running tegrastats in the background.
tegrastats --logfile <out_file> &
For stopping tegrastats you can do it so as follows
ps | grep tegrastats
Then kill the PID shown above
kill -9 <PID>
In the following table you can see more options that could help with the evaluation of performance.
| Description | Command option |
|---|---|
| Set the interval, in milliseconds, at which tegrastats writes output to the log. Default: 1000 ms. | –interval int |
| Print verbose messages | --verbose |
| Stop any running instances of tegrastats | --stop |
| Dump tegrastats output to <out_file>. | --logfile filename |
GPU Usage
In Jetpack 7.0 the GPU driver is openrm. To check GPU usage, use the following command:
watch -n 1 nvidia-smi
With watch you can set different intervals with the -n flag.
Nsight Systems
Nsight Systems profiling tool
Nsight Systems is a profiling tool from NVIDIA that helps analyze how applications use CPU and GPU resources. It can show how much time the CPU spends on different threads, how much work is sent to the GPU, and the overlap between them. On Jetson devices, this is useful to understand if an application is CPU-bound, GPU-bound, or limited by synchronization or memory transfers.
When running nsys profile, the tool produces two main outputs:
- .nsys-rep file – binary format for visualization in the Nsight Systems GUI.
- .sqlite file – database that can be queried directly to extract performance metrics with SQL or scripts.
To execute NVIDIA Nsight Systems CLI as a profile, it follows the next structure.
nsys [global-options] profile [options] [application] [application-arguments]
In this wiki we use the following command for extracting the CPU and GPU usage.
sudo nsys profile -t osrt,cuda,nvtx -s process-tree --cpuctxsw=process-tree --gpu-metrics-devices all --cuda-memory-usage true --stats true -o report <APPLICATION>
Following you can understand the flags used for:
-s: Select how to collect CPU IP/backtrace samples type. Set up to collect the process-tree.
-t: Select the API(s) to be traced. Set up to trace osrt (real-time operation system), CUDA and NVTX.
--cpuctxsw: Trace OS thread scheduling activity.
--cuda-memory-usage: Track the GPU memory usage by CUDA kernels.
--gpu-metrics-devices: Collect GPU Metrics from specified devices.
--duration: Defines the time in seconds for executing the application.
-o:Name of the report file.
--stats: Generate summary statistics after the collection. Enable sqlite generation.
The output will contain the following content.
[4/8] Executing 'osrt_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- --------------- --------------- ------------- -------------- --------------- ----------------------
52.7 64,307,798,779 2,714 23,694,841.1 21,058,185.5 1,019 15,000,136,322 288,070,828.8 futex
39.3 48,000,786,332 9 5,333,420,703.6 4,000,088,111.0 4,000,037,639 10,000,115,144 2,645,765,674.9 pthread_cond_timedwait
7.9 9,583,209,810 18,541 516,865.9 2,315.0 1,000 9,939,675 2,116,298.0 ioctl
For practical evaluation we usually focus on CPU usage and GPU usage.
CPU Usage
From the COMPOSITE_EVENTS table, using the cpuCycles field. This allows calculation of CPU% per thread or per process. Alternatively, if cpuCycles is not available, CPU usage can be approximated from the SCHED_EVENTS table by summing time spent in the “running” state.
After generating the total CPU usage, is important to subtract the NSYS overhead value reported in the thread section to get the actual CPU consumption of the application.
GPU usage
From CUPTI_ACTIVITY_KIND_KERNEL and CUPTI_ACTIVITY_KIND_MEMSET tables. By summing (end - start) for these activities, we can estimate how long the GPU was actively executing. Dividing this by the total wall time gives an approximate GPU utilization %. Note that if kernels overlap, this number may overestimate GPU usage.
Analyzing the data
In NVIDIA Nsight Systems Performance Tool Wrapper you can find a script capable of extracting the CPU and GPU usage, taking into account the information above.
For testing, you only need to execute the script as follows.
./nsys_sqlite_perf.sh /path/to/out_report.sqlite --top 10
The output would be like the following.
===== CPU (from COMPOSITE_EVENTS.cpuCycles) ===== Top PIDs by CPU%: pid cpu_util_percent 5965 76.19 5969 4.17 5931 2.38 5935 2.38 5939 2.38 5943 2.38 5975 2.38 5981 2.38 5949 1.79 5955 1.79 Top threads by CPU%: pid tid cpu_util_percent thread_name 5965 5965 29.76 gst-launch-1.0 5965 5992 14.88 qtdemux0:sink 5965 5998 13.69 qtdemux0:sink 5965 5966 5.95 [NSys] 5965 5997 5.95 qtdemux0:sink 5969 5969 4.17 gst-plugin-scan 5931 5931 2.38 jq 5935 5935 2.38 jq 5939 5939 2.38 jq 5943 5943 2.38 jq ================================================= ====== Nsight SQLite Performance Summary ====== Wall time (s): 1.439531 CPU% source: cycles-based (see tables above) GPU active (s): 0.045897 GPU% (approx): 3.19 (GPU% ≈ (kernel + memset time) / wall; overlapping kernels may over-count)
For the analysis, take the value from the TOP by PIDs table that matches with the same PID in the Top threads by CPU% table for your application. For GPU, take the Nsight SQLite Performance Summary result.