NVIDIA Jetson AGX Thor - Evaluating the Performance

From RidgeRun Developer Wiki

Follow Us On Twitter LinkedIn Email Share this page



Previous: JetPack_7.0/Performance_Tuning Index Next: JetPack_7.0/Performance_Tuning/Set_Values_Manually








Evaluating Performance

As developers, we often need to measure CPU, GPU, and memory usage to budget the applications that need to run in our projects. NVIDIA provides the tegrastats tool to extract these and other important metrics in Jetson platforms, and the Nsight Systems tool. This section provides an overview of the usage of tegrastats and Nsight Systems.

Tegrastats

How to use tegrastats

tegrastats is a tool that reports memory usage, GPU usage, CPU usage, and SOC temperature for Tegra-based devices. Tegrastats comes installed in the sample root filesystem provided in Jetpack, and it's located in the AGX Thor system path /usr/bin, so it can be launched from any location in the filesystem. To launch tegrastats, simply run the binary with privileges:

sudo tegrastats

This command will start printing the system stats each second, an example of the expected output can be found in the next subsection. If you need to customize the tegrastats behavior, the binary provides the following command-line options:

Usage: tegrastats [-option]
Options:
    --help                  : print this help screen
    --interval <millisec>   : sample the information in <milliseconds>
    --logfile  <filename>   : dump the output of tegrastats to <filename>
    --load_cfg <filename>   : load the information from <filename>
    --readall               : collect all stats including performance intensive stats
    --save_cfg <filename>   : save the information to <filename>
    --start                 : run tegrastats as a daemon process in the background
    --stop                  : stop any running instances of tegrastats
    --verbose               : print verbose message

Analyzing the output

You can find below an example of the output printed by tegrastats in the AGX Thor.

08-28-2025 15:36:52 RAM 2028/125772MB (lfb 4x4MB) CPU[0%@972,0%@972,0%@972,0%@972,0%@972,0%@972,0%@972,0%@972,0%@972,0%@972,0%@972,0%@972,0%@972,0%@972] EMC_FREQ 0%@665 GR3D_FREQ @[0,0,0] NVENC0_FREQ @0 NVENC1_FREQ @0 NVDEC0_FREQ @0 NVDEC1_FREQ @0 NVJPG0_FREQ @0 VIC off OFA_FREQ @0 PVA0_FREQ off APE 300 cpu@39.593C tj@41.5C soc012@39.875C soc345@41.468C VDD_GPU 0mW/0mW VDD_CPU_SOC_MSS 5137mW/5137mW VIN_SYS_5V0 4313mW/4313mW

Table 1 provides a description to help interpret some of the metrics provided by tegrastats.


Table 1: tegrastats interpretation
Format of Statistic Description of Statistic
X Y Z
RAM X /Y (lfb NxZ) Amount of RAM in use, specified in megabytes, e.g 2028MB Total amount of RAM available for applications, e.g 125772MB Size of the largest free memory block, in megabytes. N is the number of free blocks of this size. e.g 4x4MB
SWAP X /Y (cached Z) Amount of SWAP in use, in megabytes Total amount of SWAP available for applications Amount of SWAP cached
IRAM X /Y (lfb Z) Amount of IRAM memory in use, in kilobytes Total amount of IRAM memory available Size of the largest free block of IRAM memory
CPU [X%,X%,...]@Z Load on each CPU core relative to the current running frequency Z, or off if a core is currently powered down - CPU frequency in megahertz. Goes up or down dynamically depending on the CPU workload.
EMC_FREQ X %@Y Percentage of EMC memory bandwidth in use relative to the current running frequency EMC frequency in megahertz -
GR3D_FREQ X %@[Y, Y, ...] - Frequency of each of the GPU's GPCs in megahertz. -
VIC X %@Y VIC engine loading as a percentage of current VIC engine frequency Current VIC engine frequency -
APE Y - APE frequency in megahertz -
X @Y C Processor block name Processor block temperature in degrees Celsius -
VDD_X YmW/ZmW Name of the power rail * Block’s current power consumption in milliwatts Block’s average power consumption in milliwatts
NVENC0 X %@Y - NVENC frequency in megahertz -
NVDEC0 X %@Y - NVDEC frequency in megahertz -
NVDLA0 @Y - NVDLA frequency in megahertz -
NVJPG0 X %@Y - NVJPG frequency in megahertz -
PVA0_FREQ [X %,X %]@Y PVA0_VPU0 and PVA0_VPU1 utilization (VPU is the vector processing unit) PVA frequency in megahertz -
OFA X %@Y OFA utilization OFA frequency in megahertz -

Additional options

For easier debugging, you can run tegrastats in the background as well as generating a logfile for later review. Following you can see how to do it so by creating a tegrastats.log file by running tegrastats in the background.

tegrastats --logfile <out_file> &

For stopping tegrastats you can do it so as follows

ps | grep tegrastats

Then kill the PID shown above

kill -9  <PID>

In the following table you can see more options that could help with the evaluation of performance.


Table 2: Command options for tegrastats
Description Command option
Set the interval, in milliseconds, at which tegrastats writes output to the log. Default: 1000 ms.
–interval int
Print verbose messages
--verbose
Stop any running instances of tegrastats
--stop
Dump tegrastats output to <out_file>.
--logfile filename

GPU Usage

In Jetpack 7.0 the GPU driver is openrm. To check GPU usage, use the following command:

watch -n 1 nvidia-smi

With watch you can set different intervals with the -n flag.

Nsight Systems

Nsight Systems profiling tool

Nsight Systems is a profiling tool from NVIDIA that helps analyze how applications use CPU and GPU resources. It can show how much time the CPU spends on different threads, how much work is sent to the GPU, and the overlap between them. On Jetson devices, this is useful to understand if an application is CPU-bound, GPU-bound, or limited by synchronization or memory transfers.

When running nsys profile, the tool produces two main outputs:

  • .nsys-rep file – binary format for visualization in the Nsight Systems GUI.
  • .sqlite file – database that can be queried directly to extract performance metrics with SQL or scripts.

To execute NVIDIA Nsight Systems CLI as a profile, it follows the next structure.

nsys [global-options] profile [options] [application] [application-arguments]

In this wiki we use the following command for extracting the CPU and GPU usage.

sudo nsys profile -t osrt,cuda,nvtx -s process-tree --cpuctxsw=process-tree --gpu-metrics-devices all --cuda-memory-usage true --stats true  -o report <APPLICATION>

Following you can understand the flags used for:

-s: Select how to collect CPU IP/backtrace samples type. Set up to collect the process-tree.

-t: Select the API(s) to be traced. Set up to trace osrt (real-time operation system), CUDA and NVTX.

--cpuctxsw: Trace OS thread scheduling activity.

--cuda-memory-usage: Track the GPU memory usage by CUDA kernels.

--gpu-metrics-devices: Collect GPU Metrics from specified devices.

--duration: Defines the time in seconds for executing the application.

-o:Name of the report file.

--stats: Generate summary statistics after the collection. Enable sqlite generation.


The output will contain the following content.

[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls     Avg (ns)         Med (ns)        Min (ns)        Max (ns)       StdDev (ns)             Name         
 --------  ---------------  ---------  ---------------  ---------------  -------------  --------------  ---------------  ----------------------
     52.7   64,307,798,779      2,714     23,694,841.1     21,058,185.5          1,019  15,000,136,322    288,070,828.8  futex                 
     39.3   48,000,786,332          9  5,333,420,703.6  4,000,088,111.0  4,000,037,639  10,000,115,144  2,645,765,674.9  pthread_cond_timedwait
      7.9    9,583,209,810     18,541        516,865.9          2,315.0          1,000       9,939,675      2,116,298.0  ioctl


Documentation
More switch configurations and options can be found in the wiki


For practical evaluation we usually focus on CPU usage and GPU usage.

CPU Usage

From the COMPOSITE_EVENTS table, using the cpuCycles field. This allows calculation of CPU% per thread or per process. Alternatively, if cpuCycles is not available, CPU usage can be approximated from the SCHED_EVENTS table by summing time spent in the “running” state.

After generating the total CPU usage, is important to subtract the NSYS overhead value reported in the thread section to get the actual CPU consumption of the application.


Documentation
More information of how to extract these values can be found in Post-Collection Analysis Guide


GPU usage

From CUPTI_ACTIVITY_KIND_KERNEL and CUPTI_ACTIVITY_KIND_MEMSET tables. By summing (end - start) for these activities, we can estimate how long the GPU was actively executing. Dividing this by the total wall time gives an approximate GPU utilization %. Note that if kernels overlap, this number may overestimate GPU usage.

Analyzing the data

In NVIDIA Nsight Systems Performance Tool Wrapper you can find a script capable of extracting the CPU and GPU usage, taking into account the information above.

For testing, you only need to execute the script as follows.

./nsys_sqlite_perf.sh /path/to/out_report.sqlite --top 10

The output would be like the following.

===== CPU (from COMPOSITE_EVENTS.cpuCycles) =====
Top PIDs by CPU%:
pid   cpu_util_percent
5965  76.19
5969  4.17
5931  2.38
5935  2.38
5939  2.38
5943  2.38
5975  2.38
5981  2.38
5949  1.79
5955  1.79

Top threads by CPU%:
pid   tid   cpu_util_percent  thread_name
5965  5965  29.76             gst-launch-1.0
5965  5992  14.88             qtdemux0:sink
5965  5998  13.69             qtdemux0:sink
5965  5966  5.95              [NSys]
5965  5997  5.95              qtdemux0:sink
5969  5969  4.17              gst-plugin-scan
5931  5931  2.38              jq
5935  5935  2.38              jq
5939  5939  2.38              jq
5943  5943  2.38              jq
=================================================

====== Nsight SQLite Performance Summary ======
Wall time (s):        1.439531
CPU% source:         cycles-based (see tables above)
GPU active (s):       0.045897
GPU% (approx):       3.19
  (GPU% ≈ (kernel + memset time) / wall; overlapping kernels may over-count)

For the analysis, take the value from the TOP by PIDs table that matches with the same PID in the Top threads by CPU% table for your application. For GPU, take the Nsight SQLite Performance Summary result.



Previous: JetPack_7.0/Performance_Tuning Index Next: JetPack_7.0/Performance_Tuning/Set_Values_Manually