Exploring NVIDIA Jetson Orin Nano Super Mode performance using Generative AI

From RidgeRun Developer Wiki




NVIDIA Jetson Orin Nano Super Mode

NVIDIA Jetson Orin Nano Super Mode

Introduction

RidgeRun is always looking for the newest technologies to improve the service that we offer to our clients and the community by staying ahead of industry trends. So, in this post, we are exploring the new NVIDIA Jetson Orin Nano™ Super Developer Kit and its potential by running some NVIDIA’s Jetson Generative AI examples.

This post explores the Text Generation and Image Generation NVIDIA Jetson AI Lab tutorials. You’ll find information regarding the CPU, GPU and memory performance metrics when executing the examples in 15 watts power mode versus running the Generative AI at the MAX power mode, that means, Super Developer Kit mode. The reader will be able to check CPU, GPU and memory performance metrics when running the examples using the 15 watts power mode versus the max power mode which operates at 25 watts.

Furthermore, at the end of this wiki, you can find the instructions on how to enable the Jetson Orin Nano Super Mode Configuration and run the Text Generation with ollama client and Stable Diffusion Image Generation tutorials.

Environment

The tutorials explored in this post are Text Generation and Image Generation. They were executed in a Jetson Orin Nano 8 GB with a 1TB M 2 NVMe SSD. This Jetson Orin has an Ampere GPU architecture and 6 CPU ARM cores.

The JetPack used is version 6.1., which is the latest JetPack 6 version delivered by NVIDIA as of the date of writing this post and includes the Orin Super Developer Kit support.

The Jetson Orin supports different power modes. Two modes were used for this blog, the 15-Watts and the maximum power mode which has no power budget and is related to the Super Developer Kit release.

Super Developer Kit mode capabilities

The Jetson Orin Nano Super Developer Kit configuration comparison with the original modes can be found in the following table, adapted from [1],[2]:

Original NVIDIA Jetson Orin Nano Developer Kit NVIDIA Jetson Orin Nano Super Developer Kit
GPU NVIDIA Ampere architecture

635 MHz

NVIDIA Ampere architecture

1,020 MHz

AI PERF 40 INT8 TOPS (Sparse)

20 INT8 TOPS (Dense)

10 FP16 TFLOPs

67 TOPS (Sparse)

33 TOPS (Dense)

17 FP16 TFLOPs

CPU 6-core Arm Cortex-A78AE v8.2 64-bit CPU

1.5 GHz

6-core Arm Cortex-A78AE v8.2 64-bit CPU

1.7 GHz

Memory 8GB 128-bit LPDDR5

68 GB/s

8GB 128-bit LPDDR5

102 GB/s

Module Power 15W 15W | 25W

Table 1. Jetson Orin Nano Super Developer Kit capabilities comparison


Some capabilities can be checked by executing jetson_clocks and tegrastats on the board. The following are for the Original NVIDIA Jetson Orin Nano Developer Kit:

RAM 5933/7620MB (lfb 21x1MB) SWAP 5/3810MB (cached 0MB) CPU [0%@1497,0%@1497,0%@1497,0%@1497,0%@1497,0%@1497] EMC_FREQ 0%@2133 GR3D_FREQ 0%@[611] NVDEC off NVJPG off NVJPG1 off VIC off OFA off APE 200 cpu@49.906C soc2@48.875C soc0@49.5C gpu@49.468C tj@50.062C soc1@50.062C VDD_IN 5313mW/5313mW VDD_CPU_GPU_CV 943mW/943mW VDD_SOC 1731mW/1731mW

On the other hand, the next execution is for NVIDIA Jetson Orin Nano Super Developer Kit:

RAM 929/7620MB (lfb 3x4MB) SWAP 0/3810MB (cached 0MB) CPU [0%@1728,0%@1728,0%@1728,0%@1728,0%@1728,0%@1728] EMC_FREQ 0%@3199 GR3D_FREQ 0%@[1007] NVDEC off NVJPG off NVJPG1 off VIC off OFA off APE 200 cpu@51.531C soc2@50.5C soc0@50.812C gpu@50.718C tj@51.531C soc1@51.468C VDD_IN 6848mW/6848mW VDD_CPU_GPU_CV 1495mW/1495mW VDD_SOC 2637mW/2637mW

Notice that the CPU frequency increased from 1.5 MHz to 1.7 MHz approximately, which is an improvement of 11.76%. The GPU frequency went from being in 611 MHz to 1007 MHz, an improvement of 39.32%.

Results summary

Detailed instructions on how to enable the NVIDIA Jetson Orin Nano Super Developer kit mode and run the NVIDIA generative AI tutorials can be found below this wiki.

In addition, the performance metrics’ details can be taken from the figures in the next sections, which are summarized in the following bullet points:

  • The duration taken for the text generation output when using the Super Developer kit mode decreased by 26.74% and while the tokens/s increased by 31.38% in comparison with the 15 watts mode.
  • The GPU average usage increased by 1.25% when using the MAX power mode and its frequency during the inference reached a maximum value of 1015 MHz. While the CPU and the memory usage decreased by 0.2% and 15.23% on average, respectively when testing the text generation example.
  • The duration taken for the generation of the image when using the Super Developer kit mode decreased by 34.37% while the iterations/s increased by 34.63% in comparison with the 15 watts mode.
  • For image generation, the GPU average usage increased 5.07% when using the MAX power mode in comparison with the 15 watts mode and its frequency during the inference reached a maximum value of 1016 MHz. The CPU increased by 1.36% while the memory usage decreased by 0.37% on average.
  • Also, by running the bandwidthTest and taking the Device to Device Bandwidth when running the intense mode, the test reported 101.9 GB/s in the Super Developer Mode.

Results and Performance

One of the main purposes of this post is exploring the performance that can be achieved for generative models when using the NVIDIA Jetson Orin Nano Super Developer Kit and comparing them with the 15 watts mode. There are plenty of metrics that can be extracted from these executions, the ones measured here are:

  • CPU usage
  • GPU usage
  • RAM usage
  • Tokens/s or iterations/s

The first three parameters are taken by parsing the NVIDIA tegrastats utility and plotting the measurements to have a better understanding of resource consumption. The execution time is reported by the application itself.

Text Generation with ollama client

The text generation model was tested on its capabilities for generating a response on what approach to follow to prevent race conditions in an application. The following prompt was used:

“What approach would you use to detect and prevent race conditions in a multithreaded application?”

Here are some questions that the user can ask the AI model.

15 watts mode results

This request took 49.97 seconds to be completed and the output was generated at the rate of 9.84 tokens/s when using the 15-Watts mode

This same execution was used to measure performance and profile resource usage for the model on the Jetson. The resources of interest are GPU, CPU and RAM consumption. The following plots show a summary of the performance achieved by the model.

This image shows the GPU, CPU and Memory usage when running the text generation with ollama client example at 15 watts power mode. During the execution, the GPU reached almost 99% usage. The CPU had a spike of 30% at the beginning but the usage was lower than 10% after the spike. The memory reached 86.50% usage when the execution finished.

These plots show the impact on CPU, GPU, and memory usage when the model starts; on the GPU, usage reached an average of 93.88% during inference. On the CPU, however, we see a smaller increase with an average usage of 4.45%, with some spikes during inference execution. Regarding memory usage, the average was 86.35% and we see no significant change with a variation of 2% approximately during the inference; this is caused by the model being already loaded beforehand.

MAX mode results

When using the MAX mode, the request output was generated at the rate of 14.36 tokens/s and took 36.61 seconds to be accomplished.

The resource usage plots that summarize the performance are the following:

This image shows the GPU, CPU and Memory usage when running the text generation with ollama client example at MAX power mode. During the execution, the GPU reached almost 99% usage. The CPU had a spike of 30% at the beginning but the usage was lower than 10% after the spike. The memory reached 72.75% usage when the execution finished.

The GPU usage reached an average of 96.41% during inference. On the CPU side, we see an average usage of 3.74%. Regarding memory usage, the average was 72.75% and the usage increased with a variation of 1.75% approximately during the inference, this is caused by the model being already loaded beforehand.

We can see, that the duration taken for the text generation of the output when using the Super Developer kit mode decreased by 26.74% while the tokens/s increased by 31.38% in comparison with the 15 watts mode.

On the resource consumption side, the GPU average usage increased by 2.53% when using the MAX power mode and the GPU reached a frequency of 1015 MHz. While the CPU and the memory usage decreased by 0.71% and 13.6% on average, respectively.

Image Generation

The Image Generation generates an image based on a text input prompt. Here is the prompt used to create the following image and measure resource consumption:

Futuristic city with sunset, high quality, 4K image

15-Watts mode results

For the example executed in this mode, the model was loaded in 12.8s, the image was generated in 16 seconds with a rate of 1.34 iterations/s.

The performance and the profile resource usage for the model were measured during this same execution on the Jetson, the following plots show a summary of the performance achieved by this power mode.

This image shows the GPU, CPU and Memory usage when running the image generation example at 15 watts power mode. During the execution, the GPU reached almost 99% usage. The CPU had a started usage value of 70% approximately and then, the usage was lower than 60% until the image was generated. The memory reached approximately 82% usage when the execution finished.

These plots show the CPU, GPU and memory percentage resource usage when the tutorial is run. The GPU usage reached an average of 93.93% during inference. On the CPU side, the average was 43.94%. Regarding memory average usage was 81.13% with an increase of 5.2%, this is caused by the model being already loaded beforehand.

MAX mode results

On the Super Developer Kit mode side, the model was loaded in 9.4s, the image was generated in 10.5 seconds at a rate of 2.05 iterations per second.

The plots that summarize the resource consumption are shown as follows.

This image shows the GPU, CPU and Memory usage when running the image generation example at MAX power mode. During the execution, the GPU reached 99% usage. The CPU had some spikes greater than 50% of usage value during the execution. The memory reached approximately 81% usage when the execution finished.

The GPU usage reached an average of 99% during inference. On the CPU side, the average was 45.3%. The memory usage increased by 1% approximately, this is caused by the model being already loaded beforehand, while the average usage was 80.76%.

We can see, that the duration taken for the generation of the image when using the Super Developer kit mode decreased by 34.37% while the iterations/s increased by 34.63% in comparison with the 15 watts mode.

On the resource consumption side, the GPU average usage increased by 5.07% when using the MAX power mode in comparison with the 15 watts mode and its frequency reached a maximum value of 1016 MHz. The CPU increased by 1.36% while the memory usage decreased by 0.37% on average, respectively.

Conclusions

With the results generated in this blog, the NVIDIA Jetson Orin Nano Super Developer Kit was enabled with success and some Generative AI examples were executed.

On the metrics performance side, when using the Super Developer Kit mode, the text generation duration and the tokens/s had an improvement of 26.74% and 31.38% respectively. While the GPU usage increased by 1.25% with a maximum frequency of 1015 MHz reached during the inference. However, the CPU and memory usage decreased by 0.2% and 15.23% on average respectively.

For the image generation example in the Super Developer Kit mode, the iterations/s and generated process duration had an improvement of 34.63% and 34.37% respectively. The GPU average usage increased by 5.07% with a maximum frequency of 1016 MHz reached during the inference. Also, the CPU increased by 1.36% while the memory usage decreased by 0.37% on average.

Furthermore, the bandwidthTest was executed to measure the memory Device to Device Bandwidth intensive mode. The test reported 101.9 GB/s in the Super Developer Kit Mode. In addition to this, the CPU frequency increased from 1.5 MHz to 1.7 MHz approximately, which is an improvement of 11.76%. The GPU frequency went from being in 611 MHz to 1007 MHz, which is an improvement of 39.32%.

Finally, at RidgeRun we see an improvement when enabling the NVIDIA Jetson Orin Nano Super Developer Kit Mode and running the Jetson Generative AI examples. There are still a lot more things to explore in depth to adapt this technology to custom applications, but this survey allows us to explore the new NVIDIA Jetson Orin Nano capabilities. Watch out for more blogs on the topic and technical documentation in our developer’s wiki.

Jetson Orin Nano Super mode enabling and Generative AI running instructions

Find below the instructions to enable the Jetson Orin Nano Super mode configuration and the guidance on running the Jetson Generative AI Labs tutorials.

Enable the Super Mode Configuration

Step 1: Download the sources from the SDKManager

Download the sources from the SDKManager. Please, make sure that the configuration shown in the following image is being used:

This image shows SDKmanager configuration for downloading the JetPack 6.1 Orin Nano Super Developer Kit mode sources


The user can enable the Additional SDKS fields if they are needed.

Step 2: Define the Environment Variables

Make sure to run the following commands in the terminal that you will use to flash the Orin Nano.

export JETPACK=$HOME/nvidia/nvidia_sdk/JetPack_6.1_Linux_JETSON_ORIN_NANO_8GB_DEVKIT/Linux_for_Tegra

Step 3: Set Board in Recovery Mode

In order to flash the Orin, we must set it in recovery mode so that it can accept the files. The procedure to put the Orin in recovery mode is:

  1. Turned off and disconnected from the power supply the Jetson Orin Nano devkit.
  2. Get the USB-A to USB-C or USB-C to USB-C cable and connect the USB-C end to your Orin Nano and the USB-A or USB-C end to your computer (the computer where you installed Jetpack)
  3. Connect a jumper between FC-REC pin and any GND pin.
  4. Connect the power supply.

At this point, the Orin Nano should be in recovery mode. To verify, you can run the following command on your host computer:

lsusb

If the Orin is in recovery mode, you should see a line similar to the following among the command output:

Bus 001 Device 011: ID 0955:7523 NVidia Corp

Step 4: Apply binaries

These commands are intended to prepare the files for a correct flash.

cd $JETPACK
sudo ./apply_binaries.sh
sudo ./tools/l4t_flash_prerequisites.sh

Step 5 (Optional): Create a Default User

The default user/password can be configured after flashing if you can connect a display, keyboard, and mouse to the Orin Nano to complete the <codeOEM configuration during the first boot. If you wish to save time and create a default user/password for your Orin before flashing, you can execute the commands below, make sure to replace <user_name> and <password> for a username and password of your choice.

 cd $JETPACK/tools
 sudo ./l4t_create_default_user.sh -u <user_name> -p <password>

Step 5: Execute the Flash Script

NVIDIA provides a script for flashing the Orin in the Jetpack directory. This script takes two arguments, the target board, and the root device:

The following subsections show the command to flash each of the storage options available.

Option #1: NVMe

Execute the following commands:

cd $JETPACK
sudo ./tools/kernel_flash/l4t_initrd_flash.sh --external-device nvme0n1p1 \
  -c tools/kernel_flash/flash_l4t_t234_nvme.xml -p "-c bootloader/generic/cfg/flash_t234_qspi.xml" \
  --showlogs --network usb0 jetson-orin-nano-devkit-super internal

The flashing process will take a while, and should print a success message if the flash is finished successfully. After the flash script finishes successfully, the Orin will boot automatically.

Option #2: SD card

Execute the following commands:

cd $JETPACK
sudo ./tools/kernel_flash/l4t_initrd_flash.sh --external-device mmcblk0p1 \
  -c tools/kernel_flash/flash_l4t_t234_nvme.xml -p "-c bootloader/generic/cfg/flash_t234_qspi.xml" \
  --showlogs --network usb0 jetson-orin-nano-devkit-super internal

The flashing process will take a while and should print a success message if the flash is finished successfully. After the flash script finishes successfully, the Orin will boot automatically.

Option #3: USB Drive

sudo ./tools/kernel_flash/l4t_initrd_flash.sh --external-device sda1 \
  -c tools/kernel_flash/flash_l4t_t234_nvme.xml -p "-c bootloader/generic/cfg/flash_t234_qspi.xml" \
  --showlogs --network usb0 jetson-orin-nano-devkit-super internal

The flashing for any of the methods used above should finished by showing the following messages:

Flash is successful
Reboot device
Cleaning up...

Testing the MaxN mode

A way to confirm that the board was flashed with the super mode is by checking the /etc/nvpmodel.conf file. Based on Supported Modes and Power Efficiency, the MaxN mode is available only when flashing with the `jetson-orin-nano-devkit-super` configurations. This file shows:

POWER_MODEL ID=2 NAME=MAXN

For changing to the MaxN mode, execute the following command:

sudo /usr/sbin/nvpmodel -m 2

Then reboot the board and check it as:

sudo /usr/sbin/nvpmodel -q
NV Power Mode: MAXN
2

Also, tegrastats can be run and checked as:

sudo tegrastats --readall

RAM 929/7620MB (lfb 3x4MB) SWAP 0/3810MB (cached 0MB) CPU [0%@1728,0%@1728,0%@1728,0%@1728,0%@1728,0%@1728] EMC_FREQ 0%@3199 GR3D_FREQ 0%@[1007] NVDEC off NVJPG off NVJPG1 off VIC off OFA off APE 200 cpu@51.531C soc2@50.5C soc0@50.812C gpu@50.718C tj@51.531C soc1@51.468C VDD_IN 6848mW/6848mW VDD_CPU_GPU_CV 1495mW/1495mW VDD_SOC 2637mW/2637mW

Noticed that the CPU frequency reported is 1.7 MHz while the GPU frequency is 1007 MHz, which are values reported in NVIDIA Jetson Orin Nano Developer Kit Gets a “Super” Boost

Furthermore, cuda-samples/Samples/1_Utilities/bandwidthTest can be executed to measure the memcopy bandwidth of the GPU and memcpy bandwidth across PCI-e. To execute this, follow the next instructions.

Clone the [ttps://github.com/NVIDIA/cuda-samples.git cuda-samples] repository.

git clone https://github.com/NVIDIA/cuda-samples.git

Compile the bandwidthTest sample.

cd cuda-samples/Samples/1_Utilities/bandwidthTest/
make

Maximize jetson performance by running the jetson_clocks script.

sudo jetson_clocks

Execute the test as follows to get more detailed information regarding the memory bandwidth.

./bandwidthTest --mode=shmoo --device=all

Check the Device to Device Bandwidth. For the next example, we got:

   1000000			101.9

Found the full output below.

Expand (at right) to see the collapsible section:

./bandwidthTest --mode=shmoo --device=all
[CUDA Bandwidth Test] - Starting...

!!!!!Cumulative Bandwidth to be computed from all the devices !!!!!!

Running on...

 Device 0: Orin
 Shmoo Mode

.................................................................................
 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   1000				0.8
   2000				0.4
   3000				0.7
   4000				0.9
   5000				1.1
   6000				1.3
   7000				1.5
   8000				1.7
   9000				1.9
   10000			2.0
   11000			2.2
   12000			2.4
   13000			2.5
   14000			2.8
   15000			2.9
   16000			3.1
   17000			3.3
   18000			3.4
   19000			3.6
   20000			3.7
   22000			4.1
   24000			4.4
   26000			8.0
   28000			8.8
   30000			9.4
   32000			9.6
   34000			10.5
   36000			11.1
   38000			11.9
   40000			12.1
   42000			12.8
   44000			13.6
   46000			14.0
   48000			14.7
   50000			14.8
   60000			16.5
   70000			17.2
   80000			18.2
   90000			19.1
   100000			19.9
   200000			23.6
   300000			25.1
   400000			25.8
   500000			22.8
   600000			23.7
   700000			24.4
   800000			24.7
   900000			25.2
   1000000			25.3
   2000000			27.0
   3000000			27.6
   4000000			27.8
   5000000			28.0
   6000000			28.1
   7000000			28.1
   8000000			28.3
   9000000			28.3
   10000000			28.3
   11000000			28.3
   12000000			28.4
   13000000			28.4
   14000000			28.4
   15000000			28.4
   16000000			28.5
   18000000			28.5
   20000000			28.6
   22000000			28.6
   24000000			28.6
   26000000			28.6
   28000000			28.6
   30000000			28.6
   32000000			28.6
   36000000			28.6
   40000000			28.6
   44000000			28.6
   48000000			28.6
   52000000			28.7
   56000000			28.7
   60000000			28.7
   64000000			28.7
   68000000			28.7

.................................................................................
 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   1000				0.3
   2000				0.7
   3000				1.0
   4000				1.3
   5000				1.7
   6000				1.9
   7000				2.3
   8000				2.7
   9000				3.0
   10000			3.4
   11000			3.6
   12000			4.1
   13000			4.4
   14000			4.7
   15000			5.0
   16000			5.4
   17000			5.8
   18000			6.1
   19000			6.6
   20000			6.9
   22000			7.4
   24000			8.4
   26000			9.0
   28000			9.8
   30000			10.5
   32000			11.3
   34000			11.8
   36000			12.5
   38000			13.4
   40000			13.9
   42000			14.7
   44000			15.5
   46000			16.2
   48000			17.0
   50000			18.0
   60000			19.4
   70000			20.3
   80000			21.7
   90000			22.5
   100000			23.1
   200000			25.7
   300000			26.6
   400000			27.0
   500000			23.4
   600000			24.7
   700000			24.7
   800000			25.7
   900000			25.4
   1000000			26.9
   2000000			26.6
   3000000			27.7
   4000000			28.0
   5000000			28.1
   6000000			28.3
   7000000			28.2
   8000000			28.4
   9000000			28.4
   10000000			28.4
   11000000			28.4
   12000000			28.4
   13000000			28.5
   14000000			28.4
   15000000			28.5
   16000000			28.5
   18000000			28.5
   20000000			28.6
   22000000			28.6
   24000000			28.6
   26000000			28.6
   28000000			28.6
   30000000			28.6
   32000000			28.6
   36000000			28.6
   40000000			28.6
   44000000			28.7
   48000000			28.7
   52000000			28.6
   56000000			28.6
   60000000			28.7
   64000000			28.7
   68000000			28.6

.................................................................................
 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   1000				0.5
   2000				0.8
   3000				1.1
   4000				1.4
   5000				1.8
   6000				2.1
   7000				2.4
   8000				2.8
   9000				3.2
   10000			3.5
   11000			3.9
   12000			4.3
   13000			4.6
   14000			5.0
   15000			5.4
   16000			5.7
   17000			6.1
   18000			6.3
   19000			6.6
   20000			7.1
   22000			7.9
   24000			8.5
   26000			9.1
   28000			9.8
   30000			10.6
   32000			11.3
   34000			11.9
   36000			12.5
   38000			13.3
   40000			14.0
   42000			14.6
   44000			15.2
   46000			16.2
   48000			16.8
   50000			17.1
   60000			20.2
   70000			15.8
   80000			17.8
   90000			20.3
   100000			22.4
   200000			38.8
   300000			52.5
   400000			69.0
   500000			71.8
   600000			83.6
   700000			86.3
   800000			90.7
   900000			97.3
   1000000			101.9
   2000000			63.2
   3000000			69.6
   4000000			72.3
   5000000			74.8
   6000000			76.0
   7000000			77.7
   8000000			78.2
   9000000			79.3
   10000000			79.9
   11000000			80.1
   12000000			80.5
   13000000			80.7
   14000000			80.4
   15000000			81.1
   16000000			81.6
   18000000			82.1
   20000000			82.1
   22000000			82.8
   24000000			82.5
   26000000			82.9
   28000000			83.1
   30000000			82.8
   32000000			83.5
   36000000			83.5
   40000000			83.6
   44000000			83.7
   48000000			83.6
   52000000			83.9
   56000000			84.0
   60000000			84.1
   64000000			84.5
   68000000			84.1

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.


Set up the Jetson Orin Nano

Follow the next instructions from NVIDIA Jetson AI Lab / Tips—SSD + Docker to set up the Jetson Orin Nano before running some Generative AI examples to measure performance. Below are the instructions for you to follow.

To install the SSD, the following video can be used as a guide Jetson Orin Nano Tutorial: SSD Install, Boot, and JetPack Setup - Full Guide!

When the SSD is installed and the Jetson is up, the SSD can be verified on PCI bus as a new memory controller:

lspci

The output should look like the following:

0004:01:00.0 Non-Volatile memory controller: Sandisk Corp Device 5019 (rev 01)

Format and set up auto-mount

1. Run the following command to find the SSD device

lsblk

The output should look like the following:

nvme0n1      259:0    0 931,5G  0 disk

In this case, the device is nvme0n1

2. Format the SSD, create a mount point, and mount it to the filesystem.

sudo mkfs.ext4 nvme0n1
Note
In the NVIDIA mount instructions the directory where the SSD is mounted is named /ssd, but the user can choose any name for the mount point directory. In the <code<jetson-containers setup.md documentation, /mnt is used.
sudo mkdir /ssd

3. To ensure that the mount persists after boot, add an entry to the fstab file:

3.1 Identify the SSD UUID:

lsblk -f

Then, add a new entry to the fstab file:

sudo vi /etc/fstab

Insert UUID value found from lsblk -f. For example:

UUID=d390172d-1e75-43d4-a66f-6c719d102329 /ssd/ ext4 defaults 0 2

Finally, change the ownership of the /ssd directory.

sudo chown ${USER}:${USER} /ssd

Set up Docker

1. Install nvidia-container package. For <code<Jetson Linux (L4T) R36.x (JetPack 6.x) users follow the next instructions.

sudo apt update
sudo apt install -y nvidia-container curl
curl https://get.docker.com | sh && sudo systemctl --now enable docker
sudo nvidia-ctk runtime configure --runtime=docker

2. Restart the Docker service and run Docker as Non-Root User (No Sudo).

sudo systemctl restart docker
sudo usermod -aG docker $USER
newgrp docker

3. Add default runtime in /etc/docker/daemon.json.

3.1. First open the daemon.json file with sudo

sudo vi /etc/docker/daemon.json

3.2. Add the "default-runtime": "nvidia" line as:

{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "default-runtime": "nvidia"
}

4. Restart docker.

sudo systemctl daemon-reload && sudo systemctl restart docker

Migrate Docker directory to SSD

Use the SSD as an extra storage capacity to hold the storage-demanding Docker directory by following the below instructions.

1. Stop the Docker service.

sudo systemctl stop docker

2. Move the existing Docker directory to the SSD.

sudo du -csh /var/lib/docker/ && \
    sudo mkdir /ssd/docker && \
    sudo rsync -axPS /var/lib/docker/ /ssd/docker/ && \
    sudo du -csh  /ssd/docker/

3. Edit /etc/docker/daemon.json with sudo to set the data-root path as following:

{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "default-runtime": "nvidia",
    "data-root": "/ssd/docker"
}

4. Rename the old Docker data directory.

sudo mv /var/lib/docker /var/lib/docker.old

5. Restart the Docker daemon.

sudo systemctl daemon-reload && \
    sudo systemctl restart docker && \
    sudo journalctl -u docker

Verification

1. Pull the nvcr.io/nvidia/l4t-base:r35.2.1 image.

docker pull nvcr.io/nvidia/l4t-base:r35.2.1

2. When the above command finishes, check the /ssd as follows:

docker info | grep Root
Docker Root Dir: /ssd/docker
sudo ls -l /ssd/docker/
total 44
drwx--x--x  4 root root 4096 Mar 22 11:44 buildkit
drwx--x---  2 root root 4096 Mar 22 11:44 containers
drwx------  3 root root 4096 Mar 22 11:44 image
drwxr-x---  3 root root 4096 Mar 22 11:44 network
drwx--x--- 13 root root 4096 Mar 22 16:20 overlay2
drwx------  4 root root 4096 Mar 22 11:44 plugins
drwx------  2 root root 4096 Mar 22 16:19 runtimes
drwx------  2 root root 4096 Mar 22 11:44 swarm
drwx------  2 root root 4096 Mar 22 16:20 tmp
drwx------  2 root root 4096 Mar 22 11:44 trust
drwx-----x  2 root root 4096 Mar 22 16:19 volumes
docker info | grep -e "Runtime" -e "Root"
Runtimes: io.containerd.runtime.v1.linux nvidia runc io.containerd.runc.v2
Default Runtime: nvidia
Docker Root Dir: /ssd/docker

Run Text Generation with ollama client tutorial

Remember to go to the directory allocated to the NVMe and run the following commands.

1. Install the Ollama's official support for Jetson and CUDA.

curl -fsSL https://ollama.com/install.sh | sh

It creates a ollama serve. Check the usage as:

ollama --help
Large language model runner

Usage:
  ollama [flags]
  ollama [command]

Available Commands:
  serve       Start ollama
  create      Create a model from a Modelfile
  show        Show information for a model
  run         Run a model
  stop        Stop a running model
  pull        Pull a model from a registry
  push        Push a model to a registry
  list        List models
  ps          List running models
  cp          Copy a model
  rm          Remove a model
  help        Help about any command

Flags:
  -h, --help      help for ollama
  -v, --version   Show version information

Use "ollama [command] --help" for more information about a command

2. Then set up a container for running the text generation with ollama client tutorial.

cd /ssd
git clone https://github.com/dusty-nv/jetson-containers
bash jetson-containers/install.sh

The install.sh installs some python dependencies for running the Jetson AI Labs docker images. The jetson-containers/build.sh, run and <code.autotag scripts are responsible for finding this compatible docker image based on the example the user wants to run. In this case the dustynv/ollama:0.5.1-r36.4.0 image.

3. Run the docker container for ollama using jetson-containers as:

jetson-containers run --name ollama $(autotag ollama) ollama run mistral

mistral is a 7B model released by Mistral AI.

After running the above command. The following output should be shown up:

jetson-containers run --name ollama $(autotag ollama) ollama run mistral
Namespace(packages=['ollama'], prefer=['local', 'registry', 'build'], disable=[''], user='dustynv', output='/tmp/autotag', quiet=False, verbose=False)
-- L4T_VERSION=36.4.2  JETPACK_VERSION=5.1  CUDA_VERSION=12.6
-- Finding compatible container image for ['ollama']
dustynv/ollama:0.5.1-r36.4.0
V4L2_DEVICES: 
+ docker run --runtime nvidia -it --rm --network host --shm-size=8g --volume /tmp/argus_socket:/tmp/argus_socket --volume /etc/enctune.conf:/etc/enctune.conf --volume /etc/nv_tegra_release:/etc/nv_tegra_release --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model --volume /var/run/dbus:/var/run/dbus --volume /var/run/avahi-daemon/socket:/var/run/avahi-daemon/socket --volume /var/run/docker.sock:/var/run/docker.sock --volume /ssd/jetson-containers/data:/data -v /etc/localtime:/etc/localtime:ro -v /etc/timezone:/etc/timezone:ro --device /dev/snd -e PULSE_SERVER=unix:/run/user/1000/pulse/native -v /run/user/1000/pulse:/run/user/1000/pulse --device /dev/bus/usb --device /dev/i2c-0 --device /dev/i2c-1 --device /dev/i2c-2 --device /dev/i2c-4 --device /dev/i2c-5 --device /dev/i2c-7 --device /dev/i2c-9 -v /run/jtop.sock:/run/jtop.sock --name ollama dustynv/ollama:0.5.1-r36.4.0 ollama run mistral
>>> Send a message (/? for help)

The message sent was: What approach would you use to detect and prevent race conditions in a multithreaded application?.

4. To check the tokens per second calculation, check the ollama API.

4.1. Run the ollama container as:

jetson-containers run --name ollama $(autotag ollama)

4.2. Do a request to get a response in one reply as:

curl http://localhost:11434/api/generate -d '{
  "model": "mistral",
  "prompt": "What approach would you use to detect and prevent race conditions in a multithreaded application?",
  "stream": false
}'

When the response is generated, check the eval_count and eval_duration parameters, which are the number of tokens in the response and time in nanoseconds spent generating the response respectively. To calculate the tokens per second (tokens/s), divide eval_count / eval_duration * 10^-9. Check an example output below.

Expand (at right) to see the collapsible section:

root@nvidia-desktop:/# curl http://localhost:11434/api/generate -d '{
  "model": "mistral",
  "prompt": "What approach would you use to detect and prevent race conditions in a multithreaded application?",
  "stream": false
}'
{"model":"mistral","created_at":"2024-12-23T22:35:48.572961262Z","response":"1. Use Synchronized Blocks or Methods: In Java, the `synchronized` keyword can be used to ensure that only one thread can access a block of code or a method at a time. This can help avoid race conditions by ensuring exclusive access to shared resources.\n\n2. Atomic Variables: Use atomic variables whenever possible for simple operations on shared data. In Java, the `AtomicInteger`, `AtomicLong`, etc., are examples of atomic variables. They provide thread-safe operations that can be updated without the need for explicit synchronization.\n\n3. Lock Objects: Instead of using the `synchronized` keyword at the method level, you can use custom lock objects. This allows finer-grained control over the locked resources and can help reduce contention.\n\n4. Immutable Objects: Whenever possible, make your shared objects immutable. If an object cannot be modified after it is created, there is no need for synchronization when accessing that object.\n\n5. Use Libraries with Concurrency Support: Consider using libraries designed to handle concurrent programming, such as Java's ExecutorService or ConcurrentHashMap. These libraries provide built-in concurrency support and can help prevent race conditions.\n\n6. Data Race Detection Tools: Use static analysis tools like JThreadScope, Helgrind, or Valgrind's thread tool to detect potential data races in your code. These tools can help find race conditions during development that may be difficult to identify manually.\n\n7. Testing and Design Patterns: Write tests to verify the correctness of your multithreaded application under various scenarios, paying special attention to areas where race conditions might occur. Use design patterns like producer-consumer, barrier, or double-checked locking to manage concurrent access to shared resources in a more structured way.\n\n8. Avoid Shared State: Minimize the use of shared state by using message passing instead. This can help reduce the likelihood of race conditions and make your code easier to reason about.\n\n9. Use Wait, Notify, and Condition Variables Correctly: In Java, these constructs can be used to coordinate the actions of threads. Misuse can lead to race conditions; ensure you understand how they work and use them appropriately.\n\n10. Keep Code Simple: Complex code is more likely to contain race conditions. Strive for simplicity in your multithreaded design, and focus on clear and concise logic.","done":true,"done_reason":"stop","context":[3,29473,2592,5199,1450,1136,1706,1066,7473,1072,6065,6709,5099,1065,1032,3299,1140,1691,1054,5761,29572,4,1027,29508,29491,6706,1086,15706,2100,9831,29481,1210,10779,29481,29515,1328,12407,29493,1040,2320,29481,15706,2100,29600,24015,1309,1115,2075,1066,6175,1137,1633,1392,6195,1309,3503,1032,3492,1070,3464,1210,1032,2806,1206,1032,1495,29491,1619,1309,2084,5229,6709,5099,1254,20851,15127,3503,1066,7199,6591,29491,781,781,29518,29491,2562,8180,15562,3329,29515,6706,19859,9693,12094,3340,1122,4356,7701,1124,7199,1946,29491,1328,12407,29493,1040,2320,3935,8180,8164,7619,2320,3935,8180,8236,7619,5113,2831,1228,10022,1070,19859,9693,29491,2074,3852,6195,29501,8016,7701,1137,1309,1115,9225,2439,1040,1695,1122,10397,15709,2605,29491,781,781,29538,29491,18350,5393,29481,29515,8930,1070,2181,1040,2320,29481,15706,2100,29600,24015,1206,1040,2806,2952,29493,1136,1309,1706,3228,5112,7465,29491,1619,6744,1622,1031,29501,1588,2506,3370,1522,1040,12190,6591,1072,1309,2084,8411,1447,2916,29491,781,781,29549,29491,3004,14650,5393,29481,29515,28003,3340,29493,1806,1342,7199,7465,1271,14650,29491,1815,1164,2696,4341,1115,12220,1792,1146,1117,4627,29493,1504,1117,1476,1695,1122,15709,2605,1507,3503,1056,1137,2696,29491,781,781,29550,29491,6706,1161,17932,1163,2093,17482,11499,29515,12540,2181,24901,6450,1066,5037,1147,3790,17060,29493,2027,1158,12407,29510,29481,12431,1039,3096,1210,2093,3790,21120,29491,3725,24901,3852,5197,29501,1030,1147,17482,2528,1072,1309,2084,6065,6709,5099,29491,781,781,29552,29491,6052,20824,1152,23588,27026,29515,6706,1830,6411,7808,1505,1243,7005,9920,29493,6192,1588,1275,29493,1210,4242,1588,1275,29510,29481,6195,4689,1066,7473,5396,1946,17277,1065,1342,3464,29491,3725,7808,1309,2084,2068,6709,5099,2706,4867,1137,1761,1115,4564,1066,9819,23828,29491,781,781,29555,29491,4503,1056,1072,9416,27340,29481,29515,12786,8847,1066,12398,1040,5482,2235,1070,1342,3299,1140,1691,1054,5761,1684,4886,22909,29493,10728,3609,5269,1066,5788,1738,6709,5099,2427,6032,29491,6706,3389,12301,1505,14644,29501,17893,1031,29493,20412,29493,1210,4347,29501,13181,5112,1056,1066,9362,1147,3790,3503,1066,7199,6591,1065,1032,1448,29197,1837,29491,781,781,29551,29491,27806,2063,2095,4653,29515,3965,1089,1421,1040,1706,1070,7199,2433,1254,2181,3696,10488,4287,29491,1619,1309,2084,8411,1040,25045,1070,6709,5099,1072,1806,1342,3464,7857,1066,3379,1452,29491,781,781,29542,29491,6706,13886,29493,3048,2343,29493,1072,29005,15562,3329,3966,3891,1114,29515,1328,12407,29493,1935,5890,29481,1309,1115,2075,1066,18005,1040,7536,1070,18575,29491,22065,2498,1309,2504,1066,6709,5099,29513,6175,1136,3148,1678,1358,1539,1072,1706,1474,7350,2767,29491,781,781,29508,29502,29491,11107,8364,14656,29515,27735,3464,1117,1448,4685,1066,7769,6709,5099,29491,1430,6784,1122,26001,1065,1342,3299,1140,1691,1054,3389,29493,1072,4000,1124,3849,1072,3846,1632,12176,29491],"total_duration":36658984876,"load_duration":31889330,"prompt_eval_count":24,"prompt_eval_duration":9000000,"eval_count":526,"eval_duration":36616000000}


Run Stable Diffusion Image generation tutorial

There were some issues when running the tutorial by following the instructions found in the Jetson AI Lab At the moment of writing this wiki page. So, a workaround for executing this tutorial is to run the container as follows:

docker run --runtime nvidia -it --name=text_generation --network host --volume /tmp/argus_socket:/tmp/argus_socket --volume /etc/enctune.conf:/etc/enctune.conf --volume /etc/nv_tegra_release:/etc/nv_tegra_release --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model --volume /home/nvidia/jetson-containers/data:/data --device /dev/snd --device /dev/bus/usb dustynv/text-generation-webui:r36.2.0

Then, the instructions to create images can be found in NVIDIA_Jetson_Generative_AI_Lab_RidgeRun_Exploration#Create_images.

Furthermore, in the terminal where the docker container is running, the tutorial is going to show some useful information regarding the model loading time, image generated information, time taken, and iteration per second to generate that image. Check the example below.

Expand (at right) to see the collapsible section:

Model loaded in 9.4s (load weights from disk: 1.8s, create model: 0.8s, apply weights to model: 5.6s, apply half(): 0.1s, load textual inversion embeddings: 0.5s, calculate empty prompt: 0.3s).

Steps: 20, Sampler: DPM++ 2M Karras, CFG scale: 7, Seed: 1017573880, Size: 512x512, Model hash: 6ce0161689, Model: v1-5-pruned-emaonly, Version: v1.7.0

Time taken: 10.5 sec.

A: 2.57 GB, R: 3.37 GB, Sys: 6.4/7.44141 GB (86.0%)

Total progress: 100%|███████████████████████████████████████████████████| 20/20 [00:09<00:00,  2.05it/s]


References



For direct inquiries, please refer to the contact information available on our Contact page. Alternatively, you may complete and submit the form provided at the same link. We will respond to your request at our earliest opportunity.


Links to RidgeRun Resources and RidgeRun Artificial Intelligence Solutions can be found in the footer below.