JetPack 6 Key Feature: JetPack Compute Stack
JetPack 6 Migration and Developer Guide |
---|
![]() |
Introduction |
Migrating From JetPack 5 to JetPack 6 |
Installing JetPack 6 |
JetPack 6 Key Features |
Contact Us |
JetPack 6.0 introduces significant updates and enhancements across various components of its compute stack. Below is a detailed summary of the key changes and new features included in this release.
TensorRT
JetPack 6.0 includes TensorRT 8.6.2, enhancing AI inference capabilities. The main changes in this release include the following:
Compatibility and Support changes
1. Support Updates.
- Python 3.11 Support: Starting with TensorRT 8.6 GA.
- CUDA 12.x Support: Starting with TensorRT 8.6, but incompatible with CUDA 11.x builds.
2. Hardware Compatibility.
- Cross-Architecture Engine Compatibility: Engines built on one GPU architecture can now work on GPUs of other architectures (Ampere and newer). Note the potential for latency/throughput degradation and the lack of support on JetPack.
3. Version Compatibility.
- trtexec Version Compatibility Support: TensorRT engines now compatible with other minor versions within a major version, with appropriate build-time configuration. Requires explicit runtime configuration.
4. Runtime Installation Options.
- Lean Runtime Installation: Smaller installation for loading and running pre-built engines. Does not support building new TensorRT plan files.
- Dispatch Runtime Installation: Minimal memory consumption installation for loading and running pre-built engines, includes lean runtime functionality but lacks build support.
New features and samples
1. New Layers and Interfaces.
- IReverseSequence Layer: Supports the ReverseSequence operator in ONNX.
- INormalization Layer: Supports InstanceNormalization, GroupNormalization, and LayerNormalization operations in ONNX.
- ICastLayer Interface: Converts the data type of the input tensor across FP32, FP16, INT32, INT8, UINT8, and BOOL.
2. Experimental Features.
- Extended DLA Support: IElementWiseLayer now supports the equal operation (ElementWiseOperation::kEQUAL) with specific restrictions. trtexec flag --layerDeviceTypes for layer device type specification.
3. New Sample.
- onnx_custom_plugin Sample: Demonstrates the use of C++ plugins for running TensorRT on ONNX models with custom or unsupported layers.
API Changes
1. C++ API Changes.
- Additions and Enhancements: New classes (e.g.,
IReverseSequenceLayer
,INormalizationLayer
), functions (e.g., for builder configuration and plugin management), enums (e.g.,HardwareCompatibilityLevel
), and macros, alongside deprecated elements like certain enums and properties.
2. Python API Changes.
- New Features and Properties: Introduction of new classes and functions (e.g.,
ICastLayer
,IReverseSequenceLayer
), properties (e.g., compatibility levels, stream settings), and enums, with some deprecated properties related to optimization.
3. Multi-Stream APIs.
- Multi-Stream APIs: Control the number of streams TensorRT uses for parallel network part execution, enhancing performance.
Performance
1. Performance Enhancements.
- Improved Multi-Head Attention (MHA) Fusions: Enhanced speed for Transformer-like networks.
- Optimized Engine Building: Better performance for Transformer-like networks with dynamic shapes and reduced unnecessary synchronization calls.
- Performance Boost on NVIDIA Hopper GPUs: Various network optimizations.
- Optimization Level Builder Flag: Allows extended engine building time for better tactics or faster build with reduced scope.
Deep Learning Accelerator (DLA)
This release features DLA 3.14, providing optimized hardware acceleration for deep learning applications.
CUDA
CUDA remains a crucial part of the AI compute stack. JetPack 6.0 incorporates CUDA 12.2.1, offering improved performance and new capabilities for developers. The main changes in this release include the following:
General CUDA
1. Heterogeneous Memory Management (HMM).
- New Feature: Seamless data sharing between host memory and accelerator devices.
- Supported on: Linux (kernel 6.1.24+ or 6.2.11+).
- Requirements: NVIDIA GPU Open Kernel Modules driver.
- Limitations:
- No GPU atomic operations on file-backed memory.
- No Arm CPU support.
- No HugeTLBfs page support.
- Incomplete
fork()
system call support. - Potential performance issues compared to existing memory management APIs.
2. Lazy Loading.
- Default Status: Enabled on Linux with 535 driver.
- Disabling: Set
CUDA_MODULE_LOADING=EAGER
on Linux. - Windows: Enable with
CUDA_MODULE_LOADING=LAZY
.
3. Host NUMA Memory Allocation.
- Feature: Allocate CPU memory for specific NUMA nodes.
- Responsibility: Applications must request memory accessibility explicitly.
4. CUDA Multi-Process Service (MPS).
- Feature: Per-client priority mapping at runtime.
- Environment Variable:
CUDA_MPS_CLIENT_PRIORITY
with values 0 (NORMAL) and 1 (BELOW_NORMAL).
CUDA Compilers
1. libNVVM Samples.
- Relocation: Moved to GitHub under NVIDIA/cuda-samples.
CUDA Developer Tools
1. Updates for:
- nvprof and Visual Profiler.
- CUPTI.
- Nsight Compute.
- Compute Sanitizer.
- CUDA-GDB.
For detailed updates and changes, refer to the respective changelogs.
Vision Programming Interface (VPI)
JetPack 6.0 includes VPI 3.1, which introduces several new enhancements:
Performance Optimizations and New Supports
1. Median Filter:
- Up to 4x performance optimization on
VPI_BACKEND_CUDA
.
2. Erode:
- Added support for
VPI_BACKEND_PVA
. - Supported filter sizes: 3x3 and 5x5.
- Supported image formats:
VPI_IMAGE_FORMAT_U8
,VPI_IMAGE_FORMAT_S8
,VPI_IMAGE_FORMAT_U16
,VPI_IMAGE_FORMAT_S16
. - Supported border types:
VPI_BORDER_ZERO
andVPI_BORDER_CLAMP
. - Supported filter shapes:
ALL_TRUE
,ALL_FALSE
, andCROSS
(Only for 3x3).
3. Dilate:
- Added support for
VPI_BACKEND_PVA
. - Supported filter sizes: 3x3 and 5x5.
- Supported image formats:
VPI_IMAGE_FORMAT_U8
,VPI_IMAGE_FORMAT_S8
,VPI_IMAGE_FORMAT_U16
,VPI_IMAGE_FORMAT_S16
. - Supported border types:
VPI_BORDER_ZERO
andVPI_BORDER_CLAMP
. - Supported filter shapes:
ALL_TRUE
,ALL_FALSE
, andCROSS
(Only for 3x3).
Other Updates
1. Crop Scaler:
- Added support for
VPI_IMAGE_FORMAT_BGR8p
. - Maximum frames increased to 128.
2. ORB Feature Detector:
- Functional changes to the algorithm.
- Now works with pyramidal keypoints, extracting keypoints with octave information and computing descriptors at the appropriate octave image (pyramid level).
3. Convert Image Format:
- Added support for conversion from
VPI_IMAGE_FORMAT_2S16_BL
toVPI_IMAGE_FORMAT_2S16_PL
and vice versa forVPI_BACKEND_VIC
.
4. Gaussian Pyramid Generator:
- Added support for
VPI_BORDER_REFLECT
andVPI_BORDER_MIRROR
forVPI_BACKEND_CUDA
.
5. Laplacian Pyramid Generator:
- Added support for
VPI_BORDER_REFLECT
andVPI_BORDER_MIRROR
forVPI_BACKEND_CUDA
.
6. DCF Tracker:
- Increased maximum sequences to 33 for
VPI_BACKEND_PVA
. - Increased maximum sequences to 1024 for
VPI_BACKEND_CUDA
.
DeepStream 7.0
DeepStream 7.0 introduces a range of new features and enhancements that significantly advance the capabilities for vision AI development. Below is a comprehensive overview of these new features and updates:
1. New Development Pathway.
- Leveraging new DeepStream libraries through Python APIs, enabling more flexible and accessible development.
2. Service Maker.
- Simplifies application development by providing a streamlined process for creating, deploying, and managing AI applications.
3. Enhanced Single-View 3D Tracker.
- Improved features for single-view 3D tracking, enhancing the accuracy and reliability of object tracking in three-dimensional space.
4. Support for Sensor Fusion Model (BEVFusion).
- Integrated with the DeepStream 3D framework, this model supports the fusion of data from multiple sensors, providing more comprehensive environmental perception.
5. Support for Windows Subsystem for Linux (WSL2).
- DeepStream applications can now be developed and run on WSL2, making it easier for developers using Windows to leverage DeepStream capabilities.
6. PipeTuner.
- A new tool for optimizing AI pipeline performance, allowing developers to streamline and enhance the efficiency of their AI workflows.
New Feature Examples
These are some of the examples tested on the latest versions of the Compute Stack available in JetPack 6.0 Production Release, on a Jetson Orin Nano.
VPI 3.1.5
The VPI installation contains available samples in the directory: /opt/nvidia/vpi3/samples/
, this folder also contains an assets subfolder with images and videos to test the samples.
Sample: Orb Feature Detector
This sample detects features across an input pyramid as well as a descriptor for each feature, returning the coordinates for each feature as well as its associated bitstring descriptor. This sample makes use of the algorithm Gaussian Pyramid Generator
, this algorithm in the version of VPI 3.1.5 includes support for VPI_BORDER_REFLECT
and VPI_BORDER_MIRROR
for the cuda backend, which can be tested either in c++ or python.
C++ code
In the source code, the Gaussian Pyramid Generator can be changed to use a new border type, for example VPI_BORDER_REFLECT
:
- CHECK_STATUS(vpiSubmitGaussianPyramidGenerator(stream, backend, imgGrayScale, pyrInput, VPI_BORDER_CLAMP)); + CHECK_STATUS(vpiSubmitGaussianPyramidGenerator(stream, backend, imgGrayScale, pyrInput, VPI_BORDER_REFLECT));
Python code
Similar to the C++ source code, the border argument can be changed to use a new type, for example VPI_BORDER_MIRROR
:
- pyr = src.gaussian_pyramid(3) + pyr = src.gaussian_pyramid(3, border=vpi.Border.MIRROR)
Dependencies
With the out-of-the-box Compute Stack, no other dependencies are needed.
Usage
C++:
cd /opt/nvidia/vpi3/samples/03-harris_corners sudo cmake . sudo make sudo ./vpi_sample_03_harris_corners cuda ../assets/kodim08.png
Python:
cd /opt/nvidia/vpi3/samples/03-harris_corners sudo python3 main.py cuda ../assets/kodim08.png
Result
TensorRT 8.6.2.3
The TensorRT installation contains available samples in the directory: /usr/src/tensorrt/samples
.
Sample: Network API Pytorch MNIST
This sample is an end-to-end sample that trains a model in PyTorch, recreates the network in TensorRT, imports weights from the trained model, and finally runs inference with a TensorRT engine. This sample uses Python, among other changes, the following properties are added to the api:
ICudaEngine.hardware_compatibility_level
.ICudaEngine.num_aux_streams
.
Python code
The main function of the sample starts by initializing and training an MNIST model using PyTorch. After training, it extracts the model's weights and uses them to build a TensorRT inference engine. Next, it allocates necessary resources, such as buffers and an execution context, for performing inference. A random test case is loaded into the input buffer. The function then runs inference with TensorRT, determines the predicted class from the output, frees the allocated buffers, and prints the test case number and the prediction. Also, inside this function the new properties: hardware compatibility level and number of auxiliary streams, are printed.
def main(): common.add_help(description="Runs an MNIST network using a PyTorch model") # Train the PyTorch model mnist_model = model.MnistModel() mnist_model.learn(1) weights = mnist_model.get_weights() # Do inference with TensorRT. engine = build_engine(weights) # Get hardware compatibility level compatibility_level = engine.hardware_compatibility_level print(f"Hardware Compatibility Level: {compatibility_level}") # Get number of auxiliary streams num_aux_streams = engine.num_aux_streams print(f"Number of Auxiliary Streams: {num_aux_streams}") # Build an engine, allocate buffers and create a stream. # For more information on buffer allocation, refer to the introductory samples. inputs, outputs, bindings, stream = common.allocate_buffers(engine) context = engine.create_execution_context() case_num = load_random_test_case(mnist_model, pagelocked_buffer=inputs[0].host) # For more information on performing inference, refer to the introductory samples. # The common.do_inference function will return a list of outputs - we only have one in this case. [output] = common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream) pred = np.argmax(output) common.free_buffers(inputs, outputs, stream) print("Test Case: " + str(case_num)) print("Prediction: " + str(pred))
Dependencies
This sample needs Pytorch and some auxiliary python packages, to install them execute these commands:
cd /usr/src/tensorrt/samples/python/network_api_pytorch_mnist pip3 install --upgrade pip pip3 install -r requirements.txt
Usage
cd /usr/src/tensorrt/samples/python/network_api_pytorch_mnist python3 sample.py
Result
Train Epoch: 1 [0/60000 (0%)] Loss: 2.354930 Train Epoch: 1 [6400/60000 (11%)] Loss: 0.570188 Train Epoch: 1 [12800/60000 (21%)] Loss: 0.335233 Train Epoch: 1 [19200/60000 (32%)] Loss: 0.336444 Train Epoch: 1 [25600/60000 (43%)] Loss: 0.245888 Train Epoch: 1 [32000/60000 (53%)] Loss: 0.159086 Train Epoch: 1 [38400/60000 (64%)] Loss: 0.191686 Train Epoch: 1 [44800/60000 (75%)] Loss: 0.237418 Train Epoch: 1 [51200/60000 (85%)] Loss: 0.055748 Train Epoch: 1 [57600/60000 (96%)] Loss: 0.092222 Test set: Average loss: 0.0897, Accuracy: 9712/10000 (97%) Hardware Compatibility Level: HardwareCompatibilityLevel.NONE Number of Auxiliary Streams: 0 Test Case: 1 Prediction: 1