JetPack 6 Key Feature: JetPack Compute Stack

From RidgeRun Developer Wiki



Previous: JetPack 6 Key Features/Nsight Developer Tools Index Next: Contact Us






JetPack 6.0 introduces significant updates and enhancements across various components of its compute stack. Below is a detailed summary of the key changes and new features included in this release.

TensorRT

JetPack 6.0 includes TensorRT 8.6.2, enhancing AI inference capabilities. The main changes in this release include the following:

Compatibility and Support changes

1. Support Updates.

  • Python 3.11 Support: Starting with TensorRT 8.6 GA.
  • CUDA 12.x Support: Starting with TensorRT 8.6, but incompatible with CUDA 11.x builds.

2. Hardware Compatibility.

  • Cross-Architecture Engine Compatibility: Engines built on one GPU architecture can now work on GPUs of other architectures (Ampere and newer). Note the potential for latency/throughput degradation and the lack of support on JetPack.

3. Version Compatibility.

  • trtexec Version Compatibility Support: TensorRT engines now compatible with other minor versions within a major version, with appropriate build-time configuration. Requires explicit runtime configuration.

4. Runtime Installation Options.

  • Lean Runtime Installation: Smaller installation for loading and running pre-built engines. Does not support building new TensorRT plan files.
  • Dispatch Runtime Installation: Minimal memory consumption installation for loading and running pre-built engines, includes lean runtime functionality but lacks build support.

New features and samples

1. New Layers and Interfaces.

  • IReverseSequence Layer: Supports the ReverseSequence operator in ONNX.
  • INormalization Layer: Supports InstanceNormalization, GroupNormalization, and LayerNormalization operations in ONNX.
  • ICastLayer Interface: Converts the data type of the input tensor across FP32, FP16, INT32, INT8, UINT8, and BOOL.

2. Experimental Features.

  • Extended DLA Support: IElementWiseLayer now supports the equal operation (ElementWiseOperation::kEQUAL) with specific restrictions. trtexec flag --layerDeviceTypes for layer device type specification.

3. New Sample.

  • onnx_custom_plugin Sample: Demonstrates the use of C++ plugins for running TensorRT on ONNX models with custom or unsupported layers.

API Changes

1. C++ API Changes.

  • Additions and Enhancements: New classes (e.g., IReverseSequenceLayer, INormalizationLayer), functions (e.g., for builder configuration and plugin management), enums (e.g., HardwareCompatibilityLevel), and macros, alongside deprecated elements like certain enums and properties.

2. Python API Changes.

  • New Features and Properties: Introduction of new classes and functions (e.g., ICastLayer, IReverseSequenceLayer), properties (e.g., compatibility levels, stream settings), and enums, with some deprecated properties related to optimization.

3. Multi-Stream APIs.

  • Multi-Stream APIs: Control the number of streams TensorRT uses for parallel network part execution, enhancing performance.

Performance

1. Performance Enhancements.

  • Improved Multi-Head Attention (MHA) Fusions: Enhanced speed for Transformer-like networks.
  • Optimized Engine Building: Better performance for Transformer-like networks with dynamic shapes and reduced unnecessary synchronization calls.
  • Performance Boost on NVIDIA Hopper GPUs: Various network optimizations.
  • Optimization Level Builder Flag: Allows extended engine building time for better tactics or faster build with reduced scope.

Deep Learning Accelerator (DLA)

This release features DLA 3.14, providing optimized hardware acceleration for deep learning applications.

CUDA

CUDA remains a crucial part of the AI compute stack. JetPack 6.0 incorporates CUDA 12.2.1, offering improved performance and new capabilities for developers. The main changes in this release include the following:

General CUDA

1. Heterogeneous Memory Management (HMM).

  • New Feature: Seamless data sharing between host memory and accelerator devices.
  • Supported on: Linux (kernel 6.1.24+ or 6.2.11+).
  • Requirements: NVIDIA GPU Open Kernel Modules driver.
  • Limitations:
    • No GPU atomic operations on file-backed memory.
    • No Arm CPU support.
    • No HugeTLBfs page support.
    • Incomplete fork() system call support.
    • Potential performance issues compared to existing memory management APIs.

2. Lazy Loading.

  • Default Status: Enabled on Linux with 535 driver.
  • Disabling: Set CUDA_MODULE_LOADING=EAGER on Linux.
  • Windows: Enable with CUDA_MODULE_LOADING=LAZY.

3. Host NUMA Memory Allocation.

  • Feature: Allocate CPU memory for specific NUMA nodes.
  • Responsibility: Applications must request memory accessibility explicitly.

4. CUDA Multi-Process Service (MPS).

  • Feature: Per-client priority mapping at runtime.
  • Environment Variable: CUDA_MPS_CLIENT_PRIORITY with values 0 (NORMAL) and 1 (BELOW_NORMAL).

CUDA Compilers

1. libNVVM Samples.

  • Relocation: Moved to GitHub under NVIDIA/cuda-samples.

CUDA Developer Tools

1. Updates for:

  • nvprof and Visual Profiler.
  • CUPTI.
  • Nsight Compute.
  • Compute Sanitizer.
  • CUDA-GDB.

For detailed updates and changes, refer to the respective changelogs.

Vision Programming Interface (VPI)

JetPack 6.0 includes VPI 3.1, which introduces several new enhancements:

Performance Optimizations and New Supports

1. Median Filter:

  • Up to 4x performance optimization on VPI_BACKEND_CUDA.

2. Erode:

  • Added support for VPI_BACKEND_PVA.
  • Supported filter sizes: 3x3 and 5x5.
  • Supported image formats: VPI_IMAGE_FORMAT_U8, VPI_IMAGE_FORMAT_S8, VPI_IMAGE_FORMAT_U16, VPI_IMAGE_FORMAT_S16.
  • Supported border types: VPI_BORDER_ZERO and VPI_BORDER_CLAMP.
  • Supported filter shapes: ALL_TRUE, ALL_FALSE, and CROSS (Only for 3x3).

3. Dilate:

  • Added support for VPI_BACKEND_PVA.
  • Supported filter sizes: 3x3 and 5x5.
  • Supported image formats: VPI_IMAGE_FORMAT_U8, VPI_IMAGE_FORMAT_S8, VPI_IMAGE_FORMAT_U16, VPI_IMAGE_FORMAT_S16.
  • Supported border types: VPI_BORDER_ZERO and VPI_BORDER_CLAMP.
  • Supported filter shapes: ALL_TRUE, ALL_FALSE, and CROSS (Only for 3x3).

Other Updates

1. Crop Scaler:

  • Added support for VPI_IMAGE_FORMAT_BGR8p.
  • Maximum frames increased to 128.

2. ORB Feature Detector:

  • Functional changes to the algorithm.
  • Now works with pyramidal keypoints, extracting keypoints with octave information and computing descriptors at the appropriate octave image (pyramid level).

3. Convert Image Format:

  • Added support for conversion from VPI_IMAGE_FORMAT_2S16_BL to VPI_IMAGE_FORMAT_2S16_PL and vice versa for VPI_BACKEND_VIC.

4. Gaussian Pyramid Generator:

  • Added support for VPI_BORDER_REFLECT and VPI_BORDER_MIRROR for VPI_BACKEND_CUDA.

5. Laplacian Pyramid Generator:

  • Added support for VPI_BORDER_REFLECT and VPI_BORDER_MIRROR for VPI_BACKEND_CUDA.

6. DCF Tracker:

  • Increased maximum sequences to 33 for VPI_BACKEND_PVA.
  • Increased maximum sequences to 1024 for VPI_BACKEND_CUDA.

DeepStream 7.0

DeepStream 7.0 introduces a range of new features and enhancements that significantly advance the capabilities for vision AI development. Below is a comprehensive overview of these new features and updates:

1. New Development Pathway.

  • Leveraging new DeepStream libraries through Python APIs, enabling more flexible and accessible development.

2. Service Maker.

  • Simplifies application development by providing a streamlined process for creating, deploying, and managing AI applications.

3. Enhanced Single-View 3D Tracker.

  • Improved features for single-view 3D tracking, enhancing the accuracy and reliability of object tracking in three-dimensional space.

4. Support for Sensor Fusion Model (BEVFusion).

  • Integrated with the DeepStream 3D framework, this model supports the fusion of data from multiple sensors, providing more comprehensive environmental perception.

5. Support for Windows Subsystem for Linux (WSL2).

  • DeepStream applications can now be developed and run on WSL2, making it easier for developers using Windows to leverage DeepStream capabilities.

6. PipeTuner.

  • A new tool for optimizing AI pipeline performance, allowing developers to streamline and enhance the efficiency of their AI workflows.

New Feature Examples

These are some of the examples tested on the latest versions of the Compute Stack available in JetPack 6.0 Production Release, on a Jetson Orin Nano.

VPI 3.1.5

The VPI installation contains available samples in the directory: /opt/nvidia/vpi3/samples/, this folder also contains an assets subfolder with images and videos to test the samples.

Sample: Orb Feature Detector

This sample detects features across an input pyramid as well as a descriptor for each feature, returning the coordinates for each feature as well as its associated bitstring descriptor. This sample makes use of the algorithm Gaussian Pyramid Generator, this algorithm in the version of VPI 3.1.5 includes support for VPI_BORDER_REFLECT and VPI_BORDER_MIRROR for the cuda backend, which can be tested either in c++ or python.

C++ code

In the source code, the Gaussian Pyramid Generator can be changed to use a new border type, for example VPI_BORDER_REFLECT:

- CHECK_STATUS(vpiSubmitGaussianPyramidGenerator(stream, backend, imgGrayScale, pyrInput, VPI_BORDER_CLAMP));
+ CHECK_STATUS(vpiSubmitGaussianPyramidGenerator(stream, backend, imgGrayScale, pyrInput, VPI_BORDER_REFLECT));
Python code

Similar to the C++ source code, the border argument can be changed to use a new type, for example VPI_BORDER_MIRROR:

- pyr = src.gaussian_pyramid(3)
+ pyr = src.gaussian_pyramid(3, border=vpi.Border.MIRROR)
Dependencies

With the out-of-the-box Compute Stack, no other dependencies are needed.

Usage

C++:

cd /opt/nvidia/vpi3/samples/03-harris_corners
sudo cmake .
sudo make
sudo ./vpi_sample_03_harris_corners cuda ../assets/kodim08.png

Python:

cd /opt/nvidia/vpi3/samples/03-harris_corners
sudo python3 main.py cuda ../assets/kodim08.png
Result

NVIDIA VPI sample image showing houses Houses after applying a black an white filter and detecting corners

TensorRT 8.6.2.3

The TensorRT installation contains available samples in the directory: /usr/src/tensorrt/samples.

Sample: Network API Pytorch MNIST

This sample is an end-to-end sample that trains a model in PyTorch, recreates the network in TensorRT, imports weights from the trained model, and finally runs inference with a TensorRT engine. This sample uses Python, among other changes, the following properties are added to the api:

  • ICudaEngine.hardware_compatibility_level.
  • ICudaEngine.num_aux_streams.
Python code

The main function of the sample starts by initializing and training an MNIST model using PyTorch. After training, it extracts the model's weights and uses them to build a TensorRT inference engine. Next, it allocates necessary resources, such as buffers and an execution context, for performing inference. A random test case is loaded into the input buffer. The function then runs inference with TensorRT, determines the predicted class from the output, frees the allocated buffers, and prints the test case number and the prediction. Also, inside this function the new properties: hardware compatibility level and number of auxiliary streams, are printed.

def main():
    common.add_help(description="Runs an MNIST network using a PyTorch model")
    # Train the PyTorch model
    mnist_model = model.MnistModel()
    mnist_model.learn(1)
    weights = mnist_model.get_weights()
    # Do inference with TensorRT.
    engine = build_engine(weights)

    # Get hardware compatibility level
    compatibility_level = engine.hardware_compatibility_level
    print(f"Hardware Compatibility Level: {compatibility_level}")

    # Get number of auxiliary streams
    num_aux_streams = engine.num_aux_streams
    print(f"Number of Auxiliary Streams: {num_aux_streams}")

    # Build an engine, allocate buffers and create a stream.
    # For more information on buffer allocation, refer to the introductory samples.
    inputs, outputs, bindings, stream = common.allocate_buffers(engine)
    context = engine.create_execution_context()

    case_num = load_random_test_case(mnist_model, pagelocked_buffer=inputs[0].host)
    # For more information on performing inference, refer to the introductory samples.
    # The common.do_inference function will return a list of outputs - we only have one in this case.
    [output] = common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
    pred = np.argmax(output)
    common.free_buffers(inputs, outputs, stream)
    print("Test Case: " + str(case_num))
    print("Prediction: " + str(pred))
Dependencies

This sample needs Pytorch and some auxiliary python packages, to install them execute these commands:

cd /usr/src/tensorrt/samples/python/network_api_pytorch_mnist
pip3 install --upgrade pip
pip3 install -r requirements.txt
Usage
cd /usr/src/tensorrt/samples/python/network_api_pytorch_mnist
python3 sample.py
Result
Train Epoch: 1 [0/60000 (0%)]	Loss: 2.354930
Train Epoch: 1 [6400/60000 (11%)]	Loss: 0.570188
Train Epoch: 1 [12800/60000 (21%)]	Loss: 0.335233
Train Epoch: 1 [19200/60000 (32%)]	Loss: 0.336444
Train Epoch: 1 [25600/60000 (43%)]	Loss: 0.245888
Train Epoch: 1 [32000/60000 (53%)]	Loss: 0.159086
Train Epoch: 1 [38400/60000 (64%)]	Loss: 0.191686
Train Epoch: 1 [44800/60000 (75%)]	Loss: 0.237418
Train Epoch: 1 [51200/60000 (85%)]	Loss: 0.055748
Train Epoch: 1 [57600/60000 (96%)]	Loss: 0.092222

Test set: Average loss: 0.0897, Accuracy: 9712/10000 (97%)

Hardware Compatibility Level: HardwareCompatibilityLevel.NONE
Number of Auxiliary Streams: 0
Test Case: 1
Prediction: 1



Previous: JetPack 6 Key Features/Nsight Developer Tools Index Next: Contact Us