NVIDIA Jetson AGX Thor - Blackwell GPU
The NVIDIA Jetson AGX Thor documentation from RidgeRun is presently being developed. |
Overview
The Jetson AGX Thor comes with 2560 NVIDIA® CUDA® Cores and 96 5th GEN Tensor Cores (3GPC), 10-TPCs MIG Support.
Highlights of the GPU architecture in Thor SoC:
- Advanced graphics capabilities, with next generation DLSS.
- Fourth generation Tensor Core, adding support for high throughput FP8 compute to reduce the need for FP16 compute.
- Transformer Engine accelerates Transformer model inference.
- Multi-Instance GPU (MIG) support provides a multi-domain computing capability, allowing the GPU to be separated into separate GPU instances where required.
Mulit-instance GPU (MIG)
Allows the GPU to be split into two physically separated GPU's to provide isolation. Graphics can only be supported on one MIG partition, or across the whole GPU when MIG is not being used. Compute functionality is available on all MIG partitions.
Tranformer Engine
Uses a combination of software and custom NVIDIA Tensor core technology designed specifically to accelerate transformer model inference.
FP8 and FP4 support
FP8 format is normally based on an E4M3 encoding that includes:
- One sign bit, four exponent bits, three mantissa bits
- Subnormals and NaNs are supported
This format is much wider dynamic range (gain 2000x aprox) than INT8, that means that is less dependent on post training quantization parameters.
Also E5M2 FP8 is supported but this format is expected to be used for gradients during training, not for inference.
FP4 format is based on E2M1 encoding that includes:
- One sign bit, two exponent bits, one mantissa bit
- Infinity and NaNs are not supported
Although the data elements are only four bits wide, they can be rescaled using a UE8M0 factor for every 32 elements or a UE4M3 factor for every 16 elements. This scaling mechanism allows the FP4 format to achieve a wide dynamic range.
Compute features
- Introduces fourth-generation NVIDIA Tensor Cores, supporting a broad range of precisions TF32, bfloat16, FP16, FP8, FP4, and INT8 delivering unmatched versatility and performance.
- Adds structured sparsity support, enabling parameters in AI models to be set to zero without loss of accuracy, allowing Tensor Cores to achieve up to 2× higher inference performance.
- Implements Compute Data Compression to accelerate unstructured sparsity and other compressible patterns, providing up to 4× improvement in DRAM read/write and L2 bandwidth, and up to 2× greater L2 capacity.
- Includes additional enhancements that further boost compute throughput.
Note: TensorFloat-32 (TF32) combines the 10-bit mantissa of FP16 with the 8-bit exponent of FP32, ensuring sufficient precision for AI workloads while retaining the full numeric range of FP32.
CUDA
Overview
Jetson Thor now supports a unified CUDA 13.0 installation across all Arm targets, streamlining development, reducing fragmentation, and ensuring consistency from server-class systems to Jetson Thor.
To see the component version needed for CUDA 13.0 visit CUDA Toolkit Major Components.
Running CUDA 13.0 requires the system with CUDA capable GPU and driver that is compatible with the CUDA Toolkit.
CTK version / CUDA toolkit | Driver Range for Minor Version Compatibility | Linux x86_64 Driver version |
---|---|---|
13.x / 13.0 GA | >=580 | >=580.65.06 |
General CUDA
- Unified toolkit for Arm platforms (except Jetson Orin).
- Performance: 32-byte aligned vector types for Blackwell; registers can spill to shared memory (10× lower latency than L2).
- Platform support: New OS versions – RHEL 10.0/9.6, Debian 12.10, Fedora 42, Rocky Linux 10.0/9.6.
- Architecture: SM101 renamed to SM110; CUDA 13.0 supports GPUs from Turing through Grace Blackwell.
- Memory & APIs:
- Host allocations now supported with cuMemCreate / cudaMallocAsync.
- Managed memory discard via new UVM APIs.
- Coherent memory platforms can be initialized in non-NUMA mode.
- Runtime & tools:
- Rich error reporting added to runtime + memcpy APIs.
- cuda-checkpoint enables GPU migration via UUID mapping.
- Fatbin compression switched to Zstd for smaller binaries.
- Runtime now supports contextless loading.
- Hostnames allowed in nvidia-imex node configs.
- All Windows executables/dlls are signed.
- CUDA Compiler
- PTX ISA updated to v9.0.
- API changes:
- cudaGraphKernelNodeCopyAttributes argument order updated.
- nvJitLinkCreate options now const char * const *.
- NVCC:
- Default compression uses --compress-mode=balance (Zstd).
- New -jobserver flag for thread control under GNU Make 4.4+.
- Fatbinary: old -image option removed (use -image2 or -image3).
- Compiler support: new host compilers Clang 20 and GCC 15.
- Libraries: libNVVM, libNVPTXCompiler, and CUDA CRT split from NVCC and aligned with LLVM 20.
- Developer Tools
- New tool Compile Time Advisor (ctadvisor) to analyze and optimize CUDA build times.
Blackwell compatibility
- CUDA kernels can be shipped as cubin (native binary for a specific compute capability) and/or PTX (intermediate code).
- Cubin compatibility: works on GPUs with the same major compute capability and equal or higher minor version.
- cubin for 8.0 → runs on 8.6;
- cubin for 8.6 → does not run on 8.0.
- Cubins do not cross major versions (an 8.x cubin won’t run on 9.0).
- PTX is forward-compatible: it’s JIT-compiled at runtime to a cubin and will run on GPUs with the same or higher compute capability than it was generated for.
- PTX for 9.x → runs on 9.x and 10.0.
- Recommendation: always include PTX alongside any cubins to keep your app forward-compatible with newer GPUs.
To read more about cubin and PTX compatibilities see Compilation with NVCC
When an app launches a CUDA kernel, the runtime checks the GPU’s compute capability. If there is a matching cubin for that GPU, it runs directly. If no cubin matches, the runtime will JIT-compile PTX into a cubin and run that. If neither cubin nor PTX is available, the kernel launch fails.
Binaries with PTX included will work on new GPUs like Blackwell without needing rebuilds. Binaries with only cubins (no PTX) must be rebuilt to support Blackwell. To know more about building compatible applications read Building Applications with Blackwell Architecture Support
OpenGL
Like all recent NVIDIA GPUs, Blackwell supports OpenGL 4.6, OpenGL ES 3.2, Vulkan 1.3, and EGL 1.5. Support is present from the very first Blackwell-enabled drivers (CUDA 12.8 / 13.0 releases in 2024).
For OpenGL specifically, there are no unique Blackwell-only extensions reported.
Blackwell’s innovations are primarily in AI and compute:
- Enhanced Tensor Cores with FP4/FP8 precision → massive training/inference speedups.
- Next-gen Multi-Instance GPU (MIG) → better GPU partitioning for multi-tenant workloads.
- NVLink 5 → up to 1.8 TB/s interconnect bandwidth.
- Memory hierarchy improvements → higher bandwidth, lower latency.
These boost CUDA, HPC, and AI frameworks rather than OpenGL directly.
Traditional OpenGL-CUDA interop as outlined, its works by CUDA directly consuming handles created in OpenGL. Since OpenGL can also consume memory and synchronization objects created in Vulkan, there exist an alternative approach to interop.
Memory and synchronization objects exported by Vulkan could be imported in to both (OpenGL and CUDA) and used to coordinate memory accesses between both.
For further information refer to OpenGl Interoperatibility
OpenCV
OpenCV does not automatically include CUDA support in pre built binaries. To leverage CUDA 13.0 (Supports NVIDIA architectures from Turing through Blackwell), you will need to build OpenCV from source with CUDA enabled. You must ensure the build uses the correct compute capability.
Some performance considerations:
- Many OpenCV CUDA functions are optimized for FP32/INT8
- For advanced AI acceleration you'd typically use TensorRT
DeepStream
(Coming soon)
Previous version comparision
Feature | NVIDIA Ampere architecture GPU (AGX Orin) | NVIDIA Volta architecture (AGX Xavier) | NVIDIA Blackwell architecture GPU (AGX Thor) |
---|---|---|---|
Cuda Cores | 512 | 2048 | 2560+ |
Tensor Cores | 64 (1st gen) | 64 (3rd gen) | 96 (5th gen, FP4/FP8) |
AI performance | -30 TOPS | 200-275 TOPS | 1000+ TOPS |
Compute Compatibility | 7.2 | 8.7 | 10.0 |
GPU Max Freq | 930 MHz - 1.3 GHz | 1.1 GHz - 1.2 GHz | 1.57 GHz |
Memory | LPDDR4x, 64 GB | LPDDR5, 64 GB | LPDDR5x, 128 GB |
Features | - | Sparsity | MIG, FP4, Transform Engine |
What RidgeRun can offer
As an official NVIDIA Jetson ecosystem partner, RidgeRun provides end to end services that can now leverage the Blackwell GPU in Jetson Thor:
- Custom driver development (cameras, sensors, V4L2 pipelines).
- Optimized multimedia pipelines using GStreamer (NVENC/NVDEC, VIC, PVA).
- AI model integration and optimization with DeepStream and TensorRT (e.g., quantization, FP8/INT8 acceleration, multi-stream and multi-model concurrency).
- Vision APIs such as OpenCV, plus advanced algorithms like image stitching, video stabilization, bird’s eye view.
- Secure Boot, firmware signing, full BSP customization, and kernel/DTB integration for production.
- Ready to use products:
- CUDA Camera Undistort
- Bird's Eye View plugin
- DeepStream reference pipelines
- Video analytics and GStreamer plugins
For direct inquiries, please refer to the contact information available on our Contact page. Alternatively, you may complete and submit the form provided at the same link. We will respond to your request at our earliest opportunity.
Links to RidgeRun Resources and RidgeRun Artificial Intelligence Solutions can be found in the footer below.