NVIDIA Jetson Xavier - Description of the Volta GPU

From RidgeRun Developer Wiki
< Xavier‎ | Processors‎ | GPU




Previous: Processors/GPU Index Next: Processors/GPU/CUDA






The same Volta GPU architecture that powers NVIDIA high-performance computing (HPC) products was adapted for use in Xavier series modules. The Volta architecture features a new Streaming Multiprocessor (SM) optimized for deep learning.

Volta Streaming Multiprocessor

The new Volta SM is far more energy-efficient than the previous generations enabling major performance boosts in the same power envelope. The Volta SM includes:

  1. New programmable Tensor Cores purpose-built for INT8/FP16/FP32 deep learning tensor operations; IMMA and HMMA instructions accelerate integer and mixed-precision matrix-multiply-and-accumulate operations.
  2. Enhanced L1 data cache for higher performance and lower latency.
  3. Streamlined instruction set for simpler decoding and reduced instruction latencies.
  4. Higher clocks and higher power efficiency.

The Volta architecture also incorporates a new generation of its memory subsystem and enhanced unified memory and address translation services that increase memory bandwidth and improves utilization for greater efficiency.

Graphics Processing Cluster

The Graphics Processing Cluster (GPC) is a dedicated hardware block for computing, rasterization, shading, and texturing; most of the GPU’s core graphics functions are performed inside the GPC. It is comprised of four Texture Processing Clusters (TPC), with each TPC containing two SM units, and a Raster Engine. The SM unit creates, manages, schedules, and executes instructions from many threads in parallel. Raster operators (ROPs) continue to be aligned with L2 cache slices and memory controllers. The SM geometry and pixel processing performance make it highly suitable for rendering advanced user interfaces; the efficiency of the Volta GPU enables this performance on devices with power-limited environments.

Each SM is partitioned into four separate processing blocks (referred to as SMPs), each SMP contains its own instruction buffer, scheduler, CUDA cores, and Tensor cores. Inside each SMP, CUDA cores perform pixel/vertex/geometry shading and physics/compute calculations, and each Tensor core provides a 4x4x4 matrix processing array to perform mixed precision fused multiply-add (FMA) mathematical operations. Texture units perform texture filtering and load/store units fetch and save data to memory. Special Function Units (SFUs) handle transcendental and graphics interpolation instructions. Finally, the PolyMorph Engine handles vertex fetch, tessellation, viewport transform, attribute setup, and stream output.

Features

  • 512-core
  • End-to-end lossless compression.
  • Tile Caching.
  • OpenGL 4.6, OpenGL ES 3.2, and Vulkan 1.0.
  • Adaptive Scalable Texture Compression (ATSC) LDR profile supported.
  • DirectX 12 compliant.
  • CUDA support.
  • Iterated blend, ROP OpenGL-ES blend modes.
  • 2D BLIT from 3D class avoids channel switch.
  • 2D color compression.
  • Constant color render SM bypass.
  • 2x, 4x, 8x MSAA with color and Z compression.
  • Non-power-of-2 and 3D textures, FP16 texture filtering.
  • FP16 shader support.
  • Geometry and Vertex attribute Instancing.
  • Parallel pixel processing.
  • Early-z reject: Fast rejection of occluded pixels acts as a multiplier on pixel shader and texture performance while saving power and bandwidth.
  • Video protection region.
  • Power saving: Multiple levels of clock gating for linear scaling of power.



Previous: Processors/GPU Index Next: Processors/GPU/CUDA