Jump to content

Xavier/Processors/GPU/Description: Difference between revisions

m
no edit summary
mNo edit summary
mNo edit summary
Line 3: Line 3:
</noinclude>
</noinclude>


The same Volta GPU architecture that powers NVIDIA high-performance computing (HPC) products was adapted for use in Xavier series modules. The Volta architecture features a new Streaming Multiprocessor (SM) optimized for deep learning.  
The same Volta GPU architecture that powers NVIDIA high-performance computing (HPC) products were adapted for use in Xavier series modules. The Volta architecture features a new Streaming Multiprocessor (SM) optimized for deep learning.  


__TOC__
__TOC__


==Volta Streaming Multiprocessor==
==Volta Streaming Multiprocessor==
The new Volta SM is far more energy efficient than the previous generations enabling major performance boosts in the same power envelope. The Volta SM includes:  
The new Volta SM is far more energy-efficient than the previous generations enabling major performance boosts in the same power envelope. The Volta SM includes:  
#New programmable Tensor Cores purpose-built for INT8/FP16/FP32 deep learning tensor operations; IMMA and HMMA instructions accelerate integer and mixed-precision matrix-multiply-and-accumulate operations.  
#New programmable Tensor Cores purpose-built for INT8/FP16/FP32 deep learning tensor operations; IMMA and HMMA instructions accelerate integer and mixed-precision matrix-multiply-and-accumulate operations.  
#Enhanced L1 data cache for higher performance and lower latency.  
#Enhanced L1 data cache for higher performance and lower latency.  
#Streamlined instruction set for simpler decoding and reduced instruction latencies.  
#Streamlined instruction set for simpler decoding and reduced instruction latencies.  
#Higher clocks and higher power efficiency.
#Higher clocks and higher power efficiency.
The Volta architecture also incorporates a new generation of its memory subsystem and enhanced unified memory and address translation services that increases memory bandwidth and improves utilization for greater efficiency.
The Volta architecture also incorporates a new generation of its memory subsystem and enhanced unified memory and address translation services that increase memory bandwidth and improves utilization for greater efficiency.


==Graphics Processing Cluster==
==Graphics Processing Cluster==
The Graphics Processing Cluster (GPC) is a dedicated hardware block for compute, rasterization, shading, and texturing; most of the GPU’s core graphics functions are performed inside the GPC. It is comprised of four Texture Processing Clusters (TPC), with each TPC containing two SM units, and a Raster Engine. The SM unit creates, manages, schedules and executes instructions from many threads in parallel. Raster operators (ROPs) continue to be aligned with L2 cache slices and memory controllers. The SM geometry and pixel processing performance make it highly suitable for rendering advanced user interfaces; the efficiency of the Volta GPU enables this performance on devices with power-limited environments.  
The Graphics Processing Cluster (GPC) is a dedicated hardware block for computing, rasterization, shading, and texturing; most of the GPU’s core graphics functions are performed inside the GPC. It is comprised of four Texture Processing Clusters (TPC), with each TPC containing two SM units, and a Raster Engine. The SM unit creates, manages, schedules, and executes instructions from many threads in parallel. Raster operators (ROPs) continue to be aligned with L2 cache slices and memory controllers. The SM geometry and pixel processing performance make it highly suitable for rendering advanced user interfaces; the efficiency of the Volta GPU enables this performance on devices with power-limited environments.  


Each SM is partitioned into four separate processing blocks (referred to as SMPs), each SMP contains its own instruction buffer, scheduler, CUDA cores and Tensor cores. Inside each SMP, CUDA cores perform pixel/vertex/geometry shading and physics/compute calculations, and each Tensor core provides a 4x4x4 matrix processing array to perform mixed precision fused multiply-add (FMA) mathematical operations. Texture units perform texture filtering and load/store units fetch and save data to memory. Special Function Units (SFUs) handle transcendental and graphics interpolation instructions.
Each SM is partitioned into four separate processing blocks (referred to as SMPs), each SMP contains its own instruction buffer, scheduler, CUDA cores, and Tensor cores. Inside each SMP, CUDA cores perform pixel/vertex/geometry shading and physics/compute calculations, and each Tensor core provides a 4x4x4 matrix processing array to perform mixed precision fused multiply-add (FMA) mathematical operations. Texture units perform texture filtering and load/store units fetch and save data to memory. Special Function Units (SFUs) handle transcendental and graphics interpolation instructions.
Finally, the PolyMorph Engine handles vertex fetch, tessellation, viewport transform, attribute setup, and stream output.
Finally, the PolyMorph Engine handles vertex fetch, tessellation, viewport transform, attribute setup, and stream output.


Line 38: Line 38:
*Geometry and Vertex attribute Instancing.
*Geometry and Vertex attribute Instancing.
*Parallel pixel processing.
*Parallel pixel processing.
*Early-z reject: Fast rejection of occluded pixels acts as multiplier on pixel shader and texture performance while saving power and bandwidth.
*Early-z reject: Fast rejection of occluded pixels acts as a multiplier on pixel shader and texture performance while saving power and bandwidth.
*Video protection region.
*Video protection region.
*Power saving: Multiple levels of clock gating for linear scaling of power.
*Power saving: Multiple levels of clock gating for linear scaling of power.
Cookies help us deliver our services. By using our services, you agree to our use of cookies.