NVIDIA Jetson Xavier - Using CUDA

From RidgeRun Developer Wiki
< Xavier‎ | Processors‎ | GPU



Previous: Processors/GPU/Description Index Next: Processors/GPU/OPENGL








Build All CUDA Samples

1. Go to the samples path

cd /usr/local/cuda/samples

2. Construct the samples using the makefile

sudo make

CUDA Samples

All the samples are in:

/usr/local/cuda/samples

Simple Samples

Path Sample Description
/0_Simple/asyncAPI asyncAPI This sample uses CUDA streams and events to overlap execution on CPU and GPU.
/0_Simple/cdpSimplePrint cdpSimplePrint This sample demonstrates simple printf implemented using CUDA Dynamic Parallelism. This sample requires devices with compute capability 3.5 or higher.
/0_Simple/cdpSimpleQuicksort cdpSimpleQuicksort This sample demonstrates simple quicksort implemented using CUDA Dynamic Parallelism. This sample requires devices with compute capability 3.5 or higher.
/0_Simple/clock clock This example shows how to use the clock function to measure the performance of a block of threads of a kernel accurately.
/0_Simple/cppIntegration cppIntegration This example demonstrates how to integrate CUDA into an existing C++ application, i.e. the CUDA entry point on the host side is only a function which is called from C++ code, and only the file containing this function is compiled with nvcc. It also demonstrates that vector types can be used from cpp.
/0_Simple/cppOverload cppOverload This sample demonstrates how to use C++ function overloading on the GPU.
/0_Simple/cudaOpenMP cudaOpenMP This sample demonstrates how to use OpenMP API to write an application for multiple GPUs.
/0_Simple/fp16ScalarProduct fp16ScalarProduct Calculates scalar product of two vectors of FP16 numbers.
/0_Simple/inlinePTX inlinePTX A simple test application that demonstrates a new CUDA 4.0 ability to embed PTX in a CUDA kernel.
/0_Simple/matrixMul matrixMul This sample implements matrix multiplication which makes use of shared memory to ensure data reuse, the matrix multiplication is done using the tiling approach.
/0_Simple/matrixMulCUBLAS matrixMulCUBLAS This sample implements matrix multiplication. To illustrate GPU performance for matrix multiply, this sample also shows how to use the new CUDA 4.0 interface for CUBLAS to demonstrate high-performance performance for matrix multiplication.
/0_Simple/matrixMulDrv matrixMulDrv This sample implements matrix multiplication and uses the new CUDA 4.0 kernel launch Driver API.
/0_Simple/simpleAssert simpleAssert This CUDA Runtime API sample is a very basic sample that implements how to use the assert function in the device code. Requires Compute Capability 2.0.
/0_Simple/simpleAtomicIntrinsics simpleAtomicIntrinsics A simple demonstration of global memory atomic instructions. Requires Compute Capability 2.0 or higher.
/0_Simple/simpleCallback simpleCallback This sample implements multi-threaded heterogeneous computing workloads with the new CPU callbacks for CUDA streams and events introduced with CUDA 5.0.
/0_Simple/simpleCooperativeGroups simpleCooperativeGroups This sample is a simple code that illustrates the basic usage of cooperative groups within the thread block.
/0_Simple/simpleCubemapTexture simpleCubemapTexture Simple example that demonstrates how to use a new CUDA 4.1 feature to support cubemap Textures in CUDA C.
/0_Simple/simpleCudaGraphs simpleCudaGraphs A demonstration of CUDA Graphs creation, instantiation, and launch using Graphs APIs and Stream Capture APIs.
/0_Simple/simpleLayeredTexture simpleLayeredTexture Simple example that demonstrates how to use a new CUDA 4.0 feature to support layered Textures in CUDA C.
/0_Simple/simpleMPI simpleMPI Simple example demonstrating how to use MPI in combination with CUDA.
/0_Simple/simpleMultiCopy simpleMultiCopy This sample illustrates the usage of CUDA streams to achieve overlapping of kernel execution with data copies to and from the device.
/0_Simple/simpleMultiGPU simpleMultiGPU This application demonstrates how to use the new CUDA 4.0 API for CUDA context management and multi-threaded access to run CUDA kernels on multiple-GPUs.
/0_Simple/simpleOccupancy simpleOccupancy This sample demonstrates the basic usage of the CUDA occupancy calculator and occupancy-based launch configurator APIs by launching a kernel with the launch configurator and measures the utilization difference against a manually configured launch.
/0_Simple/simplePitchLinearTexture simplePitchLinearTexture Use of Pitch Linear Textures
/0_Simple/simplePrintf simplePrintf This CUDA Runtime API sample is a very basic sample that implements how to use the printf function in the device code.
/0_Simple/simpleSeparateCompilation simpleSeparateCompilation This sample demonstrates a CUDA 5.0 feature, the ability to create a GPU device static library and use it within another CUDA kernel. This example demonstrates how to pass in a GPU device function (from the GPU device static library) as a function pointer to be called.
/0_Simple/simpleStreams simpleStreams This sample uses CUDA streams to overlap kernel executions with memory copies between the host and a GPU device.
/0_Simple/simpleSurfaceWrite simpleSurfaceWrite Simple example that demonstrates the use of 2D surface references (Write-to-Texture).
/0_Simple/simpleTemplates simpleTemplates This sample is a templatized version of the template project. It also shows how to correctly templatize dynamically allocated shared memory arrays.
/0_Simple/simpleTexture simpleTexture Simple example that demonstrates use of Textures in CUDA.
/0_Simple/simpleTextureDrv simpleTextureDrv Simple example that demonstrates the use of Textures in CUDA. This sample uses the new CUDA 4.0 kernel launch Driver API.
/0_Simple/simpleVoteIntrinsics simpleVoteIntrinsics Simple program which demonstrates how to use the Vote (any, all) intrinsic instruction in a CUDA kernel.
/0_Simple/simpleZeroCopy simpleZeroCopy This sample illustrates how to use Zero MemCopy, kernels can read and write directly to pinned system memory
/0_Simple/template template A trivial template project that can be used as a starting point to create new CUDA projects.
/0_Simple/UnifiedMemoryStreams UnifiedMemoryStreams This sample demonstrates the use of OpenMP and streams with Unified Memory on a single GPU.
/0_Simple/vectorAdd vectorAdd This CUDA Runtime API sample is a very basic sample that implements element by element vector addition.
/0_Simple/vectorAddDrv vectorAddDrv This Vector Addition sample is a basic sample that is implemented element by element.

Utilities Samples

Path Sample Description
/1_Utilities/bandwidthTest bandwidthTest This is a simple test program to measure the memcopy bandwidth of the GPU and memcpy bandwidth across PCI-e.
/1_Utilities/deviceQuery deviceQuery This sample enumerates the properties of the CUDA devices present in the system.
/1_Utilities/deviceQueryDrv deviceQueryDrv This sample enumerates the properties of the CUDA devices present using CUDA Driver API calls.
/1_Utilities/p2pBandwidthLatencyTest p2pBandwidthLatencyTest This application demonstrates the CUDA Peer-To-Peer (P2P) data transfers between pairs of GPUs and computes latency and bandwidth.
/1_Utilities/UnifiedMemoryPerf UnifiedMemoryPerf This sample demonstrates the performance comparison using matrix multiplication kernel of Unified Memory with/without hints and other types of memory like zero-copy buffers, pageable, page locked memory performing synchronous and Asynchronous transfers on a single GPU.

Graphics Samples

Path Sample Description
/2_Graphics/bindlessTexture bindlessTexture This example demonstrates use of cudaSurfaceObject, cudaTextureObject, and MipMap support in CUDA.
/2_Graphics/Mandelbrot Mandelbrot This sample uses CUDA to compute and display the Mandelbrot or Julia sets interactively. It also illustrates the use of "double single" arithmetic to improve precision when zooming a long way into the pattern.
/2_Graphics/marchingCubes marchingCubes This sample extracts a geometric isosurface from a volume dataset using the marching cubes algorithm. It uses the scan (prefix sum) function from the Thrust library to perform stream compaction.
/2_Graphics/simpleGL simpleGL Simple program which demonstrates interoperability between CUDA and OpenGL. The program modifies vertex positions with CUDA and uses OpenGL to render the geometry.
/2_Graphics/simpleGLES simpleGLES Demonstrates data exchange between CUDA and OpenGL ES (aka Graphics interop). The program modifies vertex positions with CUDA and uses OpenGL ES to render the geometry.
/2_Graphics/simpleGLES_EGLOutput simpleGLES_EGLOutput Demonstrates data exchange between CUDA and OpenGL ES (aka Graphics interop). The program modifies vertex positions with CUDA and uses OpenGL ES to render the geometry, and shows how to render directly to the display using the EGLOutput mechanism and the DRM library.
/2_Graphics/simpleTexture3D simpleTexture3D Simple example that demonstrates use of 3D Textures in CUDA.
/2_Graphics/volumeFiltering volumeFiltering This sample demonstrates 3D Volumetric Filtering using 3D Textures and 3D Surface Writes.
/2_Graphics/volumeRender volumeRender This sample demonstrates basic volume rendering using 3D Textures.

Imaging Samples

Path Sample Description
/3_Imaging/bicubicTexture bicubicTexture This sample demonstrates how to efficiently implement a Bicubic B-spline interpolation filter with CUDA texture.
/3_Imaging/bilateralFilter bilateralFilter Bilateral filter is an edge-preserving non-linear smoothing filter that is implemented with CUDA with OpenGL rendering. It can be used in image recovery and denoising. Each pixel is weight by considering both the spatial distance and color distance between its neighbors.
/3_Imaging/boxFilter boxFilter Fast image box filter using CUDA with OpenGL rendering.
/3_Imaging/convolutionFFT2D convolutionFFT2D This sample demonstrates how 2D convolutions with very large kernel sizes can be efficiently implemented using FFT transformations.
/3_Imaging/convolutionSeparable convolutionSeparable This sample implements a separable convolution filter of a 2D signal with a gaussian kernel.
/3_Imaging/convolutionTexture convolutionTexture Texture-based implementation of a separable 2D convolution with a gaussian kernel.
/3_Imaging/dct8x8 dct8x8 This sample demonstrates how Discrete Cosine Transform (DCT) for blocks of 8 by 8 pixels can be performed using CUDA: a naive implementation by definition and a more traditional approach used in many libraries.
/3_Imaging/dwtHaar1D dwtHaar1D Discrete Haar wavelet decomposition for 1D signals with a length which is a power of 2.
/3_Imaging/dxtc dxtc High-Quality DXT Compression using CUDA. This example shows how to implement an existing computationally-intensive CPU compression algorithm in parallel on the GPU, and obtain an order of magnitude performance improvement.
/3_Imaging/EGLStream_CUDA_CrossGPU EGLStream_CUDA_CrossGPU Demonstrates CUDA and EGL Streams interop, where consumer's EGL Stream is on one GPU and producer's on other and both consumer-producer are different processes.
/3_Imaging/EGLStreams_CUDA_Interop EGLStreams_CUDA_Interop Demonstrates data exchange between CUDA and EGL Streams.
/3_Imaging/EGLSync_CUDAEvent_Interop EGLSync_CUDAEvent_Interop Demonstrates interoperability between CUDA Event and EGL Sync/EGL Image using which one can achieve synchronization on GPU itself for GL-EGL-CUDA operations instead of blocking CPU for synchronization.
/3_Imaging/histogram histogram This sample demonstrates the efficient implementation of 64-bin and 256-bin histograms.
/3_Imaging/HSOpticalFlow HSOpticalFlow Variational optical flow estimation example. Uses textures for image operations. Shows how a simple PDE solver can be accelerated with CUDA.
/3_Imaging/imageDenoising imageDenoising This sample demonstrates two adaptive image denoising techniques: KNN and NLM, based on the computation of both geometric and color distance between texels.
/3_Imaging/postProcessGL postProcessGL This sample shows how to post-process an image rendered in OpenGL using CUDA.
/3_Imaging/recursiveGaussian recursiveGaussian This sample implements a Gaussian blur using Deriche's recursive method.
/3_Imaging/simpleCUDA2GL simpleCUDA2GL This sample shows how to copy a CUDA images back to OpenGL using the most efficient methods.
/3_Imaging/SobelFilter SobelFilter This sample implements the Sobel edge detection filter for 8-bit monochrome images.
/3_Imaging/stereoDisparity stereoDisparity A CUDA program that demonstrates how to compute a stereo disparity map using SIMD SAD (Sum of Absolute Difference) intrinsics.

Finance Samples

Path Sample Description
/4_Finance/binomialOptions binomialOptions This sample evaluates fair call price for a given set of European options under the binomial model.
/4_Finance/BlackScholes BlackScholes This sample evaluates fair call and put prices for a given set of European options by Black-Scholes formula.
/4_Finance/MonteCarloMultiGPU MonteCarloMultiGPU This sample evaluates fair call price for a given set of European options using the Monte Carlo approach, taking advantage of all CUDA-capable GPUs installed in the system.
/4_Finance/quasirandomGenerator quasirandomGenerator This sample implements Niederreiter Quasirandom Sequence Generator and Inverse Cumulative Normal Distribution functions for the generation of Standard Normal Distributions.
/4_Finance/SobolQRNG SobolQRNG This sample implements Sobol Quasirandom Sequence Generator.

Simulations Samples

Path Sample Description
/5_Simulations/fluidsGL fluidsGL An example of fluid simulation using CUDA and CUFFT, with OpenGL rendering.
/5_Simulations/fluidsGLES fluidsGLES An example of fluid simulation using CUDA and CUFFT, with OpenGLES rendering.
/5_Simulations/nbody nbody This sample demonstrates the efficient all-pairs simulation of a gravitational n-body simulation in CUDA.
/5_Simulations/nbody_opengles nbody_opengles This sample demonstrates the efficient all-pairs simulation of a gravitational n-body simulation in CUDA. Unlike the OpenGL nbody sample, there is no user interaction.
/5_Simulations/oceanFFT oceanFFT This sample simulates an Ocean height field using CUFFT Library and renders the result using OpenGL.
/5_Simulations/particles particles This sample uses CUDA to simulate and visualize a large set of particles and their physical interaction. Adding "-particles=<N>" to the command line will allow users to set # of particles for simulation.
/5_Simulations/smokeParticles smokeParticles Smoke simulation with volumetric shadows using half-angle slicing technique.


Advanced Samples

Path Sample Description
/6_Advanced/alignedTypes alignedTypes A simple test, showing huge access speed gap between aligned and misaligned structures.
/6_Advanced/cdpAdvancedQuicksort cdpAdvancedQuicksort This sample demonstrates an advanced quicksort implemented using CUDA Dynamic Parallelism.
/6_Advanced/cdpBezierTessellation cdpBezierTessellation This sample demonstrates bezier tessellation of lines implemented using CUDA Dynamic Parallelism.
/6_Advanced/cdpQuadtree cdpQuadtree This sample demonstrates Quad Trees implemented using CUDA Dynamic Parallelism.
/6_Advanced/concurrentKernels concurrentKernels This sample demonstrates the use of CUDA streams for concurrent execution of several kernels on devices of computing capability 2.0 or higher. Devices of computing capability 1.x will run the kernels sequentially.
/6_Advanced/eigenvalues eigenvalues This sample demonstrates a parallel implementation of a bisection algorithm for the computation of all eigenvalues of a tridiagonal symmetric matrix of arbitrary size with CUDA.
/6_Advanced/fastWalshTransform fastWalshTransform Naturally(Hadamard)-ordered Fast Walsh Transform for batching vectors of arbitrary eligible lengths that are the power of two in size.
/6_Advanced/FDTD3d FDTD3d This sample applies a finite differences time domain progression stencil on a 3D surface.
/6_Advanced/FunctionPointers FunctionPointers This sample illustrates how to use function pointers and implements the Sobel Edge Detection filter for 8-bit monochrome images.
/6_Advanced/interval interval Interval arithmetic operators example.
/6_Advanced/lineOfSight lineOfSight This sample is an implementation of a simple line-of-sight algorithm: Given a height map and a ray originating at some observation point, it computes all the points along the ray that are visible from the observation point.
/6_Advanced/matrixMulDynlinkJIT matrixMulDynlinkJIT This sample revisits matrix multiplication using the CUDA driver API. It demonstrates how to link to CUDA driver at runtime and how to use JIT (just-in-time) compilation from PTX code.
/6_Advanced/mergeSort mergeSort This sample implements a merge sort (also known as Batcher's sort), algorithms belonging to the class of sorting networks.
/6_Advanced/newdelete newdelete This sample demonstrates dynamic global memory allocation through device C++ new and delete operators and virtual function declarations available with CUDA 4.0.
/6_Advanced/ptxjit ptxjit This sample uses the Driver API to just-in-time compile (JIT) a Kernel from PTX code. Additionally, this sample demonstrates the seamless interoperability capability of the CUDA Runtime and CUDA Driver API calls.
/6_Advanced/radixSortThrust radixSortThrust This sample demonstrates a very fast and efficient parallel radix sort that uses the Thrust library. The included RadixSort class can sort either key-value pairs (with a float or unsigned integer keys) or keys only.
/6_Advanced/reduction reduction A parallel sum reduction that computes the sum of a large array of values.
/6_Advanced/scalarProd scalarProd This sample calculates scalar products of a given set of input vector pairs.
/6_Advanced/scan scan This example demonstrates an efficient CUDA implementation of parallel prefix sum, also known as "scan". Given an array of numbers, scan computes a new array in which each element is the sum of all the elements before it in the input array.
/6_Advanced/segmentationTreeThrust segmentationTreeThrust This sample demonstrates an approach to the image segmentation trees construction. This method is based on Boruvka's MST algorithm.
/6_Advanced/shfl_scan shfl_scan This example demonstrates how to use the shuffle intrinsic __shfl_up to perform a scan operation across a thread block.
/6_Advanced/simpleHyperQ simpleHyperQ This sample demonstrates the use of CUDA streams for concurrent execution of several kernels on devices that provide HyperQ (SM 3.5). Devices without HyperQ (SM 2.0 and SM 3.0) will run a maximum of two kernels concurrently.
/6_Advanced/sortingNetworks sortingNetworks This sample implements bitonic sort and odd-even merge sort (also known as Batcher's sort), algorithms belonging to the class of sorting networks. While generally subefficient, for large sequences compared to algorithms with better asymptotic algorithmic complexity (i.e. merge sort or radix sort).
/6_Advanced/threadFenceReduction threadFenceReduction This sample shows how to perform a reduction operation on an array of values using the thread Fence intrinsic to produce a single value in a single kernel.
/6_Advanced/threadMigration threadMigration Simple program illustrating how to the CUDA Context Management API and uses the new CUDA 4.0 parameter passing and CUDA launch API. CUDA contexts can be created separately and attached independently to different threads.
/6_Advanced/transpose transpose This sample demonstrates Matrix Transpose.
/6_Advanced/warpAggregatedAtomicsCG warpAggregatedAtomicsCG This sample demonstrates how using Cooperative Groups (CG) to perform warp aggregated atomics, a useful technique to improve performance when many threads atomically add to a single counter.


CUDALibraries Samples

Path Sample Description
/7_CUDALibraries/batchCUBLAS batchCUBLAS A CUDA Sample that demonstrates how using batched CUBLAS API calls to improve overall performance.
/7_CUDALibraries/BiCGStab BiCGStab A CUDA Sample that demonstrates Bi-Conjugate Gradient Stabilized (BiCGStab) iterative method for nonsymmetric and symmetric positive definite (s.p.d.) linear systems using CUSPARSE and CUBLAS.
/7_CUDALibraries/boundSegmentsNPP boundSegmentsNPP An NPP CUDA Sample that demonstrates using nppiLabelMarkers to generate connected region segment labels in an 8-bit grayscale image then compressing the sparse list of generated labels into the minimum number of uniquely labeled regions in the image using nppiCompressMarkerLabels. Finally, a boundary is added surrounding each segmented region in the image using nppiBoundSegments.
/7_CUDALibraries/boxFilterNPP boxFilterNPP A NPP CUDA Sample that demonstrates how to use NPP FilterBox function to perform a Box Filter.
/7_CUDALibraries/cannyEdgeDetectorNPP cannyEdgeDetectorNPP An NPP CUDA Sample that demonstrates the recommended parameters to use with the nppiFilterCannyBorder_8u_C1R Canny Edge Detection image filter function.
/7_CUDALibraries/conjugateGradient conjugateGradient This sample implements a conjugate gradient solver on GPU using CUBLAS and CUSPARSE library.
/7_CUDALibraries/cuSolverDn_LinearSolver cuSolverDn_LinearSolver A CUDA Sample that demonstrates cuSolverDN's LU, QR, and Cholesky factorization.
/7_CUDALibraries/cuSolverRf cuSolverRf A CUDA Sample that demonstrates cuSolver's refactorization library - CUSOLVERRF.
/7_CUDALibraries/cuSolverSp_LinearSolver cuSolverSp_LinearSolver A CUDA Sample that demonstrates cuSolverSP's LU, QR, and Cholesky factorization.
/7_CUDALibraries/cuSolverSp_LowlevelCholesky cuSolverSp_LowlevelCholesky A CUDA Sample that demonstrates Cholesky factorization using cuSolverSP's low-level APIs.
/7_CUDALibraries/cuSolverSp_LowlevelQR cuSolverSp_LowlevelQR A CUDA Sample that demonstrates QR factorization using cuSolverSP's low-level APIs.
/7_CUDALibraries/FilterBorderControlNPP FilterBorderControlNPP This NPP CUDA Sample demonstrates how any border version of an NPP filtering function can be used in the most common mode (with border control enabled), can be used to duplicate the results of the equivalent non-border version of the NPP function, and can be used to enable and disable border control on various source image edges depending on what portion of the source image is being used as input.
/7_CUDALibraries/freeImageInteropNPP freeImageInteropNPP A simple CUDA Sample demonstrate how to use FreeImage library with NPP.
/7_CUDALibraries/histEqualizationNPP histEqualizationNPP This CUDA Sample demonstrates how to use NPP for histogram equalization for image data.
/7_CUDALibraries/jpegNPP jpegNPP This sample demonstrates a simple image processing pipeline. First, a JPEG file is Huffman decoded and inverse DCT transformed and dequantized. Then the different plances are resized. Finally, the resized image is quantized, forward DCT transformed and Huffman encoded.
/7_CUDALibraries/MC_EstimatePiInlineP MC_EstimatePiInlineP This sample uses Monte Carlo simulation for Estimation of Pi (using inline PRNG). This sample also uses the NVIDIA CURAND library.
/7_CUDALibraries/MC_EstimatePiInlineQ MC_EstimatePiInlineQ This sample uses Monte Carlo simulation for Estimation of Pi (using inline QRNG). This sample also uses the NVIDIA CURAND library.
/7_CUDALibraries/MC_EstimatePiP MC_EstimatePiP This sample uses Monte Carlo simulation for Estimation of Pi (using batch PRNG). This sample also uses the NVIDIA CURAND library.
/7_CUDALibraries/MC_EstimatePiQ MC_EstimatePiQ This sample uses Monte Carlo simulation for Estimation of Pi (using batch QRNG). This sample also uses the NVIDIA CURAND library.
/7_CUDALibraries/MC_SingleAsianOptionP MC_SingleAsianOptionP This sample uses Monte Carlo to simulate Single Asian Options using the NVIDIA CURAND library.
/7_CUDALibraries/MersenneTwisterGP11213 MersenneTwisterGP11213 This sample demonstrates the Mersenne Twister random number generator GP11213 in cuRAND.
/7_CUDALibraries/randomFog randomFog This sample illustrates pseudo- and quasi- random numbers produced by CURAND.
/7_CUDALibraries/simpleCUBLAS simpleCUBLAS Example of using CUBLAS using the new CUBLAS API interface available in CUDA 4.0.
/7_CUDALibraries/simpleCUBLASXT simpleCUBLASXT Example of using CUBLAS-XT library.
/7_CUDALibraries/simpleCUFFT simpleCUFFT Example of using CUFFT. In this example, CUFFT is used to compute the 1D-convolution of some signal with some filter by transforming both into the frequency domain, multiplying them together, and transforming the signal back to the time domain.
/7_CUDALibraries/simpleCUFFT_2d_MGPU simpleCUFFT_2d_MGPU Example of using CUFFT. In this example, CUFFT is used to compute the 2D-convolution of some signal with some filter by transforming both into the frequency domain, multiplying them together, and transforming the signal back to the time domain on Multiple GPU.
/7_CUDALibraries/simpleCUFFT_MGPU simpleCUFFT_MGPU Example of using CUFFT. In this example, CUFFT is used to compute the 1D-convolution of some signal with some filter by transforming both into the frequency domain, multiplying them together, and transforming the signal back to the time domain on Multiple GPU.

For more information about CUDA, go to: Xavier/JetPack_4.1/Components/Cuda



Previous: Processors/GPU/Description Index Next: Processors/GPU/OPENGL