NVIDIA GPU Optimisation Tool CUDA Profiler

CUDA Profiler

Depending on your setup, Nsight may be so useful since it integrates a user interface and guides the developer through the analysis process. Actually, Nsight is very recommended by NVIDIA for performing profiling. You can find more information in this User manual for NVIDIA profiling tools for optimizing the performance of CUDA applications.

Provided that most of RidgeRun's work is on Tegra, we can focus on nvprof and, then, using Nsight. You can learn how to use Nsight in the proper way and update this wiki later :D

nvprof can be as easy as running:

nvprof ./application

It will, by default, throw information about the API calls and how much the kernels consume.

For example:

$ nvprof matrixMul
[Matrix Multiply Using CUDA] - Starting...
==27694== NVPROF is profiling process 27694, command: matrixMul
GPU Device 0: "GeForce GT 640M LE" with compute capability 3.0

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 35.35 GFlop/s, Time= 3.708 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: OK

Note: For peak performance, please refer to the matrixMulCUBLAS example.
==27694== Profiling application: matrixMul
==27694== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 99.94%  1.11524s       301  3.7051ms  3.6928ms  3.7174ms  void matrixMulCUDA<int=32>(float*, float*, float*, int, int)
  0.04%  406.30us         2  203.15us  136.13us  270.18us  [CUDA memcpy HtoD]
  0.02%  248.29us         1  248.29us  248.29us  248.29us  [CUDA memcpy DtoH]

==27964== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 49.81%  285.17ms         3  95.055ms  153.32us  284.86ms  cudaMalloc
 25.95%  148.57ms         1  148.57ms  148.57ms  148.57ms  cudaEventSynchronize
 22.23%  127.28ms         1  127.28ms  127.28ms  127.28ms  cudaDeviceReset
  1.33%  7.6314ms       301  25.353us  23.551us  143.98us  cudaLaunch
  0.25%  1.4343ms         3  478.09us  155.84us  984.38us  cudaMemcpy
  0.11%  601.45us         1  601.45us  601.45us  601.45us  cudaDeviceSynchronize
  0.10%  564.48us      1505     375ns     313ns  3.6790us  cudaSetupArgument
  0.09%  490.44us        76  6.4530us     307ns  221.93us  cuDeviceGetAttribute
  0.07%  406.61us         3  135.54us  115.07us  169.99us  cudaFree
  0.02%  143.00us       301     475ns     431ns  2.4370us  cudaConfigureCall
  0.01%  42.321us         1  42.321us  42.321us  42.321us  cuDeviceTotalMem
  0.01%  33.655us         1  33.655us  33.655us  33.655us  cudaGetDeviceProperties
  0.01%  31.900us         1  31.900us  31.900us  31.900us  cuDeviceGetName
  0.00%  21.874us         2  10.937us  8.5850us  13.289us  cudaEventRecord
  0.00%  16.513us         2  8.2560us  2.6240us  13.889us  cudaEventCreate
  0.00%  13.091us         1  13.091us  13.091us  13.091us  cudaEventElapsedTime
  0.00%  8.1410us         1  8.1410us  8.1410us  8.1410us  cudaGetDevice
  0.00%  2.6290us         2  1.3140us     509ns  2.1200us  cuDeviceGetCount
  0.00%  1.9970us         2     998ns     520ns  1.4770us  cuDeviceGet

It will basically give you the first hint about what is taking too much time to complete. If your kernel is taking too much time, the track you should follow is trying to lower the kernel execution time. If the kernel execution is less than 80% and the communication takes the rest of the time, you should follow the I/O optimisation path. If most of the time is consumed by synchronisation, you should follow some memory optimisation/I/O in order to get rid of the synchronisation barriers. You will find more details in the following sections.

The information presented by nvprof is perhaps not enough for your purposes and you may need a higher level of detail. For that purpose, you can use the Visual Profiler or Nsight. By using either of those tools, it ends up with a plot similar to the following:

Timeline presented by Visual Profiler and Nsight. In the picture, it is possible to see the kernels executed, their duration, API calls, and streams

For remote data collection (let's suppose you are far away from your hardware), it is possible to use nvprof for collecting the data.

nvprof -o output.%h ${APP} ${APP_OPTIONS}

It will generate an output file that is importable with Visual Profiler:

1. Open the Visual Profiler

nvvp -vm /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java

2. Go to File > Import. Select nvprof.

3. Select Single Process

4. Load the output file in Timeline datafile

5. Click Finish

A similar process is followed when using multi-process. Let's say that the customer needs to execute the pipeline five times. It is possible to model the behavior by using MPI. Please, refer to section of the documentation for more information.

From the timeline, the following details should at least be analyzed:

GPU general occupation: is it mostly busy executing kernels?
API calls: synchronisation calls, memory copies, and their durations
Streams: are the kernels being executed in parallel?
Overlapping: is the communication being overlapped by the computation?
Synchronisation: is there a symptom of over-synchronisation?

Some conclusions that can be retrieved from the timeline presented above:

The application is currently GPU bound. The CUDA HW occupation is more than 95%
There are three kernels. Is it possible to collapse operations into a single kernel?
There are color space conversions. GPU is often a waste of resources for these tasks and it is better to offload them to the VIC. GPU can be slower because of clocking reasons.

Previous: Tools/CUDA Memcheck

Index

Next: Tools/ Computational Budget Tool