NVIDIA GPU Optimisation Tool CUDA Profiler
RidgeRun CUDA Optimisation Guide | |||||
---|---|---|---|---|---|
GPU Architecture | |||||
|
|||||
Optimisation Workflow | |||||
|
|||||
Optimisation Recipes | |||||
|
|||||
Common pitfalls when optimising | |||||
|
|||||
Examples | |||||
|
|||||
Empirical Experiments | |||||
|
|||||
Contact Us |
CUDA Profiler
Depending on your setup, Nsight may be so useful since it integrates a user interface and guides the developer through the analysis process. Actually, Nsight is very recommended by NVIDIA for performing profiling. You can find more information in this User manual for NVIDIA profiling tools for optimizing the performance of CUDA applications.
Provided that most of RidgeRun's work is on Tegra, we can focus on nvprof and, then, using Nsight. You can learn how to use Nsight in the proper way and update this wiki later :D
nvprof can be as easy as running:
nvprof ./application
It will, by default, throw information about the API calls and how much the kernels consume.
For example:
$ nvprof matrixMul [Matrix Multiply Using CUDA] - Starting... ==27694== NVPROF is profiling process 27694, command: matrixMul GPU Device 0: "GeForce GT 640M LE" with compute capability 3.0 MatrixA(320,320), MatrixB(640,320) Computing result using CUDA Kernel... done Performance= 35.35 GFlop/s, Time= 3.708 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block Checking computed result for correctness: OK Note: For peak performance, please refer to the matrixMulCUBLAS example. ==27694== Profiling application: matrixMul ==27694== Profiling result: Time(%) Time Calls Avg Min Max Name 99.94% 1.11524s 301 3.7051ms 3.6928ms 3.7174ms void matrixMulCUDA<int=32>(float*, float*, float*, int, int) 0.04% 406.30us 2 203.15us 136.13us 270.18us [CUDA memcpy HtoD] 0.02% 248.29us 1 248.29us 248.29us 248.29us [CUDA memcpy DtoH] ==27964== API calls: Time(%) Time Calls Avg Min Max Name 49.81% 285.17ms 3 95.055ms 153.32us 284.86ms cudaMalloc 25.95% 148.57ms 1 148.57ms 148.57ms 148.57ms cudaEventSynchronize 22.23% 127.28ms 1 127.28ms 127.28ms 127.28ms cudaDeviceReset 1.33% 7.6314ms 301 25.353us 23.551us 143.98us cudaLaunch 0.25% 1.4343ms 3 478.09us 155.84us 984.38us cudaMemcpy 0.11% 601.45us 1 601.45us 601.45us 601.45us cudaDeviceSynchronize 0.10% 564.48us 1505 375ns 313ns 3.6790us cudaSetupArgument 0.09% 490.44us 76 6.4530us 307ns 221.93us cuDeviceGetAttribute 0.07% 406.61us 3 135.54us 115.07us 169.99us cudaFree 0.02% 143.00us 301 475ns 431ns 2.4370us cudaConfigureCall 0.01% 42.321us 1 42.321us 42.321us 42.321us cuDeviceTotalMem 0.01% 33.655us 1 33.655us 33.655us 33.655us cudaGetDeviceProperties 0.01% 31.900us 1 31.900us 31.900us 31.900us cuDeviceGetName 0.00% 21.874us 2 10.937us 8.5850us 13.289us cudaEventRecord 0.00% 16.513us 2 8.2560us 2.6240us 13.889us cudaEventCreate 0.00% 13.091us 1 13.091us 13.091us 13.091us cudaEventElapsedTime 0.00% 8.1410us 1 8.1410us 8.1410us 8.1410us cudaGetDevice 0.00% 2.6290us 2 1.3140us 509ns 2.1200us cuDeviceGetCount 0.00% 1.9970us 2 998ns 520ns 1.4770us cuDeviceGet
It will basically give you the first hint about what is taking too much time to complete. If your kernel is taking too much time, the track you should follow is trying to lower the kernel execution time. If the kernel execution is less than 80% and the communication takes the rest of the time, you should follow the I/O optimisation path. If most of the time is consumed by synchronisation, you should follow some memory optimisation/I/O in order to get rid of the synchronisation barriers. You will find more details in the following sections.
The information presented by nvprof is perhaps not enough for your purposes and you may need a higher level of detail. For that purpose, you can use the Visual Profiler or Nsight. By using either of those tools, it ends up with a plot similar to the following:
For remote data collection (let's suppose you are far away from your hardware), it is possible to use nvprof for collecting the data.
nvprof -o output.%h ${APP} ${APP_OPTIONS}
It will generate an output file that is importable with Visual Profiler:
1. Open the Visual Profiler
nvvp -vm /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
2. Go to File > Import. Select nvprof.
3. Select Single Process
4. Load the output file in Timeline datafile
5. Click Finish
A similar process is followed when using multi-process. Let's say that the customer needs to execute the pipeline five times. It is possible to model the behavior by using MPI. Please, refer to section of the documentation for more information.
From the timeline, the following details should at least be analyzed:
- GPU general occupation: is it mostly busy executing kernels?
- API calls: synchronisation calls, memory copies, and their durations
- Streams: are the kernels being executed in parallel?
- Overlapping: is the communication being overlapped by the computation?
- Synchronisation: is there a symptom of over-synchronisation?
Some conclusions that can be retrieved from the timeline presented above:
- The application is currently GPU bound. The CUDA HW occupation is more than 95%
- There are three kernels. Is it possible to collapse operations into a single kernel?
- There are color space conversions. GPU is often a waste of resources for these tasks and it is better to offload them to the VIC. GPU can be slower because of clocking reasons.