Increasing arithmetic intensity for Fine optimisations in CUDA Optimisation
RidgeRun CUDA Optimisation Guide | |||||
---|---|---|---|---|---|
GPU Architecture | |||||
|
|||||
Optimisation Workflow | |||||
|
|||||
Optimisation Recipes | |||||
|
|||||
Common pitfalls when optimising | |||||
|
|||||
Examples | |||||
|
|||||
Empirical Experiments | |||||
|
|||||
Contact Us |
Memory bound / GPU bound: increase arithmetic intensity
Arithmetic intensity can be defined as the number of operations per byte. The idea is to squeeze each byte computing as much as possible. This is achieved by changing precision, mainly. For example, provided a kernel with double-precision operations, it is possible to switch to single-precision operations. It will:
- Double the OPS/s: single-precision is two times faster than double precision.
- Require half the space.
However, casting can be costly in the GPU. It is preferable that the whole application shifts entirely to float, meaning that both host and device operations are pure single-precision operations. On the other hand, if the computations require fine precision, this optimisation shall be evaluated carefully, giving a report of the error affectation.
Performance hints: Prefer single-precision on top of double-precision. If half-precision can be used, it is even better
On the host side, despite the double-precision can perform similarly to single-precision, it is still better because of memory occupation. The problem is still two-fold: compute + memory.