Function approximation for Fine optimisations in CUDA Optimisation

GPU bound: use approximations

Standard math operations consume more resources than you can expect. If the application allows you to use approximations, go ahead.

For reference of the consumption of the math operations, please, have a look at the optimisation guide chapter of the Programming Guide.

Performance hints: Put your approximation efforts on recurrent operations. Do not waste time on operations that are not intensive

So, the first approach is using intrinsic / SIMD functions available in CUDA.

However, there is an even more powerful tool.

Let's suppose that your function is something like:

$y({\tilde {x}},{\tilde {y}})=\log _{2}(2^{\frac {R\times {\tilde {x}}}{B}}-2^{\frac {R\times {\tilde {y}}}{B}})$

Supposing that ${\tilde {x}}$ and ${\tilde {y}}$ are expressed in 8-bit integers and ${\frac {R}{B}}\times x\in [-7,1]$ , take into account the following questions:

Do you really need powf(x, y) here?
Do you really need log2f(x) here?
Do you really need to cast scale the argument of the power?

The questions stated above are all negative (except for the second one). Some optimisation hints:

powf(x) is too generalist. The powers are often iterative. If the argument is large, you may end up with too many clock cycles. You can use an approximant for it.
In general terms, the log2f(x) can be preserved. Just replace it by the intrinsic if possible.
Scaling every time may be a waste of time. Include the scaling in your approximant.

So. What if the implementation looks like:

$y({\tilde {x}},{\tilde {y}})=\log _{2}({\text{pow2a}}({\tilde {x}})-{\text{pow2a}}({\tilde {y}}))$

where ${\text{pow2a}}(\cdot )$ is the approximant.

The trick is: how to compute the approximant.

There are three answers for it:

Use Maclaurin series: Taylor is often used. Center it where you need
Use Padè approximants: choose the degree that fulfills your requirements
(My favorite): use LUT and polynomial interpolation

The common approach is to go for LUT or Padè, depending on the problem. For simplicity, the LUT is one of the safest paths for continuous and well-behaved functions.

The following diagram illustrates an interpolation process by using a LUT and linear interpolation:

Linear interpolation for computing (approximant) for the power of two within a limited domain

The results can be adequate with just a simple linear interpolation. You may need to verify your errors after any numerical approximation. The following are knobs you can play with:

Number of elements of the LUT. Try to minimise the usage of divisions, multiplications, and exploit bit shifting. Try to make it a multiple of 2.
The degree of the polynomial. You can go from linear to third grade. After that, consider using Padè.

After a numerical approximation, try to verify your function isolated with the domain provided by the application. Additionally, compute the mean error, the variance, and corners. Inform the customer about these approximations.

Performance hints: Math functions from the LIBC/STL library are often over-dimensioned to cover most of the cases. An optimal implementation will use functions with better responses in the domain of the application.

With respect to the last performance hint and if you are interested in scientific software esoterism, have a look at faster versions of logarithms in single and double precision. It implements the logarithm in base 2, base e, and base 10 in a limited domain.

Previous: Optimisation Recipes/Fine optimisations/Increase arithmetic intensity

Index

Next: Optimisation Recipes/Fine optimisations/Condition and loops replacement

❯