Function approximation for Fine optimisations in CUDA Optimisation
RidgeRun CUDA Optimisation Guide  

GPU Architecture  


Optimisation Workflow  


Optimisation Recipes  


Common pitfalls when optimising  


Examples  


Empirical Experiments  


Contact Us 
GPU bound: use approximations
Standard math operations consume more resources than you can expect. If the application allows you to use approximations, go ahead.
For reference of the consumption of the math operations, please, have a look at the optimisation guide chapter of the Programming Guide.
Performance hints: Put your approximation efforts on recurrent operations. Do not waste time on operations that are not intensive
So, the first approach is using intrinsic / SIMD functions available in CUDA.
However, there is an even more powerful tool.
Let's suppose that your function is something like:
Supposing that and are expressed in 8bit integers and , take into account the following questions:
 Do you really need
powf(x, y)
here?  Do you really need
log2f(x)
here?  Do you really need to cast scale the argument of the power?
The questions stated above are all negative (except for the second one). Some optimisation hints:
powf(x)
is too generalist. The powers are often iterative. If the argument is large, you may end up with too many clock cycles. You can use an approximant for it. In general terms, the
log2f(x)
can be preserved. Just replace it by the intrinsic if possible.  Scaling every time may be a waste of time. Include the scaling in your approximant.
So. What if the implementation looks like:
where is the approximant.
The trick is: how to compute the approximant.
There are three answers for it:
 Use Maclaurin series: Taylor is often used. Center it where you need
 Use Padè approximants: choose the degree that fulfills your requirements
 (My favorite): use LUT and polynomial interpolation
The common approach is to go for LUT or Padè, depending on the problem. For simplicity, the LUT is one of the safest paths for continuous and wellbehaved functions.
The following diagram illustrates an interpolation process by using a LUT and linear interpolation:
The results can be adequate with just a simple linear interpolation. You may need to verify your errors after any numerical approximation. The following are knobs you can play with:
 Number of elements of the LUT. Try to minimise the usage of divisions, multiplications, and exploit bit shifting. Try to make it a multiple of 2.
 The degree of the polynomial. You can go from linear to third grade. After that, consider using Padè.
After a numerical approximation, try to verify your function isolated with the domain provided by the application. Additionally, compute the mean error, the variance, and corners. Inform the customer about these approximations.
Performance hints: Math functions from the LIBC/STL library are often overdimensioned to cover most of the cases. An optimal implementation will use functions with better responses in the domain of the application.
With respect to the last performance hint and if you are interested in scientific software esoterism, have a look at faster versions of logarithms in single and double precision. It implements the logarithm in base 2, base e, and base 10 in a limited domain.