Inlining for Fine optimisations in CUDA Optimisation

From RidgeRun Developer Wiki
Revision as of 03:21, 4 March 2023 by Spalli (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)




Previous: Optimisation Recipes/Fine optimisations/Condition and loops replacement Index Next: Optimisation Recipes/Common pitfalls when optimising





GPU bound: inlining

Inlining is another optimisation technique. Take it into account when the kernel calls another function so many times. Often, the function invocations involve context switching, memory jumping, and other phenomena that cause performance degradation.

However, take into account that inlining sometimes causes performance degradation in some cases. Therefore, it is often adequate to thrust the compiler heuristic for inlining. It evaluates the potential performance gain from inlining given the elimination of call overhead against compile time. Aggressive inlining may lead to very larger code. It also may impact resource utilization such as registers, which may impact negatively in the deployment of the code.

In our experience, the instances in which it makes sense to override the compiler's inlining heuristic are rare. We have used __noinline__ to limit code size and thus reduce excessive compile times. The use of __noinline__ has no predictable effect on register pressure that I am aware of. Inlining may allow more aggressive code movement such as load scheduling and this may increase register pressure, while not inlining may increase register pressure due to ABI restrictions on the use of registers. I have never found a case where the use of __noinline__ improved performance, but of course, such cases could exist, possibly due to instruction cache effects.

In general, for inlining, you can provide a hint to the compiler by using:

  • The __noinline__ function qualifier can be used as a hint for the compiler not to inline the function if possible.
  • The __forceinline__ function qualifier can be used to force the compiler to inline the function.
  • The __noinline__ and __forceinline__ function qualifiers cannot be used together, and neither function qualifier can be applied to an inline function.

By default, the compiler inlines any __device__ function when deemed appropriate.


Previous: Optimisation Recipes/Fine optimisations/Condition and loops replacement Index Next: Optimisation Recipes/Common pitfalls when optimising