Condition and loops replacement for Fine optimisations in CUDA Optimisation

From RidgeRun Developer Wiki



Previous: Optimisation Recipes/Fine optimisations/Function approximation Index Next: Optimisation Recipes/Fine optimisations/Inlining






GPU bound: removing if conditions / loops

Conditionals are the major source of thread divergence, increasing the walltime of a thread up to a factor of two. Conditionals include loops and if statements.

A way to mitigate the conditionals is by using masking or the ternary operator

For example, for creating an identity, you may use:

/* Naïve - Create an identity */
for (int i = 0; i < N; ++i)
  for (int j = 0; j < N; ++j)
    if (i == j)
      matrix[i][j] = 1;
    else
      matrix[i][j] = 0;

The optimised version would look like:

/* Optimised 1 - Create an identity */
for (int i = 0; i < N; ++i)
  for (int j = 0; j < N; ++j)
    matrix[i][j] = i == j ? 1 : 0;

/* or Optimised 2 - Create an identity */
for (int i = 0; i < N; ++i)
  for (int j = 0; j < N; ++j)
    matrix[i][j] = i == j;

Pay attention that the condition does not exist anymore. It is better to use the condition directly or use a ternary operator. It will depend on the application.

Another one, but taking advantage of the instructions available:

/* Optimised 3 - Create an identity */
memset(matrix, 0, N * N * sizeof(Matrix));
for (int i = 0; i < N; ++i)
  matrix[i][i] = 1;

Now, let's consider that N is defined by a template or it is known at compile time. You may get rid of the loop by using unrolling:

/* Optimised 4 - Create an identity */
memset(matrix, 0, N * N * sizeof(Matrix));

#pragma unroll
for (int i = 0; i < N; ++i)
  matrix[i][i] = 1;

The #pragma unroll gives permission to the compiler to discretise the code after the compilation. In other words, it will be similar to write:

/* Optimised 4 - Create an identity */
memset(matrix, 0, N * N * sizeof(Matrix));

matrix[0][0] = 1;
matrix[1][1] = 1;
matrix[2][2] = 1;
/* ... */
matrix[N - 2][N - 2] = 1;
matrix[N - 1][N - 1] = 1;

What do you save here? You can save comparisons, jumps, and others. However, take into account that the program will increase in size (binaries).

Performance hints: avoid using if conditions, loops, divisions, and modulus as much as possible. They often lead to high computation time because of thread divergence and computational cost.


Previous: Optimisation Recipes/Fine optimisations/Function approximation Index Next: Optimisation Recipes/Fine optimisations/Inlining