openmp/docs/remarks/OMP120.rst

   1 .. _omp120:
   2
   3 Transformed generic-mode kernel to SPMD-mode [OMP120]
   4 =====================================================
   5
   6 This optimization remark indicates that the execution strategy for the OpenMP
   7 target offloading kernel was changed. Generic-mode kernels are executed by a
   8 single thread that schedules parallel worker threads using a state machine. This
   9 code transformation can move a kernel that was initially generated in generic
  10 mode to SPMD-mode where all threads are active at the same time with no state
  11 machine. This execution strategy is closer to how the threads are actually
  12 executed on a GPU target. This is only possible if the instructions previously
  13 executed by a single thread have no side-effects or can be guarded. If the
  14 instructions have no side-effects they are simply recomputed by each thread.
  15
  16 Generic-mode is often considerably slower than SPMD-mode because of the extra
  17 overhead required to separately schedule worker threads and pass data between
  18 them.This optimization allows users to use generic-mode semantics while
  19 achieving the performance of SPMD-mode. This can be helpful when defining shared
  20 memory between the threads using :ref:`OMP111 <omp111>`.
  21
  22 Examples
  23 --------
  24
  25 Normally, any kernel that contains split OpenMP target and parallel regions will
  26 be executed in generic-mode. Sometimes it is easier to use generic-mode
  27 semantics to define shared memory, or more tightly control the distribution of
  28 the threads. This shows a naive matrix-matrix multiplication that contains code
  29 that will need to be guarded.
  30
  31 .. code-block:: c++
  32
  33   void matmul(int M, int N, int K, double *A, double *B, double *C) {
  34   #pragma omp target teams distribute collapse(2) \
  35     map(to:A[0: M*K]) map(to:B[0: K*N]) map(tofrom:C[0 : M*N])
  36     for (int i = 0; i < M; i++) {
  37       for (int j = 0; j < N; j++) {
  38         double sum = 0.0;
  39
  40   #pragma omp parallel for reduction(+:sum) default(firstprivate)
  41         for (int k = 0; k < K; k++)
  42           sum += A[i*K + k] * B[k*N + j];
  43
  44         C[i*N + j] = sum;
  45       }
  46     }
  47   }
  48
  49 .. code-block:: console
  50
  51    $ clang++ -fopenmp -fopenmp-targets=nvptx64 -fopenmp-version=51 -O2 -Rpass=openmp-opt omp120.cpp
  52    omp120.cpp:6:14: remark: Replaced globalized variable with 8 bytes of shared memory. [OMP111]
  53         double sum = 0.0;
  54                ^
  55    omp120.cpp:2:1: remark: Transformed generic-mode kernel to SPMD-mode. [OMP120]
  56    #pragma omp target teams distribute collapse(2) \
  57    ^
  58
  59 This requires guarding the store to the shared variable ``sum`` and the store to
  60 the matrix ``C``. This can be thought of as generating the code below.
  61
  62 .. code-block:: c++
  63
  64   void matmul(int M, int N, int K, double *A, double *B, double *C) {
  65   #pragma omp target teams distribute collapse(2) \
  66     map(to:A[0: M*K]) map(to:B[0: K*N]) map(tofrom:C[0 : M*N])
  67     for (int i = 0; i < M; i++) {
  68       for (int j = 0; j < N; j++) {
  69       double sum;
  70   #pragma omp parallel default(firstprivate) shared(sum)
  71       {
  72       #pragma omp barrier
  73       if (omp_get_thread_num() == 0)
  74         sum = 0.0;
  75       #pragma omp barrier
  76
  77   #pragma omp for reduction(+:sum)
  78         for (int k = 0; k < K; k++)
  79           sum += A[i*K + k] * B[k*N + j];
  80
  81       #pragma omp barrier
  82       if (omp_get_thread_num() == 0)
  83         C[i*N + j] = sum;
  84       #pragma omp barrier
  85       }
  86       }
  87     }
  88   }
  89
  90
  91 Diagnostic Scope
  92 ----------------
  93
  94 OpenMP target offloading optimization remark.