openmp/libomptarget/deviceRTLs/nvptx/docs/ReductionDesign.txt

   1
   2 **Design document for OpenMP reductions on the GPU**
   3
   4 //Abstract: //In this document we summarize the new design for an OpenMP
   5 implementation of reductions on NVIDIA GPUs.  This document comprises
   6 * a succinct background review,
   7 * an introduction to the decoupling of reduction algorithm and
   8     data-structure-specific processing routines,
   9 * detailed illustrations of reduction algorithms used and
  10 * a brief overview of steps we have made beyond the last implementation.
  11
  12 **Problem Review**
  13
  14 Consider a typical OpenMP program with reduction pragma.
  15
  16 ```
  17     double foo, bar;
  18     #pragma omp parallel for reduction(+:foo, bar)
  19     for (int i = 0; i < N; i++) {
  20       foo+=A[i]; bar+=B[i];
  21     }
  22 ```
  23 where 'foo' and 'bar' are reduced across all threads in the parallel region.
  24 Our primary goal is to efficiently aggregate the values of foo and bar in
  25 such manner that
  26 * makes the compiler logically concise.
  27 * efficiently reduces within warps, threads, blocks and the device.
  28
  29 **Introduction to Decoupling**
  30 In this section we address the problem of making the compiler
  31 //logically concise// by partitioning the task of reduction into two broad
  32 categories: data-structure specific routines and algorithmic routines.
  33
  34 The previous reduction implementation was highly coupled with
  35 the specificity of the reduction element data structures (e.g., sizes, data
  36 types) and operators of the reduction (e.g., addition, multiplication). In
  37 our implementation we strive to decouple them. In our final implementations,
  38 we could remove all template functions in our runtime system.
  39
  40 The (simplified) pseudo code generated by LLVM is as follows:
  41
  42 ```
  43     1. Create private copies of variables: foo_p, bar_p
  44     2. Each thread reduces the chunk of A and B assigned to it and writes
  45        to foo_p and bar_p respectively.
  46     3. ret = kmpc_nvptx_reduce_nowait(..., reduceData, shuffleReduceFn,
  47                interWarpCpyFn)
  48         where:
  49         struct ReduceData {
  50           double *foo;
  51           double *bar;
  52         } reduceData
  53         reduceData.foo = &foo_p
  54         reduceData.bar = &bar_p
  55
  56         shuffleReduceFn and interWarpCpyFn are two auxiliary functions
  57         generated to aid the runtime performing algorithmic steps
  58         while being data-structure agnostic about ReduceData.
  59
  60         In particular, shuffleReduceFn is a function that takes the following
  61         inputs:
  62         a. local copy of ReduceData
  63         b. its lane_id
  64         c. the offset of the lane_id which hosts a remote ReduceData
  65                 relative to the current one
  66         d. an algorithm version parameter determining which reduction
  67                 algorithm to use.
  68         This shuffleReduceFn retrieves the remote ReduceData through shuffle
  69         intrinsics and  reduces, using the algorithm specified by the 4th
  70         parameter, the local ReduceData and with the remote ReduceData element
  71         wise, and places the resultant values into the local ReduceData.
  72
  73         Different reduction algorithms are implemented with different runtime
  74         functions, but they all make calls to this same shuffleReduceFn to
  75         perform the essential reduction step. Therefore, based on the 4th
  76         parameter, this shuffleReduceFn will behave slightly differently to
  77         cooperate with the runtime function to ensure correctness under
  78         different circumstances.
  79
  80         InterWarpCpyFn, as the name suggests, is a function that copies data
  81         across warps. Its function is to tunnel all the thread private
  82         ReduceData that is already reduced within a warp to a lane in the first
  83         warp with minimal shared memory footprint. This is an essential step to
  84         prepare for the last step of a block reduction.
  85
  86         (Warp, block, device level reduction routines that utilize these
  87         auxiliary functions will be discussed in the next section.)
  88
  89     4. if ret == 1:
  90         The master thread stores the reduced result in the globals.
  91         foo += reduceData.foo; bar += reduceData.bar
  92 ```
  93
  94 **Reduction Algorithms**
  95
  96 On the warp level, we have three versions of the algorithms:
  97
  98 1. Full Warp Reduction
  99
 100 ```
 101 gpu_regular_warp_reduce(void *reduce_data,
 102                         kmp_ShuffleReductFctPtr ShuffleReduceFn) {
 103   for (int offset = WARPSIZE/2; offset > 0; offset /= 2)
 104     ShuffleReduceFn(reduce_data, 0, offset, 0);
 105 }
 106 ```
 107 ShuffleReduceFn is used here with lane_id set to 0 because it is not used
 108 therefore we save instructions by not retrieving lane_id from the corresponding
 109 special registers. The 4th parameters, which represents the version of the
 110 algorithm being used here, is set to 0 to signify full warp reduction.
 111
 112 In this version specified (=0), the ShuffleReduceFn behaves, per element, as
 113 follows:
 114
 115 ```
 116 //reduce_elem refers to an element in the local ReduceData
 117 //remote_elem is retrieved from a remote lane
 118 remote_elem = shuffle_down(reduce_elem, offset, 32);
 119 reduce_elem = reduce_elem @ remote_elem;
 120
 121 ```
 122
 123 An illustration of this algorithm operating on a hypothetical 8-lane full-warp
 124 would be:
 125 {F74}
 126 The coloring invariant follows that elements with the same color will be
 127 combined and reduced in the next reduction step. As can be observed, no overhead
 128 is present, exactly log(2, N) steps are needed.
 129
 130 2. Contiguous Full Warp Reduction
 131 ```
 132 gpu_irregular_warp_reduce(void *reduce_data,
 133                           kmp_ShuffleReductFctPtr ShuffleReduceFn, int size,
 134                           int lane_id) {
 135   int curr_size;
 136   int offset;
 137     curr_size = size;
 138     mask = curr_size/2;
 139     while (offset>0) {
 140       ShuffleReduceFn(reduce_data, lane_id, offset, 1);
 141       curr_size = (curr_size+1)/2;
 142       offset = curr_size/2;
 143     }
 144 }
 145 ```
 146
 147 In this version specified (=1), the ShuffleReduceFn behaves, per element, as
 148 follows:
 149 ```
 150 //reduce_elem refers to an element in the local ReduceData
 151 //remote_elem is retrieved from a remote lane
 152 remote_elem = shuffle_down(reduce_elem, offset, 32);
 153 if (lane_id < offset) {
 154     reduce_elem = reduce_elem @ remote_elem
 155 } else {
 156     reduce_elem = remote_elem
 157 }
 158 ```
 159
 160 An important invariant (also a restriction on the starting state of the
 161 reduction) is that this algorithm assumes that all unused ReduceData are
 162 located in a contiguous subset of threads in a warp starting from lane 0.
 163
 164 With the presence of a trailing active lane with an odd-numbered lane
 165 id, its value will not be aggregated with any other lane. Therefore,
 166 in order to preserve the invariant, such ReduceData is copied to the first lane
 167 whose thread-local ReduceData has already being used in a previous reduction
 168 and would therefore be useless otherwise.
 169
 170 An illustration of this algorithm operating on a hypothetical 8-lane partial
 171 warp woud be:
 172 {F75}
 173
 174 As illustrated, this version of the algorithm introduces overhead whenever
 175 we have odd number of participating lanes in any reduction step to
 176 copy data between lanes.
 177
 178 3. Dispersed Partial Warp Reduction
 179 ```
 180 gpu_irregular_simt_reduce(void *reduce_data,
 181                           kmp_ShuffleReductFctPtr ShuffleReduceFn) {
 182   int size, remote_id;
 183   int logical_lane_id = find_number_of_dispersed_active_lanes_before_me() * 2;
 184   do {
 185       remote_id = find_the_next_active_lane_id_right_after_me();
 186       // the above function returns 0 of no active lane
 187       // is present right after the current thread.
 188       size = get_number_of_active_lanes_in_this_warp();
 189       logical_lane_id /= 2;
 190       ShuffleReduceFn(reduce_data, logical_lane_id, remote_id-1-threadIdx.x, 2);
 191   } while (logical_lane_id % 2 == 0 && size > 1);
 192 ```
 193
 194 There is no assumption made about the initial state of the reduction.
 195 Any number of lanes (>=1) could be active at any position. The reduction
 196 result is kept in the first active lane.
 197
 198 In this version specified (=2), the ShuffleReduceFn behaves, per element, as
 199 follows:
 200 ```
 201 //reduce_elem refers to an element in the local ReduceData
 202 //remote_elem is retrieved from a remote lane
 203 remote_elem = shuffle_down(reduce_elem, offset, 32);
 204 if (LaneId % 2 == 0 && Offset > 0) {
 205     reduce_elem = reduce_elem @ remote_elem
 206 } else {
 207     reduce_elem = remote_elem
 208 }
 209 ```
 210 We will proceed with a brief explanation for some arguments passed in,
 211 it is important to notice that, in this section, we will introduce the
 212 concept of logical_lane_id, and it is important to distinguish it
 213 from physical lane_id as defined by nvidia.
 214 1. //logical_lane_id//: as the name suggests, it refers to the calculated
 215     lane_id (instead of the physical one defined by nvidia) that would make
 216     our algorithm logically concise. A thread with logical_lane_id k means
 217     there are (k-1) threads before it.
 218 2. //remote_id-1-threadIdx.x//: remote_id is indeed the nvidia-defined lane
 219     id of the remote lane from which we will retrieve the ReduceData. We
 220     subtract (threadIdx+1) from it because we would like to maintain only one
 221     underlying shuffle intrinsic (which is used to communicate among lanes in a
 222     warp). This particular version of shuffle intrinsic we take accepts only
 223     offsets, instead of absolute lane_id. Therefore the subtraction is performed
 224     on the absolute lane_id we calculated to obtain the offset.
 225
 226 This algorithm is slightly different in 2 ways and it is not, conceptually, a
 227 generalization of the above algorithms.
 228 1. It reduces elements close to each other. For instance, values in the 0th lane
 229     is to be combined with that of the 1st lane; values in the 2nd lane is to be
 230     combined with that of the 3rd lane. We did not use the previous algorithm
 231     where the first half of the (partial) warp is reduced with the second half
 232     of the (partial) warp. This is because, the mapping
 233     f(x): logical_lane_id -> physical_lane_id;
 234     can be easily calculated whereas its inverse
 235     f^-1(x): physical_lane_id -> logical_lane_id
 236     cannot and performing such reduction requires the inverse to be known.
 237 2. Because this algorithm is agnostic about the positions of the lanes that are
 238     active, we do not need to perform the coping step as in the second
 239     algorithm.
 240 An illustrative run would look like
 241 {F76}
 242 As observed, overhead is high because in each and every step of reduction,
 243 logical_lane_id is recalculated; so is the remote_id.
 244
 245 On a block level, we have implemented the following block reduce algorithm:
 246
 247 ```
 248 gpu_irregular_block_reduce(void *reduce_data,
 249               kmp_ShuffleReductFctPtr shuflReduceFn,
 250               kmp_InterWarpCopyFctPtr interWarpCpyFn,
 251               int size) {
 252
 253   int wid = threadIdx.x/WARPSIZE;
 254   int lane_id = threadIdx.x%WARPSIZE;
 255
 256   int warp_needed = (size+WARPSIZE-1)/WARPSIZE; //ceiling of division
 257
 258   unsigned tnum = __ballot(1);
 259   int thread_num = __popc(tnum);
 260
 261     //full warp reduction
 262     if (thread_num == WARPSIZE) {
 263       gpu_regular_warp_reduce(reduce_data, shuflReduceFn);
 264     }
 265     //partial warp reduction
 266     if (thread_num < WARPSIZE) {
 267         gpu_irregular_warp_reduce(reduce_data, shuflReduceFn, thread_num,
 268                                   lane_id);
 269     }
 270     //Gather all the reduced values from each warp
 271     //to the first warp
 272     //named_barrier inside this function to ensure
 273     //correctness. It is effectively a sync_thread
 274     //that won't deadlock.
 275     interWarpCpyFn(reduce_data, warp_needed);
 276
 277     //This is to reduce data gathered from each "warp master".
 278     if (wid==0) {
 279         gpu_irregular_warp_reduce(reduce_data, shuflReduceFn, warp_needed,
 280                                   lane_id);
 281     }
 282
 283   return;
 284 }
 285 ```
 286 In this function, no ShuffleReduceFn is directly called as it makes calls
 287 to various versions of the warp-reduction functions. It first reduces
 288 ReduceData warp by warp; in the end, we end up with the number of
 289 ReduceData equal to the number of warps present in this thread
 290 block. We then proceed to gather all such ReduceData to the first warp.
 291
 292 As observed, in this algorithm we make use of the function InterWarpCpyFn,
 293 which copies data from each of the "warp master" (0th lane of each warp, where
 294 a warp-reduced ReduceData is held) to the 0th warp. This step reduces (in a
 295 mathematical sense) the problem of reduction across warp masters in a block to
 296 the problem of warp reduction which we already have solutions to.
 297
 298 We can thus completely avoid the use of atomics to reduce in a threadblock.
 299
 300 **Efficient Cross Block Reduce**
 301
 302 The next challenge is to reduce values across threadblocks.  We aim to do this
 303 without atomics or critical sections.
 304
 305 Let a kernel be started with TB threadblocks.
 306 Let the GPU have S SMs.
 307 There can be at most N active threadblocks per SM at any time.
 308
 309 Consider a threadblock tb (tb < TB) running on SM s (s < SM).  'tb' is one of
 310 at most 'N' active threadblocks on SM s.  Let each threadblock active on an SM
 311 be given an instance identifier id (0 <= id < N).  Therefore, the tuple (s, id)
 312 uniquely identifies an active threadblock on the GPU.
 313
 314 To efficiently implement cross block reduce, we first allocate an array for
 315 each value to be reduced of size S*N (which is the maximum number of active
 316 threadblocks at any time on the device).
 317
 318 Each threadblock reduces its value to slot [s][id].  This can be done without
 319 locking since no other threadblock can write to the same slot concurrently.
 320
 321 As a final stage, we reduce the values in the array as follows:
 322
 323 ```
 324 // Compiler generated wrapper function for each target region with a reduction
 325 clause.
 326 target_function_wrapper(map_args, reduction_array)  <--- start with 1 team and 1
 327    thread.
 328   // Use dynamic parallelism to launch M teams, N threads as requested by the
 329   user to execute the target region.
 330
 331   target_function<<M, N>>(map_args)
 332
 333   Reduce values in reduction_array
 334
 335 ```
 336
 337 **Comparison with Last Version**
 338
 339
 340 The (simplified) pseudo code generated by LLVM on the host is as follows:
 341
 342
 343 ```
 344     1. Create private copies of variables: foo_p, bar_p
 345     2. Each thread reduces the chunk of A and B assigned to it and writes
 346        to foo_p and bar_p respectively.
 347     3. ret = kmpc_reduce_nowait(..., reduceData, reduceFn, lock)
 348         where:
 349         struct ReduceData {
 350           double *foo;
 351           double *bar;
 352         } reduceData
 353         reduceData.foo = &foo_p
 354         reduceData.bar = &bar_p
 355
 356         reduceFn is a pointer to a function that takes in two inputs
 357         of type ReduceData, "reduces" them element wise, and places the
 358         result in the first input:
 359         reduceFn(ReduceData *a, ReduceData *b)
 360           a = a @ b
 361
 362         Every thread in the parallel region calls kmpc_reduce_nowait with
 363         its private copy of reduceData.  The runtime reduces across the
 364         threads (using tree reduction on the operator 'reduceFn?) and stores
 365         the final result in the master thread if successful.
 366     4. if ret == 1:
 367         The master thread stores the reduced result in the globals.
 368         foo += reduceData.foo; bar += reduceData.bar
 369     5. else if ret == 2:
 370         In this case kmpc_reduce_nowait() could not use tree reduction,
 371         so use atomics instead:
 372         each thread atomically writes to foo
 373         each thread atomically writes to bar
 374 ```
 375
 376 On a GPU, a similar reduction may need to be performed across SIMT threads,
 377 warps, and threadblocks.  The challenge is to do so efficiently in a fashion
 378 that is compatible with the LLVM OpenMP implementation.
 379
 380 In the previously released 0.1 version of the LLVM OpenMP compiler for GPUs,
 381 the salient steps of the code generated are as follows:
 382
 383
 384 ```
 385     1. Create private copies of variables: foo_p, bar_p
 386     2. Each thread reduces the chunk of A and B assigned to it and writes
 387        to foo_p and bar_p respectively.
 388     3. ret = kmpc_reduce_nowait(..., reduceData, reduceFn, lock)
 389         status = can_block_reduce()
 390         if status == 1:
 391           reduce efficiently to thread 0 using shuffles and shared memory.
 392           return 1
 393         else
 394           cannot use efficient block reduction, fallback to atomics
 395           return 2
 396     4. if ret == 1:
 397         The master thread stores the reduced result in the globals.
 398         foo += reduceData.foo; bar += reduceData.bar
 399     5. else if ret == 2:
 400         In this case kmpc_reduce_nowait() could not use tree reduction,
 401         so use atomics instead:
 402         each thread atomically writes to foo
 403         each thread atomically writes to bar
 404 ```
 405
 406 The function can_block_reduce() is defined as follows:
 407
 408
 409 ```
 410 int32_t can_block_reduce() {
 411   int tid = GetThreadIdInTeam();
 412   int nt = GetNumberOfOmpThreads(tid);
 413   if (nt != blockDim.x)
 414     return 0;
 415   unsigned tnum = __ballot(1);
 416   if (tnum != (~0x0)) {
 417     return 0;
 418   }
 419   return 1;
 420 }
 421 ```
 422
 423 This function permits the use of the efficient block reduction algorithm
 424 using shuffles and shared memory (return 1) only if (a) all SIMT threads in
 425 a warp are active (i.e., number of threads in the parallel region is a
 426 multiple of 32) and (b) the number of threads in the parallel region
 427 (set by the num_threads clause) equals blockDim.x.
 428
 429 If either of these preconditions is not true, each thread in the threadblock
 430 updates the global value using atomics.
 431
 432 Atomics and compare-and-swap operations are expensive on many threaded
 433 architectures such as GPUs and we must avoid them completely.
 434
 435
 436 **Appendix: Implementation Details**
 437
 438
 439 ```
 440 // Compiler generated function.
 441 reduceFn(ReduceData *a, ReduceData *b)
 442   a->foo = a->foo + b->foo
 443   a->bar = a->bar + b->bar
 444
 445 // Compiler generated function.
 446 swapAndReduceFn(ReduceData *thread_private, int lane)
 447   ReduceData *remote = new ReduceData()
 448   remote->foo = shuffle_double(thread_private->foo, lane)
 449   remote->bar = shuffle_double(thread_private->bar, lane)
 450   reduceFn(thread_private, remote)
 451
 452 // OMP runtime function.
 453 warpReduce_regular(ReduceData *thread_private, Fn *swapAndReduceFn):
 454   offset = 16
 455   while (offset > 0)
 456     swapAndReduceFn(thread_private, offset)
 457     offset /= 2
 458
 459 // OMP runtime function.
 460 warpReduce_irregular():
 461   ...
 462
 463 // OMP runtime function.
 464 kmpc_reduce_warp(reduceData, swapAndReduceFn)
 465   if all_lanes_active:
 466     warpReduce_regular(reduceData, swapAndReduceFn)
 467   else:
 468     warpReduce_irregular(reduceData, swapAndReduceFn)
 469   if in_simd_region:
 470     // all done, reduce to global in simd lane 0
 471     return 1
 472   else if in_parallel_region:
 473     // done reducing to one value per warp, now reduce across warps
 474     return 3
 475
 476 // OMP runtime function; one for each basic type.
 477 kmpc_reduce_block_double(double *a)
 478   if lane == 0:
 479     shared[wid] = *a
 480   named_barrier(1, num_threads)
 481   if wid == 0
 482     block_reduce(shared)
 483   if lane == 0
 484     *a = shared[0]
 485   named_barrier(1, num_threads)
 486   if wid == 0 and lane == 0
 487     return 1  // write back reduced result
 488   else
 489     return 0  // don't do anything
 490
 491 ```
 492
 493
 494
 495 ```
 496 // Compiler generated code.
 497     1. Create private copies of variables: foo_p, bar_p
 498     2. Each thread reduces the chunk of A and B assigned to it and writes
 499        to foo_p and bar_p respectively.
 500     3. ret = kmpc_reduce_warp(reduceData, swapAndReduceFn)
 501     4. if ret == 1:
 502         The master thread stores the reduced result in the globals.
 503         foo += reduceData.foo; bar += reduceData.bar
 504     5. else if ret == 3:
 505         ret = block_reduce_double(reduceData.foo)
 506         if ret == 1:
 507           foo += reduceData.foo
 508         ret = block_reduce_double(reduceData.bar)
 509         if ret == 1:
 510           bar += reduceData.bar
 511 ```
 512
 513 **Notes**
 514
 515     1. This scheme requires that the CUDA OMP runtime can call llvm generated
 516        functions. This functionality now works.
 517     2. If the user inlines the CUDA OMP runtime bitcode, all of the machinery
 518        (including calls through function pointers) are optimized away.
 519     3. If we are reducing multiple to multiple variables in a parallel region,
 520        the reduce operations are all performed in warpReduce_[ir]regular(). This
 521        results in more instructions in the loop and should result in fewer
 522        stalls due to data dependencies.  Unfortunately we cannot do the same in
 523        kmpc_reduce_block_double() without increasing shared memory usage.