2 **Design document for OpenMP reductions on the GPU**
4 //Abstract: //In this document we summarize the new design for an OpenMP
5 implementation of reductions on NVIDIA GPUs. This document comprises
6 * a succinct background review,
7 * an introduction to the decoupling of reduction algorithm and
8 data-structure-specific processing routines,
9 * detailed illustrations of reduction algorithms used and
10 * a brief overview of steps we have made beyond the last implementation.
14 Consider a typical OpenMP program with reduction pragma.
18 #pragma omp parallel for reduction(+:foo, bar)
19 for (int i = 0; i < N; i++) {
23 where 'foo' and 'bar' are reduced across all threads in the parallel region.
24 Our primary goal is to efficiently aggregate the values of foo and bar in
26 * makes the compiler logically concise.
27 * efficiently reduces within warps, threads, blocks and the device.
29 **Introduction to Decoupling**
30 In this section we address the problem of making the compiler
31 //logically concise// by partitioning the task of reduction into two broad
32 categories: data-structure specific routines and algorithmic routines.
34 The previous reduction implementation was highly coupled with
35 the specificity of the reduction element data structures (e.g., sizes, data
36 types) and operators of the reduction (e.g., addition, multiplication). In
37 our implementation we strive to decouple them. In our final implementations,
38 we could remove all template functions in our runtime system.
40 The (simplified) pseudo code generated by LLVM is as follows:
43 1. Create private copies of variables: foo_p, bar_p
44 2. Each thread reduces the chunk of A and B assigned to it and writes
45 to foo_p and bar_p respectively.
46 3. ret = kmpc_nvptx_reduce_nowait(..., reduceData, shuffleReduceFn,
53 reduceData.foo = &foo_p
54 reduceData.bar = &bar_p
56 shuffleReduceFn and interWarpCpyFn are two auxiliary functions
57 generated to aid the runtime performing algorithmic steps
58 while being data-structure agnostic about ReduceData.
60 In particular, shuffleReduceFn is a function that takes the following
62 a. local copy of ReduceData
64 c. the offset of the lane_id which hosts a remote ReduceData
65 relative to the current one
66 d. an algorithm version parameter determining which reduction
68 This shuffleReduceFn retrieves the remote ReduceData through shuffle
69 intrinsics and reduces, using the algorithm specified by the 4th
70 parameter, the local ReduceData and with the remote ReduceData element
71 wise, and places the resultant values into the local ReduceData.
73 Different reduction algorithms are implemented with different runtime
74 functions, but they all make calls to this same shuffleReduceFn to
75 perform the essential reduction step. Therefore, based on the 4th
76 parameter, this shuffleReduceFn will behave slightly differently to
77 cooperate with the runtime function to ensure correctness under
78 different circumstances.
80 InterWarpCpyFn, as the name suggests, is a function that copies data
81 across warps. Its function is to tunnel all the thread private
82 ReduceData that is already reduced within a warp to a lane in the first
83 warp with minimal shared memory footprint. This is an essential step to
84 prepare for the last step of a block reduction.
86 (Warp, block, device level reduction routines that utilize these
87 auxiliary functions will be discussed in the next section.)
90 The master thread stores the reduced result in the globals.
91 foo += reduceData.foo; bar += reduceData.bar
94 **Reduction Algorithms**
96 On the warp level, we have three versions of the algorithms:
98 1. Full Warp Reduction
101 gpu_regular_warp_reduce(void *reduce_data,
102 kmp_ShuffleReductFctPtr ShuffleReduceFn) {
103 for (int offset = WARPSIZE/2; offset > 0; offset /= 2)
104 ShuffleReduceFn(reduce_data, 0, offset, 0);
107 ShuffleReduceFn is used here with lane_id set to 0 because it is not used
108 therefore we save instructions by not retrieving lane_id from the corresponding
109 special registers. The 4th parameters, which represents the version of the
110 algorithm being used here, is set to 0 to signify full warp reduction.
112 In this version specified (=0), the ShuffleReduceFn behaves, per element, as
116 //reduce_elem refers to an element in the local ReduceData
117 //remote_elem is retrieved from a remote lane
118 remote_elem = shuffle_down(reduce_elem, offset, 32);
119 reduce_elem = reduce_elem @ remote_elem;
123 An illustration of this algorithm operating on a hypothetical 8-lane full-warp
126 The coloring invariant follows that elements with the same color will be
127 combined and reduced in the next reduction step. As can be observed, no overhead
128 is present, exactly log(2, N) steps are needed.
130 2. Contiguous Full Warp Reduction
132 gpu_irregular_warp_reduce(void *reduce_data,
133 kmp_ShuffleReductFctPtr ShuffleReduceFn, int size,
140 ShuffleReduceFn(reduce_data, lane_id, offset, 1);
141 curr_size = (curr_size+1)/2;
142 offset = curr_size/2;
147 In this version specified (=1), the ShuffleReduceFn behaves, per element, as
150 //reduce_elem refers to an element in the local ReduceData
151 //remote_elem is retrieved from a remote lane
152 remote_elem = shuffle_down(reduce_elem, offset, 32);
153 if (lane_id < offset) {
154 reduce_elem = reduce_elem @ remote_elem
156 reduce_elem = remote_elem
160 An important invariant (also a restriction on the starting state of the
161 reduction) is that this algorithm assumes that all unused ReduceData are
162 located in a contiguous subset of threads in a warp starting from lane 0.
164 With the presence of a trailing active lane with an odd-numbered lane
165 id, its value will not be aggregated with any other lane. Therefore,
166 in order to preserve the invariant, such ReduceData is copied to the first lane
167 whose thread-local ReduceData has already being used in a previous reduction
168 and would therefore be useless otherwise.
170 An illustration of this algorithm operating on a hypothetical 8-lane partial
174 As illustrated, this version of the algorithm introduces overhead whenever
175 we have odd number of participating lanes in any reduction step to
176 copy data between lanes.
178 3. Dispersed Partial Warp Reduction
180 gpu_irregular_simt_reduce(void *reduce_data,
181 kmp_ShuffleReductFctPtr ShuffleReduceFn) {
183 int logical_lane_id = find_number_of_dispersed_active_lanes_before_me() * 2;
185 remote_id = find_the_next_active_lane_id_right_after_me();
186 // the above function returns 0 of no active lane
187 // is present right after the current thread.
188 size = get_number_of_active_lanes_in_this_warp();
189 logical_lane_id /= 2;
190 ShuffleReduceFn(reduce_data, logical_lane_id, remote_id-1-threadIdx.x, 2);
191 } while (logical_lane_id % 2 == 0 && size > 1);
194 There is no assumption made about the initial state of the reduction.
195 Any number of lanes (>=1) could be active at any position. The reduction
196 result is kept in the first active lane.
198 In this version specified (=2), the ShuffleReduceFn behaves, per element, as
201 //reduce_elem refers to an element in the local ReduceData
202 //remote_elem is retrieved from a remote lane
203 remote_elem = shuffle_down(reduce_elem, offset, 32);
204 if (LaneId % 2 == 0 && Offset > 0) {
205 reduce_elem = reduce_elem @ remote_elem
207 reduce_elem = remote_elem
210 We will proceed with a brief explanation for some arguments passed in,
211 it is important to notice that, in this section, we will introduce the
212 concept of logical_lane_id, and it is important to distinguish it
213 from physical lane_id as defined by nvidia.
214 1. //logical_lane_id//: as the name suggests, it refers to the calculated
215 lane_id (instead of the physical one defined by nvidia) that would make
216 our algorithm logically concise. A thread with logical_lane_id k means
217 there are (k-1) threads before it.
218 2. //remote_id-1-threadIdx.x//: remote_id is indeed the nvidia-defined lane
219 id of the remote lane from which we will retrieve the ReduceData. We
220 subtract (threadIdx+1) from it because we would like to maintain only one
221 underlying shuffle intrinsic (which is used to communicate among lanes in a
222 warp). This particular version of shuffle intrinsic we take accepts only
223 offsets, instead of absolute lane_id. Therefore the subtraction is performed
224 on the absolute lane_id we calculated to obtain the offset.
226 This algorithm is slightly different in 2 ways and it is not, conceptually, a
227 generalization of the above algorithms.
228 1. It reduces elements close to each other. For instance, values in the 0th lane
229 is to be combined with that of the 1st lane; values in the 2nd lane is to be
230 combined with that of the 3rd lane. We did not use the previous algorithm
231 where the first half of the (partial) warp is reduced with the second half
232 of the (partial) warp. This is because, the mapping
233 f(x): logical_lane_id -> physical_lane_id;
234 can be easily calculated whereas its inverse
235 f^-1(x): physical_lane_id -> logical_lane_id
236 cannot and performing such reduction requires the inverse to be known.
237 2. Because this algorithm is agnostic about the positions of the lanes that are
238 active, we do not need to perform the coping step as in the second
240 An illustrative run would look like
242 As observed, overhead is high because in each and every step of reduction,
243 logical_lane_id is recalculated; so is the remote_id.
245 On a block level, we have implemented the following block reduce algorithm:
248 gpu_irregular_block_reduce(void *reduce_data,
249 kmp_ShuffleReductFctPtr shuflReduceFn,
250 kmp_InterWarpCopyFctPtr interWarpCpyFn,
253 int wid = threadIdx.x/WARPSIZE;
254 int lane_id = threadIdx.x%WARPSIZE;
256 int warp_needed = (size+WARPSIZE-1)/WARPSIZE; //ceiling of division
258 unsigned tnum = __ballot(1);
259 int thread_num = __popc(tnum);
261 //full warp reduction
262 if (thread_num == WARPSIZE) {
263 gpu_regular_warp_reduce(reduce_data, shuflReduceFn);
265 //partial warp reduction
266 if (thread_num < WARPSIZE) {
267 gpu_irregular_warp_reduce(reduce_data, shuflReduceFn, thread_num,
270 //Gather all the reduced values from each warp
272 //named_barrier inside this function to ensure
273 //correctness. It is effectively a sync_thread
274 //that won't deadlock.
275 interWarpCpyFn(reduce_data, warp_needed);
277 //This is to reduce data gathered from each "warp master".
279 gpu_irregular_warp_reduce(reduce_data, shuflReduceFn, warp_needed,
286 In this function, no ShuffleReduceFn is directly called as it makes calls
287 to various versions of the warp-reduction functions. It first reduces
288 ReduceData warp by warp; in the end, we end up with the number of
289 ReduceData equal to the number of warps present in this thread
290 block. We then proceed to gather all such ReduceData to the first warp.
292 As observed, in this algorithm we make use of the function InterWarpCpyFn,
293 which copies data from each of the "warp master" (0th lane of each warp, where
294 a warp-reduced ReduceData is held) to the 0th warp. This step reduces (in a
295 mathematical sense) the problem of reduction across warp masters in a block to
296 the problem of warp reduction which we already have solutions to.
298 We can thus completely avoid the use of atomics to reduce in a threadblock.
300 **Efficient Cross Block Reduce**
302 The next challenge is to reduce values across threadblocks. We aim to do this
303 without atomics or critical sections.
305 Let a kernel be started with TB threadblocks.
306 Let the GPU have S SMs.
307 There can be at most N active threadblocks per SM at any time.
309 Consider a threadblock tb (tb < TB) running on SM s (s < SM). 'tb' is one of
310 at most 'N' active threadblocks on SM s. Let each threadblock active on an SM
311 be given an instance identifier id (0 <= id < N). Therefore, the tuple (s, id)
312 uniquely identifies an active threadblock on the GPU.
314 To efficiently implement cross block reduce, we first allocate an array for
315 each value to be reduced of size S*N (which is the maximum number of active
316 threadblocks at any time on the device).
318 Each threadblock reduces its value to slot [s][id]. This can be done without
319 locking since no other threadblock can write to the same slot concurrently.
321 As a final stage, we reduce the values in the array as follows:
324 // Compiler generated wrapper function for each target region with a reduction
326 target_function_wrapper(map_args, reduction_array) <--- start with 1 team and 1
328 // Use dynamic parallelism to launch M teams, N threads as requested by the
329 user to execute the target region.
331 target_function<<M, N>>(map_args)
333 Reduce values in reduction_array
337 **Comparison with Last Version**
340 The (simplified) pseudo code generated by LLVM on the host is as follows:
344 1. Create private copies of variables: foo_p, bar_p
345 2. Each thread reduces the chunk of A and B assigned to it and writes
346 to foo_p and bar_p respectively.
347 3. ret = kmpc_reduce_nowait(..., reduceData, reduceFn, lock)
353 reduceData.foo = &foo_p
354 reduceData.bar = &bar_p
356 reduceFn is a pointer to a function that takes in two inputs
357 of type ReduceData, "reduces" them element wise, and places the
358 result in the first input:
359 reduceFn(ReduceData *a, ReduceData *b)
362 Every thread in the parallel region calls kmpc_reduce_nowait with
363 its private copy of reduceData. The runtime reduces across the
364 threads (using tree reduction on the operator 'reduceFn?) and stores
365 the final result in the master thread if successful.
367 The master thread stores the reduced result in the globals.
368 foo += reduceData.foo; bar += reduceData.bar
370 In this case kmpc_reduce_nowait() could not use tree reduction,
371 so use atomics instead:
372 each thread atomically writes to foo
373 each thread atomically writes to bar
376 On a GPU, a similar reduction may need to be performed across SIMT threads,
377 warps, and threadblocks. The challenge is to do so efficiently in a fashion
378 that is compatible with the LLVM OpenMP implementation.
380 In the previously released 0.1 version of the LLVM OpenMP compiler for GPUs,
381 the salient steps of the code generated are as follows:
385 1. Create private copies of variables: foo_p, bar_p
386 2. Each thread reduces the chunk of A and B assigned to it and writes
387 to foo_p and bar_p respectively.
388 3. ret = kmpc_reduce_nowait(..., reduceData, reduceFn, lock)
389 status = can_block_reduce()
391 reduce efficiently to thread 0 using shuffles and shared memory.
394 cannot use efficient block reduction, fallback to atomics
397 The master thread stores the reduced result in the globals.
398 foo += reduceData.foo; bar += reduceData.bar
400 In this case kmpc_reduce_nowait() could not use tree reduction,
401 so use atomics instead:
402 each thread atomically writes to foo
403 each thread atomically writes to bar
406 The function can_block_reduce() is defined as follows:
410 int32_t can_block_reduce() {
411 int tid = GetThreadIdInTeam();
412 int nt = GetNumberOfOmpThreads(tid);
413 if (nt != blockDim.x)
415 unsigned tnum = __ballot(1);
416 if (tnum != (~0x0)) {
423 This function permits the use of the efficient block reduction algorithm
424 using shuffles and shared memory (return 1) only if (a) all SIMT threads in
425 a warp are active (i.e., number of threads in the parallel region is a
426 multiple of 32) and (b) the number of threads in the parallel region
427 (set by the num_threads clause) equals blockDim.x.
429 If either of these preconditions is not true, each thread in the threadblock
430 updates the global value using atomics.
432 Atomics and compare-and-swap operations are expensive on many threaded
433 architectures such as GPUs and we must avoid them completely.
436 **Appendix: Implementation Details**
440 // Compiler generated function.
441 reduceFn(ReduceData *a, ReduceData *b)
442 a->foo = a->foo + b->foo
443 a->bar = a->bar + b->bar
445 // Compiler generated function.
446 swapAndReduceFn(ReduceData *thread_private, int lane)
447 ReduceData *remote = new ReduceData()
448 remote->foo = shuffle_double(thread_private->foo, lane)
449 remote->bar = shuffle_double(thread_private->bar, lane)
450 reduceFn(thread_private, remote)
452 // OMP runtime function.
453 warpReduce_regular(ReduceData *thread_private, Fn *swapAndReduceFn):
456 swapAndReduceFn(thread_private, offset)
459 // OMP runtime function.
460 warpReduce_irregular():
463 // OMP runtime function.
464 kmpc_reduce_warp(reduceData, swapAndReduceFn)
466 warpReduce_regular(reduceData, swapAndReduceFn)
468 warpReduce_irregular(reduceData, swapAndReduceFn)
470 // all done, reduce to global in simd lane 0
472 else if in_parallel_region:
473 // done reducing to one value per warp, now reduce across warps
476 // OMP runtime function; one for each basic type.
477 kmpc_reduce_block_double(double *a)
480 named_barrier(1, num_threads)
485 named_barrier(1, num_threads)
486 if wid == 0 and lane == 0
487 return 1 // write back reduced result
489 return 0 // don't do anything
496 // Compiler generated code.
497 1. Create private copies of variables: foo_p, bar_p
498 2. Each thread reduces the chunk of A and B assigned to it and writes
499 to foo_p and bar_p respectively.
500 3. ret = kmpc_reduce_warp(reduceData, swapAndReduceFn)
502 The master thread stores the reduced result in the globals.
503 foo += reduceData.foo; bar += reduceData.bar
505 ret = block_reduce_double(reduceData.foo)
507 foo += reduceData.foo
508 ret = block_reduce_double(reduceData.bar)
510 bar += reduceData.bar
515 1. This scheme requires that the CUDA OMP runtime can call llvm generated
516 functions. This functionality now works.
517 2. If the user inlines the CUDA OMP runtime bitcode, all of the machinery
518 (including calls through function pointers) are optimized away.
519 3. If we are reducing multiple to multiple variables in a parallel region,
520 the reduce operations are all performed in warpReduce_[ir]regular(). This
521 results in more instructions in the loop and should result in fewer
522 stalls due to data dependencies. Unfortunately we cannot do the same in
523 kmpc_reduce_block_double() without increasing shared memory usage.