1 ==============================
2 Convergent Operation Semantics
3 ==============================
12 Some parallel execution environments execute threads in groups that allow
13 efficient communication within the group using special primitives called
14 *convergent* operations. The outcome of a convergent operation is sensitive to
15 the set of threads that executes it "together", i.e., convergently. When control
16 flow :ref:`diverges <convergence-and-uniformity>`, i.e. threads of the same
17 group follow different
18 paths through the CFG, not all threads of the group may be available to
19 participate in this communication. This is the defining characteristic that
20 distinguishes convergent operations from other inter-thread communication:
22 A convergent operation involves inter-thread communication or synchronization
23 that occurs outside of the memory model, where the set of threads which
24 participate in communication is implicitly affected by control flow.
26 For example, in the following GPU compute kernel, communication during the
27 convergent operation is expected to occur precisely among those threads of an
28 implementation-defined execution scope (such as workgroup or subgroup) for
29 which ``condition`` is true:
33 void example_kernel() {
36 convergent_operation();
40 In structured programming languages, there is often an intuitive and
41 unambiguous way of determining the threads that are expected to communicate.
42 However, this is not always the case even in structured programming languages,
43 and the intuition breaks down entirely in unstructured control flow. This
44 document describes the formal semantics in LLVM, i.e. how to determine the set
45 of communicating threads for convergent operations.
47 The definitions in this document leave many details open, such as how groups of
48 threads are formed in the first place. It focuses on the questions that are
49 relevant for deciding the correctness of generic program transforms and
50 convergence-related analyses such as :ref:`uniformity analysis
51 <convergence-and-uniformity>`.
53 .. _convergent_operations:
58 In LLVM IR, the only way to communicate between threads as described
59 above is by calling target-defined convergent intrinsics. Hence, only
60 a call-site in LLVM IR (a :ref:`call <i_call>`, :ref:`invoke
61 <i_invoke>`, or :ref:`callbr <i_callbr>` instruction) can result in a
64 A function in LLVM IR is said to be *convergent* if it has the
65 :ref:`convergent <attr_convergent>` attribute.
67 A call-site in LLVM IR is said to be *convergent* if it is a direct
68 call to a convergent function or it has the :ref:`convergent
69 <attr_convergent>` attribute or a :ref:`convergencectrl operand bundle
74 A function may have to be treated as convergent if that function, or
75 transitively, any function called from it, contains a convergent call-site. A
76 frontend generating the ``convergent`` attribute should take this into account
77 when emitting functions and function calls. But this is not always the case:
79 A non-convergent function may contain convergent operations; such operations
80 do not directly depend on the set of threads that enter the function as a
81 single communicating group. Instead, these operations depend on an
82 implementation-defined subset of threads within the body of the function, as
83 shown in :ref:`opportunistic_convergence`.
85 Examples of Convergent Operations
86 ========================================
88 (This section is informative.)
90 Texture sampling in a pixel shader
91 ----------------------------------
93 The following stylized pixel shader samples a texture at a given set of
94 coordinates, using the builtin function `textureSample`. Texture sampling
95 requires screen-space derivatives of the coordinates to determine the level of
96 detail (mipmap level) of the sample. They are commonly approximated by taking
97 the difference between neighboring pixels, which are computed by different
98 threads in the same group:
102 void example_shader() {
104 color = textureSample(texture, coordinates);
111 From a purely single-threaded perspective, sinking the `textureSample` into
112 the if-statement appears legal. However, if the condition is false for some
113 neighboring pixels, then their corresponding threads will not execute together
114 in the group, making it impossible to take the difference of coordinates as an
115 approximation of the screen-space derivative. In practice, the outcome will be
118 That is, the `textureSample` operation fits our definition of a convergent
121 1. It communicates with a set of threads that implicitly depends on control
123 2. Correctness depends on this set of threads.
125 The compiler frontend can emit IR that expresses the convergence constraints as
130 define void @example_shader() convergent {
131 %entry = call token @llvm.experimental.convergence.entry()
133 %color = call T @textureSample(U %texture, V %coordinates) [ "convergencectrl"(token %entry) ]
134 br i1 %condition, label %then, label %end
137 call void @use(T %color)
144 The :ref:`llvm.experimental.convergence.entry <llvm.experimental.convergence.entry>`
145 intrinsic is itself ``convergent``, and we expect it to communicate at least
146 among all threads of the same "quad" -- a group of 2x2 pixels that are
147 evaluated together for the purpose of approximating screen-space derivatives.
148 This fact is not part of the generic LLVM IR semantics; it would have to be
149 defined somewhere else, for example as part of target-specific ABI definitions
150 and/or in reference to some relevant API specs.
152 Since the ``@textureSample`` call then uses the token produced by the entry
153 intrinsic in its ``convergencectrl`` bundle, and has no additional control
154 dependencies, it must communicate among the same set of threads. This indicates
155 to generic program transforms that sinking the ``@textureSample`` call is
156 forbidden. (A program transform can still sink the call if it can prove somehow,
157 e.g. by leaning on target-specific callbacks that can analyze the program with
158 additional knowledge, that ``%condition`` is always uniform across the threads
159 referenced by the *convergence token* ``%entry``.)
161 .. _convergence_example_reductions:
163 Reductions inside divergent control flow
164 ----------------------------------------
166 The following example shows that merging common code of branches can be
167 incorrect in the face of convergent operations:
171 void example_kernel() {
174 total_gains = subgroupAdd(delta);
177 total_losses = subgroupAdd(delta);
182 The ``subgroupAdd`` computing the ``total_gains`` will be executed by the
183 subset of threads with positive ``delta`` in a subgroup (wave), and so will sum
184 up all the ``delta`` values of those threads; and similarly for the
185 ``subgroupAdd`` that computes the ``total_losses``.
187 If we were to hoist and merge the ``subgroupAdd`` above the if-statement, it
188 would sum up the ``delta`` across *all* threads instead.
190 The compiler frontend can emit IR that expresses the convergence constraints
195 define void @example_kernel() convergent {
196 %entry = call token @llvm.experimental.convergence.entry()
198 %cc = icmp sgt i32 %delta, 0
199 br i1 %cc, label %then, label %else
202 %total_gains = call i32 @subgroupAdd(i32 %delta) [ "convergencectrl"(token %entry) ]
207 %total_losses = call i32 @subgroupAdd(i32 %delta) [ "convergencectrl"(token %entry) ]
215 The entry intrinsic behaves like in the previous example: assuming that
216 ``@example_kernel`` is an OpenCL kernel (as hinted at by the "subgroup"
217 terminology), we expect it to communicate among all threads within the
218 "subgroup". This typically maps to a SIMD vector on GPU hardware.
220 The calls to ``@subgroupAdd`` use the token produced by the entry intrinsic,
221 but they also have an additional control dependency. According to the rules
222 defined in this document, they only communicate among the subset of threads
223 that actually end up executing the respective (static) call site.
225 Hoisting them would remove the control dependency and cause them to communicate
226 among the full set of threads that the entry intrinsic communicated with.
227 Again, hoisting is allowed if it can be proven that ``%cc`` is always uniform
228 among the relevant set of threads: in that case, the ``@subgroupAdd`` already
229 communicates among the full set of threads in the original program.
231 Motivating Examples of Convergence Control
232 ==========================================
234 (This section is informative.)
236 Unstructured control flow
237 -------------------------
239 Consider an example of how jump threading removes structure in a way that can
240 make semantics non-obvious without the convergence intrinsics described in this
245 void example_original() {
248 br i1 %cond1, label %then1, label %mid
256 %flag = phi i1 [ true, %entry ], [ %cond2, %then1 ]
257 br i1 %flag, label %then2, label %end
261 call void @subgroupControlBarrier()
268 void example_jumpthreaded() {
271 br i1 %cond1, label %then1, label %then2
276 br i1 %cond2, label %then2, label %end
280 call void @subgroupControlBarrier()
287 Is the control barrier guaranteed to synchronize among the same set of threads
288 in both cases? Different implementations in the literature may give different
289 answers to this question:
291 * In an implementation that reconverges at post-dominators, threads reconverge
292 at ``mid`` in the first version, so that all threads (within a subgroup/wave)
293 that execute the control barrier do so together. In the second version,
294 threads that reach the control barrier via different paths synchronize
295 separately: the first (and only) post-dominator is ``end``, so threads do not
296 reconverge before then.
298 * An implementation that sorts basic blocks topologically and ensures maximal
299 reconvergence for each basic block would behave the same way in both
302 We generally take the stance that reconvergence in acyclic control flow must
303 be maximal. The compiler frontend could augment the original code as follows:
307 define void @example_original() convergent {
309 %entry = call token @llvm.experimental.convergence.entry()
311 br i1 %cond1, label %then1, label %mid
319 %flag = phi i1 [ true, %entry ], [ %cond2, %then1 ]
320 br i1 %flag, label %then2, label %end
324 call void @subgroupControlBarrier() [ "convergencectrl"(token %entry) ]
331 If S is the set of threads that the entry intrinsic communicated with, then
332 the ``@subgroupControlBarrier`` call communicates with the subset of S that
333 actually reaches the call site. This set of threads doesn't change after
334 jump-threading, so the answer to the question posed above remains the same.
336 .. _opportunistic_convergence:
338 Opportunistic convergent operations
339 -----------------------------------
341 Some programs have local regions of code that contain a sequence of convergent
342 operations where the code does not care about the exact set of threads with
343 which it is executed, but only that the set of threads is the same for all the
344 operations within the sequence. (If a subset of the convergent operations in the
345 sequence have additional, non-uniform control dependencies, then this is not
346 possible. However, the code may still require that the sets of threads are
347 logically consistent with the conditions of those control dependencies.) In this
348 case, :ref:`llvm.experimental.convergence.anchor
349 <llvm.experimental.convergence.anchor>` can be used to express the desired
352 The following example function could be part of a hypothetical "append buffer"
353 implementation, where threads conditionally write fixed-sized records
354 contiguously into a global buffer. The function ``@reserveSpaceInBuffer``
355 returns the index into the buffer at which the calling thread should store its
358 This could be achieved by using a simple atomic operation in every thread to
359 bump an allocation counter.
361 However, the following implementation can be more performant on some hardware,
362 because it uses only a single atomic operation for an entire group of threads.
363 To do this, it first determines the total size of the group, which will be the
364 operand to the atomic operation, and then later broadcasts the result of the
365 atomic operation to all threads of the group, so that each thread can compute
366 its individual position in the buffer:
370 define i32 @reserveSpaceInBuffer() { ; NOTE: _not_ a convergent function!
372 %anchor = call token @llvm.experimental.convergence.anchor()
374 %ballot = call i64 @subgroupBallot(i1 true) [ "convergencectrl"(token %anchor) ]
375 %numThreads.p = call i64 @llvm.ctpop.i64(i64 %ballot)
376 %numThreads = trunc i64 %numThreads.p to i32
378 %absoluteThreadIdx = call i32 @getSubgroupLocalInvocationId()
379 %absoluteThreadIdx.ext = zext i32 %absoluteThreadIdx to i64
380 %mask.p = shl i64 1, %absoluteThreadIdx.ext
381 %mask = sub i64 %mask.p, 1
383 %maskedBallot = and i64 %ballot, %mask
384 %relativeThreadIdx.p = call i64 @llvm.ctpop.i64(i64 %maskedBallot)
385 %relativeThreadIdx = trunc i64 %relativeThreadIdx.p to i32
387 %isFirstThread = icmp eq i32 %relativeThreadIdx, 0
388 br i1 %isFirstThread, label %then, label %end
391 %baseOffset.1 = atomicrmw add ptr @bufferAllocationCount, i32 %numThreads monotonic
395 %baseOffset.2 = phi i32 [ undef, %entry ], [ %baseOffset.1, %then ]
396 %baseOffset = call i32 @subgroupBroadcastFirst(i32 %baseOffset.2) [ "convergencectrl"(token %anchor) ]
397 %offset = add i32 %baseOffset, %relativeThreadIdx
401 The key here is that the function really doesn't care which set of threads it
402 is being called with. It takes whatever set of threads it can get. What the
403 implementation of the function cares about is that the initial
404 ``@subgroupBallot`` -- which is used to retrieve the bitmask of threads that
405 executed the anchor together -- executes with the same set of threads as the
406 final ``@subgroupBroadcastFirst``. Nothing else is required for correctness as
407 far as convergence is concerned.
409 The function ``@reserveSpaceInBuffer`` itself is _not_ ``convergent``: callers
410 are free to move call sites of the function as they see fit. This can change
411 the behavior in practice, by changing the sets of threads that are grouped
412 together for the atomic operation. This can be visible in the output of the
413 program, since the order in which outputs appear in the buffer is changed.
414 However, this does not break the overall contract that ``@reserveSpaceInBuffer``
415 has with its caller -- which makes sense: the order of outputs is
416 non-deterministic anyway because of the atomic operation that is involved.
418 If the function is inlined, the use of the anchor intrinsic similarly indicates
419 that certain transforms which are usually forbidden by the presence of
420 convergent operations are in fact allowed, as long as they don't break up the
421 region of code that is controlled by the anchor.
423 .. _convergence_high-level_break:
425 Extended Cycles: Divergent Exit from a Loop
426 -------------------------------------------
428 High-level languages typically provide a ``break`` statement that transfers
429 control out of a loop statement. In most cases, the loop is structured and hence
430 there is no ambiguity about convergence inside the loop. But an ambiguity arises
431 when a ``break`` is control dependent on a divergent condition inside the loop.
432 Consider the following example:
441 if (condition) { // divergent condition
452 In this program, the call to convergent_op() is lexically "inside" the ``for``
453 loop. But when translated to LLVM IR, the basic block B is an exiting block
454 ending in a divergent branch, and the basic block C is an exit of the loop.
455 Thus, the call to convergent_op() is outside the loop. This causes a mismatch
456 between the programmer's expectation and the compiled program. The call should
457 be executed convergently on every iteration of the loop, by threads that
458 together take the branch to exit the loop. But when compiled, all threads that
459 take the divergent exit on different iterations first converge at the beginning
460 of basic block C and then together execute the call to convergent_op().
462 In this case, :ref:`llvm.experimental.convergence.loop
463 <llvm.experimental.convergence.loop>` can be used to express the desired
464 semantics. A call to this intrinsic is placed in the loop header, which tracks
465 each iteration of the loop. The token produced by this is used as a
466 ``convergencectrl`` operand to the convergent call. The semantics of the
467 ``loop`` intrinsic ensures that the convergent call is performed convergently
468 only by those threads that convergently exited the loop in a given iteration.
472 define void @example() convergent {
473 %entry = call token @llvm.experimental.convergence.entry()
477 %inner = call token @llvm.experimental.convergence.loop() ["convergencectrl"(token %entry)]
479 br i1 %for.cond, label %B, label %E
484 br i1 %condition, label %C, label %D
487 call void @convergent_op() ["convergencectrl"(token %inner)]
499 The LLVM IR version of the same program shows a cycle consisting of the basic
500 blocks ``%for``, ``%B`` and ``%D``, while ``%C`` is an exit reached by the
501 divergent branch at the end of the exiting block ``%B``. But the use of
502 convergence control tokens makes it clear that block ``%C`` must be executed
503 convergently only by those threads that convergently take the exit edge from %B
504 to ``%C``. In other words, the convergent execution of ``%C`` is governed by the
505 call to the :ref:`llvm.experimental.convergence.loop
506 <llvm.experimental.convergence.loop>` intrinsic inside the cycle. The cycle is
507 effectively extended to include all uses of this token that lie outside the
510 .. _dynamic_instances_and_convergence_tokens:
512 Dynamic Instances and Convergence Tokens
513 ========================================
515 Every execution of an LLVM IR instruction occurs in a :ref:`dynamic instance
516 <convergence-dynamic-instances>` of the instruction. Dynamic instances are the
517 formal objects by which we talk about communicating threads in convergent
518 operations. Dynamic instances are defined for *all* operations in an LLVM
519 program, whether convergent or not. Convergence control is primarily about the
520 dynamic instances of convergent operations since they affect execution of the
521 program through inter-thread communication. The dynamic instances for
522 non-convergent operations are relevant for determining :ref:`uniformity
523 <convergence-and-uniformity>` of values.
525 Dynamic instances produced by the execution of the same *convergent operation*
526 by different threads may be :ref:`converged <convergence-definition>`. When
527 executing a convergent operation, the set of threads that execute converged
528 dynamic instances is the set of threads that communicate with each other.
529 *Convergence tokens* capture this convergence as described below.
531 *Convergence tokens* are values of ``token`` type, i.e. they cannot be used in
532 ``phi`` or ``select`` instructions. A convergence token value represents the
533 dynamic instance of the instruction that produced it.
535 Convergent operations may have an optional ``convergencectrl`` operand bundle with
536 a convergence token operand to define the set of communicating threads relative
537 to the operation that defined the token.
539 Let ``U`` be a convergent operation other than a call to a convergence
540 control intrinsic, and ``D`` be the convergent operation that defines
541 the token value used as the ``convergencectrl`` operand to ``U``. Two
542 threads execute converged dynamic instances of ``U`` if and only if the
543 token value in both threads was returned by converged dynamic
548 The text defines convergence token values as representing dynamic instances.
549 But if we were to assume that converged dynamic instances produce the same
550 token value, then we could almost think of the token value as representing a
551 set of threads instead -- specifically, the set ``S`` of threads that
552 executed converged dynamic instances of the defining instruction ``D``.
554 In this intuitive picture, when a convergence token value ``T`` is used by a
555 ``convergencectrl`` bundle on an instruction ``I``, then the set of threads that
556 communicates in ``I`` is a subset of the set ``S`` represented by the token value.
557 Specifically, it is the subset of threads that ends up executing ``I`` while
558 using the token value.
560 This by itself wouldn't quite work as a definition: what if ``I`` is executed
561 multiple times by the same threads? Which execution of ``I`` in thread 1
562 communicates with which execution of ``I`` in thread 2? Leaning on the notion
563 of dynamic instances gives a robust answer to this question as long as ``D``
564 and ``I`` are at the same loop (or cycle) nesting level.
566 The case where ``D`` and ``I`` are at different loop nesting levels is
567 forbidden by the :ref:`static rules <convergence_static_rules>` -- handling
568 that case is the purpose of :ref:`llvm.experimental.convergence.loop
569 <llvm.experimental.convergence.loop>`.
571 .. _convergence_control_intrinsics:
573 Convergence Control Intrinsics
574 ==============================
576 This section describes target-independent intrinsics that can be used to
577 produce convergence tokens.
579 Behaviour is undefined if a convergence control intrinsic is called
582 .. _llvm.experimental.convergence.entry:
584 ``llvm.experimental.convergence.entry``
585 ----------------------------------------
589 token @llvm.experimental.convergence.entry() convergent readnone
591 This intrinsic is used to tie the dynamic instances inside of a function to
594 1. If the function is called from outside the scope of LLVM, the convergence of
595 dynamic instances of this intrinsic are environment-defined. For example:
597 a. In an OpenCL *kernel launch*, the maximal set of threads that
598 can communicate outside the memory model is a *workgroup*.
599 Hence, a suitable choice is to specify that all the threads from
600 a single workgroup in OpenCL execute converged dynamic instances
602 b. In a C/C++ program, threads are launched independently and they can
603 communicate only through the memory model. Hence the dynamic instances of
604 this intrinsic in a C/C++ program are never converged.
605 2. If the function is called from a call-site in LLVM IR, then two
606 threads execute converged dynamic instances of this intrinsic if and
607 only if both threads entered the function by executing converged
608 dynamic instances of the call-site.
610 This intrinsic can occur at most once in a function, and only in the the entry
611 block of the function. If this intrinsic occurs in a basic block, then it must
612 precede any other convergent operation in the same basic block.
614 It is an error if this intrinsic appears in a non-convergent function.
616 It is an error to specify a ``convergencectrl`` operand bundle at a
617 call to this intrinsic.
619 Function inlining substitutes this intrinsic with the token from the operand
626 void callee() convergent {
627 %tok = call token @llvm.experimental.convergence.entry()
628 convergent_operation(...) [ "convergencectrl"(token %tok) ]
632 %outer = call token @llvm.experimental.convergence.anchor()
634 %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
635 callee() [ "convergencectrl"(token %inner) ]
642 %outer = call token @llvm.experimental.convergence.anchor()
644 %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
645 convergent_operation(...) [ "convergencectrl"(token %inner) ]
649 .. _llvm.experimental.convergence.loop:
651 ``llvm.experimental.convergence.loop``
652 --------------------------------------
656 token @llvm.experimental.convergence.loop() [ "convergencectrl"(token) ] convergent readnone
658 This intrinsic represents the place where an imaginary counter is incremented
659 for determining convergence inside a control flow cycle.
661 Let ``U`` be a call to this intrinsic and ``D`` be the convergent operation that
662 defines the token value used as the ``convergencectrl`` operand to ``U``. Two
663 threads execute converged dynamic instances of ``U`` if and only if:
665 1. The token value in both threads was returned by converged dynamic
666 instances of ``D``, and,
667 2. There is an integer *n* such that both threads execute ``U`` for the *n*'th time
668 with that token value.
670 It is an error to omit the ``convergencectrl`` operand bundle on a
671 call to this intrinsic.
673 If this intrinsic occurs in a basic block, then it must precede any other
674 convergent operation in the same basic block.
676 .. _convergence_cycle_heart:
678 **Heart of a Cycle:**
680 If a :ref:`cycle <cycle-terminology>` ``C`` contains an occurrence ``H`` of
681 this intrinsic whose token operand is defined outside ``C``, then ``H`` is
682 called the heart of ``C``.
686 The static rules for cycles imply that a heart can occur only in the header
687 of a natural loop. This ensures that the heart closely represents the
688 intuitive notion of a loop iteration. If this restriction is relaxed, the
689 resulting semantics provides a new notion of "cycle iteration" even for
690 irreducible cycles. But this allows a natural loop to have a heart in a
691 node other than its header, which has interesting consequences on the
692 meaning of a loop iteration in terms of convergence. For now, we disallow
693 this situation since its practical application is very rare.
695 .. _llvm.experimental.convergence.anchor:
697 ``llvm.experimental.convergence.anchor``
698 ----------------------------------------
702 token @llvm.experimental.convergence.anchor() convergent readnone
704 This intrinsic produces an initial convergence token that is independent from
705 any "outer scope". The set of threads executing converged dynamic instances of
706 this intrinsic is implementation-defined.
708 It is an error to pass a ``convergencectrl`` operand bundle at a
709 call to this intrinsic.
713 The expectation is that all threads within a group that "happen to be active
714 at the same time" will execute converged dynamic instances, so that programs
715 can detect the maximal set of threads that can communicate efficiently within
716 some local region of the program.
718 .. _convergence_uncontrolled:
720 Uncontrolled Convergent Operations
721 ==================================
723 Convergent operations with an explicit ``convergencectrl`` operand bundle are
724 called *controlled convergent operations*. All other convergent operations are
725 said to be *uncontrolled*.
727 An uncontrolled convergent operation is said to have *implicit convergence
728 control* determined by the ``convergent`` attribute alone. The semantics of the
729 ``convergent`` attribute as implemented in LLVM differs from the documented
730 semantics. The implementation tries to follow common intuition about convergent
731 operations, which remains under-specified. As such, it is not possible to fully
732 translate implicit convergence control into explicit convergence control tokens,
733 and these two modes cannot be mixed in the same function.
735 If a function contains a controlled convergent operation, then all convergent
736 operations in that function must either be controlled operations or calls to
737 the convergence control intrinsics.
742 (This section is informational)
744 Sometimes, it may be necessary to reinterpret the implicit convergence control
745 in terms of explicit convergence control tokens. For example, this may happen
746 when a function call is inlined, and either the caller or the callee contains
747 uncontrolled convergent operations.
749 Some uses of uncontrolled convergent operations may need to satisfy the
752 For an environment-defined group of threads (such as an OpenCL workgroup or
753 subgroup), if one thread in the group executes a convergent operation, then
754 all threads in the group do so convergently with that thread.
756 In terms of explicit convergence control, this means that the
757 ``convergencectrl`` operand on each convergent operation ``X`` must ultimately
758 originate from a call to the :ref:`llvm.experimental.convergence.entry
759 <llvm.experimental.convergence.entry>` intrinsic. This preserves the possibility
760 that the group of threads that converge on reaching ``X`` is the same group that
761 originally started executing the program in convergence. In comparison, the
762 :ref:`llvm.experimental.convergence.anchor
763 <llvm.experimental.convergence.anchor>` intrinsic captures an
764 implementation-defined group of threads, which is insufficient to support the
767 One way to approximate implicit convergence control in terms of explicit
768 convergence control tokens is the following procedure, which preserves the above
771 1. Convert every irreducible cycle into a reducible cycle.
772 2. Insert a call to :ref:`llvm.experimental.convergence.entry
773 <llvm.experimental.convergence.entry>` at the start of the entry block of the
775 3. Insert a call to :ref:`llvm.experimental.convergence.loop
776 <llvm.experimental.convergence.loop>` at the start of every loop header. If
777 this loop is an outermost loop, the ``convergencectrl`` operand is the call
778 to :ref:`llvm.experimental.convergence.entry
779 <llvm.experimental.convergence.entry>` in the entry block of the function.
780 Otherwise, the ``convergencectrl`` operand is the call to
781 :ref:`llvm.experimental.convergence.loop
782 <llvm.experimental.convergence.loop>` in the parent loop's header.
783 4. For each uncontrolled convergent operation ``X``, add a ``convergencectrl``
784 operand bundle using the token defined by a definition ``D`` that is a
785 :ref:`sibling <cycle-sibling>` to this operation. ``D`` always dominates
786 ``X`` --- if ``X`` is not in any cycle, then ``D`` is a call to
787 :ref:`llvm.experimental.convergence.entry
788 <llvm.experimental.convergence.entry>`; otherwise ``D`` is the heart of the
789 parent cycle of ``X``.
791 .. _convergence_static_rules:
796 A *well-formed* program in LLVM IR must satisfy the following static
797 rules about cycles and convergence regions.
802 A :ref:`closed path <cycle-closed-path>` in a CFG is a connected sequence of
803 nodes and edges in the CFG whose start and end points are the same.
805 1. Every closed path in the CFG that contains a use of a convergence token T other
807 :ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
808 must also contain the definition of T.
810 2. Every closed path in the CFG that contains two different uses of a convergence
811 token T must also contain the definition of T.
813 3. Every closed path in the CFG that contains uses of two different convergence tokens
814 T1 and T2 must also contain the definition of at least one of them.
816 Taken together, these rules imply that for every closed path C, there can be at most
817 one convergence token T which is used in C but defined outside of it, and that
818 T can be used only once in C, and only by ``llvm.experimental.convergence.loop``.
820 4. In every closed path that contains a use U of a token T but not the
821 definition of T, U must dominate all nodes in the closed path.
823 This implies that ``llvm.experimental.convergence.loop`` can appear as a heart
824 only in the header of a natural loop.
826 **Sufficient Conditions:** From the :ref:`properties of cycles
827 <cycle-closed-path>`, it is sufficient to prove the above properties
828 for cycles instead of closed paths. Briefly, any closed path that violates
829 one or more of the above static rules is contained in a cycle that also
830 violates the same rule(s).
832 .. _convergence_region:
837 The *convergence region* of a convergence token T is the minimal region in
838 which T is live and used, i.e., the set of program points dominated by the
839 definition D of T from which a use of T can be reached.
841 The following static rule about convergence regions must be satisfied by
844 If a convergence region R for a token T1 contains a use of a convergence
845 token T2, then R must also contain the definition of T2. (In other words,
846 convergence regions must be reasonably nested.)
850 For brevity, this document uses the term "convergence region of a token
851 definition ``D``" to actually refer to the convergence region of the token
852 ``T`` defined by ``D``.
854 .. _inferring_noconvergent:
856 Inferring non-convergence
857 =========================
859 When the target or the environment guarantees that threads do not
860 communicate using convergent operations or that threads never diverge,
861 the dynamic instances in the program are irrelevant and an optimizer
862 may remove any occurrence of the ``convergent`` attribute on a
863 call-site or a function and any explicit ``convergencectrl`` operand
864 bundle at a call-site.
866 An optimizer may remove the ``convergent`` attribute and any explicit
867 ``convergencectrl`` operand bundle from a call-site if it can prove
868 that the execution of this call-site always results in a call to a
869 non-convergent function.
871 An optimizer may remove the ``convergent`` attribute on a function if it can
872 prove that the function does not contain a call to
873 :ref:`llvm.experimental.convergence.entry
874 <llvm.experimental.convergence.entry>`, or any uncontrolled convergent
877 Memory Model Non-Interaction
878 ============================
880 The fact that an operation is convergent has no effect on how it is treated for
881 memory model purposes. In particular, an operation that is ``convergent`` and
882 ``readnone`` does not introduce additional ordering constraints as far as the
883 memory model is concerned. There is no implied barrier, neither in the memory
884 barrier sense nor in the control barrier sense of synchronizing the execution
887 Informational note: Threads that execute converged dynamic instances do not
888 necessarily do so at the same time.
894 A function can be both ``convergent`` and
895 ``speculatable``, indicating that the function does not have undefined
896 behavior and has no effects besides calculating its result, but is still
897 affected by the set of threads executing this function. This typically
898 prevents speculation of calls to the function unless the constraint imposed
899 by ``convergent`` is further relaxed by some other means.
901 Controlled Maximal Convergence
902 ==============================
904 The :ref:`converged-with relation <convergence-definition>` over dynamic
905 instances of each controlled convergent operation is completely defined by the
906 semantics of convergence tokens. But the implementation-defined convergence at a
907 call to :ref:`llvm.experimental.convergence.anchor
908 <llvm.experimental.convergence.anchor>` also depends on the cycle hierarchy
909 chosen if it occurs inside an irreducible cycle.
911 When the token defined by a convergent operation ``D`` is used at another
912 convergent operation ``U``, the implementation must ensure that the threads that
913 converge at ``U`` are all the threads that reached ``U`` after converging at
914 ``D``. On most implementations, it is reasonable to assume that only these
915 threads are converged at every node they reach on any path from ``D`` to ``U``.
916 In other words, the converged-with relation at ``D`` produces groups of threads
917 that can converge only within each group, while inside the convergence region of
920 All this affects the :ref:`maximal converged-with relation
921 <convergence-maximal>` over dynamic instances and in turn the :ref:`m-converged
922 property <uniformity-analysis>` of static instances in the convergence region of
925 .. _controlled_maximal_converged_with:
927 **Controlled Maximal converged-with Relation**
929 1. Dynamic instances of a *convergent operation* are related in the controlled
930 maximal converged-with relation according to the semantics of the convergence
932 2. Dynamic instances ``X1`` and ``X2`` produced by different threads for the
933 same *non-convergent operation* ``X`` are related in the controlled maximal
934 converged-with relation if and only if:
936 1. Both threads executed converged dynamic instances of every token
937 definition ``D`` such that ``X`` is in the convergence region of ``D``,
939 2. For every cycle ``C`` with header ``H`` that contains ``X``:
941 - every dynamic instance ``H1`` of ``H`` that precedes ``X1`` in the
942 respective thread is convergence-before ``X2``, and,
943 - every dynamic instance ``H2`` of ``H`` that precedes ``X2`` in the
944 respective thread is convergence-before ``X1``,
945 - without assuming that ``X1`` is converged with ``X2``.
947 .. _controlled_m_converged:
949 **Controlled m-converged Static Instances**
951 A node ``X`` in a given CFG is reported to be m-converged if and only if:
953 1. For any token definition ``D`` such that ``X`` is inside the convergence region
954 of ``D``, ``D`` itself is m-converged, and,
955 2. Every cycle that contains ``X`` satisfies the following necessary
958 a. Every divergent branch inside the cycle satisfies the :ref:`diverged
959 entry criterion<convergence-diverged-entry>`, and,
960 b. There are no :ref:`diverged paths reaching the
961 cycle<convergence-diverged-outside>` from a divergent branch outside it.
963 Temporal Divergence at Cycle Exit
964 ---------------------------------
966 When a cycle has a divergent exit, maximal convergence assumes that all threads
967 converge at the exit block. But if a controlled convergent operation outside the
968 cycle uses a token defined by an operation ``D`` inside the cycle, the
969 convergence region of ``D`` now extends outside the cycle. If two threads
970 executed converged dynamic instances of ``D`` before exiting the cycle, then
971 they continue to execute converged dynamic instances of nodes in the convergence
972 region of ``D`` outside the cycle. Thus, for a value ``V`` defined inside the
973 cycle, any use ``U`` of ``V`` within the convergence region of ``T`` uses the
974 output of converged dynamic instances of ``V``. If ``V`` is uniform, then its
975 use at such a ``U`` is also uniform. In other words, temporal divergence applies
976 only to a use of ``V`` that is outside the convergence region of ``D``.
978 Rationales for Static rules about cycles
979 ========================================
981 (This section is informative.)
985 For convenience, we use the operator ``==`` to represent the relation
986 ``converged-with`` and the operator ``!=`` to represent its negation.
988 Consider a loop with (incorrect!) convergence control as in the following
993 ; WARNING: Example of incorrect convergence control!
995 %anchor = call token @llvm.experimental.convergence.anchor()
998 call void @convergent.op() [ "convergencectrl"(token %anchor) ]
1002 This code is forbidden by the first static rule about cycles.
1004 A first formal argument why we have to do this is that the dynamic rule for
1005 deciding whether two threads execute converged dynamic instances of
1006 ``@convergent.op`` leads to a logical contradiction in this code.
1007 Assume two threads execute converged dynamic instances of the anchor
1008 followed by two iterations of the loop. Thread 1 executes dynamic instances
1009 I1 and I2 of ``@convergent.op``, thread 2 executes dynamic instances J1 and J2.
1010 Using all the rules, we can deduce:
1012 1. ``I1 != I2`` and ``J1 != J2`` by the basic rules of dynamic instances.
1014 2. ``I1 == J1`` by the first dynamic rule about controlled convergent
1015 operations: both threads execute the same static instruction while using
1016 a convergence token value produced by converged dynamic instances of an
1017 instruction (the anchor).
1019 3. ``I1 == J2`` by the same argument. Also, ``I2 == J1`` and ``I2 == J2``.
1021 The fact that one may be *intuitively* tempted to think of ``I1`` and ``J2``
1022 as being executed in different loop iterations is completely irrelevant for
1023 the *formal* argument. There is no mechanism in LLVM IR semantics for
1024 forming associations between loop iterations in different threads, *except*
1025 for the rules defined in this document -- and the rules in this document
1026 require a loop heart intrinsic for talking about loop iterations.
1028 4. By transitivity, we have ``I1 == I2`` and ``J1 == J2``. That is a
1031 This problem goes away by inserting a loop heart intrinsic as follows, which
1032 establishes a relationship between loop iterations across threads.
1034 .. code-block:: llvm
1036 %anchor = call token @llvm.experimental.convergence.anchor()
1038 %loop = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
1040 call void @convergent.op() [ "convergencectrl"(token %loop) ]
1044 In the same scenario of two threads executing converged dynamic instances of the
1045 anchor and then two iterations of the loop, the dynamic rule about loop heart
1046 intrinsics implies that both threads execute the converged dynamic instances of
1047 the loop heart intrinsic in their respective first iterations and then again in
1048 their respective second iterations of the loop.
1050 This then implies that they execute converged dynamic instances ``I1 == J1`` of
1051 the ``@convergent.op`` in their first iterations and then
1052 ``I2 == J2`` in their second iterations. The rule is an "if and only if" rule,
1053 so it also implies that ``I1 != J2`` and ``I2 != J1``, because those executions
1054 see token values of ``%loop`` originating from non-converged dynamic
1055 instances of the loop intrinsic.
1057 One may ask whether we could change the dynamic rule instead of adding the
1058 static rule about cycles. That is impractical due to deeper difficulties.
1059 Consider the following loop, again with incorrect convergence control:
1061 .. code-block:: llvm
1063 ; WARNING: Example of incorrect convergence control!
1066 %anchor = call token @llvm.experimental.convergence.anchor()
1071 call void @convergent.op.1() [ "convergencectrl"(token %anchor) ]
1076 call void @convergent.op.2() [ "convergencectrl"(token %anchor) ]
1082 Assume two threads execute converged dynamic instances of the anchor followed
1083 by this sequence of basic blocks:
1085 .. code-block:: text
1087 Thread 1: A B C D F B D E F G
1088 Thread 2: A B D E F B C D F G
1090 That is, both threads execute two iterations of the loop, but they execute
1091 the different convergent operations in different iterations. Without forming a
1092 relation between loop iterations across the threads, there is no reasonable way
1093 of defining which dynamic instances of the convergent operations should be the
1094 same across the threads, if any.
1096 Again, this can be addressed by adding a loop heart intrinsic, most naturally
1099 .. code-block:: llvm
1102 %anchor = call token @llvm.experimental.convergence.anchor()
1105 %loop = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
1108 call void @convergent.op.1() [ "convergencectrl"(token %loop) ]
1113 call void @convergent.op.2() [ "convergencectrl"(token %loop) ]
1119 Let ``%loop(i;j)`` be the dynamic instance of ``j``-th execution of the loop
1120 heart intrinsic by thread ``i``, and analogously ``@op.k(i)`` and ``@op.k(i)``
1121 the dynamic instances of the execution of ``@convergent.op.k`` by thread ``i``.
1124 1. ``%loop(1;j) == %loop(2;j)`` for ``j = 1, 2`` because of the dynamic rule
1125 about loop heart intrinsics.
1127 2. ``%loop(i;1) != %loop(i;2)`` for ``i = 1, 2`` because of the basic rule that
1128 different executions by the same thread happen in different dynamic
1131 3. ``@op.1(1) != @op.1(2)``, since ``@op.1(1)`` uses the token value of ``%loop``
1132 referring to ``%loop(1;1)`` and ``@op.1(2)`` uses that
1133 referring to ``%loop(2;2) == %loop(1;2)``, which is different from
1136 4. Similarly, ``@op.2(1) != @op.2(2)``.
1138 However, loop heart intrinsics could be inserted differently, at the cost
1139 of also inserting a free-standing anchor:
1141 .. code-block:: llvm
1144 %anchor = call token @llvm.experimental.convergence.anchor()
1149 %loop = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
1150 call void @convergent.op.1() [ "convergencectrl"(token %loop) ]
1155 %free = call token @llvm.experimental.convergence.anchor()
1156 call void @convergent.op.2() [ "convergencectrl"(token %free) ]
1162 This leads to the "unnatural counting of loop iterations" that is also mentioned
1163 elsewhere. Let ``%loop(i)`` be the dynamic instance of the execution of the
1164 loop heart intrinsic by thread ``i`` (each thread executes it only once), and
1165 let ``@op.k(i)`` be as before. Then:
1167 1. ``%loop(1) == %loop(2)`` because of the dynamic rule about loop heart
1170 2. ``@op.1(1) == @op.1(2)`` because ``@op.1(i)`` uses the value of ``%loop``
1171 referring to ``%loop(i)``, and ``%loop(1) == %loop(2)``.
1173 3. Whether ``@op.2(1) == @op.2(2)`` is implementation-defined because of the
1174 use of the ``%free`` anchor intrinsic.
1176 In practice, they almost certainly have to be non-converged dynamic
1177 instances. Consider that if an implementation strictly follows the order of
1178 instructions given in the program, the executions of the threads can be
1179 "aligned" as follows:
1181 .. code-block:: text
1183 Thread 1: A B C D F B D E F G
1184 Thread 2: A B D E F B C D F G
1186 So then ``@op.2(1)`` physically executes later than ``@op.2(2)`` and there
1187 can be no communication between the threads, which means they execute
1188 non-converged dynamic instances.
1190 That said, it is conceivable that there aren't actually any data or other
1191 dependencies that would enforce this execution order. In that case, a highly
1192 out-of-order implementation could potentially allow communication. That's
1193 why the rules defined in this document are silent about whether
1194 ``@op.2(1) == @op.2(2)`` or not.
1196 This type of convergence control seems relatively unlikely to appear in real
1197 programs. Its possibility is simply a logical consequence of the model.
1199 An equivalent issue arises if the convergent operations are replaced by nested
1200 loops with loop heart intrinsics that directly refer to ``%anchor``, hence
1201 the variants of the static rules about cycles that apply to them:
1203 .. code-block:: llvm
1205 ; WARNING: Example of incorrect convergence control!
1207 %anchor = call token @llvm.experimental.convergence.anchor()
1211 %loop1 = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
1216 %loop2 = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
1221 There is a cycle (closed walk in the CFG) that goes through both loop heart
1222 intrinsics using ``%anchor`` but not through the definition of ``%anchor``,
1223 so this code is invalid.
1226 Examples for the Correctness of Program Transforms
1227 ==================================================
1229 (This section is informative.)
1231 As implied by the rules in the previous sections, program transforms are correct
1232 with respect to convergent operations if they preserve or refine their
1233 semantics. This means that the set of communicating threads in the transformed
1234 program must have been possible in the original program.
1236 Program transforms with a single-threaded focus are generally conservatively
1237 correct if they do not sink or hoist convergent operations across a branch.
1238 This applies even to program transforms that change the control flow graph.
1240 For example, unrolling a loop that does not contain convergent operations
1241 cannot break any of the guarantees required for convergent operations outside
1245 Loop unrolling examples
1246 -----------------------
1248 We consider three kinds of loop unrolling here:
1250 * Partial unrolling with no known trip multiple, so a "tail" is required to
1251 collect the remaining elements.
1252 * Partial unrolling by a trip multiple, so no "tail" is required.
1253 * Full unrolling, which eliminates the loop.
1255 The first kind is forbidden when ``@llvm.experimental.convergence.loop`` is
1256 used. We illustrate the reasoning with some examples.
1258 First, an arbitrary loop that contains convergent operations *can* be unrolled
1259 in all of these ways, even with "tail", if all convergent operations refer back
1260 to an anchor inside the loop. For example (in pseudo-code):
1262 .. code-block:: llvm
1264 while (counter > 0) {
1265 %tok = call token @llvm.experimental.convergence.anchor()
1266 call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1270 This can be unrolled to:
1272 .. code-block:: llvm
1274 while (counter >= 2) {
1275 %tok = call token @llvm.experimental.convergence.anchor()
1276 call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1277 %tok = call token @llvm.experimental.convergence.anchor()
1278 call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1281 while (counter > 0) {
1282 %tok = call token @llvm.experimental.convergence.anchor()
1283 call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1287 This is likely to change the behavior of the convergent operation if there
1288 are threads whose initial counter value is not a multiple of 2. In particular,
1289 all threads with an odd trip count are now likely to execute the convergent
1290 operation in their respective final iterations together because the
1291 underlying implementation is likely to try to group as many threads together
1292 as possible for the execution of the "tail".
1294 This change is allowed because the anchor intrinsic has implementation-defined
1295 convergence behavior and the loop unrolling transform is considered to be part
1296 of the implementation. Another way of reasoning is that while the *likely*
1297 behavior of the code has changed, the *guarantees* about its behavior have
1300 If the loop contains uncontrolled convergent operations, this kind of unrolling
1303 Unrolling a loop with convergent operations that refer to tokens produced
1304 outside the loop is forbidden when a "tail" or "remainder" would have to
1305 be introduced. Consider:
1307 .. code-block:: llvm
1310 %outer = call token @llvm.experimental.convergence.anchor()
1311 while (counter > 0) {
1312 %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
1314 call void @convergent.operation() [ "convergencectrl"(token %inner) ]
1319 To understand why unrolling is forbidden, consider two threads that execute
1320 converged dynamic instances of the anchor and then proceed with 3 and 4 loop
1321 iterations, respectively:
1323 .. code-block:: text
1326 Thread 2: A B B B B C
1328 By the dynamic rule on loop heart intrinsics, these threads execute converged
1329 dynamic instances of the loop intrinsic for the first 3 iterations, and then
1330 thread 2 executes another dynamic instance by itself.
1332 By the dynamic rule on general convergent operations, the threads execute
1333 converged dynamic instances of the ``@convergent.operation`` in the first 3
1334 iterations (that is, the dynamic instance executed by thread 1 in iteration
1335 *n* is the same as that executed by thread 2 in iteration *n*, for *n = 1,2,3*;
1336 the dynamic instance executed in iteration 1 is different from that in
1339 Now assume that the loop is unrolled by a factor of 2, which requires a
1340 remainder as follows:
1342 .. code-block:: llvm
1345 %outer = call token @llvm.experimental.convergence.anchor()
1346 while (counter >= 2) {
1347 %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
1349 call void @convergent.operation() [ "convergencectrl"(token %inner) ]
1350 call void @convergent.operation() [ "convergencectrl"(token %inner) ]
1355 %remainder = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
1357 call void @convergent.operation() [ "convergencectrl"(token %remainder) ]
1361 First of all, note some interesting problems surrounding the loop intrinsic:
1363 1. It is *not* duplicated inside the unrolled loop. This is to comply with
1364 the :ref:`convergence_static_rules`.
1366 2. It is unclear whether the loop intrinsic ought to be duplicated in the
1367 remainder, or whether the final ``@convergent.operation`` in D should just
1368 refer to either ``%inner`` (which is possible in SSA form) or directly to
1369 ``%outer``. The decision made here is arbitrary and doesn't change the
1370 argument that follows. Ultimately, it simply doesn't matter because the
1371 transform is incorrect either way.
1373 The threads now execute the following sequences of blocks:
1375 .. code-block:: text
1378 Thread 2: A B B C D E
1380 Analogous to the argument above, they execute converged dynamic instances of the
1381 ``%inner`` intrinsic and the ``@convergent.operation`` in the first iteration
1382 of the unrolled loop, which corresponds to the first 2 iterations of the
1385 However, they execute different static calls to ``@convergent.operation`` for
1386 the 3rd iteration of the original loop. In thread 1, that iteration corresponds
1387 to the call in the remainder, while in thread 2 it corresponds to the first
1388 call to ``@convergent.operation`` in the unrolled loop. Therefore, they execute
1389 non-converged dynamic instances, which means that the set of communicating threads
1390 for the 3rd iteration of the original loop is different. This is why the
1391 unrolling is incorrect.
1393 On the other hand, unrolling without "tail" is allowed. For example, assuming
1394 that the trip counter is known to be a multiple of 2, we can unroll the loop
1397 .. code-block:: llvm
1399 %outer = call token @llvm.experimental.convergence.anchor()
1400 while (counter > 0) {
1401 %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
1402 call void @convergent.operation() [ "convergencectrl"(token %inner) ]
1403 call void @convergent.operation() [ "convergencectrl"(token %inner) ]
1407 Note again that the loop intrinsic is not duplicated.
1410 :ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
1411 intrinsic is typically expected to appear in the header of a natural loop.
1412 However, it can also appear in non-header blocks of a loop. In that case, the
1413 loop can generally not be unrolled.
1416 Hoisting and sinking
1417 --------------------
1419 In general, hoisting and sinking of convergent operations is forbidden. This is
1420 because moving the operation to a different point in control flow generally
1421 changes the set of threads that reach the operation and therefore, the set of
1422 threads that execute converged dynamic instances of the operation. By
1423 definition, this changes the set of threads that participate in the
1424 communication of the convergent operation, which will typically change its
1427 There are a number of exceptions, though most of them require additional
1430 For example, hoisting and sinking across *uniform* conditional branches -- i.e.,
1431 conditional branches where within every possible relevant set of threads, all
1432 threads will always take the same direction -- is generally allowed. See the end
1433 of the :ref:`example of reductions inside control flow
1434 <convergence_example_reductions>` for a brief discussion.
1436 Some convergent operations can be hoisted but not sunk, or vice versa. A simple
1437 example is the ``subgroupShuffle(data, id)`` operation. It returns the ``data``
1438 operand of the thread identified by ``id``, where thread IDs are fixed and
1439 assigned to each thread at launch. The result is undefined (or perhaps there is
1440 UB, depending on the language and environment) if thread ``id`` is not in the
1441 communicating set of threads. So hoisting is allowed in the following
1442 pseudo-code example:
1444 .. code-block:: llvm
1446 define void @example(...) convergent {
1447 %entry = call token @llvm.experimental.convergence.entry()
1451 %shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ]
1454 %shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ]
1459 After hoisting the calls to ``@subgroupShuffle``, the communicating set of
1460 threads is the union of the two sets of threads in the original program, so
1461 ``%id`` can only go "out of range" after hoisting if it did so in the original
1464 However, speculative execution of ``@subgroupShuffle`` in the following program
1467 .. code-block:: llvm
1469 define void @example(...) convergent {
1470 %entry = call token @llvm.experimental.convergence.entry()
1474 %shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ]
1479 There is no guarantee about the value of ``%id`` in the threads where
1480 ``condition`` is false. If ``@subgroupShuffle`` is defined to have UB when
1481 ``%id`` is outside of the set of communicating threads, then speculating and
1482 hoisting ``@subgroupShuffle`` might introduce UB.
1484 On the other hand, if ``@subgroupShuffle`` is defined such that it merely
1485 produces an undefined value or poison as result when ``%id`` is "out of range",
1486 then speculating is okay.
1489 :ref:`llvm.experimental.convergence.anchor <llvm.experimental.convergence.anchor>`
1490 is marked as ``convergent``, it can be sunk in some cases. For example, in
1493 .. code-block:: llvm
1495 %tok = call token @llvm.experimental.convergence.anchor()
1497 call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1500 Assuming that ``%tok`` is only used inside the conditional block, the anchor can
1501 be sunk. The rationale is two-fold. First, the anchor has implementation-defined
1502 behavior, and the sinking is part of the implementation. Second, already in the
1503 original program, the set of threads that communicates in the
1504 ``@convergent.operation`` is automatically subset to the threads for which
1505 ``condition`` is true.
1507 Anchors can be hoisted in acyclic control flow. For example:
1509 .. code-block:: llvm
1512 %tok1 = call token @llvm.experimental.convergence.anchor()
1513 call void @convergent.operation() [ "convergencectrl"(token %tok1) ]
1515 %tok2 = call token @llvm.experimental.convergence.anchor()
1516 call void @convergent.operation() [ "convergencectrl"(token %tok2) ]
1519 The anchors can be hoisted, resulting in:
1521 .. code-block:: llvm
1523 %tok = call token @llvm.experimental.convergence.anchor()
1525 call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1527 call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1530 The behavior is unchanged, since each of the static convergent operations only
1531 ever communicates with threads that have the same ``condition`` value.
1532 By contrast, hoisting the convergent operations themselves is forbidden.
1534 Hoisting and sinking anchors out of and into loops is forbidden. For example:
1536 .. code-block:: llvm
1539 %tok = call token @llvm.experimental.convergence.anchor()
1540 call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1543 Hoisting the anchor would make the program invalid according to the static
1544 validity rules. Conversely:
1546 .. code-block:: llvm
1548 %outer = call token @llvm.experimental.convergence.anchor()
1549 while (counter > 0) {
1550 %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
1551 call void @convergent.operation() [ "convergencectrl"(token %inner) ]
1555 The program would stay valid if the anchor was sunk into the loop, but its
1556 behavior could end up being different. If the anchor is inside the loop, then
1557 each loop iteration has a new dynamic instance of the anchor, and the set of
1558 threads participating in those dynamic instances of the anchor could be
1559 different in arbitrary implementation-defined ways. Via the dynamic rules about
1560 dynamic instances of convergent operations, this then implies that the set of
1561 threads executing ``@convergent.operation`` could be different in each loop
1562 iteration in arbitrary implementation-defined ways.
1564 Convergent operations can be sunk together with their anchor. Again in
1567 .. code-block:: llvm
1569 %tok = call token @llvm.experimental.convergence.anchor()
1570 %a = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
1571 %b = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
1576 Assuming that ``%tok``, ``%a``, and ``%b`` are only used inside the conditional
1577 block, all can be sunk together:
1579 .. code-block:: llvm
1582 %tok = call token @llvm.experimental.convergence.anchor()
1583 %a = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
1584 %b = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
1588 The rationale is that the anchor intrinsic has implementation-defined behavior,
1589 and the sinking transform is considered to be part of the implementation:
1590 the sinking will restrict the set of communicating threads to those for which
1591 ``condition`` is true, but that could have happened in the original program
1592 anyway for some arbitrary other reason.
1594 However, sinking *only* the convergent operation producing ``%b`` would be
1595 incorrect. That would allow threads for which ``condition`` is false to
1596 communicate at ``%a``, but not at ``%b``, which the original program doesn't
1599 Note that the entry intrinsic behaves differently. Sinking the convergent
1600 operations is forbidden in the following snippet:
1602 .. code-block:: llvm
1604 %tok = call token @llvm.experimental.convergence.entry()
1605 %a = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
1606 %b = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]