llvm/docs/ConvergentOperations.rst

   1 ==============================
   2 Convergent Operation Semantics
   3 ==============================
   4
   5 .. contents::
   6    :local:
   7    :depth: 4
   8
   9 Overview
  10 ========
  11
  12 Some parallel execution environments execute threads in groups that allow
  13 efficient communication within the group using special primitives called
  14 *convergent* operations. The outcome of a convergent operation is sensitive to
  15 the set of threads that executes it "together", i.e., convergently. When control
  16 flow :ref:`diverges <convergence-and-uniformity>`, i.e. threads of the same
  17 group follow different
  18 paths through the CFG, not all threads of the group may be available to
  19 participate in this communication. This is the defining characteristic that
  20 distinguishes convergent operations from other inter-thread communication:
  21
  22   A convergent operation involves inter-thread communication or synchronization
  23   that occurs outside of the memory model, where the set of threads which
  24   participate in communication is implicitly affected by control flow.
  25
  26 For example, in the following GPU compute kernel, communication during the
  27 convergent operation is expected to occur precisely among those threads of an
  28 implementation-defined execution scope (such as workgroup or subgroup) for
  29 which ``condition`` is true:
  30
  31 .. code-block:: c++
  32
  33   void example_kernel() {
  34       ...
  35       if (condition)
  36           convergent_operation();
  37       ...
  38   }
  39
  40 In structured programming languages, there is often an intuitive and
  41 unambiguous way of determining the threads that are expected to communicate.
  42 However, this is not always the case even in structured programming languages,
  43 and the intuition breaks down entirely in unstructured control flow. This
  44 document describes the formal semantics in LLVM, i.e. how to determine the set
  45 of communicating threads for convergent operations.
  46
  47 The definitions in this document leave many details open, such as how groups of
  48 threads are formed in the first place. It focuses on the questions that are
  49 relevant for deciding the correctness of generic program transforms and
  50 convergence-related analyses such as :ref:`uniformity analysis
  51 <convergence-and-uniformity>`.
  52
  53 .. _convergent_operations:
  54
  55 Convergent Operations
  56 =====================
  57
  58 In LLVM IR, the only way to communicate between threads as described
  59 above is by calling target-defined convergent intrinsics. Hence, only
  60 a call-site in LLVM IR (a :ref:`call <i_call>`, :ref:`invoke
  61 <i_invoke>`, or :ref:`callbr <i_callbr>` instruction) can result in a
  62 convergent operation.
  63
  64 A function in LLVM IR is said to be *convergent* if it has the
  65 :ref:`convergent <attr_convergent>` attribute.
  66
  67 A call-site in LLVM IR is said to be *convergent* if it is a direct
  68 call to a convergent function or it has the :ref:`convergent
  69 <attr_convergent>` attribute or a :ref:`convergencectrl operand bundle
  70 <convergencectrl>`.
  71
  72 Informational notes:
  73
  74   A function may have to be treated as convergent if that function, or
  75   transitively, any function called from it, contains a convergent call-site. A
  76   frontend generating the ``convergent`` attribute should take this into account
  77   when emitting functions and function calls. But this is not always the case:
  78
  79   A non-convergent function may contain convergent operations; such operations
  80   do not directly depend on the set of threads that enter the function as a
  81   single communicating group. Instead, these operations depend on an
  82   implementation-defined subset of threads within the body of the function, as
  83   shown in :ref:`opportunistic_convergence`.
  84
  85 Examples of Convergent Operations
  86 ========================================
  87
  88 (This section is informative.)
  89
  90 Texture sampling in a pixel shader
  91 ----------------------------------
  92
  93 The following stylized pixel shader samples a texture at a given set of
  94 coordinates, using the builtin function `textureSample`. Texture sampling
  95 requires screen-space derivatives of the coordinates to determine the level of
  96 detail (mipmap level) of the sample. They are commonly approximated by taking
  97 the difference between neighboring pixels, which are computed by different
  98 threads in the same group:
  99
 100 .. code-block:: c++
 101
 102   void example_shader() {
 103     ...
 104     color = textureSample(texture, coordinates);
 105     if (condition) {
 106       use(color);
 107     }
 108     ...
 109   }
 110
 111 From a purely single-threaded perspective, sinking the `textureSample` into
 112 the if-statement appears legal. However, if the condition is false for some
 113 neighboring pixels, then their corresponding threads will not execute together
 114 in the group, making it impossible to take the difference of coordinates as an
 115 approximation of the screen-space derivative. In practice, the outcome will be
 116 an undefined value.
 117
 118 That is, the `textureSample` operation fits our definition of a convergent
 119 operation:
 120
 121  1. It communicates with a set of threads that implicitly depends on control
 122     flow.
 123  2. Correctness depends on this set of threads.
 124
 125 The compiler frontend can emit IR that expresses the convergence constraints as
 126 follows:
 127
 128 .. code-block:: llvm
 129
 130   define void @example_shader() convergent {
 131     %entry = call token @llvm.experimental.convergence.entry()
 132     ...
 133     %color = call T @textureSample(U %texture, V %coordinates) [ "convergencectrl"(token %entry) ]
 134     br i1 %condition, label %then, label %end
 135
 136   then:
 137     call void @use(T %color)
 138     br label %end
 139
 140   end:
 141     ret void
 142   }
 143
 144 The :ref:`llvm.experimental.convergence.entry <llvm.experimental.convergence.entry>`
 145 intrinsic is itself ``convergent``, and we expect it to communicate at least
 146 among all threads of the same "quad" -- a group of 2x2 pixels that are
 147 evaluated together for the purpose of approximating screen-space derivatives.
 148 This fact is not part of the generic LLVM IR semantics; it would have to be
 149 defined somewhere else, for example as part of target-specific ABI definitions
 150 and/or in reference to some relevant API specs.
 151
 152 Since the ``@textureSample`` call then uses the token produced by the entry
 153 intrinsic in its ``convergencectrl`` bundle, and has no additional control
 154 dependencies, it must communicate among the same set of threads. This indicates
 155 to generic program transforms that sinking the ``@textureSample`` call is
 156 forbidden. (A program transform can still sink the call if it can prove somehow,
 157 e.g. by leaning on target-specific callbacks that can analyze the program with
 158 additional knowledge, that ``%condition`` is always uniform across the threads
 159 referenced by the *convergence token* ``%entry``.)
 160
 161 .. _convergence_example_reductions:
 162
 163 Reductions inside divergent control flow
 164 ----------------------------------------
 165
 166 The following example shows that merging common code of branches can be
 167 incorrect in the face of convergent operations:
 168
 169 .. code-block:: c++
 170
 171   void example_kernel() {
 172     delta = ...
 173     if (delta > 0) {
 174       total_gains = subgroupAdd(delta);
 175       ...
 176     } else {
 177       total_losses = subgroupAdd(delta);
 178       ...
 179     }
 180   }
 181
 182 The ``subgroupAdd`` computing the ``total_gains`` will be executed by the
 183 subset of threads with positive ``delta`` in a subgroup (wave), and so will sum
 184 up all the ``delta`` values of those threads; and similarly for the
 185 ``subgroupAdd`` that computes the ``total_losses``.
 186
 187 If we were to hoist and merge the ``subgroupAdd`` above the if-statement, it
 188 would sum up the ``delta`` across *all* threads instead.
 189
 190 The compiler frontend can emit IR that expresses the convergence constraints
 191 as follows:
 192
 193 .. code-block:: llvm
 194
 195   define void @example_kernel() convergent {
 196     %entry = call token @llvm.experimental.convergence.entry()
 197     %delta = ...
 198     %cc = icmp sgt i32 %delta, 0
 199     br i1 %cc, label %then, label %else
 200
 201   then:
 202     %total_gains = call i32 @subgroupAdd(i32 %delta) [ "convergencectrl"(token %entry) ]
 203     ...
 204     br label %end
 205
 206   else:
 207     %total_losses = call i32 @subgroupAdd(i32 %delta) [ "convergencectrl"(token %entry) ]
 208     ...
 209     br label %end
 210
 211   end:
 212     ...
 213   }
 214
 215 The entry intrinsic behaves like in the previous example: assuming that
 216 ``@example_kernel`` is an OpenCL kernel (as hinted at by the "subgroup"
 217 terminology), we expect it to communicate among all threads within the
 218 "subgroup". This typically maps to a SIMD vector on GPU hardware.
 219
 220 The calls to ``@subgroupAdd`` use the token produced by the entry intrinsic,
 221 but they also have an additional control dependency. According to the rules
 222 defined in this document, they only communicate among the subset of threads
 223 that actually end up executing the respective (static) call site.
 224
 225 Hoisting them would remove the control dependency and cause them to communicate
 226 among the full set of threads that the entry intrinsic communicated with.
 227 Again, hoisting is allowed if it can be proven that ``%cc`` is always uniform
 228 among the relevant set of threads: in that case, the ``@subgroupAdd`` already
 229 communicates among the full set of threads in the original program.
 230
 231 Motivating Examples of Convergence Control
 232 ==========================================
 233
 234 (This section is informative.)
 235
 236 Unstructured control flow
 237 -------------------------
 238
 239 Consider an example of how jump threading removes structure in a way that can
 240 make semantics non-obvious without the convergence intrinsics described in this
 241 document:
 242
 243 .. code-block:: llvm
 244
 245   void example_original() {
 246   entry:
 247       ...
 248       br i1 %cond1, label %then1, label %mid
 249
 250   then1:
 251       ...
 252       %cond2 = ...
 253       br label %mid
 254
 255   mid:
 256       %flag = phi i1 [ true, %entry ], [ %cond2, %then1 ]
 257       br i1 %flag, label %then2, label %end
 258
 259   then2:
 260       ...
 261       call void @subgroupControlBarrier()
 262       ...
 263       br label %end
 264
 265   end:
 266   }
 267
 268   void example_jumpthreaded() {
 269   entry:
 270       ...
 271       br i1 %cond1, label %then1, label %then2
 272
 273   then1:
 274       ...
 275       %cond2 = ...
 276       br i1 %cond2, label %then2, label %end
 277
 278   then2:
 279       ...
 280       call void @subgroupControlBarrier()
 281       ...
 282       br label %end
 283
 284   end:
 285   }
 286
 287 Is the control barrier guaranteed to synchronize among the same set of threads
 288 in both cases? Different implementations in the literature may give different
 289 answers to this question:
 290
 291 * In an implementation that reconverges at post-dominators, threads reconverge
 292   at ``mid`` in the first version, so that all threads (within a subgroup/wave)
 293   that execute the control barrier do so together. In the second version,
 294   threads that reach the control barrier via different paths synchronize
 295   separately: the first (and only) post-dominator is ``end``, so threads do not
 296   reconverge before then.
 297
 298 * An implementation that sorts basic blocks topologically and ensures maximal
 299   reconvergence for each basic block would behave the same way in both
 300   versions.
 301
 302 We generally take the stance that reconvergence in acyclic control flow must
 303 be maximal. The compiler frontend could augment the original code as follows:
 304
 305 .. code-block:: llvm
 306
 307   define void @example_original() convergent {
 308   entry:
 309     %entry = call token @llvm.experimental.convergence.entry()
 310     ...
 311     br i1 %cond1, label %then1, label %mid
 312
 313   then1:
 314     ...
 315     %cond2 = ...
 316     br label %mid
 317
 318   mid:
 319     %flag = phi i1 [ true, %entry ], [ %cond2, %then1 ]
 320     br i1 %flag, label %then2, label %end
 321
 322   then2:
 323     ...
 324     call void @subgroupControlBarrier() [ "convergencectrl"(token %entry) ]
 325     ...
 326     br label %end
 327
 328   end:
 329   }
 330
 331 If S is the set of threads that the entry intrinsic communicated with, then
 332 the ``@subgroupControlBarrier`` call communicates with the subset of S that
 333 actually reaches the call site. This set of threads doesn't change after
 334 jump-threading, so the answer to the question posed above remains the same.
 335
 336 .. _opportunistic_convergence:
 337
 338 Opportunistic convergent operations
 339 -----------------------------------
 340
 341 Some programs have local regions of code that contain a sequence of convergent
 342 operations where the code does not care about the exact set of threads with
 343 which it is executed, but only that the set of threads is the same for all the
 344 operations within the sequence. (If a subset of the convergent operations in the
 345 sequence have additional, non-uniform control dependencies, then this is not
 346 possible. However, the code may still require that the sets of threads are
 347 logically consistent with the conditions of those control dependencies.) In this
 348 case, :ref:`llvm.experimental.convergence.anchor
 349 <llvm.experimental.convergence.anchor>` can be used to express the desired
 350 semantics.
 351
 352 The following example function could be part of a hypothetical "append buffer"
 353 implementation, where threads conditionally write fixed-sized records
 354 contiguously into a global buffer. The function ``@reserveSpaceInBuffer``
 355 returns the index into the buffer at which the calling thread should store its
 356 data.
 357
 358 This could be achieved by using a simple atomic operation in every thread to
 359 bump an allocation counter.
 360
 361 However, the following implementation can be more performant on some hardware,
 362 because it uses only a single atomic operation for an entire group of threads.
 363 To do this, it first determines the total size of the group, which will be the
 364 operand to the atomic operation, and then later broadcasts the result of the
 365 atomic operation to all threads of the group, so that each thread can compute
 366 its individual position in the buffer:
 367
 368 .. code-block:: llvm
 369
 370   define i32 @reserveSpaceInBuffer() {    ; NOTE: _not_ a convergent function!
 371   entry:
 372     %anchor = call token @llvm.experimental.convergence.anchor()
 373
 374     %ballot = call i64 @subgroupBallot(i1 true) [ "convergencectrl"(token %anchor) ]
 375     %numThreads.p = call i64 @llvm.ctpop.i64(i64 %ballot)
 376     %numThreads = trunc i64 %numThreads.p to i32
 377
 378     %absoluteThreadIdx = call i32 @getSubgroupLocalInvocationId()
 379     %absoluteThreadIdx.ext = zext i32 %absoluteThreadIdx to i64
 380     %mask.p = shl i64 1, %absoluteThreadIdx.ext
 381     %mask = sub i64 %mask.p, 1
 382
 383     %maskedBallot = and i64 %ballot, %mask
 384     %relativeThreadIdx.p = call i64 @llvm.ctpop.i64(i64 %maskedBallot)
 385     %relativeThreadIdx = trunc i64 %relativeThreadIdx.p to i32
 386
 387     %isFirstThread = icmp eq i32 %relativeThreadIdx, 0
 388     br i1 %isFirstThread, label %then, label %end
 389
 390   then:
 391     %baseOffset.1 = atomicrmw add ptr @bufferAllocationCount, i32 %numThreads monotonic
 392     br label %end
 393
 394   end:
 395     %baseOffset.2 = phi i32 [ undef, %entry ], [ %baseOffset.1, %then ]
 396     %baseOffset = call i32 @subgroupBroadcastFirst(i32 %baseOffset.2) [ "convergencectrl"(token %anchor) ]
 397     %offset = add i32 %baseOffset, %relativeThreadIdx
 398     ret i32 %offset
 399   }
 400
 401 The key here is that the function really doesn't care which set of threads it
 402 is being called with. It takes whatever set of threads it can get. What the
 403 implementation of the function cares about is that the initial
 404 ``@subgroupBallot`` -- which is used to retrieve the bitmask of threads that
 405 executed the anchor together -- executes with the same set of threads as the
 406 final ``@subgroupBroadcastFirst``. Nothing else is required for correctness as
 407 far as convergence is concerned.
 408
 409 The function ``@reserveSpaceInBuffer`` itself is _not_ ``convergent``: callers
 410 are free to move call sites of the function as they see fit. This can change
 411 the behavior in practice, by changing the sets of threads that are grouped
 412 together for the atomic operation. This can be visible in the output of the
 413 program, since the order in which outputs appear in the buffer is changed.
 414 However, this does not break the overall contract that ``@reserveSpaceInBuffer``
 415 has with its caller -- which makes sense: the order of outputs is
 416 non-deterministic anyway because of the atomic operation that is involved.
 417
 418 If the function is inlined, the use of the anchor intrinsic similarly indicates
 419 that certain transforms which are usually forbidden by the presence of
 420 convergent operations are in fact allowed, as long as they don't break up the
 421 region of code that is controlled by the anchor.
 422
 423 .. _convergence_high-level_break:
 424
 425 Extended Cycles: Divergent Exit from a Loop
 426 -------------------------------------------
 427
 428 High-level languages typically provide a ``break`` statement that transfers
 429 control out of a loop statement. In most cases, the loop is structured and hence
 430 there is no ambiguity about convergence inside the loop. But an ambiguity arises
 431 when a ``break`` is control dependent on a divergent condition inside the loop.
 432 Consider the following example:
 433
 434 .. code-block:: c++
 435
 436   void example() {
 437     // A
 438     ...
 439     for (...) {
 440       // B
 441       if (condition) { // divergent condition
 442         // C
 443         convergent_op();
 444         break;
 445       }
 446       // D
 447       ...
 448     }
 449     // E
 450   }
 451
 452 In this program, the call to convergent_op() is lexically "inside" the ``for``
 453 loop. But when translated to LLVM IR, the basic block B is an exiting block
 454 ending in a divergent branch, and the basic block C is an exit of the loop.
 455 Thus, the call to convergent_op() is outside the loop. This causes a mismatch
 456 between the programmer's expectation and the compiled program. The call should
 457 be executed convergently on every iteration of the loop, by threads that
 458 together take the branch to exit the loop. But when compiled, all threads that
 459 take the divergent exit on different iterations first converge at the beginning
 460 of basic block C and then together execute the call to convergent_op().
 461
 462 In this case, :ref:`llvm.experimental.convergence.loop
 463 <llvm.experimental.convergence.loop>` can be used to express the desired
 464 semantics. A call to this intrinsic is placed in the loop header, which tracks
 465 each iteration of the loop. The token produced by this is used as a
 466 ``convergencectrl`` operand to the convergent call. The semantics of the
 467 ``loop`` intrinsic ensures that the convergent call is performed convergently
 468 only by those threads that convergently exited the loop in a given iteration.
 469
 470 .. code-block:: llvm
 471
 472   define void @example() convergent {
 473     %entry = call token @llvm.experimental.convergence.entry()
 474     br label %for
 475
 476   for:
 477     %inner = call token @llvm.experimental.convergence.loop() ["convergencectrl"(token %entry)]
 478     %for.cond = i1 ...
 479     br i1 %for.cond, label %B, label %E
 480
 481   B:
 482     ...
 483     %condition = i1 ...
 484     br i1 %condition, label %C, label %D
 485
 486   C:
 487     call void @convergent_op() ["convergencectrl"(token %inner)]
 488     br label %E
 489
 490   D:
 491     ...
 492     br label %for
 493
 494   E:
 495     ...
 496     ret void
 497   }
 498
 499 The LLVM IR version of the same program shows a cycle consisting of the basic
 500 blocks ``%for``, ``%B`` and ``%D``, while ``%C`` is an exit reached by the
 501 divergent branch at the end of the exiting block ``%B``. But the use of
 502 convergence control tokens makes it clear that block ``%C`` must be executed
 503 convergently only by those threads that convergently take the exit edge from %B
 504 to ``%C``. In other words, the convergent execution of ``%C`` is governed by the
 505 call to the :ref:`llvm.experimental.convergence.loop
 506 <llvm.experimental.convergence.loop>` intrinsic inside the cycle. The cycle is
 507 effectively extended to include all uses of this token that lie outside the
 508 cycle.
 509
 510 .. _dynamic_instances_and_convergence_tokens:
 511
 512 Dynamic Instances and Convergence Tokens
 513 ========================================
 514
 515 Every execution of an LLVM IR instruction occurs in a :ref:`dynamic instance
 516 <convergence-dynamic-instances>` of the instruction. Dynamic instances are the
 517 formal objects by which we talk about communicating threads in convergent
 518 operations. Dynamic instances are defined for *all* operations in an LLVM
 519 program, whether convergent or not. Convergence control is primarily about the
 520 dynamic instances of convergent operations since they affect execution of the
 521 program through inter-thread communication. The dynamic instances for
 522 non-convergent operations are relevant for determining :ref:`uniformity
 523 <convergence-and-uniformity>` of values.
 524
 525 Dynamic instances produced by the execution of the same *convergent operation*
 526 by different threads may be :ref:`converged <convergence-definition>`. When
 527 executing a convergent operation, the set of threads that execute converged
 528 dynamic instances is the set of threads that communicate with each other.
 529 *Convergence tokens* capture this convergence as described below.
 530
 531 *Convergence tokens* are values of ``token`` type, i.e. they cannot be used in
 532 ``phi`` or ``select`` instructions. A convergence token value represents the
 533 dynamic instance of the instruction that produced it.
 534
 535 Convergent operations may have an optional ``convergencectrl`` operand bundle with
 536 a convergence token operand to define the set of communicating threads relative
 537 to the operation that defined the token.
 538
 539    Let ``U`` be a convergent operation other than a call to a convergence
 540    control intrinsic, and ``D`` be the convergent operation that defines
 541    the token value used as the ``convergencectrl`` operand to ``U``. Two
 542    threads execute converged dynamic instances of ``U`` if and only if the
 543    token value in both threads was returned by converged dynamic
 544    instances of ``D``.
 545
 546 .. note::
 547
 548    The text defines convergence token values as representing dynamic instances.
 549    But if we were to assume that converged dynamic instances produce the same
 550    token value, then we could almost think of the token value as representing a
 551    set of threads instead -- specifically, the set ``S`` of threads that
 552    executed converged dynamic instances of the defining instruction ``D``.
 553
 554    In this intuitive picture, when a convergence token value ``T`` is used by a
 555    ``convergencectrl`` bundle on an instruction ``I``, then the set of threads that
 556    communicates in ``I`` is a subset of the set ``S`` represented by the token value.
 557    Specifically, it is the subset of threads that ends up executing ``I`` while
 558    using the token value.
 559
 560    This by itself wouldn't quite work as a definition: what if ``I`` is executed
 561    multiple times by the same threads? Which execution of ``I`` in thread 1
 562    communicates with which execution of ``I`` in thread 2? Leaning on the notion
 563    of dynamic instances gives a robust answer to this question as long as ``D``
 564    and ``I`` are at the same loop (or cycle) nesting level.
 565
 566    The case where ``D`` and ``I`` are at different loop nesting levels is
 567    forbidden by the :ref:`static rules <convergence_static_rules>` -- handling
 568    that case is the purpose of :ref:`llvm.experimental.convergence.loop
 569    <llvm.experimental.convergence.loop>`.
 570
 571 .. _convergence_control_intrinsics:
 572
 573 Convergence Control Intrinsics
 574 ==============================
 575
 576 This section describes target-independent intrinsics that can be used to
 577 produce convergence tokens.
 578
 579 Behaviour is undefined if a convergence control intrinsic is called
 580 indirectly.
 581
 582 .. _llvm.experimental.convergence.entry:
 583
 584 ``llvm.experimental.convergence.entry``
 585 ----------------------------------------
 586
 587 .. code-block:: llvm
 588
 589   token @llvm.experimental.convergence.entry() convergent readnone
 590
 591 This intrinsic is used to tie the dynamic instances inside of a function to
 592 those in the caller.
 593
 594 1. If the function is called from outside the scope of LLVM, the convergence of
 595    dynamic instances of this intrinsic are environment-defined. For example:
 596
 597    a. In an OpenCL *kernel launch*, the maximal set of threads that
 598       can communicate outside the memory model is a *workgroup*.
 599       Hence, a suitable choice is to specify that all the threads from
 600       a single workgroup in OpenCL execute converged dynamic instances
 601       of this intrinsic.
 602    b. In a C/C++ program, threads are launched independently and they can
 603       communicate only through the memory model. Hence the dynamic instances of
 604       this intrinsic in a C/C++ program are never converged.
 605 2. If the function is called from a call-site in LLVM IR, then two
 606    threads execute converged dynamic instances of this intrinsic if and
 607    only if both threads entered the function by executing converged
 608    dynamic instances of the call-site.
 609
 610 This intrinsic can occur at most once in a function, and only in the entry
 611 block of the function. If this intrinsic occurs in a basic block, then it must
 612 precede any other convergent operation in the same basic block.
 613
 614 It is an error if this intrinsic appears in a non-convergent function.
 615
 616 It is an error to specify a ``convergencectrl`` operand bundle at a
 617 call to this intrinsic.
 618
 619 Function inlining substitutes this intrinsic with the token from the operand
 620 bundle. For example:
 621
 622 .. code-block:: c++
 623
 624   // Before inlining:
 625
 626   void callee() convergent {
 627     %tok = call token @llvm.experimental.convergence.entry()
 628     convergent_operation(...) [ "convergencectrl"(token %tok) ]
 629   }
 630
 631   void main() {
 632     %outer = call token @llvm.experimental.convergence.anchor()
 633     for (...) {
 634       %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
 635       callee() [ "convergencectrl"(token %inner) ]
 636     }
 637   }
 638
 639   // After inlining:
 640
 641   void main() {
 642     %outer = call token @llvm.experimental.convergence.anchor()
 643     for (...) {
 644       %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
 645       convergent_operation(...) [ "convergencectrl"(token %inner) ]
 646     }
 647   }
 648
 649 .. _llvm.experimental.convergence.loop:
 650
 651 ``llvm.experimental.convergence.loop``
 652 --------------------------------------
 653
 654 .. code-block:: llvm
 655
 656   token @llvm.experimental.convergence.loop() [ "convergencectrl"(token) ] convergent readnone
 657
 658 This intrinsic represents the place where an imaginary counter is incremented
 659 for determining convergence inside a control flow cycle.
 660
 661 Let ``U`` be a call to this intrinsic and ``D`` be the convergent operation that
 662 defines the token value used as the ``convergencectrl`` operand to ``U``. Two
 663 threads execute converged dynamic instances of ``U`` if and only if:
 664
 665 1. The token value in both threads was returned by converged dynamic
 666    instances of ``D``, and,
 667 2. There is an integer *n* such that both threads execute ``U`` for the *n*'th time
 668    with that token value.
 669
 670 It is an error to omit the ``convergencectrl`` operand bundle on a
 671 call to this intrinsic.
 672
 673 If this intrinsic occurs in a basic block, then it must precede any other
 674 convergent operation in the same basic block.
 675
 676 .. _convergence_cycle_heart:
 677
 678 **Heart of a Cycle:**
 679
 680   If a :ref:`cycle <cycle-terminology>` ``C`` contains an occurrence ``H`` of
 681   this intrinsic whose token operand is defined outside ``C``, then ``H`` is
 682   called the heart of ``C``.
 683
 684   .. note::
 685
 686      The static rules for cycles imply that a heart can occur only in the header
 687      of a natural loop. This ensures that the heart closely represents the
 688      intuitive notion of a loop iteration. If this restriction is relaxed, the
 689      resulting semantics provides a new notion of "cycle iteration" even for
 690      irreducible cycles. But this allows a natural loop to have a heart in a
 691      node other than its header, which has interesting consequences on the
 692      meaning of a loop iteration in terms of convergence. For now, we disallow
 693      this situation since its practical application is very rare.
 694
 695 .. _llvm.experimental.convergence.anchor:
 696
 697 ``llvm.experimental.convergence.anchor``
 698 ----------------------------------------
 699
 700 .. code-block:: llvm
 701
 702   token @llvm.experimental.convergence.anchor() convergent readnone
 703
 704 This intrinsic produces an initial convergence token that is independent from
 705 any "outer scope". The set of threads executing converged dynamic instances of
 706 this intrinsic is implementation-defined.
 707
 708 It is an error to pass a ``convergencectrl`` operand bundle at a
 709 call to this intrinsic.
 710
 711 .. note::
 712
 713    The expectation is that all threads within a group that "happen to be active
 714    at the same time" will execute converged dynamic instances, so that programs
 715    can detect the maximal set of threads that can communicate efficiently within
 716    some local region of the program.
 717
 718 .. _convergence_uncontrolled:
 719
 720 Uncontrolled Convergent Operations
 721 ==================================
 722
 723 Convergent operations with an explicit ``convergencectrl`` operand bundle are
 724 called *controlled convergent operations*. All other convergent operations are
 725 said to be *uncontrolled*.
 726
 727 An uncontrolled convergent operation is said to have *implicit convergence
 728 control* determined by the ``convergent`` attribute alone. The semantics of the
 729 ``convergent`` attribute as implemented in LLVM differs from the documented
 730 semantics. The implementation tries to follow common intuition about convergent
 731 operations, which remains under-specified. As such, it is not possible to fully
 732 translate implicit convergence control into explicit convergence control tokens,
 733 and these two modes cannot be mixed in the same function.
 734
 735   If a function contains a controlled convergent operation, then all convergent
 736   operations in that function must either be controlled operations or calls to
 737   the convergence control intrinsics.
 738
 739 Inferring Tokens
 740 ----------------
 741
 742 (This section is informational)
 743
 744 Sometimes, it may be necessary to reinterpret the implicit convergence control
 745 in terms of explicit convergence control tokens. For example, this may happen
 746 when a function call is inlined, and either the caller or the callee contains
 747 uncontrolled convergent operations.
 748
 749 Some uses of uncontrolled convergent operations may need to satisfy the
 750 following property:
 751
 752   For an environment-defined group of threads (such as an OpenCL workgroup or
 753   subgroup), if one thread in the group executes a convergent operation, then
 754   all threads in the group do so convergently with that thread.
 755
 756 In terms of explicit convergence control, this means that the
 757 ``convergencectrl`` operand on each convergent operation ``X`` must ultimately
 758 originate from a call to the :ref:`llvm.experimental.convergence.entry
 759 <llvm.experimental.convergence.entry>` intrinsic. This preserves the possibility
 760 that the group of threads that converge on reaching ``X`` is the same group that
 761 originally started executing the program in convergence. In comparison, the
 762 :ref:`llvm.experimental.convergence.anchor
 763 <llvm.experimental.convergence.anchor>` intrinsic captures an
 764 implementation-defined group of threads, which is insufficient to support the
 765 above property.
 766
 767 One way to approximate implicit convergence control in terms of explicit
 768 convergence control tokens is the following procedure, which preserves the above
 769 mentioned property:
 770
 771 1. Convert every irreducible cycle into a reducible cycle.
 772 2. Insert a call to :ref:`llvm.experimental.convergence.entry
 773    <llvm.experimental.convergence.entry>` at the start of the entry block of the
 774    function.
 775 3. Insert a call to :ref:`llvm.experimental.convergence.loop
 776    <llvm.experimental.convergence.loop>` at the start of every loop header. If
 777    this loop is an outermost loop, the ``convergencectrl`` operand is the call
 778    to :ref:`llvm.experimental.convergence.entry
 779    <llvm.experimental.convergence.entry>` in the entry block of the function.
 780    Otherwise, the ``convergencectrl`` operand is the call to
 781    :ref:`llvm.experimental.convergence.loop
 782    <llvm.experimental.convergence.loop>` in the parent loop's header.
 783 4. For each uncontrolled convergent operation ``X``, add a ``convergencectrl``
 784    operand bundle using the token defined by a definition ``D`` that is a
 785    :ref:`sibling <cycle-sibling>` to this operation. ``D`` always dominates
 786    ``X`` --- if ``X`` is not in any cycle, then ``D`` is a call to
 787    :ref:`llvm.experimental.convergence.entry
 788    <llvm.experimental.convergence.entry>`; otherwise ``D`` is the heart of the
 789    parent cycle of ``X``.
 790
 791 .. _convergence_static_rules:
 792
 793 Static Rules
 794 ============
 795
 796 A *well-formed* program in LLVM IR must satisfy the following static
 797 rules about cycles and convergence regions.
 798
 799 Closed Paths
 800 ------------
 801
 802 A :ref:`closed path <cycle-closed-path>` in a CFG is a connected sequence of
 803 nodes and edges in the CFG whose start and end points are the same.
 804
 805 1. Every closed path in the CFG that contains a use of a convergence token T other
 806    than a use by
 807    :ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
 808    must also contain the definition of T.
 809
 810 2. Every closed path in the CFG that contains two different uses of a convergence
 811    token T must also contain the definition of T.
 812
 813 3. Every closed path in the CFG that contains uses of two different convergence tokens
 814    T1 and T2 must also contain the definition of at least one of them.
 815
 816 Taken together, these rules imply that for every closed path C, there can be at most
 817 one convergence token T which is used in C but defined outside of it, and that
 818 T can be used only once in C, and only by ``llvm.experimental.convergence.loop``.
 819
 820 4. In every closed path that contains a use U of a token T but not the
 821    definition of T, U must dominate all nodes in the closed path.
 822
 823 This implies that ``llvm.experimental.convergence.loop`` can appear as a heart
 824 only in the header of a natural loop.
 825
 826 **Sufficient Conditions:** From the :ref:`properties of cycles
 827 <cycle-closed-path>`, it is sufficient to prove the above properties
 828 for cycles instead of closed paths. Briefly, any closed path that violates
 829 one or more of the above static rules is contained in a cycle that also
 830 violates the same rule(s).
 831
 832 .. _convergence_region:
 833
 834 Convergence Regions
 835 -------------------
 836
 837 The *convergence region* of a convergence token T is the minimal region in
 838 which T is live and used, i.e., the set of program points dominated by the
 839 definition D of T from which a use of T can be reached.
 840
 841 The following static rule about convergence regions must be satisfied by
 842 valid programs:
 843
 844    If a convergence region R for a token T1 contains a use of a convergence
 845    token T2, then R must also contain the definition of T2. (In other words,
 846    convergence regions must be reasonably nested.)
 847
 848 .. note::
 849
 850    For brevity, this document uses the term "convergence region of a token
 851    definition ``D``" to actually refer to the convergence region of the token
 852    ``T`` defined by ``D``.
 853
 854 .. _inferring_noconvergent:
 855
 856 Inferring non-convergence
 857 =========================
 858
 859 When the target or the environment guarantees that threads do not
 860 communicate using convergent operations or that threads never diverge,
 861 the dynamic instances in the program are irrelevant and an optimizer
 862 may remove any occurrence of the ``convergent`` attribute on a
 863 call-site or a function and any explicit ``convergencectrl`` operand
 864 bundle at a call-site.
 865
 866 An optimizer may remove the ``convergent`` attribute and any explicit
 867 ``convergencectrl`` operand bundle from a call-site if it can prove
 868 that the execution of this call-site always results in a call to a
 869 non-convergent function.
 870
 871 An optimizer may remove the ``convergent`` attribute on a function if it can
 872 prove that the function does not contain a call to
 873 :ref:`llvm.experimental.convergence.entry
 874 <llvm.experimental.convergence.entry>`, or any uncontrolled convergent
 875 operations.
 876
 877 Memory Model Non-Interaction
 878 ============================
 879
 880 The fact that an operation is convergent has no effect on how it is treated for
 881 memory model purposes. In particular, an operation that is ``convergent`` and
 882 ``readnone`` does not introduce additional ordering constraints as far as the
 883 memory model is concerned. There is no implied barrier, neither in the memory
 884 barrier sense nor in the control barrier sense of synchronizing the execution
 885 of threads.
 886
 887 Informational note: Threads that execute converged dynamic instances do not
 888 necessarily do so at the same time.
 889
 890
 891 Other Interactions
 892 ==================
 893
 894 A function can be both ``convergent`` and
 895 ``speculatable``, indicating that the function does not have undefined
 896 behavior and has no effects besides calculating its result, but is still
 897 affected by the set of threads executing this function. This typically
 898 prevents speculation of calls to the function unless the constraint imposed
 899 by ``convergent`` is further relaxed by some other means.
 900
 901 Controlled Maximal Convergence
 902 ==============================
 903
 904 The :ref:`converged-with relation <convergence-definition>` over dynamic
 905 instances of each controlled convergent operation is completely defined by the
 906 semantics of convergence tokens. But the implementation-defined convergence at a
 907 call to :ref:`llvm.experimental.convergence.anchor
 908 <llvm.experimental.convergence.anchor>` also depends on the cycle hierarchy
 909 chosen if it occurs inside an irreducible cycle.
 910
 911 When the token defined by a convergent operation ``D`` is used at another
 912 convergent operation ``U``, the implementation must ensure that the threads that
 913 converge at ``U`` are all the threads that reached ``U`` after converging at
 914 ``D``. On most implementations, it is reasonable to assume that only these
 915 threads are converged at every node they reach on any path from ``D`` to ``U``.
 916 In other words, the converged-with relation at ``D`` produces groups of threads
 917 that can converge only within each group, while inside the convergence region of
 918 ``D``.
 919
 920 All this affects the :ref:`maximal converged-with relation
 921 <convergence-maximal>` over dynamic instances and in turn the :ref:`m-converged
 922 property <uniformity-analysis>` of static instances in the convergence region of
 923 ``D``.
 924
 925 .. _controlled_maximal_converged_with:
 926
 927   **Controlled Maximal converged-with Relation**
 928
 929   1. Dynamic instances of a *convergent operation* are related in the controlled
 930      maximal converged-with relation according to the semantics of the convergence
 931      control tokens.
 932   2. Dynamic instances ``X1`` and ``X2`` produced by different threads for the
 933      same *non-convergent operation* ``X`` are related in the controlled maximal
 934      converged-with relation if and only if:
 935
 936      1. Both threads executed converged dynamic instances of every token
 937         definition ``D`` such that ``X`` is in the convergence region of ``D``,
 938         and,
 939      2. Either ``X`` is not contained in any cycle, or, for every cycle ``C``
 940         with header ``H`` that contains ``X``:
 941
 942         - every dynamic instance ``H1`` of ``H`` that precedes ``X1`` in the
 943           respective thread is convergence-before ``X2``, and,
 944         - every dynamic instance ``H2`` of ``H`` that precedes ``X2`` in the
 945           respective thread is convergence-before ``X1``,
 946         - without assuming that ``X1`` is converged with ``X2``.
 947
 948 .. _controlled_m_converged:
 949
 950   **Controlled m-converged Static Instances**
 951
 952   A node ``X`` in a given CFG is reported to be m-converged if and only if:
 953
 954   1. For any token definition ``D`` such that ``X`` is inside the convergence region
 955      of ``D``, ``D`` itself is m-converged, and,
 956   2. Every cycle that contains ``X`` satisfies the following necessary
 957      conditions:
 958
 959      a. Every divergent branch inside the cycle satisfies the :ref:`diverged
 960         entry criterion<convergence-diverged-entry>`, and,
 961      b. There are no :ref:`diverged paths reaching the
 962         cycle<convergence-diverged-outside>` from a divergent branch outside it.
 963
 964 Temporal Divergence at Cycle Exit
 965 ---------------------------------
 966
 967 When a cycle has a divergent exit, maximal convergence assumes that all threads
 968 converge at the exit block. But if a controlled convergent operation outside the
 969 cycle uses a token defined by an operation ``D`` inside the cycle, the
 970 convergence region of ``D`` now extends outside the cycle. If two threads
 971 executed converged dynamic instances of ``D`` before exiting the cycle, then
 972 they continue to execute converged dynamic instances of nodes in the convergence
 973 region of ``D`` outside the cycle. Thus, for a value ``V`` defined inside the
 974 cycle, any use ``U`` of ``V`` within the convergence region of ``T`` uses the
 975 output of converged dynamic instances of ``V``. If ``V`` is uniform, then its
 976 use at such a ``U`` is also uniform. In other words, temporal divergence applies
 977 only to a use of ``V`` that is outside the convergence region of ``D``.
 978
 979 Rationales for Static rules about cycles
 980 ========================================
 981
 982 (This section is informative.)
 983
 984 .. note::
 985
 986    For convenience, we use the operator ``==`` to represent the relation
 987    ``converged-with`` and the operator ``!=`` to represent its negation.
 988
 989 Consider a loop with (incorrect!) convergence control as in the following
 990 pseudocode:
 991
 992 .. code-block:: llvm
 993
 994   ; WARNING: Example of incorrect convergence control!
 995
 996   %anchor = call token @llvm.experimental.convergence.anchor()
 997   for (;;) {
 998     ...
 999     call void @convergent.op() [ "convergencectrl"(token %anchor) ]
1000     ...
1001   }
1002
1003 This code is forbidden by the first static rule about cycles.
1004
1005 A first formal argument why we have to do this is that the dynamic rule for
1006 deciding whether two threads execute converged dynamic instances of
1007 ``@convergent.op`` leads to a logical contradiction in this code.
1008 Assume two threads execute converged dynamic instances of the anchor
1009 followed by two iterations of the loop. Thread 1 executes dynamic instances
1010 I1 and I2 of ``@convergent.op``, thread 2 executes dynamic instances J1 and J2.
1011 Using all the rules, we can deduce:
1012
1013 1. ``I1 != I2`` and ``J1 != J2`` by the basic rules of dynamic instances.
1014
1015 2. ``I1 == J1`` by the first dynamic rule about controlled convergent
1016    operations: both threads execute the same static instruction while using
1017    a convergence token value produced by converged dynamic instances of an
1018    instruction (the anchor).
1019
1020 3. ``I1 == J2`` by the same argument. Also, ``I2 == J1`` and ``I2 == J2``.
1021
1022    The fact that one may be *intuitively* tempted to think of ``I1`` and ``J2``
1023    as being executed in different loop iterations is completely irrelevant for
1024    the *formal* argument. There is no mechanism in LLVM IR semantics for
1025    forming associations between loop iterations in different threads, *except*
1026    for the rules defined in this document -- and the rules in this document
1027    require a loop heart intrinsic for talking about loop iterations.
1028
1029 4. By transitivity, we have ``I1 == I2`` and ``J1 == J2``. That is a
1030    contradiction.
1031
1032 This problem goes away by inserting a loop heart intrinsic as follows, which
1033 establishes a relationship between loop iterations across threads.
1034
1035 .. code-block:: llvm
1036
1037   %anchor = call token @llvm.experimental.convergence.anchor()
1038   for (;;) {
1039     %loop = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
1040     ...
1041     call void @convergent.op() [ "convergencectrl"(token %loop) ]
1042     ...
1043   }
1044
1045 In the same scenario of two threads executing converged dynamic instances of the
1046 anchor and then two iterations of the loop, the dynamic rule about loop heart
1047 intrinsics implies that both threads execute the converged dynamic instances of
1048 the loop heart intrinsic in their respective first iterations and then again in
1049 their respective second iterations of the loop.
1050
1051 This then implies that they execute converged dynamic instances ``I1 == J1`` of
1052 the ``@convergent.op`` in their first iterations and then
1053 ``I2 == J2`` in their second iterations. The rule is an "if and only if" rule,
1054 so it also implies that ``I1 != J2`` and ``I2 != J1``, because those executions
1055 see token values of ``%loop`` originating from non-converged dynamic
1056 instances of the loop intrinsic.
1057
1058 One may ask whether we could change the dynamic rule instead of adding the
1059 static rule about cycles. That is impractical due to deeper difficulties.
1060 Consider the following loop, again with incorrect convergence control:
1061
1062 .. code-block:: llvm
1063
1064   ; WARNING: Example of incorrect convergence control!
1065
1066   ; (A)
1067   %anchor = call token @llvm.experimental.convergence.anchor()
1068   for (;;) {
1069     ; (B)
1070     if (condition1) {
1071       ; (C)
1072       call void @convergent.op.1() [ "convergencectrl"(token %anchor) ]
1073     }
1074     ; (D)
1075     if (condition2) {
1076       ; (E)
1077       call void @convergent.op.2() [ "convergencectrl"(token %anchor) ]
1078     }
1079     ; (F)
1080   }
1081   ; (G)
1082
1083 Assume two threads execute converged dynamic instances of the anchor followed
1084 by this sequence of basic blocks:
1085
1086 .. code-block:: text
1087
1088   Thread 1: A B C D F B D E F G
1089   Thread 2: A B D E F B C D F G
1090
1091 That is, both threads execute two iterations of the loop, but they execute
1092 the different convergent operations in different iterations. Without forming a
1093 relation between loop iterations across the threads, there is no reasonable way
1094 of defining which dynamic instances of the convergent operations should be the
1095 same across the threads, if any.
1096
1097 Again, this can be addressed by adding a loop heart intrinsic, most naturally
1098 as:
1099
1100 .. code-block:: llvm
1101
1102   ; (A)
1103   %anchor = call token @llvm.experimental.convergence.anchor()
1104   for (;;) {
1105     ; (B)
1106     %loop = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
1107     if (condition1) {
1108       ; (C)
1109       call void @convergent.op.1() [ "convergencectrl"(token %loop) ]
1110     }
1111     ; (D)
1112     if (condition2) {
1113       ; (E)
1114       call void @convergent.op.2() [ "convergencectrl"(token %loop) ]
1115     }
1116     ; (F)
1117   }
1118   ; (G)
1119
1120 Let ``%loop(i;j)`` be the dynamic instance of ``j``-th execution of the loop
1121 heart intrinsic by thread ``i``, and analogously ``@op.k(i)`` and ``@op.k(i)``
1122 the dynamic instances of the execution of ``@convergent.op.k`` by thread ``i``.
1123 Then we have:
1124
1125 1. ``%loop(1;j) == %loop(2;j)`` for ``j = 1, 2`` because of the dynamic rule
1126    about loop heart intrinsics.
1127
1128 2. ``%loop(i;1) != %loop(i;2)`` for ``i = 1, 2`` because of the basic rule that
1129    different executions by the same thread happen in different dynamic
1130    instances.
1131
1132 3. ``@op.1(1) != @op.1(2)``, since ``@op.1(1)`` uses the token value of ``%loop``
1133    referring to ``%loop(1;1)`` and ``@op.1(2)`` uses that
1134    referring to ``%loop(2;2) == %loop(1;2)``, which is different from
1135    ``%loop(1;1)``.
1136
1137 4. Similarly, ``@op.2(1) != @op.2(2)``.
1138
1139 However, loop heart intrinsics could be inserted differently, at the cost
1140 of also inserting a free-standing anchor:
1141
1142 .. code-block:: llvm
1143
1144   ; (A)
1145   %anchor = call token @llvm.experimental.convergence.anchor()
1146   for (;;) {
1147     ; (B)
1148     if (condition1) {
1149       ; (C)
1150       %loop = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
1151       call void @convergent.op.1() [ "convergencectrl"(token %loop) ]
1152     }
1153     ; (D)
1154     if (condition2) {
1155       ; (E)
1156       %free = call token @llvm.experimental.convergence.anchor()
1157       call void @convergent.op.2() [ "convergencectrl"(token %free) ]
1158     }
1159     ; (F)
1160   }
1161   ; (G)
1162
1163 This leads to the "unnatural counting of loop iterations" that is also mentioned
1164 elsewhere. Let ``%loop(i)`` be the dynamic instance of the execution of the
1165 loop heart intrinsic by thread ``i`` (each thread executes it only once), and
1166 let ``@op.k(i)`` be as before. Then:
1167
1168 1. ``%loop(1) == %loop(2)`` because of the dynamic rule about loop heart
1169    intrinsics.
1170
1171 2. ``@op.1(1) == @op.1(2)`` because ``@op.1(i)`` uses the value of ``%loop``
1172    referring to ``%loop(i)``, and ``%loop(1) == %loop(2)``.
1173
1174 3. Whether ``@op.2(1) == @op.2(2)`` is implementation-defined because of the
1175    use of the ``%free`` anchor intrinsic.
1176
1177    In practice, they almost certainly have to be non-converged dynamic
1178    instances. Consider that if an implementation strictly follows the order of
1179    instructions given in the program, the executions of the threads can be
1180    "aligned" as follows:
1181
1182    .. code-block:: text
1183
1184      Thread 1: A B         C D F B D E F G
1185      Thread 2: A B D E F B C D F         G
1186
1187    So then ``@op.2(1)`` physically executes later than ``@op.2(2)`` and there
1188    can be no communication between the threads, which means they execute
1189    non-converged dynamic instances.
1190
1191    That said, it is conceivable that there aren't actually any data or other
1192    dependencies that would enforce this execution order. In that case, a highly
1193    out-of-order implementation could potentially allow communication. That's
1194    why the rules defined in this document are silent about whether
1195    ``@op.2(1) == @op.2(2)`` or not.
1196
1197 This type of convergence control seems relatively unlikely to appear in real
1198 programs. Its possibility is simply a logical consequence of the model.
1199
1200 An equivalent issue arises if the convergent operations are replaced by nested
1201 loops with loop heart intrinsics that directly refer to ``%anchor``, hence
1202 the variants of the static rules about cycles that apply to them:
1203
1204 .. code-block:: llvm
1205
1206   ; WARNING: Example of incorrect convergence control!
1207
1208   %anchor = call token @llvm.experimental.convergence.anchor()
1209   for (;;) {
1210     if (condition1) {
1211       for (;;) {
1212         %loop1 = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
1213       }
1214     }
1215     if (condition2) {
1216       for (;;) {
1217         %loop2 = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
1218       }
1219     }
1220   }
1221
1222 There is a cycle (closed walk in the CFG) that goes through both loop heart
1223 intrinsics using ``%anchor`` but not through the definition of ``%anchor``,
1224 so this code is invalid.
1225
1226
1227 Examples for the Correctness of Program Transforms
1228 ==================================================
1229
1230 (This section is informative.)
1231
1232 As implied by the rules in the previous sections, program transforms are correct
1233 with respect to convergent operations if they preserve or refine their
1234 semantics. This means that the set of communicating threads in the transformed
1235 program must have been possible in the original program.
1236
1237 Program transforms with a single-threaded focus are generally conservatively
1238 correct if they do not sink or hoist convergent operations across a branch.
1239 This applies even to program transforms that change the control flow graph.
1240
1241 For example, unrolling a loop that does not contain convergent operations
1242 cannot break any of the guarantees required for convergent operations outside
1243 of the loop.
1244
1245
1246 Loop unrolling examples
1247 -----------------------
1248
1249 We consider three kinds of loop unrolling here:
1250
1251 * Partial unrolling with no known trip multiple, so a "tail" is required to
1252   collect the remaining elements.
1253 * Partial unrolling by a trip multiple, so no "tail" is required.
1254 * Full unrolling, which eliminates the loop.
1255
1256 The first kind is forbidden when ``@llvm.experimental.convergence.loop`` is
1257 used. We illustrate the reasoning with some examples.
1258
1259 First, an arbitrary loop that contains convergent operations *can* be unrolled
1260 in all of these ways, even with "tail", if all convergent operations refer back
1261 to an anchor inside the loop. For example (in pseudo-code):
1262
1263 .. code-block:: llvm
1264
1265   while (counter > 0) {
1266     %tok = call token @llvm.experimental.convergence.anchor()
1267     call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1268     counter--;
1269   }
1270
1271 This can be unrolled to:
1272
1273 .. code-block:: llvm
1274
1275   while (counter >= 2) {
1276     %tok = call token @llvm.experimental.convergence.anchor()
1277     call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1278     %tok = call token @llvm.experimental.convergence.anchor()
1279     call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1280     counter -= 2;
1281   }
1282   while (counter > 0) {
1283     %tok = call token @llvm.experimental.convergence.anchor()
1284     call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1285     counter--;
1286   }
1287
1288 This is likely to change the behavior of the convergent operation if there
1289 are threads whose initial counter value is not a multiple of 2. In particular,
1290 all threads with an odd trip count are now likely to execute the convergent
1291 operation in their respective final iterations together because the
1292 underlying implementation is likely to try to group as many threads together
1293 as possible for the execution of the "tail".
1294
1295 This change is allowed because the anchor intrinsic has implementation-defined
1296 convergence behavior and the loop unrolling transform is considered to be part
1297 of the implementation. Another way of reasoning is that while the *likely*
1298 behavior of the code has changed, the *guarantees* about its behavior have
1299 remained the same.
1300
1301 If the loop contains uncontrolled convergent operations, this kind of unrolling
1302 is forbidden.
1303
1304 Unrolling a loop with convergent operations that refer to tokens produced
1305 outside the loop is forbidden when a "tail" or "remainder" would have to
1306 be introduced. Consider:
1307
1308 .. code-block:: llvm
1309
1310   ; (A)
1311   %outer = call token @llvm.experimental.convergence.anchor()
1312   while (counter > 0) {
1313     %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
1314     ; (B)
1315     call void @convergent.operation() [ "convergencectrl"(token %inner) ]
1316     counter--;
1317   }
1318   ; (C)
1319
1320 To understand why unrolling is forbidden, consider two threads that execute
1321 converged dynamic instances of the anchor and then proceed with 3 and 4 loop
1322 iterations, respectively:
1323
1324 .. code-block:: text
1325
1326   Thread 1: A B B B C
1327   Thread 2: A B B B B C
1328
1329 By the dynamic rule on loop heart intrinsics, these threads execute converged
1330 dynamic instances of the loop intrinsic for the first 3 iterations, and then
1331 thread 2 executes another dynamic instance by itself.
1332
1333 By the dynamic rule on general convergent operations, the threads execute
1334 converged dynamic instances of the ``@convergent.operation`` in the first 3
1335 iterations (that is, the dynamic instance executed by thread 1 in iteration
1336 *n* is the same as that executed by thread 2 in iteration *n*, for *n = 1,2,3*;
1337 the dynamic instance executed in iteration 1 is different from that in
1338 iteration 2, etc.).
1339
1340 Now assume that the loop is unrolled by a factor of 2, which requires a
1341 remainder as follows:
1342
1343 .. code-block:: llvm
1344
1345   ; (A)
1346   %outer = call token @llvm.experimental.convergence.anchor()
1347   while (counter >= 2) {
1348     %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
1349     ; (B)
1350     call void @convergent.operation() [ "convergencectrl"(token %inner) ]
1351     call void @convergent.operation() [ "convergencectrl"(token %inner) ]
1352     counter -= 2;
1353   }
1354   ; (C)
1355   if (counter > 0) {
1356     %remainder = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
1357     ; (D)
1358     call void @convergent.operation() [ "convergencectrl"(token %remainder) ]
1359   }
1360   ; (E)
1361
1362 First of all, note some interesting problems surrounding the loop intrinsic:
1363
1364 1. It is *not* duplicated inside the unrolled loop. This is to comply with
1365    the :ref:`convergence_static_rules`.
1366
1367 2. It is unclear whether the loop intrinsic ought to be duplicated in the
1368    remainder, or whether the final ``@convergent.operation`` in D should just
1369    refer to either ``%inner`` (which is possible in SSA form) or directly to
1370    ``%outer``. The decision made here is arbitrary and doesn't change the
1371    argument that follows. Ultimately, it simply doesn't matter because the
1372    transform is incorrect either way.
1373
1374 The threads now execute the following sequences of blocks:
1375
1376 .. code-block:: text
1377
1378   Thread 1: A B C D E
1379   Thread 2: A B B C D E
1380
1381 Analogous to the argument above, they execute converged dynamic instances of the
1382 ``%inner`` intrinsic and the ``@convergent.operation`` in the first iteration
1383 of the unrolled loop, which corresponds to the first 2 iterations of the
1384 original loop.
1385
1386 However, they execute different static calls to ``@convergent.operation`` for
1387 the 3rd iteration of the original loop. In thread 1, that iteration corresponds
1388 to the call in the remainder, while in thread 2 it corresponds to the first
1389 call to ``@convergent.operation`` in the unrolled loop. Therefore, they execute
1390 non-converged dynamic instances, which means that the set of communicating threads
1391 for the 3rd iteration of the original loop is different. This is why the
1392 unrolling is incorrect.
1393
1394 On the other hand, unrolling without "tail" is allowed. For example, assuming
1395 that the trip counter is known to be a multiple of 2, we can unroll the loop
1396 as follows:
1397
1398 .. code-block:: llvm
1399
1400   %outer = call token @llvm.experimental.convergence.anchor()
1401   while (counter > 0) {
1402     %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
1403     call void @convergent.operation() [ "convergencectrl"(token %inner) ]
1404     call void @convergent.operation() [ "convergencectrl"(token %inner) ]
1405     counter -= 2;
1406   }
1407
1408 Note again that the loop intrinsic is not duplicated.
1409
1410 The
1411 :ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
1412 intrinsic is typically expected to appear in the header of a natural loop.
1413 However, it can also appear in non-header blocks of a loop. In that case, the
1414 loop can generally not be unrolled.
1415
1416
1417 Hoisting and sinking
1418 --------------------
1419
1420 In general, hoisting and sinking of convergent operations is forbidden. This is
1421 because moving the operation to a different point in control flow generally
1422 changes the set of threads that reach the operation and therefore, the set of
1423 threads that execute converged dynamic instances of the operation. By
1424 definition, this changes the set of threads that participate in the
1425 communication of the convergent operation, which will typically change its
1426 result.
1427
1428 There are a number of exceptions, though most of them require additional
1429 knowledge.
1430
1431 For example, hoisting and sinking across *uniform* conditional branches -- i.e.,
1432 conditional branches where within every possible relevant set of threads, all
1433 threads will always take the same direction -- is generally allowed. See the end
1434 of the :ref:`example of reductions inside control flow
1435 <convergence_example_reductions>` for a brief discussion.
1436
1437 Some convergent operations can be hoisted but not sunk, or vice versa. A simple
1438 example is the ``subgroupShuffle(data, id)`` operation. It returns the ``data``
1439 operand of the thread identified by ``id``, where thread IDs are fixed and
1440 assigned to each thread at launch. The result is undefined (or perhaps there is
1441 UB, depending on the language and environment) if thread ``id`` is not in the
1442 communicating set of threads. So hoisting is allowed in the following
1443 pseudo-code example:
1444
1445 .. code-block:: llvm
1446
1447   define void @example(...) convergent {
1448     %entry = call token @llvm.experimental.convergence.entry()
1449     %data = ...
1450     %id = ...
1451     if (condition) {
1452       %shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ]
1453       ...
1454     } else {
1455       %shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ]
1456       ...
1457     }
1458   }
1459
1460 After hoisting the calls to ``@subgroupShuffle``, the communicating set of
1461 threads is the union of the two sets of threads in the original program, so
1462 ``%id`` can only go "out of range" after hoisting if it did so in the original
1463 program.
1464
1465 However, speculative execution of ``@subgroupShuffle`` in the following program
1466 may be forbidden:
1467
1468 .. code-block:: llvm
1469
1470   define void @example(...) convergent {
1471     %entry = call token @llvm.experimental.convergence.entry()
1472     %data = ...
1473     %id = ...
1474     if (condition) {
1475       %shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ]
1476       ...
1477     }
1478   }
1479
1480 There is no guarantee about the value of ``%id`` in the threads where
1481 ``condition`` is false. If ``@subgroupShuffle`` is defined to have UB when
1482 ``%id`` is outside of the set of communicating threads, then speculating and
1483 hoisting ``@subgroupShuffle`` might introduce UB.
1484
1485 On the other hand, if ``@subgroupShuffle`` is defined such that it merely
1486 produces an undefined value or poison as result when ``%id`` is "out of range",
1487 then speculating is okay.
1488
1489 Even though
1490 :ref:`llvm.experimental.convergence.anchor <llvm.experimental.convergence.anchor>`
1491 is marked as ``convergent``, it can be sunk in some cases. For example, in
1492 pseudo-code:
1493
1494 .. code-block:: llvm
1495
1496   %tok = call token @llvm.experimental.convergence.anchor()
1497   if (condition) {
1498     call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1499   }
1500
1501 Assuming that ``%tok`` is only used inside the conditional block, the anchor can
1502 be sunk. The rationale is two-fold. First, the anchor has implementation-defined
1503 behavior, and the sinking is part of the implementation. Second, already in the
1504 original program, the set of threads that communicates in the
1505 ``@convergent.operation`` is automatically subset to the threads for which
1506 ``condition`` is true.
1507
1508 Anchors can be hoisted in acyclic control flow. For example:
1509
1510 .. code-block:: llvm
1511
1512   if (condition) {
1513     %tok1 = call token @llvm.experimental.convergence.anchor()
1514     call void @convergent.operation() [ "convergencectrl"(token %tok1) ]
1515   } else {
1516     %tok2 = call token @llvm.experimental.convergence.anchor()
1517     call void @convergent.operation() [ "convergencectrl"(token %tok2) ]
1518   }
1519
1520 The anchors can be hoisted, resulting in:
1521
1522 .. code-block:: llvm
1523
1524   %tok = call token @llvm.experimental.convergence.anchor()
1525   if (condition) {
1526     call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1527   } else {
1528     call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1529   }
1530
1531 The behavior is unchanged, since each of the static convergent operations only
1532 ever communicates with threads that have the same ``condition`` value.
1533 By contrast, hoisting the convergent operations themselves is forbidden.
1534
1535 Hoisting and sinking anchors out of and into loops is forbidden. For example:
1536
1537 .. code-block:: llvm
1538
1539   for (;;) {
1540     %tok = call token @llvm.experimental.convergence.anchor()
1541     call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1542   }
1543
1544 Hoisting the anchor would make the program invalid according to the static
1545 validity rules. Conversely:
1546
1547 .. code-block:: llvm
1548
1549   %outer = call token @llvm.experimental.convergence.anchor()
1550   while (counter > 0) {
1551     %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
1552     call void @convergent.operation() [ "convergencectrl"(token %inner) ]
1553     counter--;
1554   }
1555
1556 The program would stay valid if the anchor was sunk into the loop, but its
1557 behavior could end up being different. If the anchor is inside the loop, then
1558 each loop iteration has a new dynamic instance of the anchor, and the set of
1559 threads participating in those dynamic instances of the anchor could be
1560 different in arbitrary implementation-defined ways. Via the dynamic rules about
1561 dynamic instances of convergent operations, this then implies that the set of
1562 threads executing ``@convergent.operation`` could be different in each loop
1563 iteration in arbitrary implementation-defined ways.
1564
1565 Convergent operations can be sunk together with their anchor. Again in
1566 pseudo-code:
1567
1568 .. code-block:: llvm
1569
1570   %tok = call token @llvm.experimental.convergence.anchor()
1571   %a = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
1572   %b = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
1573   if (condition) {
1574     use(%a, %b)
1575   }
1576
1577 Assuming that ``%tok``, ``%a``, and ``%b`` are only used inside the conditional
1578 block, all can be sunk together:
1579
1580 .. code-block:: llvm
1581
1582   if (condition) {
1583     %tok = call token @llvm.experimental.convergence.anchor()
1584     %a = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
1585     %b = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
1586     use(%a, %b)
1587   }
1588
1589 The rationale is that the anchor intrinsic has implementation-defined behavior,
1590 and the sinking transform is considered to be part of the implementation:
1591 the sinking will restrict the set of communicating threads to those for which
1592 ``condition`` is true, but that could have happened in the original program
1593 anyway for some arbitrary other reason.
1594
1595 However, sinking *only* the convergent operation producing ``%b`` would be
1596 incorrect. That would allow threads for which ``condition`` is false to
1597 communicate at ``%a``, but not at ``%b``, which the original program doesn't
1598 allow.
1599
1600 Note that the entry intrinsic behaves differently. Sinking the convergent
1601 operations is forbidden in the following snippet:
1602
1603 .. code-block:: llvm
1604
1605   %tok = call token @llvm.experimental.convergence.entry()
1606   %a = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
1607   %b = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
1608   if (condition) {
1609     use(%a, %b)
1610   }