llvm/docs/ConvergentOperations.rst

   1 ==============================
   2 Convergent Operation Semantics
   3 ==============================
   4
   5 .. contents::
   6    :local:
   7    :depth: 4
   8
   9 Overview
  10 ========
  11
  12 Some parallel execution environments execute threads in groups that allow
  13 efficient communication within the group using special primitives called
  14 *convergent* operations. The outcome of a convergent operation is sensitive to
  15 the set of threads that executes it "together", i.e., convergently. When control
  16 flow :ref:`diverges <convergence-and-uniformity>`, i.e. threads of the same
  17 group follow different
  18 paths through the CFG, not all threads of the group may be available to
  19 participate in this communication. This is the defining characteristic that
  20 distinguishes convergent operations from other inter-thread communication:
  21
  22   A convergent operation involves inter-thread communication or synchronization
  23   that occurs outside of the memory model, where the set of threads which
  24   participate in communication is implicitly affected by control flow.
  25
  26 For example, in the following GPU compute kernel, communication during the
  27 convergent operation is expected to occur precisely among those threads of an
  28 implementation-defined execution scope (such as workgroup or subgroup) for
  29 which ``condition`` is true:
  30
  31 .. code-block:: c++
  32
  33   void example_kernel() {
  34       ...
  35       if (condition)
  36           convergent_operation();
  37       ...
  38   }
  39
  40 In structured programming languages, there is often an intuitive and
  41 unambiguous way of determining the threads that are expected to communicate.
  42 However, this is not always the case even in structured programming languages,
  43 and the intuition breaks down entirely in unstructured control flow. This
  44 document describes the formal semantics in LLVM, i.e. how to determine the set
  45 of communicating threads for convergent operations.
  46
  47 The definitions in this document leave many details open, such as how groups of
  48 threads are formed in the first place. It focuses on the questions that are
  49 relevant for deciding the correctness of generic program transforms and
  50 convergence-related analyses such as :ref:`uniformity analysis
  51 <convergence-and-uniformity>`.
  52
  53 .. _convergent_operations:
  54
  55 Convergent Operations
  56 =====================
  57
  58 In LLVM IR, the only way to communicate between threads as described
  59 above is by calling target-defined convergent intrinsics. Hence, only
  60 a call-site in LLVM IR (a :ref:`call <i_call>`, :ref:`invoke
  61 <i_invoke>`, or :ref:`callbr <i_callbr>` instruction) can result in a
  62 convergent operation.
  63
  64 A function in LLVM IR is said to be *convergent* if it has the
  65 :ref:`convergent <attr_convergent>` attribute.
  66
  67 A call-site in LLVM IR is said to be *convergent* if it is a direct
  68 call to a convergent function or it has the :ref:`convergent
  69 <attr_convergent>` attribute or a :ref:`convergencectrl operand bundle
  70 <convergencectrl>`.
  71
  72 Informational notes:
  73
  74   A function may have to be treated as convergent if that function, or
  75   transitively, any function called from it, contains a convergent call-site. A
  76   frontend generating the ``convergent`` attribute should take this into account
  77   when emitting functions and function calls. But this is not always the case:
  78
  79   A non-convergent function may contain convergent operations; such operations
  80   do not directly depend on the set of threads that enter the function as a
  81   single communicating group. Instead, these operations depend on an
  82   implementation-defined subset of threads within the body of the function, as
  83   shown in :ref:`opportunistic_convergence`.
  84
  85 Examples of Convergent Operations
  86 ========================================
  87
  88 (This section is informative.)
  89
  90 Texture sampling in a pixel shader
  91 ----------------------------------
  92
  93 The following stylized pixel shader samples a texture at a given set of
  94 coordinates, using the builtin function `textureSample`. Texture sampling
  95 requires screen-space derivatives of the coordinates to determine the level of
  96 detail (mipmap level) of the sample. They are commonly approximated by taking
  97 the difference between neighboring pixels, which are computed by different
  98 threads in the same group:
  99
 100 .. code-block:: c++
 101
 102   void example_shader() {
 103     ...
 104     color = textureSample(texture, coordinates);
 105     if (condition) {
 106       use(color);
 107     }
 108     ...
 109   }
 110
 111 From a purely single-threaded perspective, sinking the `textureSample` into
 112 the if-statement appears legal. However, if the condition is false for some
 113 neighboring pixels, then their corresponding threads will not execute together
 114 in the group, making it impossible to take the difference of coordinates as an
 115 approximation of the screen-space derivative. In practice, the outcome will be
 116 an undefined value.
 117
 118 That is, the `textureSample` operation fits our definition of a convergent
 119 operation:
 120
 121  1. It communicates with a set of threads that implicitly depends on control
 122     flow.
 123  2. Correctness depends on this set of threads.
 124
 125 The compiler frontend can emit IR that expresses the convergence constraints as
 126 follows:
 127
 128 .. code-block:: llvm
 129
 130   define void @example_shader() convergent {
 131     %entry = call token @llvm.experimental.convergence.entry()
 132     ...
 133     %color = call T @textureSample(U %texture, V %coordinates) [ "convergencectrl"(token %entry) ]
 134     br i1 %condition, label %then, label %end
 135
 136   then:
 137     call void @use(T %color)
 138     br label %end
 139
 140   end:
 141     ret void
 142   }
 143
 144 The :ref:`llvm.experimental.convergence.entry <llvm.experimental.convergence.entry>`
 145 intrinsic is itself ``convergent``, and we expect it to communicate at least
 146 among all threads of the same "quad" -- a group of 2x2 pixels that are
 147 evaluated together for the purpose of approximating screen-space derivatives.
 148 This fact is not part of the generic LLVM IR semantics; it would have to be
 149 defined somewhere else, for example as part of target-specific ABI definitions
 150 and/or in reference to some relevant API specs.
 151
 152 Since the ``@textureSample`` call then uses the token produced by the entry
 153 intrinsic in its ``convergencectrl`` bundle, and has no additional control
 154 dependencies, it must communicate among the same set of threads. This indicates
 155 to generic program transforms that sinking the ``@textureSample`` call is
 156 forbidden. (A program transform can still sink the call if it can prove somehow,
 157 e.g. by leaning on target-specific callbacks that can analyze the program with
 158 additional knowledge, that ``%condition`` is always uniform across the threads
 159 referenced by the *convergence token* ``%entry``.)
 160
 161 .. _convergence_example_reductions:
 162
 163 Reductions inside divergent control flow
 164 ----------------------------------------
 165
 166 The following example shows that merging common code of branches can be
 167 incorrect in the face of convergent operations:
 168
 169 .. code-block:: c++
 170
 171   void example_kernel() {
 172     delta = ...
 173     if (delta > 0) {
 174       total_gains = subgroupAdd(delta);
 175       ...
 176     } else {
 177       total_losses = subgroupAdd(delta);
 178       ...
 179     }
 180   }
 181
 182 The ``subgroupAdd`` computing the ``total_gains`` will be executed by the
 183 subset of threads with positive ``delta`` in a subgroup (wave), and so will sum
 184 up all the ``delta`` values of those threads; and similarly for the
 185 ``subgroupAdd`` that computes the ``total_losses``.
 186
 187 If we were to hoist and merge the ``subgroupAdd`` above the if-statement, it
 188 would sum up the ``delta`` across *all* threads instead.
 189
 190 The compiler frontend can emit IR that expresses the convergence constraints
 191 as follows:
 192
 193 .. code-block:: llvm
 194
 195   define void @example_kernel() convergent {
 196     %entry = call token @llvm.experimental.convergence.entry()
 197     %delta = ...
 198     %cc = icmp sgt i32 %delta, 0
 199     br i1 %cc, label %then, label %else
 200
 201   then:
 202     %total_gains = call i32 @subgroupAdd(i32 %delta) [ "convergencectrl"(token %entry) ]
 203     ...
 204     br label %end
 205
 206   else:
 207     %total_losses = call i32 @subgroupAdd(i32 %delta) [ "convergencectrl"(token %entry) ]
 208     ...
 209     br label %end
 210
 211   end:
 212     ...
 213   }
 214
 215 The entry intrinsic behaves like in the previous example: assuming that
 216 ``@example_kernel`` is an OpenCL kernel (as hinted at by the "subgroup"
 217 terminology), we expect it to communicate among all threads within the
 218 "subgroup". This typically maps to a SIMD vector on GPU hardware.
 219
 220 The calls to ``@subgroupAdd`` use the token produced by the entry intrinsic,
 221 but they also have an additional control dependency. According to the rules
 222 defined in this document, they only communicate among the subset of threads
 223 that actually end up executing the respective (static) call site.
 224
 225 Hoisting them would remove the control dependency and cause them to communicate
 226 among the full set of threads that the entry intrinsic communicated with.
 227 Again, hoisting is allowed if it can be proven that ``%cc`` is always uniform
 228 among the relevant set of threads: in that case, the ``@subgroupAdd`` already
 229 communicates among the full set of threads in the original program.
 230
 231 Motivating Examples of Convergence Control
 232 ==========================================
 233
 234 (This section is informative.)
 235
 236 Unstructured control flow
 237 -------------------------
 238
 239 Consider an example of how jump threading removes structure in a way that can
 240 make semantics non-obvious without the convergence intrinsics described in this
 241 document:
 242
 243 .. code-block:: llvm
 244
 245   void example_original() {
 246   entry:
 247       ...
 248       br i1 %cond1, label %then1, label %mid
 249
 250   then1:
 251       ...
 252       %cond2 = ...
 253       br label %mid
 254
 255   mid:
 256       %flag = phi i1 [ true, %entry ], [ %cond2, %then1 ]
 257       br i1 %flag, label %then2, label %end
 258
 259   then2:
 260       ...
 261       call void @subgroupControlBarrier()
 262       ...
 263       br label %end
 264
 265   end:
 266   }
 267
 268   void example_jumpthreaded() {
 269   entry:
 270       ...
 271       br i1 %cond1, label %then1, label %then2
 272
 273   then1:
 274       ...
 275       %cond2 = ...
 276       br i1 %cond2, label %then2, label %end
 277
 278   then2:
 279       ...
 280       call void @subgroupControlBarrier()
 281       ...
 282       br label %end
 283
 284   end:
 285   }
 286
 287 Is the control barrier guaranteed to synchronize among the same set of threads
 288 in both cases? Different implementations in the literature may give different
 289 answers to this question:
 290
 291 * In an implementation that reconverges at post-dominators, threads reconverge
 292   at ``mid`` in the first version, so that all threads (within a subgroup/wave)
 293   that execute the control barrier do so together. In the second version,
 294   threads that reach the control barrier via different paths synchronize
 295   separately: the first (and only) post-dominator is ``end``, so threads do not
 296   reconverge before then.
 297
 298 * An implementation that sorts basic blocks topologically and ensures maximal
 299   reconvergence for each basic block would behave the same way in both
 300   versions.
 301
 302 We generally take the stance that reconvergence in acyclic control flow must
 303 be maximal. The compiler frontend could augment the original code as follows:
 304
 305 .. code-block:: llvm
 306
 307   define void @example_original() convergent {
 308   entry:
 309     %entry = call token @llvm.experimental.convergence.entry()
 310     ...
 311     br i1 %cond1, label %then1, label %mid
 312
 313   then1:
 314     ...
 315     %cond2 = ...
 316     br label %mid
 317
 318   mid:
 319     %flag = phi i1 [ true, %entry ], [ %cond2, %then1 ]
 320     br i1 %flag, label %then2, label %end
 321
 322   then2:
 323     ...
 324     call void @subgroupControlBarrier() [ "convergencectrl"(token %entry) ]
 325     ...
 326     br label %end
 327
 328   end:
 329   }
 330
 331 If S is the set of threads that the entry intrinsic communicated with, then
 332 the ``@subgroupControlBarrier`` call communicates with the subset of S that
 333 actually reaches the call site. This set of threads doesn't change after
 334 jump-threading, so the answer to the question posed above remains the same.
 335
 336 .. _opportunistic_convergence:
 337
 338 Opportunistic convergent operations
 339 -----------------------------------
 340
 341 Some programs have local regions of code that contain a sequence of convergent
 342 operations where the code does not care about the exact set of threads with
 343 which it is executed, but only that the set of threads is the same for all the
 344 operations within the sequence. (If a subset of the convergent operations in the
 345 sequence have additional, non-uniform control dependencies, then this is not
 346 possible. However, the code may still require that the sets of threads are
 347 logically consistent with the conditions of those control dependencies.) In this
 348 case, :ref:`llvm.experimental.convergence.anchor
 349 <llvm.experimental.convergence.anchor>` can be used to express the desired
 350 semantics.
 351
 352 The following example function could be part of a hypothetical "append buffer"
 353 implementation, where threads conditionally write fixed-sized records
 354 contiguously into a global buffer. The function ``@reserveSpaceInBuffer``
 355 returns the index into the buffer at which the calling thread should store its
 356 data.
 357
 358 This could be achieved by using a simple atomic operation in every thread to
 359 bump an allocation counter.
 360
 361 However, the following implementation can be more performant on some hardware,
 362 because it uses only a single atomic operation for an entire group of threads.
 363 To do this, it first determines the total size of the group, which will be the
 364 operand to the atomic operation, and then later broadcasts the result of the
 365 atomic operation to all threads of the group, so that each thread can compute
 366 its individual position in the buffer:
 367
 368 .. code-block:: llvm
 369
 370   define i32 @reserveSpaceInBuffer() {    ; NOTE: _not_ a convergent function!
 371   entry:
 372     %anchor = call token @llvm.experimental.convergence.anchor()
 373
 374     %ballot = call i64 @subgroupBallot(i1 true) [ "convergencectrl"(token %anchor) ]
 375     %numThreads.p = call i64 @llvm.ctpop.i64(i64 %ballot)
 376     %numThreads = trunc i64 %numThreads.p to i32
 377
 378     %absoluteThreadIdx = call i32 @getSubgroupLocalInvocationId()
 379     %absoluteThreadIdx.ext = zext i32 %absoluteThreadIdx to i64
 380     %mask.p = shl i64 1, %absoluteThreadIdx.ext
 381     %mask = sub i64 %mask.p, 1
 382
 383     %maskedBallot = and i64 %ballot, %mask
 384     %relativeThreadIdx.p = call i64 @llvm.ctpop.i64(i64 %maskedBallot)
 385     %relativeThreadIdx = trunc i64 %relativeThreadIdx.p to i32
 386
 387     %isFirstThread = icmp eq i32 %relativeThreadIdx, 0
 388     br i1 %isFirstThread, label %then, label %end
 389
 390   then:
 391     %baseOffset.1 = atomicrmw add ptr @bufferAllocationCount, i32 %numThreads monotonic
 392     br label %end
 393
 394   end:
 395     %baseOffset.2 = phi i32 [ undef, %entry ], [ %baseOffset.1, %then ]
 396     %baseOffset = call i32 @subgroupBroadcastFirst(i32 %baseOffset.2) [ "convergencectrl"(token %anchor) ]
 397     %offset = add i32 %baseOffset, %relativeThreadIdx
 398     ret i32 %offset
 399   }
 400
 401 The key here is that the function really doesn't care which set of threads it
 402 is being called with. It takes whatever set of threads it can get. What the
 403 implementation of the function cares about is that the initial
 404 ``@subgroupBallot`` -- which is used to retrieve the bitmask of threads that
 405 executed the anchor together -- executes with the same set of threads as the
 406 final ``@subgroupBroadcastFirst``. Nothing else is required for correctness as
 407 far as convergence is concerned.
 408
 409 The function ``@reserveSpaceInBuffer`` itself is _not_ ``convergent``: callers
 410 are free to move call sites of the function as they see fit. This can change
 411 the behavior in practice, by changing the sets of threads that are grouped
 412 together for the atomic operation. This can be visible in the output of the
 413 program, since the order in which outputs appear in the buffer is changed.
 414 However, this does not break the overall contract that ``@reserveSpaceInBuffer``
 415 has with its caller -- which makes sense: the order of outputs is
 416 non-deterministic anyway because of the atomic operation that is involved.
 417
 418 If the function is inlined, the use of the anchor intrinsic similarly indicates
 419 that certain transforms which are usually forbidden by the presence of
 420 convergent operations are in fact allowed, as long as they don't break up the
 421 region of code that is controlled by the anchor.
 422
 423 .. _convergence_high-level_break:
 424
 425 Extended Cycles: Divergent Exit from a Loop
 426 -------------------------------------------
 427
 428 High-level languages typically provide a ``break`` statement that transfers
 429 control out of a loop statement. In most cases, the loop is structured and hence
 430 there is no ambiguity about convergence inside the loop. But an ambiguity arises
 431 when a ``break`` is control dependent on a divergent condition inside the loop.
 432 Consider the following example:
 433
 434 .. code-block:: c++
 435
 436   void example() {
 437     // A
 438     ...
 439     for (...) {
 440       // B
 441       if (condition) { // divergent condition
 442         // C
 443         convergent_op();
 444         break;
 445       }
 446       // D
 447       ...
 448     }
 449     // E
 450   }
 451
 452 In this program, the call to convergent_op() is lexically "inside" the ``for``
 453 loop. But when translated to LLVM IR, the basic block B is an exiting block
 454 ending in a divergent branch, and the basic block C is an exit of the loop.
 455 Thus, the call to convergent_op() is outside the loop. This causes a mismatch
 456 between the programmer's expectation and the compiled program. The call should
 457 be executed convergently on every iteration of the loop, by threads that
 458 together take the branch to exit the loop. But when compiled, all threads that
 459 take the divergent exit on different iterations first converge at the beginning
 460 of basic block C and then together execute the call to convergent_op().
 461
 462 In this case, :ref:`llvm.experimental.convergence.loop
 463 <llvm.experimental.convergence.loop>` can be used to express the desired
 464 semantics. A call to this intrinsic is placed in the loop header, which tracks
 465 each iteration of the loop. The token produced by this is used as a
 466 ``convergencectrl`` operand to the convergent call. The semantics of the
 467 ``loop`` intrinsic ensures that the convergent call is performed convergently
 468 only by those threads that convergently exited the loop in a given iteration.
 469
 470 .. code-block:: llvm
 471
 472   define void @example() convergent {
 473     %entry = call token @llvm.experimental.convergence.entry()
 474     br label %for
 475
 476   for:
 477     %inner = call token @llvm.experimental.convergence.loop() ["convergencectrl"(token %entry)]
 478     %for.cond = i1 ...
 479     br i1 %for.cond, label %B, label %E
 480
 481   B:
 482     ...
 483     %condition = i1 ...
 484     br i1 %condition, label %C, label %D
 485
 486   C:
 487     call void @convergent_op() ["convergencectrl"(token %inner)]
 488     br label %E
 489
 490   D:
 491     ...
 492     br label %for
 493
 494   E:
 495     ...
 496     ret void
 497   }
 498
 499 The LLVM IR version of the same program shows a cycle consisting of the basic
 500 blocks ``%for``, ``%B`` and ``%D``, while ``%C`` is an exit reached by the
 501 divergent branch at the end of the exiting block ``%B``. But the use of
 502 convergence control tokens makes it clear that block ``%C`` must be executed
 503 convergently only by those threads that convergently take the exit edge from %B
 504 to ``%C``. In other words, the convergent execution of ``%C`` is governed by the
 505 call to the :ref:`llvm.experimental.convergence.loop
 506 <llvm.experimental.convergence.loop>` intrinsic inside the cycle. The cycle is
 507 effectively extended to include all uses of this token that lie outside the
 508 cycle.
 509
 510 .. _dynamic_instances_and_convergence_tokens:
 511
 512 Dynamic Instances and Convergence Tokens
 513 ========================================
 514
 515 Every execution of an LLVM IR instruction occurs in a :ref:`dynamic instance
 516 <convergence-dynamic-instances>` of the instruction. Dynamic instances are the
 517 formal objects by which we talk about communicating threads in convergent
 518 operations. Dynamic instances are defined for *all* operations in an LLVM
 519 program, whether convergent or not. Convergence control is primarily about the
 520 dynamic instances of convergent operations since they affect execution of the
 521 program through inter-thread communication. The dynamic instances for
 522 non-convergent operations are relevant for determining :ref:`uniformity
 523 <convergence-and-uniformity>` of values.
 524
 525 Dynamic instances produced by the execution of the same *convergent operation*
 526 by different threads may be :ref:`converged <convergence-definition>`. When
 527 executing a convergent operation, the set of threads that execute converged
 528 dynamic instances is the set of threads that communicate with each other.
 529 *Convergence tokens* capture this convergence as described below.
 530
 531 *Convergence tokens* are values of ``token`` type, i.e. they cannot be used in
 532 ``phi`` or ``select`` instructions. A convergence token value represents the
 533 dynamic instance of the instruction that produced it.
 534
 535 Convergent operations may have an optional ``convergencectrl`` operand bundle with
 536 a convergence token operand to define the set of communicating threads relative
 537 to the operation that defined the token.
 538
 539    Let ``U`` be a convergent operation other than a call to a convergence
 540    control intrinsic, and ``D`` be the convergent operation that defines
 541    the token value used as the ``convergencectrl`` operand to ``U``. Two
 542    threads execute converged dynamic instances of ``U`` if and only if the
 543    token value in both threads was returned by converged dynamic
 544    instances of ``D``.
 545
 546 .. note::
 547
 548    The text defines convergence token values as representing dynamic instances.
 549    But if we were to assume that converged dynamic instances produce the same
 550    token value, then we could almost think of the token value as representing a
 551    set of threads instead -- specifically, the set ``S`` of threads that
 552    executed converged dynamic instances of the defining instruction ``D``.
 553
 554    In this intuitive picture, when a convergence token value ``T`` is used by a
 555    ``convergencectrl`` bundle on an instruction ``I``, then the set of threads that
 556    communicates in ``I`` is a subset of the set ``S`` represented by the token value.
 557    Specifically, it is the subset of threads that ends up executing ``I`` while
 558    using the token value.
 559
 560    This by itself wouldn't quite work as a definition: what if ``I`` is executed
 561    multiple times by the same threads? Which execution of ``I`` in thread 1
 562    communicates with which execution of ``I`` in thread 2? Leaning on the notion
 563    of dynamic instances gives a robust answer to this question as long as ``D``
 564    and ``I`` are at the same loop (or cycle) nesting level.
 565
 566    The case where ``D`` and ``I`` are at different loop nesting levels is
 567    forbidden by the :ref:`static rules <convergence_static_rules>` -- handling
 568    that case is the purpose of :ref:`llvm.experimental.convergence.loop
 569    <llvm.experimental.convergence.loop>`.
 570
 571 .. _convergence_control_intrinsics:
 572
 573 Convergence Control Intrinsics
 574 ==============================
 575
 576 This section describes target-independent intrinsics that can be used to
 577 produce convergence tokens.
 578
 579 Behaviour is undefined if a convergence control intrinsic is called
 580 indirectly.
 581
 582 .. _llvm.experimental.convergence.entry:
 583
 584 ``llvm.experimental.convergence.entry``
 585 ----------------------------------------
 586
 587 .. code-block:: llvm
 588
 589   token @llvm.experimental.convergence.entry() convergent readnone
 590
 591 This intrinsic is used to tie the dynamic instances inside of a function to
 592 those in the caller.
 593
 594 1. If the function is called from outside the scope of LLVM, the convergence of
 595    dynamic instances of this intrinsic are environment-defined. For example:
 596
 597    a. In an OpenCL *kernel launch*, the maximal set of threads that
 598       can communicate outside the memory model is a *workgroup*.
 599       Hence, a suitable choice is to specify that all the threads from
 600       a single workgroup in OpenCL execute converged dynamic instances
 601       of this intrinsic.
 602    b. In a C/C++ program, threads are launched independently and they can
 603       communicate only through the memory model. Hence the dynamic instances of
 604       this intrinsic in a C/C++ program are never converged.
 605 2. If the function is called from a call-site in LLVM IR, then two
 606    threads execute converged dynamic instances of this intrinsic if and
 607    only if both threads entered the function by executing converged
 608    dynamic instances of the call-site.
 609
 610 This intrinsic can occur at most once in a function, and only in the entry
 611 block of the function. If this intrinsic occurs in a basic block, then it must
 612 precede any other convergent operation in the same basic block.
 613
 614 It is an error if this intrinsic appears in a non-convergent function.
 615
 616 It is an error to specify a ``convergencectrl`` operand bundle at a
 617 call to this intrinsic.
 618
 619 Function inlining substitutes this intrinsic with the token from the operand
 620 bundle. For example:
 621
 622 .. code-block:: c++
 623
 624   // Before inlining:
 625
 626   void callee() convergent {
 627     %tok = call token @llvm.experimental.convergence.entry()
 628     convergent_operation(...) [ "convergencectrl"(token %tok) ]
 629   }
 630
 631   void main() {
 632     %outer = call token @llvm.experimental.convergence.anchor()
 633     for (...) {
 634       %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
 635       callee() [ "convergencectrl"(token %inner) ]
 636     }
 637   }
 638
 639   // After inlining:
 640
 641   void main() {
 642     %outer = call token @llvm.experimental.convergence.anchor()
 643     for (...) {
 644       %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
 645       convergent_operation(...) [ "convergencectrl"(token %inner) ]
 646     }
 647   }
 648
 649 .. _llvm.experimental.convergence.loop:
 650
 651 ``llvm.experimental.convergence.loop``
 652 --------------------------------------
 653
 654 .. code-block:: llvm
 655
 656   token @llvm.experimental.convergence.loop() [ "convergencectrl"(token) ] convergent readnone
 657
 658 This intrinsic represents the place where an imaginary counter is incremented
 659 for determining convergence inside a control flow cycle.
 660
 661 Let ``U`` be a call to this intrinsic and ``D`` be the convergent operation that
 662 defines the token value used as the ``convergencectrl`` operand to ``U``. Two
 663 threads execute converged dynamic instances of ``U`` if and only if:
 664
 665 1. The token value in both threads was returned by converged dynamic
 666    instances of ``D``, and,
 667 2. There is an integer *n* such that both threads execute ``U`` for the *n*'th time
 668    with that token value.
 669
 670 It is an error to omit the ``convergencectrl`` operand bundle on a
 671 call to this intrinsic.
 672
 673 If this intrinsic occurs in a basic block, then it must precede any other
 674 convergent operation in the same basic block.
 675
 676 .. _convergence_cycle_heart:
 677
 678 **Heart of a Cycle:**
 679
 680   If a :ref:`cycle <cycle-terminology>` ``C`` contains an occurrence ``H`` of
 681   this intrinsic whose token operand is defined outside ``C``, then ``H`` is
 682   called the heart of ``C``.
 683
 684   .. note::
 685
 686      The static rules for cycles imply that a heart can occur only in the header
 687      of a natural loop. This ensures that the heart closely represents the
 688      intuitive notion of a loop iteration. If this restriction is relaxed, the
 689      resulting semantics provides a new notion of "cycle iteration" even for
 690      irreducible cycles. But this allows a natural loop to have a heart in a
 691      node other than its header, which has interesting consequences on the
 692      meaning of a loop iteration in terms of convergence. For now, we disallow
 693      this situation since its practical application is very rare.
 694
 695 .. _llvm.experimental.convergence.anchor:
 696
 697 ``llvm.experimental.convergence.anchor``
 698 ----------------------------------------
 699
 700 .. code-block:: llvm
 701
 702   token @llvm.experimental.convergence.anchor() convergent readnone
 703
 704 This intrinsic produces an initial convergence token that is independent from
 705 any "outer scope". The set of threads executing converged dynamic instances of
 706 this intrinsic is implementation-defined.
 707
 708 It is an error to pass a ``convergencectrl`` operand bundle at a
 709 call to this intrinsic.
 710
 711 .. note::
 712
 713    The expectation is that all threads within a group that "happen to be active
 714    at the same time" will execute converged dynamic instances, so that programs
 715    can detect the maximal set of threads that can communicate efficiently within
 716    some local region of the program.
 717
 718 .. _convergence_uncontrolled:
 719
 720 Uncontrolled Convergent Operations
 721 ==================================
 722
 723 Convergent operations with an explicit ``convergencectrl`` operand bundle are
 724 called *controlled convergent operations*. All other convergent operations are
 725 said to be *uncontrolled*.
 726
 727 An uncontrolled convergent operation is said to have *implicit convergence
 728 control* determined by the ``convergent`` attribute alone. The semantics of the
 729 ``convergent`` attribute as implemented in LLVM differs from the documented
 730 semantics. The implementation tries to follow common intuition about convergent
 731 operations, which remains under-specified. As such, it is not possible to fully
 732 translate implicit convergence control into explicit convergence control tokens,
 733 and these two modes cannot be mixed in the same function.
 734
 735   If a function contains a controlled convergent operation, then all convergent
 736   operations in that function must either be controlled operations or calls to
 737   the convergence control intrinsics.
 738
 739 Inferring Tokens
 740 ----------------
 741
 742 (This section is informational)
 743
 744 Sometimes, it may be necessary to reinterpret the implicit convergence control
 745 in terms of explicit convergence control tokens. For example, this may happen
 746 when a function call is inlined, and either the caller or the callee contains
 747 uncontrolled convergent operations.
 748
 749 Some uses of uncontrolled convergent operations may need to satisfy the
 750 following property:
 751
 752   For an environment-defined group of threads (such as an OpenCL workgroup or
 753   subgroup), if one thread in the group executes a convergent operation, then
 754   all threads in the group do so convergently with that thread.
 755
 756 In terms of explicit convergence control, this means that the
 757 ``convergencectrl`` operand on each convergent operation ``X`` must ultimately
 758 originate from a call to the :ref:`llvm.experimental.convergence.entry
 759 <llvm.experimental.convergence.entry>` intrinsic. This preserves the possibility
 760 that the group of threads that converge on reaching ``X`` is the same group that
 761 originally started executing the program in convergence. In comparison, the
 762 :ref:`llvm.experimental.convergence.anchor
 763 <llvm.experimental.convergence.anchor>` intrinsic captures an
 764 implementation-defined group of threads, which is insufficient to support the
 765 above property.
 766
 767 One way to approximate implicit convergence control in terms of explicit
 768 convergence control tokens is the following procedure, which preserves the above
 769 mentioned property:
 770
 771 1. Convert every irreducible cycle into a reducible cycle.
 772 2. Insert a call to :ref:`llvm.experimental.convergence.entry
 773    <llvm.experimental.convergence.entry>` at the start of the entry block of the
 774    function.
 775 3. Insert a call to :ref:`llvm.experimental.convergence.loop
 776    <llvm.experimental.convergence.loop>` at the start of every loop header. If
 777    this loop is an outermost loop, the ``convergencectrl`` operand is the call
 778    to :ref:`llvm.experimental.convergence.entry
 779    <llvm.experimental.convergence.entry>` in the entry block of the function.
 780    Otherwise, the ``convergencectrl`` operand is the call to
 781    :ref:`llvm.experimental.convergence.loop
 782    <llvm.experimental.convergence.loop>` in the parent loop's header.
 783 4. For each uncontrolled convergent operation ``X``, add a ``convergencectrl``
 784    operand bundle using the token defined by a definition ``D`` that is a
 785    :ref:`sibling <cycle-sibling>` to this operation. ``D`` always dominates
 786    ``X`` --- if ``X`` is not in any cycle, then ``D`` is a call to
 787    :ref:`llvm.experimental.convergence.entry
 788    <llvm.experimental.convergence.entry>`; otherwise ``D`` is the heart of the
 789    parent cycle of ``X``.
 790
 791 .. _convergence_static_rules:
 792
 793 Static Rules
 794 ============
 795
 796 A *well-formed* program in LLVM IR must satisfy the following static
 797 rules about cycles and convergence regions.
 798
 799 Closed Paths
 800 ------------
 801
 802 A :ref:`closed path <cycle-closed-path>` in a CFG is a connected sequence of
 803 nodes and edges in the CFG whose start and end points are the same.
 804
 805 1. Every closed path in the CFG that contains a use of a convergence token T other
 806    than a use by
 807    :ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
 808    must also contain the definition of T.
 809
 810 2. Every closed path in the CFG that contains two different uses of a convergence
 811    token T must also contain the definition of T.
 812
 813 3. Every closed path in the CFG that contains uses of two different convergence tokens
 814    T1 and T2 must also contain the definition of at least one of them.
 815
 816 Taken together, these rules imply that for every closed path C, there can be at most
 817 one convergence token T which is used in C but defined outside of it, and that
 818 T can be used only once in C, and only by ``llvm.experimental.convergence.loop``.
 819
 820 4. In every closed path that contains a use U of a token T but not the
 821    definition of T, U must dominate all nodes in the closed path.
 822
 823 This implies that ``llvm.experimental.convergence.loop`` can appear as a heart
 824 only in the header of a natural loop.
 825
 826 **Sufficient Conditions:** From the :ref:`properties of cycles
 827 <cycle-closed-path>`, it is sufficient to prove the above properties
 828 for cycles instead of closed paths. Briefly, any closed path that violates
 829 one or more of the above static rules is contained in a cycle that also
 830 violates the same rule(s).
 831
 832 .. _convergence_region:
 833
 834 Convergence Regions
 835 -------------------
 836
 837 The *convergence region* of a convergence token T is the minimal region in
 838 which T is live and used, i.e., the set of program points dominated by the
 839 definition D of T from which a use of T can be reached.
 840
 841 The following static rule about convergence regions must be satisfied by
 842 valid programs:
 843
 844    If a convergence region R for a token T1 contains a use of a convergence
 845    token T2, then R must also contain the definition of T2. (In other words,
 846    convergence regions must be reasonably nested.)
 847
 848 .. note::
 849
 850    For brevity, this document uses the term "convergence region of a token
 851    definition ``D``" to actually refer to the convergence region of the token
 852    ``T`` defined by ``D``.
 853
 854 .. _inferring_noconvergent:
 855
 856 Inferring non-convergence
 857 =========================
 858
 859 When the target or the environment guarantees that threads do not
 860 communicate using convergent operations or that threads never diverge,
 861 the dynamic instances in the program are irrelevant and an optimizer
 862 may remove any occurrence of the ``convergent`` attribute on a
 863 call-site or a function and any explicit ``convergencectrl`` operand
 864 bundle at a call-site.
 865
 866 An optimizer may remove the ``convergent`` attribute and any explicit
 867 ``convergencectrl`` operand bundle from a call-site if it can prove
 868 that the execution of this call-site always results in a call to a
 869 non-convergent function.
 870
 871 An optimizer may remove the ``convergent`` attribute on a function if it can
 872 prove that the function does not contain a call to
 873 :ref:`llvm.experimental.convergence.entry
 874 <llvm.experimental.convergence.entry>`, or any uncontrolled convergent
 875 operations.
 876
 877 Memory Model Non-Interaction
 878 ============================
 879
 880 The fact that an operation is convergent has no effect on how it is treated for
 881 memory model purposes. In particular, an operation that is ``convergent`` and
 882 ``readnone`` does not introduce additional ordering constraints as far as the
 883 memory model is concerned. There is no implied barrier, neither in the memory
 884 barrier sense nor in the control barrier sense of synchronizing the execution
 885 of threads.
 886
 887 Informational note: Threads that execute converged dynamic instances do not
 888 necessarily do so at the same time.
 889
 890
 891 Other Interactions
 892 ==================
 893
 894 A function can be both ``convergent`` and
 895 ``speculatable``, indicating that the function does not have undefined
 896 behavior and has no effects besides calculating its result, but is still
 897 affected by the set of threads executing this function. This typically
 898 prevents speculation of calls to the function unless the constraint imposed
 899 by ``convergent`` is further relaxed by some other means.
 900
 901 Controlled Maximal Convergence
 902 ==============================
 903
 904 The :ref:`converged-with relation <convergence-definition>` over dynamic
 905 instances of each controlled convergent operation is completely defined by the
 906 semantics of convergence tokens. But the implementation-defined convergence at a
 907 call to :ref:`llvm.experimental.convergence.anchor
 908 <llvm.experimental.convergence.anchor>` also depends on the cycle hierarchy
 909 chosen if it occurs inside an irreducible cycle.
 910
 911 When the token defined by a convergent operation ``D`` is used at another
 912 convergent operation ``U``, the implementation must ensure that the threads that
 913 converge at ``U`` are all the threads that reached ``U`` after converging at
 914 ``D``. On most implementations, it is reasonable to assume that only these
 915 threads are converged at every node they reach on any path from ``D`` to ``U``.
 916 In other words, the converged-with relation at ``D`` produces groups of threads
 917 that can converge only within each group, while inside the convergence region of
 918 ``D``.
 919
 920 All this affects the :ref:`maximal converged-with relation
 921 <convergence-maximal>` over dynamic instances and in turn the :ref:`m-converged
 922 property <uniformity-analysis>` of static instances in the convergence region of
 923 ``D``.
 924
 925 .. _controlled_maximal_converged_with:
 926
 927   **Controlled Maximal converged-with Relation**
 928
 929   1. Dynamic instances of a *convergent operation* are related in the controlled
 930      maximal converged-with relation according to the semantics of the convergence
 931      control tokens.
 932   2. Dynamic instances ``X1`` and ``X2`` produced by different threads for the
 933      same *non-convergent operation* ``X`` are related in the controlled maximal
 934      converged-with relation if and only if:
 935
 936      1. Both threads executed converged dynamic instances of every token
 937         definition ``D`` such that ``X`` is in the convergence region of ``D``,
 938         and,
 939      2. For every cycle ``C`` with header ``H`` that contains ``X``:
 940
 941         - every dynamic instance ``H1`` of ``H`` that precedes ``X1`` in the
 942           respective thread is convergence-before ``X2``, and,
 943         - every dynamic instance ``H2`` of ``H`` that precedes ``X2`` in the
 944           respective thread is convergence-before ``X1``,
 945         - without assuming that ``X1`` is converged with ``X2``.
 946
 947 .. _controlled_m_converged:
 948
 949   **Controlled m-converged Static Instances**
 950
 951   A node ``X`` in a given CFG is reported to be m-converged if and only if:
 952
 953   1. For any token definition ``D`` such that ``X`` is inside the convergence region
 954      of ``D``, ``D`` itself is m-converged, and,
 955   2. Every cycle that contains ``X`` satisfies the following necessary
 956      conditions:
 957
 958      a. Every divergent branch inside the cycle satisfies the :ref:`diverged
 959         entry criterion<convergence-diverged-entry>`, and,
 960      b. There are no :ref:`diverged paths reaching the
 961         cycle<convergence-diverged-outside>` from a divergent branch outside it.
 962
 963 Temporal Divergence at Cycle Exit
 964 ---------------------------------
 965
 966 When a cycle has a divergent exit, maximal convergence assumes that all threads
 967 converge at the exit block. But if a controlled convergent operation outside the
 968 cycle uses a token defined by an operation ``D`` inside the cycle, the
 969 convergence region of ``D`` now extends outside the cycle. If two threads
 970 executed converged dynamic instances of ``D`` before exiting the cycle, then
 971 they continue to execute converged dynamic instances of nodes in the convergence
 972 region of ``D`` outside the cycle. Thus, for a value ``V`` defined inside the
 973 cycle, any use ``U`` of ``V`` within the convergence region of ``T`` uses the
 974 output of converged dynamic instances of ``V``. If ``V`` is uniform, then its
 975 use at such a ``U`` is also uniform. In other words, temporal divergence applies
 976 only to a use of ``V`` that is outside the convergence region of ``D``.
 977
 978 Rationales for Static rules about cycles
 979 ========================================
 980
 981 (This section is informative.)
 982
 983 .. note::
 984
 985    For convenience, we use the operator ``==`` to represent the relation
 986    ``converged-with`` and the operator ``!=`` to represent its negation.
 987
 988 Consider a loop with (incorrect!) convergence control as in the following
 989 pseudocode:
 990
 991 .. code-block:: llvm
 992
 993   ; WARNING: Example of incorrect convergence control!
 994
 995   %anchor = call token @llvm.experimental.convergence.anchor()
 996   for (;;) {
 997     ...
 998     call void @convergent.op() [ "convergencectrl"(token %anchor) ]
 999     ...
1000   }
1001
1002 This code is forbidden by the first static rule about cycles.
1003
1004 A first formal argument why we have to do this is that the dynamic rule for
1005 deciding whether two threads execute converged dynamic instances of
1006 ``@convergent.op`` leads to a logical contradiction in this code.
1007 Assume two threads execute converged dynamic instances of the anchor
1008 followed by two iterations of the loop. Thread 1 executes dynamic instances
1009 I1 and I2 of ``@convergent.op``, thread 2 executes dynamic instances J1 and J2.
1010 Using all the rules, we can deduce:
1011
1012 1. ``I1 != I2`` and ``J1 != J2`` by the basic rules of dynamic instances.
1013
1014 2. ``I1 == J1`` by the first dynamic rule about controlled convergent
1015    operations: both threads execute the same static instruction while using
1016    a convergence token value produced by converged dynamic instances of an
1017    instruction (the anchor).
1018
1019 3. ``I1 == J2`` by the same argument. Also, ``I2 == J1`` and ``I2 == J2``.
1020
1021    The fact that one may be *intuitively* tempted to think of ``I1`` and ``J2``
1022    as being executed in different loop iterations is completely irrelevant for
1023    the *formal* argument. There is no mechanism in LLVM IR semantics for
1024    forming associations between loop iterations in different threads, *except*
1025    for the rules defined in this document -- and the rules in this document
1026    require a loop heart intrinsic for talking about loop iterations.
1027
1028 4. By transitivity, we have ``I1 == I2`` and ``J1 == J2``. That is a
1029    contradiction.
1030
1031 This problem goes away by inserting a loop heart intrinsic as follows, which
1032 establishes a relationship between loop iterations across threads.
1033
1034 .. code-block:: llvm
1035
1036   %anchor = call token @llvm.experimental.convergence.anchor()
1037   for (;;) {
1038     %loop = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
1039     ...
1040     call void @convergent.op() [ "convergencectrl"(token %loop) ]
1041     ...
1042   }
1043
1044 In the same scenario of two threads executing converged dynamic instances of the
1045 anchor and then two iterations of the loop, the dynamic rule about loop heart
1046 intrinsics implies that both threads execute the converged dynamic instances of
1047 the loop heart intrinsic in their respective first iterations and then again in
1048 their respective second iterations of the loop.
1049
1050 This then implies that they execute converged dynamic instances ``I1 == J1`` of
1051 the ``@convergent.op`` in their first iterations and then
1052 ``I2 == J2`` in their second iterations. The rule is an "if and only if" rule,
1053 so it also implies that ``I1 != J2`` and ``I2 != J1``, because those executions
1054 see token values of ``%loop`` originating from non-converged dynamic
1055 instances of the loop intrinsic.
1056
1057 One may ask whether we could change the dynamic rule instead of adding the
1058 static rule about cycles. That is impractical due to deeper difficulties.
1059 Consider the following loop, again with incorrect convergence control:
1060
1061 .. code-block:: llvm
1062
1063   ; WARNING: Example of incorrect convergence control!
1064
1065   ; (A)
1066   %anchor = call token @llvm.experimental.convergence.anchor()
1067   for (;;) {
1068     ; (B)
1069     if (condition1) {
1070       ; (C)
1071       call void @convergent.op.1() [ "convergencectrl"(token %anchor) ]
1072     }
1073     ; (D)
1074     if (condition2) {
1075       ; (E)
1076       call void @convergent.op.2() [ "convergencectrl"(token %anchor) ]
1077     }
1078     ; (F)
1079   }
1080   ; (G)
1081
1082 Assume two threads execute converged dynamic instances of the anchor followed
1083 by this sequence of basic blocks:
1084
1085 .. code-block:: text
1086
1087   Thread 1: A B C D F B D E F G
1088   Thread 2: A B D E F B C D F G
1089
1090 That is, both threads execute two iterations of the loop, but they execute
1091 the different convergent operations in different iterations. Without forming a
1092 relation between loop iterations across the threads, there is no reasonable way
1093 of defining which dynamic instances of the convergent operations should be the
1094 same across the threads, if any.
1095
1096 Again, this can be addressed by adding a loop heart intrinsic, most naturally
1097 as:
1098
1099 .. code-block:: llvm
1100
1101   ; (A)
1102   %anchor = call token @llvm.experimental.convergence.anchor()
1103   for (;;) {
1104     ; (B)
1105     %loop = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
1106     if (condition1) {
1107       ; (C)
1108       call void @convergent.op.1() [ "convergencectrl"(token %loop) ]
1109     }
1110     ; (D)
1111     if (condition2) {
1112       ; (E)
1113       call void @convergent.op.2() [ "convergencectrl"(token %loop) ]
1114     }
1115     ; (F)
1116   }
1117   ; (G)
1118
1119 Let ``%loop(i;j)`` be the dynamic instance of ``j``-th execution of the loop
1120 heart intrinsic by thread ``i``, and analogously ``@op.k(i)`` and ``@op.k(i)``
1121 the dynamic instances of the execution of ``@convergent.op.k`` by thread ``i``.
1122 Then we have:
1123
1124 1. ``%loop(1;j) == %loop(2;j)`` for ``j = 1, 2`` because of the dynamic rule
1125    about loop heart intrinsics.
1126
1127 2. ``%loop(i;1) != %loop(i;2)`` for ``i = 1, 2`` because of the basic rule that
1128    different executions by the same thread happen in different dynamic
1129    instances.
1130
1131 3. ``@op.1(1) != @op.1(2)``, since ``@op.1(1)`` uses the token value of ``%loop``
1132    referring to ``%loop(1;1)`` and ``@op.1(2)`` uses that
1133    referring to ``%loop(2;2) == %loop(1;2)``, which is different from
1134    ``%loop(1;1)``.
1135
1136 4. Similarly, ``@op.2(1) != @op.2(2)``.
1137
1138 However, loop heart intrinsics could be inserted differently, at the cost
1139 of also inserting a free-standing anchor:
1140
1141 .. code-block:: llvm
1142
1143   ; (A)
1144   %anchor = call token @llvm.experimental.convergence.anchor()
1145   for (;;) {
1146     ; (B)
1147     if (condition1) {
1148       ; (C)
1149       %loop = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
1150       call void @convergent.op.1() [ "convergencectrl"(token %loop) ]
1151     }
1152     ; (D)
1153     if (condition2) {
1154       ; (E)
1155       %free = call token @llvm.experimental.convergence.anchor()
1156       call void @convergent.op.2() [ "convergencectrl"(token %free) ]
1157     }
1158     ; (F)
1159   }
1160   ; (G)
1161
1162 This leads to the "unnatural counting of loop iterations" that is also mentioned
1163 elsewhere. Let ``%loop(i)`` be the dynamic instance of the execution of the
1164 loop heart intrinsic by thread ``i`` (each thread executes it only once), and
1165 let ``@op.k(i)`` be as before. Then:
1166
1167 1. ``%loop(1) == %loop(2)`` because of the dynamic rule about loop heart
1168    intrinsics.
1169
1170 2. ``@op.1(1) == @op.1(2)`` because ``@op.1(i)`` uses the value of ``%loop``
1171    referring to ``%loop(i)``, and ``%loop(1) == %loop(2)``.
1172
1173 3. Whether ``@op.2(1) == @op.2(2)`` is implementation-defined because of the
1174    use of the ``%free`` anchor intrinsic.
1175
1176    In practice, they almost certainly have to be non-converged dynamic
1177    instances. Consider that if an implementation strictly follows the order of
1178    instructions given in the program, the executions of the threads can be
1179    "aligned" as follows:
1180
1181    .. code-block:: text
1182
1183      Thread 1: A B         C D F B D E F G
1184      Thread 2: A B D E F B C D F         G
1185
1186    So then ``@op.2(1)`` physically executes later than ``@op.2(2)`` and there
1187    can be no communication between the threads, which means they execute
1188    non-converged dynamic instances.
1189
1190    That said, it is conceivable that there aren't actually any data or other
1191    dependencies that would enforce this execution order. In that case, a highly
1192    out-of-order implementation could potentially allow communication. That's
1193    why the rules defined in this document are silent about whether
1194    ``@op.2(1) == @op.2(2)`` or not.
1195
1196 This type of convergence control seems relatively unlikely to appear in real
1197 programs. Its possibility is simply a logical consequence of the model.
1198
1199 An equivalent issue arises if the convergent operations are replaced by nested
1200 loops with loop heart intrinsics that directly refer to ``%anchor``, hence
1201 the variants of the static rules about cycles that apply to them:
1202
1203 .. code-block:: llvm
1204
1205   ; WARNING: Example of incorrect convergence control!
1206
1207   %anchor = call token @llvm.experimental.convergence.anchor()
1208   for (;;) {
1209     if (condition1) {
1210       for (;;) {
1211         %loop1 = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
1212       }
1213     }
1214     if (condition2) {
1215       for (;;) {
1216         %loop2 = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
1217       }
1218     }
1219   }
1220
1221 There is a cycle (closed walk in the CFG) that goes through both loop heart
1222 intrinsics using ``%anchor`` but not through the definition of ``%anchor``,
1223 so this code is invalid.
1224
1225
1226 Examples for the Correctness of Program Transforms
1227 ==================================================
1228
1229 (This section is informative.)
1230
1231 As implied by the rules in the previous sections, program transforms are correct
1232 with respect to convergent operations if they preserve or refine their
1233 semantics. This means that the set of communicating threads in the transformed
1234 program must have been possible in the original program.
1235
1236 Program transforms with a single-threaded focus are generally conservatively
1237 correct if they do not sink or hoist convergent operations across a branch.
1238 This applies even to program transforms that change the control flow graph.
1239
1240 For example, unrolling a loop that does not contain convergent operations
1241 cannot break any of the guarantees required for convergent operations outside
1242 of the loop.
1243
1244
1245 Loop unrolling examples
1246 -----------------------
1247
1248 We consider three kinds of loop unrolling here:
1249
1250 * Partial unrolling with no known trip multiple, so a "tail" is required to
1251   collect the remaining elements.
1252 * Partial unrolling by a trip multiple, so no "tail" is required.
1253 * Full unrolling, which eliminates the loop.
1254
1255 The first kind is forbidden when ``@llvm.experimental.convergence.loop`` is
1256 used. We illustrate the reasoning with some examples.
1257
1258 First, an arbitrary loop that contains convergent operations *can* be unrolled
1259 in all of these ways, even with "tail", if all convergent operations refer back
1260 to an anchor inside the loop. For example (in pseudo-code):
1261
1262 .. code-block:: llvm
1263
1264   while (counter > 0) {
1265     %tok = call token @llvm.experimental.convergence.anchor()
1266     call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1267     counter--;
1268   }
1269
1270 This can be unrolled to:
1271
1272 .. code-block:: llvm
1273
1274   while (counter >= 2) {
1275     %tok = call token @llvm.experimental.convergence.anchor()
1276     call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1277     %tok = call token @llvm.experimental.convergence.anchor()
1278     call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1279     counter -= 2;
1280   }
1281   while (counter > 0) {
1282     %tok = call token @llvm.experimental.convergence.anchor()
1283     call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1284     counter--;
1285   }
1286
1287 This is likely to change the behavior of the convergent operation if there
1288 are threads whose initial counter value is not a multiple of 2. In particular,
1289 all threads with an odd trip count are now likely to execute the convergent
1290 operation in their respective final iterations together because the
1291 underlying implementation is likely to try to group as many threads together
1292 as possible for the execution of the "tail".
1293
1294 This change is allowed because the anchor intrinsic has implementation-defined
1295 convergence behavior and the loop unrolling transform is considered to be part
1296 of the implementation. Another way of reasoning is that while the *likely*
1297 behavior of the code has changed, the *guarantees* about its behavior have
1298 remained the same.
1299
1300 If the loop contains uncontrolled convergent operations, this kind of unrolling
1301 is forbidden.
1302
1303 Unrolling a loop with convergent operations that refer to tokens produced
1304 outside the loop is forbidden when a "tail" or "remainder" would have to
1305 be introduced. Consider:
1306
1307 .. code-block:: llvm
1308
1309   ; (A)
1310   %outer = call token @llvm.experimental.convergence.anchor()
1311   while (counter > 0) {
1312     %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
1313     ; (B)
1314     call void @convergent.operation() [ "convergencectrl"(token %inner) ]
1315     counter--;
1316   }
1317   ; (C)
1318
1319 To understand why unrolling is forbidden, consider two threads that execute
1320 converged dynamic instances of the anchor and then proceed with 3 and 4 loop
1321 iterations, respectively:
1322
1323 .. code-block:: text
1324
1325   Thread 1: A B B B C
1326   Thread 2: A B B B B C
1327
1328 By the dynamic rule on loop heart intrinsics, these threads execute converged
1329 dynamic instances of the loop intrinsic for the first 3 iterations, and then
1330 thread 2 executes another dynamic instance by itself.
1331
1332 By the dynamic rule on general convergent operations, the threads execute
1333 converged dynamic instances of the ``@convergent.operation`` in the first 3
1334 iterations (that is, the dynamic instance executed by thread 1 in iteration
1335 *n* is the same as that executed by thread 2 in iteration *n*, for *n = 1,2,3*;
1336 the dynamic instance executed in iteration 1 is different from that in
1337 iteration 2, etc.).
1338
1339 Now assume that the loop is unrolled by a factor of 2, which requires a
1340 remainder as follows:
1341
1342 .. code-block:: llvm
1343
1344   ; (A)
1345   %outer = call token @llvm.experimental.convergence.anchor()
1346   while (counter >= 2) {
1347     %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
1348     ; (B)
1349     call void @convergent.operation() [ "convergencectrl"(token %inner) ]
1350     call void @convergent.operation() [ "convergencectrl"(token %inner) ]
1351     counter -= 2;
1352   }
1353   ; (C)
1354   if (counter > 0) {
1355     %remainder = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
1356     ; (D)
1357     call void @convergent.operation() [ "convergencectrl"(token %remainder) ]
1358   }
1359   ; (E)
1360
1361 First of all, note some interesting problems surrounding the loop intrinsic:
1362
1363 1. It is *not* duplicated inside the unrolled loop. This is to comply with
1364    the :ref:`convergence_static_rules`.
1365
1366 2. It is unclear whether the loop intrinsic ought to be duplicated in the
1367    remainder, or whether the final ``@convergent.operation`` in D should just
1368    refer to either ``%inner`` (which is possible in SSA form) or directly to
1369    ``%outer``. The decision made here is arbitrary and doesn't change the
1370    argument that follows. Ultimately, it simply doesn't matter because the
1371    transform is incorrect either way.
1372
1373 The threads now execute the following sequences of blocks:
1374
1375 .. code-block:: text
1376
1377   Thread 1: A B C D E
1378   Thread 2: A B B C D E
1379
1380 Analogous to the argument above, they execute converged dynamic instances of the
1381 ``%inner`` intrinsic and the ``@convergent.operation`` in the first iteration
1382 of the unrolled loop, which corresponds to the first 2 iterations of the
1383 original loop.
1384
1385 However, they execute different static calls to ``@convergent.operation`` for
1386 the 3rd iteration of the original loop. In thread 1, that iteration corresponds
1387 to the call in the remainder, while in thread 2 it corresponds to the first
1388 call to ``@convergent.operation`` in the unrolled loop. Therefore, they execute
1389 non-converged dynamic instances, which means that the set of communicating threads
1390 for the 3rd iteration of the original loop is different. This is why the
1391 unrolling is incorrect.
1392
1393 On the other hand, unrolling without "tail" is allowed. For example, assuming
1394 that the trip counter is known to be a multiple of 2, we can unroll the loop
1395 as follows:
1396
1397 .. code-block:: llvm
1398
1399   %outer = call token @llvm.experimental.convergence.anchor()
1400   while (counter > 0) {
1401     %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
1402     call void @convergent.operation() [ "convergencectrl"(token %inner) ]
1403     call void @convergent.operation() [ "convergencectrl"(token %inner) ]
1404     counter -= 2;
1405   }
1406
1407 Note again that the loop intrinsic is not duplicated.
1408
1409 The
1410 :ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
1411 intrinsic is typically expected to appear in the header of a natural loop.
1412 However, it can also appear in non-header blocks of a loop. In that case, the
1413 loop can generally not be unrolled.
1414
1415
1416 Hoisting and sinking
1417 --------------------
1418
1419 In general, hoisting and sinking of convergent operations is forbidden. This is
1420 because moving the operation to a different point in control flow generally
1421 changes the set of threads that reach the operation and therefore, the set of
1422 threads that execute converged dynamic instances of the operation. By
1423 definition, this changes the set of threads that participate in the
1424 communication of the convergent operation, which will typically change its
1425 result.
1426
1427 There are a number of exceptions, though most of them require additional
1428 knowledge.
1429
1430 For example, hoisting and sinking across *uniform* conditional branches -- i.e.,
1431 conditional branches where within every possible relevant set of threads, all
1432 threads will always take the same direction -- is generally allowed. See the end
1433 of the :ref:`example of reductions inside control flow
1434 <convergence_example_reductions>` for a brief discussion.
1435
1436 Some convergent operations can be hoisted but not sunk, or vice versa. A simple
1437 example is the ``subgroupShuffle(data, id)`` operation. It returns the ``data``
1438 operand of the thread identified by ``id``, where thread IDs are fixed and
1439 assigned to each thread at launch. The result is undefined (or perhaps there is
1440 UB, depending on the language and environment) if thread ``id`` is not in the
1441 communicating set of threads. So hoisting is allowed in the following
1442 pseudo-code example:
1443
1444 .. code-block:: llvm
1445
1446   define void @example(...) convergent {
1447     %entry = call token @llvm.experimental.convergence.entry()
1448     %data = ...
1449     %id = ...
1450     if (condition) {
1451       %shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ]
1452       ...
1453     } else {
1454       %shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ]
1455       ...
1456     }
1457   }
1458
1459 After hoisting the calls to ``@subgroupShuffle``, the communicating set of
1460 threads is the union of the two sets of threads in the original program, so
1461 ``%id`` can only go "out of range" after hoisting if it did so in the original
1462 program.
1463
1464 However, speculative execution of ``@subgroupShuffle`` in the following program
1465 may be forbidden:
1466
1467 .. code-block:: llvm
1468
1469   define void @example(...) convergent {
1470     %entry = call token @llvm.experimental.convergence.entry()
1471     %data = ...
1472     %id = ...
1473     if (condition) {
1474       %shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ]
1475       ...
1476     }
1477   }
1478
1479 There is no guarantee about the value of ``%id`` in the threads where
1480 ``condition`` is false. If ``@subgroupShuffle`` is defined to have UB when
1481 ``%id`` is outside of the set of communicating threads, then speculating and
1482 hoisting ``@subgroupShuffle`` might introduce UB.
1483
1484 On the other hand, if ``@subgroupShuffle`` is defined such that it merely
1485 produces an undefined value or poison as result when ``%id`` is "out of range",
1486 then speculating is okay.
1487
1488 Even though
1489 :ref:`llvm.experimental.convergence.anchor <llvm.experimental.convergence.anchor>`
1490 is marked as ``convergent``, it can be sunk in some cases. For example, in
1491 pseudo-code:
1492
1493 .. code-block:: llvm
1494
1495   %tok = call token @llvm.experimental.convergence.anchor()
1496   if (condition) {
1497     call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1498   }
1499
1500 Assuming that ``%tok`` is only used inside the conditional block, the anchor can
1501 be sunk. The rationale is two-fold. First, the anchor has implementation-defined
1502 behavior, and the sinking is part of the implementation. Second, already in the
1503 original program, the set of threads that communicates in the
1504 ``@convergent.operation`` is automatically subset to the threads for which
1505 ``condition`` is true.
1506
1507 Anchors can be hoisted in acyclic control flow. For example:
1508
1509 .. code-block:: llvm
1510
1511   if (condition) {
1512     %tok1 = call token @llvm.experimental.convergence.anchor()
1513     call void @convergent.operation() [ "convergencectrl"(token %tok1) ]
1514   } else {
1515     %tok2 = call token @llvm.experimental.convergence.anchor()
1516     call void @convergent.operation() [ "convergencectrl"(token %tok2) ]
1517   }
1518
1519 The anchors can be hoisted, resulting in:
1520
1521 .. code-block:: llvm
1522
1523   %tok = call token @llvm.experimental.convergence.anchor()
1524   if (condition) {
1525     call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1526   } else {
1527     call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1528   }
1529
1530 The behavior is unchanged, since each of the static convergent operations only
1531 ever communicates with threads that have the same ``condition`` value.
1532 By contrast, hoisting the convergent operations themselves is forbidden.
1533
1534 Hoisting and sinking anchors out of and into loops is forbidden. For example:
1535
1536 .. code-block:: llvm
1537
1538   for (;;) {
1539     %tok = call token @llvm.experimental.convergence.anchor()
1540     call void @convergent.operation() [ "convergencectrl"(token %tok) ]
1541   }
1542
1543 Hoisting the anchor would make the program invalid according to the static
1544 validity rules. Conversely:
1545
1546 .. code-block:: llvm
1547
1548   %outer = call token @llvm.experimental.convergence.anchor()
1549   while (counter > 0) {
1550     %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
1551     call void @convergent.operation() [ "convergencectrl"(token %inner) ]
1552     counter--;
1553   }
1554
1555 The program would stay valid if the anchor was sunk into the loop, but its
1556 behavior could end up being different. If the anchor is inside the loop, then
1557 each loop iteration has a new dynamic instance of the anchor, and the set of
1558 threads participating in those dynamic instances of the anchor could be
1559 different in arbitrary implementation-defined ways. Via the dynamic rules about
1560 dynamic instances of convergent operations, this then implies that the set of
1561 threads executing ``@convergent.operation`` could be different in each loop
1562 iteration in arbitrary implementation-defined ways.
1563
1564 Convergent operations can be sunk together with their anchor. Again in
1565 pseudo-code:
1566
1567 .. code-block:: llvm
1568
1569   %tok = call token @llvm.experimental.convergence.anchor()
1570   %a = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
1571   %b = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
1572   if (condition) {
1573     use(%a, %b)
1574   }
1575
1576 Assuming that ``%tok``, ``%a``, and ``%b`` are only used inside the conditional
1577 block, all can be sunk together:
1578
1579 .. code-block:: llvm
1580
1581   if (condition) {
1582     %tok = call token @llvm.experimental.convergence.anchor()
1583     %a = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
1584     %b = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
1585     use(%a, %b)
1586   }
1587
1588 The rationale is that the anchor intrinsic has implementation-defined behavior,
1589 and the sinking transform is considered to be part of the implementation:
1590 the sinking will restrict the set of communicating threads to those for which
1591 ``condition`` is true, but that could have happened in the original program
1592 anyway for some arbitrary other reason.
1593
1594 However, sinking *only* the convergent operation producing ``%b`` would be
1595 incorrect. That would allow threads for which ``condition`` is false to
1596 communicate at ``%a``, but not at ``%b``, which the original program doesn't
1597 allow.
1598
1599 Note that the entry intrinsic behaves differently. Sinking the convergent
1600 operations is forbidden in the following snippet:
1601
1602 .. code-block:: llvm
1603
1604   %tok = call token @llvm.experimental.convergence.entry()
1605   %a = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
1606   %b = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
1607   if (condition) {
1608     use(%a, %b)
1609   }