llvm/docs/MemoryModelRelaxationAnnotations.rst

   1 ===================================
   2 Memory Model Relaxation Annotations
   3 ===================================
   4
   5 .. contents::
   6    :local:
   7
   8 Introduction
   9 ============
  10
  11 Memory Model Relaxation Annotations (MMRAs) are target-defined properties
  12 on instructions that can be used to selectively relax constraints placed
  13 by the memory model. For example:
  14
  15 * The use of ``VulkanMemoryModel`` in a SPIRV program allows certain
  16   memory operations to be reordered across ``acquire`` or ``release``
  17   operations.
  18 * OpenCL APIs expose primitives to only fence a specific set of address
  19   spaces. Carrying that information to the backend can enable the
  20   use of faster synchronization instructions, rather than fencing all
  21   address spaces everytime.
  22
  23 MMRAs offer an opt-in system for targets to relax the default LLVM
  24 memory model.
  25 As such, they are attached to an operation using LLVM metadata which
  26 can always be dropped without affecting correctness.
  27
  28 Definitions
  29 ===========
  30
  31 memory operation
  32     A load, a store, an atomic, or a function call that is marked as
  33     accessing memory.
  34
  35 synchronizing operation
  36     An instruction that synchronizes memory with other threads (e.g.
  37     an atomic or a fence).
  38
  39 tag
  40     Metadata attached to a memory or synchronizing operation
  41     that represents some target-defined property regarding memory
  42     synchronization.
  43
  44     An operation may have multiple tags that each represent a different
  45     property.
  46
  47     A tag is composed of a pair of metadata string: a *prefix* and a *suffix*.
  48
  49     In LLVM IR, the pair is represented using a metadata tuple.
  50     In other cases (comments, documentation, etc.), we may use the
  51     ``prefix:suffix`` notation.
  52     For example:
  53
  54     .. code-block::
  55       :caption: Example: Tags in Metadata
  56
  57       !0 = !{!"scope", !"workgroup"}  # scope:workgroup
  58       !1 = !{!"scope", !"device"}     # scope:device
  59       !2 = !{!"scope", !"system"}     # scope:system
  60
  61     .. note::
  62
  63       The only semantics relevant to the optimizer is the
  64       "compatibility" relation defined below. All other
  65       semantics are target defined.
  66
  67     Tags can also be organised in lists to allow operations
  68     to specify all of the tags they belong to. Such a list
  69     is referred to as a "set of tags".
  70
  71     .. code-block::
  72       :caption: Example: Set of Tags in Metadata
  73
  74       !0 = !{!"scope", !"workgroup"}
  75       !1 = !{!"sync-as", !"private"}
  76       !2 = !{!0, !2}
  77
  78     .. note::
  79
  80       If an operation does not have MMRA metadata, it's treated as if
  81       it has an empty list (``!{}``) of tags.
  82
  83     Note that it is not an error if a tag is not recognized by the
  84     instruction it is applied to, or by the current target.
  85     Such tags are simply ignored.
  86
  87     Both synchronizing operations and memory operations can have
  88     zero or more tags attached to them using the ``!mmra`` syntax.
  89
  90     For the sake of readability in examples below,
  91     we use a (non-functional) short syntax to represent MMMRA metadata:
  92
  93     .. code-block::
  94       :caption: Short Syntax Example
  95
  96       store %ptr1 # foo:bar
  97       store %ptr1 !mmra !{!"foo", !"bar"}
  98
  99     These two notations can be used in this document and are strictly
 100     equivalent. However, only the second version is functional.
 101
 102 compatibility
 103     Two sets of tags are said to be *compatible* iff, for every unique
 104     tag prefix P present in at least one set:
 105
 106     - the other set contains no tag with prefix P, or
 107     - at least one tag with prefix P is common to both sets.
 108
 109     The above definition implies that an empty set is always compatible
 110     with any other set. This is an important property as it ensures that
 111     if a transform drops the metadata on an operation, it can never affect
 112     correctness. In other words, the memory model cannot be relaxed further
 113     by deleting metadata from instructions.
 114
 115 .. _HappensBefore:
 116
 117 The *happens-before* Relation
 118 ==============================
 119
 120 Compatibility checks can be used to opt out of the *happens-before* relation
 121 established between two instructions.
 122
 123 Ordering
 124     When two instructions' metadata are not compatible, any program order
 125     between them are not in *happens-before*.
 126
 127     For example, consider two tags ``foo:bar`` and
 128     ``foo:baz`` exposed by a target:
 129
 130     .. code-block::
 131
 132        A: store %ptr1                 # foo:bar
 133        B: store %ptr2                 # foo:baz
 134        X: store atomic release %ptr3  # foo:bar
 135
 136     In the above figure, ``A`` is compatible with ``X``, and hence ``A``
 137     happens-before ``X``. But ``B`` is not compatible with
 138     ``X``, and hence it is not happens-before ``X``.
 139
 140 Synchronization
 141     If an synchronizing operation has one or more tags, then whether it
 142     synchronizes-with and participates in the  ``seq_cst`` order with
 143     other operations is target dependent.
 144
 145     Whether the following example synchronizes with another sequence depends
 146     on the target-defined semantics of ``foo:bar`` and ``foo:bux``.
 147
 148     .. code-block::
 149
 150        fence release               # foo:bar
 151        store atomic %ptr1          # foo:bux
 152
 153 Examples
 154 --------
 155
 156 Example 1:
 157     .. code-block::
 158
 159       A: store ptr addrspace(1) %ptr2                  # sync-as:1 vulkan:nonprivate
 160       B: store atomic release ptr addrspace(1) %ptr3   # sync-as:0 vulkan:nonprivate
 161
 162     A and B are not ordered relative to each other
 163     (no *happens-before*) because their sets of tags are not compatible.
 164
 165     Note that the ``sync-as`` value does not have to match the ``addrspace`` value.
 166     e.g. In Example 1, a store-release to a location in ``addrspace(1)`` wants to
 167     only synchronize with operations happening in ``addrspace(0)``.
 168
 169 Example 2:
 170     .. code-block::
 171
 172       A: store ptr addrspace(1) %ptr2                 # sync-as:1 vulkan:nonprivate
 173       B: store atomic release ptr addrspace(1) %ptr3  # sync-as:1 vulkan:nonprivate
 174
 175     The ordering of A and B is unaffected because their set of tags are
 176     compatible.
 177
 178     Note that A and B may or may not be in *happens-before* due to other reasons.
 179
 180 Example 3:
 181     .. code-block::
 182
 183       A: store ptr addrspace(1) %ptr2                 # sync-as:1 vulkan:nonprivate
 184       B: store atomic release ptr addrspace(1) %ptr3  # vulkan:nonprivate
 185
 186     The ordering of A and B is unaffected because their set of tags are
 187     compatible.
 188
 189 Example 4:
 190     .. code-block::
 191
 192       A: store ptr addrspace(1) %ptr2                 # sync-as:1
 193       B: store atomic release ptr addrspace(1) %ptr3  # sync-as:2
 194
 195     A and B do not have to be ordered relative to each other
 196     (no *happens-before*) because their sets of tags are not compatible.
 197
 198 Use-cases
 199 =========
 200
 201 SPIRV ``NonPrivatePointer``
 202 ---------------------------
 203
 204 MMRAs can support the SPIRV capability
 205 ``VulkanMemoryModel``, where synchronizing operations only affect
 206 memory operations that specify ``NonPrivatePointer`` semantics.
 207
 208 The example below is generated from a SPIRV program using the
 209 following recipe:
 210
 211 - Add ``vulkan:nonprivate`` to every synchronizing operation.
 212 - Add ``vulkan:nonprivate`` to every non-atomic memory operation
 213   that is marked ``NonPrivatePointer``.
 214 - Add ``vulkan:private`` to tags of every non-atomic memory operation
 215   that is not marked ``NonPrivatePointer``.
 216
 217 .. code-block::
 218
 219    Thread T1:
 220     A: store %ptr1                 # vulkan:nonprivate
 221     B: store %ptr2                 # vulkan:private
 222     X: store atomic release %ptr3  # vulkan:nonprivate
 223
 224    Thread T2:
 225     Y: load atomic acquire %ptr3   # vulkan:nonprivate
 226     C: load %ptr2                  # vulkan:private
 227     D: load %ptr1                  # vulkan:nonprivate
 228
 229 Compatibility ensures that operation ``A`` is ordered
 230 relative to ``X`` while operation ``D`` is ordered relative to ``Y``.
 231 If ``X`` synchronizes with ``Y``, then ``A`` happens-before ``D``.
 232 No such relation can be inferred about operations ``B`` and ``C``.
 233
 234 .. note::
 235    The `Vulkan Memory Model <https://registry.khronos.org/vulkan/specs/1.3-extensions/html/vkspec.html#memory-model-non-private>`_
 236    considers all atomic operation non-private.
 237
 238    Whether ``vulkan:nonprivate`` would be specified on atomic operations is
 239    an implementation detail, as an atomic operation is always ``nonprivate``.
 240    The implementation may choose to be explicit and emit IR with
 241    ``vulkan:nonprivate`` on every atomic operation, or it could choose to
 242    only emit ``vulkan::private`` and assume ``vulkan:nonprivate``
 243    by default.
 244
 245 Operations marked with ``vulkan:private`` effectively opt out of the
 246 happens-before order in a SPIRV program since they are incompatible
 247 with every synchronizing operation. Note that SPIRV operations that
 248 are not marked ``NonPrivatePointer`` are not entirely private to the
 249 thread --- they are implicitly synchronized at the start or end of a
 250 thread by the Vulkan *system-synchronizes-with* relationship. This
 251 example assumes that the target-defined semantics of
 252 ``vulkan:private`` correctly implements this property.
 253
 254 This scheme is general enough to express the interoperability of SPIRV
 255 programs with other environments.
 256
 257 .. code-block::
 258
 259    Thread T1:
 260    A: store %ptr1                 # vulkan:nonprivate
 261    X: store atomic release %ptr2  # vulkan:nonprivate
 262
 263    Thread T2:
 264    Y: load atomic acquire %ptr2   # foo:bar
 265    B: load %ptr1
 266
 267 In the above example, thread ``T1`` originates from a SPIRV program
 268 while thread ``T2`` originates from a non-SPIRV program. Whether ``X``
 269 can synchronize with ``Y`` is target defined.  If ``X`` synchronizes
 270 with ``Y``, then ``A`` happens before ``B`` (because A/X and
 271 Y/B are compatible).
 272
 273 Implementation Example
 274 ~~~~~~~~~~~~~~~~~~~~~~
 275
 276 Consider the implementation of SPIRV ``NonPrivatePointer`` on a target
 277 where all memory operations are cached, and the entire cache is
 278 flushed or invalidated at a ``release`` or ``acquire`` respectively. A
 279 possible scheme is that when translating a SPIRV program, memory
 280 operations marked ``NonPrivatePointer`` should not be cached, and the
 281 cache contents should not be touched during an ``acquire`` and
 282 ``release`` operation.
 283
 284 This could be implemented using the tags that share the ``vulkan:`` prefix,
 285 as follows:
 286
 287 - For memory operations:
 288
 289   - Operations with ``vulkan:nonprivate`` should bypass the cache.
 290   - Operations with ``vulkan:private`` should be cached.
 291   - Operations that specify neither or both should conservatively
 292     bypass the cache to ensure correctness.
 293
 294 - For synchronizing operations:
 295
 296   - Operations with ``vulkan:nonprivate`` should not flush or
 297     invalidate the cache.
 298   - Operations with ``vulkan:private`` should flush or invalidate the cache.
 299   - Operations that specify neither or both should conservatively
 300     flush or invalidate the cache to ensure correctness.
 301
 302 .. note::
 303    In such an implementation, dropping the metadata on an operation, while
 304    not affecting correctness, may have big performance implications.
 305    e.g. an operation bypasses the cache when it shouldn't.
 306
 307 Memory Types
 308 ------------
 309
 310 MMRAs may express the selective synchronization of
 311 different memory types.
 312
 313 As an example, a target may expose an ``sync-as:<N>`` tag to
 314 pass information about which address spaces are synchronized by the
 315 execution of a synchronizing operation.
 316
 317 .. note::
 318   Address spaces are used here as a common example, but this concept
 319   can apply for other "memory types". What "memory types" means here is
 320   up to the target.
 321
 322 .. code-block::
 323
 324    # let 1 = global address space
 325    # let 3 = local address space
 326
 327    Thread T1:
 328    A: store %ptr1                                  # sync-as:1
 329    B: store %ptr2                                  # sync-as:3
 330    X: store atomic release ptr addrspace(0) %ptr3  # sync-as:3
 331
 332    Thread T2:
 333    Y: load atomic acquire ptr addrspace(0) %ptr3   # sync-as:3
 334    C: load %ptr2                                   # sync-as:3
 335    D: load %ptr1                                   # sync-as:1
 336
 337 In the above figure, ``X`` and ``Y`` are atomic operations on a
 338 location in the ``global``  address space. If ``X`` synchronizes with
 339 ``Y``, then ``B`` happens-before ``C`` in the ``local`` address
 340 space. But no such statement can be made about operations ``A`` and
 341 ``D``, although they are peformed on a location in the ``global``
 342 address space.
 343
 344 Implementation Example: Adding Address Space Information to Fences
 345 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 346
 347 Languages such as OpenCL C provide fence operations such as
 348 ``atomic_work_item_fence`` that can take an explicit address
 349 space to fence.
 350
 351 By default, LLVM has no means to carry that information in the IR, so
 352 the information is lost during lowering to LLVM IR. This means that
 353 targets such as AMDGPU have to conservatively emit instructions to
 354 fence all address spaces in all cases, which can have a noticeable
 355 performance impact in high-performance applications.
 356
 357 MMRAs may be used to preserve that information at the IR level, all the
 358 way through code generation. For example, a fence that only affects the
 359 global address space ``addrspace(1)`` may be lowered as
 360
 361 .. code-block::
 362
 363     fence release # sync-as:1
 364
 365 and the target may use the presence of ``sync-as:1`` to infer that it
 366 must only emit instruction to fence the global address space.
 367
 368 Note that as MMRAs are opt in, a fence that does not have MMRA metadata
 369 could still be lowered conservatively, so this optimization would only
 370 apply if the front-end emits the MMRA metadata on the fence instructions.
 371
 372 Additional Topics
 373 =================
 374
 375 .. note::
 376
 377   The following sections are informational.
 378
 379 Performance Impact
 380 ------------------
 381
 382 MMRAs are a way to capture optimization opportunities in the program.
 383 But when an operation mentions no tags or conflicting tags,
 384 the target may need to produce conservative code to ensure correctness
 385 at the cost of performance. This can happen in the following situations:
 386
 387 1. When a target first introduces MMRAs, the
 388    frontend might not have been updated to emit them.
 389 2. An optimization may drop MMRA metadata.
 390 3. An optimization may add arbitrary tags to an operation.
 391
 392 Note that targets can always choose to ignore (or even drop) MMRAs
 393 and revert to the default behavior/codegen heuristics without
 394 affecting correctness.
 395
 396 Consequences of the Absence of *happens-before*
 397 -----------------------------------------------
 398
 399 In the :ref:`happens-before<HappensBefore>` section, we defined how an
 400 *happens-before* relation between two instruction can be broken
 401 by leveraging compatibility between MMRAs. When the instructions
 402 are incompatible and there is no *happens-before* relation, we say
 403 that the instructions "do not have to be ordered relative to each
 404 other".
 405
 406 "Ordering" in this context is a very broad term which covers both
 407 static and runtime aspects.
 408
 409 When there is no ordering constraint, we *could* statically reorder
 410 the instructions in an optimizer transform if the reordering does
 411 not break other constraints as single location coherence.
 412 Static reordering is one consequence of breaking *happens-before*,
 413 but is not the most interesting one.
 414
 415 Run-time consequences are more interesting. When there is an
 416 *happens-before* relation between instructions, the target has to emit
 417 synchronization code to ensure other threads will observe the effects of
 418 the instructions in the right order.
 419
 420 For instance, the target may have to wait for previous loads & stores to
 421 finish before starting a fence-release, or there may be a need to flush a
 422 memory cache before executing the next instruction.
 423 In the absence of *happens-before*, there is no such requirement and
 424 no waiting or flushing is required. This may noticeably speed up
 425 execution in some cases.
 426
 427 Combining Operations
 428 --------------------
 429
 430 If a pass can combine multiple memory or synchronizing operations
 431 into one, it needs to be able to combine MMRAs. One possible way to
 432 achieve this is by doing a prefix-wise union of the tag sets.
 433
 434 Let A and B be two tags set, and U be the prefix-wise union of A and B.
 435 For every unique tag prefix P present in A or B:
 436
 437 * If either A or B has no tags with prefix P, no tags with prefix
 438   P are added to U.
 439 * If both A and B have at least one tag with prefix P, all tags with prefix
 440   P from both sets are added to U.
 441
 442 Passes should avoid aggressively combining MMRAs, as this can result
 443 in significant losses of information. While this cannot affect
 444 correctness, it may affect performance.
 445
 446 As a general rule of thumb, common passes such as SimplifyCFG that
 447 aggressively combine/reorder operations should only combine
 448 instructions that have identical sets of tags.
 449 Passes that combine less frequently, or that are well aware of the cost
 450 of combining the MMRAs can use the prefix-wise union described above.
 451
 452 Examples:
 453
 454 .. code-block::
 455
 456     A: store release %ptr1  # foo:x, foo:y, bar:x
 457     B: store release %ptr2  # foo:x, bar:y
 458
 459     # Unique prefixes P = [foo, bar]
 460     # "foo:x" is common to A and B so it's added to U.
 461     # "bar:x" != "bar:y" so it's not added to U.
 462     U: store release %ptr3  # foo:x
 463
 464 .. code-block::
 465
 466     A: store release %ptr1  # foo:x, foo:y
 467     B: store release %ptr2  # foo:x, bux:y
 468
 469     # Unique prefixes P = [foo, bux]
 470     # "foo:x" is common to A and B so it's added to U.
 471     # No tags have the prefix "bux" in A.
 472     U: store release %ptr3  # foo:x
 473
 474 .. code-block::
 475
 476     A: store release %ptr1
 477     B: store release %ptr2  # foo:x, bar:y
 478
 479     # Unique prefixes P = [foo, bar]
 480     # No tags with "foo" or "bar" in A, so no tags added.
 481     U: store release %ptr3