docs/AMDGPUUsage.rst

   1 =============================
   2 User Guide for AMDGPU Backend
   3 =============================
   4
   5 .. contents::
   6    :local:
   7
   8 Introduction
   9 ============
  10
  11 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
  12 R600 family up until the current GCN families. It lives in the
  13 ``lib/Target/AMDGPU`` directory.
  14
  15 LLVM
  16 ====
  17
  18 .. _amdgpu-target-triples:
  19
  20 Target Triples
  21 --------------
  22
  23 Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
  24 specify the target triple:
  25
  26   .. table:: AMDGPU Architectures
  27      :name: amdgpu-architecture-table
  28
  29      ============ ==============================================================
  30      Architecture Description
  31      ============ ==============================================================
  32      ``r600``     AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
  33      ``amdgcn``   AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
  34      ============ ==============================================================
  35
  36   .. table:: AMDGPU Vendors
  37      :name: amdgpu-vendor-table
  38
  39      ============ ==============================================================
  40      Vendor       Description
  41      ============ ==============================================================
  42      ``amd``      Can be used for all AMD GPU usage.
  43      ``mesa3d``   Can be used if the OS is ``mesa3d``.
  44      ============ ==============================================================
  45
  46   .. table:: AMDGPU Operating Systems
  47      :name: amdgpu-os-table
  48
  49      ============== ============================================================
  50      OS             Description
  51      ============== ============================================================
  52      *<empty>*      Defaults to the *unknown* OS.
  53      ``amdhsa``     Compute kernels executed on HSA [HSA]_ compatible runtimes
  54                     such as AMD's ROCm [AMD-ROCm]_.
  55      ``amdpal``     Graphic shaders and compute kernels executed on AMD PAL
  56                     runtime.
  57      ``mesa3d``     Graphic shaders and compute kernels executed on Mesa 3D
  58                     runtime.
  59      ============== ============================================================
  60
  61   .. table:: AMDGPU Environments
  62      :name: amdgpu-environment-table
  63
  64      ============ ==============================================================
  65      Environment  Description
  66      ============ ==============================================================
  67      *<empty>*    Default.
  68      ============ ==============================================================
  69
  70 .. _amdgpu-processors:
  71
  72 Processors
  73 ----------
  74
  75 Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
  76 names from both the *Processor* and *Alternative Processor* can be used.
  77
  78   .. table:: AMDGPU Processors
  79      :name: amdgpu-processor-table
  80
  81      =========== =============== ============ ===== ================= ======= ======================
  82      Processor   Alternative     Target       dGPU/ Target            ROCm    Example
  83                  Processor       Triple       APU   Features          Support Products
  84                                  Architecture       Supported
  85                                                     [Default]
  86      =========== =============== ============ ===== ================= ======= ======================
  87      **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
  88      -----------------------------------------------------------------------------------------------
  89      ``r600``                    ``r600``     dGPU
  90      ``r630``                    ``r600``     dGPU
  91      ``rs880``                   ``r600``     dGPU
  92      ``rv670``                   ``r600``     dGPU
  93      **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
  94      -----------------------------------------------------------------------------------------------
  95      ``rv710``                   ``r600``     dGPU
  96      ``rv730``                   ``r600``     dGPU
  97      ``rv770``                   ``r600``     dGPU
  98      **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
  99      -----------------------------------------------------------------------------------------------
 100      ``cedar``                   ``r600``     dGPU
 101      ``cypress``                 ``r600``     dGPU
 102      ``juniper``                 ``r600``     dGPU
 103      ``redwood``                 ``r600``     dGPU
 104      ``sumo``                    ``r600``     dGPU
 105      **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
 106      -----------------------------------------------------------------------------------------------
 107      ``barts``                   ``r600``     dGPU
 108      ``caicos``                  ``r600``     dGPU
 109      ``cayman``                  ``r600``     dGPU
 110      ``turks``                   ``r600``     dGPU
 111      **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
 112      -----------------------------------------------------------------------------------------------
 113      ``gfx600``  - ``tahiti``    ``amdgcn``   dGPU
 114      ``gfx601``  - ``hainan``    ``amdgcn``   dGPU
 115                  - ``oland``
 116                  - ``pitcairn``
 117                  - ``verde``
 118      **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
 119      -----------------------------------------------------------------------------------------------
 120      ``gfx700``  - ``kaveri``    ``amdgcn``   APU                             - A6-7000
 121                                                                               - A6 Pro-7050B
 122                                                                               - A8-7100
 123                                                                               - A8 Pro-7150B
 124                                                                               - A10-7300
 125                                                                               - A10 Pro-7350B
 126                                                                               - FX-7500
 127                                                                               - A8-7200P
 128                                                                               - A10-7400P
 129                                                                               - FX-7600P
 130      ``gfx701``  - ``hawaii``    ``amdgcn``   dGPU                    ROCm    - FirePro W8100
 131                                                                               - FirePro W9100
 132                                                                               - FirePro S9150
 133                                                                               - FirePro S9170
 134      ``gfx702``                  ``amdgcn``   dGPU                    ROCm    - Radeon R9 290
 135                                                                               - Radeon R9 290x
 136                                                                               - Radeon R390
 137                                                                               - Radeon R390x
 138      ``gfx703``  - ``kabini``    ``amdgcn``   APU                             - E1-2100
 139                  - ``mullins``                                                - E1-2200
 140                                                                               - E1-2500
 141                                                                               - E2-3000
 142                                                                               - E2-3800
 143                                                                               - A4-5000
 144                                                                               - A4-5100
 145                                                                               - A6-5200
 146                                                                               - A4 Pro-3340B
 147      ``gfx704``  - ``bonaire``   ``amdgcn``   dGPU                            - Radeon HD 7790
 148                                                                               - Radeon HD 8770
 149                                                                               - R7 260
 150                                                                               - R7 260X
 151      **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
 152      -----------------------------------------------------------------------------------------------
 153      ``gfx801``  - ``carrizo``   ``amdgcn``   APU   - xnack                   - A6-8500P
 154                                                       [on]                    - Pro A6-8500B
 155                                                                               - A8-8600P
 156                                                                               - Pro A8-8600B
 157                                                                               - FX-8800P
 158                                                                               - Pro A12-8800B
 159      \                           ``amdgcn``   APU   - xnack           ROCm    - A10-8700P
 160                                                       [on]                    - Pro A10-8700B
 161                                                                               - A10-8780P
 162      \                           ``amdgcn``   APU   - xnack                   - A10-9600P
 163                                                       [on]                    - A10-9630P
 164                                                                               - A12-9700P
 165                                                                               - A12-9730P
 166                                                                               - FX-9800P
 167                                                                               - FX-9830P
 168      \                           ``amdgcn``   APU   - xnack                   - E2-9010
 169                                                       [on]                    - A6-9210
 170                                                                               - A9-9410
 171      ``gfx802``  - ``iceland``   ``amdgcn``   dGPU  - xnack           ROCm    - FirePro S7150
 172                  - ``tonga``                          [off]                   - FirePro S7100
 173                                                                               - FirePro W7100
 174                                                                               - Radeon R285
 175                                                                               - Radeon R9 380
 176                                                                               - Radeon R9 385
 177                                                                               - Mobile FirePro
 178                                                                                 M7170
 179      ``gfx803``  - ``fiji``      ``amdgcn``   dGPU  - xnack           ROCm    - Radeon R9 Nano
 180                                                       [off]                   - Radeon R9 Fury
 181                                                                               - Radeon R9 FuryX
 182                                                                               - Radeon Pro Duo
 183                                                                               - FirePro S9300x2
 184                                                                               - Radeon Instinct MI8
 185      \           - ``polaris10`` ``amdgcn``   dGPU  - xnack           ROCm    - Radeon RX 470
 186                                                       [off]                   - Radeon RX 480
 187                                                                               - Radeon Instinct MI6
 188      \           - ``polaris11`` ``amdgcn``   dGPU  - xnack           ROCm    - Radeon RX 460
 189                                                       [off]
 190      ``gfx810``  - ``stoney``    ``amdgcn``   APU   - xnack
 191                                                       [on]
 192      **GCN GFX9** [AMD-GCN-GFX9]_
 193      -----------------------------------------------------------------------------------------------
 194      ``gfx900``                  ``amdgcn``   dGPU  - xnack           ROCm    - Radeon Vega
 195                                                       [off]                     Frontier Edition
 196                                                                               - Radeon RX Vega 56
 197                                                                               - Radeon RX Vega 64
 198                                                                               - Radeon RX Vega 64
 199                                                                                 Liquid
 200                                                                               - Radeon Instinct MI25
 201      ``gfx902``                  ``amdgcn``   APU   - xnack                   - Ryzen 3 2200G
 202                                                       [on]                    - Ryzen 5 2400G
 203      ``gfx904``                  ``amdgcn``   dGPU  - xnack                   *TBA*
 204                                                       [off]
 205                                                                               .. TODO
 206                                                                                  Add product
 207                                                                                  names.
 208      ``gfx906``                  ``amdgcn``   dGPU  - xnack                   - Radeon Instinct MI50
 209                                                       [off]                   - Radeon Instinct MI60
 210      ``gfx908``                  ``amdgcn``   dGPU  - xnack                   *TBA*
 211                                                       [off]
 212                                                       sram-ecc
 213                                                       [on]
 214      ``gfx909``                  ``amdgcn``   APU   - xnack                   *TBA* (Raven Ridge 2)
 215                                                       [on]
 216                                                                               .. TODO
 217                                                                                  Add product
 218                                                                                  names.
 219      **GCN GFX10** [AMD-GCN-GFX10]_
 220      -----------------------------------------------------------------------------------------------
 221      ``gfx1010``                 ``amdgcn``   dGPU  - xnack                   *TBA*
 222                                                       [off]
 223                                                     - wavefrontsize64
 224                                                       [off]
 225                                                     - cumode
 226                                                       [off]
 227                                                                               .. TODO
 228                                                                                  Add product
 229                                                                                  names.
 230      ``gfx1011``                 ``amdgcn``   dGPU  - xnack                   *TBA*
 231                                                       [off]
 232                                                     - wavefrontsize64
 233                                                       [off]
 234                                                     - cumode
 235                                                       [off]
 236                                                                               .. TODO
 237                                                                                  Add product
 238                                                                                  names.
 239      ``gfx1012``                 ``amdgcn``   dGPU  - xnack                   *TBA*
 240                                                       [off]
 241                                                     - wavefrontsize64
 242                                                       [off]
 243                                                     - cumode
 244                                                       [off]
 245                                                                               .. TODO
 246                                                                                  Add product
 247                                                                                  names.
 248      =========== =============== ============ ===== ================= ======= ======================
 249
 250 .. _amdgpu-target-features:
 251
 252 Target Features
 253 ---------------
 254
 255 Target features control how code is generated to support certain
 256 processor specific features. Not all target features are supported by
 257 all processors. The runtime must ensure that the features supported by
 258 the device used to execute the code match the features enabled when
 259 generating the code. A mismatch of features may result in incorrect
 260 execution, or a reduction in performance.
 261
 262 The target features supported by each processor, and the default value
 263 used if not specified explicitly, is listed in
 264 :ref:`amdgpu-processor-table`.
 265
 266 Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
 267 target features.
 268
 269 For example:
 270
 271 ``-mxnack``
 272   Enable the ``xnack`` feature.
 273 ``-mno-xnack``
 274   Disable the ``xnack`` feature.
 275
 276   .. table:: AMDGPU Target Features
 277      :name: amdgpu-target-feature-table
 278
 279      ====================== ==================================================
 280      Target Feature         Description
 281      ====================== ==================================================
 282      -m[no-]xnack           Enable/disable generating code that has
 283                             memory clauses that are compatible with
 284                             having XNACK replay enabled.
 285
 286                             This is used for demand paging and page
 287                             migration. If XNACK replay is enabled in
 288                             the device, then if a page fault occurs
 289                             the code may execute incorrectly if the
 290                             ``xnack`` feature is not enabled. Executing
 291                             code that has the feature enabled on a
 292                             device that does not have XNACK replay
 293                             enabled will execute correctly, but may
 294                             be less performant than code with the
 295                             feature disabled.
 296
 297      -m[no-]sram-ecc        Enable/disable generating code that assumes SRAM
 298                             ECC is enabled/disabled.
 299
 300      -m[no-]wavefrontsize64 Control the default wavefront size used when
 301                             generating code for kernels. When disabled
 302                             native wavefront size 32 is used, when enabled
 303                             wavefront size 64 is used.
 304
 305      -m[no-]cumode          Control the default wavefront execution mode used
 306                             when generating code for kernels. When disabled
 307                             native WGP wavefront execution mode is used,
 308                             when enabled CU wavefront execution mode is used
 309                             (see :ref:`amdgpu-amdhsa-memory-model`).
 310      ====================== ==================================================
 311
 312 .. _amdgpu-address-spaces:
 313
 314 Address Spaces
 315 --------------
 316
 317 The AMDGPU backend uses the following address space mappings.
 318
 319 The memory space names used in the table, aside from the region memory space, is
 320 from the OpenCL standard.
 321
 322 LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
 323
 324   .. table:: Address Space Mapping
 325      :name: amdgpu-address-space-mapping-table
 326
 327      ================== =================================
 328      LLVM Address Space Memory Space
 329      ================== =================================
 330      0                  Generic (Flat)
 331      1                  Global
 332      2                  Region (GDS)
 333      3                  Local (group/LDS)
 334      4                  Constant
 335      5                  Private (Scratch)
 336      6                  Constant 32-bit
 337      7                  Buffer Fat Pointer (experimental)
 338      ================== =================================
 339
 340 The buffer fat pointer is an experimental address space that is currently
 341 unsupported in the backend. It exposes a non-integral pointer that is in future
 342 intended to support the modelling of 128-bit buffer descriptors + a 32-bit
 343 offset into the buffer descriptor (in total encapsulating a 160-bit 'pointer'),
 344 allowing us to use normal LLVM load/store/atomic operations to model the buffer
 345 descriptors used heavily in graphics workloads targeting the backend.
 346
 347 .. _amdgpu-memory-scopes:
 348
 349 Memory Scopes
 350 -------------
 351
 352 This section provides LLVM memory synchronization scopes supported by the AMDGPU
 353 backend memory model when the target triple OS is ``amdhsa`` (see
 354 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
 355
 356 The memory model supported is based on the HSA memory model [HSA]_ which is
 357 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
 358 relation is transitive over the synchonizes-with relation independent of scope,
 359 and synchonizes-with allows the memory scope instances to be inclusive (see
 360 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
 361
 362 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
 363 inclusion and requires the memory scopes to exactly match. However, this
 364 is conservatively correct for OpenCL.
 365
 366   .. table:: AMDHSA LLVM Sync Scopes
 367      :name: amdgpu-amdhsa-llvm-sync-scopes-table
 368
 369      ======================= ===================================================
 370      LLVM Sync Scope         Description
 371      ======================= ===================================================
 372      *none*                  The default: ``system``.
 373
 374                              Synchronizes with, and participates in modification
 375                              and seq_cst total orderings with, other operations
 376                              (except image operations) for all address spaces
 377                              (except private, or generic that accesses private)
 378                              provided the other operation's sync scope is:
 379
 380                              - ``system``.
 381                              - ``agent`` and executed by a thread on the same
 382                                agent.
 383                              - ``workgroup`` and executed by a thread in the
 384                                same workgroup.
 385                              - ``wavefront`` and executed by a thread in the
 386                                same wavefront.
 387
 388      ``agent``               Synchronizes with, and participates in modification
 389                              and seq_cst total orderings with, other operations
 390                              (except image operations) for all address spaces
 391                              (except private, or generic that accesses private)
 392                              provided the other operation's sync scope is:
 393
 394                              - ``system`` or ``agent`` and executed by a thread
 395                                on the same agent.
 396                              - ``workgroup`` and executed by a thread in the
 397                                same workgroup.
 398                              - ``wavefront`` and executed by a thread in the
 399                                same wavefront.
 400
 401      ``workgroup``           Synchronizes with, and participates in modification
 402                              and seq_cst total orderings with, other operations
 403                              (except image operations) for all address spaces
 404                              (except private, or generic that accesses private)
 405                              provided the other operation's sync scope is:
 406
 407                              - ``system``, ``agent`` or ``workgroup`` and
 408                                executed by a thread in the same workgroup.
 409                              - ``wavefront`` and executed by a thread in the
 410                                same wavefront.
 411
 412      ``wavefront``           Synchronizes with, and participates in modification
 413                              and seq_cst total orderings with, other operations
 414                              (except image operations) for all address spaces
 415                              (except private, or generic that accesses private)
 416                              provided the other operation's sync scope is:
 417
 418                              - ``system``, ``agent``, ``workgroup`` or
 419                                ``wavefront`` and executed by a thread in the
 420                                same wavefront.
 421
 422      ``singlethread``        Only synchronizes with, and participates in
 423                              modification and seq_cst total orderings with,
 424                              other operations (except image operations) running
 425                              in the same thread for all address spaces (for
 426                              example, in signal handlers).
 427
 428      ``one-as``              Same as ``system`` but only synchronizes with other
 429                              operations within the same address space.
 430
 431      ``agent-one-as``        Same as ``agent`` but only synchronizes with other
 432                              operations within the same address space.
 433
 434      ``workgroup-one-as``    Same as ``workgroup`` but only synchronizes with
 435                              other operations within the same address space.
 436
 437      ``wavefront-one-as``    Same as ``wavefront`` but only synchronizes with
 438                              other operations within the same address space.
 439
 440      ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with
 441                              other operations within the same address space.
 442      ======================= ===================================================
 443
 444 AMDGPU Intrinsics
 445 -----------------
 446
 447 The AMDGPU backend implements the following LLVM IR intrinsics.
 448
 449 *This section is WIP.*
 450
 451 .. TODO
 452    List AMDGPU intrinsics
 453
 454 AMDGPU Attributes
 455 -----------------
 456
 457 The AMDGPU backend supports the following LLVM IR attributes.
 458
 459   .. table:: AMDGPU LLVM IR Attributes
 460      :name: amdgpu-llvm-ir-attributes-table
 461
 462      ======================================= ==========================================================
 463      LLVM Attribute                          Description
 464      ======================================= ==========================================================
 465      "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
 466                                              will be specified when the kernel is dispatched. Generated
 467                                              by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
 468      "amdgpu-implicitarg-num-bytes"="n"      Number of kernel argument bytes to add to the kernel
 469                                              argument block size for the implicit arguments. This
 470                                              varies by OS and language (for OpenCL see
 471                                              :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
 472      "amdgpu-num-sgpr"="n"                   Specifies the number of SGPRs to use. Generated by
 473                                              the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
 474      "amdgpu-num-vgpr"="n"                   Specifies the number of VGPRs to use. Generated by the
 475                                              ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
 476      "amdgpu-waves-per-eu"="m,n"             Specify the minimum and maximum number of waves per
 477                                              execution unit. Generated by the ``amdgpu_waves_per_eu``
 478                                              CLANG attribute [CLANG-ATTR]_.
 479      "amdgpu-ieee" true/false.               Specify whether the function expects the IEEE field of the
 480                                              mode register to be set on entry. Overrides the default for
 481                                              the calling convention.
 482      "amdgpu-dx10-clamp" true/false.         Specify whether the function expects the DX10_CLAMP field of
 483                                              the mode register to be set on entry. Overrides the default
 484                                              for the calling convention.
 485      ======================================= ==========================================================
 486
 487 Code Object
 488 ===========
 489
 490 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
 491 can be linked by ``lld`` to produce a standard ELF shared code object which can
 492 be loaded and executed on an AMDGPU target.
 493
 494 Header
 495 ------
 496
 497 The AMDGPU backend uses the following ELF header:
 498
 499   .. table:: AMDGPU ELF Header
 500      :name: amdgpu-elf-header-table
 501
 502      ========================== ===============================
 503      Field                      Value
 504      ========================== ===============================
 505      ``e_ident[EI_CLASS]``      ``ELFCLASS64``
 506      ``e_ident[EI_DATA]``       ``ELFDATA2LSB``
 507      ``e_ident[EI_OSABI]``      - ``ELFOSABI_NONE``
 508                                 - ``ELFOSABI_AMDGPU_HSA``
 509                                 - ``ELFOSABI_AMDGPU_PAL``
 510                                 - ``ELFOSABI_AMDGPU_MESA3D``
 511      ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
 512                                 - ``ELFABIVERSION_AMDGPU_PAL``
 513                                 - ``ELFABIVERSION_AMDGPU_MESA3D``
 514      ``e_type``                 - ``ET_REL``
 515                                 - ``ET_DYN``
 516      ``e_machine``              ``EM_AMDGPU``
 517      ``e_entry``                0
 518      ``e_flags``                See :ref:`amdgpu-elf-header-e_flags-table`
 519      ========================== ===============================
 520
 521 ..
 522
 523   .. table:: AMDGPU ELF Header Enumeration Values
 524      :name: amdgpu-elf-header-enumeration-values-table
 525
 526      =============================== =====
 527      Name                            Value
 528      =============================== =====
 529      ``EM_AMDGPU``                   224
 530      ``ELFOSABI_NONE``               0
 531      ``ELFOSABI_AMDGPU_HSA``         64
 532      ``ELFOSABI_AMDGPU_PAL``         65
 533      ``ELFOSABI_AMDGPU_MESA3D``      66
 534      ``ELFABIVERSION_AMDGPU_HSA``    1
 535      ``ELFABIVERSION_AMDGPU_PAL``    0
 536      ``ELFABIVERSION_AMDGPU_MESA3D`` 0
 537      =============================== =====
 538
 539 ``e_ident[EI_CLASS]``
 540   The ELF class is:
 541
 542   * ``ELFCLASS32`` for ``r600`` architecture.
 543
 544   * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
 545     bit applications.
 546
 547 ``e_ident[EI_DATA]``
 548   All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
 549
 550 ``e_ident[EI_OSABI]``
 551   One of the following AMD GPU architecture specific OS ABIs
 552   (see :ref:`amdgpu-os-table`):
 553
 554   * ``ELFOSABI_NONE`` for *unknown* OS.
 555
 556   * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
 557
 558   * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
 559
 560   * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
 561
 562 ``e_ident[EI_ABIVERSION]``
 563   The ABI version of the AMD GPU architecture specific OS ABI to which the code
 564   object conforms:
 565
 566   * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
 567     runtime ABI.
 568
 569   * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
 570     runtime ABI.
 571
 572   * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
 573     3D runtime ABI.
 574
 575 ``e_type``
 576   Can be one of the following values:
 577
 578
 579   ``ET_REL``
 580     The type produced by the AMD GPU backend compiler as it is relocatable code
 581     object.
 582
 583   ``ET_DYN``
 584     The type produced by the linker as it is a shared code object.
 585
 586   The AMD HSA runtime loader requires a ``ET_DYN`` code object.
 587
 588 ``e_machine``
 589   The value ``EM_AMDGPU`` is used for the machine for all processors supported
 590   by the ``r600`` and ``amdgcn`` architectures (see
 591   :ref:`amdgpu-processor-table`). The specific processor is specified in the
 592   ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
 593   :ref:`amdgpu-elf-header-e_flags-table`).
 594
 595 ``e_entry``
 596   The entry point is 0 as the entry points for individual kernels must be
 597   selected in order to invoke them through AQL packets.
 598
 599 ``e_flags``
 600   The AMDGPU backend uses the following ELF header flags:
 601
 602   .. table:: AMDGPU ELF Header ``e_flags``
 603      :name: amdgpu-elf-header-e_flags-table
 604
 605      ================================= ========== =============================
 606      Name                              Value      Description
 607      ================================= ========== =============================
 608      **AMDGPU Processor Flag**                    See :ref:`amdgpu-processor-table`.
 609      -------------------------------------------- -----------------------------
 610      ``EF_AMDGPU_MACH``                0x000000ff AMDGPU processor selection
 611                                                   mask for
 612                                                   ``EF_AMDGPU_MACH_xxx`` values
 613                                                   defined in
 614                                                   :ref:`amdgpu-ef-amdgpu-mach-table`.
 615      ``EF_AMDGPU_XNACK``               0x00000100 Indicates if the ``xnack``
 616                                                   target feature is
 617                                                   enabled for all code
 618                                                   contained in the code object.
 619                                                   If the processor
 620                                                   does not support the
 621                                                   ``xnack`` target
 622                                                   feature then must
 623                                                   be 0.
 624                                                   See
 625                                                   :ref:`amdgpu-target-features`.
 626      ``EF_AMDGPU_SRAM_ECC``            0x00000200 Indicates if the ``sram-ecc``
 627                                                   target feature is
 628                                                   enabled for all code
 629                                                   contained in the code object.
 630                                                   If the processor
 631                                                   does not support the
 632                                                   ``sram-ecc`` target
 633                                                   feature then must
 634                                                   be 0.
 635                                                   See
 636                                                   :ref:`amdgpu-target-features`.
 637      ================================= ========== =============================
 638
 639   .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
 640      :name: amdgpu-ef-amdgpu-mach-table
 641
 642      ================================= ========== =============================
 643      Name                              Value      Description (see
 644                                                   :ref:`amdgpu-processor-table`)
 645      ================================= ========== =============================
 646      ``EF_AMDGPU_MACH_NONE``           0x000      *not specified*
 647      ``EF_AMDGPU_MACH_R600_R600``      0x001      ``r600``
 648      ``EF_AMDGPU_MACH_R600_R630``      0x002      ``r630``
 649      ``EF_AMDGPU_MACH_R600_RS880``     0x003      ``rs880``
 650      ``EF_AMDGPU_MACH_R600_RV670``     0x004      ``rv670``
 651      ``EF_AMDGPU_MACH_R600_RV710``     0x005      ``rv710``
 652      ``EF_AMDGPU_MACH_R600_RV730``     0x006      ``rv730``
 653      ``EF_AMDGPU_MACH_R600_RV770``     0x007      ``rv770``
 654      ``EF_AMDGPU_MACH_R600_CEDAR``     0x008      ``cedar``
 655      ``EF_AMDGPU_MACH_R600_CYPRESS``   0x009      ``cypress``
 656      ``EF_AMDGPU_MACH_R600_JUNIPER``   0x00a      ``juniper``
 657      ``EF_AMDGPU_MACH_R600_REDWOOD``   0x00b      ``redwood``
 658      ``EF_AMDGPU_MACH_R600_SUMO``      0x00c      ``sumo``
 659      ``EF_AMDGPU_MACH_R600_BARTS``     0x00d      ``barts``
 660      ``EF_AMDGPU_MACH_R600_CAICOS``    0x00e      ``caicos``
 661      ``EF_AMDGPU_MACH_R600_CAYMAN``    0x00f      ``cayman``
 662      ``EF_AMDGPU_MACH_R600_TURKS``     0x010      ``turks``
 663      *reserved*                        0x011 -    Reserved for ``r600``
 664                                        0x01f      architecture processors.
 665      ``EF_AMDGPU_MACH_AMDGCN_GFX600``  0x020      ``gfx600``
 666      ``EF_AMDGPU_MACH_AMDGCN_GFX601``  0x021      ``gfx601``
 667      ``EF_AMDGPU_MACH_AMDGCN_GFX700``  0x022      ``gfx700``
 668      ``EF_AMDGPU_MACH_AMDGCN_GFX701``  0x023      ``gfx701``
 669      ``EF_AMDGPU_MACH_AMDGCN_GFX702``  0x024      ``gfx702``
 670      ``EF_AMDGPU_MACH_AMDGCN_GFX703``  0x025      ``gfx703``
 671      ``EF_AMDGPU_MACH_AMDGCN_GFX704``  0x026      ``gfx704``
 672      *reserved*                        0x027      Reserved.
 673      ``EF_AMDGPU_MACH_AMDGCN_GFX801``  0x028      ``gfx801``
 674      ``EF_AMDGPU_MACH_AMDGCN_GFX802``  0x029      ``gfx802``
 675      ``EF_AMDGPU_MACH_AMDGCN_GFX803``  0x02a      ``gfx803``
 676      ``EF_AMDGPU_MACH_AMDGCN_GFX810``  0x02b      ``gfx810``
 677      ``EF_AMDGPU_MACH_AMDGCN_GFX900``  0x02c      ``gfx900``
 678      ``EF_AMDGPU_MACH_AMDGCN_GFX902``  0x02d      ``gfx902``
 679      ``EF_AMDGPU_MACH_AMDGCN_GFX904``  0x02e      ``gfx904``
 680      ``EF_AMDGPU_MACH_AMDGCN_GFX906``  0x02f      ``gfx906``
 681      ``EF_AMDGPU_MACH_AMDGCN_GFX908``  0x030      ``gfx908``
 682      ``EF_AMDGPU_MACH_AMDGCN_GFX909``  0x031      ``gfx909``
 683      *reserved*                        0x032      Reserved.
 684      ``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033      ``gfx1010``
 685      ``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034      ``gfx1011``
 686      ``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035      ``gfx1012``
 687      ================================= ========== =============================
 688
 689 Sections
 690 --------
 691
 692 An AMDGPU target ELF code object has the standard ELF sections which include:
 693
 694   .. table:: AMDGPU ELF Sections
 695      :name: amdgpu-elf-sections-table
 696
 697      ================== ================ =================================
 698      Name               Type             Attributes
 699      ================== ================ =================================
 700      ``.bss``           ``SHT_NOBITS``   ``SHF_ALLOC`` + ``SHF_WRITE``
 701      ``.data``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
 702      ``.debug_``\ *\**  ``SHT_PROGBITS`` *none*
 703      ``.dynamic``       ``SHT_DYNAMIC``  ``SHF_ALLOC``
 704      ``.dynstr``        ``SHT_PROGBITS`` ``SHF_ALLOC``
 705      ``.dynsym``        ``SHT_PROGBITS`` ``SHF_ALLOC``
 706      ``.got``           ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
 707      ``.hash``          ``SHT_HASH``     ``SHF_ALLOC``
 708      ``.note``          ``SHT_NOTE``     *none*
 709      ``.rela``\ *name*  ``SHT_RELA``     *none*
 710      ``.rela.dyn``      ``SHT_RELA``     *none*
 711      ``.rodata``        ``SHT_PROGBITS`` ``SHF_ALLOC``
 712      ``.shstrtab``      ``SHT_STRTAB``   *none*
 713      ``.strtab``        ``SHT_STRTAB``   *none*
 714      ``.symtab``        ``SHT_SYMTAB``   *none*
 715      ``.text``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
 716      ================== ================ =================================
 717
 718 These sections have their standard meanings (see [ELF]_) and are only generated
 719 if needed.
 720
 721 ``.debug``\ *\**
 722   The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
 723   DWARF produced by the AMDGPU backend.
 724
 725 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
 726   The standard sections used by a dynamic loader.
 727
 728 ``.note``
 729   See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
 730   backend.
 731
 732 ``.rela``\ *name*, ``.rela.dyn``
 733   For relocatable code objects, *name* is the name of the section that the
 734   relocation records apply. For example, ``.rela.text`` is the section name for
 735   relocation records associated with the ``.text`` section.
 736
 737   For linked shared code objects, ``.rela.dyn`` contains all the relocation
 738   records from each of the relocatable code object's ``.rela``\ *name* sections.
 739
 740   See :ref:`amdgpu-relocation-records` for the relocation records supported by
 741   the AMDGPU backend.
 742
 743 ``.text``
 744   The executable machine code for the kernels and functions they call. Generated
 745   as position independent code. See :ref:`amdgpu-code-conventions` for
 746   information on conventions used in the isa generation.
 747
 748 .. _amdgpu-note-records:
 749
 750 Note Records
 751 ------------
 752
 753 The AMDGPU backend code object contains ELF note records in the ``.note``
 754 section. The set of generated notes and their semantics depend on the code
 755 object version; see :ref:`amdgpu-note-records-v2` and
 756 :ref:`amdgpu-note-records-v3`.
 757
 758 As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding
 759 must be generated after the ``name`` field to ensure the ``desc`` field is 4
 760 byte aligned. In addition, minimal zero byte padding must be generated to
 761 ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign``
 762 field of the ``.note`` section must be at least 4 to indicate at least 8 byte
 763 alignment.
 764
 765 .. _amdgpu-note-records-v2:
 766
 767 Code Object V2 Note Records (-mattr=-code-object-v3)
 768 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 769
 770 .. warning:: Code Object V2 is not the default code object version emitted by
 771   this version of LLVM. For a description of the notes generated with the
 772   default configuration (Code Object V3) see :ref:`amdgpu-note-records-v3`.
 773
 774 The AMDGPU backend code object uses the following ELF note record in the
 775 ``.note`` section when compiling for Code Object V2 (-mattr=-code-object-v3).
 776
 777 Additional note records may be present, but any which are not documented here
 778 are deprecated and should not be used.
 779
 780   .. table:: AMDGPU Code Object V2 ELF Note Records
 781      :name: amdgpu-elf-note-records-table-v2
 782
 783      ===== ============================== ======================================
 784      Name  Type                           Description
 785      ===== ============================== ======================================
 786      "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
 787      ===== ============================== ======================================
 788
 789 ..
 790
 791   .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values
 792      :name: amdgpu-elf-note-record-enumeration-values-table-v2
 793
 794      ============================== =====
 795      Name                           Value
 796      ============================== =====
 797      *reserved*                       0-9
 798      ``NT_AMD_AMDGPU_HSA_METADATA``    10
 799      *reserved*                        11
 800      ============================== =====
 801
 802 ``NT_AMD_AMDGPU_HSA_METADATA``
 803   Specifies extensible metadata associated with the code objects executed on HSA
 804   [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
 805   the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
 806   :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code
 807   object metadata string.
 808
 809 .. _amdgpu-note-records-v3:
 810
 811 Code Object V3 Note Records (-mattr=+code-object-v3)
 812 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 813
 814 The AMDGPU backend code object uses the following ELF note record in the
 815 ``.note`` section when compiling for Code Object V3 (-mattr=+code-object-v3).
 816
 817 Additional note records may be present, but any which are not documented here
 818 are deprecated and should not be used.
 819
 820   .. table:: AMDGPU Code Object V3 ELF Note Records
 821      :name: amdgpu-elf-note-records-table-v3
 822
 823      ======== ============================== ======================================
 824      Name     Type                           Description
 825      ======== ============================== ======================================
 826      "AMDGPU" ``NT_AMDGPU_METADATA``         Metadata in Message Pack [MsgPack]_
 827                                              binary format.
 828      ======== ============================== ======================================
 829
 830 ..
 831
 832   .. table:: AMDGPU Code Object V3 ELF Note Record Enumeration Values
 833      :name: amdgpu-elf-note-record-enumeration-values-table-v3
 834
 835      ============================== =====
 836      Name                           Value
 837      ============================== =====
 838      *reserved*                     0-31
 839      ``NT_AMDGPU_METADATA``         32
 840      ============================== =====
 841
 842 ``NT_AMDGPU_METADATA``
 843   Specifies extensible metadata associated with an AMDGPU code
 844   object. It is encoded as a map in the Message Pack [MsgPack]_ binary
 845   data format. See :ref:`amdgpu-amdhsa-code-object-metadata-v3` for the
 846   map keys defined for the ``amdhsa`` OS.
 847
 848 .. _amdgpu-symbols:
 849
 850 Symbols
 851 -------
 852
 853 Symbols include the following:
 854
 855   .. table:: AMDGPU ELF Symbols
 856      :name: amdgpu-elf-symbols-table
 857
 858      ===================== ================== ================ ==================
 859      Name                  Type               Section          Description
 860      ===================== ================== ================ ==================
 861      *link-name*           ``STT_OBJECT``     - ``.data``      Global variable
 862                                               - ``.rodata``
 863                                               - ``.bss``
 864      *link-name*\ ``.kd``  ``STT_OBJECT``     - ``.rodata``    Kernel descriptor
 865      *link-name*           ``STT_FUNC``       - ``.text``      Kernel entry point
 866      *link-name*           ``STT_OBJECT``     - SHN_AMDGPU_LDS Global variable in LDS
 867      ===================== ================== ================ ==================
 868
 869 Global variable
 870   Global variables both used and defined by the compilation unit.
 871
 872   If the symbol is defined in the compilation unit then it is allocated in the
 873   appropriate section according to if it has initialized data or is readonly.
 874
 875   If the symbol is external then its section is ``STN_UNDEF`` and the loader
 876   will resolve relocations using the definition provided by another code object
 877   or explicitly defined by the runtime.
 878
 879   If the symbol resides in local/group memory (LDS) then its section is the
 880   special processor-specific section name ``SHN_AMDGPU_LDS``, and the
 881   ``st_value`` field describes alignment requirements as it does for common
 882   symbols.
 883
 884   .. TODO
 885      Add description of linked shared object symbols. Seems undefined symbols
 886      are marked as STT_NOTYPE.
 887
 888 Kernel descriptor
 889   Every HSA kernel has an associated kernel descriptor. It is the address of the
 890   kernel descriptor that is used in the AQL dispatch packet used to invoke the
 891   kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
 892   defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
 893
 894 Kernel entry point
 895   Every HSA kernel also has a symbol for its machine code entry point.
 896
 897 .. _amdgpu-relocation-records:
 898
 899 Relocation Records
 900 ------------------
 901
 902 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
 903 relocatable fields are:
 904
 905 ``word32``
 906   This specifies a 32-bit field occupying 4 bytes with arbitrary byte
 907   alignment. These values use the same byte order as other word values in the
 908   AMD GPU architecture.
 909
 910 ``word64``
 911   This specifies a 64-bit field occupying 8 bytes with arbitrary byte
 912   alignment. These values use the same byte order as other word values in the
 913   AMD GPU architecture.
 914
 915 Following notations are used for specifying relocation calculations:
 916
 917 **A**
 918   Represents the addend used to compute the value of the relocatable field.
 919
 920 **G**
 921   Represents the offset into the global offset table at which the relocation
 922   entry's symbol will reside during execution.
 923
 924 **GOT**
 925   Represents the address of the global offset table.
 926
 927 **P**
 928   Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
 929   of the storage unit being relocated (computed using ``r_offset``).
 930
 931 **S**
 932   Represents the value of the symbol whose index resides in the relocation
 933   entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
 934
 935 **B**
 936   Represents the base address of a loaded executable or shared object which is
 937   the difference between the ELF address and the actual load address. Relocations
 938   using this are only valid in executable or shared objects.
 939
 940 The following relocation types are supported:
 941
 942   .. table:: AMDGPU ELF Relocation Records
 943      :name: amdgpu-elf-relocation-records-table
 944
 945      ========================== ======= =====  ==========  ==============================
 946      Relocation Type            Kind    Value  Field       Calculation
 947      ========================== ======= =====  ==========  ==============================
 948      ``R_AMDGPU_NONE``                  0      *none*      *none*
 949      ``R_AMDGPU_ABS32_LO``      Static, 1      ``word32``  (S + A) & 0xFFFFFFFF
 950                                 Dynamic
 951      ``R_AMDGPU_ABS32_HI``      Static, 2      ``word32``  (S + A) >> 32
 952                                 Dynamic
 953      ``R_AMDGPU_ABS64``         Static, 3      ``word64``  S + A
 954                                 Dynamic
 955      ``R_AMDGPU_REL32``         Static  4      ``word32``  S + A - P
 956      ``R_AMDGPU_REL64``         Static  5      ``word64``  S + A - P
 957      ``R_AMDGPU_ABS32``         Static, 6      ``word32``  S + A
 958                                 Dynamic
 959      ``R_AMDGPU_GOTPCREL``      Static  7      ``word32``  G + GOT + A - P
 960      ``R_AMDGPU_GOTPCREL32_LO`` Static  8      ``word32``  (G + GOT + A - P) & 0xFFFFFFFF
 961      ``R_AMDGPU_GOTPCREL32_HI`` Static  9      ``word32``  (G + GOT + A - P) >> 32
 962      ``R_AMDGPU_REL32_LO``      Static  10     ``word32``  (S + A - P) & 0xFFFFFFFF
 963      ``R_AMDGPU_REL32_HI``      Static  11     ``word32``  (S + A - P) >> 32
 964      *reserved*                         12
 965      ``R_AMDGPU_RELATIVE64``    Dynamic 13     ``word64``  B + A
 966      ========================== ======= =====  ==========  ==============================
 967
 968 ``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
 969 the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
 970
 971 There is no current OS loader support for 32 bit programs and so
 972 ``R_AMDGPU_ABS32`` is not used.
 973
 974 .. _amdgpu-dwarf:
 975
 976 DWARF
 977 -----
 978
 979 Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain
 980 information that maps the code object executable code and data to the source
 981 language constructs. It can be used by tools such as debuggers and profilers.
 982
 983 Address Space Mapping
 984 ~~~~~~~~~~~~~~~~~~~~~
 985
 986 The following address space mapping is used:
 987
 988   .. table:: AMDGPU DWARF Address Space Mapping
 989      :name: amdgpu-dwarf-address-space-mapping-table
 990
 991      =================== =================
 992      DWARF Address Space Memory Space
 993      =================== =================
 994      1                   Private (Scratch)
 995      2                   Local (group/LDS)
 996      *omitted*           Global
 997      *omitted*           Constant
 998      *omitted*           Generic (Flat)
 999      *not supported*     Region (GDS)
1000      =================== =================
1001
1002 See :ref:`amdgpu-address-spaces` for information on the memory space terminology
1003 used in the table.
1004
1005 An ``address_class`` attribute is generated on pointer type DIEs to specify the
1006 DWARF address space of the value of the pointer when it is in the *private* or
1007 *local* address space. Otherwise the attribute is omitted.
1008
1009 An ``XDEREF`` operation is generated in location list expressions for variables
1010 that are allocated in the *private* and *local* address space. Otherwise no
1011 ``XDREF`` is omitted.
1012
1013 Register Mapping
1014 ~~~~~~~~~~~~~~~~
1015
1016 *This section is WIP.*
1017
1018 .. TODO
1019    Define DWARF register enumeration.
1020
1021    If want to present a wavefront state then should expose vector registers as
1022    64 wide (rather than per work-item view that LLVM uses). Either as separate
1023    registers, or a 64x4 byte single register. In either case use a new LANE op
1024    (akin to XDREF) to select the current lane usage in a location
1025    expression. This would also allow scalar register spilling to vector register
1026    lanes to be expressed (currently no debug information is being generated for
1027    spilling). If choose a wide single register approach then use LANE in
1028    conjunction with PIECE operation to select the dword part of the register for
1029    the current lane. If the separate register approach then use LANE to select
1030    the register.
1031
1032 Source Text
1033 ~~~~~~~~~~~
1034
1035 Source text for online-compiled programs (e.g. those compiled by the OpenCL
1036 runtime) may be embedded into the DWARF v5 line table using the ``clang
1037 -gembed-source`` option, described in table :ref:`amdgpu-debug-options`.
1038
1039 For example:
1040
1041 ``-gembed-source``
1042   Enable the embedded source DWARF v5 extension.
1043 ``-gno-embed-source``
1044   Disable the embedded source DWARF v5 extension.
1045
1046   .. table:: AMDGPU Debug Options
1047      :name: amdgpu-debug-options
1048
1049      ==================== ==================================================
1050      Debug Flag           Description
1051      ==================== ==================================================
1052      -g[no-]embed-source  Enable/disable embedding source text in DWARF
1053                           debug sections. Useful for environments where
1054                           source cannot be written to disk, such as
1055                           when performing online compilation.
1056      ==================== ==================================================
1057
1058 This option enables one extended content types in the DWARF v5 Line Number
1059 Program Header, which is used to encode embedded source.
1060
1061   .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types
1062      :name: amdgpu-dwarf-extended-content-types
1063
1064      ============================  ======================
1065      Content Type                  Form
1066      ============================  ======================
1067      ``DW_LNCT_LLVM_source``       ``DW_FORM_line_strp``
1068      ============================  ======================
1069
1070 The source field will contain the UTF-8 encoded, null-terminated source text
1071 with ``'\n'`` line endings. When the source field is present, consumers can use
1072 the embedded source instead of attempting to discover the source on disk. When
1073 the source field is absent, consumers can access the file to get the source
1074 text.
1075
1076 The above content type appears in the ``file_name_entry_format`` field of the
1077 line table prologue, and its corresponding value appear in the ``file_names``
1078 field. The current encoding of the content type is documented in table
1079 :ref:`amdgpu-dwarf-extended-content-types-encoding`
1080
1081   .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
1082      :name: amdgpu-dwarf-extended-content-types-encoding
1083
1084      ============================  ====================
1085      Content Type                  Value
1086      ============================  ====================
1087      ``DW_LNCT_LLVM_source``       0x2001
1088      ============================  ====================
1089
1090 .. _amdgpu-code-conventions:
1091
1092 Code Conventions
1093 ================
1094
1095 This section provides code conventions used for each supported target triple OS
1096 (see :ref:`amdgpu-target-triples`).
1097
1098 AMDHSA
1099 ------
1100
1101 This section provides code conventions used when the target triple OS is
1102 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
1103
1104 .. _amdgpu-amdhsa-code-object-target-identification:
1105
1106 Code Object Target Identification
1107 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1108
1109 The AMDHSA OS uses the following syntax to specify the code object
1110 target as a single string:
1111
1112   ``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>``
1113
1114 Where:
1115
1116   - ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>``
1117     are the same as the *Target Triple* (see
1118     :ref:`amdgpu-target-triples`).
1119
1120   - ``<Processor>`` is the same as the *Processor* (see
1121     :ref:`amdgpu-processors`).
1122
1123   - ``<Target Features>`` is a list of the enabled *Target Features*
1124     (see :ref:`amdgpu-target-features`), each prefixed by a plus, that
1125     apply to *Processor*. The list must be in the same order as listed
1126     in the table :ref:`amdgpu-target-feature-table`. Note that *Target
1127     Features* must be included in the list if they are enabled even if
1128     that is the default for *Processor*.
1129
1130 For example:
1131
1132   ``"amdgcn-amd-amdhsa--gfx902+xnack"``
1133
1134 .. _amdgpu-amdhsa-code-object-metadata:
1135
1136 Code Object Metadata
1137 ~~~~~~~~~~~~~~~~~~~~
1138
1139 The code object metadata specifies extensible metadata associated with the code
1140 objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
1141 [AMD-ROCm]_. The encoding and semantics of this metadata depends on the code
1142 object version; see :ref:`amdgpu-amdhsa-code-object-metadata-v2` and
1143 :ref:`amdgpu-amdhsa-code-object-metadata-v3`.
1144
1145 Code object metadata is specified in a note record (see
1146 :ref:`amdgpu-note-records`) and is required when the target triple OS is
1147 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
1148 information necessary to support the ROCM kernel queries. For example, the
1149 segment sizes needed in a dispatch packet. In addition, a high level language
1150 runtime may require other information to be included. For example, the AMD
1151 OpenCL runtime records kernel argument information.
1152
1153 .. _amdgpu-amdhsa-code-object-metadata-v2:
1154
1155 Code Object V2 Metadata (-mattr=-code-object-v3)
1156 ++++++++++++++++++++++++++++++++++++++++++++++++
1157
1158 .. warning:: Code Object V2 is not the default code object version emitted by
1159   this version of LLVM. For a description of the metadata generated with the
1160   default configuration (Code Object V3) see
1161   :ref:`amdgpu-amdhsa-code-object-metadata-v3`.
1162
1163 Code object V2 metadata is specified by the ``NT_AMD_AMDGPU_METADATA`` note
1164 record (see :ref:`amdgpu-note-records-v2`).
1165
1166 The metadata is specified as a YAML formatted string (see [YAML]_ and
1167 :doc:`YamlIO`).
1168
1169 .. TODO
1170    Is the string null terminated? It probably should not if YAML allows it to
1171    contain null characters, otherwise it should be.
1172
1173 The metadata is represented as a single YAML document comprised of the mapping
1174 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v2` and
1175 referenced tables.
1176
1177 For boolean values, the string values of ``false`` and ``true`` are used for
1178 false and true respectively.
1179
1180 Additional information can be added to the mappings. To avoid conflicts, any
1181 non-AMD key names should be prefixed by "*vendor-name*.".
1182
1183   .. table:: AMDHSA Code Object V2 Metadata Map
1184      :name: amdgpu-amdhsa-code-object-metadata-map-table-v2
1185
1186      ========== ============== ========= =======================================
1187      String Key Value Type     Required? Description
1188      ========== ============== ========= =======================================
1189      "Version"  sequence of    Required  - The first integer is the major
1190                 2 integers                 version. Currently 1.
1191                                          - The second integer is the minor
1192                                            version. Currently 0.
1193      "Printf"   sequence of              Each string is encoded information
1194                 strings                  about a printf function call. The
1195                                          encoded information is organized as
1196                                          fields separated by colon (':'):
1197
1198                                          ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
1199
1200                                          where:
1201
1202                                          ``ID``
1203                                            A 32 bit integer as a unique id for
1204                                            each printf function call
1205
1206                                          ``N``
1207                                            A 32 bit integer equal to the number
1208                                            of arguments of printf function call
1209                                            minus 1
1210
1211                                          ``S[i]`` (where i = 0, 1, ... , N-1)
1212                                            32 bit integers for the size in bytes
1213                                            of the i-th FormatString argument of
1214                                            the printf function call
1215
1216                                          FormatString
1217                                            The format string passed to the
1218                                            printf function call.
1219      "Kernels"  sequence of    Required  Sequence of the mappings for each
1220                 mapping                  kernel in the code object. See
1221                                          :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v2`
1222                                          for the definition of the mapping.
1223      ========== ============== ========= =======================================
1224
1225 ..
1226
1227   .. table:: AMDHSA Code Object V2 Kernel Metadata Map
1228      :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v2
1229
1230      ================= ============== ========= ================================
1231      String Key        Value Type     Required? Description
1232      ================= ============== ========= ================================
1233      "Name"            string         Required  Source name of the kernel.
1234      "SymbolName"      string         Required  Name of the kernel
1235                                                 descriptor ELF symbol.
1236      "Language"        string                   Source language of the kernel.
1237                                                 Values include:
1238
1239                                                 - "OpenCL C"
1240                                                 - "OpenCL C++"
1241                                                 - "HCC"
1242                                                 - "OpenMP"
1243
1244      "LanguageVersion" sequence of              - The first integer is the major
1245                        2 integers                 version.
1246                                                 - The second integer is the
1247                                                   minor version.
1248      "Attrs"           mapping                  Mapping of kernel attributes.
1249                                                 See
1250                                                 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-table-v2`
1251                                                 for the mapping definition.
1252      "Args"            sequence of              Sequence of mappings of the
1253                        mapping                  kernel arguments. See
1254                                                 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v2`
1255                                                 for the definition of the mapping.
1256      "CodeProps"       mapping                  Mapping of properties related to
1257                                                 the kernel code. See
1258                                                 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-table-v2`
1259                                                 for the mapping definition.
1260      ================= ============== ========= ================================
1261
1262 ..
1263
1264   .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map
1265      :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-table-v2
1266
1267      =================== ============== ========= ==============================
1268      String Key          Value Type     Required? Description
1269      =================== ============== ========= ==============================
1270      "ReqdWorkGroupSize" sequence of              If not 0, 0, 0 then all values
1271                          3 integers               must be >=1 and the dispatch
1272                                                   work-group size X, Y, Z must
1273                                                   correspond to the specified
1274                                                   values. Defaults to 0, 0, 0.
1275
1276                                                   Corresponds to the OpenCL
1277                                                   ``reqd_work_group_size``
1278                                                   attribute.
1279      "WorkGroupSizeHint" sequence of              The dispatch work-group size
1280                          3 integers               X, Y, Z is likely to be the
1281                                                   specified values.
1282
1283                                                   Corresponds to the OpenCL
1284                                                   ``work_group_size_hint``
1285                                                   attribute.
1286      "VecTypeHint"       string                   The name of a scalar or vector
1287                                                   type.
1288
1289                                                   Corresponds to the OpenCL
1290                                                   ``vec_type_hint`` attribute.
1291
1292      "RuntimeHandle"     string                   The external symbol name
1293                                                   associated with a kernel.
1294                                                   OpenCL runtime allocates a
1295                                                   global buffer for the symbol
1296                                                   and saves the kernel's address
1297                                                   to it, which is used for
1298                                                   device side enqueueing. Only
1299                                                   available for device side
1300                                                   enqueued kernels.
1301      =================== ============== ========= ==============================
1302
1303 ..
1304
1305   .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map
1306      :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v2
1307
1308      ================= ============== ========= ================================
1309      String Key        Value Type     Required? Description
1310      ================= ============== ========= ================================
1311      "Name"            string                   Kernel argument name.
1312      "TypeName"        string                   Kernel argument type name.
1313      "Size"            integer        Required  Kernel argument size in bytes.
1314      "Align"           integer        Required  Kernel argument alignment in
1315                                                 bytes. Must be a power of two.
1316      "ValueKind"       string         Required  Kernel argument kind that
1317                                                 specifies how to set up the
1318                                                 corresponding argument.
1319                                                 Values include:
1320
1321                                                 "ByValue"
1322                                                   The argument is copied
1323                                                   directly into the kernarg.
1324
1325                                                 "GlobalBuffer"
1326                                                   A global address space pointer
1327                                                   to the buffer data is passed
1328                                                   in the kernarg.
1329
1330                                                 "DynamicSharedPointer"
1331                                                   A group address space pointer
1332                                                   to dynamically allocated LDS
1333                                                   is passed in the kernarg.
1334
1335                                                 "Sampler"
1336                                                   A global address space
1337                                                   pointer to a S# is passed in
1338                                                   the kernarg.
1339
1340                                                 "Image"
1341                                                   A global address space
1342                                                   pointer to a T# is passed in
1343                                                   the kernarg.
1344
1345                                                 "Pipe"
1346                                                   A global address space pointer
1347                                                   to an OpenCL pipe is passed in
1348                                                   the kernarg.
1349
1350                                                 "Queue"
1351                                                   A global address space pointer
1352                                                   to an OpenCL device enqueue
1353                                                   queue is passed in the
1354                                                   kernarg.
1355
1356                                                 "HiddenGlobalOffsetX"
1357                                                   The OpenCL grid dispatch
1358                                                   global offset for the X
1359                                                   dimension is passed in the
1360                                                   kernarg.
1361
1362                                                 "HiddenGlobalOffsetY"
1363                                                   The OpenCL grid dispatch
1364                                                   global offset for the Y
1365                                                   dimension is passed in the
1366                                                   kernarg.
1367
1368                                                 "HiddenGlobalOffsetZ"
1369                                                   The OpenCL grid dispatch
1370                                                   global offset for the Z
1371                                                   dimension is passed in the
1372                                                   kernarg.
1373
1374                                                 "HiddenNone"
1375                                                   An argument that is not used
1376                                                   by the kernel. Space needs to
1377                                                   be left for it, but it does
1378                                                   not need to be set up.
1379
1380                                                 "HiddenPrintfBuffer"
1381                                                   A global address space pointer
1382                                                   to the runtime printf buffer
1383                                                   is passed in kernarg.
1384
1385                                                 "HiddenDefaultQueue"
1386                                                   A global address space pointer
1387                                                   to the OpenCL device enqueue
1388                                                   queue that should be used by
1389                                                   the kernel by default is
1390                                                   passed in the kernarg.
1391
1392                                                 "HiddenCompletionAction"
1393                                                   A global address space pointer
1394                                                   to help link enqueued kernels into
1395                                                   the ancestor tree for determining
1396                                                   when the parent kernel has finished.
1397
1398                                                 "HiddenMultiGridSyncArg"
1399                                                   A global address space pointer for
1400                                                   multi-grid synchronization is
1401                                                   passed in the kernarg.
1402
1403      "ValueType"       string         Required  Kernel argument value type. Only
1404                                                 present if "ValueKind" is
1405                                                 "ByValue". For vector data
1406                                                 types, the value is for the
1407                                                 element type. Values include:
1408
1409                                                 - "Struct"
1410                                                 - "I8"
1411                                                 - "U8"
1412                                                 - "I16"
1413                                                 - "U16"
1414                                                 - "F16"
1415                                                 - "I32"
1416                                                 - "U32"
1417                                                 - "F32"
1418                                                 - "I64"
1419                                                 - "U64"
1420                                                 - "F64"
1421
1422                                                 .. TODO
1423                                                    How can it be determined if a
1424                                                    vector type, and what size
1425                                                    vector?
1426      "PointeeAlign"    integer                  Alignment in bytes of pointee
1427                                                 type for pointer type kernel
1428                                                 argument. Must be a power
1429                                                 of 2. Only present if
1430                                                 "ValueKind" is
1431                                                 "DynamicSharedPointer".
1432      "AddrSpaceQual"   string                   Kernel argument address space
1433                                                 qualifier. Only present if
1434                                                 "ValueKind" is "GlobalBuffer" or
1435                                                 "DynamicSharedPointer". Values
1436                                                 are:
1437
1438                                                 - "Private"
1439                                                 - "Global"
1440                                                 - "Constant"
1441                                                 - "Local"
1442                                                 - "Generic"
1443                                                 - "Region"
1444
1445                                                 .. TODO
1446                                                    Is GlobalBuffer only Global
1447                                                    or Constant? Is
1448                                                    DynamicSharedPointer always
1449                                                    Local? Can HCC allow Generic?
1450                                                    How can Private or Region
1451                                                    ever happen?
1452      "AccQual"         string                   Kernel argument access
1453                                                 qualifier. Only present if
1454                                                 "ValueKind" is "Image" or
1455                                                 "Pipe". Values
1456                                                 are:
1457
1458                                                 - "ReadOnly"
1459                                                 - "WriteOnly"
1460                                                 - "ReadWrite"
1461
1462                                                 .. TODO
1463                                                    Does this apply to
1464                                                    GlobalBuffer?
1465      "ActualAccQual"   string                   The actual memory accesses
1466                                                 performed by the kernel on the
1467                                                 kernel argument. Only present if
1468                                                 "ValueKind" is "GlobalBuffer",
1469                                                 "Image", or "Pipe". This may be
1470                                                 more restrictive than indicated
1471                                                 by "AccQual" to reflect what the
1472                                                 kernel actual does. If not
1473                                                 present then the runtime must
1474                                                 assume what is implied by
1475                                                 "AccQual" and "IsConst". Values
1476                                                 are:
1477
1478                                                 - "ReadOnly"
1479                                                 - "WriteOnly"
1480                                                 - "ReadWrite"
1481
1482      "IsConst"         boolean                  Indicates if the kernel argument
1483                                                 is const qualified. Only present
1484                                                 if "ValueKind" is
1485                                                 "GlobalBuffer".
1486
1487      "IsRestrict"      boolean                  Indicates if the kernel argument
1488                                                 is restrict qualified. Only
1489                                                 present if "ValueKind" is
1490                                                 "GlobalBuffer".
1491
1492      "IsVolatile"      boolean                  Indicates if the kernel argument
1493                                                 is volatile qualified. Only
1494                                                 present if "ValueKind" is
1495                                                 "GlobalBuffer".
1496
1497      "IsPipe"          boolean                  Indicates if the kernel argument
1498                                                 is pipe qualified. Only present
1499                                                 if "ValueKind" is "Pipe".
1500
1501                                                 .. TODO
1502                                                    Can GlobalBuffer be pipe
1503                                                    qualified?
1504      ================= ============== ========= ================================
1505
1506 ..
1507
1508   .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map
1509      :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-table-v2
1510
1511      ============================ ============== ========= =====================
1512      String Key                   Value Type     Required? Description
1513      ============================ ============== ========= =====================
1514      "KernargSegmentSize"         integer        Required  The size in bytes of
1515                                                            the kernarg segment
1516                                                            that holds the values
1517                                                            of the arguments to
1518                                                            the kernel.
1519      "GroupSegmentFixedSize"      integer        Required  The amount of group
1520                                                            segment memory
1521                                                            required by a
1522                                                            work-group in
1523                                                            bytes. This does not
1524                                                            include any
1525                                                            dynamically allocated
1526                                                            group segment memory
1527                                                            that may be added
1528                                                            when the kernel is
1529                                                            dispatched.
1530      "PrivateSegmentFixedSize"    integer        Required  The amount of fixed
1531                                                            private address space
1532                                                            memory required for a
1533                                                            work-item in
1534                                                            bytes. If the kernel
1535                                                            uses a dynamic call
1536                                                            stack then additional
1537                                                            space must be added
1538                                                            to this value for the
1539                                                            call stack.
1540      "KernargSegmentAlign"        integer        Required  The maximum byte
1541                                                            alignment of
1542                                                            arguments in the
1543                                                            kernarg segment. Must
1544                                                            be a power of 2.
1545      "WavefrontSize"              integer        Required  Wavefront size. Must
1546                                                            be a power of 2.
1547      "NumSGPRs"                   integer        Required  Number of scalar
1548                                                            registers used by a
1549                                                            wavefront for
1550                                                            GFX6-GFX10. This
1551                                                            includes the special
1552                                                            SGPRs for VCC, Flat
1553                                                            Scratch (GFX7-GFX10)
1554                                                            and XNACK (for
1555                                                            GFX8-GFX10). It does
1556                                                            not include the 16
1557                                                            SGPR added if a trap
1558                                                            handler is
1559                                                            enabled. It is not
1560                                                            rounded up to the
1561                                                            allocation
1562                                                            granularity.
1563      "NumVGPRs"                   integer        Required  Number of vector
1564                                                            registers used by
1565                                                            each work-item for
1566                                                            GFX6-GFX10
1567      "MaxFlatWorkGroupSize"       integer        Required  Maximum flat
1568                                                            work-group size
1569                                                            supported by the
1570                                                            kernel in work-items.
1571                                                            Must be >=1 and
1572                                                            consistent with
1573                                                            ReqdWorkGroupSize if
1574                                                            not 0, 0, 0.
1575      "NumSpilledSGPRs"            integer                  Number of stores from
1576                                                            a scalar register to
1577                                                            a register allocator
1578                                                            created spill
1579                                                            location.
1580      "NumSpilledVGPRs"            integer                  Number of stores from
1581                                                            a vector register to
1582                                                            a register allocator
1583                                                            created spill
1584                                                            location.
1585      ============================ ============== ========= =====================
1586
1587 .. _amdgpu-amdhsa-code-object-metadata-v3:
1588
1589 Code Object V3 Metadata (-mattr=+code-object-v3)
1590 ++++++++++++++++++++++++++++++++++++++++++++++++
1591
1592 Code object V3 metadata is specified by the ``NT_AMDGPU_METADATA`` note record
1593 (see :ref:`amdgpu-note-records-v3`).
1594
1595 The metadata is represented as Message Pack formatted binary data (see
1596 [MsgPack]_). The top level is a Message Pack map that includes the
1597 keys defined in table
1598 :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced
1599 tables.
1600
1601 Additional information can be added to the maps. To avoid conflicts,
1602 any key names should be prefixed by "*vendor-name*." where
1603 ``vendor-name`` can be the the name of the vendor and specific vendor
1604 tool that generates the information. The prefix is abbreviated to
1605 simply "." when it appears within a map that has been added by the
1606 same *vendor-name*.
1607
1608   .. table:: AMDHSA Code Object V3 Metadata Map
1609      :name: amdgpu-amdhsa-code-object-metadata-map-table-v3
1610
1611      ================= ============== ========= =======================================
1612      String Key        Value Type     Required? Description
1613      ================= ============== ========= =======================================
1614      "amdhsa.version"  sequence of    Required  - The first integer is the major
1615                        2 integers                 version. Currently 1.
1616                                                 - The second integer is the minor
1617                                                   version. Currently 0.
1618      "amdhsa.printf"   sequence of              Each string is encoded information
1619                        strings                  about a printf function call. The
1620                                                 encoded information is organized as
1621                                                 fields separated by colon (':'):
1622
1623                                                 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
1624
1625                                                 where:
1626
1627                                                 ``ID``
1628                                                   A 32 bit integer as a unique id for
1629                                                   each printf function call
1630
1631                                                 ``N``
1632                                                   A 32 bit integer equal to the number
1633                                                   of arguments of printf function call
1634                                                   minus 1
1635
1636                                                 ``S[i]`` (where i = 0, 1, ... , N-1)
1637                                                   32 bit integers for the size in bytes
1638                                                   of the i-th FormatString argument of
1639                                                   the printf function call
1640
1641                                                 FormatString
1642                                                   The format string passed to the
1643                                                   printf function call.
1644      "amdhsa.kernels"  sequence of    Required  Sequence of the maps for each
1645                        map                      kernel in the code object. See
1646                                                 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3`
1647                                                 for the definition of the keys included
1648                                                 in that map.
1649      ================= ============== ========= =======================================
1650
1651 ..
1652
1653   .. table:: AMDHSA Code Object V3 Kernel Metadata Map
1654      :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3
1655
1656      =================================== ============== ========= ================================
1657      String Key                          Value Type     Required? Description
1658      =================================== ============== ========= ================================
1659      ".name"                             string         Required  Source name of the kernel.
1660      ".symbol"                           string         Required  Name of the kernel
1661                                                                   descriptor ELF symbol.
1662      ".language"                         string                   Source language of the kernel.
1663                                                                   Values include:
1664
1665                                                                   - "OpenCL C"
1666                                                                   - "OpenCL C++"
1667                                                                   - "HCC"
1668                                                                   - "HIP"
1669                                                                   - "OpenMP"
1670                                                                   - "Assembler"
1671
1672      ".language_version"                 sequence of              - The first integer is the major
1673                                          2 integers                 version.
1674                                                                   - The second integer is the
1675                                                                     minor version.
1676      ".args"                             sequence of              Sequence of maps of the
1677                                          map                      kernel arguments. See
1678                                                                   :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`
1679                                                                   for the definition of the keys
1680                                                                   included in that map.
1681      ".reqd_workgroup_size"              sequence of              If not 0, 0, 0 then all values
1682                                          3 integers               must be >=1 and the dispatch
1683                                                                   work-group size X, Y, Z must
1684                                                                   correspond to the specified
1685                                                                   values. Defaults to 0, 0, 0.
1686
1687                                                                   Corresponds to the OpenCL
1688                                                                   ``reqd_work_group_size``
1689                                                                   attribute.
1690      ".workgroup_size_hint"              sequence of              The dispatch work-group size
1691                                          3 integers               X, Y, Z is likely to be the
1692                                                                   specified values.
1693
1694                                                                   Corresponds to the OpenCL
1695                                                                   ``work_group_size_hint``
1696                                                                   attribute.
1697      ".vec_type_hint"                    string                   The name of a scalar or vector
1698                                                                   type.
1699
1700                                                                   Corresponds to the OpenCL
1701                                                                   ``vec_type_hint`` attribute.
1702
1703      ".device_enqueue_symbol"            string                   The external symbol name
1704                                                                   associated with a kernel.
1705                                                                   OpenCL runtime allocates a
1706                                                                   global buffer for the symbol
1707                                                                   and saves the kernel's address
1708                                                                   to it, which is used for
1709                                                                   device side enqueueing. Only
1710                                                                   available for device side
1711                                                                   enqueued kernels.
1712      ".kernarg_segment_size"             integer        Required  The size in bytes of
1713                                                                   the kernarg segment
1714                                                                   that holds the values
1715                                                                   of the arguments to
1716                                                                   the kernel.
1717      ".group_segment_fixed_size"         integer        Required  The amount of group
1718                                                                   segment memory
1719                                                                   required by a
1720                                                                   work-group in
1721                                                                   bytes. This does not
1722                                                                   include any
1723                                                                   dynamically allocated
1724                                                                   group segment memory
1725                                                                   that may be added
1726                                                                   when the kernel is
1727                                                                   dispatched.
1728      ".private_segment_fixed_size"       integer        Required  The amount of fixed
1729                                                                   private address space
1730                                                                   memory required for a
1731                                                                   work-item in
1732                                                                   bytes. If the kernel
1733                                                                   uses a dynamic call
1734                                                                   stack then additional
1735                                                                   space must be added
1736                                                                   to this value for the
1737                                                                   call stack.
1738      ".kernarg_segment_align"            integer        Required  The maximum byte
1739                                                                   alignment of
1740                                                                   arguments in the
1741                                                                   kernarg segment. Must
1742                                                                   be a power of 2.
1743      ".wavefront_size"                   integer        Required  Wavefront size. Must
1744                                                                   be a power of 2.
1745      ".sgpr_count"                       integer        Required  Number of scalar
1746                                                                   registers required by a
1747                                                                   wavefront for
1748                                                                   GFX6-GFX9. A register
1749                                                                   is required if it is
1750                                                                   used explicitly, or
1751                                                                   if a higher numbered
1752                                                                   register is used
1753                                                                   explicitly. This
1754                                                                   includes the special
1755                                                                   SGPRs for VCC, Flat
1756                                                                   Scratch (GFX7-GFX9)
1757                                                                   and XNACK (for
1758                                                                   GFX8-GFX9). It does
1759                                                                   not include the 16
1760                                                                   SGPR added if a trap
1761                                                                   handler is
1762                                                                   enabled. It is not
1763                                                                   rounded up to the
1764                                                                   allocation
1765                                                                   granularity.
1766      ".vgpr_count"                       integer        Required  Number of vector
1767                                                                   registers required by
1768                                                                   each work-item for
1769                                                                   GFX6-GFX9. A register
1770                                                                   is required if it is
1771                                                                   used explicitly, or
1772                                                                   if a higher numbered
1773                                                                   register is used
1774                                                                   explicitly.
1775      ".max_flat_workgroup_size"          integer        Required  Maximum flat
1776                                                                   work-group size
1777                                                                   supported by the
1778                                                                   kernel in work-items.
1779                                                                   Must be >=1 and
1780                                                                   consistent with
1781                                                                   ReqdWorkGroupSize if
1782                                                                   not 0, 0, 0.
1783      ".sgpr_spill_count"                 integer                  Number of stores from
1784                                                                   a scalar register to
1785                                                                   a register allocator
1786                                                                   created spill
1787                                                                   location.
1788      ".vgpr_spill_count"                 integer                  Number of stores from
1789                                                                   a vector register to
1790                                                                   a register allocator
1791                                                                   created spill
1792                                                                   location.
1793      =================================== ============== ========= ================================
1794
1795 ..
1796
1797   .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map
1798      :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3
1799
1800      ====================== ============== ========= ================================
1801      String Key             Value Type     Required? Description
1802      ====================== ============== ========= ================================
1803      ".name"                string                   Kernel argument name.
1804      ".type_name"           string                   Kernel argument type name.
1805      ".size"                integer        Required  Kernel argument size in bytes.
1806      ".offset"              integer        Required  Kernel argument offset in
1807                                                      bytes. The offset must be a
1808                                                      multiple of the alignment
1809                                                      required by the argument.
1810      ".value_kind"          string         Required  Kernel argument kind that
1811                                                      specifies how to set up the
1812                                                      corresponding argument.
1813                                                      Values include:
1814
1815                                                      "by_value"
1816                                                        The argument is copied
1817                                                        directly into the kernarg.
1818
1819                                                      "global_buffer"
1820                                                        A global address space pointer
1821                                                        to the buffer data is passed
1822                                                        in the kernarg.
1823
1824                                                      "dynamic_shared_pointer"
1825                                                        A group address space pointer
1826                                                        to dynamically allocated LDS
1827                                                        is passed in the kernarg.
1828
1829                                                      "sampler"
1830                                                        A global address space
1831                                                        pointer to a S# is passed in
1832                                                        the kernarg.
1833
1834                                                      "image"
1835                                                        A global address space
1836                                                        pointer to a T# is passed in
1837                                                        the kernarg.
1838
1839                                                      "pipe"
1840                                                        A global address space pointer
1841                                                        to an OpenCL pipe is passed in
1842                                                        the kernarg.
1843
1844                                                      "queue"
1845                                                        A global address space pointer
1846                                                        to an OpenCL device enqueue
1847                                                        queue is passed in the
1848                                                        kernarg.
1849
1850                                                      "hidden_global_offset_x"
1851                                                        The OpenCL grid dispatch
1852                                                        global offset for the X
1853                                                        dimension is passed in the
1854                                                        kernarg.
1855
1856                                                      "hidden_global_offset_y"
1857                                                        The OpenCL grid dispatch
1858                                                        global offset for the Y
1859                                                        dimension is passed in the
1860                                                        kernarg.
1861
1862                                                      "hidden_global_offset_z"
1863                                                        The OpenCL grid dispatch
1864                                                        global offset for the Z
1865                                                        dimension is passed in the
1866                                                        kernarg.
1867
1868                                                      "hidden_none"
1869                                                        An argument that is not used
1870                                                        by the kernel. Space needs to
1871                                                        be left for it, but it does
1872                                                        not need to be set up.
1873
1874                                                      "hidden_printf_buffer"
1875                                                        A global address space pointer
1876                                                        to the runtime printf buffer
1877                                                        is passed in kernarg.
1878
1879                                                      "hidden_default_queue"
1880                                                        A global address space pointer
1881                                                        to the OpenCL device enqueue
1882                                                        queue that should be used by
1883                                                        the kernel by default is
1884                                                        passed in the kernarg.
1885
1886                                                      "hidden_completion_action"
1887                                                        A global address space pointer
1888                                                        to help link enqueued kernels into
1889                                                        the ancestor tree for determining
1890                                                        when the parent kernel has finished.
1891
1892                                                      "hidden_multigrid_sync_arg"
1893                                                        A global address space pointer for
1894                                                        multi-grid synchronization is
1895                                                        passed in the kernarg.
1896
1897      ".value_type"          string         Required  Kernel argument value type. Only
1898                                                      present if ".value_kind" is
1899                                                      "by_value". For vector data
1900                                                      types, the value is for the
1901                                                      element type. Values include:
1902
1903                                                      - "struct"
1904                                                      - "i8"
1905                                                      - "u8"
1906                                                      - "i16"
1907                                                      - "u16"
1908                                                      - "f16"
1909                                                      - "i32"
1910                                                      - "u32"
1911                                                      - "f32"
1912                                                      - "i64"
1913                                                      - "u64"
1914                                                      - "f64"
1915
1916                                                      .. TODO
1917                                                         How can it be determined if a
1918                                                         vector type, and what size
1919                                                         vector?
1920      ".pointee_align"       integer                  Alignment in bytes of pointee
1921                                                      type for pointer type kernel
1922                                                      argument. Must be a power
1923                                                      of 2. Only present if
1924                                                      ".value_kind" is
1925                                                      "dynamic_shared_pointer".
1926      ".address_space"       string                   Kernel argument address space
1927                                                      qualifier. Only present if
1928                                                      ".value_kind" is "global_buffer" or
1929                                                      "dynamic_shared_pointer". Values
1930                                                      are:
1931
1932                                                      - "private"
1933                                                      - "global"
1934                                                      - "constant"
1935                                                      - "local"
1936                                                      - "generic"
1937                                                      - "region"
1938
1939                                                      .. TODO
1940                                                         Is "global_buffer" only "global"
1941                                                         or "constant"? Is
1942                                                         "dynamic_shared_pointer" always
1943                                                         "local"? Can HCC allow "generic"?
1944                                                         How can "private" or "region"
1945                                                         ever happen?
1946      ".access"              string                   Kernel argument access
1947                                                      qualifier. Only present if
1948                                                      ".value_kind" is "image" or
1949                                                      "pipe". Values
1950                                                      are:
1951
1952                                                      - "read_only"
1953                                                      - "write_only"
1954                                                      - "read_write"
1955
1956                                                      .. TODO
1957                                                         Does this apply to
1958                                                         "global_buffer"?
1959      ".actual_access"       string                   The actual memory accesses
1960                                                      performed by the kernel on the
1961                                                      kernel argument. Only present if
1962                                                      ".value_kind" is "global_buffer",
1963                                                      "image", or "pipe". This may be
1964                                                      more restrictive than indicated
1965                                                      by ".access" to reflect what the
1966                                                      kernel actual does. If not
1967                                                      present then the runtime must
1968                                                      assume what is implied by
1969                                                      ".access" and ".is_const"      . Values
1970                                                      are:
1971
1972                                                      - "read_only"
1973                                                      - "write_only"
1974                                                      - "read_write"
1975
1976      ".is_const"            boolean                  Indicates if the kernel argument
1977                                                      is const qualified. Only present
1978                                                      if ".value_kind" is
1979                                                      "global_buffer".
1980
1981      ".is_restrict"         boolean                  Indicates if the kernel argument
1982                                                      is restrict qualified. Only
1983                                                      present if ".value_kind" is
1984                                                      "global_buffer".
1985
1986      ".is_volatile"         boolean                  Indicates if the kernel argument
1987                                                      is volatile qualified. Only
1988                                                      present if ".value_kind" is
1989                                                      "global_buffer".
1990
1991      ".is_pipe"             boolean                  Indicates if the kernel argument
1992                                                      is pipe qualified. Only present
1993                                                      if ".value_kind" is "pipe".
1994
1995                                                      .. TODO
1996                                                         Can "global_buffer" be pipe
1997                                                         qualified?
1998      ====================== ============== ========= ================================
1999
2000 ..
2001
2002 Kernel Dispatch
2003 ~~~~~~~~~~~~~~~
2004
2005 The HSA architected queuing language (AQL) defines a user space memory interface
2006 that can be used to control the dispatch of kernels, in an agent independent
2007 way. An agent can have zero or more AQL queues created for it using the ROCm
2008 runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
2009 *HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
2010 mechanics and packet layouts.
2011
2012 The packet processor of a kernel agent is responsible for detecting and
2013 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
2014 packet processor is implemented by the hardware command processor (CP),
2015 asynchronous dispatch controller (ADC) and shader processor input controller
2016 (SPI).
2017
2018 The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
2019 mode driver to initialize and register the AQL queue with CP.
2020
2021 To dispatch a kernel the following actions are performed. This can occur in the
2022 CPU host program, or from an HSA kernel executing on a GPU.
2023
2024 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
2025    executed is obtained.
2026 2. A pointer to the kernel descriptor (see
2027    :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
2028    obtained. It must be for a kernel that is contained in a code object that that
2029    was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
2030    associated.
2031 3. Space is allocated for the kernel arguments using the ROCm runtime allocator
2032    for a memory region with the kernarg property for the kernel agent that will
2033    execute the kernel. It must be at least 16 byte aligned.
2034 4. Kernel argument values are assigned to the kernel argument memory
2035    allocation. The layout is defined in the *HSA Programmer's Language Reference*
2036    [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
2037    memory in the same way constant memory is accessed. (Note that the HSA
2038    specification allows an implementation to copy the kernel argument contents to
2039    another location that is accessed by the kernel.)
2040 5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
2041    api uses 64 bit atomic operations to reserve space in the AQL queue for the
2042    packet. The packet must be set up, and the final write must use an atomic
2043    store release to set the packet kind to ensure the packet contents are
2044    visible to the kernel agent. AQL defines a doorbell signal mechanism to
2045    notify the kernel agent that the AQL queue has been updated. These rules, and
2046    the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
2047    System Architecture Specification* [HSA]_.
2048 6. A kernel dispatch packet includes information about the actual dispatch,
2049    such as grid and work-group size, together with information from the code
2050    object about the kernel, such as segment sizes. The ROCm runtime queries on
2051    the kernel symbol can be used to obtain the code object values which are
2052    recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
2053 7. CP executes micro-code and is responsible for detecting and setting up the
2054    GPU to execute the wavefronts of a kernel dispatch.
2055 8. CP ensures that when the a wavefront starts executing the kernel machine
2056    code, the scalar general purpose registers (SGPR) and vector general purpose
2057    registers (VGPR) are set up as required by the machine code. The required
2058    setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
2059    register state is defined in
2060    :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
2061 9. The prolog of the kernel machine code (see
2062    :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
2063    before continuing executing the machine code that corresponds to the kernel.
2064 10. When the kernel dispatch has completed execution, CP signals the completion
2065     signal specified in the kernel dispatch packet if not 0.
2066
2067 .. _amdgpu-amdhsa-memory-spaces:
2068
2069 Memory Spaces
2070 ~~~~~~~~~~~~~
2071
2072 The memory space properties are:
2073
2074   .. table:: AMDHSA Memory Spaces
2075      :name: amdgpu-amdhsa-memory-spaces-table
2076
2077      ================= =========== ======== ======= ==================
2078      Memory Space Name HSA Segment Hardware Address NULL Value
2079                        Name        Name     Size
2080      ================= =========== ======== ======= ==================
2081      Private           private     scratch  32      0x00000000
2082      Local             group       LDS      32      0xFFFFFFFF
2083      Global            global      global   64      0x0000000000000000
2084      Constant          constant    *same as 64      0x0000000000000000
2085                                    global*
2086      Generic           flat        flat     64      0x0000000000000000
2087      Region            N/A         GDS      32      *not implemented
2088                                                     for AMDHSA*
2089      ================= =========== ======== ======= ==================
2090
2091 The global and constant memory spaces both use global virtual addresses, which
2092 are the same virtual address space used by the CPU. However, some virtual
2093 addresses may only be accessible to the CPU, some only accessible by the GPU,
2094 and some by both.
2095
2096 Using the constant memory space indicates that the data will not change during
2097 the execution of the kernel. This allows scalar read instructions to be
2098 used. The vector and scalar L1 caches are invalidated of volatile data before
2099 each kernel dispatch execution to allow constant memory to change values between
2100 kernel dispatches.
2101
2102 The local memory space uses the hardware Local Data Store (LDS) which is
2103 automatically allocated when the hardware creates work-groups of wavefronts, and
2104 freed when all the wavefronts of a work-group have terminated. The data store
2105 (DS) instructions can be used to access it.
2106
2107 The private memory space uses the hardware scratch memory support. If the kernel
2108 uses scratch, then the hardware allocates memory that is accessed using
2109 wavefront lane dword (4 byte) interleaving. The mapping used from private
2110 address to physical address is:
2111
2112   ``wavefront-scratch-base +
2113   (private-address * wavefront-size * 4) +
2114   (wavefront-lane-id * 4)``
2115
2116 There are different ways that the wavefront scratch base address is determined
2117 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
2118 memory can be accessed in an interleaved manner using buffer instruction with
2119 the scratch buffer descriptor and per wavefront scratch offset, by the scratch
2120 instructions, or by flat instructions. If each lane of a wavefront accesses the
2121 same private address, the interleaving results in adjacent dwords being accessed
2122 and hence requires fewer cache lines to be fetched. Multi-dword access is not
2123 supported except by flat and scratch instructions in GFX9-GFX10.
2124
2125 The generic address space uses the hardware flat address support available in
2126 GFX7-GFX10. This uses two fixed ranges of virtual addresses (the private and
2127 local appertures), that are outside the range of addressible global memory, to
2128 map from a flat address to a private or local address.
2129
2130 FLAT instructions can take a flat address and access global, private (scratch)
2131 and group (LDS) memory depending in if the address is within one of the
2132 apperture ranges. Flat access to scratch requires hardware aperture setup and
2133 setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
2134 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
2135 (see :ref:`amdgpu-amdhsa-m0`).
2136
2137 To convert between a segment address and a flat address the base address of the
2138 appertures address can be used. For GFX7-GFX8 these are available in the
2139 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
2140 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
2141 GFX9-GFX10 the appature base addresses are directly available as inline constant
2142 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
2143 address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
2144 which makes it easier to convert from flat to segment or segment to flat.
2145
2146 Image and Samplers
2147 ~~~~~~~~~~~~~~~~~~
2148
2149 Image and sample handles created by the ROCm runtime are 64 bit addresses of a
2150 hardware 32 byte V# and 48 byte S# object respectively. In order to support the
2151 HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
2152 enumeration values for the queries that are not trivially deducible from the S#
2153 representation.
2154
2155 HSA Signals
2156 ~~~~~~~~~~~
2157
2158 HSA signal handles created by the ROCm runtime are 64 bit addresses of a
2159 structure allocated in memory accessible from both the CPU and GPU. The
2160 structure is defined by the ROCm runtime and subject to change between releases
2161 (see [AMD-ROCm-github]_).
2162
2163 .. _amdgpu-amdhsa-hsa-aql-queue:
2164
2165 HSA AQL Queue
2166 ~~~~~~~~~~~~~
2167
2168 The HSA AQL queue structure is defined by the ROCm runtime and subject to change
2169 between releases (see [AMD-ROCm-github]_). For some processors it contains
2170 fields needed to implement certain language features such as the flat address
2171 aperture bases. It also contains fields used by CP such as managing the
2172 allocation of scratch memory.
2173
2174 .. _amdgpu-amdhsa-kernel-descriptor:
2175
2176 Kernel Descriptor
2177 ~~~~~~~~~~~~~~~~~
2178
2179 A kernel descriptor consists of the information needed by CP to initiate the
2180 execution of a kernel, including the entry point address of the machine code
2181 that implements the kernel.
2182
2183 Kernel Descriptor for GFX6-GFX10
2184 ++++++++++++++++++++++++++++++++
2185
2186 CP microcode requires the Kernel descriptor to be allocated on 64 byte
2187 alignment.
2188
2189   .. table:: Kernel Descriptor for GFX6-GFX10
2190      :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table
2191
2192      ======= ======= =============================== ============================
2193      Bits    Size    Field Name                      Description
2194      ======= ======= =============================== ============================
2195      31:0    4 bytes GROUP_SEGMENT_FIXED_SIZE        The amount of fixed local
2196                                                      address space memory
2197                                                      required for a work-group
2198                                                      in bytes. This does not
2199                                                      include any dynamically
2200                                                      allocated local address
2201                                                      space memory that may be
2202                                                      added when the kernel is
2203                                                      dispatched.
2204      63:32   4 bytes PRIVATE_SEGMENT_FIXED_SIZE      The amount of fixed
2205                                                      private address space
2206                                                      memory required for a
2207                                                      work-item in bytes. If
2208                                                      is_dynamic_callstack is 1
2209                                                      then additional space must
2210                                                      be added to this value for
2211                                                      the call stack.
2212      127:64  8 bytes                                 Reserved, must be 0.
2213      191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET   Byte offset (possibly
2214                                                      negative) from base
2215                                                      address of kernel
2216                                                      descriptor to kernel's
2217                                                      entry point instruction
2218                                                      which must be 256 byte
2219                                                      aligned.
2220      351:272 20                                      Reserved, must be 0.
2221              bytes
2222      383:352 4 bytes COMPUTE_PGM_RSRC3               GFX6-9
2223                                                        Reserved, must be 0.
2224                                                      GFX10
2225                                                        Compute Shader (CS)
2226                                                        program settings used by
2227                                                        CP to set up
2228                                                        ``COMPUTE_PGM_RSRC3``
2229                                                        configuration
2230                                                        register. See
2231                                                        :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table`.
2232      415:384 4 bytes COMPUTE_PGM_RSRC1               Compute Shader (CS)
2233                                                      program settings used by
2234                                                      CP to set up
2235                                                      ``COMPUTE_PGM_RSRC1``
2236                                                      configuration
2237                                                      register. See
2238                                                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
2239      447:416 4 bytes COMPUTE_PGM_RSRC2               Compute Shader (CS)
2240                                                      program settings used by
2241                                                      CP to set up
2242                                                      ``COMPUTE_PGM_RSRC2``
2243                                                      configuration
2244                                                      register. See
2245                                                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
2246      448     1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     Enable the setup of the
2247                      _BUFFER                         SGPR user data registers
2248                                                      (see
2249                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
2250
2251                                                      The total number of SGPR
2252                                                      user data registers
2253                                                      requested must not exceed
2254                                                      16 and match value in
2255                                                      ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
2256                                                      Any requests beyond 16
2257                                                      will be ignored.
2258      449     1 bit   ENABLE_SGPR_DISPATCH_PTR        *see above*
2259      450     1 bit   ENABLE_SGPR_QUEUE_PTR           *see above*
2260      451     1 bit   ENABLE_SGPR_KERNARG_SEGMENT_PTR *see above*
2261      452     1 bit   ENABLE_SGPR_DISPATCH_ID         *see above*
2262      453     1 bit   ENABLE_SGPR_FLAT_SCRATCH_INIT   *see above*
2263      454     1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     *see above*
2264                      _SIZE
2265      457:455 3 bits                                  Reserved, must be 0.
2266      458     1 bit   ENABLE_WAVEFRONT_SIZE32         GFX6-9
2267                                                        Reserved, must be 0.
2268                                                      GFX10
2269                                                        - If 0 execute in
2270                                                          wavefront size 64 mode.
2271                                                        - If 1 execute in
2272                                                          native wavefront size
2273                                                          32 mode.
2274      463:459 5 bits                                  Reserved, must be 0.
2275      511:464 6 bytes                                 Reserved, must be 0.
2276      512     **Total size 64 bytes.**
2277      ======= ====================================================================
2278
2279 ..
2280
2281   .. table:: compute_pgm_rsrc1 for GFX6-GFX10
2282      :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table
2283
2284      ======= ======= =============================== ===========================================================================
2285      Bits    Size    Field Name                      Description
2286      ======= ======= =============================== ===========================================================================
2287      5:0     6 bits  GRANULATED_WORKITEM_VGPR_COUNT  Number of vector register
2288                                                      blocks used by each work-item;
2289                                                      granularity is device
2290                                                      specific:
2291
2292                                                      GFX6-GFX9
2293                                                        - vgprs_used 0..256
2294                                                        - max(0, ceil(vgprs_used / 4) - 1)
2295                                                      GFX10 (wavefront size 64)
2296                                                        - max_vgpr 1..256
2297                                                        - max(0, ceil(vgprs_used / 4) - 1)
2298                                                      GFX10 (wavefront size 32)
2299                                                        - max_vgpr 1..256
2300                                                        - max(0, ceil(vgprs_used / 8) - 1)
2301
2302                                                      Where vgprs_used is defined
2303                                                      as the highest VGPR number
2304                                                      explicitly referenced plus
2305                                                      one.
2306
2307                                                      Used by CP to set up
2308                                                      ``COMPUTE_PGM_RSRC1.VGPRS``.
2309
2310                                                      The
2311                                                      :ref:`amdgpu-assembler`
2312                                                      calculates this
2313                                                      automatically for the
2314                                                      selected processor from
2315                                                      values provided to the
2316                                                      `.amdhsa_kernel` directive
2317                                                      by the
2318                                                      `.amdhsa_next_free_vgpr`
2319                                                      nested directive (see
2320                                                      :ref:`amdhsa-kernel-directives-table`).
2321      9:6     4 bits  GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
2322                                                      blocks used by a wavefront;
2323                                                      granularity is device
2324                                                      specific:
2325
2326                                                      GFX6-GFX8
2327                                                        - sgprs_used 0..112
2328                                                        - max(0, ceil(sgprs_used / 8) - 1)
2329                                                      GFX9
2330                                                        - sgprs_used 0..112
2331                                                        - 2 * max(0, ceil(sgprs_used / 16) - 1)
2332                                                      GFX10
2333                                                        Reserved, must be 0.
2334                                                        (128 SGPRs always
2335                                                        allocated.)
2336
2337                                                      Where sgprs_used is
2338                                                      defined as the highest
2339                                                      SGPR number explicitly
2340                                                      referenced plus one, plus
2341                                                      a target-specific number
2342                                                      of additional special
2343                                                      SGPRs for VCC,
2344                                                      FLAT_SCRATCH (GFX7+) and
2345                                                      XNACK_MASK (GFX8+), and
2346                                                      any additional
2347                                                      target-specific
2348                                                      limitations. It does not
2349                                                      include the 16 SGPRs added
2350                                                      if a trap handler is
2351                                                      enabled.
2352
2353                                                      The target-specific
2354                                                      limitations and special
2355                                                      SGPR layout are defined in
2356                                                      the hardware
2357                                                      documentation, which can
2358                                                      be found in the
2359                                                      :ref:`amdgpu-processors`
2360                                                      table.
2361
2362                                                      Used by CP to set up
2363                                                      ``COMPUTE_PGM_RSRC1.SGPRS``.
2364
2365                                                      The
2366                                                      :ref:`amdgpu-assembler`
2367                                                      calculates this
2368                                                      automatically for the
2369                                                      selected processor from
2370                                                      values provided to the
2371                                                      `.amdhsa_kernel` directive
2372                                                      by the
2373                                                      `.amdhsa_next_free_sgpr`
2374                                                      and `.amdhsa_reserve_*`
2375                                                      nested directives (see
2376                                                      :ref:`amdhsa-kernel-directives-table`).
2377      11:10   2 bits  PRIORITY                        Must be 0.
2378
2379                                                      Start executing wavefront
2380                                                      at the specified priority.
2381
2382                                                      CP is responsible for
2383                                                      filling in
2384                                                      ``COMPUTE_PGM_RSRC1.PRIORITY``.
2385      13:12   2 bits  FLOAT_ROUND_MODE_32             Wavefront starts execution
2386                                                      with specified rounding
2387                                                      mode for single (32
2388                                                      bit) floating point
2389                                                      precision floating point
2390                                                      operations.
2391
2392                                                      Floating point rounding
2393                                                      mode values are defined in
2394                                                      :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
2395
2396                                                      Used by CP to set up
2397                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
2398      15:14   2 bits  FLOAT_ROUND_MODE_16_64          Wavefront starts execution
2399                                                      with specified rounding
2400                                                      denorm mode for half/double (16
2401                                                      and 64 bit) floating point
2402                                                      precision floating point
2403                                                      operations.
2404
2405                                                      Floating point rounding
2406                                                      mode values are defined in
2407                                                      :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
2408
2409                                                      Used by CP to set up
2410                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
2411      17:16   2 bits  FLOAT_DENORM_MODE_32            Wavefront starts execution
2412                                                      with specified denorm mode
2413                                                      for single (32
2414                                                      bit)  floating point
2415                                                      precision floating point
2416                                                      operations.
2417
2418                                                      Floating point denorm mode
2419                                                      values are defined in
2420                                                      :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
2421
2422                                                      Used by CP to set up
2423                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
2424      19:18   2 bits  FLOAT_DENORM_MODE_16_64         Wavefront starts execution
2425                                                      with specified denorm mode
2426                                                      for half/double (16
2427                                                      and 64 bit) floating point
2428                                                      precision floating point
2429                                                      operations.
2430
2431                                                      Floating point denorm mode
2432                                                      values are defined in
2433                                                      :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
2434
2435                                                      Used by CP to set up
2436                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
2437      20      1 bit   PRIV                            Must be 0.
2438
2439                                                      Start executing wavefront
2440                                                      in privilege trap handler
2441                                                      mode.
2442
2443                                                      CP is responsible for
2444                                                      filling in
2445                                                      ``COMPUTE_PGM_RSRC1.PRIV``.
2446      21      1 bit   ENABLE_DX10_CLAMP               Wavefront starts execution
2447                                                      with DX10 clamp mode
2448                                                      enabled. Used by the vector
2449                                                      ALU to force DX10 style
2450                                                      treatment of NaN's (when
2451                                                      set, clamp NaN to zero,
2452                                                      otherwise pass NaN
2453                                                      through).
2454
2455                                                      Used by CP to set up
2456                                                      ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
2457      22      1 bit   DEBUG_MODE                      Must be 0.
2458
2459                                                      Start executing wavefront
2460                                                      in single step mode.
2461
2462                                                      CP is responsible for
2463                                                      filling in
2464                                                      ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
2465      23      1 bit   ENABLE_IEEE_MODE                Wavefront starts execution
2466                                                      with IEEE mode
2467                                                      enabled. Floating point
2468                                                      opcodes that support
2469                                                      exception flag gathering
2470                                                      will quiet and propagate
2471                                                      signaling-NaN inputs per
2472                                                      IEEE 754-2008. Min_dx10 and
2473                                                      max_dx10 become IEEE
2474                                                      754-2008 compliant due to
2475                                                      signaling-NaN propagation
2476                                                      and quieting.
2477
2478                                                      Used by CP to set up
2479                                                      ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
2480      24      1 bit   BULKY                           Must be 0.
2481
2482                                                      Only one work-group allowed
2483                                                      to execute on a compute
2484                                                      unit.
2485
2486                                                      CP is responsible for
2487                                                      filling in
2488                                                      ``COMPUTE_PGM_RSRC1.BULKY``.
2489      25      1 bit   CDBG_USER                       Must be 0.
2490
2491                                                      Flag that can be used to
2492                                                      control debugging code.
2493
2494                                                      CP is responsible for
2495                                                      filling in
2496                                                      ``COMPUTE_PGM_RSRC1.CDBG_USER``.
2497      26      1 bit   FP16_OVFL                       GFX6-GFX8
2498                                                        Reserved, must be 0.
2499                                                      GFX9-GFX10
2500                                                        Wavefront starts execution
2501                                                        with specified fp16 overflow
2502                                                        mode.
2503
2504                                                        - If 0, fp16 overflow generates
2505                                                          +/-INF values.
2506                                                        - If 1, fp16 overflow that is the
2507                                                          result of an +/-INF input value
2508                                                          or divide by 0 produces a +/-INF,
2509                                                          otherwise clamps computed
2510                                                          overflow to +/-MAX_FP16 as
2511                                                          appropriate.
2512
2513                                                        Used by CP to set up
2514                                                        ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
2515      28:27   2 bits                                  Reserved, must be 0.
2516      29      1 bit    WGP_MODE                       GFX6-GFX9
2517                                                        Reserved, must be 0.
2518                                                      GFX10
2519                                                        - If 0 execute work-groups in
2520                                                          CU wavefront execution mode.
2521                                                        - If 1 execute work-groups on
2522                                                          in WGP wavefront execution mode.
2523
2524                                                        See :ref:`amdgpu-amdhsa-memory-model`.
2525
2526                                                        Used by CP to set up
2527                                                        ``COMPUTE_PGM_RSRC1.WGP_MODE``.
2528      30      1 bit    MEM_ORDERED                    GFX6-9
2529                                                        Reserved, must be 0.
2530                                                      GFX10
2531                                                        Controls the behavior of the
2532                                                        waitcnt's vmcnt and vscnt
2533                                                        counters.
2534
2535                                                        - If 0 vmcnt reports completion
2536                                                          of load and atomic with return
2537                                                          out of order with sample
2538                                                          instructions, and the vscnt
2539                                                          reports the completion of
2540                                                          store and atomic without
2541                                                          return in order.
2542                                                        - If 1 vmcnt reports completion
2543                                                          of load, atomic with return
2544                                                          and sample instructions in
2545                                                          order, and the vscnt reports
2546                                                          the completion of store and
2547                                                          atomic without return in order.
2548
2549                                                        Used by CP to set up
2550                                                        ``COMPUTE_PGM_RSRC1.MEM_ORDERED``.
2551      31      1 bit    FWD_PROGRESS                   GFX6-9
2552                                                        Reserved, must be 0.
2553                                                      GFX10
2554                                                        - If 0 execute SIMD wavefronts
2555                                                          using oldest first policy.
2556                                                        - If 1 execute SIMD wavefronts to
2557                                                          ensure wavefronts will make some
2558                                                          forward progress.
2559
2560                                                        Used by CP to set up
2561                                                        ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``.
2562      32      **Total size 4 bytes**
2563      ======= ===================================================================================================================
2564
2565 ..
2566
2567   .. table:: compute_pgm_rsrc2 for GFX6-GFX10
2568      :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table
2569
2570      ======= ======= =============================== ===========================================================================
2571      Bits    Size    Field Name                      Description
2572      ======= ======= =============================== ===========================================================================
2573      0       1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     Enable the setup of the
2574                      _WAVEFRONT_OFFSET               SGPR wavefront scratch offset
2575                                                      system register (see
2576                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
2577
2578                                                      Used by CP to set up
2579                                                      ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
2580      5:1     5 bits  USER_SGPR_COUNT                 The total number of SGPR
2581                                                      user data registers
2582                                                      requested. This number must
2583                                                      match the number of user
2584                                                      data registers enabled.
2585
2586                                                      Used by CP to set up
2587                                                      ``COMPUTE_PGM_RSRC2.USER_SGPR``.
2588      6       1 bit   ENABLE_TRAP_HANDLER             Must be 0.
2589
2590                                                      This bit represents
2591                                                      ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
2592                                                      which is set by the CP if
2593                                                      the runtime has installed a
2594                                                      trap handler.
2595      7       1 bit   ENABLE_SGPR_WORKGROUP_ID_X      Enable the setup of the
2596                                                      system SGPR register for
2597                                                      the work-group id in the X
2598                                                      dimension (see
2599                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
2600
2601                                                      Used by CP to set up
2602                                                      ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
2603      8       1 bit   ENABLE_SGPR_WORKGROUP_ID_Y      Enable the setup of the
2604                                                      system SGPR register for
2605                                                      the work-group id in the Y
2606                                                      dimension (see
2607                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
2608
2609                                                      Used by CP to set up
2610                                                      ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
2611      9       1 bit   ENABLE_SGPR_WORKGROUP_ID_Z      Enable the setup of the
2612                                                      system SGPR register for
2613                                                      the work-group id in the Z
2614                                                      dimension (see
2615                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
2616
2617                                                      Used by CP to set up
2618                                                      ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
2619      10      1 bit   ENABLE_SGPR_WORKGROUP_INFO      Enable the setup of the
2620                                                      system SGPR register for
2621                                                      work-group information (see
2622                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
2623
2624                                                      Used by CP to set up
2625                                                      ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
2626      12:11   2 bits  ENABLE_VGPR_WORKITEM_ID         Enable the setup of the
2627                                                      VGPR system registers used
2628                                                      for the work-item ID.
2629                                                      :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
2630                                                      defines the values.
2631
2632                                                      Used by CP to set up
2633                                                      ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
2634      13      1 bit   ENABLE_EXCEPTION_ADDRESS_WATCH  Must be 0.
2635
2636                                                      Wavefront starts execution
2637                                                      with address watch
2638                                                      exceptions enabled which
2639                                                      are generated when L1 has
2640                                                      witnessed a thread access
2641                                                      an *address of
2642                                                      interest*.
2643
2644                                                      CP is responsible for
2645                                                      filling in the address
2646                                                      watch bit in
2647                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
2648                                                      according to what the
2649                                                      runtime requests.
2650      14      1 bit   ENABLE_EXCEPTION_MEMORY         Must be 0.
2651
2652                                                      Wavefront starts execution
2653                                                      with memory violation
2654                                                      exceptions exceptions
2655                                                      enabled which are generated
2656                                                      when a memory violation has
2657                                                      occurred for this wavefront from
2658                                                      L1 or LDS
2659                                                      (write-to-read-only-memory,
2660                                                      mis-aligned atomic, LDS
2661                                                      address out of range,
2662                                                      illegal address, etc.).
2663
2664                                                      CP sets the memory
2665                                                      violation bit in
2666                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
2667                                                      according to what the
2668                                                      runtime requests.
2669      23:15   9 bits  GRANULATED_LDS_SIZE             Must be 0.
2670
2671                                                      CP uses the rounded value
2672                                                      from the dispatch packet,
2673                                                      not this value, as the
2674                                                      dispatch may contain
2675                                                      dynamically allocated group
2676                                                      segment memory. CP writes
2677                                                      directly to
2678                                                      ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
2679
2680                                                      Amount of group segment
2681                                                      (LDS) to allocate for each
2682                                                      work-group. Granularity is
2683                                                      device specific:
2684
2685                                                      GFX6:
2686                                                        roundup(lds-size / (64 * 4))
2687                                                      GFX7-GFX10:
2688                                                        roundup(lds-size / (128 * 4))
2689
2690      24      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    Wavefront starts execution
2691                      _INVALID_OPERATION              with specified exceptions
2692                                                      enabled.
2693
2694                                                      Used by CP to set up
2695                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN``
2696                                                      (set from bits 0..6).
2697
2698                                                      IEEE 754 FP Invalid
2699                                                      Operation
2700      25      1 bit   ENABLE_EXCEPTION_FP_DENORMAL    FP Denormal one or more
2701                      _SOURCE                         input operands is a
2702                                                      denormal number
2703      26      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Division by
2704                      _DIVISION_BY_ZERO               Zero
2705      27      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP FP Overflow
2706                      _OVERFLOW
2707      28      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Underflow
2708                      _UNDERFLOW
2709      29      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Inexact
2710                      _INEXACT
2711      30      1 bit   ENABLE_EXCEPTION_INT_DIVIDE_BY  Integer Division by Zero
2712                      _ZERO                           (rcp_iflag_f32 instruction
2713                                                      only)
2714      31      1 bit                                   Reserved, must be 0.
2715      32      **Total size 4 bytes.**
2716      ======= ===================================================================================================================
2717
2718 ..
2719
2720   .. table:: compute_pgm_rsrc3 for GFX10
2721      :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table
2722
2723      ======= ======= =============================== ===========================================================================
2724      Bits    Size    Field Name                      Description
2725      ======= ======= =============================== ===========================================================================
2726      3:0     4 bits  SHARED_VGPR_COUNT               Number of shared VGPRs for wavefront size 64. Granularity 8. Value 0-120.
2727                                                      compute_pgm_rsrc1.vgprs + shared_vgpr_cnt cannot exceed 64.
2728      31:4    28                                      Reserved, must be 0.
2729              bits
2730      32      **Total size 4 bytes.**
2731      ======= ===================================================================================================================
2732
2733 ..
2734
2735   .. table:: Floating Point Rounding Mode Enumeration Values
2736      :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
2737
2738      ====================================== ===== ==============================
2739      Enumeration Name                       Value Description
2740      ====================================== ===== ==============================
2741      FLOAT_ROUND_MODE_NEAR_EVEN             0     Round Ties To Even
2742      FLOAT_ROUND_MODE_PLUS_INFINITY         1     Round Toward +infinity
2743      FLOAT_ROUND_MODE_MINUS_INFINITY        2     Round Toward -infinity
2744      FLOAT_ROUND_MODE_ZERO                  3     Round Toward 0
2745      ====================================== ===== ==============================
2746
2747 ..
2748
2749   .. table:: Floating Point Denorm Mode Enumeration Values
2750      :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
2751
2752      ====================================== ===== ==============================
2753      Enumeration Name                       Value Description
2754      ====================================== ===== ==============================
2755      FLOAT_DENORM_MODE_FLUSH_SRC_DST        0     Flush Source and Destination
2756                                                   Denorms
2757      FLOAT_DENORM_MODE_FLUSH_DST            1     Flush Output Denorms
2758      FLOAT_DENORM_MODE_FLUSH_SRC            2     Flush Source Denorms
2759      FLOAT_DENORM_MODE_FLUSH_NONE           3     No Flush
2760      ====================================== ===== ==============================
2761
2762 ..
2763
2764   .. table:: System VGPR Work-Item ID Enumeration Values
2765      :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
2766
2767      ======================================== ===== ============================
2768      Enumeration Name                         Value Description
2769      ======================================== ===== ============================
2770      SYSTEM_VGPR_WORKITEM_ID_X                0     Set work-item X dimension
2771                                                     ID.
2772      SYSTEM_VGPR_WORKITEM_ID_X_Y              1     Set work-item X and Y
2773                                                     dimensions ID.
2774      SYSTEM_VGPR_WORKITEM_ID_X_Y_Z            2     Set work-item X, Y and Z
2775                                                     dimensions ID.
2776      SYSTEM_VGPR_WORKITEM_ID_UNDEFINED        3     Undefined.
2777      ======================================== ===== ============================
2778
2779 .. _amdgpu-amdhsa-initial-kernel-execution-state:
2780
2781 Initial Kernel Execution State
2782 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2783
2784 This section defines the register state that will be set up by the packet
2785 processor prior to the start of execution of every wavefront. This is limited by
2786 the constraints of the hardware controllers of CP/ADC/SPI.
2787
2788 The order of the SGPR registers is defined, but the compiler can specify which
2789 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
2790 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2791 for enabled registers are dense starting at SGPR0: the first enabled register is
2792 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
2793 an SGPR number.
2794
2795 The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
2796 all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using
2797 the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
2798 initialized. These are then immediately followed by the System SGPRs that are
2799 set up by ADC/SPI and can have different values for each wavefront of the grid
2800 dispatch.
2801
2802 SGPR register initial state is defined in
2803 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
2804
2805   .. table:: SGPR Register Set Up Order
2806      :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
2807
2808      ========== ========================== ====== ==============================
2809      SGPR Order Name                       Number Description
2810                 (kernel descriptor enable  of
2811                 field)                     SGPRs
2812      ========== ========================== ====== ==============================
2813      First      Private Segment Buffer     4      V# that can be used, together
2814                 (enable_sgpr_private              with Scratch Wavefront Offset
2815                 _segment_buffer)                  as an offset, to access the
2816                                                   private memory space using a
2817                                                   segment address.
2818
2819                                                   CP uses the value provided by
2820                                                   the runtime.
2821      then       Dispatch Ptr               2      64 bit address of AQL dispatch
2822                 (enable_sgpr_dispatch_ptr)        packet for kernel dispatch
2823                                                   actually executing.
2824      then       Queue Ptr                  2      64 bit address of amd_queue_t
2825                 (enable_sgpr_queue_ptr)           object for AQL queue on which
2826                                                   the dispatch packet was
2827                                                   queued.
2828      then       Kernarg Segment Ptr        2      64 bit address of Kernarg
2829                 (enable_sgpr_kernarg              segment. This is directly
2830                 _segment_ptr)                     copied from the
2831                                                   kernarg_address in the kernel
2832                                                   dispatch packet.
2833
2834                                                   Having CP load it once avoids
2835                                                   loading it at the beginning of
2836                                                   every wavefront.
2837      then       Dispatch Id                2      64 bit Dispatch ID of the
2838                 (enable_sgpr_dispatch_id)         dispatch packet being
2839                                                   executed.
2840      then       Flat Scratch Init          2      This is 2 SGPRs:
2841                 (enable_sgpr_flat_scratch
2842                 _init)                            GFX6
2843                                                     Not supported.
2844                                                   GFX7-GFX8
2845                                                     The first SGPR is a 32 bit
2846                                                     byte offset from
2847                                                     ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2848                                                     to per SPI base of memory
2849                                                     for scratch for the queue
2850                                                     executing the kernel
2851                                                     dispatch. CP obtains this
2852                                                     from the runtime. (The
2853                                                     Scratch Segment Buffer base
2854                                                     address is
2855                                                     ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2856                                                     plus this offset.) The value
2857                                                     of Scratch Wavefront Offset must
2858                                                     be added to this offset by
2859                                                     the kernel machine code,
2860                                                     right shifted by 8, and
2861                                                     moved to the FLAT_SCRATCH_HI
2862                                                     SGPR register.
2863                                                     FLAT_SCRATCH_HI corresponds
2864                                                     to SGPRn-4 on GFX7, and
2865                                                     SGPRn-6 on GFX8 (where SGPRn
2866                                                     is the highest numbered SGPR
2867                                                     allocated to the wavefront).
2868                                                     FLAT_SCRATCH_HI is
2869                                                     multiplied by 256 (as it is
2870                                                     in units of 256 bytes) and
2871                                                     added to
2872                                                     ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2873                                                     to calculate the per wavefront
2874                                                     FLAT SCRATCH BASE in flat
2875                                                     memory instructions that
2876                                                     access the scratch
2877                                                     apperture.
2878
2879                                                     The second SGPR is 32 bit
2880                                                     byte size of a single
2881                                                     work-item's scratch memory
2882                                                     usage. CP obtains this from
2883                                                     the runtime, and it is
2884                                                     always a multiple of DWORD.
2885                                                     CP checks that the value in
2886                                                     the kernel dispatch packet
2887                                                     Private Segment Byte Size is
2888                                                     not larger, and requests the
2889                                                     runtime to increase the
2890                                                     queue's scratch size if
2891                                                     necessary. The kernel code
2892                                                     must move it to
2893                                                     FLAT_SCRATCH_LO which is
2894                                                     SGPRn-3 on GFX7 and SGPRn-5
2895                                                     on GFX8. FLAT_SCRATCH_LO is
2896                                                     used as the FLAT SCRATCH
2897                                                     SIZE in flat memory
2898                                                     instructions. Having CP load
2899                                                     it once avoids loading it at
2900                                                     the beginning of every
2901                                                     wavefront.
2902                                                   GFX9-GFX10
2903                                                     This is the
2904                                                     64 bit base address of the
2905                                                     per SPI scratch backing
2906                                                     memory managed by SPI for
2907                                                     the queue executing the
2908                                                     kernel dispatch. CP obtains
2909                                                     this from the runtime (and
2910                                                     divides it if there are
2911                                                     multiple Shader Arrays each
2912                                                     with its own SPI). The value
2913                                                     of Scratch Wavefront Offset must
2914                                                     be added by the kernel
2915                                                     machine code and the result
2916                                                     moved to the FLAT_SCRATCH
2917                                                     SGPR which is SGPRn-6 and
2918                                                     SGPRn-5. It is used as the
2919                                                     FLAT SCRATCH BASE in flat
2920                                                     memory instructions.
2921      then       Private Segment Size       1      The 32 bit byte size of a
2922                                                   (enable_sgpr_private single
2923                                                   work-item's
2924                                                   scratch_segment_size) memory
2925                                                   allocation. This is the
2926                                                   value from the kernel
2927                                                   dispatch packet Private
2928                                                   Segment Byte Size rounded up
2929                                                   by CP to a multiple of
2930                                                   DWORD.
2931
2932                                                   Having CP load it once avoids
2933                                                   loading it at the beginning of
2934                                                   every wavefront.
2935
2936                                                   This is not used for
2937                                                   GFX7-GFX8 since it is the same
2938                                                   value as the second SGPR of
2939                                                   Flat Scratch Init. However, it
2940                                                   may be needed for GFX9-GFX10 which
2941                                                   changes the meaning of the
2942                                                   Flat Scratch Init value.
2943      then       Grid Work-Group Count X    1      32 bit count of the number of
2944                 (enable_sgpr_grid                 work-groups in the X dimension
2945                 _workgroup_count_X)               for the grid being
2946                                                   executed. Computed from the
2947                                                   fields in the kernel dispatch
2948                                                   packet as ((grid_size.x +
2949                                                   workgroup_size.x - 1) /
2950                                                   workgroup_size.x).
2951      then       Grid Work-Group Count Y    1      32 bit count of the number of
2952                 (enable_sgpr_grid                 work-groups in the Y dimension
2953                 _workgroup_count_Y &&             for the grid being
2954                 less than 16 previous             executed. Computed from the
2955                 SGPRs)                            fields in the kernel dispatch
2956                                                   packet as ((grid_size.y +
2957                                                   workgroup_size.y - 1) /
2958                                                   workgroupSize.y).
2959
2960                                                   Only initialized if <16
2961                                                   previous SGPRs initialized.
2962      then       Grid Work-Group Count Z    1      32 bit count of the number of
2963                 (enable_sgpr_grid                 work-groups in the Z dimension
2964                 _workgroup_count_Z &&             for the grid being
2965                 less than 16 previous             executed. Computed from the
2966                 SGPRs)                            fields in the kernel dispatch
2967                                                   packet as ((grid_size.z +
2968                                                   workgroup_size.z - 1) /
2969                                                   workgroupSize.z).
2970
2971                                                   Only initialized if <16
2972                                                   previous SGPRs initialized.
2973      then       Work-Group Id X            1      32 bit work-group id in X
2974                 (enable_sgpr_workgroup_id         dimension of grid for
2975                 _X)                               wavefront.
2976      then       Work-Group Id Y            1      32 bit work-group id in Y
2977                 (enable_sgpr_workgroup_id         dimension of grid for
2978                 _Y)                               wavefront.
2979      then       Work-Group Id Z            1      32 bit work-group id in Z
2980                 (enable_sgpr_workgroup_id         dimension of grid for
2981                 _Z)                               wavefront.
2982      then       Work-Group Info            1      {first_wavefront, 14'b0000,
2983                 (enable_sgpr_workgroup            ordered_append_term[10:0],
2984                 _info)                            threadgroup_size_in_wavefronts[5:0]}
2985      then       Scratch Wavefront Offset   1      32 bit byte offset from base
2986                 (enable_sgpr_private              of scratch base of queue
2987                 _segment_wavefront_offset)        executing the kernel
2988                                                   dispatch. Must be used as an
2989                                                   offset with Private
2990                                                   segment address when using
2991                                                   Scratch Segment Buffer. It
2992                                                   must be used to set up FLAT
2993                                                   SCRATCH for flat addressing
2994                                                   (see
2995                                                   :ref:`amdgpu-amdhsa-flat-scratch`).
2996      ========== ========================== ====== ==============================
2997
2998 The order of the VGPR registers is defined, but the compiler can specify which
2999 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
3000 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
3001 for enabled registers are dense starting at VGPR0: the first enabled register is
3002 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
3003 VGPR number.
3004
3005 VGPR register initial state is defined in
3006 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
3007
3008   .. table:: VGPR Register Set Up Order
3009      :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
3010
3011      ========== ========================== ====== ==============================
3012      VGPR Order Name                       Number Description
3013                 (kernel descriptor enable  of
3014                 field)                     VGPRs
3015      ========== ========================== ====== ==============================
3016      First      Work-Item Id X             1      32 bit work item id in X
3017                 (Always initialized)              dimension of work-group for
3018                                                   wavefront lane.
3019      then       Work-Item Id Y             1      32 bit work item id in Y
3020                 (enable_vgpr_workitem_id          dimension of work-group for
3021                 > 0)                              wavefront lane.
3022      then       Work-Item Id Z             1      32 bit work item id in Z
3023                 (enable_vgpr_workitem_id          dimension of work-group for
3024                 > 1)                              wavefront lane.
3025      ========== ========================== ====== ==============================
3026
3027 The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
3028
3029 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
3030    registers.
3031 2. Work-group Id registers X, Y, Z are set by ADC which supports any
3032    combination including none.
3033 3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
3034    its value cannot included with the flat scratch init value which is per queue.
3035 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
3036    or (X, Y, Z).
3037
3038 Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
3039 value to the hardware required SGPRn-3 and SGPRn-4 respectively.
3040
3041 The global segment can be accessed either using buffer instructions (GFX6 which
3042 has V# 64 bit address support), flat instructions (GFX7-GFX10), or global
3043 instructions (GFX9-GFX10).
3044
3045 If buffer operations are used then the compiler can generate a V# with the
3046 following properties:
3047
3048 * base address of 0
3049 * no swizzle
3050 * ATC: 1 if IOMMU present (such as APU)
3051 * ptr64: 1
3052 * MTYPE set to support memory coherence that matches the runtime (such as CC for
3053   APU and NC for dGPU).
3054
3055 .. _amdgpu-amdhsa-kernel-prolog:
3056
3057 Kernel Prolog
3058 ~~~~~~~~~~~~~
3059
3060 .. _amdgpu-amdhsa-m0:
3061
3062 M0
3063 ++
3064
3065 GFX6-GFX8
3066   The M0 register must be initialized with a value at least the total LDS size
3067   if the kernel may access LDS via DS or flat operations. Total LDS size is
3068   available in dispatch packet. For M0, it is also possible to use maximum
3069   possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
3070   GFX7-GFX8).
3071 GFX9-GFX10
3072   The M0 register is not used for range checking LDS accesses and so does not
3073   need to be initialized in the prolog.
3074
3075 .. _amdgpu-amdhsa-flat-scratch:
3076
3077 Flat Scratch
3078 ++++++++++++
3079
3080 If the kernel may use flat operations to access scratch memory, the prolog code
3081 must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
3082 are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront
3083 Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
3084
3085 GFX6
3086   Flat scratch is not supported.
3087
3088 GFX7-GFX8
3089   1. The low word of Flat Scratch Init is 32 bit byte offset from
3090      ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
3091      being managed by SPI for the queue executing the kernel dispatch. This is
3092      the same value used in the Scratch Segment Buffer V# base address. The
3093      prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte
3094      scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
3095      FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
3096      by 8 before moving into FLAT_SCRATCH_LO.
3097   2. The second word of Flat Scratch Init is 32 bit byte size of a single
3098      work-items scratch memory usage. This is directly loaded from the kernel
3099      dispatch packet Private Segment Byte Size and rounded up to a multiple of
3100      DWORD. Having CP load it once avoids loading it at the beginning of every
3101      wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
3102      SIZE.
3103
3104 GFX9-GFX10
3105   The Flat Scratch Init is the 64 bit address of the base of scratch backing
3106   memory being managed by SPI for the queue executing the kernel dispatch. The
3107   prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH
3108   pair for use as the flat scratch base in flat memory instructions.
3109
3110 .. _amdgpu-amdhsa-memory-model:
3111
3112 Memory Model
3113 ~~~~~~~~~~~~
3114
3115 This section describes the mapping of LLVM memory model onto AMDGPU machine code
3116 (see :ref:`memmodel`). *The implementation is WIP.*
3117
3118 .. TODO
3119    Update when implementation complete.
3120
3121 The AMDGPU backend supports the memory synchronization scopes specified in
3122 :ref:`amdgpu-memory-scopes`.
3123
3124 The code sequences used to implement the memory model are defined in table
3125 :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx10-table`.
3126
3127 The sequences specify the order of instructions that a single thread must
3128 execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
3129 to other memory instructions executed by the same thread. This allows them to be
3130 moved earlier or later which can allow them to be combined with other instances
3131 of the same instruction, or hoisted/sunk out of loops to improve
3132 performance. Only the instructions related to the memory model are given;
3133 additional ``s_waitcnt`` instructions are required to ensure registers are
3134 defined before being used. These may be able to be combined with the memory
3135 model ``s_waitcnt`` instructions as described above.
3136
3137 The AMDGPU backend supports the following memory models:
3138
3139   HSA Memory Model [HSA]_
3140     The HSA memory model uses a single happens-before relation for all address
3141     spaces (see :ref:`amdgpu-address-spaces`).
3142   OpenCL Memory Model [OpenCL]_
3143     The OpenCL memory model which has separate happens-before relations for the
3144     global and local address spaces. Only a fence specifying both global and
3145     local address space, and seq_cst instructions join the relationships. Since
3146     the LLVM ``memfence`` instruction does not allow an address space to be
3147     specified the OpenCL fence has to convervatively assume both local and
3148     global address space was specified. However, optimizations can often be
3149     done to eliminate the additional ``s_waitcnt`` instructions when there are
3150     no intervening memory instructions which access the corresponding address
3151     space. The code sequences in the table indicate what can be omitted for the
3152     OpenCL memory. The target triple environment is used to determine if the
3153     source language is OpenCL (see :ref:`amdgpu-opencl`).
3154
3155 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
3156 operations.
3157
3158 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
3159 termed vector memory operations.
3160
3161 For GFX6-GFX9:
3162
3163 * Each agent has multiple shader arrays (SA).
3164 * Each SA has multiple compute units (CU).
3165 * Each CU has multiple SIMDs that execute wavefronts.
3166 * The wavefronts for a single work-group are executed in the same CU but may be
3167   executed by different SIMDs.
3168 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
3169   executing on it.
3170 * All LDS operations of a CU are performed as wavefront wide operations in a
3171   global order and involve no caching. Completion is reported to a wavefront in
3172   execution order.
3173 * The LDS memory has multiple request queues shared by the SIMDs of a
3174   CU. Therefore, the LDS operations performed by different wavefronts of a work-group
3175   can be reordered relative to each other, which can result in reordering the
3176   visibility of vector memory operations with respect to LDS operations of other
3177   wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
3178   ensure synchronization between LDS operations and vector memory operations
3179   between wavefronts of a work-group, but not between operations performed by the
3180   same wavefront.
3181 * The vector memory operations are performed as wavefront wide operations and
3182   completion is reported to a wavefront in execution order. The exception is
3183   that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
3184   vector memory order if they access LDS memory, and out of LDS operation order
3185   if they access global memory.
3186 * The vector memory operations access a single vector L1 cache shared by all
3187   SIMDs a CU. Therefore, no special action is required for coherence between the
3188   lanes of a single wavefront, or for coherence between wavefronts in the same
3189   work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
3190   executing in different work-groups as they may be executing on different CUs.
3191 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
3192   on a group of CUs. The scalar and vector L1 caches are not coherent. However,
3193   scalar operations are used in a restricted way so do not impact the memory
3194   model. See :ref:`amdgpu-amdhsa-memory-spaces`.
3195 * The vector and scalar memory operations use an L2 cache shared by all CUs on
3196   the same agent.
3197 * The L2 cache has independent channels to service disjoint ranges of virtual
3198   addresses.
3199 * Each CU has a separate request queue per channel. Therefore, the vector and
3200   scalar memory operations performed by wavefronts executing in different work-groups
3201   (which may be executing on different CUs) of an agent can be reordered
3202   relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
3203   synchronization between vector memory operations of different CUs. It ensures a
3204   previous vector memory operation has completed before executing a subsequent
3205   vector memory or LDS operation and so can be used to meet the requirements of
3206   acquire and release.
3207 * The L2 cache can be kept coherent with other agents on some targets, or ranges
3208   of virtual addresses can be set up to bypass it to ensure system coherence.
3209
3210 For GFX10:
3211
3212 * Each agent has multiple shader arrays (SA).
3213 * Each SA has multiple work-group processors (WGP).
3214 * Each WGP has multiple compute units (CU).
3215 * Each CU has multiple SIMDs that execute wavefronts.
3216 * The wavefronts for a single work-group are executed in the same
3217   WGP. In CU wavefront execution mode the wavefronts may be executed by
3218   different SIMDs in the same CU. In WGP wavefront execution mode the
3219   wavefronts may be executed by different SIMDs in different CUs in the same
3220   WGP.
3221 * Each WGP has a single LDS memory shared by the wavefronts of the work-groups
3222   executing on it.
3223 * All LDS operations of a WGP are performed as wavefront wide operations in a
3224   global order and involve no caching. Completion is reported to a wavefront in
3225   execution order.
3226 * The LDS memory has multiple request queues shared by the SIMDs of a
3227   WGP. Therefore, the LDS operations performed by different wavefronts of a work-group
3228   can be reordered relative to each other, which can result in reordering the
3229   visibility of vector memory operations with respect to LDS operations of other
3230   wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
3231   ensure synchronization between LDS operations and vector memory operations
3232   between wavefronts of a work-group, but not between operations performed by the
3233   same wavefront.
3234 * The vector memory operations are performed as wavefront wide operations.
3235   Completion of load/store/sample operations are reported to a wavefront in
3236   execution order of other load/store/sample operations performed by that
3237   wavefront.
3238 * The vector memory operations access a vector L0 cache. There is a single L0
3239   cache per CU. Each SIMD of a CU accesses the same L0 cache.
3240   Therefore, no special action is required for coherence between the lanes of a
3241   single wavefront. However, a ``BUFFER_GL0_INV`` is required for coherence
3242   between wavefronts executing in the same work-group as they may be executing on
3243   SIMDs of different CUs that access different L0s. A ``BUFFER_GL0_INV`` is also
3244   required for coherence between wavefronts executing in different work-groups as
3245   they may be executing on different WGPs.
3246 * The scalar memory operations access a scalar L0 cache shared by all wavefronts
3247   on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
3248   operations are used in a restricted way so do not impact the memory model. See
3249   :ref:`amdgpu-amdhsa-memory-spaces`.
3250 * The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
3251   the same SA. Therefore, no special action is required for coherence between
3252   the wavefronts of a single work-group. However, a ``BUFFER_GL1_INV`` is
3253   required for coherence between wavefronts executing in different work-groups as
3254   they may be executing on different SAs that access different L1s.
3255 * The L1 caches have independent quadrants to service disjoint ranges of virtual
3256   addresses.
3257 * Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
3258   vector and scalar memory operations performed by different wavefronts, whether
3259   executing in the same or different work-groups (which may be executing on
3260   different CUs accessing different L0s), can be reordered relative to each
3261   other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure synchronization
3262   between vector memory operations of different wavefronts. It ensures a previous
3263   vector memory operation has completed before executing a subsequent vector
3264   memory or LDS operation and so can be used to meet the requirements of acquire,
3265   release and sequential consistency.
3266 * The L1 caches use an L2 cache shared by all SAs on the same agent.
3267 * The L2 cache has independent channels to service disjoint ranges of virtual
3268   addresses.
3269 * Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
3270   quadrant has a separate request queue per L2 channel. Therefore, the vector
3271   and scalar memory operations performed by wavefronts executing in different
3272   work-groups (which may be executing on different SAs) of an agent can be
3273   reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is
3274   required to ensure synchronization between vector memory operations of
3275   different SAs. It ensures a previous vector memory operation has completed
3276   before executing a subsequent vector memory and so can be used to meet the
3277   requirements of acquire, release and sequential consistency.
3278 * The L2 cache can be kept coherent with other agents on some targets, or ranges
3279   of virtual addresses can be set up to bypass it to ensure system coherence.
3280
3281 Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
3282 or ``scratch_load/store`` (GFX9-GFX10). Since only a single thread is accessing the
3283 memory, atomic memory orderings are not meaningful and all accesses are treated
3284 as non-atomic.
3285
3286 Constant address space uses ``buffer/global_load`` instructions (or equivalent
3287 scalar memory instructions). Since the constant address space contents do not
3288 change during the execution of a kernel dispatch it is not legal to perform
3289 stores, and atomic memory orderings are not meaningful and all access are
3290 treated as non-atomic.
3291
3292 A memory synchronization scope wider than work-group is not meaningful for the
3293 group (LDS) address space and is treated as work-group.
3294
3295 The memory model does not support the region address space which is treated as
3296 non-atomic.
3297
3298 Acquire memory ordering is not meaningful on store atomic instructions and is
3299 treated as non-atomic.
3300
3301 Release memory ordering is not meaningful on load atomic instructions and is
3302 treated a non-atomic.
3303
3304 Acquire-release memory ordering is not meaningful on load or store atomic
3305 instructions and is treated as acquire and release respectively.
3306
3307 AMDGPU backend only uses scalar memory operations to access memory that is
3308 proven to not change during the execution of the kernel dispatch. This includes
3309 constant address space and global address space for program scope const
3310 variables. Therefore the kernel machine code does not have to maintain the
3311 scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
3312 and vector L1 caches are invalidated between kernel dispatches by CP since
3313 constant address space data may change between kernel dispatch executions. See
3314 :ref:`amdgpu-amdhsa-memory-spaces`.
3315
3316 The one execption is if scalar writes are used to spill SGPR registers. In this
3317 case the AMDGPU backend ensures the memory location used to spill is never
3318 accessed by vector memory operations at the same time. If scalar writes are used
3319 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
3320 return since the locations may be used for vector memory instructions by a
3321 future wavefront that uses the same scratch area, or a function call that creates a
3322 frame at the same address, respectively. There is no need for a ``s_dcache_inv``
3323 as all scalar writes are write-before-read in the same thread.
3324
3325 For GFX6-GFX9, scratch backing memory (which is used for the private address space)
3326 is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
3327 address space is only accessed by a single thread, and is always
3328 write-before-read, there is never a need to invalidate these entries from the L1
3329 cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
3330 volatile cache lines.
3331
3332 For GFX10, scratch backing memory (which is used for the private address space)
3333 is accessed with MTYPE NC (non-coherenent). Since the private address space is
3334 only accessed by a single thread, and is always write-before-read, there is
3335 never a need to invalidate these entries from the L0 or L1 caches.
3336
3337 For GFX10, wavefronts are executed in native mode with in-order reporting of loads
3338 and sample instructions. In this mode vmcnt reports completion of load, atomic
3339 with return and sample instructions in order, and the vscnt reports the
3340 completion of store and atomic without return in order. See ``MEM_ORDERED`` field
3341 in :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
3342
3343 In GFX10, wavefronts can be executed in WGP or CU wavefront execution mode:
3344
3345 * In WGP wavefront execution mode the wavefronts of a work-group are executed
3346   on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per
3347   CU L0 caches is required for work-group synchronization. Also accesses to L1 at
3348   work-group scope need to be expicitly ordered as the accesses from different
3349   CUs are not ordered.
3350 * In CU wavefront execution mode the wavefronts of a work-group are executed on
3351   the SIMDs of a single CU of the WGP. Therefore, all global memory access by
3352   the work-group access the same L0 which in turn ensures L1 accesses are
3353   ordered and so do not require explicit management of the caches for
3354   work-group synchronization.
3355
3356 See ``WGP_MODE`` field in :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`
3357 and :ref:`amdgpu-target-features`.
3358
3359 On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
3360 to invalidate the L2 cache. For GFX6-GFX9, this also causes it to be treated as
3361 non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
3362 (cache coherent) and so the L2 cache will be coherent with the CPU and other
3363 agents.
3364
3365   .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX10
3366      :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx10-table
3367
3368      ============ ============ ============== ========== =============================== ==================================
3369      LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code             AMDGPU Machine Code
3370                   Ordering     Sync Scope     Address    GFX6-9                          GFX10
3371                                               Space
3372      ============ ============ ============== ========== =============================== ==================================
3373      **Non-Atomic**
3374      ----------------------------------------------------------------------------------------------------------------------
3375      load         *none*       *none*         - global   - !volatile & !nontemporal      - !volatile & !nontemporal
3376                                               - generic
3377                                               - private    1. buffer/global/flat_load      1. buffer/global/flat_load
3378                                               - constant
3379                                                          - volatile & !nontemporal       - volatile & !nontemporal
3380
3381                                                            1. buffer/global/flat_load      1. buffer/global/flat_load
3382                                                               glc=1                           glc=1 dlc=1
3383
3384                                                          - nontemporal                   - nontemporal
3385
3386                                                            1. buffer/global/flat_load      1. buffer/global/flat_load
3387                                                               glc=1 slc=1                     slc=1
3388
3389      load         *none*       *none*         - local    1. ds_load                      1. ds_load
3390      store        *none*       *none*         - global   - !nontemporal                  - !nontemporal
3391                                               - generic
3392                                               - private    1. buffer/global/flat_store     1. buffer/global/flat_store
3393                                               - constant
3394                                                          - nontemporal                   - nontemporal
3395
3396                                                            1. buffer/global/flat_stote      1. buffer/global/flat_store
3397                                                               glc=1 slc=1                      slc=1
3398
3399      store        *none*       *none*         - local    1. ds_store                     1. ds_store
3400      **Unordered Atomic**
3401      ----------------------------------------------------------------------------------------------------------------------
3402      load atomic  unordered    *any*          *any*      *Same as non-atomic*.           *Same as non-atomic*.
3403      store atomic unordered    *any*          *any*      *Same as non-atomic*.           *Same as non-atomic*.
3404      atomicrmw    unordered    *any*          *any*      *Same as monotonic              *Same as monotonic
3405                                                          atomic*.                        atomic*.
3406      **Monotonic Atomic**
3407      ----------------------------------------------------------------------------------------------------------------------
3408      load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load      1. buffer/global/flat_load
3409                                - wavefront    - generic
3410      load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load      1. buffer/global/flat_load
3411                                               - generic                                     glc=1
3412
3413                                                                                            - If CU wavefront execution mode, omit glc=1.
3414
3415      load atomic  monotonic    - singlethread - local    1. ds_load                      1. ds_load
3416                                - wavefront
3417                                - workgroup
3418      load atomic  monotonic    - agent        - global   1. buffer/global/flat_load      1. buffer/global/flat_load
3419                                - system       - generic     glc=1                           glc=1 dlc=1
3420      store atomic monotonic    - singlethread - global   1. buffer/global/flat_store     1. buffer/global/flat_store
3421                                - wavefront    - generic
3422                                - workgroup
3423                                - agent
3424                                - system
3425      store atomic monotonic    - singlethread - local    1. ds_store                     1. ds_store
3426                                - wavefront
3427                                - workgroup
3428      atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic    1. buffer/global/flat_atomic
3429                                - wavefront    - generic
3430                                - workgroup
3431                                - agent
3432                                - system
3433      atomicrmw    monotonic    - singlethread - local    1. ds_atomic                    1. ds_atomic
3434                                - wavefront
3435                                - workgroup
3436      **Acquire Atomic**
3437      ----------------------------------------------------------------------------------------------------------------------
3438      load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load   1. buffer/global/ds/flat_load
3439                                - wavefront    - local
3440                                               - generic
3441      load atomic  acquire      - workgroup    - global   1. buffer/global/flat_load      1. buffer/global_load glc=1
3442
3443                                                                                            - If CU wavefront execution mode, omit glc=1.
3444
3445                                                                                          2. s_waitcnt vmcnt(0)
3446
3447                                                                                            - If CU wavefront execution mode, omit.
3448                                                                                            - Must happen before
3449                                                                                              the following buffer_gl0_inv
3450                                                                                              and before any following
3451                                                                                              global/generic
3452                                                                                              load/load
3453                                                                                              atomic/stote/store
3454                                                                                              atomic/atomicrmw.
3455
3456                                                                                          3. buffer_gl0_inv
3457
3458                                                                                            - If CU wavefront execution mode, omit.
3459                                                                                            - Ensures that
3460                                                                                              following
3461                                                                                              loads will not see
3462                                                                                              stale data.
3463
3464      load atomic  acquire      - workgroup    - local    1. ds_load                      1. ds_load
3465                                                          2. s_waitcnt lgkmcnt(0)         2. s_waitcnt lgkmcnt(0)
3466
3467                                                            - If OpenCL, omit.              - If OpenCL, omit.
3468                                                            - Must happen before            - Must happen before
3469                                                              any following                   the following buffer_gl0_inv
3470                                                              global/generic                  and before any following
3471                                                              load/load                       global/generic load/load
3472                                                              atomic/store/store              atomic/store/store
3473                                                              atomic/atomicrmw.               atomic/atomicrmw.
3474                                                            - Ensures any                   - Ensures any
3475                                                              following global                following global
3476                                                              data read is no                 data read is no
3477                                                              older than the load             older than the load
3478                                                              atomic value being              atomic value being
3479                                                              acquired.                       acquired.
3480
3481                                                                                          3. buffer_gl0_inv
3482
3483                                                                                            - If CU wavefront execution mode, omit.
3484                                                                                            - If OpenCL, omit.
3485                                                                                            - Ensures that
3486                                                                                              following
3487                                                                                              loads will not see
3488                                                                                              stale data.
3489
3490      load atomic  acquire      - workgroup    - generic  1. flat_load                    1. flat_load glc=1
3491
3492                                                                                            - If CU wavefront execution mode, omit glc=1.
3493
3494                                                          2. s_waitcnt lgkmcnt(0)         2. s_waitcnt lgkmcnt(0) &
3495                                                                                             vmcnt(0)
3496
3497                                                                                            - If CU wavefront execution mode, omit vmcnt.
3498                                                            - If OpenCL, omit.              - If OpenCL, omit
3499                                                                                              lgkmcnt(0).
3500                                                            - Must happen before            - Must happen before
3501                                                              any following                   the following
3502                                                              global/generic                  buffer_gl0_inv and any
3503                                                              load/load                       following global/generic
3504                                                              atomic/store/store              load/load
3505                                                              atomic/atomicrmw.               atomic/store/store
3506                                                                                              atomic/atomicrmw.
3507                                                            - Ensures any                   - Ensures any
3508                                                              following global                following global
3509                                                              data read is no                 data read is no
3510                                                              older than the load             older than the load
3511                                                              atomic value being              atomic value being
3512                                                              acquired.                       acquired.
3513
3514                                                                                          3. buffer_gl0_inv
3515
3516                                                                                            - If CU wavefront execution mode, omit.
3517                                                                                            - Ensures that
3518                                                                                              following
3519                                                                                              loads will not see
3520                                                                                              stale data.
3521
3522      load atomic  acquire      - agent        - global   1. buffer/global/flat_load      1. buffer/global_load
3523                                - system                     glc=1                           glc=1 dlc=1
3524                                                          2. s_waitcnt vmcnt(0)           2. s_waitcnt vmcnt(0)
3525
3526                                                            - Must happen before            - Must happen before
3527                                                              following                       following
3528                                                              buffer_wbinvl1_vol.             buffer_gl*_inv.
3529                                                            - Ensures the load              - Ensures the load
3530                                                              has completed                   has completed
3531                                                              before invalidating             before invalidating
3532                                                              the cache.                      the caches.
3533
3534                                                          3. buffer_wbinvl1_vol           3. buffer_gl0_inv;
3535                                                                                             buffer_gl1_inv
3536
3537                                                            - Must happen before            - Must happen before
3538                                                              any following                   any following
3539                                                              global/generic                  global/generic
3540                                                              load/load                       load/load
3541                                                              atomic/atomicrmw.               atomic/atomicrmw.
3542                                                            - Ensures that                  - Ensures that
3543                                                              following                       following
3544                                                              loads will not see              loads will not see
3545                                                              stale global data.              stale global data.
3546
3547      load atomic  acquire      - agent        - generic  1. flat_load glc=1              1. flat_load glc=1 dlc=1
3548                                - system                  2. s_waitcnt vmcnt(0) &         2. s_waitcnt vmcnt(0) &
3549                                                             lgkmcnt(0)                      lgkmcnt(0)
3550
3551                                                            - If OpenCL omit                - If OpenCL omit
3552                                                              lgkmcnt(0).                     lgkmcnt(0).
3553                                                            - Must happen before            - Must happen before
3554                                                              following                       following
3555                                                              buffer_wbinvl1_vol.             buffer_gl*_invl.
3556                                                            - Ensures the flat_load         - Ensures the flat_load
3557                                                              has completed                   has completed
3558                                                              before invalidating             before invalidating
3559                                                              the cache.                      the caches.
3560
3561                                                          3. buffer_wbinvl1_vol           3. buffer_gl0_inv;
3562                                                                                             buffer_gl1_inv
3563
3564                                                            - Must happen before            - Must happen before
3565                                                              any following                   any following
3566                                                              global/generic                  global/generic
3567                                                              load/load                       load/load
3568                                                              atomic/atomicrmw.               atomic/atomicrmw.
3569                                                            - Ensures that                  - Ensures that
3570                                                              following loads                 following loads
3571                                                              will not see stale              will not see stale
3572                                                              global data.                    global data.
3573
3574      atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic
3575                                - wavefront    - local
3576                                               - generic
3577      atomicrmw    acquire      - workgroup    - global   1. buffer/global/flat_atomic    1. buffer/global_atomic
3578                                                                                          2. s_waitcnt vm/vscnt(0)
3579
3580                                                                                            - If CU wavefront execution mode, omit.
3581                                                                                            - Use vmcnt if atomic with
3582                                                                                              return and vscnt if atomic
3583                                                                                              with no-return.
3584                                                                                            - Must happen before
3585                                                                                              the following buffer_gl0_inv
3586                                                                                              and before any following
3587                                                                                              global/generic
3588                                                                                              load/load
3589                                                                                              atomic/stote/store
3590                                                                                              atomic/atomicrmw.
3591
3592                                                                                          3. buffer_gl0_inv
3593
3594                                                                                            - If CU wavefront execution mode, omit.
3595                                                                                            - Ensures that
3596                                                                                              following
3597                                                                                              loads will not see
3598                                                                                              stale data.
3599
3600      atomicrmw    acquire      - workgroup    - local    1. ds_atomic                    1. ds_atomic
3601                                                          2. waitcnt lgkmcnt(0)           2. waitcnt lgkmcnt(0)
3602
3603                                                            - If OpenCL, omit.              - If OpenCL, omit.
3604                                                            - Must happen before            - Must happen before
3605                                                              any following                   the following
3606                                                              global/generic                  buffer_gl0_inv.
3607                                                              load/load
3608                                                              atomic/store/store
3609                                                              atomic/atomicrmw.
3610                                                            - Ensures any                   - Ensures any
3611                                                              following global                following global
3612                                                              data read is no                 data read is no
3613                                                              older than the                  older than the
3614                                                              atomicrmw value                 atomicrmw value
3615                                                              being acquired.                 being acquired.
3616
3617                                                                                          3. buffer_gl0_inv
3618
3619                                                                                            - If OpenCL omit.
3620                                                                                            - Ensures that
3621                                                                                              following
3622                                                                                              loads will not see
3623                                                                                              stale data.
3624
3625      atomicrmw    acquire      - workgroup    - generic  1. flat_atomic                  1. flat_atomic
3626                                                          2. waitcnt lgkmcnt(0)           2. waitcnt lgkmcnt(0) &
3627                                                                                             vm/vscnt(0)
3628
3629                                                                                            - If CU wavefront execution mode, omit vm/vscnt.
3630                                                            - If OpenCL, omit.              - If OpenCL, omit
3631                                                                                              waitcnt lgkmcnt(0)..
3632                                                                                            - Use vmcnt if atomic with
3633                                                                                              return and vscnt if atomic
3634                                                                                              with no-return.
3635                                                                                              waitcnt lgkmcnt(0).
3636                                                            - Must happen before            - Must happen before
3637                                                              any following                   the following
3638                                                              global/generic                  buffer_gl0_inv.
3639                                                              load/load
3640                                                              atomic/store/store
3641                                                              atomic/atomicrmw.
3642                                                            - Ensures any                   - Ensures any
3643                                                              following global                following global
3644                                                              data read is no                 data read is no
3645                                                              older than the                  older than the
3646                                                              atomicrmw value                 atomicrmw value
3647                                                              being acquired.                 being acquired.
3648
3649                                                                                          3. buffer_gl0_inv
3650
3651                                                                                            - If CU wavefront execution mode, omit.
3652                                                                                            - Ensures that
3653                                                                                              following
3654                                                                                              loads will not see
3655                                                                                              stale data.
3656
3657      atomicrmw    acquire      - agent        - global   1. buffer/global/flat_atomic    1. buffer/global_atomic
3658                                - system                  2. s_waitcnt vmcnt(0)           2. s_waitcnt vm/vscnt(0)
3659
3660                                                                                            - Use vmcnt if atomic with
3661                                                                                              return and vscnt if atomic
3662                                                                                              with no-return.
3663                                                                                              waitcnt lgkmcnt(0).
3664                                                            - Must happen before            - Must happen before
3665                                                              following                       following
3666                                                              buffer_wbinvl1_vol.             buffer_gl*_inv.
3667                                                            - Ensures the                   - Ensures the
3668                                                              atomicrmw has                   atomicrmw has
3669                                                              completed before                completed before
3670                                                              invalidating the                invalidating the
3671                                                              cache.                          caches.
3672
3673                                                          3. buffer_wbinvl1_vol           3. buffer_gl0_inv;
3674                                                                                             buffer_gl1_inv
3675
3676                                                            - Must happen before            - Must happen before
3677                                                              any following                   any following
3678                                                              global/generic                  global/generic
3679                                                              load/load                       load/load
3680                                                              atomic/atomicrmw.               atomic/atomicrmw.
3681                                                            - Ensures that                  - Ensures that
3682                                                              following loads                 following loads
3683                                                              will not see stale              will not see stale
3684                                                              global data.                    global data.
3685
3686      atomicrmw    acquire      - agent        - generic  1. flat_atomic                  1. flat_atomic
3687                                - system                  2. s_waitcnt vmcnt(0) &         2. s_waitcnt vm/vscnt(0) &
3688                                                             lgkmcnt(0)                      lgkmcnt(0)
3689
3690                                                            - If OpenCL, omit               - If OpenCL, omit
3691                                                              lgkmcnt(0).                     lgkmcnt(0).
3692                                                                                            - Use vmcnt if atomic with
3693                                                                                              return and vscnt if atomic
3694                                                                                              with no-return.
3695                                                            - Must happen before            - Must happen before
3696                                                              following                       following
3697                                                              buffer_wbinvl1_vol.             buffer_gl*_inv.
3698                                                            - Ensures the                   - Ensures the
3699                                                              atomicrmw has                   atomicrmw has
3700                                                              completed before                completed before
3701                                                              invalidating the                invalidating the
3702                                                              cache.                          caches.
3703
3704                                                          3. buffer_wbinvl1_vol           3. buffer_gl0_inv;
3705                                                                                             buffer_gl1_inv
3706
3707                                                            - Must happen before            - Must happen before
3708                                                              any following                   any following
3709                                                              global/generic                  global/generic
3710                                                              load/load                       load/load
3711                                                              atomic/atomicrmw.               atomic/atomicrmw.
3712                                                            - Ensures that                  - Ensures that
3713                                                              following loads                 following loads
3714                                                              will not see stale              will not see stale
3715                                                              global data.                    global data.
3716
3717      fence        acquire      - singlethread *none*     *none*                          *none*
3718                                - wavefront
3719      fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
3720                                                                                             vmcnt(0) & vscnt(0)
3721
3722                                                                                            - If CU wavefront execution mode, omit vmcnt and
3723                                                                                              vscnt.
3724                                                            - If OpenCL and                 - If OpenCL and
3725                                                              address space is                address space is
3726                                                              not generic, omit.              not generic, omit
3727                                                                                              lgkmcnt(0).
3728                                                                                            - If OpenCL and
3729                                                                                              address space is
3730                                                                                              local, omit
3731                                                                                              vmcnt(0) and vscnt(0).
3732                                                            - However, since LLVM           - However, since LLVM
3733                                                              currently has no                currently has no
3734                                                              address space on                address space on
3735                                                              the fence need to               the fence need to
3736                                                              conservatively                  conservatively
3737                                                              always generate. If             always generate. If
3738                                                              fence had an                    fence had an
3739                                                              address space then              address space then
3740                                                              set to address                  set to address
3741                                                              space of OpenCL                 space of OpenCL
3742                                                              fence flag, or to               fence flag, or to
3743                                                              generic if both                 generic if both
3744                                                              local and global                local and global
3745                                                              flags are                       flags are
3746                                                              specified.                      specified.
3747                                                            - Must happen after
3748                                                              any preceding
3749                                                              local/generic load
3750                                                              atomic/atomicrmw
3751                                                              with an equal or
3752                                                              wider sync scope
3753                                                              and memory ordering
3754                                                              stronger than
3755                                                              unordered (this is
3756                                                              termed the
3757                                                              fence-paired-atomic).
3758                                                            - Must happen before
3759                                                              any following
3760                                                              global/generic
3761                                                              load/load
3762                                                              atomic/store/store
3763                                                              atomic/atomicrmw.
3764                                                            - Ensures any
3765                                                              following global
3766                                                              data read is no
3767                                                              older than the
3768                                                              value read by the
3769                                                              fence-paired-atomic.
3770                                                                                            - Could be split into
3771                                                                                              separate s_waitcnt
3772                                                                                              vmcnt(0), s_waitcnt
3773                                                                                              vscnt(0) and s_waitcnt
3774                                                                                              lgkmcnt(0) to allow
3775                                                                                              them to be
3776                                                                                              independently moved
3777                                                                                              according to the
3778                                                                                              following rules.
3779                                                                                            - s_waitcnt vmcnt(0)
3780                                                                                              must happen after
3781                                                                                              any preceding
3782                                                                                              global/generic load
3783                                                                                              atomic/
3784                                                                                              atomicrmw-with-return-value
3785                                                                                              with an equal or
3786                                                                                              wider sync scope
3787                                                                                              and memory ordering
3788                                                                                              stronger than
3789                                                                                              unordered (this is
3790                                                                                              termed the
3791                                                                                              fence-paired-atomic).
3792                                                                                            - s_waitcnt vscnt(0)
3793                                                                                              must happen after
3794                                                                                              any preceding
3795                                                                                              global/generic
3796                                                                                              atomicrmw-no-return-value
3797                                                                                              with an equal or
3798                                                                                              wider sync scope
3799                                                                                              and memory ordering
3800                                                                                              stronger than
3801                                                                                              unordered (this is
3802                                                                                              termed the
3803                                                                                              fence-paired-atomic).
3804                                                                                            - s_waitcnt lgkmcnt(0)
3805                                                                                              must happen after
3806                                                                                              any preceding
3807                                                                                              local/generic load
3808                                                                                              atomic/atomicrmw
3809                                                                                              with an equal or
3810                                                                                              wider sync scope
3811                                                                                              and memory ordering
3812                                                                                              stronger than
3813                                                                                              unordered (this is
3814                                                                                              termed the
3815                                                                                              fence-paired-atomic).
3816                                                                                            - Must happen before
3817                                                                                              the following
3818                                                                                              buffer_gl0_inv.
3819                                                                                            - Ensures that the
3820                                                                                              fence-paired atomic
3821                                                                                              has completed
3822                                                                                              before invalidating
3823                                                                                              the
3824                                                                                              cache. Therefore
3825                                                                                              any following
3826                                                                                              locations read must
3827                                                                                              be no older than
3828                                                                                              the value read by
3829                                                                                              the
3830                                                                                              fence-paired-atomic.
3831
3832                                                                                          3. buffer_gl0_inv
3833
3834                                                                                            - If CU wavefront execution mode, omit.
3835                                                                                            - Ensures that
3836                                                                                              following
3837                                                                                              loads will not see
3838                                                                                              stale data.
3839
3840      fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) &
3841                                - system                     vmcnt(0)                        vmcnt(0) & vscnt(0)
3842
3843                                                            - If OpenCL and                 - If OpenCL and
3844                                                              address space is                address space is
3845                                                              not generic, omit               not generic, omit
3846                                                              lgkmcnt(0).                     lgkmcnt(0).
3847                                                                                            - If OpenCL and
3848                                                                                              address space is
3849                                                                                              local, omit
3850                                                                                              vmcnt(0) and vscnt(0).
3851                                                            - However, since LLVM           - However, since LLVM
3852                                                              currently has no                currently has no
3853                                                              address space on                address space on
3854                                                              the fence need to               the fence need to
3855                                                              conservatively                  conservatively
3856                                                              always generate                 always generate
3857                                                              (see comment for                (see comment for
3858                                                              previous fence).                previous fence).
3859                                                            - Could be split into
3860                                                              separate s_waitcnt
3861                                                              vmcnt(0) and
3862                                                              s_waitcnt
3863                                                              lgkmcnt(0) to allow
3864                                                              them to be
3865                                                              independently moved
3866                                                              according to the
3867                                                              following rules.
3868                                                            - s_waitcnt vmcnt(0)
3869                                                              must happen after
3870                                                              any preceding
3871                                                              global/generic load
3872                                                              atomic/atomicrmw
3873                                                              with an equal or
3874                                                              wider sync scope
3875                                                              and memory ordering
3876                                                              stronger than
3877                                                              unordered (this is
3878                                                              termed the
3879                                                              fence-paired-atomic).
3880                                                            - s_waitcnt lgkmcnt(0)
3881                                                              must happen after
3882                                                              any preceding
3883                                                              local/generic load
3884                                                              atomic/atomicrmw
3885                                                              with an equal or
3886                                                              wider sync scope
3887                                                              and memory ordering
3888                                                              stronger than
3889                                                              unordered (this is
3890                                                              termed the
3891                                                              fence-paired-atomic).
3892                                                            - Must happen before
3893                                                              the following
3894                                                              buffer_wbinvl1_vol.
3895                                                            - Ensures that the
3896                                                              fence-paired atomic
3897                                                              has completed
3898                                                              before invalidating
3899                                                              the
3900                                                              cache. Therefore
3901                                                              any following
3902                                                              locations read must
3903                                                              be no older than
3904                                                              the value read by
3905                                                              the
3906                                                              fence-paired-atomic.
3907                                                                                            - Could be split into
3908                                                                                              separate s_waitcnt
3909                                                                                              vmcnt(0), s_waitcnt
3910                                                                                              vscnt(0) and s_waitcnt
3911                                                                                              lgkmcnt(0) to allow
3912                                                                                              them to be
3913                                                                                              independently moved
3914                                                                                              according to the
3915                                                                                              following rules.
3916                                                                                            - s_waitcnt vmcnt(0)
3917                                                                                              must happen after
3918                                                                                              any preceding
3919                                                                                              global/generic load
3920                                                                                              atomic/
3921                                                                                              atomicrmw-with-return-value
3922                                                                                              with an equal or
3923                                                                                              wider sync scope
3924                                                                                              and memory ordering
3925                                                                                              stronger than
3926                                                                                              unordered (this is
3927                                                                                              termed the
3928                                                                                              fence-paired-atomic).
3929                                                                                            - s_waitcnt vscnt(0)
3930                                                                                              must happen after
3931                                                                                              any preceding
3932                                                                                              global/generic
3933                                                                                              atomicrmw-no-return-value
3934                                                                                              with an equal or
3935                                                                                              wider sync scope
3936                                                                                              and memory ordering
3937                                                                                              stronger than
3938                                                                                              unordered (this is
3939                                                                                              termed the
3940                                                                                              fence-paired-atomic).
3941                                                                                            - s_waitcnt lgkmcnt(0)
3942                                                                                              must happen after
3943                                                                                              any preceding
3944                                                                                              local/generic load
3945                                                                                              atomic/atomicrmw
3946                                                                                              with an equal or
3947                                                                                              wider sync scope
3948                                                                                              and memory ordering
3949                                                                                              stronger than
3950                                                                                              unordered (this is
3951                                                                                              termed the
3952                                                                                              fence-paired-atomic).
3953                                                                                            - Must happen before
3954                                                                                              the following
3955                                                                                              buffer_gl*_inv.
3956                                                                                            - Ensures that the
3957                                                                                              fence-paired atomic
3958                                                                                              has completed
3959                                                                                              before invalidating
3960                                                                                              the
3961                                                                                              caches. Therefore
3962                                                                                              any following
3963                                                                                              locations read must
3964                                                                                              be no older than
3965                                                                                              the value read by
3966                                                                                              the
3967                                                                                              fence-paired-atomic.
3968
3969                                                          2. buffer_wbinvl1_vol           2. buffer_gl0_inv;
3970                                                                                             buffer_gl1_inv
3971
3972                                                            - Must happen before any        - Must happen before any
3973                                                              following global/generic        following global/generic
3974                                                              load/load                       load/load
3975                                                              atomic/store/store              atomic/store/store
3976                                                              atomic/atomicrmw.               atomic/atomicrmw.
3977                                                            - Ensures that                  - Ensures that
3978                                                              following loads                 following loads
3979                                                              will not see stale              will not see stale
3980                                                              global data.                    global data.
3981
3982      **Release Atomic**
3983      ----------------------------------------------------------------------------------------------------------------------
3984      store atomic release      - singlethread - global   1. buffer/global/ds/flat_store  1. buffer/global/ds/flat_store
3985                                - wavefront    - local
3986                                               - generic
3987      store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
3988                                                                                             vmcnt(0) & vscnt(0)
3989
3990                                                                                            - If CU wavefront execution mode, omit vmcnt and
3991                                                                                              vscnt.
3992                                                            - If OpenCL, omit.              - If OpenCL, omit
3993                                                                                              lgkmcnt(0).
3994                                                            - Must happen after
3995                                                              any preceding
3996                                                              local/generic
3997                                                              load/store/load
3998                                                              atomic/store
3999                                                              atomic/atomicrmw.
4000                                                                                            - Could be split into
4001                                                                                              separate s_waitcnt
4002                                                                                              vmcnt(0), s_waitcnt
4003                                                                                              vscnt(0) and s_waitcnt
4004                                                                                              lgkmcnt(0) to allow
4005                                                                                              them to be
4006                                                                                              independently moved
4007                                                                                              according to the
4008                                                                                              following rules.
4009                                                                                            - s_waitcnt vmcnt(0)
4010                                                                                              must happen after
4011                                                                                              any preceding
4012                                                                                              global/generic load/load
4013                                                                                              atomic/
4014                                                                                              atomicrmw-with-return-value.
4015                                                                                            - s_waitcnt vscnt(0)
4016                                                                                              must happen after
4017                                                                                              any preceding
4018                                                                                              global/generic
4019                                                                                              store/store
4020                                                                                              atomic/
4021                                                                                              atomicrmw-no-return-value.
4022                                                                                            - s_waitcnt lgkmcnt(0)
4023                                                                                              must happen after
4024                                                                                              any preceding
4025                                                                                              local/generic
4026                                                                                              load/store/load
4027                                                                                              atomic/store
4028                                                                                              atomic/atomicrmw.
4029                                                            - Must happen before            - Must happen before
4030                                                              the following                   the following
4031                                                              store.                          store.
4032                                                            - Ensures that all              - Ensures that all
4033                                                              memory operations               memory operations
4034                                                              to local have                   have
4035                                                              completed before                completed before
4036                                                              performing the                  performing the
4037                                                              store that is being             store that is being
4038                                                              released.                       released.
4039
4040                                                          2. buffer/global/flat_store     2. buffer/global_store
4041      store atomic release      - workgroup    - local                                    1. waitcnt vmcnt(0) & vscnt(0)
4042
4043                                                                                            - If CU wavefront execution mode, omit.
4044                                                                                            - If OpenCL, omit.
4045                                                                                            - Could be split into
4046                                                                                              separate s_waitcnt
4047                                                                                              vmcnt(0) and s_waitcnt
4048                                                                                              vscnt(0) to allow
4049                                                                                              them to be
4050                                                                                              independently moved
4051                                                                                              according to the
4052                                                                                              following rules.
4053                                                                                            - s_waitcnt vmcnt(0)
4054                                                                                              must happen after
4055                                                                                              any preceding
4056                                                                                              global/generic load/load
4057                                                                                              atomic/
4058                                                                                              atomicrmw-with-return-value.
4059                                                                                            - s_waitcnt vscnt(0)
4060                                                                                              must happen after
4061                                                                                              any preceding
4062                                                                                              global/generic
4063                                                                                              store/store atomic/
4064                                                                                              atomicrmw-no-return-value.
4065                                                                                            - Must happen before
4066                                                                                              the following
4067                                                                                              store.
4068                                                                                            - Ensures that all
4069                                                                                              global memory
4070                                                                                              operations have
4071                                                                                              completed before
4072                                                                                              performing the
4073                                                                                              store that is being
4074                                                                                              released.
4075
4076                                                          1. ds_store                     2. ds_store
4077      store atomic release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
4078                                                                                             vmcnt(0) & vscnt(0)
4079
4080                                                                                            - If CU wavefront execution mode, omit vmcnt and
4081                                                                                              vscnt.
4082                                                            - If OpenCL, omit.              - If OpenCL, omit
4083                                                                                              lgkmcnt(0).
4084                                                            - Must happen after
4085                                                              any preceding
4086                                                              local/generic
4087                                                              load/store/load
4088                                                              atomic/store
4089                                                              atomic/atomicrmw.
4090                                                                                            - Could be split into
4091                                                                                              separate s_waitcnt
4092                                                                                              vmcnt(0), s_waitcnt
4093                                                                                              vscnt(0) and s_waitcnt
4094                                                                                              lgkmcnt(0) to allow
4095                                                                                              them to be
4096                                                                                              independently moved
4097                                                                                              according to the
4098                                                                                              following rules.
4099                                                                                            - s_waitcnt vmcnt(0)
4100                                                                                              must happen after
4101                                                                                              any preceding
4102                                                                                              global/generic load/load
4103                                                                                              atomic/
4104                                                                                              atomicrmw-with-return-value.
4105                                                                                            - s_waitcnt vscnt(0)
4106                                                                                              must happen after
4107                                                                                              any preceding
4108                                                                                              global/generic
4109                                                                                              store/store
4110                                                                                              atomic/
4111                                                                                              atomicrmw-no-return-value.
4112                                                                                            - s_waitcnt lgkmcnt(0)
4113                                                                                              must happen after
4114                                                                                              any preceding
4115                                                                                              local/generic load/store/load
4116                                                                                              atomic/store atomic/atomicrmw.
4117                                                            - Must happen before            - Must happen before
4118                                                              the following                   the following
4119                                                              store.                          store.
4120                                                            - Ensures that all              - Ensures that all
4121                                                              memory operations               memory operations
4122                                                              to local have                   have
4123                                                              completed before                completed before
4124                                                              performing the                  performing the
4125                                                              store that is being             store that is being
4126                                                              released.                       released.
4127
4128                                                          2. flat_store                   2. flat_store
4129      store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &         1. s_waitcnt lgkmcnt(0) &
4130                                - system       - generic     vmcnt(0)                          vmcnt(0) & vscnt(0)
4131
4132                                                            - If OpenCL, omit               - If OpenCL, omit
4133                                                              lgkmcnt(0).                     lgkmcnt(0).
4134                                                            - Could be split into           - Could be split into
4135                                                              separate s_waitcnt              separate s_waitcnt
4136                                                              vmcnt(0) and                    vmcnt(0), s_waitcnt vscnt(0)
4137                                                              s_waitcnt                       and s_waitcnt
4138                                                              lgkmcnt(0) to allow             lgkmcnt(0) to allow
4139                                                              them to be                      them to be
4140                                                              independently moved             independently moved
4141                                                              according to the                according to the
4142                                                              following rules.                following rules.
4143                                                            - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0)
4144                                                              must happen after               must happen after
4145                                                              any preceding                   any preceding
4146                                                              global/generic                  global/generic
4147                                                              load/store/load                 load/load
4148                                                              atomic/store                    atomic/
4149                                                              atomic/atomicrmw.               atomicrmw-with-return-value.
4150                                                                                            - s_waitcnt vscnt(0)
4151                                                                                              must happen after
4152                                                                                              any preceding
4153                                                                                              global/generic
4154                                                                                              store/store atomic/
4155                                                                                              atomicrmw-no-return-value.
4156                                                            - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0)
4157                                                              must happen after               must happen after
4158                                                              any preceding                   any preceding
4159                                                              local/generic                   local/generic
4160                                                              load/store/load                 load/store/load
4161                                                              atomic/store                    atomic/store
4162                                                              atomic/atomicrmw.               atomic/atomicrmw.
4163                                                            - Must happen before            - Must happen before
4164                                                              the following                   the following
4165                                                              store.                          store.
4166                                                            - Ensures that all              - Ensures that all
4167                                                              memory operations               memory operations
4168                                                              to memory have                  to memory have
4169                                                              completed before                completed before
4170                                                              performing the                  performing the
4171                                                              store that is being             store that is being
4172                                                              released.                       released.
4173
4174                                                          2. buffer/global/ds/flat_store  2. buffer/global/ds/flat_store
4175      atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic
4176                                - wavefront    - local
4177                                               - generic
4178      atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
4179                                                                                             vmcnt(0) & vscnt(0)
4180
4181                                                                                            - If CU wavefront execution mode, omit vmcnt and
4182                                                                                              vscnt.
4183                                                            - If OpenCL, omit.
4184
4185                                                            - Must happen after
4186                                                              any preceding
4187                                                              local/generic
4188                                                              load/store/load
4189                                                              atomic/store
4190                                                              atomic/atomicrmw.
4191                                                                                            - Could be split into
4192                                                                                              separate s_waitcnt
4193                                                                                              vmcnt(0), s_waitcnt
4194                                                                                              vscnt(0) and s_waitcnt
4195                                                                                              lgkmcnt(0) to allow
4196                                                                                              them to be
4197                                                                                              independently moved
4198                                                                                              according to the
4199                                                                                              following rules.
4200                                                                                            - s_waitcnt vmcnt(0)
4201                                                                                              must happen after
4202                                                                                              any preceding
4203                                                                                              global/generic load/load
4204                                                                                              atomic/
4205                                                                                              atomicrmw-with-return-value.
4206                                                                                            - s_waitcnt vscnt(0)
4207                                                                                              must happen after
4208                                                                                              any preceding
4209                                                                                              global/generic
4210                                                                                              store/store
4211                                                                                              atomic/
4212                                                                                              atomicrmw-no-return-value.
4213                                                                                            - s_waitcnt lgkmcnt(0)
4214                                                                                              must happen after
4215                                                                                              any preceding
4216                                                                                              local/generic
4217                                                                                              load/store/load
4218                                                                                              atomic/store
4219                                                                                              atomic/atomicrmw.
4220                                                            - Must happen before            - Must happen before
4221                                                              the following                   the following
4222                                                              atomicrmw.                      atomicrmw.
4223                                                            - Ensures that all              - Ensures that all
4224                                                              memory operations               memory operations
4225                                                              to local have                   have
4226                                                              completed before                completed before
4227                                                              performing the                  performing the
4228                                                              atomicrmw that is               atomicrmw that is
4229                                                              being released.                 being released.
4230
4231                                                          2. buffer/global/flat_atomic    2. buffer/global_atomic
4232      atomicrmw    release      - workgroup    - local                                    1. waitcnt vmcnt(0) & vscnt(0)
4233
4234                                                                                            - If CU wavefront execution mode, omit.
4235                                                                                            - If OpenCL, omit.
4236                                                                                            - Could be split into
4237                                                                                              separate s_waitcnt
4238                                                                                              vmcnt(0) and s_waitcnt
4239                                                                                              vscnt(0) to allow
4240                                                                                              them to be
4241                                                                                              independently moved
4242                                                                                              according to the
4243                                                                                              following rules.
4244                                                                                            - s_waitcnt vmcnt(0)
4245                                                                                              must happen after
4246                                                                                              any preceding
4247                                                                                              global/generic load/load
4248                                                                                              atomic/
4249                                                                                              atomicrmw-with-return-value.
4250                                                                                            - s_waitcnt vscnt(0)
4251                                                                                              must happen after
4252                                                                                              any preceding
4253                                                                                              global/generic
4254                                                                                              store/store atomic/
4255                                                                                              atomicrmw-no-return-value.
4256                                                                                            - Must happen before
4257                                                                                              the following
4258                                                                                              store.
4259                                                                                            - Ensures that all
4260                                                                                              global memory
4261                                                                                              operations have
4262                                                                                              completed before
4263                                                                                              performing the
4264                                                                                              store that is being
4265                                                                                              released.
4266
4267                                                          1. ds_atomic                    2. ds_atomic
4268      atomicrmw    release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
4269                                                                                             vmcnt(0) & vscnt(0)
4270
4271                                                                                            - If CU wavefront execution mode, omit vmcnt and
4272                                                                                              vscnt.
4273                                                            - If OpenCL, omit.              - If OpenCL, omit
4274                                                                                              waitcnt lgkmcnt(0).
4275                                                            - Must happen after
4276                                                              any preceding
4277                                                              local/generic
4278                                                              load/store/load
4279                                                              atomic/store
4280                                                              atomic/atomicrmw.
4281                                                                                            - Could be split into
4282                                                                                              separate s_waitcnt
4283                                                                                              vmcnt(0), s_waitcnt
4284                                                                                              vscnt(0) and s_waitcnt
4285                                                                                              lgkmcnt(0) to allow
4286                                                                                              them to be
4287                                                                                              independently moved
4288                                                                                              according to the
4289                                                                                              following rules.
4290                                                                                            - s_waitcnt vmcnt(0)
4291                                                                                              must happen after
4292                                                                                              any preceding
4293                                                                                              global/generic load/load
4294                                                                                              atomic/
4295                                                                                              atomicrmw-with-return-value.
4296                                                                                            - s_waitcnt vscnt(0)
4297                                                                                              must happen after
4298                                                                                              any preceding
4299                                                                                              global/generic
4300                                                                                              store/store
4301                                                                                              atomic/
4302                                                                                              atomicrmw-no-return-value.
4303                                                                                            - s_waitcnt lgkmcnt(0)
4304                                                                                              must happen after
4305                                                                                              any preceding
4306                                                                                              local/generic load/store/load
4307                                                                                              atomic/store atomic/atomicrmw.
4308                                                            - Must happen before            - Must happen before
4309                                                              the following                   the following
4310                                                              atomicrmw.                      atomicrmw.
4311                                                            - Ensures that all              - Ensures that all
4312                                                              memory operations               memory operations
4313                                                              to local have                   have
4314                                                              completed before                completed before
4315                                                              performing the                  performing the
4316                                                              atomicrmw that is               atomicrmw that is
4317                                                              being released.                 being released.
4318
4319                                                          2. flat_atomic                  2. flat_atomic
4320      atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lkkmcnt(0) &
4321                                - system       - generic     vmcnt(0)                         vmcnt(0) & vscnt(0)
4322
4323                                                            - If OpenCL, omit               - If OpenCL, omit
4324                                                              lgkmcnt(0).                     lgkmcnt(0).
4325                                                            - Could be split into           - Could be split into
4326                                                              separate s_waitcnt              separate s_waitcnt
4327                                                              vmcnt(0) and                    vmcnt(0), s_waitcnt
4328                                                              s_waitcnt                       vscnt(0) and s_waitcnt
4329                                                              lgkmcnt(0) to allow             lgkmcnt(0) to allow
4330                                                              them to be                      them to be
4331                                                              independently moved             independently moved
4332                                                              according to the                according to the
4333                                                              following rules.                following rules.
4334                                                            - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0)
4335                                                              must happen after               must happen after
4336                                                              any preceding                   any preceding
4337                                                              global/generic                  global/generic
4338                                                              load/store/load                 load/load atomic/
4339                                                              atomic/store                    atomicrmw-with-return-value.
4340                                                              atomic/atomicrmw.
4341                                                                                            - s_waitcnt vscnt(0)
4342                                                                                              must happen after
4343                                                                                              any preceding
4344                                                                                              global/generic
4345                                                                                              store/store atomic/
4346                                                                                              atomicrmw-no-return-value.
4347                                                            - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0)
4348                                                              must happen after               must happen after
4349                                                              any preceding                   any preceding
4350                                                              local/generic                   local/generic
4351                                                              load/store/load                 load/store/load
4352                                                              atomic/store                    atomic/store
4353                                                              atomic/atomicrmw.               atomic/atomicrmw.
4354                                                            - Must happen before            - Must happen before
4355                                                              the following                   the following
4356                                                              atomicrmw.                      atomicrmw.
4357                                                            - Ensures that all              - Ensures that all
4358                                                              memory operations               memory operations
4359                                                              to global and local             to global and local
4360                                                              have completed                  have completed
4361                                                              before performing               before performing
4362                                                              the atomicrmw that              the atomicrmw that
4363                                                              is being released.              is being released.
4364
4365                                                          2. buffer/global/ds/flat_atomic 2. buffer/global/ds/flat_atomic
4366      fence        release      - singlethread *none*     *none*                          *none*
4367                                - wavefront
4368      fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
4369                                                                                             vmcnt(0) & vscnt(0)
4370
4371                                                                                            - If CU wavefront execution mode, omit vmcnt and
4372                                                                                              vscnt.
4373                                                            - If OpenCL and                 - If OpenCL and
4374                                                              address space is                address space is
4375                                                              not generic, omit.              not generic, omit
4376                                                                                              lgkmcnt(0).
4377                                                                                            - If OpenCL and
4378                                                                                              address space is
4379                                                                                              local, omit
4380                                                                                              vmcnt(0) and vscnt(0).
4381                                                            - However, since LLVM           - However, since LLVM
4382                                                              currently has no                currently has no
4383                                                              address space on                address space on
4384                                                              the fence need to               the fence need to
4385                                                              conservatively                  conservatively
4386                                                              always generate. If             always generate. If
4387                                                              fence had an                    fence had an
4388                                                              address space then              address space then
4389                                                              set to address                  set to address
4390                                                              space of OpenCL                 space of OpenCL
4391                                                              fence flag, or to               fence flag, or to
4392                                                              generic if both                 generic if both
4393                                                              local and global                local and global
4394                                                              flags are                       flags are
4395                                                              specified.                      specified.
4396                                                            - Must happen after
4397                                                              any preceding
4398                                                              local/generic
4399                                                              load/load
4400                                                              atomic/store/store
4401                                                              atomic/atomicrmw.
4402                                                                                            - Could be split into
4403                                                                                              separate s_waitcnt
4404                                                                                              vmcnt(0), s_waitcnt
4405                                                                                              vscnt(0) and s_waitcnt
4406                                                                                              lgkmcnt(0) to allow
4407                                                                                              them to be
4408                                                                                              independently moved
4409                                                                                              according to the
4410                                                                                              following rules.
4411                                                                                            - s_waitcnt vmcnt(0)
4412                                                                                              must happen after
4413                                                                                              any preceding
4414                                                                                              global/generic
4415                                                                                              load/load
4416                                                                                              atomic/
4417                                                                                              atomicrmw-with-return-value.
4418                                                                                            - s_waitcnt vscnt(0)
4419                                                                                              must happen after
4420                                                                                              any preceding
4421                                                                                              global/generic
4422                                                                                              store/store atomic/
4423                                                                                              atomicrmw-no-return-value.
4424                                                                                            - s_waitcnt lgkmcnt(0)
4425                                                                                              must happen after
4426                                                                                              any preceding
4427                                                                                              local/generic
4428                                                                                              load/store/load
4429                                                                                              atomic/store atomic/
4430                                                                                              atomicrmw.
4431                                                            - Must happen before            - Must happen before
4432                                                              any following store             any following store
4433                                                              atomic/atomicrmw                atomic/atomicrmw
4434                                                              with an equal or                with an equal or
4435                                                              wider sync scope                wider sync scope
4436                                                              and memory ordering             and memory ordering
4437                                                              stronger than                   stronger than
4438                                                              unordered (this is              unordered (this is
4439                                                              termed the                      termed the
4440                                                              fence-paired-atomic).           fence-paired-atomic).
4441                                                            - Ensures that all              - Ensures that all
4442                                                              memory operations               memory operations
4443                                                              to local have                   have
4444                                                              completed before                completed before
4445                                                              performing the                  performing the
4446                                                              following                       following
4447                                                              fence-paired-atomic.            fence-paired-atomic.
4448
4449      fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) &
4450                                - system                     vmcnt(0)                        vmcnt(0) & vscnt(0)
4451
4452                                                            - If OpenCL and                 - If OpenCL and
4453                                                              address space is                address space is
4454                                                              not generic, omit               not generic, omit
4455                                                              lgkmcnt(0).                     lgkmcnt(0).
4456                                                            - If OpenCL and                 - If OpenCL and
4457                                                              address space is                address space is
4458                                                              local, omit                     local, omit
4459                                                              vmcnt(0).                       vmcnt(0) and vscnt(0).
4460                                                            - However, since LLVM           - However, since LLVM
4461                                                              currently has no                currently has no
4462                                                              address space on                address space on
4463                                                              the fence need to               the fence need to
4464                                                              conservatively                  conservatively
4465                                                              always generate. If             always generate. If
4466                                                              fence had an                    fence had an
4467                                                              address space then              address space then
4468                                                              set to address                  set to address
4469                                                              space of OpenCL                 space of OpenCL
4470                                                              fence flag, or to               fence flag, or to
4471                                                              generic if both                 generic if both
4472                                                              local and global                local and global
4473                                                              flags are                       flags are
4474                                                              specified.                      specified.
4475                                                            - Could be split into           - Could be split into
4476                                                              separate s_waitcnt              separate s_waitcnt
4477                                                              vmcnt(0) and                    vmcnt(0), s_waitcnt
4478                                                              s_waitcnt                       vscnt(0) and s_waitcnt
4479                                                              lgkmcnt(0) to allow             lgkmcnt(0) to allow
4480                                                              them to be                      them to be
4481                                                              independently moved             independently moved
4482                                                              according to the                according to the
4483                                                              following rules.                following rules.
4484                                                            - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0)
4485                                                              must happen after               must happen after
4486                                                              any preceding                   any preceding
4487                                                              global/generic                  global/generic
4488                                                              load/store/load                 load/load atomic/
4489                                                              atomic/store                    atomicrmw-with-return-value.
4490                                                              atomic/atomicrmw.
4491                                                                                            - s_waitcnt vscnt(0)
4492                                                                                              must happen after
4493                                                                                              any preceding
4494                                                                                              global/generic
4495                                                                                              store/store atomic/
4496                                                                                              atomicrmw-no-return-value.
4497                                                            - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0)
4498                                                              must happen after               must happen after
4499                                                              any preceding                   any preceding
4500                                                              local/generic                   local/generic
4501                                                              load/store/load                 load/store/load
4502                                                              atomic/store                    atomic/store
4503                                                              atomic/atomicrmw.               atomic/atomicrmw.
4504                                                            - Must happen before            - Must happen before
4505                                                              any following store             any following store
4506                                                              atomic/atomicrmw                atomic/atomicrmw
4507                                                              with an equal or                with an equal or
4508                                                              wider sync scope                wider sync scope
4509                                                              and memory ordering             and memory ordering
4510                                                              stronger than                   stronger than
4511                                                              unordered (this is              unordered (this is
4512                                                              termed the                      termed the
4513                                                              fence-paired-atomic).           fence-paired-atomic).
4514                                                            - Ensures that all              - Ensures that all
4515                                                              memory operations               memory operations
4516                                                              have                            have
4517                                                              completed before                completed before
4518                                                              performing the                  performing the
4519                                                              following                       following
4520                                                              fence-paired-atomic.            fence-paired-atomic.
4521
4522      **Acquire-Release Atomic**
4523      ----------------------------------------------------------------------------------------------------------------------
4524      atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic
4525                                - wavefront    - local
4526                                               - generic
4527      atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
4528                                                                                             vmcnt(0) & vscnt(0)
4529
4530                                                                                            - If CU wavefront execution mode, omit vmcnt and
4531                                                                                              vscnt.
4532                                                            - If OpenCL, omit.              - If OpenCL, omit
4533                                                                                              s_waitcnt lgkmcnt(0).
4534                                                            - Must happen after             - Must happen after
4535                                                              any preceding                   any preceding
4536                                                              local/generic                   local/generic
4537                                                              load/store/load                 load/store/load
4538                                                              atomic/store                    atomic/store
4539                                                              atomic/atomicrmw.               atomic/atomicrmw.
4540                                                                                            - Could be split into
4541                                                                                              separate s_waitcnt
4542                                                                                              vmcnt(0), s_waitcnt
4543                                                                                              vscnt(0) and s_waitcnt
4544                                                                                              lgkmcnt(0) to allow
4545                                                                                              them to be
4546                                                                                              independently moved
4547                                                                                              according to the
4548                                                                                              following rules.
4549                                                                                            - s_waitcnt vmcnt(0)
4550                                                                                              must happen after
4551                                                                                              any preceding
4552                                                                                              global/generic load/load
4553                                                                                              atomic/
4554                                                                                              atomicrmw-with-return-value.
4555                                                                                            - s_waitcnt vscnt(0)
4556                                                                                              must happen after
4557                                                                                              any preceding
4558                                                                                              global/generic
4559                                                                                              store/store
4560                                                                                              atomic/
4561                                                                                              atomicrmw-no-return-value.
4562                                                                                            - s_waitcnt lgkmcnt(0)
4563                                                                                              must happen after
4564                                                                                              any preceding
4565                                                                                              local/generic load/store/load
4566                                                                                              atomic/store atomic/atomicrmw.
4567                                                            - Must happen before            - Must happen before
4568                                                              the following                   the following
4569                                                              atomicrmw.                      atomicrmw.
4570                                                            - Ensures that all              - Ensures that all
4571                                                              memory operations               memory operations
4572                                                              to local have                   have
4573                                                              completed before                completed before
4574                                                              performing the                  performing the
4575                                                              atomicrmw that is               atomicrmw that is
4576                                                              being released.                 being released.
4577
4578                                                          2. buffer/global/flat_atomic    2. buffer/global_atomic
4579                                                                                          3. s_waitcnt vm/vscnt(0)
4580
4581                                                                                            - If CU wavefront execution mode, omit vm/vscnt.
4582                                                                                            - Use vmcnt if atomic with
4583                                                                                              return and vscnt if atomic
4584                                                                                              with no-return.
4585                                                                                              waitcnt lgkmcnt(0).
4586                                                                                            - Must happen before
4587                                                                                              the following
4588                                                                                              buffer_gl0_inv.
4589                                                                                            - Ensures any
4590                                                                                              following global
4591                                                                                              data read is no
4592                                                                                              older than the
4593                                                                                              atomicrmw value
4594                                                                                              being acquired.
4595
4596                                                                                          4. buffer_gl0_inv
4597
4598                                                                                            - If CU wavefront execution mode, omit.
4599                                                                                            - Ensures that
4600                                                                                              following
4601                                                                                              loads will not see
4602                                                                                              stale data.
4603
4604      atomicrmw    acq_rel      - workgroup    - local                                    1. waitcnt vmcnt(0) & vscnt(0)
4605
4606                                                                                            - If CU wavefront execution mode, omit.
4607                                                                                            - If OpenCL, omit.
4608                                                                                            - Could be split into
4609                                                                                              separate s_waitcnt
4610                                                                                              vmcnt(0) and s_waitcnt
4611                                                                                              vscnt(0) to allow
4612                                                                                              them to be
4613                                                                                              independently moved
4614                                                                                              according to the
4615                                                                                              following rules.
4616                                                                                            - s_waitcnt vmcnt(0)
4617                                                                                              must happen after
4618                                                                                              any preceding
4619                                                                                              global/generic load/load
4620                                                                                              atomic/
4621                                                                                              atomicrmw-with-return-value.
4622                                                                                            - s_waitcnt vscnt(0)
4623                                                                                              must happen after
4624                                                                                              any preceding
4625                                                                                              global/generic
4626                                                                                              store/store atomic/
4627                                                                                              atomicrmw-no-return-value.
4628                                                                                            - Must happen before
4629                                                                                              the following
4630                                                                                              store.
4631                                                                                            - Ensures that all
4632                                                                                              global memory
4633                                                                                              operations have
4634                                                                                              completed before
4635                                                                                              performing the
4636                                                                                              store that is being
4637                                                                                              released.
4638
4639                                                          1. ds_atomic                    2. ds_atomic
4640                                                          2. s_waitcnt lgkmcnt(0)         3. s_waitcnt lgkmcnt(0)
4641
4642                                                            - If OpenCL, omit.              - If OpenCL, omit.
4643                                                            - Must happen before            - Must happen before
4644                                                              any following                   the following
4645                                                              global/generic                  buffer_gl0_inv.
4646                                                              load/load
4647                                                              atomic/store/store
4648                                                              atomic/atomicrmw.
4649                                                            - Ensures any                   - Ensures any
4650                                                              following global                following global
4651                                                              data read is no                 data read is no
4652                                                              older than the load             older than the load
4653                                                              atomic value being              atomic value being
4654                                                              acquired.                       acquired.
4655
4656                                                                                          4. buffer_gl0_inv
4657
4658                                                                                            - If CU wavefront execution mode, omit.
4659                                                                                            - If OpenCL omit.
4660                                                                                            - Ensures that
4661                                                                                              following
4662                                                                                              loads will not see
4663                                                                                              stale data.
4664
4665      atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
4666                                                                                             vmcnt(0) & vscnt(0)
4667
4668                                                                                            - If CU wavefront execution mode, omit vmcnt and
4669                                                                                              vscnt.
4670                                                            - If OpenCL, omit.              - If OpenCL, omit
4671                                                                                              waitcnt lgkmcnt(0).
4672                                                            - Must happen after
4673                                                              any preceding
4674                                                              local/generic
4675                                                              load/store/load
4676                                                              atomic/store
4677                                                              atomic/atomicrmw.
4678                                                                                            - Could be split into
4679                                                                                              separate s_waitcnt
4680                                                                                              vmcnt(0), s_waitcnt
4681                                                                                              vscnt(0) and s_waitcnt
4682                                                                                              lgkmcnt(0) to allow
4683                                                                                              them to be
4684                                                                                              independently moved
4685                                                                                              according to the
4686                                                                                              following rules.
4687                                                                                            - s_waitcnt vmcnt(0)
4688                                                                                              must happen after
4689                                                                                              any preceding
4690                                                                                              global/generic load/load
4691                                                                                              atomic/
4692                                                                                              atomicrmw-with-return-value.
4693                                                                                            - s_waitcnt vscnt(0)
4694                                                                                              must happen after
4695                                                                                              any preceding
4696                                                                                              global/generic
4697                                                                                              store/store
4698                                                                                              atomic/
4699                                                                                              atomicrmw-no-return-value.
4700                                                                                            - s_waitcnt lgkmcnt(0)
4701                                                                                              must happen after
4702                                                                                              any preceding
4703                                                                                              local/generic load/store/load
4704                                                                                              atomic/store atomic/atomicrmw.
4705                                                            - Must happen before            - Must happen before
4706                                                              the following                   the following
4707                                                              atomicrmw.                      atomicrmw.
4708                                                            - Ensures that all              - Ensures that all
4709                                                              memory operations               memory operations
4710                                                              to local have                   have
4711                                                              completed before                completed before
4712                                                              performing the                  performing the
4713                                                              atomicrmw that is               atomicrmw that is
4714                                                              being released.                 being released.
4715
4716                                                          2. flat_atomic                  2. flat_atomic
4717                                                          3. s_waitcnt lgkmcnt(0)         3. s_waitcnt lgkmcnt(0) &
4718                                                                                             vm/vscnt(0)
4719
4720                                                                                            - If CU wavefront execution mode, omit vm/vscnt.
4721                                                            - If OpenCL, omit.              - If OpenCL, omit
4722                                                                                              waitcnt lgkmcnt(0).
4723                                                            - Must happen before            - Must happen before
4724                                                              any following                   the following
4725                                                              global/generic                  buffer_gl0_inv.
4726                                                              load/load
4727                                                              atomic/store/store
4728                                                              atomic/atomicrmw.
4729                                                            - Ensures any                   - Ensures any
4730                                                              following global                following global
4731                                                              data read is no                 data read is no
4732                                                              older than the load             older than the load
4733                                                              atomic value being              atomic value being
4734                                                              acquired.                       acquired.
4735
4736                                                                                          3. buffer_gl0_inv
4737
4738                                                                                            - If CU wavefront execution mode, omit.
4739                                                                                            - Ensures that
4740                                                                                              following
4741                                                                                              loads will not see
4742                                                                                              stale data.
4743
4744      atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) &
4745                                - system                     vmcnt(0)                        vmcnt(0) & vscnt(0)
4746
4747                                                            - If OpenCL, omit               - If OpenCL, omit
4748                                                              lgkmcnt(0).                     lgkmcnt(0).
4749                                                            - Could be split into           - Could be split into
4750                                                              separate s_waitcnt              separate s_waitcnt
4751                                                              vmcnt(0) and                    vmcnt(0), s_waitcnt
4752                                                              s_waitcnt                       vscnt(0) and s_waitcnt
4753                                                              lgkmcnt(0) to allow             lgkmcnt(0) to allow
4754                                                              them to be                      them to be
4755                                                              independently moved             independently moved
4756                                                              according to the                according to the
4757                                                              following rules.                following rules.
4758                                                            - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0)
4759                                                              must happen after               must happen after
4760                                                              any preceding                   any preceding
4761                                                              global/generic                  global/generic
4762                                                              load/store/load                 load/load atomic/
4763                                                              atomic/store                    atomicrmw-with-return-value.
4764                                                              atomic/atomicrmw.
4765                                                                                            - s_waitcnt vscnt(0)
4766                                                                                              must happen after
4767                                                                                              any preceding
4768                                                                                              global/generic
4769                                                                                              store/store atomic/
4770                                                                                              atomicrmw-no-return-value.
4771                                                            - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0)
4772                                                              must happen after               must happen after
4773                                                              any preceding                   any preceding
4774                                                              local/generic                   local/generic
4775                                                              load/store/load                 load/store/load
4776                                                              atomic/store                    atomic/store
4777                                                              atomic/atomicrmw.               atomic/atomicrmw.
4778                                                            - Must happen before            - Must happen before
4779                                                              the following                   the following
4780                                                              atomicrmw.                      atomicrmw.
4781                                                            - Ensures that all              - Ensures that all
4782                                                              memory operations               memory operations
4783                                                              to global have                  to global have
4784                                                              completed before                completed before
4785                                                              performing the                  performing the
4786                                                              atomicrmw that is               atomicrmw that is
4787                                                              being released.                 being released.
4788
4789                                                          2. buffer/global/flat_atomic    2. buffer/global_atomic
4790                                                          3. s_waitcnt vmcnt(0)           3. s_waitcnt vm/vscnt(0)
4791
4792                                                                                            - Use vmcnt if atomic with
4793                                                                                              return and vscnt if atomic
4794                                                                                              with no-return.
4795                                                                                              waitcnt lgkmcnt(0).
4796                                                            - Must happen before            - Must happen before
4797                                                              following                       following
4798                                                              buffer_wbinvl1_vol.             buffer_gl*_inv.
4799                                                            - Ensures the                   - Ensures the
4800                                                              atomicrmw has                   atomicrmw has
4801                                                              completed before                completed before
4802                                                              invalidating the                invalidating the
4803                                                              cache.                          caches.
4804
4805                                                          4. buffer_wbinvl1_vol           4. buffer_gl0_inv;
4806                                                                                             buffer_gl1_inv
4807
4808                                                            - Must happen before            - Must happen before
4809                                                              any following                   any following
4810                                                              global/generic                  global/generic
4811                                                              load/load                       load/load
4812                                                              atomic/atomicrmw.               atomic/atomicrmw.
4813                                                            - Ensures that                  - Ensures that
4814                                                              following loads                 following loads
4815                                                              will not see stale              will not see stale
4816                                                              global data.                    global data.
4817
4818      atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) &
4819                                - system                     vmcnt(0)                        vmcnt(0) & vscnt(0)
4820
4821                                                            - If OpenCL, omit               - If OpenCL, omit
4822                                                              lgkmcnt(0).                     lgkmcnt(0).
4823                                                            - Could be split into           - Could be split into
4824                                                              separate s_waitcnt              separate s_waitcnt
4825                                                              vmcnt(0) and                    vmcnt(0), s_waitcnt
4826                                                              s_waitcnt                       vscnt(0) and s_waitcnt
4827                                                              lgkmcnt(0) to allow             lgkmcnt(0) to allow
4828                                                              them to be                      them to be
4829                                                              independently moved             independently moved
4830                                                              according to the                according to the
4831                                                              following rules.                following rules.
4832                                                            - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0)
4833                                                              must happen after               must happen after
4834                                                              any preceding                   any preceding
4835                                                              global/generic                  global/generic
4836                                                              load/store/load                 load/load atomic
4837                                                              atomic/store                    atomicrmw-with-return-value.
4838                                                              atomic/atomicrmw.
4839                                                                                            - s_waitcnt vscnt(0)
4840                                                                                              must happen after
4841                                                                                              any preceding
4842                                                                                              global/generic
4843                                                                                              store/store atomic/
4844                                                                                              atomicrmw-no-return-value.
4845                                                            - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0)
4846                                                              must happen after               must happen after
4847                                                              any preceding                   any preceding
4848                                                              local/generic                   local/generic
4849                                                              load/store/load                 load/store/load
4850                                                              atomic/store                    atomic/store
4851                                                              atomic/atomicrmw.               atomic/atomicrmw.
4852                                                            - Must happen before            - Must happen before
4853                                                              the following                   the following
4854                                                              atomicrmw.                      atomicrmw.
4855                                                            - Ensures that all              - Ensures that all
4856                                                              memory operations               memory operations
4857                                                              to global have                  have
4858                                                              completed before                completed before
4859                                                              performing the                  performing the
4860                                                              atomicrmw that is               atomicrmw that is
4861                                                              being released.                 being released.
4862
4863                                                          2. flat_atomic                  2. flat_atomic
4864                                                          3. s_waitcnt vmcnt(0) &         3. s_waitcnt vm/vscnt(0) &
4865                                                             lgkmcnt(0)                      lgkmcnt(0)
4866
4867                                                            - If OpenCL, omit               - If OpenCL, omit
4868                                                              lgkmcnt(0).                     lgkmcnt(0).
4869                                                                                            - Use vmcnt if atomic with
4870                                                                                              return and vscnt if atomic
4871                                                                                              with no-return.
4872                                                            - Must happen before            - Must happen before
4873                                                              following                       following
4874                                                              buffer_wbinvl1_vol.             buffer_gl*_inv.
4875                                                            - Ensures the                   - Ensures the
4876                                                              atomicrmw has                   atomicrmw has
4877                                                              completed before                completed before
4878                                                              invalidating the                invalidating the
4879                                                              cache.                          caches.
4880
4881                                                          4. buffer_wbinvl1_vol           4. buffer_gl0_inv;
4882                                                                                             buffer_gl1_inv
4883
4884                                                            - Must happen before            - Must happen before
4885                                                              any following                   any following
4886                                                              global/generic                  global/generic
4887                                                              load/load                       load/load
4888                                                              atomic/atomicrmw.               atomic/atomicrmw.
4889                                                            - Ensures that                  - Ensures that
4890                                                              following loads                 following loads
4891                                                              will not see stale              will not see stale
4892                                                              global data.                    global data.
4893
4894      fence        acq_rel      - singlethread *none*     *none*                          *none*
4895                                - wavefront
4896      fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
4897                                                                                             vmcnt(0) & vscnt(0)
4898
4899                                                                                            - If CU wavefront execution mode, omit vmcnt and
4900                                                                                              vscnt.
4901                                                            - If OpenCL and                 - If OpenCL and
4902                                                              address space is                address space is
4903                                                              not generic, omit.              not generic, omit
4904                                                                                              lgkmcnt(0).
4905                                                                                            - If OpenCL and
4906                                                                                              address space is
4907                                                                                              local, omit
4908                                                                                              vmcnt(0) and vscnt(0).
4909                                                            - However,                      - However,
4910                                                              since LLVM                      since LLVM
4911                                                              currently has no                currently has no
4912                                                              address space on                address space on
4913                                                              the fence need to               the fence need to
4914                                                              conservatively                  conservatively
4915                                                              always generate                 always generate
4916                                                              (see comment for                (see comment for
4917                                                              previous fence).                previous fence).
4918                                                            - Must happen after
4919                                                              any preceding
4920                                                              local/generic
4921                                                              load/load
4922                                                              atomic/store/store
4923                                                              atomic/atomicrmw.
4924                                                                                            - Could be split into
4925                                                                                              separate s_waitcnt
4926                                                                                              vmcnt(0), s_waitcnt
4927                                                                                              vscnt(0) and s_waitcnt
4928                                                                                              lgkmcnt(0) to allow
4929                                                                                              them to be
4930                                                                                              independently moved
4931                                                                                              according to the
4932                                                                                              following rules.
4933                                                                                            - s_waitcnt vmcnt(0)
4934                                                                                              must happen after
4935                                                                                              any preceding
4936                                                                                              global/generic
4937                                                                                              load/load
4938                                                                                              atomic/
4939                                                                                              atomicrmw-with-return-value.
4940                                                                                            - s_waitcnt vscnt(0)
4941                                                                                              must happen after
4942                                                                                              any preceding
4943                                                                                              global/generic
4944                                                                                              store/store atomic/
4945                                                                                              atomicrmw-no-return-value.
4946                                                                                            - s_waitcnt lgkmcnt(0)
4947                                                                                              must happen after
4948                                                                                              any preceding
4949                                                                                              local/generic
4950                                                                                              load/store/load
4951                                                                                              atomic/store atomic/
4952                                                                                              atomicrmw.
4953                                                            - Must happen before            - Must happen before
4954                                                              any following                   any following
4955                                                              global/generic                  global/generic
4956                                                              load/load                       load/load
4957                                                              atomic/store/store              atomic/store/store
4958                                                              atomic/atomicrmw.               atomic/atomicrmw.
4959                                                            - Ensures that all              - Ensures that all
4960                                                              memory operations               memory operations
4961                                                              to local have                   have
4962                                                              completed before                completed before
4963                                                              performing any                  performing any
4964                                                              following global                following global
4965                                                              memory operations.              memory operations.
4966                                                            - Ensures that the              - Ensures that the
4967                                                              preceding                       preceding
4968                                                              local/generic load              local/generic load
4969                                                              atomic/atomicrmw                atomic/atomicrmw
4970                                                              with an equal or                with an equal or
4971                                                              wider sync scope                wider sync scope
4972                                                              and memory ordering             and memory ordering
4973                                                              stronger than                   stronger than
4974                                                              unordered (this is              unordered (this is
4975                                                              termed the                      termed the
4976                                                              acquire-fence-paired-atomic     acquire-fence-paired-atomic
4977                                                              ) has completed                 ) has completed
4978                                                              before following                before following
4979                                                              global memory                   global memory
4980                                                              operations. This                operations. This
4981                                                              satisfies the                   satisfies the
4982                                                              requirements of                 requirements of
4983                                                              acquire.                        acquire.
4984                                                            - Ensures that all              - Ensures that all
4985                                                              previous memory                 previous memory
4986                                                              operations have                 operations have
4987                                                              completed before a              completed before a
4988                                                              following                       following
4989                                                              local/generic store             local/generic store
4990                                                              atomic/atomicrmw                atomic/atomicrmw
4991                                                              with an equal or                with an equal or
4992                                                              wider sync scope                wider sync scope
4993                                                              and memory ordering             and memory ordering
4994                                                              stronger than                   stronger than
4995                                                              unordered (this is              unordered (this is
4996                                                              termed the                      termed the
4997                                                              release-fence-paired-atomic     release-fence-paired-atomic
4998                                                              ). This satisfies the           ). This satisfies the
4999                                                              requirements of                 requirements of
5000                                                              release.                        release.
5001                                                                                            - Must happen before
5002                                                                                              the following
5003                                                                                              buffer_gl0_inv.
5004                                                                                            - Ensures that the
5005                                                                                              acquire-fence-paired
5006                                                                                              atomic has completed
5007                                                                                              before invalidating
5008                                                                                              the
5009                                                                                              cache. Therefore
5010                                                                                              any following
5011                                                                                              locations read must
5012                                                                                              be no older than
5013                                                                                              the value read by
5014                                                                                              the
5015                                                                                              acquire-fence-paired-atomic.
5016
5017                                                                                          3. buffer_gl0_inv
5018
5019                                                                                            - If CU wavefront execution mode, omit.
5020                                                                                            - Ensures that
5021                                                                                              following
5022                                                                                              loads will not see
5023                                                                                              stale data.
5024
5025      fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) &
5026                                - system                     vmcnt(0)                        vmcnt(0) & vscnt(0)
5027
5028                                                            - If OpenCL and                 - If OpenCL and
5029                                                              address space is                address space is
5030                                                              not generic, omit               not generic, omit
5031                                                              lgkmcnt(0).                     lgkmcnt(0).
5032                                                                                            - If OpenCL and
5033                                                                                              address space is
5034                                                                                              local, omit
5035                                                                                              vmcnt(0) and vscnt(0).
5036                                                            - However, since LLVM           - However, since LLVM
5037                                                              currently has no                currently has no
5038                                                              address space on                address space on
5039                                                              the fence need to               the fence need to
5040                                                              conservatively                  conservatively
5041                                                              always generate                 always generate
5042                                                              (see comment for                (see comment for
5043                                                              previous fence).                previous fence).
5044                                                            - Could be split into           - Could be split into
5045                                                              separate s_waitcnt              separate s_waitcnt
5046                                                              vmcnt(0) and                    vmcnt(0), s_waitcnt
5047                                                              s_waitcnt                       vscnt(0) and s_waitcnt
5048                                                              lgkmcnt(0) to allow             lgkmcnt(0) to allow
5049                                                              them to be                      them to be
5050                                                              independently moved             independently moved
5051                                                              according to the                according to the
5052                                                              following rules.                following rules.
5053                                                            - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0)
5054                                                              must happen after               must happen after
5055                                                              any preceding                   any preceding
5056                                                              global/generic                  global/generic
5057                                                              load/store/load                 load/load
5058                                                              atomic/store                    atomic/
5059                                                              atomic/atomicrmw.               atomicrmw-with-return-value.
5060                                                                                            - s_waitcnt vscnt(0)
5061                                                                                              must happen after
5062                                                                                              any preceding
5063                                                                                              global/generic
5064                                                                                              store/store atomic/
5065                                                                                              atomicrmw-no-return-value.
5066                                                            - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0)
5067                                                              must happen after               must happen after
5068                                                              any preceding                   any preceding
5069                                                              local/generic                   local/generic
5070                                                              load/store/load                 load/store/load
5071                                                              atomic/store                    atomic/store
5072                                                              atomic/atomicrmw.               atomic/atomicrmw.
5073                                                            - Must happen before            - Must happen before
5074                                                              the following                   the following
5075                                                              buffer_wbinvl1_vol.             buffer_gl*_inv.
5076                                                            - Ensures that the              - Ensures that the
5077                                                              preceding                       preceding
5078                                                              global/local/generic            global/local/generic
5079                                                              load                            load
5080                                                              atomic/atomicrmw                atomic/atomicrmw
5081                                                              with an equal or                with an equal or
5082                                                              wider sync scope                wider sync scope
5083                                                              and memory ordering             and memory ordering
5084                                                              stronger than                   stronger than
5085                                                              unordered (this is              unordered (this is
5086                                                              termed the                      termed the
5087                                                              acquire-fence-paired-atomic     acquire-fence-paired-atomic
5088                                                              ) has completed                 ) has completed
5089                                                              before invalidating             before invalidating
5090                                                              the cache. This                 the caches. This
5091                                                              satisfies the                   satisfies the
5092                                                              requirements of                 requirements of
5093                                                              acquire.                        acquire.
5094                                                            - Ensures that all              - Ensures that all
5095                                                              previous memory                 previous memory
5096                                                              operations have                 operations have
5097                                                              completed before a              completed before a
5098                                                              following                       following
5099                                                              global/local/generic            global/local/generic
5100                                                              store                           store
5101                                                              atomic/atomicrmw                atomic/atomicrmw
5102                                                              with an equal or                with an equal or
5103                                                              wider sync scope                wider sync scope
5104                                                              and memory ordering             and memory ordering
5105                                                              stronger than                   stronger than
5106                                                              unordered (this is              unordered (this is
5107                                                              termed the                      termed the
5108                                                              release-fence-paired-atomic     release-fence-paired-atomic
5109                                                              ). This satisfies the           ). This satisfies the
5110                                                              requirements of                 requirements of
5111                                                              release.                        release.
5112
5113                                                          2. buffer_wbinvl1_vol           2. buffer_gl0_inv;
5114                                                                                             buffer_gl1_inv
5115
5116                                                            - Must happen before            - Must happen before
5117                                                              any following                   any following
5118                                                              global/generic                  global/generic
5119                                                              load/load                       load/load
5120                                                              atomic/store/store              atomic/store/store
5121                                                              atomic/atomicrmw.               atomic/atomicrmw.
5122                                                            - Ensures that                  - Ensures that
5123                                                              following loads                 following loads
5124                                                              will not see stale              will not see stale
5125                                                              global data. This               global data. This
5126                                                              satisfies the                   satisfies the
5127                                                              requirements of                 requirements of
5128                                                              acquire.                        acquire.
5129
5130      **Sequential Consistent Atomic**
5131      ----------------------------------------------------------------------------------------------------------------------
5132      load atomic  seq_cst      - singlethread - global   *Same as corresponding          *Same as corresponding
5133                                - wavefront    - local    load atomic acquire,            load atomic acquire,
5134                                               - generic  except must generated           except must generated
5135                                                          all instructions even           all instructions even
5136                                                          for OpenCL.*                    for OpenCL.*
5137      load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
5138                                               - generic                                     vmcnt(0) & vscnt(0)
5139
5140                                                                                            - If CU wavefront execution mode, omit vmcnt and
5141                                                                                              vscnt.
5142                                                                                            - Could be split into
5143                                                                                              separate s_waitcnt
5144                                                                                              vmcnt(0), s_waitcnt
5145                                                                                              vscnt(0) and s_waitcnt
5146                                                                                              lgkmcnt(0) to allow
5147                                                                                              them to be
5148                                                                                              independently moved
5149                                                                                              according to the
5150                                                                                              following rules.
5151                                                            - Must                          - waitcnt lgkmcnt(0) must
5152                                                              happen after                    happen after
5153                                                              preceding                       preceding
5154                                                              global/generic load             local load
5155                                                              atomic/store                    atomic/store
5156                                                              atomic/atomicrmw                atomic/atomicrmw
5157                                                              with memory                     with memory
5158                                                              ordering of seq_cst             ordering of seq_cst
5159                                                              and with equal or               and with equal or
5160                                                              wider sync scope.               wider sync scope.
5161                                                              (Note that seq_cst              (Note that seq_cst
5162                                                              fences have their               fences have their
5163                                                              own s_waitcnt                   own s_waitcnt
5164                                                              lgkmcnt(0) and so do            lgkmcnt(0) and so do
5165                                                              not need to be                  not need to be
5166                                                              considered.)                    considered.)
5167                                                                                            - waitcnt vmcnt(0)
5168                                                                                              Must happen after
5169                                                                                              preceding
5170                                                                                              global/generic load
5171                                                                                              atomic/
5172                                                                                              atomicrmw-with-return-value
5173                                                                                              with memory
5174                                                                                              ordering of seq_cst
5175                                                                                              and with equal or
5176                                                                                              wider sync scope.
5177                                                                                              (Note that seq_cst
5178                                                                                              fences have their
5179                                                                                              own s_waitcnt
5180                                                                                              vmcnt(0) and so do
5181                                                                                              not need to be
5182                                                                                              considered.)
5183                                                                                            - waitcnt vscnt(0)
5184                                                                                              Must happen after
5185                                                                                              preceding
5186                                                                                              global/generic store
5187                                                                                              atomic/
5188                                                                                              atomicrmw-no-return-value
5189                                                                                              with memory
5190                                                                                              ordering of seq_cst
5191                                                                                              and with equal or
5192                                                                                              wider sync scope.
5193                                                                                              (Note that seq_cst
5194                                                                                              fences have their
5195                                                                                              own s_waitcnt
5196                                                                                              vscnt(0) and so do
5197                                                                                              not need to be
5198                                                                                              considered.)
5199                                                            - Ensures any                   - Ensures any
5200                                                              preceding                       preceding
5201                                                              sequential                      sequential
5202                                                              consistent local                consistent global/local
5203                                                              memory instructions             memory instructions
5204                                                              have completed                  have completed
5205                                                              before executing                before executing
5206                                                              this sequentially               this sequentially
5207                                                              consistent                      consistent
5208                                                              instruction. This               instruction. This
5209                                                              prevents reordering             prevents reordering
5210                                                              a seq_cst store                 a seq_cst store
5211                                                              followed by a                   followed by a
5212                                                              seq_cst load. (Note             seq_cst load. (Note
5213                                                              that seq_cst is                 that seq_cst is
5214                                                              stronger than                   stronger than
5215                                                              acquire/release as              acquire/release as
5216                                                              the reordering of               the reordering of
5217                                                              load acquire                    load acquire
5218                                                              followed by a store             followed by a store
5219                                                              release is                      release is
5220                                                              prevented by the                prevented by the
5221                                                              waitcnt of                      waitcnt of
5222                                                              the release, but                the release, but
5223                                                              there is nothing                there is nothing
5224                                                              preventing a store              preventing a store
5225                                                              release followed by             release followed by
5226                                                              load acquire from               load acquire from
5227                                                              competing out of                competing out of
5228                                                              order.)                         order.)
5229
5230                                                          2. *Following                   2. *Following
5231                                                             instructions same as            instructions same as
5232                                                             corresponding load              corresponding load
5233                                                             atomic acquire,                 atomic acquire,
5234                                                             except must generated           except must generated
5235                                                             all instructions even           all instructions even
5236                                                             for OpenCL.*                    for OpenCL.*
5237      load atomic  seq_cst      - workgroup    - local    *Same as corresponding
5238                                                          load atomic acquire,
5239                                                          except must generated
5240                                                          all instructions even
5241                                                          for OpenCL.*
5242
5243                                                                                          1. s_waitcnt vmcnt(0) & vscnt(0)
5244
5245                                                                                            - If CU wavefront execution mode, omit.
5246                                                                                            - Could be split into
5247                                                                                              separate s_waitcnt
5248                                                                                              vmcnt(0) and s_waitcnt
5249                                                                                              vscnt(0) to allow
5250                                                                                              them to be
5251                                                                                              independently moved
5252                                                                                              according to the
5253                                                                                              following rules.
5254                                                                                            - waitcnt vmcnt(0)
5255                                                                                              Must happen after
5256                                                                                              preceding
5257                                                                                              global/generic load
5258                                                                                              atomic/
5259                                                                                              atomicrmw-with-return-value
5260                                                                                              with memory
5261                                                                                              ordering of seq_cst
5262                                                                                              and with equal or
5263                                                                                              wider sync scope.
5264                                                                                              (Note that seq_cst
5265                                                                                              fences have their
5266                                                                                              own s_waitcnt
5267                                                                                              vmcnt(0) and so do
5268                                                                                              not need to be
5269                                                                                              considered.)
5270                                                                                            - waitcnt vscnt(0)
5271                                                                                              Must happen after
5272                                                                                              preceding
5273                                                                                              global/generic store
5274                                                                                              atomic/
5275                                                                                              atomicrmw-no-return-value
5276                                                                                              with memory
5277                                                                                              ordering of seq_cst
5278                                                                                              and with equal or
5279                                                                                              wider sync scope.
5280                                                                                              (Note that seq_cst
5281                                                                                              fences have their
5282                                                                                              own s_waitcnt
5283                                                                                              vscnt(0) and so do
5284                                                                                              not need to be
5285                                                                                              considered.)
5286                                                                                            - Ensures any
5287                                                                                              preceding
5288                                                                                              sequential
5289                                                                                              consistent global
5290                                                                                              memory instructions
5291                                                                                              have completed
5292                                                                                              before executing
5293                                                                                              this sequentially
5294                                                                                              consistent
5295                                                                                              instruction. This
5296                                                                                              prevents reordering
5297                                                                                              a seq_cst store
5298                                                                                              followed by a
5299                                                                                              seq_cst load. (Note
5300                                                                                              that seq_cst is
5301                                                                                              stronger than
5302                                                                                              acquire/release as
5303                                                                                              the reordering of
5304                                                                                              load acquire
5305                                                                                              followed by a store
5306                                                                                              release is
5307                                                                                              prevented by the
5308                                                                                              waitcnt of
5309                                                                                              the release, but
5310                                                                                              there is nothing
5311                                                                                              preventing a store
5312                                                                                              release followed by
5313                                                                                              load acquire from
5314                                                                                              competing out of
5315                                                                                              order.)
5316
5317                                                                                          2. *Following
5318                                                                                             instructions same as
5319                                                                                             corresponding load
5320                                                                                             atomic acquire,
5321                                                                                             except must generated
5322                                                                                             all instructions even
5323                                                                                             for OpenCL.*
5324
5325      load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) &
5326                                - system       - generic     vmcnt(0)                        vmcnt(0) & vscnt(0)
5327
5328                                                            - Could be split into           - Could be split into
5329                                                              separate s_waitcnt              separate s_waitcnt
5330                                                              vmcnt(0)                        vmcnt(0), s_waitcnt
5331                                                              and s_waitcnt                   vscnt(0) and s_waitcnt
5332                                                              lgkmcnt(0) to allow             lgkmcnt(0) to allow
5333                                                              them to be                      them to be
5334                                                              independently moved             independently moved
5335                                                              according to the                according to the
5336                                                              following rules.                following rules.
5337                                                            - waitcnt lgkmcnt(0)            - waitcnt lgkmcnt(0)
5338                                                              must happen after               must happen after
5339                                                              preceding                       preceding
5340                                                              global/generic load             local load
5341                                                              atomic/store                    atomic/store
5342                                                              atomic/atomicrmw                atomic/atomicrmw
5343                                                              with memory                     with memory
5344                                                              ordering of seq_cst             ordering of seq_cst
5345                                                              and with equal or               and with equal or
5346                                                              wider sync scope.               wider sync scope.
5347                                                              (Note that seq_cst              (Note that seq_cst
5348                                                              fences have their               fences have their
5349                                                              own s_waitcnt                   own s_waitcnt
5350                                                              lgkmcnt(0) and so do            lgkmcnt(0) and so do
5351                                                              not need to be                  not need to be
5352                                                              considered.)                    considered.)
5353                                                            - waitcnt vmcnt(0)              - waitcnt vmcnt(0)
5354                                                              must happen after               must happen after
5355                                                              preceding                       preceding
5356                                                              global/generic load             global/generic load
5357                                                              atomic/store                    atomic/
5358                                                              atomic/atomicrmw                atomicrmw-with-return-value
5359                                                              with memory                     with memory
5360                                                              ordering of seq_cst             ordering of seq_cst
5361                                                              and with equal or               and with equal or
5362                                                              wider sync scope.               wider sync scope.
5363                                                              (Note that seq_cst              (Note that seq_cst
5364                                                              fences have their               fences have their
5365                                                              own s_waitcnt                   own s_waitcnt
5366                                                              vmcnt(0) and so do              vmcnt(0) and so do
5367                                                              not need to be                  not need to be
5368                                                              considered.)                    considered.)
5369                                                                                            - waitcnt vscnt(0)
5370                                                                                              Must happen after
5371                                                                                              preceding
5372                                                                                              global/generic store
5373                                                                                              atomic/
5374                                                                                              atomicrmw-no-return-value
5375                                                                                              with memory
5376                                                                                              ordering of seq_cst
5377                                                                                              and with equal or
5378                                                                                              wider sync scope.
5379                                                                                              (Note that seq_cst
5380                                                                                              fences have their
5381                                                                                              own s_waitcnt
5382                                                                                              vscnt(0) and so do
5383                                                                                              not need to be
5384                                                                                              considered.)
5385                                                            - Ensures any                   - Ensures any
5386                                                              preceding                       preceding
5387                                                              sequential                      sequential
5388                                                              consistent global               consistent global
5389                                                              memory instructions             memory instructions
5390                                                              have completed                  have completed
5391                                                              before executing                before executing
5392                                                              this sequentially               this sequentially
5393                                                              consistent                      consistent
5394                                                              instruction. This               instruction. This
5395                                                              prevents reordering             prevents reordering
5396                                                              a seq_cst store                 a seq_cst store
5397                                                              followed by a                   followed by a
5398                                                              seq_cst load. (Note             seq_cst load. (Note
5399                                                              that seq_cst is                 that seq_cst is
5400                                                              stronger than                   stronger than
5401                                                              acquire/release as              acquire/release as
5402                                                              the reordering of               the reordering of
5403                                                              load acquire                    load acquire
5404                                                              followed by a store             followed by a store
5405                                                              release is                      release is
5406                                                              prevented by the                prevented by the
5407                                                              waitcnt of                      waitcnt of
5408                                                              the release, but                the release, but
5409                                                              there is nothing                there is nothing
5410                                                              preventing a store              preventing a store
5411                                                              release followed by             release followed by
5412                                                              load acquire from               load acquire from
5413                                                              competing out of                competing out of
5414                                                              order.)                         order.)
5415
5416                                                          2. *Following                   2. *Following
5417                                                             instructions same as            instructions same as
5418                                                             corresponding load              corresponding load
5419                                                             atomic acquire,                 atomic acquire,
5420                                                             except must generated           except must generated
5421                                                             all instructions even           all instructions even
5422                                                             for OpenCL.*                    for OpenCL.*
5423      store atomic seq_cst      - singlethread - global   *Same as corresponding          *Same as corresponding
5424                                - wavefront    - local    store atomic release,           store atomic release,
5425                                - workgroup    - generic  except must generated           except must generated
5426                                                          all instructions even           all instructions even
5427                                                          for OpenCL.*                    for OpenCL.*
5428      store atomic seq_cst      - agent        - global   *Same as corresponding          *Same as corresponding
5429                                - system       - generic  store atomic release,           store atomic release,
5430                                                          except must generated           except must generated
5431                                                          all instructions even           all instructions even
5432                                                          for OpenCL.*                    for OpenCL.*
5433      atomicrmw    seq_cst      - singlethread - global   *Same as corresponding          *Same as corresponding
5434                                - wavefront    - local    atomicrmw acq_rel,              atomicrmw acq_rel,
5435                                - workgroup    - generic  except must generated           except must generated
5436                                                          all instructions even           all instructions even
5437                                                          for OpenCL.*                    for OpenCL.*
5438      atomicrmw    seq_cst      - agent        - global   *Same as corresponding          *Same as corresponding
5439                                - system       - generic  atomicrmw acq_rel,              atomicrmw acq_rel,
5440                                                          except must generated           except must generated
5441                                                          all instructions even           all instructions even
5442                                                          for OpenCL.*                    for OpenCL.*
5443      fence        seq_cst      - singlethread *none*     *Same as corresponding          *Same as corresponding
5444                                - wavefront               fence acq_rel,                  fence acq_rel,
5445                                - workgroup               except must generated           except must generated
5446                                - agent                   all instructions even           all instructions even
5447                                - system                  for OpenCL.*                    for OpenCL.*
5448      ============ ============ ============== ========== =============================== ==================================
5449
5450 The memory order also adds the single thread optimization constrains defined in
5451 table
5452 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx10-table`.
5453
5454   .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX10
5455      :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx10-table
5456
5457      ============ ==============================================================
5458      LLVM Memory  Optimization Constraints
5459      Ordering
5460      ============ ==============================================================
5461      unordered    *none*
5462      monotonic    *none*
5463      acquire      - If a load atomic/atomicrmw then no following load/load
5464                     atomic/store/ store atomic/atomicrmw/fence instruction can
5465                     be moved before the acquire.
5466                   - If a fence then same as load atomic, plus no preceding
5467                     associated fence-paired-atomic can be moved after the fence.
5468      release      - If a store atomic/atomicrmw then no preceding load/load
5469                     atomic/store/ store atomic/atomicrmw/fence instruction can
5470                     be moved after the release.
5471                   - If a fence then same as store atomic, plus no following
5472                     associated fence-paired-atomic can be moved before the
5473                     fence.
5474      acq_rel      Same constraints as both acquire and release.
5475      seq_cst      - If a load atomic then same constraints as acquire, plus no
5476                     preceding sequentially consistent load atomic/store
5477                     atomic/atomicrmw/fence instruction can be moved after the
5478                     seq_cst.
5479                   - If a store atomic then the same constraints as release, plus
5480                     no following sequentially consistent load atomic/store
5481                     atomic/atomicrmw/fence instruction can be moved before the
5482                     seq_cst.
5483                   - If an atomicrmw/fence then same constraints as acq_rel.
5484      ============ ==============================================================
5485
5486 Trap Handler ABI
5487 ~~~~~~~~~~~~~~~~
5488
5489 For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
5490 (such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
5491 the ``s_trap`` instruction with the following usage:
5492
5493   .. table:: AMDGPU Trap Handler for AMDHSA OS
5494      :name: amdgpu-trap-handler-for-amdhsa-os-table
5495
5496      =================== =============== =============== =======================
5497      Usage               Code Sequence   Trap Handler    Description
5498                                          Inputs
5499      =================== =============== =============== =======================
5500      reserved            ``s_trap 0x00``                 Reserved by hardware.
5501      ``debugtrap(arg)``  ``s_trap 0x01`` ``SGPR0-1``:    Reserved for HSA
5502                                            ``queue_ptr`` ``debugtrap``
5503                                          ``VGPR0``:      intrinsic (not
5504                                            ``arg``       implemented).
5505      ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes dispatch to be
5506                                            ``queue_ptr`` terminated and its
5507                                                          associated queue put
5508                                                          into the error state.
5509      ``llvm.debugtrap``  ``s_trap 0x03``                 - If debugger not
5510                                                            installed then
5511                                                            behaves as a
5512                                                            no-operation. The
5513                                                            trap handler is
5514                                                            entered and
5515                                                            immediately returns
5516                                                            to continue
5517                                                            execution of the
5518                                                            wavefront.
5519                                                          - If the debugger is
5520                                                            installed, causes
5521                                                            the debug trap to be
5522                                                            reported by the
5523                                                            debugger and the
5524                                                            wavefront is put in
5525                                                            the halt state until
5526                                                            resumed by the
5527                                                            debugger.
5528      reserved            ``s_trap 0x04``                 Reserved.
5529      reserved            ``s_trap 0x05``                 Reserved.
5530      reserved            ``s_trap 0x06``                 Reserved.
5531      debugger breakpoint ``s_trap 0x07``                 Reserved for debugger
5532                                                          breakpoints.
5533      reserved            ``s_trap 0x08``                 Reserved.
5534      reserved            ``s_trap 0xfe``                 Reserved.
5535      reserved            ``s_trap 0xff``                 Reserved.
5536      =================== =============== =============== =======================
5537
5538 AMDPAL
5539 ------
5540
5541 This section provides code conventions used when the target triple OS is
5542 ``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters
5543 from the application/runtime to each invocation of a hardware shader. These
5544 parameters include both generic, application-controlled parameters called
5545 *user data* as well as system-generated parameters that are a product of the
5546 draw or dispatch execution.
5547
5548 User Data
5549 ~~~~~~~~~
5550
5551 Each hardware stage has a set of 32-bit *user data registers* which can be
5552 written from a command buffer and then loaded into SGPRs when waves are launched
5553 via a subsequent dispatch or draw operation. This is the way most arguments are
5554 passed from the application/runtime to a hardware shader.
5555
5556 Compute User Data
5557 ~~~~~~~~~~~~~~~~~
5558
5559 Compute shader user data mappings are simpler than graphics shaders, and have a
5560 fixed mapping.
5561
5562 Note that there are always 10 available *user data entries* in registers -
5563 entries beyond that limit must be fetched from memory (via the spill table
5564 pointer) by the shader.
5565
5566   .. table:: PAL Compute Shader User Data Registers
5567      :name: pal-compute-user-data-registers
5568
5569      ============= ================================
5570      User Register Description
5571      ============= ================================
5572      0             Global Internal Table (32-bit pointer)
5573      1             Per-Shader Internal Table (32-bit pointer)
5574      2 - 11        Application-Controlled User Data (10 32-bit values)
5575      12            Spill Table (32-bit pointer)
5576      13 - 14       Thread Group Count (64-bit pointer)
5577      15            GDS Range
5578      ============= ================================
5579
5580 Graphics User Data
5581 ~~~~~~~~~~~~~~~~~~
5582
5583 Graphics pipelines support a much more flexible user data mapping:
5584
5585   .. table:: PAL Graphics Shader User Data Registers
5586      :name: pal-graphics-user-data-registers
5587
5588      ============= ================================
5589      User Register Description
5590      ============= ================================
5591      0             Global Internal Table (32-bit pointer)
5592      +             Per-Shader Internal Table (32-bit pointer)
5593      + 1-15        Application Controlled User Data
5594                    (1-15 Contiguous 32-bit Values in Registers)
5595      +             Spill Table (32-bit pointer)
5596      +             Draw Index (First Stage Only)
5597      +             Vertex Offset (First Stage Only)
5598      +             Instance Offset (First Stage Only)
5599      ============= ================================
5600
5601   The placement of the global internal table remains fixed in the first *user
5602   data SGPR register*. Otherwise all parameters are optional, and can be mapped
5603   to any desired *user data SGPR register*, with the following regstrictions:
5604
5605   * Draw Index, Vertex Offset, and Instance Offset can only be used by the first
5606     activehardware stage in a graphics pipeline (i.e. where the API vertex
5607     shader runs).
5608
5609   * Application-controlled user data must be mapped into a contiguous range of
5610     user data registers.
5611
5612   * The application-controlled user data range supports compaction remapping, so
5613     only *entries* that are actually consumed by the shader must be assigned to
5614     corresponding *registers*. Note that in order to support an efficient runtime
5615     implementation, the remapping must pack *registers* in the same order as
5616     *entries*, with unused *entries* removed.
5617
5618 .. _pal_global_internal_table:
5619
5620 Global Internal Table
5621 ~~~~~~~~~~~~~~~~~~~~~
5622
5623 The global internal table is a table of *shader resource descriptors* (SRDs) that
5624 define how certain engine-wide, runtime-managed resources should be accessed
5625 from a shader. The majority of these resources have HW-defined formats, and it
5626 is up to the compiler to write/read data as required by the target hardware.
5627
5628 The following table illustrates the required format:
5629
5630   .. table:: PAL Global Internal Table
5631      :name: pal-git-table
5632
5633      ============= ================================
5634      Offset        Description
5635      ============= ================================
5636      0-3           Graphics Scratch SRD
5637      4-7           Compute Scratch SRD
5638      8-11          ES/GS Ring Output SRD
5639      12-15         ES/GS Ring Input SRD
5640      16-19         GS/VS Ring Output #0
5641      20-23         GS/VS Ring Output #1
5642      24-27         GS/VS Ring Output #2
5643      28-31         GS/VS Ring Output #3
5644      32-35         GS/VS Ring Input SRD
5645      36-39         Tessellation Factor Buffer SRD
5646      40-43         Off-Chip LDS Buffer SRD
5647      44-47         Off-Chip Param Cache Buffer SRD
5648      48-51         Sample Position Buffer SRD
5649      52            vaRange::ShadowDescriptorTable High Bits
5650      ============= ================================
5651
5652   The pointer to the global internal table passed to the shader as user data
5653   is a 32-bit pointer. The top 32 bits should be assumed to be the same as
5654   the top 32 bits of the pipeline, so the shader may use the program
5655   counter's top 32 bits.
5656
5657 Unspecified OS
5658 --------------
5659
5660 This section provides code conventions used when the target triple OS is
5661 empty (see :ref:`amdgpu-target-triples`).
5662
5663 Trap Handler ABI
5664 ~~~~~~~~~~~~~~~~
5665
5666 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
5667 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
5668 instructions are handled as follows:
5669
5670   .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
5671      :name: amdgpu-trap-handler-for-non-amdhsa-os-table
5672
5673      =============== =============== ===========================================
5674      Usage           Code Sequence   Description
5675      =============== =============== ===========================================
5676      llvm.trap       s_endpgm        Causes wavefront to be terminated.
5677      llvm.debugtrap  *none*          Compiler warning given that there is no
5678                                      trap handler installed.
5679      =============== =============== ===========================================
5680
5681 Source Languages
5682 ================
5683
5684 .. _amdgpu-opencl:
5685
5686 OpenCL
5687 ------
5688
5689 When the language is OpenCL the following differences occur:
5690
5691 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
5692 2. The AMDGPU backend appends additional arguments to the kernel's explicit
5693    arguments for the AMDHSA OS (see
5694    :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
5695 3. Additional metadata is generated
5696    (see :ref:`amdgpu-amdhsa-code-object-metadata`).
5697
5698   .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
5699      :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
5700
5701      ======== ==== ========= ===========================================
5702      Position Byte Byte      Description
5703               Size Alignment
5704      ======== ==== ========= ===========================================
5705      1        8    8         OpenCL Global Offset X
5706      2        8    8         OpenCL Global Offset Y
5707      3        8    8         OpenCL Global Offset Z
5708      4        8    8         OpenCL address of printf buffer
5709      5        8    8         OpenCL address of virtual queue used by
5710                              enqueue_kernel.
5711      6        8    8         OpenCL address of AqlWrap struct used by
5712                              enqueue_kernel.
5713      7        8    8         Pointer argument used for Multi-gird
5714                              synchronization.
5715      ======== ==== ========= ===========================================
5716
5717 .. _amdgpu-hcc:
5718
5719 HCC
5720 ---
5721
5722 When the language is HCC the following differences occur:
5723
5724 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
5725
5726 .. _amdgpu-assembler:
5727
5728 Assembler
5729 ---------
5730
5731 AMDGPU backend has LLVM-MC based assembler which is currently in development.
5732 It supports AMDGCN GFX6-GFX10.
5733
5734 This section describes general syntax for instructions and operands.
5735
5736 Instructions
5737 ~~~~~~~~~~~~
5738
5739 .. toctree::
5740    :hidden:
5741
5742    AMDGPU/AMDGPUAsmGFX7
5743    AMDGPU/AMDGPUAsmGFX8
5744    AMDGPU/AMDGPUAsmGFX9
5745    AMDGPU/AMDGPUAsmGFX10
5746    AMDGPUModifierSyntax
5747    AMDGPUOperandSyntax
5748    AMDGPUInstructionSyntax
5749    AMDGPUInstructionNotation
5750
5751 An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`:
5752
5753     ``<``\ *opcode*\ ``>    <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,...    <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...``
5754
5755 :doc:`Operands<AMDGPUOperandSyntax>` are normally comma-separated while
5756 :doc:`modifiers<AMDGPUModifierSyntax>` are space-separated.
5757
5758 The order of *operands* and *modifiers* is fixed.
5759 Most *modifiers* are optional and may be omitted.
5760
5761 See detailed instruction syntax description for :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>`,
5762 :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>`, :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`
5763 and :doc:`GFX10<AMDGPU/AMDGPUAsmGFX10>`.
5764
5765 Note that features under development are not included in this description.
5766
5767 For more information about instructions, their semantics and supported combinations of
5768 operands, refer to one of instruction set architecture manuals
5769 [AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_, [AMD-GCN-GFX9]_ and
5770 [AMD-GCN-GFX10]_.
5771
5772 Operands
5773 ~~~~~~~~
5774
5775 Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`.
5776
5777 Modifiers
5778 ~~~~~~~~~
5779
5780 Detailed description of modifiers may be found :doc:`here<AMDGPUModifierSyntax>`.
5781
5782 Instruction Examples
5783 ~~~~~~~~~~~~~~~~~~~~
5784
5785 DS
5786 ++
5787
5788 .. code-block:: nasm
5789
5790   ds_add_u32 v2, v4 offset:16
5791   ds_write_src2_b64 v2 offset0:4 offset1:8
5792   ds_cmpst_f32 v2, v4, v6
5793   ds_min_rtn_f64 v[8:9], v2, v[4:5]
5794
5795
5796 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
5797
5798 FLAT
5799 ++++
5800
5801 .. code-block:: nasm
5802
5803   flat_load_dword v1, v[3:4]
5804   flat_store_dwordx3 v[3:4], v[5:7]
5805   flat_atomic_swap v1, v[3:4], v5 glc
5806   flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
5807   flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
5808
5809 For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
5810
5811 MUBUF
5812 +++++
5813
5814 .. code-block:: nasm
5815
5816   buffer_load_dword v1, off, s[4:7], s1
5817   buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
5818   buffer_store_format_xy v[1:2], off, s[4:7], s1
5819   buffer_wbinvl1
5820   buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
5821
5822 For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
5823
5824 SMRD/SMEM
5825 +++++++++
5826
5827 .. code-block:: nasm
5828
5829   s_load_dword s1, s[2:3], 0xfc
5830   s_load_dwordx8 s[8:15], s[2:3], s4
5831   s_load_dwordx16 s[88:103], s[2:3], s4
5832   s_dcache_inv_vol
5833   s_memtime s[4:5]
5834
5835 For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
5836
5837 SOP1
5838 ++++
5839
5840 .. code-block:: nasm
5841
5842   s_mov_b32 s1, s2
5843   s_mov_b64 s[0:1], 0x80000000
5844   s_cmov_b32 s1, 200
5845   s_wqm_b64 s[2:3], s[4:5]
5846   s_bcnt0_i32_b64 s1, s[2:3]
5847   s_swappc_b64 s[2:3], s[4:5]
5848   s_cbranch_join s[4:5]
5849
5850 For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
5851
5852 SOP2
5853 ++++
5854
5855 .. code-block:: nasm
5856
5857   s_add_u32 s1, s2, s3
5858   s_and_b64 s[2:3], s[4:5], s[6:7]
5859   s_cselect_b32 s1, s2, s3
5860   s_andn2_b32 s2, s4, s6
5861   s_lshr_b64 s[2:3], s[4:5], s6
5862   s_ashr_i32 s2, s4, s6
5863   s_bfm_b64 s[2:3], s4, s6
5864   s_bfe_i64 s[2:3], s[4:5], s6
5865   s_cbranch_g_fork s[4:5], s[6:7]
5866
5867 For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
5868
5869 SOPC
5870 ++++
5871
5872 .. code-block:: nasm
5873
5874   s_cmp_eq_i32 s1, s2
5875   s_bitcmp1_b32 s1, s2
5876   s_bitcmp0_b64 s[2:3], s4
5877   s_setvskip s3, s5
5878
5879 For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
5880
5881 SOPP
5882 ++++
5883
5884 .. code-block:: nasm
5885
5886   s_barrier
5887   s_nop 2
5888   s_endpgm
5889   s_waitcnt 0 ; Wait for all counters to be 0
5890   s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
5891   s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
5892   s_sethalt 9
5893   s_sleep 10
5894   s_sendmsg 0x1
5895   s_sendmsg sendmsg(MSG_INTERRUPT)
5896   s_trap 1
5897
5898 For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
5899
5900 Unless otherwise mentioned, little verification is performed on the operands
5901 of SOPP Instructions, so it is up to the programmer to be familiar with the
5902 range or acceptable values.
5903
5904 VALU
5905 ++++
5906
5907 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
5908 the assembler will automatically use optimal encoding based on its operands.
5909 To force specific encoding, one can add a suffix to the opcode of the instruction:
5910
5911 * _e32 for 32-bit VOP1/VOP2/VOPC
5912 * _e64 for 64-bit VOP3
5913 * _dpp for VOP_DPP
5914 * _sdwa for VOP_SDWA
5915
5916 VOP1/VOP2/VOP3/VOPC examples:
5917
5918 .. code-block:: nasm
5919
5920   v_mov_b32 v1, v2
5921   v_mov_b32_e32 v1, v2
5922   v_nop
5923   v_cvt_f64_i32_e32 v[1:2], v2
5924   v_floor_f32_e32 v1, v2
5925   v_bfrev_b32_e32 v1, v2
5926   v_add_f32_e32 v1, v2, v3
5927   v_mul_i32_i24_e64 v1, v2, 3
5928   v_mul_i32_i24_e32 v1, -3, v3
5929   v_mul_i32_i24_e32 v1, -100, v3
5930   v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
5931   v_max_f16_e32 v1, v2, v3
5932
5933 VOP_DPP examples:
5934
5935 .. code-block:: nasm
5936
5937   v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
5938   v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
5939   v_mov_b32 v0, v0 wave_shl:1
5940   v_mov_b32 v0, v0 row_mirror
5941   v_mov_b32 v0, v0 row_bcast:31
5942   v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
5943   v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
5944   v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
5945
5946 VOP_SDWA examples:
5947
5948 .. code-block:: nasm
5949
5950   v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
5951   v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
5952   v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
5953   v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
5954   v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
5955
5956 For full list of supported instructions, refer to "Vector ALU instructions".
5957
5958 .. TODO
5959    Remove once we switch to code object v3 by default.
5960
5961 .. _amdgpu-amdhsa-assembler-predefined-symbols-v2:
5962
5963 Code Object V2 Predefined Symbols (-mattr=-code-object-v3)
5964 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5965
5966 .. warning:: Code Object V2 is not the default code object version emitted by
5967   this version of LLVM. For a description of the predefined symbols available
5968   with the default configuration (Code Object V3) see
5969   :ref:`amdgpu-amdhsa-assembler-predefined-symbols-v3`.
5970
5971 The AMDGPU assembler defines and updates some symbols automatically. These
5972 symbols do not affect code generation.
5973
5974 .option.machine_version_major
5975 +++++++++++++++++++++++++++++
5976
5977 Set to the GFX major generation number of the target being assembled for. For
5978 example, when assembling for a "GFX9" target this will be set to the integer
5979 value "9". The possible GFX major generation numbers are presented in
5980 :ref:`amdgpu-processors`.
5981
5982 .option.machine_version_minor
5983 +++++++++++++++++++++++++++++
5984
5985 Set to the GFX minor generation number of the target being assembled for. For
5986 example, when assembling for a "GFX810" target this will be set to the integer
5987 value "1". The possible GFX minor generation numbers are presented in
5988 :ref:`amdgpu-processors`.
5989
5990 .option.machine_version_stepping
5991 ++++++++++++++++++++++++++++++++
5992
5993 Set to the GFX stepping generation number of the target being assembled for.
5994 For example, when assembling for a "GFX704" target this will be set to the
5995 integer value "4". The possible GFX stepping generation numbers are presented
5996 in :ref:`amdgpu-processors`.
5997
5998 .kernel.vgpr_count
5999 ++++++++++++++++++
6000
6001 Set to zero each time a
6002 :ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
6003 encountered. At each instruction, if the current value of this symbol is less
6004 than or equal to the maximum VPGR number explicitly referenced within that
6005 instruction then the symbol value is updated to equal that VGPR number plus
6006 one.
6007
6008 .kernel.sgpr_count
6009 ++++++++++++++++++
6010
6011 Set to zero each time a
6012 :ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
6013 encountered. At each instruction, if the current value of this symbol is less
6014 than or equal to the maximum VPGR number explicitly referenced within that
6015 instruction then the symbol value is updated to equal that SGPR number plus
6016 one.
6017
6018 .. _amdgpu-amdhsa-assembler-directives-v2:
6019
6020 Code Object V2 Directives (-mattr=-code-object-v3)
6021 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6022
6023 .. warning:: Code Object V2 is not the default code object version emitted by
6024   this version of LLVM. For a description of the directives supported with
6025   the default configuration (Code Object V3) see
6026   :ref:`amdgpu-amdhsa-assembler-directives-v3`.
6027
6028 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
6029 one can specify them with assembler directives.
6030
6031 .hsa_code_object_version major, minor
6032 +++++++++++++++++++++++++++++++++++++
6033
6034 *major* and *minor* are integers that specify the version of the HSA code
6035 object that will be generated by the assembler.
6036
6037 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
6038 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
6039
6040
6041 *major*, *minor*, and *stepping* are all integers that describe the instruction
6042 set architecture (ISA) version of the assembly program.
6043
6044 *vendor* and *arch* are quoted strings.  *vendor* should always be equal to
6045 "AMD" and *arch* should always be equal to "AMDGPU".
6046
6047 By default, the assembler will derive the ISA version, *vendor*, and *arch*
6048 from the value of the -mcpu option that is passed to the assembler.
6049
6050 .. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel:
6051
6052 .amdgpu_hsa_kernel (name)
6053 +++++++++++++++++++++++++
6054
6055 This directives specifies that the symbol with given name is a kernel entry point
6056 (label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
6057
6058 .amd_kernel_code_t
6059 ++++++++++++++++++
6060
6061 This directive marks the beginning of a list of key / value pairs that are used
6062 to specify the amd_kernel_code_t object that will be emitted by the assembler.
6063 The list must be terminated by the *.end_amd_kernel_code_t* directive.  For
6064 any amd_kernel_code_t values that are unspecified a default value will be
6065 used.  The default value for all keys is 0, with the following exceptions:
6066
6067 - *amd_code_version_major* defaults to 1.
6068 - *amd_kernel_code_version_minor* defaults to 2.
6069 - *amd_machine_kind* defaults to 1.
6070 - *amd_machine_version_major*, *machine_version_minor*, and
6071   *amd_machine_version_stepping* are derived from the value of the -mcpu option
6072   that is passed to the assembler.
6073 - *kernel_code_entry_byte_offset* defaults to 256.
6074 - *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards
6075   defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5.
6076   Note that wavefront size is specified as a power of two, so a value of **n**
6077   means a size of 2^ **n**.
6078 - *call_convention* defaults to -1.
6079 - *kernarg_segment_alignment*, *group_segment_alignment*, and
6080   *private_segment_alignment* default to 4. Note that alignments are specified
6081   as a power of 2, so a value of **n** means an alignment of 2^ **n**.
6082 - *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for
6083   GFX10 onwards.
6084 - *enable_mem_ordered* defaults to 1 for GFX10 onwards.
6085
6086 The *.amd_kernel_code_t* directive must be placed immediately after the
6087 function label and before any instructions.
6088
6089 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
6090 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
6091
6092 .. _amdgpu-amdhsa-assembler-example-v2:
6093
6094 Code Object V2 Example Source Code (-mattr=-code-object-v3)
6095 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6096
6097 .. warning:: Code Object V2 is not the default code object version emitted by
6098   this version of LLVM. For a description of the directives supported with
6099   the default configuration (Code Object V3) see
6100   :ref:`amdgpu-amdhsa-assembler-example-v3`.
6101
6102 Here is an example of a minimal assembly source file, defining one HSA kernel:
6103
6104 .. code-block:: none
6105
6106    .hsa_code_object_version 1,0
6107    .hsa_code_object_isa
6108
6109    .hsatext
6110    .globl  hello_world
6111    .p2align 8
6112    .amdgpu_hsa_kernel hello_world
6113
6114    hello_world:
6115
6116       .amd_kernel_code_t
6117          enable_sgpr_kernarg_segment_ptr = 1
6118          is_ptr64 = 1
6119          compute_pgm_rsrc1_vgprs = 0
6120          compute_pgm_rsrc1_sgprs = 0
6121          compute_pgm_rsrc2_user_sgpr = 2
6122          compute_pgm_rsrc1_wgp_mode = 0
6123          compute_pgm_rsrc1_mem_ordered = 0
6124          compute_pgm_rsrc1_fwd_progress = 1
6125      .end_amd_kernel_code_t
6126
6127      s_load_dwordx2 s[0:1], s[0:1] 0x0
6128      v_mov_b32 v0, 3.14159
6129      s_waitcnt lgkmcnt(0)
6130      v_mov_b32 v1, s0
6131      v_mov_b32 v2, s1
6132      flat_store_dword v[1:2], v0
6133      s_endpgm
6134    .Lfunc_end0:
6135         .size   hello_world, .Lfunc_end0-hello_world
6136
6137 .. _amdgpu-amdhsa-assembler-predefined-symbols-v3:
6138
6139 Code Object V3 Predefined Symbols (-mattr=+code-object-v3)
6140 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6141
6142 The AMDGPU assembler defines and updates some symbols automatically. These
6143 symbols do not affect code generation.
6144
6145 .amdgcn.gfx_generation_number
6146 +++++++++++++++++++++++++++++
6147
6148 Set to the GFX major generation number of the target being assembled for. For
6149 example, when assembling for a "GFX9" target this will be set to the integer
6150 value "9". The possible GFX major generation numbers are presented in
6151 :ref:`amdgpu-processors`.
6152
6153 .amdgcn.gfx_generation_minor
6154 ++++++++++++++++++++++++++++
6155
6156 Set to the GFX minor generation number of the target being assembled for. For
6157 example, when assembling for a "GFX810" target this will be set to the integer
6158 value "1". The possible GFX minor generation numbers are presented in
6159 :ref:`amdgpu-processors`.
6160
6161 .amdgcn.gfx_generation_stepping
6162 +++++++++++++++++++++++++++++++
6163
6164 Set to the GFX stepping generation number of the target being assembled for.
6165 For example, when assembling for a "GFX704" target this will be set to the
6166 integer value "4". The possible GFX stepping generation numbers are presented
6167 in :ref:`amdgpu-processors`.
6168
6169 .. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr:
6170
6171 .amdgcn.next_free_vgpr
6172 ++++++++++++++++++++++
6173
6174 Set to zero before assembly begins. At each instruction, if the current value
6175 of this symbol is less than or equal to the maximum VGPR number explicitly
6176 referenced within that instruction then the symbol value is updated to equal
6177 that VGPR number plus one.
6178
6179 May be used to set the `.amdhsa_next_free_vpgr` directive in
6180 :ref:`amdhsa-kernel-directives-table`.
6181
6182 May be set at any time, e.g. manually set to zero at the start of each kernel.
6183
6184 .. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr:
6185
6186 .amdgcn.next_free_sgpr
6187 ++++++++++++++++++++++
6188
6189 Set to zero before assembly begins. At each instruction, if the current value
6190 of this symbol is less than or equal the maximum SGPR number explicitly
6191 referenced within that instruction then the symbol value is updated to equal
6192 that SGPR number plus one.
6193
6194 May be used to set the `.amdhsa_next_free_spgr` directive in
6195 :ref:`amdhsa-kernel-directives-table`.
6196
6197 May be set at any time, e.g. manually set to zero at the start of each kernel.
6198
6199 .. _amdgpu-amdhsa-assembler-directives-v3:
6200
6201 Code Object V3 Directives (-mattr=+code-object-v3)
6202 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6203
6204 Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
6205 architecture processors, and are not OS-specific. Directives which begin with
6206 ``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
6207 ``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
6208 :ref:`amdgpu-processors`.
6209
6210 .amdgcn_target <target>
6211 +++++++++++++++++++++++
6212
6213 Optional directive which declares the target supported by the containing
6214 assembler source file. Valid values are described in
6215 :ref:`amdgpu-amdhsa-code-object-target-identification`. Used by the assembler
6216 to validate command-line options such as ``-triple``, ``-mcpu``, and those
6217 which specify target features.
6218
6219 .amdhsa_kernel <name>
6220 +++++++++++++++++++++
6221
6222 Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
6223 ``<name>.kd``, in the current location of the current section. Only valid when
6224 the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
6225 instruction to execute, and does not need to be previously defined.
6226
6227 Marks the beginning of a list of directives used to generate the bytes of a
6228 kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
6229 Directives which may appear in this list are described in
6230 :ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
6231 be valid for the target being assembled for, and cannot be repeated. Directives
6232 support the range of values specified by the field they reference in
6233 :ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
6234 assumed to have its default value, unless it is marked as "Required", in which
6235 case it is an error to omit the directive. This list of directives is
6236 terminated by an ``.end_amdhsa_kernel`` directive.
6237
6238   .. table:: AMDHSA Kernel Assembler Directives
6239      :name: amdhsa-kernel-directives-table
6240
6241      ======================================================== =================== ============ ===================
6242      Directive                                                Default             Supported On Description
6243      ======================================================== =================== ============ ===================
6244      ``.amdhsa_group_segment_fixed_size``                     0                   GFX6-GFX10   Controls GROUP_SEGMENT_FIXED_SIZE in
6245                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6246      ``.amdhsa_private_segment_fixed_size``                   0                   GFX6-GFX10   Controls PRIVATE_SEGMENT_FIXED_SIZE in
6247                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6248      ``.amdhsa_user_sgpr_private_segment_buffer``             0                   GFX6-GFX10   Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
6249                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6250      ``.amdhsa_user_sgpr_dispatch_ptr``                       0                   GFX6-GFX10   Controls ENABLE_SGPR_DISPATCH_PTR in
6251                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6252      ``.amdhsa_user_sgpr_queue_ptr``                          0                   GFX6-GFX10   Controls ENABLE_SGPR_QUEUE_PTR in
6253                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6254      ``.amdhsa_user_sgpr_kernarg_segment_ptr``                0                   GFX6-GFX10   Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
6255                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6256      ``.amdhsa_user_sgpr_dispatch_id``                        0                   GFX6-GFX10   Controls ENABLE_SGPR_DISPATCH_ID in
6257                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6258      ``.amdhsa_user_sgpr_flat_scratch_init``                  0                   GFX6-GFX10   Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
6259                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6260      ``.amdhsa_user_sgpr_private_segment_size``               0                   GFX6-GFX10   Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
6261                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6262      ``.amdhsa_wavefront_size32``                             Target              GFX10        Controls ENABLE_WAVEFRONT_SIZE32 in
6263                                                               Feature                          :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6264                                                               Specific
6265                                                               (-wavefrontsize64)
6266      ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0                   GFX6-GFX10   Controls ENABLE_SGPR_PRIVATE_SEGMENT_WAVEFRONT_OFFSET in
6267                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6268      ``.amdhsa_system_sgpr_workgroup_id_x``                   1                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_ID_X in
6269                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6270      ``.amdhsa_system_sgpr_workgroup_id_y``                   0                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_ID_Y in
6271                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6272      ``.amdhsa_system_sgpr_workgroup_id_z``                   0                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_ID_Z in
6273                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6274      ``.amdhsa_system_sgpr_workgroup_info``                   0                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_INFO in
6275                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6276      ``.amdhsa_system_vgpr_workitem_id``                      0                   GFX6-GFX10   Controls ENABLE_VGPR_WORKITEM_ID in
6277                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6278                                                                                                Possible values are defined in
6279                                                                                                :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
6280      ``.amdhsa_next_free_vgpr``                               Required            GFX6-GFX10   Maximum VGPR number explicitly referenced, plus one.
6281                                                                                                Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
6282                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6283      ``.amdhsa_next_free_sgpr``                               Required            GFX6-GFX10   Maximum SGPR number explicitly referenced, plus one.
6284                                                                                                Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
6285                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6286      ``.amdhsa_reserve_vcc``                                  1                   GFX6-GFX10   Whether the kernel may use the special VCC SGPR.
6287                                                                                                Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
6288                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6289      ``.amdhsa_reserve_flat_scratch``                         1                   GFX7-GFX10   Whether the kernel may use flat instructions to access
6290                                                                                                scratch memory. Used to calculate
6291                                                                                                GRANULATED_WAVEFRONT_SGPR_COUNT in
6292                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6293      ``.amdhsa_reserve_xnack_mask``                           Target              GFX8-GFX10   Whether the kernel may trigger XNACK replay.
6294                                                               Feature                          Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
6295                                                               Specific                         :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6296                                                               (+xnack)
6297      ``.amdhsa_float_round_mode_32``                          0                   GFX6-GFX10   Controls FLOAT_ROUND_MODE_32 in
6298                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6299                                                                                                Possible values are defined in
6300                                                                                                :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
6301      ``.amdhsa_float_round_mode_16_64``                       0                   GFX6-GFX10   Controls FLOAT_ROUND_MODE_16_64 in
6302                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6303                                                                                                Possible values are defined in
6304                                                                                                :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
6305      ``.amdhsa_float_denorm_mode_32``                         0                   GFX6-GFX10   Controls FLOAT_DENORM_MODE_32 in
6306                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6307                                                                                                Possible values are defined in
6308                                                                                                :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
6309      ``.amdhsa_float_denorm_mode_16_64``                      3                   GFX6-GFX10   Controls FLOAT_DENORM_MODE_16_64 in
6310                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6311                                                                                                Possible values are defined in
6312                                                                                                :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
6313      ``.amdhsa_dx10_clamp``                                   1                   GFX6-GFX10   Controls ENABLE_DX10_CLAMP in
6314                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6315      ``.amdhsa_ieee_mode``                                    1                   GFX6-GFX10   Controls ENABLE_IEEE_MODE in
6316                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6317      ``.amdhsa_fp16_overflow``                                0                   GFX9-GFX10   Controls FP16_OVFL in
6318                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6319      ``.amdhsa_workgroup_processor_mode``                     Target              GFX10        Controls ENABLE_WGP_MODE in
6320                                                               Feature                          :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6321                                                               Specific
6322                                                               (-cumode)
6323      ``.amdhsa_memory_ordered``                               1                   GFX10        Controls MEM_ORDERED in
6324                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6325      ``.amdhsa_forward_progress``                             0                   GFX10        Controls FWD_PROGRESS in
6326                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6327      ``.amdhsa_exception_fp_ieee_invalid_op``                 0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
6328                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6329      ``.amdhsa_exception_fp_denorm_src``                      0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
6330                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6331      ``.amdhsa_exception_fp_ieee_div_zero``                   0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
6332                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6333      ``.amdhsa_exception_fp_ieee_overflow``                   0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
6334                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6335      ``.amdhsa_exception_fp_ieee_underflow``                  0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
6336                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6337      ``.amdhsa_exception_fp_ieee_inexact``                    0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
6338                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6339      ``.amdhsa_exception_int_div_zero``                       0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
6340                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6341      ======================================================== =================== ============ ===================
6342
6343 .amdgpu_metadata
6344 ++++++++++++++++
6345
6346 Optional directive which declares the contents of the ``NT_AMDGPU_METADATA``
6347 note record (see :ref:`amdgpu-elf-note-records-table-v3`).
6348
6349 The contents must be in the [YAML]_ markup format, with the same structure and
6350 semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`.
6351
6352 This directive is terminated by an ``.end_amdgpu_metadata`` directive.
6353
6354 .. _amdgpu-amdhsa-assembler-example-v3:
6355
6356 Code Object V3 Example Source Code (-mattr=+code-object-v3)
6357 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6358
6359 Here is an example of a minimal assembly source file, defining one HSA kernel:
6360
6361 .. code-block:: none
6362
6363   .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
6364
6365   .text
6366   .globl hello_world
6367   .p2align 8
6368   .type hello_world,@function
6369   hello_world:
6370     s_load_dwordx2 s[0:1], s[0:1] 0x0
6371     v_mov_b32 v0, 3.14159
6372     s_waitcnt lgkmcnt(0)
6373     v_mov_b32 v1, s0
6374     v_mov_b32 v2, s1
6375     flat_store_dword v[1:2], v0
6376     s_endpgm
6377   .Lfunc_end0:
6378     .size   hello_world, .Lfunc_end0-hello_world
6379
6380   .rodata
6381   .p2align 6
6382   .amdhsa_kernel hello_world
6383     .amdhsa_user_sgpr_kernarg_segment_ptr 1
6384     .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
6385     .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
6386   .end_amdhsa_kernel
6387
6388   .amdgpu_metadata
6389   ---
6390   amdhsa.version:
6391     - 1
6392     - 0
6393   amdhsa.kernels:
6394     - .name: hello_world
6395       .symbol: hello_world.kd
6396       .kernarg_segment_size: 48
6397       .group_segment_fixed_size: 0
6398       .private_segment_fixed_size: 0
6399       .kernarg_segment_align: 4
6400       .wavefront_size: 64
6401       .sgpr_count: 2
6402       .vgpr_count: 3
6403       .max_flat_workgroup_size: 256
6404   ...
6405   .end_amdgpu_metadata
6406
6407 If an assembly source file contains multiple kernels and/or functions, the
6408 :ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and
6409 :ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using
6410 the ``.set <symbol>, <expression>`` directive. For example, in the case of two
6411 kernels, where ``function1`` is only called from ``kernel1`` it is sufficient
6412 to group the function with the kernel that calls it and reset the symbols
6413 between the two connected components:
6414
6415 .. code-block:: none
6416
6417   .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
6418
6419   // gpr tracking symbols are implicitly set to zero
6420
6421   .text
6422   .globl kern0
6423   .p2align 8
6424   .type kern0,@function
6425   kern0:
6426     // ...
6427     s_endpgm
6428   .Lkern0_end:
6429     .size   kern0, .Lkern0_end-kern0
6430
6431   .rodata
6432   .p2align 6
6433   .amdhsa_kernel kern0
6434     // ...
6435     .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
6436     .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
6437   .end_amdhsa_kernel
6438
6439   // reset symbols to begin tracking usage in func1 and kern1
6440   .set .amdgcn.next_free_vgpr, 0
6441   .set .amdgcn.next_free_sgpr, 0
6442
6443   .text
6444   .hidden func1
6445   .global func1
6446   .p2align 2
6447   .type func1,@function
6448   func1:
6449     // ...
6450     s_setpc_b64 s[30:31]
6451   .Lfunc1_end:
6452   .size func1, .Lfunc1_end-func1
6453
6454   .globl kern1
6455   .p2align 8
6456   .type kern1,@function
6457   kern1:
6458     // ...
6459     s_getpc_b64 s[4:5]
6460     s_add_u32 s4, s4, func1@rel32@lo+4
6461     s_addc_u32 s5, s5, func1@rel32@lo+4
6462     s_swappc_b64 s[30:31], s[4:5]
6463     // ...
6464     s_endpgm
6465   .Lkern1_end:
6466     .size   kern1, .Lkern1_end-kern1
6467
6468   .rodata
6469   .p2align 6
6470   .amdhsa_kernel kern1
6471     // ...
6472     .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
6473     .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
6474   .end_amdhsa_kernel
6475
6476 These symbols cannot identify connected components in order to automatically
6477 track the usage for each kernel. However, in some cases careful organization of
6478 the kernels and functions in the source file means there is minimal additional
6479 effort required to accurately calculate GPR usage.
6480
6481 Additional Documentation
6482 ========================
6483
6484 .. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
6485 .. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
6486 .. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
6487 .. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
6488 .. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
6489 .. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
6490 .. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
6491 .. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
6492 .. [AMD-GCN-GFX10] AMD "Navi" Instruction Set Architecture *TBA*
6493 .. TODO
6494    ttye Add link when made public.
6495 .. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
6496 .. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
6497 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
6498 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
6499 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
6500 .. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
6501 .. [MsgPack] `Message Pack <http://www.msgpack.org/>`__
6502 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
6503 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
6504 .. [CLANG-ATTR] `Attributes in Clang <http://clang.llvm.org/docs/AttributeReference.html>`__